The WWW is increasingly being used source of information. The volume of
information is accessed by users using direct manipulation tools. It is obviously
that we′d like to have a tool to keep those texts we want and remove those texts we
don′t want from so much information flow to us. This paper describes a module
that sifts through large number of texts retrieved by the user.
The module is based on HowNet, a knowledge dictionary developed by Mr.
Zhendong Dong. In this dictionary, the concept of a word is divided into sememes.
In the philosophy of HowNet, all concepts in the world can be expressed by a
combination more than 1500 sememes. Sememe is a very useful concept in settle
the problem of synonym which is the most difficult problem in text filtering. We
classified the set of sememes into two sets of sememes: classfiable sememes and
unclassficable semems. Classfiable sememes includes those sememes that are more
useful in distinguishing a document′s class from other documents. Unclassfiable
sememes include those sememes that have similar appearance in all documents.
Classfiable includes about 800 sememes. We used these 800 classficable sememes
to build Classficable Sememes Vector Space(CSVS).
A text is represented as a vector in the CSVS after the following step:
1. text preprosessing: Judge the language of the text and do some process attribute
to its language.
2. Part-of-Speech tagging
3. keywords extraction
4. keyword sense disambiguation based on its environment by calculating its
classifiable sememes relevance with it′s environment′s classifiable sememes.
We add the weight of a semantic item if there are classifiable sememes the same
as classifiable sememe in the its environment word′s semantic item. This is
not a strict disambiguation algorithm. We just adjust the weights of those
semantic items.
5. Those keywords are reduced to sememes and the weight of all keywords ′s all
semantic items ′s classifiable sememes are calculated to be the weight of its 数据挖掘实验室
vector feature.
A user provides some texts to express the text he interested in. They are all
expressed as vectors in the CSVS. Then those vectors represent the user′s
preference. The relevance of two texts can be measured by using the cosine angle
between the two text′s vectors. When a new text comes, it is expressed as a vector
in CSVS too. We find its k nearest neighbours in the texts provided by the user in
the CSVS . Calculating the relevance of the new text to its k nearest neighbours
and if it is bigger than a certain valve, than it means it is of the user′s interest if
smaller, it means that it is not belong to the user′s interesting. The k is determined
by calculated every training vector its neighbours.
数据挖掘研究院

