RSS
热门关键字:  数据挖掘  数据仓库  商业智能  人工智能  搜索引擎

Experiments with a Chunker and Lucene

来源: 作者:unkonwn 时间:2004-12-04 点击:

First, we decided to improve our « picking up » module that corrects written mistakes. Let′s note that these mistakes are numerous in the news texts. The goal was to avoid silence due to the fact that a miswritten word cannot be reached and so cannot propose its text as a candidate during search. The second decision was to apply a natural language process on the input corpora during indexation. We has the following reasons to do that:

数据挖掘研究院

1) we wanted to recognize compound words as really compound words in order to : 数据挖掘研究院

a) avoid noise due to false interpretation of components: a « pomme de terre » is not a « pomme ». 数据挖掘研究院

b) thanks to the fact that we have a lot of compound words recorded in the lexicon, we can identify words are interesting to index and so, are interesting to identify the document where they appear.

数据挖掘研究院

2) we wanted to filter certain grammatical categories, for instance, we wanted to avoid indexation of empty words and adverbs.

3) we wanted to insert inside the index only the lemmatized forms et not the full forms in order to group the various occurrences of the same lemmatized form, and compute a weight for the whole occurrences of the various full forms. This criteria holds for simple and compounds words. 数据挖掘研究院

4) we wanted to desambiguate certain difficult (and frequent) French words like « tu » as « Pronoun » vs « Past participle of the verb taire ».

数据挖掘研究院

5) we needed to use local grammars in order to recognize dates, times, numbers etc. and the morphological analyzer already had these algorithms. 数据挖掘研究院

资料全文下载 数据挖掘研究院

最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?