First, we decided to improve our « picking up » module that corrects written mistakes. Let′s note that these mistakes are numerous in the news texts. The goal was to avoid silence due to the fact that a miswritten word cannot be reached and so cannot propose its text as a candidate during search. The second decision was to apply a natural language process on the input corpora during indexation. We has the following reasons to do that:
1) we wanted to recognize compound words as really compound words in order to : 数据挖掘研究院
a) avoid noise due to false interpretation of components: a « pomme de terre » is not a « pomme ». 数据挖掘研究院
b) thanks to the fact that we have a lot of compound words recorded in the lexicon, we can identify words are interesting to index and so, are interesting to identify the document where they appear.
2) we wanted to filter certain grammatical categories, for instance, we wanted to avoid indexation of empty words and adverbs.
3) we wanted to insert inside the index only the lemmatized forms et not the full forms in order to group the various occurrences of the same lemmatized form, and compute a weight for the whole occurrences of the various full forms. This criteria holds for simple and compounds words. 数据挖掘研究院
4) we wanted to desambiguate certain difficult (and frequent) French words like « tu » as « Pronoun » vs « Past participle of the verb taire ».
5) we needed to use local grammars in order to recognize dates, times, numbers etc. and the morphological analyzer already had these algorithms. 数据挖掘研究院

