RSS
热门关键字:  数据挖掘  数据仓库  商业智能  人工智能  搜索引擎
当前位置 :| 首页>人工智能>信息检索>

Stemming Errors

来源: 作者:unkonwn 时间:2004-11-29 点击:

Two Kinds of Error

Natural languages are not completely regular constructs, and therefore stemmers operating on natural words inevitably make mistakes. On the one hand, words which ought to be merged together (such as "adhere" and "adhesion") may remain distinct after stemming; on the other, words which are really distinct may be wrongly conflated (e.g., "experiment" and "experience"). These are known as understemming errors and overstemming errors respectively. By counting these errors for a sample of words, we can gain an insight into the operation of a stemmer, and compare different stemmers one with another.

数据挖掘研究院

To enable the errors to be counted, the words in the collection must already be organised into ′conceptual groups′, containing words (such as "adhere", "adheres", "adhering", "adhesion", "adhesive") which ought all to be merged to the same stem. The bibliography refers to two papers (Paice, 1994 and Paice, 1996) which describe this approach to stemmer evaluation in some detail. 数据挖掘研究院

In this document, we give simple examples to show how overstemming and understemming indices can be computed. In principle, the idea is to compare every pair of words in the sample -- though if fact, to avoid wasting lots of time, only certain pairs are actually considered. For each comparison, we will know whether the pair of words belong to the same conceptual group, and whether they were in fact converted to the same stem:

  • If the two words belong to the same conceptual group, and are converted to the same stem, then the conflation is correct; if however they are converted to different stems, this is counted as an understemming error.
  • If the two words belong to different conceptual groups, and remain distinct after stemming, then the stemmer has behaved correctly. If however they are converted to the same stem, this is counted as an overstemming error.



数据挖掘研究院

An Example

The following examples use the so-called Truncate(n) stemmer, which simply retains the first n letters of the word, where n is a suitable integer, such as 4, 5 or 6. If the word has less than n letters to start with, it is returned unchanged.

Consider the seven words shown in the first column of the table below, which fall into two distinct conceptual groups. We will look at the effect of applying Truncate(5) and Truncate(4) to these words. 数据挖掘研究院

Truncate(N)

Under-Stemming Truncate(N)

数据挖掘研究院


Under-Stemming Truncate(5) 数据挖掘实验室

Under-Stemming Truncate(5)

数据挖掘研究院

1 = A pair of groupable words with identical stems. This represents a successful stemming operation.
0 = A pair of groupable words with non-identical stems. This represents an Understemming error.
X = A pair of non-groupable words. These pairs are not considered when counting Understemming errors. 数据挖掘研究院

In this case, the proportion of word pairs successfully merged is 10 out of 22, hence UI= 1 - (10/22) = 0.545. 数据挖掘研究院


Over-Stemming Truncate(5) 数据挖掘研究院

Over-Stemming Truncate(5) 数据挖掘实验室

1 = A pair of non-groupable words with non-identical stems. This represents a successful stemming operation.
0 = A pair of non-groupable words with identical stems. This represents an Overstemming error.
X = A pair of groupable words. These pairs are not considered when counting Overstemming errors. 数据挖掘研究院

In this case, the proportion of word pairs which were correctly not merged to the same stem is 20 out of 20, hence OI = 1 - (20/20), OI = 0.


Under-Stemming Truncate(4)

Under-Stemming Truncate(4) 数据挖掘研究院

In this case, the proportion of word pairs correctly merged is 22 out of 22, hence UI= 1 - (22/22) = 0.


Over-Stemming Truncate(4) 数据挖掘实验室

Over-Stemming Truncate(4) 数据挖掘研究院

Here, the proportion of word pairs which were correctly not conflated is 0 out of 20, so that OI = 1 - (20/20) = 1. 数据挖掘研究院

  数据挖掘研究院

Discussion 数据挖掘研究院

There are some problems with this approach. Firstly, the manual construction of the grouped word collection is time-consuming, which limits the size of the word collections which can be used. Secondly, it is sometimes unclear whether or not two words should be grouped together - for instance, whether "different" and "differentiate" should be merged depends very much on the topic of the original text. Moreover, the question of whether or not a group will contain all of the words it should and none of the words that it should not is highly relative to the document being inspected. 数据挖掘研究院

An example of this would be an IR system containing two documents, one containing no words from the ‘Divine’ group, but various words from ‘Divide’ group, and another containing words from ‘Divine’ group, but none from the ‘Divide’ group. On testing for Understemming and Overstemming errors using the Truncate (4) Stemmer on each document there are no errors, yet when a query is performed with the term ‘Divide’ both documents would be returned. This is obviously detrimental to precision due Overstemming, but analysis based on the individual documents would not pick this up. It would be necessary for the entire database to be checked to find all such errors.

最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?