A short history of language models


Statistical language models have been around for quite a long time. They were
first applied by Andrei Markov at the beginning of the 20th century to model
letter sequences in works of Russian literature (Manning and Sch¨utze 1999).
Another famous application of language models are Claude Shannon’s models of
letter sequences and word sequences, which he used to illustrate the implications
of coding and information theory (Shannon 1948). Later, statistical language
models were developed as a general natural language processing tool. Language
models were first successfully used for automatic speech recognition at the end of
the 1970’s. The by now standard model of automatic speech recognition consists
of two parts. The first part is the language model, that predicts the next word in
continuous speech. The second part models the acoustic signal and is therefore
called the acoustic model. The theory behind the speech recognition models is part of hidden Markov model theory (indeed, a ‘hidden’ version of Markov’s

数据挖掘交友


models) that was developed by Leonard Baum and his colleagues at IBM in
the late 1960s and early 1970s (Rabiner 1990; Jelinek 1997). Recently, hidden
Markov models are studied as part of a general graphical model formalism,
which subsumes many of the multivariate probabilistic models used in statistics,
systems engineering, information theory and pattern recognition. Examples
include Bayesian networks, Markov random fields, factor analysis and Kalman
filters (Jordan 1998; Bengio 1999). 数据挖掘工具


The application to information retrieval
Only very recently, since 1998, statistical language models are applied to information
retrieval. The past two years show a remarkably large number of
publications in which statistical language models are used to compute the ranking
of documents given a query. To sum them up quickly: Ponte and Croft
(1998) were the first to suggest the use of language models in information retrieval.
They used estimation based on risk functions to overcome the problem
of small sample sizes. Hiemstra (1998a) and Hiemstra and Kraaij (1999) were
the first to introduce ranking based on a mixture of global and local probability
distributions that is also used in the publications mentioned in the remainder of
this paragraph. Miller, Leek, and Schwartz (1999) use hidden Markov models
for ranking, including the use of bi-grams to model two word phrases and a
method for performing blind feedback. Sahami (1999) suggested an approach

数据挖掘工具


to document clustering based on smoothing the document models by using the
geometric mean of the global and local distributions. Berger and Lafferty (1999)
and Hiemstra and De Jong (1999) developed a model that includes statistical
translation. Ng (2000) introduced a model that uses the ratio of the conditional
probability of the query given the document and the prior probability of the
query, including a method for query expansion. Song and Croft (1999) used a
model which includes bi-grams and introduced Good Turing re-estimation to
smooth the document models. This chapter will address details of many of the
above mentioned publications. They will be recited where appropriate in the
following sections. It is assumed that the reader is familiar with the basics of
probability theory as for instance presented by Mood and Graybill (1963).

数据挖掘交友


4.1.3 Two models of information retrieval processes
This chapter will introduce two models of information retrieval: a basic retrieval
model and an extension of the basic model, the statistical translation retrieval
model. The basic model defines the system’s matching process. It has the same
function as the models presented in chapter 2. The extended model adds statistical
translation to the basic retrieval model to model both the matching process
and the query formulation process. Because today’s computers are still not able
to really understand the documents and the user’s request, both matching and
query formulation are modelled by simple probability mechanisms. Matching
is modelled by the generation of a random query from a relevant document and query formulation is modelled by translation of the query into the request
(Hiemstra and De Jong 1999).
Figure 4.1 suggests an information theoretic view of the problem (Miller,


Leek, and Schwartz 1999; Berger and Lafferty 1999). Information theory was
developed by Shannon (1948) to model the problem of decoding a message that
is sent over a noisy communication channel. From this viewpoint, a relevant
document d gets ‘corrupted’ into a query t1, · · · , tn by sending it through a
noisy channel, and the query gets again corrupted into a request s1, · · · , sn by
sending it through a second noisy channel. A natural language information
retrieval system can be thought of as a decoding function f : s1, · · · , sn ! d,
that tries to reproduce the message that was originally sent, that is, to find
the document that is relevant to the request. An optimal retrieval system will
choose f(s1, · · · , sn) such that:
model. A real life retrieval system does not know these probabilities, but instead
defines them by some simple basic principles. A basic principle for the matching

数据挖掘工具


model might be that each document has the same probability of being relevant,
and that within a document each occurrence of a term has the same probability
of ending up in the query. A basic principle for the query formulation model
might be that each query term is translated to one and only one word in the
request.

数据挖掘研究院


For each document in the collection, a two-step statistical model defines the
probability of generating the user request. Documents are ranked according
to this probability. If a request is entered, the system first uses the query
formulation model to hypothesise for each word in the request the terms that
might have generated it. This results in a structured query that represents all
queries that might have generated the request. In a second step, the system
uses the matching model of each document to calculate the probability that the
document generated any of the queries represented by the structured query.

数据挖掘实验室

The two parts have objectives that are similar to the two parts of the speech
recognition models. The objective of the translation model of information retrieval
is similar to the acoustic model of speech recognition. Both model the
observed signal, respectively the user’s request and the sound wave. The structured
query that represents all queries that might have generated the request,
can be compared to a so-called word lattice in speech recognition (Rabiner
1990). The objective of the basic model of information retrieval is similar to
the objective of the language model of speech recognition. The models predict
respectively the next term in the query, and the next word in speech. So, the
basic retrieval model is the ‘true’ language model, and the translation model is
the signal model. The distinction between language model and signal model can
also be made for e.g. models for part-of-speech tagging (Cutting et al. 1992),
and models for statistical machine translation (Brown et al. 1990). The major


difference with the models of speech recognition, part-of-speech tagging and machine
translation, is that for information retrieval there is a separate language
model for each document in the collection. 数据挖掘交友


4.1.5 The query formulation model
For the query formulation model a simple one-to-one statistical translation
model will be used (Hiemstra 1998b), that is, each query term is translated
to one, and only one request word. The model requires easier calculations during
actual use than the one-to-many models of Brown et al. (1990) which are
quite standard in the field. Training a one-to-one model from data, for instance
training a machine translation lexicon from a parallel corpus, is however less
straightforward, but can be done efficiently by some effective approximations.
The training of statistical translation models is not addressed by this thesis. The
existence of a translation tool for query formulation is simply assumed. Any natural language processing tool or algorithm that converts natural language
words into some other representation may be used as the translation/query formulation
tool. Examples are stemming algorithms (Porter 1980), edit distance

数据挖掘交友


algorithms (Baeza-Yates 1992), fuzzy matching algorithms (De Heer 1979), the
soundex algorithm (Gadd 1988), ontologies as Wordnet (Miller et al. 1990), or
machine-readable bilingual dictionaries. 数据挖掘论坛


4.1.6 The matching model
The matching model assumes that relevant documents are drawn at random
from the document collection. Given a relevant document, queries are generated
by the explicit generation of important terms and unimportant terms. The
important terms are supposed to be drawn at random from the document. The
unimportant terms are supposed to be drawn at random from the full collection.
The probabilities of drawing the terms from the document are calculated by a
simple procedure that, in introductory courses on probability theory (Mood and
Graybill 1963), is often explained by urns containing coloured balls. Consider 4
urns with coloured balls, one of them with 3 red balls, 1 blue ball and 6 yellow
balls. For instance, the probability of selecting at random the described urn
and then drawing at random a red ball is 1/4 × 3/(3+1+6) = 0.25 × 0.3 =
0.075, and the probability of drawing the urn and then drawing at random, 数据挖掘实验室
with replacement, first a red ball and then a blue ball is 1/4 × 3/10 × 1/10 =
0.25 × 0.3 × 0.1 = 0.0075. Instead of urns containing coloured balls, the system
uses documents containing terms, but the procedure is exactly the same. 数据挖掘交友

4.1.7 An ideal user
The probability mechanisms that define how requests are generated from a relevant
document should in some way reflect the way users choose the words when
they formulate the request. When users enter a request in a full text information
retrieval system, they do have a reasonable idea of what a relevant document
would look like and they will choose the words accordingly (Ponte and Croft
1998). To formulate a request, users might picture themselves a relevant document
to choose words from. A probability is assigned to each hypothesis “the
user has the document in mind” and the documents are ranked by this probability
(Miller et al. 1999). The document in the collection that is most similar
to the document that the user has in mind is the best candidate for retrieval.
An ideal user might be defined as follows. Ideal users choose the relevant
document they picture in their mind, and the corresponding query terms according 数据挖掘研究院
to the probability mechanism that is informally introduced in the previous
section. Ideal users know exactly what the collection looks like. Once they have
decided which document they are looking for, they choose important terms and
unimportant terms as defined above: the important terms are selected at random
from the relevant document, and the unimportant terms are selected at
random from the collection. Of course, ideal users do not exist in practice. Real
users do not know what the collection looks like, and they often do not know exactly what they are looking for. Thinking of the retrieval model as a model
of an ideal user explains under which circumstances the model works best. According
to the experimental results reported in chapter 5, 6 and 7, the ideal
user assumption provides a reasonable approximation of the behaviour of the
real world user. Similar reasons for using simplifications to real world problems
exist in other research areas for very different problems. For instance in thermodynamics, 数据挖掘论坛
an ideal gas consist of particals with zero volume that move in
any direction with equal probability. Ideal gasses do not exist in practice, but
in many cases they provide a convenient approximation of the real world: that
is the essence of modelling. 数据挖掘实验室


4.1.8 An overview of this chapter
The remainder of this chapter is structured as follows. Section 4.2 formally
introduces the basic model of the matching process. In section 4.3 translation
of terms is added to model the query formulation process. Section 4.4 addresses
the notion of importance of query terms. Section 4.5 and 4.6 will present the
exact same models in terms of respectively hidden Markov models and Bayesian
networks. Section 4.7 addresses implementation details and shows the resemblance
of the resulting weighting formulas with other models. Finally, section
4.8 introduces extensions for proximity searching. 数据挖掘研究院

 

数据挖掘交友

[数据挖掘工作交流] [数据挖掘研究院] [数据挖掘论坛] [数据挖掘实验室]
上一篇:EXACT STRING MATCHING ALGORITHMS : Introduction
下一篇:汉语拼音与汉语信息处理技术
最新评论共有 0 位网友发表了评论 , 查看所有评论
发表评论( 不能超过250字,需审核,请自觉遵守互联网相关政策法规。 )
匿名?
数据挖掘网站导航 数据挖掘论坛导航
  • 数据挖掘工具
  • 数据挖掘论坛
  • DataCruncher - Cognos
  • MineSet - MathSoft
  • Intelligent Miner - GainSmarts
  • Sqlserver - SAS - Clementine
  • CART - Weka - WizSoft
  • NeuroShell - ModelQuest
  • data mining tools - Darwin
  • 数据挖掘交友
  • 数据挖掘博客
  • 数据挖掘工具
  • 数据挖掘资源
  • 数据挖掘技术算法
  • 数据挖掘相关期刊、会议
  • 研究院联盟合作专区
  • 数据挖掘基础与相关技术
  • 数据挖掘厂商与就业
  • 数据挖掘研究者乐园
  • 知名厂商数据挖掘工具资料
  • 国内数据挖掘实验室
  • Foreign Data Mining Lab
  • 热点关注
  • 信息检索的核心支撑技术
  • 信息检索研究人员推荐读物
  • 清华信息检索在TREC评测中再创佳绩
  • 如何实现中文文献的自动聚合分类
  • Resources for Text, Speech and Language
  • 基于WordNet的文本分类技术研究和实现
  • 字符串匹配的KMP算法
  • 中创软件Infor中间件助力税收信息化
  • Boyer Moore 算法
  • 中文信息处理——纵览与建议
  • 论坛最新话题
  • 正规省级、国家级别期刊征集论文稿件
  • 寻data mining cookbook 一书的配套光盘
  • 网博垂直搜索引擎完全开源版
  • 电脑也会成为火灾元凶 操作不当也会有危险
  • 网络暴力间接逼死崔真实 韩国拟立法实名上
  • 网络最流行的歌曲单良《那一场雪》推荐给大
  • 快国庆了大家怎么安排
  • 08年“铁观音秋茶”安溪铁观音,茶叶批发网
  • 快国庆了大家怎么安排
  • 世界最大规模“网格计算”网络启动
  • 相关资讯
  • 信息检索权威资料收集
  • Artificial Intelligence as Smart as Huma
  • 2nd CFP: Social Linking Track at Hyperte
  • 如何实现中文文献的自动聚合分类
  • 信息检索的核心支撑技术
  • Efficient Similarity Search over Vector
  • MARS: A Matching and Ranking System for
  • 信息检索研究人员推荐读物
  • Resources for Text, Speech and Language
  • Information Wants to be Found
  • 数据挖掘实验室资料
  • 注册成为SAS用户与爱好者俱乐部会员
  • 水南梅
  • 明日烟
  • 新人报道
  • 下载
  • 厦门服务器托管,450元/月—0592-5177319 高
  • 买空间送域名--0592-5177319 高静
  • mit ocw 数据挖掘相关课程连接
  • Introduction to Data Mining
  • Data Mining & Business Intelligence