Statistical language models have been around for quite a long time. They were
first applied by Andrei Markov at the beginning of the 20th century to model
letter sequences in works of Russian literature (Manning and Sch¨utze 1999).
Another famous application of language models are Claude Shannon’s models of
letter sequences and word sequences, which he used to illustrate the implications
of coding and information theory (Shannon 1948). Later, statistical language
models were developed as a general natural language processing tool. Language
models were first successfully used for automatic speech recognition at the end of
the 1970’s. The by now standard model of automatic speech recognition consists
of two parts. The first part is the language model, that predicts the next word in
continuous speech. The second part models the acoustic signal and is therefore
called the acoustic model. The theory behind the speech recognition models is part of hidden Markov model theory (indeed, a ‘hidden’ version of Markov’s
数据挖掘交友
models) that was developed by Leonard Baum and his colleagues at IBM in
the late 1960s and early 1970s (Rabiner 1990; Jelinek 1997). Recently, hidden
Markov models are studied as part of a general graphical model formalism,
which subsumes many of the multivariate probabilistic models used in statistics,
systems engineering, information theory and pattern recognition. Examples
include Bayesian networks, Markov random fields, factor analysis and Kalman
filters (Jordan 1998; Bengio 1999). 数据挖掘工具
The application to information retrieval
Only very recently, since 1998, statistical language models are applied to information
retrieval. The past two years show a remarkably large number of
publications in which statistical language models are used to compute the ranking
of documents given a query. To sum them up quickly: Ponte and Croft
(1998) were the first to suggest the use of language models in information retrieval.
They used estimation based on risk functions to overcome the problem
of small sample sizes. Hiemstra (1998a) and Hiemstra and Kraaij (1999) were
the first to introduce ranking based on a mixture of global and local probability
distributions that is also used in the publications mentioned in the remainder of
this paragraph. Miller, Leek, and Schwartz (1999) use hidden Markov models
for ranking, including the use of bi-grams to model two word phrases and a
method for performing blind feedback. Sahami (1999) suggested an approach
数据挖掘工具
to document clustering based on smoothing the document models by using the
geometric mean of the global and local distributions. Berger and Lafferty (1999)
and Hiemstra and De Jong (1999) developed a model that includes statistical
translation. Ng (2000) introduced a model that uses the ratio of the conditional
probability of the query given the document and the prior probability of the
query, including a method for query expansion. Song and Croft (1999) used a
model which includes bi-grams and introduced Good Turing re-estimation to
smooth the document models. This chapter will address details of many of the
above mentioned publications. They will be recited where appropriate in the
following sections. It is assumed that the reader is familiar with the basics of
probability theory as for instance presented by Mood and Graybill (1963). 数据挖掘交友
4.1.3 Two models of information retrieval processes
This chapter will introduce two models of information retrieval: a basic retrieval
model and an extension of the basic model, the statistical translation retrieval
model. The basic model defines the system’s matching process. It has the same
function as the models presented in chapter 2. The extended model adds statistical
translation to the basic retrieval model to model both the matching process
and the query formulation process. Because today’s computers are still not able
to really understand the documents and the user’s request, both matching and
query formulation are modelled by simple probability mechanisms. Matching
is modelled by the generation of a random query from a relevant document and query formulation is modelled by translation of the query into the request
(Hiemstra and De Jong 1999).
Figure 4.1 suggests an information theoretic view of the problem (Miller,
Leek, and Schwartz 1999; Berger and Lafferty 1999). Information theory was
developed by Shannon (1948) to model the problem of decoding a message that
is sent over a noisy communication channel. From this viewpoint, a relevant
document d gets ‘corrupted’ into a query t1, · · · , tn by sending it through a
noisy channel, and the query gets again corrupted into a request s1, · · · , sn by
sending it through a second noisy channel. A natural language information
retrieval system can be thought of as a decoding function f : s1, · · · , sn ! d,
that tries to reproduce the message that was originally sent, that is, to find
the document that is relevant to the request. An optimal retrieval system will
choose f(s1, · · · , sn) such that:
model. A real life retrieval system does not know these probabilities, but instead
defines them by some simple basic principles. A basic principle for the matching 数据挖掘工具
model might be that each document has the same probability of being relevant,
and that within a document each occurrence of a term has the same probability
of ending up in the query. A basic principle for the query formulation model
might be that each query term is translated to one and only one word in the
request. 数据挖掘研究院
For each document in the collection, a two-step statistical model defines the
probability of generating the user request. Documents are ranked according
to this probability. If a request is entered, the system first uses the query
formulation model to hypothesise for each word in the request the terms that
might have generated it. This results in a structured query that represents all
queries that might have generated the request. In a second step, the system
uses the matching model of each document to calculate the probability that the
document generated any of the queries represented by the structured query.
数据挖掘实验室
The two parts have objectives that are similar to the two parts of the speech
recognition models. The objective of the translation model of information retrieval
is similar to the acoustic model of speech recognition. Both model the
observed signal, respectively the user’s request and the sound wave. The structured
query that represents all queries that might have generated the request,
can be compared to a so-called word lattice in speech recognition (Rabiner
1990). The objective of the basic model of information retrieval is similar to
the objective of the language model of speech recognition. The models predict
respectively the next term in the query, and the next word in speech. So, the
basic retrieval model is the ‘true’ language model, and the translation model is
the signal model. The distinction between language model and signal model can
also be made for e.g. models for part-of-speech tagging (Cutting et al. 1992),
and models for statistical machine translation (Brown et al. 1990). The major
difference with the models of speech recognition, part-of-speech tagging and machine
translation, is that for information retrieval there is a separate language
model for each document in the collection. 数据挖掘交友
4.1.5 The query formulation model
For the query formulation model a simple one-to-one statistical translation
model will be used (Hiemstra 1998b), that is, each query term is translated
to one, and only one request word. The model requires easier calculations during
actual use than the one-to-many models of Brown et al. (1990) which are
quite standard in the field. Training a one-to-one model from data, for instance
training a machine translation lexicon from a parallel corpus, is however less
straightforward, but can be done efficiently by some effective approximations.
The training of statistical translation models is not addressed by this thesis. The
existence of a translation tool for query formulation is simply assumed. Any natural language processing tool or algorithm that converts natural language
words into some other representation may be used as the translation/query formulation
tool. Examples are stemming algorithms (Porter 1980), edit distance
数据挖掘交友
algorithms (Baeza-Yates 1992), fuzzy matching algorithms (De Heer 1979), the
soundex algorithm (Gadd 1988), ontologies as Wordnet (Miller et al. 1990), or
machine-readable bilingual dictionaries. 数据挖掘论坛
4.1.6 The matching model
The matching model assumes that relevant documents are drawn at random
from the document collection. Given a relevant document, queries are generated
by the explicit generation of important terms and unimportant terms. The
important terms are supposed to be drawn at random from the document. The
unimportant terms are supposed to be drawn at random from the full collection.
The probabilities of drawing the terms from the document are calculated by a
simple procedure that, in introductory courses on probability theory (Mood and
Graybill 1963), is often explained by urns containing coloured balls. Consider 4
urns with coloured balls, one of them with 3 red balls, 1 blue ball and 6 yellow
balls. For instance, the probability of selecting at random the described urn
and then drawing at random a red ball is 1/4 × 3/(3+1+6) = 0.25 × 0.3 =
0.075, and the probability of drawing the urn and then drawing at random, 数据挖掘实验室
with replacement, first a red ball and then a blue ball is 1/4 × 3/10 × 1/10 =
0.25 × 0.3 × 0.1 = 0.0075. Instead of urns containing coloured balls, the system
uses documents containing terms, but the procedure is exactly the same. 数据挖掘交友
4.1.7 An ideal user
The probability mechanisms that define how requests are generated from a relevant
document should in some way reflect the way users choose the words when
they formulate the request. When users enter a request in a full text information
retrieval system, they do have a reasonable idea of what a relevant document
would look like and they will choose the words accordingly (Ponte and Croft
1998). To formulate a request, users might picture themselves a relevant document
to choose words from. A probability is assigned to each hypothesis “the
user has the document in mind” and the documents are ranked by this probability
(Miller et al. 1999). The document in the collection that is most similar
to the document that the user has in mind is the best candidate for retrieval.
An ideal user might be defined as follows. Ideal users choose the relevant
document they picture in their mind, and the corresponding query terms according 数据挖掘研究院
to the probability mechanism that is informally introduced in the previous
section. Ideal users know exactly what the collection looks like. Once they have
decided which document they are looking for, they choose important terms and
unimportant terms as defined above: the important terms are selected at random
from the relevant document, and the unimportant terms are selected at
random from the collection. Of course, ideal users do not exist in practice. Real
users do not know what the collection looks like, and they often do not know exactly what they are looking for. Thinking of the retrieval model as a model
of an ideal user explains under which circumstances the model works best. According
to the experimental results reported in chapter 5, 6 and 7, the ideal
user assumption provides a reasonable approximation of the behaviour of the
real world user. Similar reasons for using simplifications to real world problems
exist in other research areas for very different problems. For instance in thermodynamics, 数据挖掘论坛
an ideal gas consist of particals with zero volume that move in
any direction with equal probability. Ideal gasses do not exist in practice, but
in many cases they provide a convenient approximation of the real world: that
is the essence of modelling. 数据挖掘实验室
4.1.8 An overview of this chapter
The remainder of this chapter is structured as follows. Section 4.2 formally
introduces the basic model of the matching process. In section 4.3 translation
of terms is added to model the query formulation process. Section 4.4 addresses
the notion of importance of query terms. Section 4.5 and 4.6 will present the
exact same models in terms of respectively hidden Markov models and Bayesian
networks. Section 4.7 addresses implementation details and shows the resemblance
of the resulting weighting formulas with other models. Finally, section
4.8 introduces extensions for proximity searching. 数据挖掘研究院
数据挖掘交友