|
Statistical language models have been around for quite a long time. They were first applied by Andrei Markov at the beginning of the 20th century to model letter sequences in works of Russian literature (Manning and Sch¨utze 1999). Another famous application of language models are Claude Shannon’s models of letter sequences and word sequences, which he used to illustrate the implications of coding and information theory (Shannon 1948). Later, statistical language models were developed as a general natural language processing tool. Language models were first successfully used for automatic speech recognition at the end of the 1970’s. The by now standard model of automatic speech recognition consists of two parts. The first part is the language model, that predicts the next word in continuous speech. The second part models the acoustic signal and is therefore called the acoustic model. The theory behind the speech recognition models is part of hidden Markov model theory (indeed, a ‘hidden’ version of Markov’s 字串2 models) that was developed by Leonard Baum and his colleagues at IBM in the late 1960s and early 1970s (Rabiner 1990; Jelinek 1997). Recently, hidden Markov models are studied as part of a general graphical model formalism, which subsumes many of the multivariate probabilistic models used in statistics, systems engineering, information theory and pattern recognition. Examples include Bayesian networks, Markov random fields, factor analysis and Kalman filters (Jordan 1998; Bengio 1999).
字串5
The application to information retrieval Only very recently, since 1998, statistical language models are applied to information retrieval. The past two years show a remarkably large number of publications in which statistical language models are used to compute the ranking of documents given a query. To sum them up quickly: Ponte and Croft (1998) were the first to suggest the use of language models in information retrieval. They used estimation based on risk functions to overcome the problem of small sample sizes. Hiemstra (1998a) and Hiemstra and Kraaij (1999) were the first to introduce ranking based on a mixture of global and local probability distributions that is also used in the publications mentioned in the remainder of this paragraph. Miller, Leek, and Schwartz (1999) use hidden Markov models for ranking, including the use of bi-grams to model two word phrases and a method for performing blind feedback. Sahami (1999) suggested an approach
字串2 to document clustering based on smoothing the document models by using the geometric mean of the global and local distributions. Berger and Lafferty (1999) and Hiemstra and De Jong (1999) developed a model that includes statistical translation. Ng (2000) introduced a model that uses the ratio of the conditional probability of the query given the document and the prior probability of the query, including a method for query expansion. Song and Croft (1999) used a model which includes bi-grams and introduced Good Turing re-estimation to smooth the document models. This chapter will address details of many of the above mentioned publications. They will be recited where appropriate in the following sections. It is assumed that the reader is familiar with the basics of probability theory as for instance presented by Mood and Graybill (1963). 字串2
4.1.3 Two models of information retrieval processes This chapter will introduce two models of information retrieval: a basic retrieval model and an extension of the basic model, the statistical translation retrieval model. The basic model defines the system’s matching process. It has the same function as the models presented in chapter 2. The extended model adds statistical translation to the basic retrieval model to model both the matching process and the query formulation process. Because today’s computers are still not able to really understand the documents and the user’s request, both matching and query formulation are modelled by simple probability mechanisms. Matching is modelled by the generation of a random query from a relevant document and query formulation is modelled by translation of the query into the request (Hiemstra and De Jong 1999). Figure 4.1 suggests an information theoretic view of the problem (Miller,
字串1 Leek, and Schwartz 1999; Berger and Lafferty 1999). Information theory was developed by Shannon (1948) to model the problem of decoding a message that is sent over a noisy communication channel. From this viewpoint, a relevant document d gets ‘corrupted’ into a query t1, · · · , tn by sending it through a noisy channel, and the query gets again corrupted into a request s1, · · · , sn by sending it through a second noisy channel. A natural language information retrieval system can be thought of as a decoding function f : s1, · · · , sn ! d, that tries to reproduce the message that was originally sent, that is, to find the document that is relevant to the request. An optimal retrieval system will choose f(s1, · · · , sn) such that: model. A real life retrieval system does not know these probabilities, but instead defines them by some simple basic principles. A basic principle for the matching 字串3 model might be that each document has the same probability of being relevant, and that within a document each occurrence of a term has the same probability of ending up in the query. A basic principle for the query formulation model might be that each query term is translated to one and only one word in the request. 字串9
For each document in the collection, a two-step statistical model defines the probability of generating the user request. Documents are ranked according to this probability. If a request is entered, the system first uses the query formulation model to hypothesise for each word in the request the terms that might have generated it. This results in a structured query that represents all queries that might have generated the request. In a second step, the system uses the matching model of each document to calculate the probability that the document generated any of the queries represented by the structured query.
字串2
The two parts have objectives that are similar to the two parts of the speech recognition models. The objective of the translation model of information retrieval is similar to the acoustic model of speech recognition. Both model the observed signal, respectively the user’s request and the sound wave. The structured query that represents all queries that might have generated the request, can be compared to a so-called word lattice in speech recognition (Rabiner 1990). The objective of the basic model of information retrieval is similar to the objective of the language model of speech recognition. The models predict respectively the next term in the query, and the next word in speech. So, the basic retrieval model is the ‘true’ language model, and the translation model is the signal model. The distinction between language model and signal model can also be made for e.g. models for part-of-speech tagging (Cutting et al. 1992), and models for statistical machine translation (Brown et al. 1990). The major 字串5 difference with the models of speech recognition, part-of-speech tagging and machine translation, is that for information retrieval there is a separate language model for each document in the collection. 字串8
4.1.5 The query formulation model For the query formulation model a simple one-to-one statistical translation model will be used (Hiemstra 1998b), that is, each query term is translated to one, and only one request word. The model requires easier calculations during actual use than the one-to-many models of Brown et al. (1990) which are quite standard in the field. Training a one-to-one model from data, for instance training a machine translation lexicon from a parallel corpus, is however less straightforward, but can be done efficiently by some effective approximations. The training of statistical translation models is not addressed by this thesis. The existence of a translation tool for query formulation is simply assumed. Any natural language processing tool or algorithm that converts natural language words into some other representation may be used as the translation/query formulation tool. Examples are stemming algorithms (Porter 1980), edit distance
字串5 algorithms (Baeza-Yates 1992), fuzzy matching algorithms (De Heer 1979), the soundex algorithm (Gadd 1988), ontologies as Wordnet (Miller et al. 1990), or machine-readable bilingual dictionaries. 字串1
4.1.6 The matching model The matching model assumes that relevant documents are drawn at random from the document collection. Given a relevant document, queries are generated by the explicit generation of important terms and unimportant terms. The important terms are supposed to be drawn at random from the document. The unimportant terms are supposed to be drawn at random from the full collection. The probabilities of drawing the terms from the document are calculated by a simple procedure that, in introductory courses on probability theory (Mood and Graybill 1963), is often explained by urns containing coloured balls. Consider 4 urns with coloured balls, one of them with 3 red balls, 1 blue ball and 6 yellow balls. For instance, the probability of selecting at random the described urn and then drawing at random a red ball is 1/4 × 3/(3+1+6) = 0.25 × 0.3 = 0.075, and the probability of drawing the urn and then drawing at random,
字串4 with replacement, first a red ball and then a blue ball is 1/4 × 3/10 × 1/10 = 0.25 × 0.3 × 0.1 = 0.0075. Instead of urns containing coloured balls, the system uses documents containing terms, but the procedure is exactly the same. 字串4
4.1.7 An ideal user The probability mechanisms that define how requests are generated from a relevant document should in some way reflect the way users choose the words when they formulate the request. When users enter a request in a full text information retrieval system, they do have a reasonable idea of what a relevant document would look like and they will choose the words accordingly (Ponte and Croft 1998). To formulate a request, users might picture themselves a relevant document to choose words from. A probability is assigned to each hypothesis “the user has the document in mind” and the documents are ranked by this probability (Miller et al. 1999). The document in the collection that is most similar to the document that the user has in mind is the best candidate for retrieval. An ideal user might be defined as follows. Ideal users choose the relevant document they picture in their mind, and the corresponding query terms according 字串9 to the probability mechanism that is informally introduced in the previous section. Ideal users know exactly what the collection looks like. Once they have decided which document they are looking for, they choose important terms and unimportant terms as defined above: the important terms are selected at random from the relevant document, and the unimportant terms are selected at random from the collection. Of course, ideal users do not exist in practice. Real users do not know what the collection looks like, and they often do not know exactly what they are looking for. Thinking of the retrieval model as a model of an ideal user explains under which circumstances the model works best. According to the experimental results reported in chapter 5, 6 and 7, the ideal user assumption provides a reasonable approximation of the behaviour of the real world user. Similar reasons for using simplifications to real world problems exist in other research areas for very different problems. For instance in thermodynamics, 字串1 an ideal gas consist of particals with zero volume that move in any direction with equal probability. Ideal gasses do not exist in practice, but in many cases they provide a convenient approximation of the real world: that is the essence of modelling. 字串4
4.1.8 An overview of this chapter The remainder of this chapter is structured as follows. Section 4.2 formally introduces the basic model of the matching process. In section 4.3 translation of terms is added to model the query formulation process. Section 4.4 addresses the notion of importance of query terms. Section 4.5 and 4.6 will present the exact same models in terms of respectively hidden Markov models and Bayesian networks. Section 4.7 addresses implementation details and shows the resemblance of the resulting weighting formulas with other models. Finally, section 4.8 introduces extensions for proximity searching. 字串1
字串7
|