INTRODUCTION: Parsing a natural language sentence can be viewed as making a sequence of disambiguation decisions: de- termining the part-of-speech of the words, choosing between possible constituent structures, and select- ing labels for the constituents. Traditionally, disam- biguation problems in parsing have been addressed by enumerating possibilities and explicitly declaring knowledge which might aid the disambiguation pro- cess. However, these approaches have proved too brittle for most interesting natural language prob- lems. This work addresses the problem of automatically discovering the disambiguation criteria for all of the decisions made during the parsing process, given the set of possible features which can act as disambigua- tors. The candidate disambiguators are the words in the sentence, relationships among the words, and re- lationships among constituents already constructed in the parsing process. Since most natural language rules are not abso- lute, the disambiguation criteria discovered in this work are never applied deterministically. Instead, all decisions are pursued non-deterministically accord- ing to the probability of each choice. These proba- bilities are estimated using statistical decision tree models. The probability of a complete parse tree (T) of a sentence (S) is the product of each decision (dl) conditioned on all previous decisions: P(T[S) = H P(dildi-ldi-2""dlS)" diET Each decision sequence constructs a unique parse, and the parser selects the parse whose decision se- quence yields the highest cumulative probability. By combining a stack decoder search with a breadth- first algorithm with probabilistic pruning, it is pos- sible to identify the highest-probability parse for any sentence using a reasonable amount of memory and time. The claim of this work is that statistics from a large corpus of parsed sentences combined with information-theoretic classification and training al- gorithms can produce an accurate natural language parser without the aid of a complicated knowl- edge base or grammar. This claim is justified by constructing a parser, called SPATTER (Statistical PATTErn Recognizer), based on very limited lin- gnistic information, and comparing its performance to a state-of-the-art grammar-based parser on a common task. It remains to be shown that an accu- rate broad-coverage parser can improve the perfor- mance of a text processing application. This will be the subject of future experiments. One of the important points of this work is that statistical models of natural language should not be restricted to simple, context-insensitive models. In a problem like parsing, where long-distance lex- ical information is crucial to disambiguate inter- pretations accurately, local models like probabilistic context-free grammars are inadequate. This work illustrates that existing decision-tree technology can be used to construct and estimate models which se- lectively choose elements of the context which con- tribute to disambignation decisions, and which have few enough parameters to be trained using existing resources. I begin by describing decision-tree modeling, showing that decision-tree models are equivalent to interpolated n-gram models. Then I briefly describe the training and parsing procedures used in SPAT- TER. Finally, I present some results of experiments comparing SPATTER with a grammarian′s rule- based statistical parser, along with more recent re- suits showing SPATTER applied to the Wall Street Journal domain. REFERENCES: L. R. Bahl, P. F. Brown, P. V. deSouza, and R. L. Mercer. 1989. A tree-based statistical language model for natural language speech recognition. IEEE ~Pransactions on Acoustics, Speech, and Sig- nal Processing, Vol. 36, No. 7, pages 1001-1008. L. E. Baum. 1972. An inequality and associated maximization technique in statistical estimation of probabilistic functions of markov processes. In- equalities, Vol. 3, pages 1-8. E. Black and et al. 1991. A procedure for quanti- tatively comparing the syntactic coverage of en- glish grammars. Proceedings o/ the February 1991 DARPA Speech and Natural Language Workshop, pages 306-311. E. Black, R. Garside, and G. Leech. 1993. Statistically-driven computer grammars of english: the ibm/lancaster approach. Rodopi, Atlanta, Georgia. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. 1984. Ci~ssi]ication and Regression Trees. Wadsworth and Brooks, Pacific Grove, California. P. F. Brown, V. Della Pietra, P. V. deSouza, J. C. Lai, and R. L. Mercer. 1992. "Class-based n-gram models of natural language." Computa- tional Linguistics, 18(4), pages 467-479. D. M. Magerman. 1994. Natural Language Pars- ing as Statistical Pattern Recognition. Doctoral dissertation. Stanford University, Stanford, Cali- fornia.
published in ACL 95
http://acl.ldc.upenn.edu/P/P95/P95-1037.pdf

