RSS
热门关键字:  数据挖掘  人工智能  数据仓库  搜索引擎  数据挖掘导论
当前位置 :| 首页>人工智能>机器学习>

Non-linear models in NLP

来源: 作者:互联网作品 时间:2007-05-08 点击:

If you talk to (many) "real" machine learning people, they express profound disbelief that almost everything we do in NLP is based on linear classifiers (maxent/lr, svms, perceptron, etc.). We only rarely use kernels and, while decision tress used to be popular, they seem to have fallen out of favor. Very few people use them for large-scale apps. (Our NYU friends are an exception.)

There are two possible explanations for this. (1) we really only need linear models; (2) we're too lazy to use anything other than linear models (or, alternative, non-linear models don't scale). My experience tells me that for most of our sequence-y problems (parsing, tagging, etc.), there's very little to be gained by moving to, eg., quadratic SVMs. I even tried doing NE tagging with boosted decision trees under Searn, because I really wanted it to work nicely, but it failed. I've also pondered the idea of making small decision trees with perceptrons on the leaves, so as to account for small amounts of non-linearity. Using default DT construction technology (eg., information gain), this doesn't seem to help either. (Ryan McDonald has told me that other people have tried something similar and it hasn't worked for them either.) Perhaps this is because IG is the wrong metric to use (there exist DTs for regression with linear models on the leaves and they are typically learned so as to maximize the "linearness" of the underlying data, but this is computationally too expensive).

数据挖掘研究院



One counter-example is the gains that people have gotten by using latent variable models (eg., Koo and Collins), which are essentially non-linearified linear models. In somewhat a similar vein, one could consider "edge" features in CRFs (or any structured prediction technique) to be non-linear features, but this is perhaps stretching it.

Part of this may be because over time we've adapted to using features that don't need to be non-linearified. If we went back and treated each character in a word as a single feature and then required the learning algorithm to recover what important features (like, "word is 'Bush'") then clearly non-linearity would be required. This is essentially what vision people do, and exactly the cases where things like deep belief networks really shine. But so long as we're subjecting our learning algorithms to featuritis (John Langford's term), perhaps there's really not much to gain. 数据挖掘实验室

最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?