More data isn’t always a good thing in text mining

In text mining it seems obvious that we should use all the data we can get our hands on for use in drawing conclusions. The temptation is always to use the broadest possible query to select the data set, because we don’t want to miss anything that might be important. The problem with such an all inclusive strategy is that it often adds more noise that obscures the signal we’re trying to detect.

数据挖掘交友

So, for instance, if I’m doing a study for a chocolate candy manufacturer and simply enter the query, “chocolate,” the vast majority of the data I collect for my study will have nothing whatever to do with chocolate candy. This will make it much harder to detect the relevant trends and themes in the data related to chocolate candy because they’ll be obscured by unrelated issues, such as the color chocolate or chocolate ice cream. So the query “chocolate candy” might actually make more sense, even though it leaves out a lot of relevant data. As long as we have enough data, adding more that is mostly irrelevant could actually make our analysis less effective. 数据挖掘工具

But how much data is enough? The answer may surprise you. It doesn’t really take as much data as you might think to spot a potentially interesting trend or correlation. To see why, let’s try a simple thought experiment. Say we’re given a coin and we’re told that it may or may not be “loaded,” where a loaded coin is one that when flipped nearly always comes up heads, whereas a normal coin will only come up heads half the time. How many flips of the coin will it require for me to determine that the coin is fair or loaded with 99% confidence? The answer is 7 (the first flip of heads gives me 50% confidence (1/2), the next flip 25% (1/4)… the seventh flip .007 (1/128)). So in this simple experiment I only needed seven data points to tell that something was probably amiss with the coin. 数据挖掘研究院

But if seven examples is enough to draw a conclusion from a simple experiment, why do we usually use thousands of examples to draw conclusions from text? There are actually a couple of reasons. Partly it’s because we frequently don’t get to design our experiment before the data is generated. So we basically have to take whatever data is given to us, and some of it is certain to be redundant or irrelevant for our purposes. The other issue is that we usually aren’t simply trying to determine the answer to one yes/no question (e.g. “is the coin loaded or not”) but rather are looking across thousands of potential features and correlations to find a handful that are potentially interesting. When you have to cover more bases, you naturally need more data to do it with.

数据挖掘实验室

So the better, more relevant the data, and the more focused the subject of the analysis, the less data you actually need to get an accurate picture. Typically when I get a fairly focused set of short documents (paragraphs) that are relevant to the subject under study, I can usually get a pretty good picture of between 25 and 50 themes using between 1000-10000 documents. Right around 500 documents usually turns out to be too small a set to be interesting (it might even be easier just to read the documents one by one, than it is to try to analyze them using text mining techniques). Once I get above 100,000 documents, I’ll usually either sample the data or divide into smaller chunks using some other feature of interest. 数据挖掘交友

The moral of the story is, adding more data is not a panacea. Being thoughtful about what you want to study and why and then carefully selecting data that is relevant to those objectives will produce much better results in the end. 数据挖掘交友

Scott Spangler is an IBM senior technical staff member who has been researching knowledge-based systems and data mining for the past 20 years. He is the co-author, along with Jeffrey Kreulen, of the book “Mining the Talk: Unlocking the Business Value in Unstructured Information”, which shows readers how to leverage unstructured data to become more competitive, responsive and innovative.

[数据挖掘专家] [数据挖掘研究院] [数据挖掘论坛] [数据挖掘实验室]
上一篇:Springer-Verlag chooses TEMIS Text Analytics solution
下一篇:Is Data Mining Misguided?
最新评论共有 0 位网友发表了评论 , 查看所有评论
发表评论( 不能超过250字,需审核,请自觉遵守互联网相关政策法规。 )
匿名?
数据挖掘网站导航 数据挖掘论坛导航
  • 数据挖掘工具
  • 数据挖掘论坛
  • DataCruncher - Cognos
  • MineSet - MathSoft
  • Intelligent Miner - GainSmarts
  • Sqlserver - SAS - Clementine
  • CART - Weka - WizSoft
  • NeuroShell - ModelQuest
  • data mining tools - Darwin
  • 数据挖掘交友
  • 数据挖掘博客
  • 数据挖掘工具
  • 数据挖掘资源
  • 数据挖掘技术算法
  • 数据挖掘相关期刊、会议
  • 研究院联盟合作专区
  • 数据挖掘基础与相关技术
  • 数据挖掘厂商与就业
  • 数据挖掘研究者乐园
  • 知名厂商数据挖掘工具资料
  • 国内数据挖掘实验室
  • Foreign Data Mining Lab
  • 热点关注
  • 文本聚类程序实例
  • BBS 数据挖掘研究及其地位与核心问题
  • 一种新的基于统计的自动文本分类方法
  • Text Categorization
  • Is Data Mining Misguided?
  • 焦点应用:语义分析
  • 句子相似度计算在FAQ中的应用
  • 文本挖掘抢占商业智能掘金制高点
  • 基于文本概念和kNN 的跨语种文本过滤
  • More data isn’t always a good thing in
  • 论坛最新话题
  • Foundations of Statistical Natural Langu
  • Game Theory meet Data Mining: A Recent P
  • System Building: How does it help or hin
  • 数据挖掘与Clementine培训
  • 新手报到
  • 求 SASEM 客户流失预测分析
  • 数据挖掘工程师/搜索研究院—北京——无线
  • 数据挖掘入门介绍(如何着手数据挖掘)
  • Information Overload Survey Results
  • The INEX 2005 Workshop on Element Retrie
  • 相关资讯
  • More data isn’t always a good thing in
  • Text Categorization
  • Finding Advertising Keywords on Web Page
  • Communities from Seed Sets
  • To Randomize or Not To Randomize: Space
  • Overview of Text Summarization History
  • Porter Stemming Algorithm
  • Sequential Minimal Optimization
  • 句子相似度计算在FAQ中的应用
  • 弱指导的统计隐含语义分析及其在跨语言信息
  • 数据挖掘实验室资料
  • 数据挖掘博客地址
  • 数据挖掘实验室网站地址
  • Prepare for Medicare audits by using dat
  • 注册成为SAS用户与爱好者俱乐部会员
  • 水南梅
  • 明日烟
  • 新人报道
  • 下载
  • 厦门服务器托管,450元/月—0592-5177319 高
  • 买空间送域名--0592-5177319 高静