Generic Information Retrieval System

An information retrieval application is a collection of components which selects and returns to the user desired documents from a large set of documents (the corpus) in accordance with criteria (Detection Need) specified by the user. Information Retrieval (or Document Detection, as it is also called) performs two functions: 数据挖掘交友

  • Document Search: the selection of documents from an existing collection of documents, and
  • Document Routing: the dissemination of incoming documents to appropriate users on the basis of user interest profiles.

A Detection Need is a set of criteria specified by the user which describes the kind of information desired. Detection Needs are frequently called queries when they occur in the Document Search task and are referred to as profiles in the Routing Task. Document Needs can be expressed in terms of keywords, keywords with Boolean operators, statements in free text, or example documents, depending upon the system. Until recently most systems have required the user to produce keyword or Boolean keyword queries. With advances in the field, users can define their Detection Needs in terms of free-text and example documents. 数据挖掘论坛

Document are generally returned to the user in the form of lists of document citations. These lists may be unordered. However, many systems rank order the documents, generally placing the documents most likely to be relevant at the top of the list, with less relevant document placed lower on the list. The accuracy and utility of such systems is often measured by their ability to place most documents of interest to the user in the top part of the document list. 数据挖掘研究院

The main difference between search and routing is that the search process matches a single Detection Need against the stored corpus to return a sub-set of documents whereas routing matches a single document as it enters the system against a group of Profiles to determine which users are interested in the document. Profiles, therefore tend to be standing and long-term expressions of user needs, whereas search queries are typically ad hoc in nature.

A generic detection architecture can be used for both the search and routing. Each of these tasks is discussed separately below, but the similarities should be apparent both in the figures and in the descriptions.

数据挖掘论坛

Search

Search is the retrieval of desired documents from an existing corpus. 数据挖掘工具

Retrospective search is frequently interactive. As the user produces increasingly better search queries based on the results of initial searches and the user′s developing concept of his/her information need.

数据挖掘交友

There are several methods that can be used to perform the search function; however, indexing the corpus by keyword, stem and/or phrase is central to most methods. Some methods apply statistical and/or learning techniques to better understand the content of the corpus and to determine appropriate keywords. Some methods also analyze free text Detection Needs to allow comparison with the indexed corpus or a single document.

数据挖掘研究院

Search can be understood in terms of a set of modules shown in Figure 1. Any particular document detection system uses own set of modules which may vary slightly from the ones described below. The modules are: 数据挖掘研究院

  1. Document Corpus
  2. Pre-Processing of Document Corpus
  3. Building Index from Stems
  4. Document Index
  5. Detection Need
  6. Convert Detection Need to System Specific Query
  7. Compare Query with Index
  8. Resultant Rank Ordered List of Documents

  数据挖掘交友

数据挖掘交友

Figure 1. Document Detection : Search

数据挖掘工具

1. Document Corpus

A corpus is comprised of the source documents from which the user will select the document sub-set. The content of the corpus may have significant the performance in some applications. Its documents may be part of a vary narrowly defined subject domain, or may pertain to a broad range of concepts covering several subject domains. Thus, different detection systems may have varying performance, depending upon the content of the corpus and the Detection Need. 数据挖掘论坛

TIPSTER detection applications have been developed and tested against a corpus of several gigabytes of text covering multiple subject domains. 数据挖掘研究院

2. Pre-Processing of Document Corpus

Pre-processing of the corpus is an area of continuing research and is a key discriminator in Document Detection methods. Most systems use some method of stemming. Stemming is the reduction of a word to its root. For example, "Contrary, Contradiction, Contraband" all have ′contra′ as a stem which permits a certain amount of generalization over the meaning of the sentences or document.

Most systems also use a list of Stop words. A Stop word is a word which is usually ignored, e.g., ′a′, ′an′, ′the′. Stop words lack significance to the determination of the subject of a document at the rather general level at which document detection works. Research is progressing at identifying phrases and multi-term items such as dates and personal names so that these can be indexed as single terms.

3. Building Index from Stems

This function is frequently very system specific because it is a key place in the detection system for optimizing run-time performance. Document Detection systems in general are concerned with speed due to the very large number of the documents to be searched.

数据挖掘交友

It may take a very long time to build the index for a large corpus. New indexes may only be built weekly or even less often. Some provision is always made for incrementally indexing documents that have been added to the corpus since the last full indexing. 数据挖掘实验室

4. Document Index

A document index is essentially a list of terms, stems, phrases, etc. (depending upon the search algorithm) with each term having an associated list of document identifiers which point to documents and stem locations that contain the particular item. Further information resulting from analyses of the frequency of terms in the document and corpus and of the co-occurrence of terms within the corpus may also be stored in the index to aid in the ranking of documents in the returned document set.

数据挖掘交友

Frequently the index may be as large as the original document corpus and various design and compression techniques are usually used to condense it. 数据挖掘论坛

5. Detection Need

A Detection Need expresses the user′s criteria for a relevant document. A Detection Need may take a number of forms described at the beginning of this paper. In the TIPSTER architecture the design of the Detection Need has been made generic to allow use of any or all of these forms: 数据挖掘研究院

This is a sample Detection Need

Studies about economic indicators as they would apply to or be used in analyzing financial markets in European countries 数据挖掘交友

6. Convert Detection Need to System Specific Query

When being processed the Detection Need is transformed in two stages: it is first transformed into a detection query, and then into a retrieval query. Some information in the Detection Need, such as keywords, may not require transformation. 数据挖掘工具

Detection Needs are independent of the specific retrieval engine employed, while detection queries and retrieval queries are specific to a particular retrieval engine. By de-coupling the Detection Need and the system specific query, detection systems can more easily be ported to different domains and employ different indexing algorithms. This allows a more consistent interface with the user.

数据挖掘论坛

The detection query is specific to the retrieval engine but independent of the corpus over which retrieval is to be performed. The retrieval query is specific to the retrieval engine, to the operation, and to the corpus. The retrieval query may incorporate term weights based on the inverse document frequencies in a collection. 数据挖掘实验室

The interpretation, translation and processing of a Detection Need is also performance sensitive part of the Detection application. Again, because of the large number of documents against which it will be compared. 数据挖掘交友

Research is progressing on the use of phrase lists and term expansion when determining system specific queries. Certain words or phrases are replaced with more informative words which are determined from the document corpus itself. Abbreviations are frequently expanded to their full meaning.

7. Compare Query with Index

In attempting to select a desired document the query is compared item by item with the corpus index, recognizing any imbedded logic, such as include - do not include. 数据挖掘论坛

It is not necessary to examine each document in the corpus since it′s important constituent items were placed in the index prior to the query comparison. The use of an index significant improves the time, typically a few seconds, required to identify a document.

8. Resultant Rank Ordered List of Documents

The list of relevant documents that results from the comparison process is ranked ordered from the most relevant to the query to the least relevant. This is accomplished through the use of various weighting algorithms which are dependent upon the particular detection system. For example, a document, which met every criterion in the Detection Need, would be at the top of the list and a document which met 90% of the criteria may be further down the list. 数据挖掘论坛

Document Detection systems typically rank order all the documents in the corpus but only return the top ′N′ documents depending upon the desired cut-off specified. 数据挖掘研究院

Routing

The goal of routing is to decide which user would be interested in each document in a corpus. Usually routing is applied to new incoming documents as opposed to archived documents. Any given document may be of interest to several users and each would be notified of the existence of the document. 数据挖掘实验室

A user′s interest is specified in an individual Profile. The Profile may be composed of multiple Detection Needs.

Routing employs a set of modules as shown in Figure 2. Any particular Detection system will use its own set of modules which may very slightly with the ones described below. The modules used by this generic routing system are:

  1. Profile of Multiple Detection Needs
  2. Convert Detection Need to System Specific Query
  3. Building Index from Queries
  4. Routing Profile Index
  5. Document to be Routed
  6. Pre-Processing of Document
  7. Compare Document with Index
  8. Resultant List of Profiles to which the Document belongs

  数据挖掘工具

数据挖掘研究院

Figure 2. Document Detection : Routing

1. Profile of Multiple Detection Needs

A Profile is a group individual Detection Needs that describes a user′s areas of interest. All Profiles will be compared to each incoming document (via the Profile index) to determine if any matches exist. If a document matches a Profile the user is notified about the existence of a relevant document.

数据挖掘实验室

2. Convert Detection Need to System Specific Query

When being processed the Detection Need is transformed in two stages: it is first transformed into a detection query, and then into a routing query. Some information in the Detection Need, such as keywords, may not require transformation.

The Detection Need used in routing is the same Detection Need used in searching and it is converted to a specific query in the same way as in search.

3. Building Index from Queries

Building a routing profile index from the Profiles is similar to building the corpus index for searching, described above. The only differences are that the quantity of source data (Profiles) is usually much less than a document corpus. Additionally, Profiles may have more specific, structured data in the form of SGML tagged fields. 数据挖掘工具

4. Routing Profile Index

The index will be system specific and will make use of all the pre-processing techniques employed by a particular detection system. 数据挖掘研究院

5. Document to be Routed

A stream of incoming documents is handled one at a time to determine where each should be directed. Routing implementations may handle multiple document streams and multiple Profiles. 数据挖掘交友

6. Pre-Processing of Document

A document is pre-processed in the same manner that a query would be set-up in a search, described above. In the case of routing the document and query roles are reversed compared with the search process. 数据挖掘工具

7. Compare Document with Index

A comparison of a document against the query index identifies which queries and in turn, which Profiles, are relevant to the document. Essentially, the document in routing is analogous to the query in searching. The problem can be stated as, given this document, which of the indexed profiles match it? 数据挖掘论坛

8. Resultant List of Profiles

The list of Profiles that was created by the matching process identifies which users should receive the document since each Profile is owned by a specific user. 数据挖掘工具

A Final Word

Document Detection is a mature technology with reliable and consistent implementations. However, advances continue to be made, particularly in the areas of expanding Detection Needs and in the conversion of Detection Needs to system specific queries. 数据挖掘论坛

[数据挖掘专家] [数据挖掘研究院] [数据挖掘论坛] [数据挖掘实验室]
上一篇:TIPSTER Text Program
下一篇:Webstemmer - How it works?
最新评论共有 0 位网友发表了评论 , 查看所有评论
发表评论( 不能超过250字,需审核,请自觉遵守互联网相关政策法规。 )
匿名?
数据挖掘网站导航 数据挖掘论坛导航
  • 数据挖掘工具
  • 数据挖掘论坛
  • DataCruncher - Cognos
  • MineSet - MathSoft
  • Intelligent Miner - GainSmarts
  • Sqlserver - SAS - Clementine
  • CART - Weka - WizSoft
  • NeuroShell - ModelQuest
  • data mining tools - Darwin
  • 数据挖掘交友
  • 数据挖掘博客
  • 数据挖掘工具
  • 数据挖掘资源
  • 数据挖掘技术算法
  • 数据挖掘相关期刊、会议
  • 研究院联盟合作专区
  • 数据挖掘基础与相关技术
  • 数据挖掘厂商与就业
  • 数据挖掘研究者乐园
  • 知名厂商数据挖掘工具资料
  • 国内数据挖掘实验室
  • Foreign Data Mining Lab
  • 热点关注
  • 什么是信息抽取?
  • 信息抽取相关词语定义
  • 什么是信息抽取(Information Extraction )
  • 网上信息抽取技术纵览 参考文献
  • MUC Evaluations and dataset
  • 基于WEB资源的信息抽取技术
  • Jakarta POI - Java API To Access Microso
  • 网上信息抽取技术纵览 第二章信息抽取技术
  • XWRAP Elite Home
  • Generic Information Retrieval System
  • 论坛最新话题
  • Foundations of Statistical Natural Langu
  • Game Theory meet Data Mining: A Recent P
  • System Building: How does it help or hin
  • 数据挖掘与Clementine培训
  • 新手报到
  • 求 SASEM 客户流失预测分析
  • 数据挖掘工程师/搜索研究院—北京——无线
  • 数据挖掘入门介绍(如何着手数据挖掘)
  • Information Overload Survey Results
  • The INEX 2005 Workshop on Element Retrie
  • 相关资讯
  • MUC Evaluations and dataset
  • 信息抽取相关词语定义
  • 什么是信息抽取?
  • Jakarta POI - Java API To Access Microso
  • 什么是信息抽取(Information Extraction )
  • XWRAP Elite Home
  • Webstemmer - How it works?
  • Generic Information Retrieval System
  • TIPSTER Text Program
  • Phase III Overview
  • 数据挖掘实验室资料
  • 数据挖掘博客地址
  • 数据挖掘实验室网站地址
  • Prepare for Medicare audits by using dat
  • 注册成为SAS用户与爱好者俱乐部会员
  • 水南梅
  • 明日烟
  • 新人报道
  • 下载
  • 厦门服务器托管,450元/月—0592-5177319 高
  • 买空间送域名--0592-5177319 高静