Introduction to Text Indexing with Apache Jakarta Lucene

 

What Lucene Is

Lucene is a Java library that adds text indexing and searching capabilities to an application. It is not a complete application that one can just download, install, and run. It offers a simple, yet powerful core API. To start using it, one needs to know only a few Lucene classes and methods. 数据挖掘论坛

′); //]]>

Lucene offers two main services: text indexing and text searching. These two activities are relatively independent of each other, although indexing naturally affects searching. In this article I will focus on text indexing, and we will look at some of the core Lucene classes that provide text indexing capabilities. 数据挖掘交友

Lucene Background

Lucene was originally written by Doug Cutting and was available for download from SourceForge. It joined the Apache Software Foundation′s Jakarta family of open source server-side Java products in September of 2001. With each release since then, the project has enjoyed more visibility, attracting more users and developers. As of November 2002, Lucene version 1.2 has been released, with version 1.3 in the works. In addition to those organizations mentioned on the "Powered by Lucene" page, I have heard of FedEx, Overture, Mayo Clinic, Hewlett Packard, New Scientist magazine, Epiphany, and others using, or at least evaluating, Lucene.

数据挖掘实验室

Related Reading

数据挖掘研究院

Java Enterprise Best Practices

Java Enterprise Best Practices
By The O′Reilly Java Authors
数据挖掘实验室

Table of Contents
Index
Sample Chapter

Read Online--Safari Search this book on Safari:
 

数据挖掘研究院



Code Fragments only

Installing Lucene

Like most other Jakarta projects, Lucene is distributed as pre-compiled binaries or in source form. You can download the latest official release from Lucene′s release page. There are also nightly builds, if you′d like to use the newest features. To demonstrate Lucene usage, I will assume that you will use the pre-compiled distribution. Simply download the Lucene .jar file and add its path to your CLASSPATH environment variable. If you choose to get the source distribution and build it yourself, you will need Jakarta Ant and JavaCC, which is available as a free download. Although the company that created JavaCC no longer exists, you can still get JavaCC from the URL listed in the References section of this article. 数据挖掘实验室

Indexing with Lucene

Before we jump into code, let′s look at some of the fundamental Lucene classes for indexing text. They are IndexWriter, Analyzer, Document, and Field.

IndexWriter is used to create a new index and to add Documents to an existing index.

Before text is indexed, it is passed through an Analyzer. Analyzers are in charge of extracting indexable tokens out of text to be indexed, and eliminating the rest. Lucene comes with a few different Analyzer implementations. Some of them deal with skipping stop words (frequently-used words that don′t help distinguish one document from the other, such as "a," "an," "the," "in," "on," etc.), some deal with converting all tokens to lowercase letters, so that searches are not case-sensitive, and so on.

An index consists of a set of Documents, and each Document consists of one or more Fields. Each Field has a name and a value. Think of a Document as a row in a RDBMS, and Fields as columns in that row.

数据挖掘工具

Now, let′s consider the simplest scenario, where you have a piece of text to index, stored in an instance of String. Here is how you could do it, using the classes described above: 数据挖掘实验室

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

/**
 * LuceneIndexExample class provides a simple
 * example of indexing with Lucene.  It creates a fresh
 * index called "index-1" in a temporary directory every
 * time it is invoked and adds a single document with a
 * single field to it.
 */
public class LuceneIndexExample
{
    public static void main(String args[]) throws Exception
    {
        String text = "This is the text to index with Lucene";

        String indexDir =
            System.getProperty("java.io.tmpdir", "tmp") +
            System.getProperty("file.separator") + "index-1";
        Analyzer analyzer = new StandardAnalyzer();
        boolean createFlag = true;

        IndexWriter writer =
            new IndexWriter(indexDir, analyzer, createFlag);
        Document document  = new Document();
        document.add(Field.Text("fieldname", text));
        writer.addDocument(document);
        writer.close();
    }
} 

数据挖掘研究院

Let′s step through the code. Lucene stores its indices in directories on the file system. Each index is contained within a single directory, and multiple indices should not share a directory. The first parameter in IndexWriter′s constructor specifies the directory where the index should be stored. The second parameter provides the implementation of Analyzer that should be used for pre-processing the text before it is indexed. This particular implementation of Analyzer eliminates stop words, converts tokens to lower case, and performs a few other small input modifications, such as eliminating periods from acronyms. The last parameter is a boolean flag that, when true, tells IndexWriter to create a new index in the specified directory, or overwrite an index in that directory, if it already exists. A value of false instructs IndexWriter to instead add Documents to an existing index. We then create a blank Document, and add a Field called fieldname to it, with a value of the String that we want to index. Once the Document is populated, we add it to the index via the instance of IndexWriter. Finally, we close the index. This is important, as it ensures that all index changes are flushed to the disk.

Analyzers

As I already mentioned, Analyzers are components that pre-process input text. They are also used when searching. Because the search string has to be processed the same way that the indexed text was processed, it is crucial to use the same Analyzer for both indexing and searching. Not using the same Analyzer will result in invalid search results.

数据挖掘研究院

The Analyzer class is an abstract class, but Lucene comes with a few concrete Analyzers that pre-process their input in different ways. Should you need to pre-process input text and queries in a way that is not provided by any of Lucene′s Analyzers, you will need to implement a custom Analyzer. If you are indexing text with non-Latin characters, for instance, you will most definitely need to do this. 数据挖掘交友

数据挖掘交友


Pages: 1, 2

In this example of a custom Analyzer, we will assume we are indexing text in English. Our PorterStemAnalyzer will perform Porter stemming on its input. As stated by its creator, the Porter stemming algorithm (or "Porter stemmer") is a process for removing the more common morphological and inflexional endings from words in English. Its main function is to be part of a term normalization process that is usually done when setting up Information Retrieval systems.


′); //]]>

This Analyzer will use an implementation of the Porter stemming algorithm provided by Lucene′s PorterStemFilter class.

数据挖掘工具

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.LowerCaseTokenizer;
import org.apache.lucene.analysis.PorterStemFilter;

import java.io.Reader;
import java.util.Hashtable;

/**
 * PorterStemAnalyzer processes input
 * text by stemming English words to their roots.
 * This Analyzer also converts the input to lower case
 * and removes stop words.  A small set of default stop
 * words is defined in the STOP_WORDS
 * array, but a caller can specify an alternative set
 * of stop words by calling non-default constructor.
 */
public class PorterStemAnalyzer extends Analyzer
{
    private static Hashtable _stopTable;

    /**
     * An array containing some common English words
     * that are usually not useful for searching.
     */
    public static final String[] STOP_WORDS =
    {
        "0", "1", "2", "3", "4", "5", "6", "7", "8",
        "9", "000", "___FCKpd___1quot;,
        "about", "after", "all", "also", "an", "and",
        "another", "any", "are", "as", "at", "be",
        "because", "been", "before", "being", "between",
        "both", "but", "by", "came", "can", "come",
        "could", "did", "do", "does", "each", "else",
        "for", "from", "get", "got", "has", "had",
        "he", "have", "her", "here", "him", "himself",
        "his", "how","if", "in", "into", "is", "it",
        "its", "just", "like", "make", "many", "me",
        "might", "more", "most", "much", "must", "my",
        "never", "now", "of", "on", "only", "or",
        "other", "our", "out", "over", "re", "said",
        "same", "see", "should", "since", "so", "some",
        "still", "such", "take", "than", "that", "the",
        "their", "them", "then", "there", "these",
        "they", "this", "those", "through", "to", "too",
        "under", "up", "use", "very", "want", "was",
        "way", "we", "well", "were", "what", "when",
        "where", "which", "while", "who", "will",
        "with", "would", "you", "your",
        "a", "b", "c", "d", "e", "f", "g", "h", "i",
        "j", "k", "l", "m", "n", "o", "p", "q", "r",
        "s", "t", "u", "v", "w", "x", "y", "z"
    };

    /**
     * Builds an analyzer.
     */
    public PorterStemAnalyzer()
    {
        this(STOP_WORDS);
    }

    /**
     * Builds an analyzer with the given stop words.
     *
     * @param stopWords a String array of stop words
     */
    public PorterStemAnalyzer(String[] stopWords)
    {
        _stopTable = StopFilter.makeStopTable(stopWords);
    }

    /**
     * Processes the input by first converting it to
     * lower case, then by eliminating stop words, and
     * finally by performing Porter stemming on it.
     *
     * @param reader the Reader that
     *               provides access to the input text
     * @return an instance of TokenStream
     */
    public final TokenStream tokenStream(Reader reader)
    {
        return new PorterStemFilter(
            new StopFilter(new LowerCaseTokenizer(reader),
                _stopTable));
    }
} 数据挖掘工具 

The tokenStream(Reader) method is the core of the PorterStemAnalyzer. It lower-cases input, eliminates stop words, and uses the PorterStemFilter to remove common morphological and inflexional endings. This class includes only a small set of stop words for English. When using Lucene in a production system for indexing and searching text in English, I suggest that you use a more complete list of stop words, such as this one.

To use our new PorterStemAnalyzer class, we need to modify a single line of our LuceneIndexExample class shown above, to instantiate PorterStemAnalyzer instead of StandardAnalyzer: 数据挖掘交友

Old line: 数据挖掘论坛

Analyzer analyzer = new StandardAnalyzer(); 数据挖掘论坛 

New line:

Analyzer analyzer = new PorterStemAnalyzer();  

The rest of the code remains unchanged. Anything indexed after this change will pass through the Porter stemmer. The process of text indexing with PorterStemAnalyzer is depicted in Figure 1.


Figure 1: The indexing process with PorterStemAnalyzer. 数据挖掘实验室

Because different Analyzers process their text input differently, note again that changing the Analyzer for an existing index is dangerous. It will result in erroneous search results later, in the same way that using a different Analyzer for both indexing and searching will produce invalid results. 数据挖掘论坛

Field Types

Lucene offers four different types of fields from which a developer can choose: Keyword, UnIndexed, UnStored, and Text. Which field type you should use depends on how you want to use that field and its values. 数据挖掘工具

Keyword fields are not tokenized, but are indexed and stored in the index verbatim. This field is suitable for fields whose original value should be preserved in its entirety, such as URLs, dates, personal names, Social Security numbers, telephone numbers, etc. 数据挖掘论坛

UnIndexed fields are neither tokenized nor indexed, but their value is stored in the index word for word. This field is suitable for fields that you need to display with search results, but whose values you will never search directly. Because this type of field is not indexed, searches against it are slow. Since the original value of a field of this type is stored in the index, this type is not suitable for storing fields with very large values, if index size is an issue.

数据挖掘交友

UnStored fields are the opposite of UnIndexed fields. Fields of this type are tokenized and indexed, but are not stored in the index. This field is suitable for indexing large amounts of text that does not need to be retrieved in its original form, such as the bodies of Web pages, or any other type of text document. 数据挖掘工具

Text fields are tokenized, indexed, and stored in the index. This implies that fields of this type can be searched, but be cautious about the size of the field stored as Text field.

数据挖掘研究院

If you look back at the LuceneIndexExample class, you will see that I used a Text field: 数据挖掘实验室

document.add(Field.Text("fieldname", text)); 数据挖掘实验室 

If we wanted to change the type of field "fieldname," we would call one of the other methods of class Field: 数据挖掘工具

document.add(Field.Keyword("fieldname", text)); 数据挖掘实验室 

or 数据挖掘交友

document.add(Field.UnIndexed("fieldname", text)); 数据挖掘论坛 

or

document.add(Field.UnStored("fieldname", text)); 数据挖掘工具 

Although the Field.Text, Field.Keyword, Field.UnIndexed, and Field.UnStored calls may at first look like calls to constructors, they are really just calls to different Field class methods. Table 1 summarizes the different field types. 数据挖掘论坛

Table 1: An overview of different field types. 数据挖掘实验室

Field method/type Tokenized Indexed Stored
Field.Keyword(String, String) No Yes Yes
Field.UnIndexed(String, String) No No Yes
Field.UnStored(String, String) Yes Yes No
Field.Text(String, String) Yes Yes Yes
Field.Text(String, Reader) Yes Yes No
数据挖掘研究院

Conclusion

In this article, we have learned about adding basic text indexing capabilities to your applications using IndexWriter and its associated classes. We have also developed a custom Analyzer that can perform Porter stemming on its input. Finally, we have looked at different field types and learned what each of them can be used for. In the next article of this Lucene series, we shall look at indexing in more depth, and address issues such as performance and multi-threading.

References

[数据挖掘专家] [数据挖掘研究院] [数据挖掘论坛] [数据挖掘实验室]
上一篇:Multilingual Retrieval Experiments with MIMOR at the Univers
下一篇:lucene倒排文件索引结构
最新评论共有 0 位网友发表了评论 , 查看所有评论
发表评论( 不能超过250字,需审核,请自觉遵守互联网相关政策法规。 )
匿名?
数据挖掘网站导航 数据挖掘论坛导航
  • 数据挖掘工具
  • 数据挖掘论坛
  • DataCruncher - Cognos
  • MineSet - MathSoft
  • Intelligent Miner - GainSmarts
  • Sqlserver - SAS - Clementine
  • CART - Weka - WizSoft
  • NeuroShell - ModelQuest
  • data mining tools - Darwin
  • 数据挖掘交友
  • 数据挖掘博客
  • 数据挖掘工具
  • 数据挖掘资源
  • 数据挖掘技术算法
  • 数据挖掘相关期刊、会议
  • 研究院联盟合作专区
  • 数据挖掘基础与相关技术
  • 数据挖掘厂商与就业
  • 数据挖掘研究者乐园
  • 知名厂商数据挖掘工具资料
  • 国内数据挖掘实验室
  • Foreign Data Mining Lab
  • 热点关注
  • Larbin网站爬虫简明使用说明
  • 全文检索引擎Lucene源码分析-analysis包
  • Nutch爬虫工作流程及文件格式详细分析
  • Lucene 基础指南(Java版)
  • 关于lucene 结构及内层的研究(一)
  • 实现NUTCH中文分词的代码修改方法
  • 利用Lucene搜索Java源代码
  • Lucene In Action ch 5 笔记 --高级搜索技
  • 第三节 Lucene索引文件格式分析
  • 如何使用Lucene进行全文检索(一)
  • 论坛最新话题
  • Foundations of Statistical Natural Langu
  • Game Theory meet Data Mining: A Recent P
  • System Building: How does it help or hin
  • 数据挖掘与Clementine培训
  • 新手报到
  • 求 SASEM 客户流失预测分析
  • 数据挖掘工程师/搜索研究院—北京——无线
  • 数据挖掘入门介绍(如何着手数据挖掘)
  • Information Overload Survey Results
  • The INEX 2005 Workshop on Element Retrie
  • 相关资讯
  • 什么是luncene
  • 什么是nutch
  • 让Nutch支持中文分词
  • 关于lucene 结构及内层的研究(一)
  • Lucene In Action ch 5 笔记 --高级搜索技
  • 第三节 Lucene索引文件格式分析
  • 第二节 Lucene系统结构分析
  • 第一节 全文检索系统与Lucene简介
  • Lucene的查询语法!
  • 第四节 Lucene索引构建逻辑模块分析
  • 数据挖掘实验室资料
  • 数据挖掘博客地址
  • 数据挖掘实验室网站地址
  • Prepare for Medicare audits by using dat
  • 注册成为SAS用户与爱好者俱乐部会员
  • 水南梅
  • 明日烟
  • 新人报道
  • 下载
  • 厦门服务器托管,450元/月—0592-5177319 高
  • 买空间送域名--0592-5177319 高静