RSS
热门关键字:  数据挖掘  人工智能  数据仓库  搜索引擎  数据挖掘导论

about Lucene ppt

来源: 作者:unkonwn 时间:2004-12-05 点击:

Prelude

  • my background..
  • please interrupt with questions
  • blog this talk now so that we can search for it later
    (using a Lucene-based blog search engine, of course)
  • In this course, Paolo and Antonio have presented many techniques.
  • I present real software that uses many of these techniques.

Lucene is


Lucene Architecture


[draw on whiteboard for reference throughout talk]

数据挖掘研究院


Lucene API

  • org.apache.lucene.document
  • org.apache.lucene.analysis
  • org.apache.lucene.index
  • org.apache.lucene.search


数据挖掘实验室


Package: org.apache.lucene.document

  • A Document is a sequence of Fields.
  • A Field is a <name, value> pair.
    • name is the name of the field, e.g., title, body, subject, date, etc.
    • value is text.
  • Field values may be stored, indexed or analyzed (and, now, vectored).

Example

 public Document makeDocument(File f) throws FileNotFoundException { 数据挖掘研究院 
Document doc = new Document();
doc.add(new Field("path", f.getPath(), Store.YES, Index.TOKENIZED));

doc.add(new Field("modified",
DateTools.timeToString(f.lastModified(), Resolution.MINUTE),
Store.YES, Index.UN_TOKENIZED));

// Reader implies Store.NO and Index.TOKENIZED
doc.add(new Field("contents", new FileReader(f)));

return doc;
} 数据挖掘研究院

 

Example (continued)

field
stored
indexed
analyzed
path
yes
yes
yes
modified
yes
yes
no
content
no
yes
yes

数据挖掘研究院


Package: org.apache.lucene.analysis

  • An Analyzer is a TokenStream factory.
  • A TokenStream is an iterator over Tokens.
    • input is a character iterator (Reader)
  • A Token is tuple <text, type, start, length, positionIncrement>
    • text (e.g., “pisa”).
    • type (e.g., “word”, “sent”, “para”).
    • start & length offsets, in characters (e.g, <5,4>)
    • positionIncrement (normally 1)
  • standard TokenStream implementations are
    • Tokenizers, which divide characters into tokens and 数据挖掘研究院
    • TokenFilters, e.g., stop lists, stemmers, etc.

Example

public class ItalianAnalyzer extends Analyzer {
数据挖掘研究院

  private Set stopWords = 
StopFilter.makeStopSet(new String[] {"il", "la", "in"};

public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new WhitespaceTokenizer(reader);
 result = new LowerCaseFilter(result);
  result = new StopFilter(result, stopWords);
  result = new SnowballFilter(result, "Italian");
 return result;
 }
}

数据挖掘研究院


Package: org.apache.lucene.index

  • Term is <fieldName, text>
  • index maps Term → <df, <docNum, <position>* >*>
  • e.g., “content:pisa” → <2, <2, <14>>, <4, <2, 9>>>
  • new: term vectors!

Example

IndexWriter writer = new IndexWriter("index", new ItalianAnalyzer());
File[] files = directory.listFiles();
for (int i = 0; i < files.length; i++) {
writer.addDocument(makeDocument(files[i]));
}
writer.close();

Some Inverted Index Strategies

  1. batch-based: use file-sorting algorithms (textbook)
      + fastest to build
      + fastest to search
      - slow to update
  2. b-tree based: update in place (http://lucene.sf.net/papers/sigir90.ps)
      + fast to search
      - update/build does not scale
      - complex implementation
  3. segment based: lots of small indexes (Verity)
      + fast to build
      + fast to update
      - slower to search

Lucene′s Index Algorithm

  • two basic algorithms:
    1. make an index for a single document
    2. merge a set of indices
  • incremental algorithm:
    • maintain a stack of segment indices
    • create index for each incoming document
    • push new indexes onto the stack
    • let b=10 be the merge factor; M=∞

      数据挖掘实验室


      for (size = 1; size < M; size *= b) {
        if (there are b indexes with size docs on top of the stack) {
         pop them off the stack;
         merge them into a single index;
         push the merged index onto the stack;
       } else {
         break;
       }
      }
  • optimization: single-doc indexes kept in RAM, saves system calls
  • notes:
    • average b*logb(N)/2 indexes
      • N=1M, b=2 gives just 20 indexes
      • fast to update and not too slow to search
    • batch indexing  w/ M=∞, merge all at end
    • segment indexing w/ M<∞

Indexing Diagram

  • b = 3
  • 11 documents indexed
  • stack has four indexes
  • grayed indexes have been deleted
  • 5 merges have occurred


数据挖掘研究院


Index Compression

For keys in Term -> ... map, use technique from Paolo′s slides:


For values in Term -> ... map, use technique from Paolo′s slides:

数据挖掘研究院



数据挖掘研究院

VInt Encoding Example

数据挖掘实验室

Value

First byte 数据挖掘研究院

Second byte

数据挖掘实验室

Third byte

数据挖掘研究院

0 数据挖掘研究院

00000000


数据挖掘研究院


数据挖掘研究院

1 数据挖掘研究院

00000001

数据挖掘研究院


数据挖掘实验室


2 数据挖掘研究院

00000010 数据挖掘实验室


数据挖掘研究院


数据挖掘实验室

... 数据挖掘实验室


数据挖掘研究院


数据挖掘研究院


数据挖掘研究院

127 数据挖掘研究院

01111111 数据挖掘研究院


数据挖掘研究院


128

数据挖掘研究院

10000000 数据挖掘实验室

00000001

数据挖掘研究院


数据挖掘研究院

129

10000001

00000001 数据挖掘研究院


数据挖掘研究院

130 数据挖掘实验室

10000010 数据挖掘研究院

00000001

数据挖掘研究院


数据挖掘研究院

... 数据挖掘研究院


数据挖掘研究院


数据挖掘研究院


16,383

11111111

数据挖掘实验室

01111111

数据挖掘研究院


数据挖掘研究院

16,384

数据挖掘实验室

10000000

数据挖掘研究院

10000000 数据挖掘实验室

00000001

16,385 数据挖掘研究院

10000001 数据挖掘研究院

10000000

数据挖掘研究院

00000001 数据挖掘研究院

...

数据挖掘研究院


数据挖掘研究院


数据挖掘研究院


数据挖掘研究院

数据挖掘研究院

This provides compression while still being efficient to decode.

数据挖掘实验室


Package: org.apache.lucene.search

  • primitive queries:
    • TermQuery: match docs containing a Term
    • PhraseQuery: match docs w/ sequence of Terms
    • BooleanQuery: match docs matching other queries.
      e.g., +path:pisa +content:“Doug Cutting” -path:nutch
  • new: SpansQuery

    数据挖掘研究院


  • derived queries:
    • PrefixQuery, WildcardQuery, etc.

Example

Query pisa = new TermQuery(new Term("content", "pisa"));
Query babel = new TermQuery(new Term("content", "babel")); 数据挖掘研究院

PhraseQuery leaningTower = new PhraseQuery();
leaningTower.add(new Term("content", "leaning"));
leaningTower.add(new Term("content", "tower"));

BooleanQuery query = new BooleanQuery();
query.add(leaningTower, Occur.MUST);
query.add(pisa, Occur.SHOULD);
query.add(babel, Occur.MUST_NOT); 数据挖掘研究院

Search Algorithms

From Paolo′s slides:


Lucene′s Disjunctive Search Algorithm

  • described in http://lucene.sf.net/papers/riao97.ps
  • since all postings must be processed
    • goal is to minimize per-posting computation
  • merges postings through a fixed-size array of accumulator buckets
  • performs boolean logic with bit masks
  • scales well with large queries

[ draw a diagram to illustrate? ]

数据挖掘实验室


Lucene′s Conjunctive Search Algorithm

From Paolo′s slides:


Algorithm

数据挖掘研究院

  • use linked list of pointers to doc list
  • initially sorted by doc
  • loop
    • if all are at same doc, record hit
    • skip first to-or-past last and move to end of list

Scoring

From Paolo′s slides:

Is very much like Lucene′s Similarity.
数据挖掘研究院


Lucene′s Phrase Scoring

  • approximate phrase IDF with sum of terms
  • compute actual tf of phrase
  • slop penalizes slight mismatches by edit-distance

Thanks!

And there′s lots more to Lucene.
Check out http://jakarta.apache.org/lucene/.

数据挖掘研究院

最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?