Prelude
- my background..
- please interrupt with questions
- blog this talk now so that we can search for it later
(using a Lucene-based blog search engine, of course) - In this course, Paolo and Antonio have presented many techniques.
- I present real software that uses many of these techniques.
Lucene is
- software library for search
- open source
- not a complete application
- set of java classes
- active user and developer communities
- widely used, e.g, IBM and Microsoft.
Lucene Architecture
[draw on whiteboard for reference throughout talk]
Lucene API
- org.apache.lucene.document
- org.apache.lucene.analysis
- org.apache.lucene.index
- org.apache.lucene.search
Package: org.apache.lucene.document
- A Document is a sequence of Fields.
- A Field is a <name, value> pair.
- name is the name of the field, e.g., title, body, subject, date, etc.
- value is text.
- Field values may be stored, indexed or analyzed (and, now, vectored).
Example
public Document makeDocument(File f) throws FileNotFoundException { 数据挖掘研究院
Document doc = new Document();
doc.add(new Field("path", f.getPath(), Store.YES, Index.TOKENIZED));
doc.add(new Field("modified",
DateTools.timeToString(f.lastModified(), Resolution.MINUTE),
Store.YES, Index.UN_TOKENIZED));
// Reader implies Store.NO and Index.TOKENIZED
doc.add(new Field("contents", new FileReader(f)));
return doc;
} 数据挖掘研究院
Example (continued)
| field |
stored |
indexed |
analyzed |
| path |
yes |
yes |
yes |
| modified |
yes |
yes |
no |
| content |
no |
yes |
yes |
Package: org.apache.lucene.analysis
- An Analyzer is a TokenStream factory.
- A TokenStream is an iterator over Tokens.
- input is a character iterator (Reader)
- A Token is tuple <text, type, start, length, positionIncrement>
- text (e.g., “pisa”).
- type (e.g., “word”, “sent”, “para”).
- start & length offsets, in characters (e.g, <5,4>)
- positionIncrement (normally 1)
- standard TokenStream implementations are
- Tokenizers, which divide characters into tokens and 数据挖掘研究院
- TokenFilters, e.g., stop lists, stemmers, etc.
- Tokenizers, which divide characters into tokens and 数据挖掘研究院
Example
public class ItalianAnalyzer extends Analyzer {
数据挖掘研究院
private Set stopWords =
StopFilter.makeStopSet(new String[] {"il", "la", "in"};
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new WhitespaceTokenizer(reader);
result = new LowerCaseFilter(result);
result = new StopFilter(result, stopWords);
result = new SnowballFilter(result, "Italian");
return result;
}
}
Package: org.apache.lucene.index
- Term is <fieldName, text>
- index maps Term → <df, <docNum, <position>* >*>
- e.g., “content:pisa” → <2, <2, <14>>, <4, <2, 9>>>
- new: term vectors!
Example
IndexWriter writer = new IndexWriter("index", new ItalianAnalyzer());
File[] files = directory.listFiles();
for (int i = 0; i < files.length; i++) {
writer.addDocument(makeDocument(files[i]));
}
writer.close();
Some Inverted Index Strategies
- batch-based: use file-sorting algorithms (textbook)
- + fastest to build
+ fastest to search
- slow to update - b-tree based: update in place (http://lucene.sf.net/papers/sigir90.ps)
- + fast to search
- update/build does not scale
- complex implementation - segment based: lots of small indexes (Verity)
- + fast to build
+ fast to update
- slower to search
Lucene′s Index Algorithm
- two basic algorithms:
- make an index for a single document
- merge a set of indices
- incremental algorithm:
- maintain a stack of segment indices
- create index for each incoming document
- push new indexes onto the stack
- let b=10 be the merge factor; M=∞
for (size = 1; size < M; size *= b) {
if (there are b indexes with size docs on top of the stack) {
pop them off the stack;
merge them into a single index;
push the merged index onto the stack;
} else {
break;
}
}
- optimization: single-doc indexes kept in RAM, saves system calls
- notes:
- average b*logb(N)/2 indexes
- N=1M, b=2 gives just 20 indexes
- fast to update and not too slow to search
- batch indexing w/ M=∞, merge all at end
- equivalent to external merge sort, optimal
- equivalent to external merge sort, optimal
- segment indexing w/ M<∞
- average b*logb(N)/2 indexes
Indexing Diagram
- b = 3
- 11 documents indexed
- stack has four indexes
- grayed indexes have been deleted
- 5 merges have occurred

Index Compression
For keys in Term -> ... map, use technique from Paolo′s slides:
For values in Term -> ... map, use technique from Paolo′s slides:
VInt Encoding Example
|
Value |
First byte 数据挖掘研究院 |
Second byte |
Third byte |
|
0 数据挖掘研究院 |
00000000 |
||
|
1 数据挖掘研究院 |
00000001 |
|
|
|
2 数据挖掘研究院 |
00000010 数据挖掘实验室 |
|
|
|
... 数据挖掘实验室 |
|
|
|
|
127 数据挖掘研究院 |
01111111 数据挖掘研究院 |
|
|
|
128 |
10000000 数据挖掘实验室 |
00000001 |
|
|
129 |
10000001 |
00000001 数据挖掘研究院 |
|
|
130 数据挖掘实验室 |
10000010 数据挖掘研究院 |
00000001 |
|
|
... 数据挖掘研究院 |
|
|
|
|
16,383 |
11111111 |
01111111 |
|
|
16,384 |
10000000 |
10000000 数据挖掘实验室 |
00000001 |
|
16,385 数据挖掘研究院 |
10000001 数据挖掘研究院 |
10000000 |
00000001 数据挖掘研究院 |
|
... |
|
This provides compression while still being efficient to decode.
Package: org.apache.lucene.search
- primitive queries:
- TermQuery: match docs containing a Term
- PhraseQuery: match docs w/ sequence of Terms
- BooleanQuery: match docs matching other queries.
e.g., +path:pisa +content:“Doug Cutting” -path:nutch
- TermQuery: match docs containing a Term
- new: SpansQuery
- derived queries:
- PrefixQuery, WildcardQuery, etc.
Example
Query pisa = new TermQuery(new Term("content", "pisa"));
Query babel = new TermQuery(new Term("content", "babel")); 数据挖掘研究院
PhraseQuery leaningTower = new PhraseQuery();
leaningTower.add(new Term("content", "leaning"));
leaningTower.add(new Term("content", "tower"));
BooleanQuery query = new BooleanQuery();
query.add(leaningTower, Occur.MUST);
query.add(pisa, Occur.SHOULD);
query.add(babel, Occur.MUST_NOT); 数据挖掘研究院
Search Algorithms
From Paolo′s slides:
Lucene′s Disjunctive Search Algorithm
- described in http://lucene.sf.net/papers/riao97.ps
- since all postings must be processed
- goal is to minimize per-posting computation
- merges postings through a fixed-size array of accumulator buckets
- performs boolean logic with bit masks
- scales well with large queries
[ draw a diagram to illustrate? ]
Lucene′s Conjunctive Search Algorithm
From Paolo′s slides:
Algorithm
- use linked list of pointers to doc list
- initially sorted by doc
- loop
- if all are at same doc, record hit
- skip first to-or-past last and move to end of list
- if all are at same doc, record hit
Scoring
From Paolo′s slides:
Is very much like Lucene′s Similarity.
数据挖掘研究院
Lucene′s Phrase Scoring
- approximate phrase IDF with sum of terms
- compute actual tf of phrase
- slop penalizes slight mismatches by edit-distance
Thanks!
And there′s lots more to Lucene.
Check out http://jakarta.apache.org/lucene/.

