Wednesday, September 17, 2014

Exporting Lucene index to xml with Luke

Luke is the open source Lucene toolbox originally written by Andrzej Bialecki and currently maintained by yours truly. The tool allows you to introspect into your solr / lucene index, check it for health, fix problems, verify field tokens and even experiment with scoring or read the index from HDFS.

In this post I would like to illustrate one particular luke's feature, that allows you to dump index into an xml for external processing.

Task

Extract indexed tokens from a field to a file for further analysis outside luke.

 

Indexing data

In order to extract tokens you need to index your field with term vectors configured. Usually, this also means, that you need to configure positions and offsets.

If you are indexing using Apache Solr, you would configure the following on your field:

<field indexed="true" name="Contents" omitnorms="false" stored="true" termoffsets="true" termpositions="true" termvectors="true" type="text">

With this line you make sure you field is going to store its contents, not only index; it will also store the term vectors, i.e. a term, its positions and offsets in the token stream.

 

Extracting index terms

One way to view the indexed tokens with luke is to search / list documents, select the field with term vectors enabled and click TV button (or right-click and choose "Field's Term Vector").




If you would like to extract this data into an external file, there is a way currently to accomplish this via menu Tools->Export index to XML:



In this case I have selected the docid 94724 (note, that this is lucene's internal doc id, not solr application level document id!), that is visible when viewing a particular document in luke. This dumps a document into the xml file, including the fields in the schema and each field's contents. In particular, this will dump the term vectors (if present) of a field, in my case:

<field flags="Idfp--SV-Nnum--------" name="Contents">
<val>CENTURY TEXT.</val>
<tv>
<t freq="1" offsets="0-7" positions="0" text="centuri" />
<t freq="1" offsets="0-7" positions="0" text="centuryä" />
<t freq="1" offsets="8-12" positions="1" text="text" />
<t freq="1" offsets="8-12" positions="1" text="textä" />
</tv>
</field>

No comments: