Tuesday, September 23, 2014

Indexing documents in Apache Solr using custom update chain and solrj api

This post focuses on how to target custom update chain using solrj api and index your documents in Apache Solr. The reason for this post existence is because I have spent more than one hour figuring this out. This warrants a blog post (hopefully for other's benefit as well).


Suppose that you have a default update chain, that is executed in every day situations, i.e. for majority of input documents:

<updaterequestprocessorchain default="true" name="everydaychain">
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />

In some specific cases you would like to execute a slightly modified update chain, in this case with a factory that drops duplicate values from document fields. For that purpose you have configured a custom update chain:

<updaterequestprocessorchain name="customchain">
<processor class="solr.UniqFieldsUpdateProcessorFactory" >
<lst name="fields">
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />

Your update request handler looks like this:

<requesthandler class="solr.UpdateRequestHandler" name="/update">
<lst name="defaults">
<str name="update.chain">everydaychain</str>

Every time you hit /update from your solrj backed code, you'll execute document indexing using the "everydaychain".


Using solrj, index documents against the custom update chain.


First before diving into the solution, I'll show the code that you use for normal indexing process from java, i.e. with every:

HttpSolrServer httpSolrServer = null;
try {
     httpSolrServer = new HttpSolrServer("http://localhost:8983/solr/core0");
     SolrInputDocument sid = new SolrInputDocument();
     sid.addField("field1", "value1");

     httpSolrServer.commit(); // hard commit; could be soft too
} catch (Exception e) {
     if (httpSolrServer != null) {

So far so good. Next turning to indexing with custom update chain. This part of non-obvious from the point of view of solrj api design: having an instance of SolrInputDocument, how would one access a custom update chain? You may notice, how the update chain is defined in the update request handler of your solrconfig.xml. It uses the update.chain parameter name. Luckily, this is an http parameter, that can be supplied on the /update endpoint. Figuring this out via http client of the httpSolrServer object led to nowhere.

Turns out, you can use UpdateRequest class instead. The object has got a nice setParam() method that lets you set a value for the update.chain parameter:

HttpSolrServer httpSolrServer = null;
        try {
            httpSolrServer = new HttpSolrServer(updateURL);

            SolrInputDocument sid = new SolrInputDocument();
            // dummy field
            sid.addField("field1", "value1");

            UpdateRequest updateRequest = new UpdateRequest();
            updateRequest.setParam("update.chain", "customchain");

            UpdateResponse updateResponse = updateRequest.process(httpSolrServer);
            if (updateResponse.getStatus() == 200) {
                log.info("Successfully added a document");
            } else {
                log.info("Adding document failed, status code=" + updateResponse.getStatus());
        } catch (Exception e) {
            if (httpSolrServer != null) {
                log.info("Released connection to the Solr server");


Executing the second code will trigger the LogUpdateProcessor to output the following line in the solr logs:

org.apache.solr.update.processor.LogUpdateProcessor  –
   [core0] webapp=/solr path=/update params={wt=javabin&

That's it for today. Happy indexing!

Wednesday, September 17, 2014

Exporting Lucene index to xml with Luke

Luke is the open source Lucene toolbox originally written by Andrzej Bialecki and currently maintained by yours truly. The tool allows you to introspect into your solr / lucene index, check it for health, fix problems, verify field tokens and even experiment with scoring or read the index from HDFS.

In this post I would like to illustrate one particular luke's feature, that allows you to dump index into an xml for external processing.


Extract indexed tokens from a field to a file for further analysis outside luke.


Indexing data

In order to extract tokens you need to index your field with term vectors configured. Usually, this also means, that you need to configure positions and offsets.

If you are indexing using Apache Solr, you would configure the following on your field:

<field indexed="true" name="Contents" omitnorms="false" stored="true" termoffsets="true" termpositions="true" termvectors="true" type="text">

With this line you make sure you field is going to store its contents, not only index; it will also store the term vectors, i.e. a term, its positions and offsets in the token stream.


Extracting index terms

One way to view the indexed tokens with luke is to search / list documents, select the field with term vectors enabled and click TV button (or right-click and choose "Field's Term Vector").

If you would like to extract this data into an external file, there is a way currently to accomplish this via menu Tools->Export index to XML:

In this case I have selected the docid 94724 (note, that this is lucene's internal doc id, not solr application level document id!), that is visible when viewing a particular document in luke. This dumps a document into the xml file, including the fields in the schema and each field's contents. In particular, this will dump the term vectors (if present) of a field, in my case:

<field flags="Idfp--SV-Nnum--------" name="Contents">
<val>CENTURY TEXT.</val>
<t freq="1" offsets="0-7" positions="0" text="centuri" />
<t freq="1" offsets="0-7" positions="0" text="centuryä" />
<t freq="1" offsets="8-12" positions="1" text="text" />
<t freq="1" offsets="8-12" positions="1" text="textä" />