Tuesday, September 23, 2014

Indexing documents in Apache Solr using custom update chain and solrj api

This post focuses on how to target custom update chain using solrj api and index your documents in Apache Solr. The reason for this post existence is because I have spent more than one hour figuring this out. This warrants a blog post (hopefully for other's benefit as well).

Setup


Suppose that you have a default update chain, that is executed in every day situations, i.e. for majority of input documents:

<updaterequestprocessorchain default="true" name="everydaychain">
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updaterequestprocessorchain>

In some specific cases you would like to execute a slightly modified update chain, in this case with a factory that drops duplicate values from document fields. For that purpose you have configured a custom update chain:

<updaterequestprocessorchain name="customchain">
<processor class="solr.UniqFieldsUpdateProcessorFactory" >
<lst name="fields">
   <str>field1</str>
<lst>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updaterequestprocessorchain>

Your update request handler looks like this:

<requesthandler class="solr.UpdateRequestHandler" name="/update">
<lst name="defaults">
<str name="update.chain">everydaychain</str>
</requesthandler>

Every time you hit /update from your solrj backed code, you'll execute document indexing using the "everydaychain".

Task


Using solrj, index documents against the custom update chain.

Solution


First before diving into the solution, I'll show the code that you use for normal indexing process from java, i.e. with every:

HttpSolrServer httpSolrServer = null;
try {
     httpSolrServer = new HttpSolrServer("http://localhost:8983/solr/core0");
     SolrInputDocument sid = new SolrInputDocument();
     sid.addField("field1", "value1");
     httpSolrServer.add(sid);

     httpSolrServer.commit(); // hard commit; could be soft too
} catch (Exception e) {
     if (httpSolrServer != null) {
         httpSolrServer.shutdown();
     }
}

So far so good. Next turning to indexing with custom update chain. This part of non-obvious from the point of view of solrj api design: having an instance of SolrInputDocument, how would one access a custom update chain? You may notice, how the update chain is defined in the update request handler of your solrconfig.xml. It uses the update.chain parameter name. Luckily, this is an http parameter, that can be supplied on the /update endpoint. Figuring this out via http client of the httpSolrServer object led to nowhere.

Turns out, you can use UpdateRequest class instead. The object has got a nice setParam() method that lets you set a value for the update.chain parameter:

HttpSolrServer httpSolrServer = null;
        try {
            httpSolrServer = new HttpSolrServer(updateURL);

            SolrInputDocument sid = new SolrInputDocument();
            // dummy field
            sid.addField("field1", "value1");

            UpdateRequest updateRequest = new UpdateRequest();
            updateRequest.setCommitWithin(2000);
            updateRequest.setParam("update.chain", "customchain");
            updateRequest.add(sid);

            UpdateResponse updateResponse = updateRequest.process(httpSolrServer);
            if (updateResponse.getStatus() == 200) {
                log.info("Successfully added a document");
            } else {
                log.info("Adding document failed, status code=" + updateResponse.getStatus());
            }
        } catch (Exception e) {
            e.printStackTrace();
            if (httpSolrServer != null) {
                httpSolrServer.shutdown();
                log.info("Released connection to the Solr server");
            }

        }

Executing the second code will trigger the LogUpdateProcessor to output the following line in the solr logs:

org.apache.solr.update.processor.LogUpdateProcessor  –
   [core0] webapp=/solr path=/update params={wt=javabin&
      version=2&update.chain=customchain}

That's it for today. Happy indexing!

3 comments:

Unknown said...

I wonder if it is necessary to specify the default chain in the requestHandler while setting default="true" is enough for the everydaychain? In such case, you need to remove the default="true" from the customchain.

Dmitry Kan said...

Hossam, thanks for your comment. You're right, defaulting both chains to "true" is unnecessary. I will fix in the text.

Airakanke said...

Loovely blog you have here