Saturday, September 7, 2013

Solr usability contest: make Apache Solr even cooler!

In august I took part in the Solr Usability Contest ran by Alexandre Rafalovitch, author of the Apache Solr for Indexing Data How-to book from Packt.

As I have already told Alexandre (@arafalov), it was a great idea to launch a contest like this. While Solr / Lucene mail-lists serve as a direct way of solving particular problems and Apache jira is a way of doing some feature requests and bug submissions, it is great to sometimes take a step back and have a look at a larger perspective of features / limitations / possible improvements and so on.

The following three are the winning suggestions of truly yours:


On atomic updates

It coincided that we have been evaluating new sexy sounding atomic updates feature and found out, that it wasn't easy to enable it. To actually make use of the feature, we essentially would've needed to make *all* fields stored. The past several years I have been "fighting" against storing fields unless really necessary. It is amongst one of the performance suggestions to avoid storing fields if possible that in turn helps avoiding extra disk seeks. Not all have SSD disks installed on their Solr servers. Having written some Lucene level code some time ago I could wildly guess, why would all fields be necessary stored for the atomic updates. Essentially upon atomic update Solr will have to retrieve an existing document with all its source values, update a value (or values) and push the document back into persistent storage (index). To my taste this describes a bulk update. There is one major advantage of atomic updates (given that all fields were made stored): saving on the network traffic. Indeed, instead of submitting an entire document with a couple updated fields, you can send only the fields with new values and provide a document id. There are other cool features, like setting a previously non-existent field on the document or deleting an existing field. These all will surely make an atomic update feature appealing to some folks. You will find real examples of how to use atomic updates feature in the Alexandre's book. So go and get your copy now.

In the course of reindexnig our data in solr4 we have found out a lot of improvements, one of them is index compression. The lossless compression algorithm used in Lucene / Solr 4 is lz4, which has made the index super compact. Our use case shows 20G vs 100G index size compression in solr4 vs solr3 battle, which is simply amazing. By the way the algorithm has a property of fast decompression (fast decoder), which makes it an ideal fit for an online algorithm.

In light of compactness of the index we are still considering to evaluate the atomic updates feature, merely from three perspectives:
  • traffic savings
  • speed of processing
  • index size increase vs storing only necessary fields

On interactivity of Solr dashboard

As we have started evaluating goodies of Solr4 I was positively surprised about how usable and eye-catchy looking the Solr dashboard (admin) has become. Above all is usability (it is after all usability contest) and, oh yes, it has become usable for the first time. In Solr 1.4.1 and 3.4 times we have been merely consulting the cache statistics page and analysis page occasionally. In Solr4 one can now administer the cores directly from the dashboard, optimize indices, study the frequency characteristics of text data and so on. This is of course on top of mentioned features, like field analysis and monitoring the cache stats.

But.. something is still missing. We are running several shards with frontend solrs and for us it has always been a bit of a pain to monitor our cluster. We intentionally do not use SolrCloud, because of requirements for logical sharding. Sometime ago I have blogged about Solr on FitNesse, which helped to see the situation with the cluster with just one click. We have also set up RAM monitoring with graphite, but wait, all of these are external to Solr tools. It would be really great to be able to integrate some of them directly into Solr dashboard. So we hope this will change into the direction of "plug-n-play" type of interfaces that would allow implementing plugins to Solr dashboard. In the mean time good ol' jvisualvm is a tool helping to monitor a heavy shard during soft-commit runs:

On scripting capability

I also dared to fantasize about what could open Solr up for wider audience. Especially people that are not dreaming of reading and changing Solr source code. This can be enabled with a scripting capability. By this I mean a way of hacking into Solr via external interfaces in the language that fits your task and skillset best (ruby or scala or some other JVM friendly language or perhaps something outside JVM family altogether). The best thing this would offer is an opportunity to experiment fast with the Solr search: changing runtime order of analyzers or search components, affecting on scoring, introducing advertisement entries, calculating some analytics, refining facets etc etc. While some of these may sound too far stretched, the feature in general may open up for changing the Solr core behaviour without hacking into the heavy-duty source code recompilation (although personally I would recommend diving into that anyway).


Concluding remarks

I would like to conclude that Solr4 has brought lots of compelling features and improvements (an extremely great soft-commit feature, for example) and we are happy to see this blazingly fast search platform to evolve that fast. In these three usability suggestions I have tried to summarize what is great to do to make the platform even more compelling and cool.

yours truly,

No comments: