Wednesday, May 28, 2014

Using system disk cache for speeding up the indexing with SOLR

Benchmarking is rather hard subject of software development, especially in a sand-boxed development environments, like JVM with "uncontrolled" garbage collection. Still, there are tasks, that are more IO heavy, like indexing xml files into Apache Solr and this is where you can control more on the system level to do better benchmarking.

So what about batch indexing? There are ways to speed it up purely on SOLR side.

This post shows a possible remedy to speeding up indexing purely on the system level, and assumes linux as the system.

The benchmarking setup that I had is the following:

Apache SOLR 4.3.1
Ubuntu with 16G RAM
Committing via softCommit feature

What I set up to do is to play around the system disk cache. One of the recommendations of speeding up the search is to cat the index files into the cache, using the command:

cat an_index_file > /dev/null

Then the index is read from the disk cache buffers and is faster than reading it cold.

What about bulk indexing xml files into Solr? We could cat the xml files to be indexed into the disk cache and possibly speed up the indexing. The following figures are not exactly statistically significant, nor was the test done on a large amount of xml files, but the figures do show the trend:

With warmed up disk cache:
real    1m27.604s
user    0m2.220s
sys    0m2.860s

After dropping the file cache:
echo 3 | sudo tee /proc/sys/vm/drop_caches

real    1m30.285s
user    0m2.148s
sys    0m3.700s

Again, hot cache:
real    1m27.924s
user    0m2.264s
sys    0m3.068s

Again, after dropping the file cache:
echo 3 | sudo tee /proc/sys/vm/drop_caches

real    1m32.791s
user    0m2.204s
sys    0m3.104s

The figures above are pretty clear, that having the files cached speeds the indexing up by about 3-5 seconds for just 420 xml files.

Coupled with ways of increasing the throughput on the SOLR side this approach could win some more seconds / minutes / hours in the batch indexing.


enashed said...

Hi, just to get an idea about your experiment with speeding up the indexing, I have a question here...

For those 420 xml files, what was the average file size, or number of documents per file?


Dmitry Kan said...

the average files size is: 2,68 MB
the average docs per file: 1,1k docs