Benchmarking is rather hard subject of software development, especially in a sand-boxed development environments, like JVM with "uncontrolled" garbage collection. Still, there are tasks, that are more IO heavy, like indexing xml files into Apache Solr and this is where you can control more on the system level to do better benchmarking.
This post shows a possible remedy to speeding up indexing purely on the system level, and assumes linux as the system.
The benchmarking setup that I had is the following:
Apache SOLR 4.3.1
Ubuntu with 16G RAM
Committing via softCommit feature
What I set up to do is to play around the system disk cache. One of the recommendations of speeding up the search is to cat the index files into the cache, using the command:
cat an_index_file > /dev/null
Then the index is read from the disk cache buffers and is faster than reading it cold.
What about bulk indexing xml files into Solr? We could cat the xml files to be indexed into the disk cache and possibly speed up the indexing. The following figures are not exactly statistically significant, nor was the test done on a large amount of xml files, but the figures do show the trend:
real 1m27.604s
user 0m2.220s
sys 0m2.860s
After dropping the file cache:
echo 3 | sudo tee /proc/sys/vm/drop_caches
real 1m30.285s
user 0m2.148s
sys 0m3.700s
Again, hot cache:
real 1m27.924s
user 0m2.264s
sys 0m3.068s
Again, after dropping the file cache:
echo 3 | sudo tee /proc/sys/vm/drop_caches
real 1m32.791s
user 0m2.204s
sys 0m3.104s
The figures above are pretty clear, that having the files cached speeds the indexing up by about 3-5 seconds for just 420 xml files.
Coupled with ways of increasing the throughput on the SOLR side this approach could win some more seconds / minutes / hours in the batch indexing.
2 comments:
Hi, just to get an idea about your experiment with speeding up the indexing, I have a question here...
For those 420 xml files, what was the average file size, or number of documents per file?
Thanks
the average files size is: 2,68 MB
the average docs per file: 1,1k docs
Post a Comment