Wednesday, November 13, 2013

Lucene Revolution EU 2013 in Dublin: Day 2

This post tells my impressions of the Lucene Revolution conference held in Dublin, Ireland, 2013, day 2. The day 1 is described here.

Second day was slightly tougher to start right after the previous evening in a bar with committers and conference participants, but good portion of full Irish breakfast with coffee helped a lot and set me up for the day.

I decided to postpone coming to the conference venue (Aviva stadium) right until the technical presentations started. But, I was in early enough still to make it to the panel with Paul Doscher. Paul holds himself quite professionally and talks in a convincing manner. He didn't seem to be a BS CEO, and tries best to address the question at hand as deeply and low-level as possible. May be EU audience was a bit sleepy or a not too comfortable talking otherwise (geeks, geeks), which didn't spawn as many discussions / questions. Two Paul's questions that come to mind:



1. Have you been doing search / big data during past 4-5 years?
2. Do you think the search / big data is a good area to stay until end of your IT career?

There have been more hands for 1 than for 2. Which in Paul's mind was a bit strange as he strongly believes search and big data is the area to stay. There is so much complexity involved and so much data has been accumulated by enterprises that there will be a strong demand in specialists of this sort. While I would second this, I can tell that the consumers of enterprise data will have higher standards to quality, because they do business with it. On the other hand, in my opinion, if you take so called mass media (microblogs, forums, user sentiment and so on), there is also big data produced at high speeds, but the user base on average is not as demanding. It is again businesses that will likely to be interested to sift some signal from the mass media noise.

Coming back to the technical presentations.

The first technical presentation I visited has been "Lucene Search Essentials: Scorers, Collections and Custom Queries" by +Mikhail Khludnev. I know Mikhail since more than a year ago, sitting on the same Solr user mail-list and commenting on SORL jira.

His team has been doing some really low-level stuff with Solr / Lucene code for e-commerce and this presentation has summarized his knowledge pretty much. I'd say while the presentation is accessible for a non-prepared audience, it goes quite deep into analyzing compexities of certain search algorithms and lays a background for more conscious search. The slides.

On the next session Isabel Drost-Fomm introduced the audience into the world of text classification and Apache Mahout. Isabel is a co-founder of another exciting conference on search and related technologies Berlin Buzzwords and is a Mahout committer.



Before the session Isabel mentioned to me, that one of the targets of her presentation is let people know the existing good systems like Lucene with tokenizers and focus on solving a particular task at hand instead of reinventing a wheel. Good intro level presentation, and after trying out Mahout, I must say you can get going with the examples quite easily. Isabel had mentioned that becoming a committer in the Apache community is easier than one might think: fix or write new documentation, attach small patches (not necessarily code) and that might be enough. But, removing yourself from the committer status is much harder, because, if I sensed it right, Apache has rather wide network of reminding you to commit. So you might find yourself coding over the weekend for Open Source. Which, in the long run,  may make a good career boost for you.

In the lobbies some folks wanted to chat about luke project, that I have been contributing lately. Finally added its creator Andrzej Bialecki as a contributor. If you enjoy using luke and want to make it better, feel free to join and start contributing yummy patches!

The next cool presentation was presented by Alan Woodward and Charlie Hull (who published a nice writeup on the conference here and here). Their collaborative talk focused on turning the search upside down: queries become documents and documents become queries (honestly, I would like to see how this is handled in the code in terms of naming the variables and functions). Charlie briefly mentioned that the code will be open-sourced. The specific use case for this is a bank of queries or a taxonomy of queries (for instance, for patent examiners) that need to be executed against each incoming document. If there is a hit, this document gets the query id as some concept. What I liked in the presentation is the innovative look on the problem: given a set of queries, what is the efficient representation of them in the Lucene index? For example a SpanNear query can be represented as an AND on its operands, then  the AND query can be checked only for one operand quickly and efficiently. If it is not present, the entire query is not hitting the document. They also used the position aware Scorer that can iterate the intervals of positions. I did a similar work in the past, but used the not so efficient Highlighter component. So I'm glad there is a similar work done in an efficient manner. UPD: the tool called Luwak has been open-sourced!

In the next presentation Adrien Grand walked the audience through the design decisions behind data structures in Lucene index, compared Lucene indices to RDBMS indices (I tried to summarize it here) and talked about things on hardware / OS level, like using SSD disks and allocating enough RAM for OS cache buffers. During the Q&A I have suggested an idea that in order to increase the indexing throughput one can set the merge factor to thousands and upon completion to 10 or even 1 so that the search side would be efficient. The indexing then will not need to merge as often. Adrien has seconded this idea and also proposed to use TieredMergePolicy (default in Solr) where you can control the merging of segments with a number of docs deleted. This is needed both for efficiency and timely space re-claiming. Even though I have missed the closing remarks, since this presentation took more time than expected, I was happy to be there as the presentation was quite thought provocative.

These have been two intense days both in terms of technical presentations and sightseeing. I have only been able to explore Trinity College and areas around it, but doing this on foot was quite a good sport activity.

The apogee of this day and perhaps of the conference was meeting entire sematext team while roaming around shopping streets near Trinity College! We had some good time together in a local bar. It's been fun, thanks guys!

This pretty much wraps up my 2-post impressions and notes on the Lucene Revolution conference 2013 in Dublin.

If you enjoyed this writeup, there is a delightful writeup by +Mike Sokolov :
http://blog.safariflow.com/2013/11/25/this-revolution-will-be-televised/

And read the Flax writeups too (links above in the post).

Happy searching,

@dmitrykan

No comments: