Sunday, November 10, 2013

Lucene Revolution EU 2013 in Dublin: Day 1

The Lucene Revolution conference has been held for the first time in Europe. Dublin, Ireland, has been selected as the conference city and I must say it is a great choice. The city itself is quite compact and friendly, full of places to see (Trinity Colleage, Temple Bar, O'Connell street and many more) as well as places to relax with all the pubs, bars and shopping streets.



It was my first time in Dublin and my general impression of the city is very good.

Let's proceed to the conference!

Day 1


The keynote has been presented by Michael Busch of Twitter, who talked about their data structure for holding an index in such a way that posting lists are automatically sorted and one can read backwards without reopening an index (costly op). They also support early termination naturally in such a way. Everything is stored in RAM and they never commit.

Followed by the keynote I watched Timothy Potter present about integrating Solr and Storm for realtime stream processing.
What I specifically liked in Tim's presentation was two things: (1) he gave personal recommendations for certain libraries and frameworks, like Netty, Coda Hale Metrics and (2) he emphasized, that even that we deal with relatively exotic technologies like Solr and Storm we can still rely on old proven technologies, like Spring + Jackson to remap input json into neatly named properties of a Java class. This all comes in handy when you start working on your own backend code.

Next, LinkedIn engg's talked about their indexing architecture for segmentation and targeting platform.
They use Lucene only, no Solr yet, but would be interested to look into things like SolrCloud. Also mentioned in the presentation was the architectural decision to abstract the Lucene engine from the the business logic by writing a json to Lucene query parser. This will help if they ever decide to try some other search engine. On one hand I sense here, that large business is more cautious than let's say a smaller company. On the other hand, imho, Lucene has grown into a very stable system and given its long presence on the market and wide adoption wouldn't be too cautious. But biz is biz.

Up next was the session on additions to Lucene arsenal by Shai Erera and Adrien Grand. I'd say this presentation was the most in-depth technically during this day. Shai talked about new replication module added to Lucene and expressed the hope of it being added to ElasticSearch and Solr or even replacing their own replication methods. Adrien focused on implementation details of new feature called Index Sorting. The idea is that an index can be kept always sorted on some criteria, like modification date of a document, which in turn will enable for early termination techniques and better compression of sorted index.

After lunch break I decided to go to "Shrinking the Haystack" using Solr and OpenNLP by Wes Cladwell. While his presentation wasn't too technical it had mentioned really hilarious uncommon practice they have at ISS Inc while building search and data processing solutions for FBI type of agencies. One question, he said, strucks every sw engg applicant is: "Will you be ok to spend a weekend in Afganistan?" Not an usual working place, hah? To me, having spent a month in Algeria as a consultant, this sounded somewhat familiar, because when entering my hotel the car was always checked by tommy-gunners. The take-away point of this presentation was that they build big data systems not to find a needle in the haystack, but give tools to human analysts to help find an actionable intelligent data. I.e. you can't replace people. The presentation also mentioned some tech. goodies, like boilerpipe for extracting text from an html leaving boiler html out; GeoNames = open source geo-entity database, that can be indexed into a Solr core for better search. In terms of NLProc and machine learning at ISS they have arrived at combining gazeteer (dictionary) and supervised machine learning to give the best of both worlds. From my NLProc + ML experience at SemanticAnalyzer I will only second this.


Since I have spent a number of weeks focusing of various query parsers changing QueryParser.jj grammar with JavaCC I have decided to see an alternative approach to building parsers: presentation on Parboiled by John Berryman. He has been implementing a legacy replacement system for patent examiners, who have at times super-long boolean and positional queries and who are not ready to give up their query syntax. What is great about Parboiled based parsers is that it is Java, i.e. the parser itself is declaratively expressed using your own Java code. It works nicely with Lucene's SpanQueries and general Queries. On the other hand, in lobbies John had mentioned that debugging the Parboiled generated parser is not straighforward. Well, JavaCC isn't any better either in terms of debugging. One listener from the audience has mentioned Antlr, which he enjoyed using. There is a JIRA for introducing Antlr into query parsing with Lucene, if you would like to work on this practically. John's slides can be enjoyed here.

After all this techy stuff I made a larger break and searched for networking with sematext guys at their booth, with +Mikhail Khludnev, Documill (from Espoo, yeah!) and other people to sense what type of audience has arrived at the conference and why did they have an interest in Lucene / Solr specifically.

The apogee of the day had been the lovely Stump the Chump session by +Chris Hostetter. Which truly yours had a privilege to win with the first time first place! I'll update this post with the video once it is out of production lines.

Up next is the second day of the conference in the next post as this post is quite a long read already.

3 comments:

Mikhail Khludnev said...

One detail re Michael Bush from Twitter talk: Solr 4.4 is used for older tweets aka archive, in-memory hack searches for recent ones. it's a great promotion for Solr, I suppose.

Dmitry Kan said...

Cool, thanks for mentioning. Do you know what kind of setup they have for solr, do they use any new features, like soft-commits?

Mikhail Khludnev said...

there was no any detail as far as I remember