Sunday, November 24, 2013

Training on NLProc and Machine Learning

Just did a training on NLProc (imho, better abbreviation for natural language processing than NLP) and Machine Learning for OK.ru (Russian Facebook owned by Mail.Ru Group) in person in Saint-Petersburg, Russia.

OK.ru has a nice office not far from Petrogradskaya subway station in Saint-Petersburg (the central office is in Moscow).


If you feel like your project is somewhat stuck and needs a fresh look or you need to widen you knowledge in NLProc and / or machine learning, feel free to contact me on g+. At the moment folks at SemanticAnalyzer can do this in Europe / western part of Russia. At SemanticAnalyzer we also offer a full package of services for natural language processing development in case there isn't expertise in your house. This includes project scoping, breaking down by technical tasks, time estimation, development, testing / evaluation and delivery.

Wednesday, November 13, 2013

Lucene Revolution EU 2013 in Dublin: Day 2

This post tells my impressions of the Lucene Revolution conference held in Dublin, Ireland, 2013, day 2. The day 1 is described here.

Second day was slightly tougher to start right after the previous evening in a bar with committers and conference participants, but good portion of full Irish breakfast with coffee helped a lot and set me up for the day.

I decided to postpone coming to the conference venue (Aviva stadium) right until the technical presentations started. But, I was in early enough still to make it to the panel with Paul Doscher. Paul holds himself quite professionally and talks in a convincing manner. He didn't seem to be a BS CEO, and tries best to address the question at hand as deeply and low-level as possible. May be EU audience was a bit sleepy or a not too comfortable talking otherwise (geeks, geeks), which didn't spawn as many discussions / questions. Two Paul's questions that come to mind:



1. Have you been doing search / big data during past 4-5 years?
2. Do you think the search / big data is a good area to stay until end of your IT career?

There have been more hands for 1 than for 2. Which in Paul's mind was a bit strange as he strongly believes search and big data is the area to stay. There is so much complexity involved and so much data has been accumulated by enterprises that there will be a strong demand in specialists of this sort. While I would second this, I can tell that the consumers of enterprise data will have higher standards to quality, because they do business with it. On the other hand, in my opinion, if you take so called mass media (microblogs, forums, user sentiment and so on), there is also big data produced at high speeds, but the user base on average is not as demanding. It is again businesses that will likely to be interested to sift some signal from the mass media noise.

Coming back to the technical presentations.

The first technical presentation I visited has been "Lucene Search Essentials: Scorers, Collections and Custom Queries" by +Mikhail Khludnev. I know Mikhail since more than a year ago, sitting on the same Solr user mail-list and commenting on SORL jira.

His team has been doing some really low-level stuff with Solr / Lucene code for e-commerce and this presentation has summarized his knowledge pretty much. I'd say while the presentation is accessible for a non-prepared audience, it goes quite deep into analyzing compexities of certain search algorithms and lays a background for more conscious search. The slides.

On the next session Isabel Drost-Fomm introduced the audience into the world of text classification and Apache Mahout. Isabel is a co-founder of another exciting conference on search and related technologies Berlin Buzzwords and is a Mahout committer.



Before the session Isabel mentioned to me, that one of the targets of her presentation is let people know the existing good systems like Lucene with tokenizers and focus on solving a particular task at hand instead of reinventing a wheel. Good intro level presentation, and after trying out Mahout, I must say you can get going with the examples quite easily. Isabel had mentioned that becoming a committer in the Apache community is easier than one might think: fix or write new documentation, attach small patches (not necessarily code) and that might be enough. But, removing yourself from the committer status is much harder, because, if I sensed it right, Apache has rather wide network of reminding you to commit. So you might find yourself coding over the weekend for Open Source. Which, in the long run,  may make a good career boost for you.

In the lobbies some folks wanted to chat about luke project, that I have been contributing lately. Finally added its creator Andrzej Bialecki as a contributor. If you enjoy using luke and want to make it better, feel free to join and start contributing yummy patches!

The next cool presentation was presented by Alan Woodward and Charlie Hull (who published a nice writeup on the conference here and here). Their collaborative talk focused on turning the search upside down: queries become documents and documents become queries (honestly, I would like to see how this is handled in the code in terms of naming the variables and functions). Charlie briefly mentioned that the code will be open-sourced. The specific use case for this is a bank of queries or a taxonomy of queries (for instance, for patent examiners) that need to be executed against each incoming document. If there is a hit, this document gets the query id as some concept. What I liked in the presentation is the innovative look on the problem: given a set of queries, what is the efficient representation of them in the Lucene index? For example a SpanNear query can be represented as an AND on its operands, then  the AND query can be checked only for one operand quickly and efficiently. If it is not present, the entire query is not hitting the document. They also used the position aware Scorer that can iterate the intervals of positions. I did a similar work in the past, but used the not so efficient Highlighter component. So I'm glad there is a similar work done in an efficient manner. UPD: the tool called Luwak has been open-sourced!

In the next presentation Adrien Grand walked the audience through the design decisions behind data structures in Lucene index, compared Lucene indices to RDBMS indices (I tried to summarize it here) and talked about things on hardware / OS level, like using SSD disks and allocating enough RAM for OS cache buffers. During the Q&A I have suggested an idea that in order to increase the indexing throughput one can set the merge factor to thousands and upon completion to 10 or even 1 so that the search side would be efficient. The indexing then will not need to merge as often. Adrien has seconded this idea and also proposed to use TieredMergePolicy (default in Solr) where you can control the merging of segments with a number of docs deleted. This is needed both for efficiency and timely space re-claiming. Even though I have missed the closing remarks, since this presentation took more time than expected, I was happy to be there as the presentation was quite thought provocative.

These have been two intense days both in terms of technical presentations and sightseeing. I have only been able to explore Trinity College and areas around it, but doing this on foot was quite a good sport activity.

The apogee of this day and perhaps of the conference was meeting entire sematext team while roaming around shopping streets near Trinity College! We had some good time together in a local bar. It's been fun, thanks guys!

This pretty much wraps up my 2-post impressions and notes on the Lucene Revolution conference 2013 in Dublin.

If you enjoyed this writeup, there is a delightful writeup by +Mike Sokolov :
http://blog.safariflow.com/2013/11/25/this-revolution-will-be-televised/

And read the Flax writeups too (links above in the post).

Happy searching,

@dmitrykan

Sunday, November 10, 2013

Lucene Revolution EU 2013 in Dublin: Day 1

The Lucene Revolution conference has been held for the first time in Europe. Dublin, Ireland, has been selected as the conference city and I must say it is a great choice. The city itself is quite compact and friendly, full of places to see (Trinity Colleage, Temple Bar, O'Connell street and many more) as well as places to relax with all the pubs, bars and shopping streets.



It was my first time in Dublin and my general impression of the city is very good.

Let's proceed to the conference!

Day 1


The keynote has been presented by Michael Busch of Twitter, who talked about their data structure for holding an index in such a way that posting lists are automatically sorted and one can read backwards without reopening an index (costly op). They also support early termination naturally in such a way. Everything is stored in RAM and they never commit.

Followed by the keynote I watched Timothy Potter present about integrating Solr and Storm for realtime stream processing.
What I specifically liked in Tim's presentation was two things: (1) he gave personal recommendations for certain libraries and frameworks, like Netty, Coda Hale Metrics and (2) he emphasized, that even that we deal with relatively exotic technologies like Solr and Storm we can still rely on old proven technologies, like Spring + Jackson to remap input json into neatly named properties of a Java class. This all comes in handy when you start working on your own backend code.

Next, LinkedIn engg's talked about their indexing architecture for segmentation and targeting platform.
They use Lucene only, no Solr yet, but would be interested to look into things like SolrCloud. Also mentioned in the presentation was the architectural decision to abstract the Lucene engine from the the business logic by writing a json to Lucene query parser. This will help if they ever decide to try some other search engine. On one hand I sense here, that large business is more cautious than let's say a smaller company. On the other hand, imho, Lucene has grown into a very stable system and given its long presence on the market and wide adoption wouldn't be too cautious. But biz is biz.

Up next was the session on additions to Lucene arsenal by Shai Erera and Adrien Grand. I'd say this presentation was the most in-depth technically during this day. Shai talked about new replication module added to Lucene and expressed the hope of it being added to ElasticSearch and Solr or even replacing their own replication methods. Adrien focused on implementation details of new feature called Index Sorting. The idea is that an index can be kept always sorted on some criteria, like modification date of a document, which in turn will enable for early termination techniques and better compression of sorted index.

After lunch break I decided to go to "Shrinking the Haystack" using Solr and OpenNLP by Wes Cladwell. While his presentation wasn't too technical it had mentioned really hilarious uncommon practice they have at ISS Inc while building search and data processing solutions for FBI type of agencies. One question, he said, strucks every sw engg applicant is: "Will you be ok to spend a weekend in Afganistan?" Not an usual working place, hah? To me, having spent a month in Algeria as a consultant, this sounded somewhat familiar, because when entering my hotel the car was always checked by tommy-gunners. The take-away point of this presentation was that they build big data systems not to find a needle in the haystack, but give tools to human analysts to help find an actionable intelligent data. I.e. you can't replace people. The presentation also mentioned some tech. goodies, like boilerpipe for extracting text from an html leaving boiler html out; GeoNames = open source geo-entity database, that can be indexed into a Solr core for better search. In terms of NLProc and machine learning at ISS they have arrived at combining gazeteer (dictionary) and supervised machine learning to give the best of both worlds. From my NLProc + ML experience at SemanticAnalyzer I will only second this.


Since I have spent a number of weeks focusing of various query parsers changing QueryParser.jj grammar with JavaCC I have decided to see an alternative approach to building parsers: presentation on Parboiled by John Berryman. He has been implementing a legacy replacement system for patent examiners, who have at times super-long boolean and positional queries and who are not ready to give up their query syntax. What is great about Parboiled based parsers is that it is Java, i.e. the parser itself is declaratively expressed using your own Java code. It works nicely with Lucene's SpanQueries and general Queries. On the other hand, in lobbies John had mentioned that debugging the Parboiled generated parser is not straighforward. Well, JavaCC isn't any better either in terms of debugging. One listener from the audience has mentioned Antlr, which he enjoyed using. There is a JIRA for introducing Antlr into query parsing with Lucene, if you would like to work on this practically. John's slides can be enjoyed here.

After all this techy stuff I made a larger break and searched for networking with sematext guys at their booth, with +Mikhail Khludnev, Documill (from Espoo, yeah!) and other people to sense what type of audience has arrived at the conference and why did they have an interest in Lucene / Solr specifically.

The apogee of the day had been the lovely Stump the Chump session by +Chris Hostetter. Which truly yours had a privilege to win with the first time first place! I'll update this post with the video once it is out of production lines.

Up next is the second day of the conference in the next post as this post is quite a long read already.