Monday, December 16, 2013

Lucene Toolbox Luke

As I was mentioning in some of the previous posts, I have been working on luke, the Lucene Toolbox. Recently, the releases have been more or less steady, providing the binaries right on the github site.



Here is the most recent release of luke for the most recent release of Lucene 4.6.0:

port to Lucene 4.6.
Tested against the solr-4.6 index of exampledocs xml files.
Changelist:
1. bumped up the version of maven-assembly plugin to the latest 2.4
2. added maven-jar plugin which copies all binary dependencies into lib/ subdirectory
3. Fixed AnalyzerToolPlugin to use token stream reset outside while loop over tokens
4. added lock factory handling in FsDirectory (untested)
5. updated trivia in about.xml

At the moment the generated all-inclusive jar can be launched by double-clicking it only. There is some work to do to enable command line launches.

If you feel like contributing (patches and new features alike) and enjoy using the tool, get in touch or just send a pull request.

Enjoy the season's holidays,

Sunday, November 24, 2013

Training on NLProc and Machine Learning

Just did a training on NLProc (imho, better abbreviation for natural language processing than NLP) and Machine Learning for OK.ru (Russian Facebook owned by Mail.Ru Group) in person in Saint-Petersburg, Russia.

OK.ru has a nice office not far from Petrogradskaya subway station in Saint-Petersburg (the central office is in Moscow).


If you feel like your project is somewhat stuck and needs a fresh look or you need to widen you knowledge in NLProc and / or machine learning, feel free to contact me on g+. At the moment folks at SemanticAnalyzer can do this in Europe / western part of Russia. At SemanticAnalyzer we also offer a full package of services for natural language processing development in case there isn't expertise in your house. This includes project scoping, breaking down by technical tasks, time estimation, development, testing / evaluation and delivery.

Wednesday, November 13, 2013

Lucene Revolution EU 2013 in Dublin: Day 2

This post tells my impressions of the Lucene Revolution conference held in Dublin, Ireland, 2013, day 2. The day 1 is described here.

Second day was slightly tougher to start right after the previous evening in a bar with committers and conference participants, but good portion of full Irish breakfast with coffee helped a lot and set me up for the day.

I decided to postpone coming to the conference venue (Aviva stadium) right until the technical presentations started. But, I was in early enough still to make it to the panel with Paul Doscher. Paul holds himself quite professionally and talks in a convincing manner. He didn't seem to be a BS CEO, and tries best to address the question at hand as deeply and low-level as possible. May be EU audience was a bit sleepy or a not too comfortable talking otherwise (geeks, geeks), which didn't spawn as many discussions / questions. Two Paul's questions that come to mind:



1. Have you been doing search / big data during past 4-5 years?
2. Do you think the search / big data is a good area to stay until end of your IT career?

There have been more hands for 1 than for 2. Which in Paul's mind was a bit strange as he strongly believes search and big data is the area to stay. There is so much complexity involved and so much data has been accumulated by enterprises that there will be a strong demand in specialists of this sort. While I would second this, I can tell that the consumers of enterprise data will have higher standards to quality, because they do business with it. On the other hand, in my opinion, if you take so called mass media (microblogs, forums, user sentiment and so on), there is also big data produced at high speeds, but the user base on average is not as demanding. It is again businesses that will likely to be interested to sift some signal from the mass media noise.

Coming back to the technical presentations.

The first technical presentation I visited has been "Lucene Search Essentials: Scorers, Collections and Custom Queries" by +Mikhail Khludnev. I know Mikhail since more than a year ago, sitting on the same Solr user mail-list and commenting on SORL jira.

His team has been doing some really low-level stuff with Solr / Lucene code for e-commerce and this presentation has summarized his knowledge pretty much. I'd say while the presentation is accessible for a non-prepared audience, it goes quite deep into analyzing compexities of certain search algorithms and lays a background for more conscious search. The slides.

On the next session Isabel Drost-Fomm introduced the audience into the world of text classification and Apache Mahout. Isabel is a co-founder of another exciting conference on search and related technologies Berlin Buzzwords and is a Mahout committer.



Before the session Isabel mentioned to me, that one of the targets of her presentation is let people know the existing good systems like Lucene with tokenizers and focus on solving a particular task at hand instead of reinventing a wheel. Good intro level presentation, and after trying out Mahout, I must say you can get going with the examples quite easily. Isabel had mentioned that becoming a committer in the Apache community is easier than one might think: fix or write new documentation, attach small patches (not necessarily code) and that might be enough. But, removing yourself from the committer status is much harder, because, if I sensed it right, Apache has rather wide network of reminding you to commit. So you might find yourself coding over the weekend for Open Source. Which, in the long run,  may make a good career boost for you.

In the lobbies some folks wanted to chat about luke project, that I have been contributing lately. Finally added its creator Andrzej Bialecki as a contributor. If you enjoy using luke and want to make it better, feel free to join and start contributing yummy patches!

The next cool presentation was presented by Alan Woodward and Charlie Hull (who published a nice writeup on the conference here and here). Their collaborative talk focused on turning the search upside down: queries become documents and documents become queries (honestly, I would like to see how this is handled in the code in terms of naming the variables and functions). Charlie briefly mentioned that the code will be open-sourced. The specific use case for this is a bank of queries or a taxonomy of queries (for instance, for patent examiners) that need to be executed against each incoming document. If there is a hit, this document gets the query id as some concept. What I liked in the presentation is the innovative look on the problem: given a set of queries, what is the efficient representation of them in the Lucene index? For example a SpanNear query can be represented as an AND on its operands, then  the AND query can be checked only for one operand quickly and efficiently. If it is not present, the entire query is not hitting the document. They also used the position aware Scorer that can iterate the intervals of positions. I did a similar work in the past, but used the not so efficient Highlighter component. So I'm glad there is a similar work done in an efficient manner. UPD: the tool called Luwak has been open-sourced!

In the next presentation Adrien Grand walked the audience through the design decisions behind data structures in Lucene index, compared Lucene indices to RDBMS indices (I tried to summarize it here) and talked about things on hardware / OS level, like using SSD disks and allocating enough RAM for OS cache buffers. During the Q&A I have suggested an idea that in order to increase the indexing throughput one can set the merge factor to thousands and upon completion to 10 or even 1 so that the search side would be efficient. The indexing then will not need to merge as often. Adrien has seconded this idea and also proposed to use TieredMergePolicy (default in Solr) where you can control the merging of segments with a number of docs deleted. This is needed both for efficiency and timely space re-claiming. Even though I have missed the closing remarks, since this presentation took more time than expected, I was happy to be there as the presentation was quite thought provocative.

These have been two intense days both in terms of technical presentations and sightseeing. I have only been able to explore Trinity College and areas around it, but doing this on foot was quite a good sport activity.

The apogee of this day and perhaps of the conference was meeting entire sematext team while roaming around shopping streets near Trinity College! We had some good time together in a local bar. It's been fun, thanks guys!

This pretty much wraps up my 2-post impressions and notes on the Lucene Revolution conference 2013 in Dublin.

If you enjoyed this writeup, there is a delightful writeup by +Mike Sokolov :
http://blog.safariflow.com/2013/11/25/this-revolution-will-be-televised/

And read the Flax writeups too (links above in the post).

Happy searching,

@dmitrykan

Sunday, November 10, 2013

Lucene Revolution EU 2013 in Dublin: Day 1

The Lucene Revolution conference has been held for the first time in Europe. Dublin, Ireland, has been selected as the conference city and I must say it is a great choice. The city itself is quite compact and friendly, full of places to see (Trinity Colleage, Temple Bar, O'Connell street and many more) as well as places to relax with all the pubs, bars and shopping streets.



It was my first time in Dublin and my general impression of the city is very good.

Let's proceed to the conference!

Day 1


The keynote has been presented by Michael Busch of Twitter, who talked about their data structure for holding an index in such a way that posting lists are automatically sorted and one can read backwards without reopening an index (costly op). They also support early termination naturally in such a way. Everything is stored in RAM and they never commit.

Followed by the keynote I watched Timothy Potter present about integrating Solr and Storm for realtime stream processing.
What I specifically liked in Tim's presentation was two things: (1) he gave personal recommendations for certain libraries and frameworks, like Netty, Coda Hale Metrics and (2) he emphasized, that even that we deal with relatively exotic technologies like Solr and Storm we can still rely on old proven technologies, like Spring + Jackson to remap input json into neatly named properties of a Java class. This all comes in handy when you start working on your own backend code.

Next, LinkedIn engg's talked about their indexing architecture for segmentation and targeting platform.
They use Lucene only, no Solr yet, but would be interested to look into things like SolrCloud. Also mentioned in the presentation was the architectural decision to abstract the Lucene engine from the the business logic by writing a json to Lucene query parser. This will help if they ever decide to try some other search engine. On one hand I sense here, that large business is more cautious than let's say a smaller company. On the other hand, imho, Lucene has grown into a very stable system and given its long presence on the market and wide adoption wouldn't be too cautious. But biz is biz.

Up next was the session on additions to Lucene arsenal by Shai Erera and Adrien Grand. I'd say this presentation was the most in-depth technically during this day. Shai talked about new replication module added to Lucene and expressed the hope of it being added to ElasticSearch and Solr or even replacing their own replication methods. Adrien focused on implementation details of new feature called Index Sorting. The idea is that an index can be kept always sorted on some criteria, like modification date of a document, which in turn will enable for early termination techniques and better compression of sorted index.

After lunch break I decided to go to "Shrinking the Haystack" using Solr and OpenNLP by Wes Cladwell. While his presentation wasn't too technical it had mentioned really hilarious uncommon practice they have at ISS Inc while building search and data processing solutions for FBI type of agencies. One question, he said, strucks every sw engg applicant is: "Will you be ok to spend a weekend in Afganistan?" Not an usual working place, hah? To me, having spent a month in Algeria as a consultant, this sounded somewhat familiar, because when entering my hotel the car was always checked by tommy-gunners. The take-away point of this presentation was that they build big data systems not to find a needle in the haystack, but give tools to human analysts to help find an actionable intelligent data. I.e. you can't replace people. The presentation also mentioned some tech. goodies, like boilerpipe for extracting text from an html leaving boiler html out; GeoNames = open source geo-entity database, that can be indexed into a Solr core for better search. In terms of NLProc and machine learning at ISS they have arrived at combining gazeteer (dictionary) and supervised machine learning to give the best of both worlds. From my NLProc + ML experience at SemanticAnalyzer I will only second this.


Since I have spent a number of weeks focusing of various query parsers changing QueryParser.jj grammar with JavaCC I have decided to see an alternative approach to building parsers: presentation on Parboiled by John Berryman. He has been implementing a legacy replacement system for patent examiners, who have at times super-long boolean and positional queries and who are not ready to give up their query syntax. What is great about Parboiled based parsers is that it is Java, i.e. the parser itself is declaratively expressed using your own Java code. It works nicely with Lucene's SpanQueries and general Queries. On the other hand, in lobbies John had mentioned that debugging the Parboiled generated parser is not straighforward. Well, JavaCC isn't any better either in terms of debugging. One listener from the audience has mentioned Antlr, which he enjoyed using. There is a JIRA for introducing Antlr into query parsing with Lucene, if you would like to work on this practically. John's slides can be enjoyed here.

After all this techy stuff I made a larger break and searched for networking with sematext guys at their booth, with +Mikhail Khludnev, Documill (from Espoo, yeah!) and other people to sense what type of audience has arrived at the conference and why did they have an interest in Lucene / Solr specifically.

The apogee of the day had been the lovely Stump the Chump session by +Chris Hostetter. Which truly yours had a privilege to win with the first time first place! I'll update this post with the video once it is out of production lines.

Up next is the second day of the conference in the next post as this post is quite a long read already.

Sunday, October 20, 2013

SentiScan: блог-пост о технологии распознавания сентимента (тональной окрашенности сообщений)

На днях наш партнёр YouScan опубликовал интервью с моим участием о нашей совместной технологии распознавания сентимента или, говоря иначе, эмоциональной окрашенности в текстах. Эта задача известна теоретической и практической компьютерной лингвистике довольно давно и создано множество подходов. Традиционно выделяют две группы методов: основанные на машинном обучении и статистике и подходы, основанные на правилах. Есть также и методы соединения обоих подходов, а также новомодный алгоритм на нейронных сетях.

кан ан
В технологии SentiScan мы также сочетаем оба подхода и добавляем нашу собственную изюминку: объектную ориентированность. Это не ООП (объектно-ориентированное программирование), а поиск именнованных сущностей и определение сентимента по отношению к ним. Список сущностей мы получаем из поисковых запросов пользователей, описывающих некий бренд, название продукта, имя человека или других явлений. Задача системы найти данные объекты в тексте, выделить сентиментный контекст и распознать сам сентимент.

Мы использовали методы машинного обучения для поиска полярных единиц, т.е. таких, которые имеют однозначное тональное значение -- позитивное либо негативное. Примеры таких однозначно окрашенных единиц:

позитив
благородный
доход
изысканный
лояльный
необыкновенный
оперативный
передовой

негатив
абсурд
винить
вымогательство
грабеж
идиотский
нытье
отвратительный

Как можно заметить, в словарях присутствуют представители любых частей речи: не только имён прилагательных, но имён существительных, глаголов. Есть и наречия (отвратительно). 

После того, как входной текст был разделён на отдельные предложения, алгоритм производит синтаксический анализ с целью определения объектов в тексте, а также их взаимного влияния. Правила синтаксического анализа подобраны специально для задачи распознавания сентимента и не подойдут, например, для некоторой общей задачи синтаксического анализа либо его применения (машинный перевод или spell-cheker).

В процессе синтаксического анализа производится наращивание информации и статистики о сентиментном потоке (его силе и полярном окрасе -- позитив либо негатив) и его направленности на целевой объект. Накопив информацию об отдельных предложениях, алгоритм переходит на уровень текста, на котором вычисляется финальная информация. В итоге алгоритм выносит вердикт по всему тексту (который может также состоять и из одного предложения или даже слова): позитив либо негатив. Текст может быть также помечен и нейтральным флагом, в двух случаях:

1. в тексте не было ни одной тонально окрашенной единицы либо синтаксического противопоставления (объект А, но объект Б)

2. в тексте был смешанный сентимент и неясно, что хотел сказать своим высказыванием автор. В этом случае алгоритм может опционально поставить метку "смешанный сентимент", то есть позитив+негатив.

У описанного здесь вкратце алгоритма есть также и отдельная функциональность определения объективности ("беспристрастноси") и субъективности текста либо сообщения. Если автор текста не использует эмоционально окрашенных выражений, то его текст можно в целом считать объективным или беспристрастным. И субъективным, если использует. Распознавание субъективности автора может быть полезна тем брендам, которые ищут "подлинные" обзоры их продукции, т.е. опирающиеся на факты.

Попробовать эту систему в действии можно, написав нам письмо на info@semanticanalyzer.info. Техническая документация и характеристики быстродействия описаны в документации.



Sunday, October 6, 2013

Deep learning: анализ текста и изображений при помощи Рекурсивных Нейронных сетей

Продолжая тему Deep Learning, получившую в последнее время большое внимание научного сообщества и индустрии, хочется рассказать про довольно необычную сферу применения нейронных сетей. Это сфера разбора (parsing) изображений рекурсивными нейронными сетями. В этом посте я кратко опишу суть и приведу ссылки для более детального ознакомления с материалом. Помимо этого будет ссылка на задачу из Stanford'а для желающих попробовать свои силы и получить практический результат в области deep learning для NLP.

Этот пост базируется на докладе Richard Socher, видео на английском доступно здесь.

Как оказывается, рекурсивной (или некоторой регулярной) структурой может обладать не только язык, но и изображения. Но в начале о рекурсивных свойствах предложений. Принцип композиционности (principle of compositionality) делает предположение, что любое предложение на естественном языке можно представить в виде иерархической структуры из связных составляющих. Смысл предложения можно представить в виде смыслов слов, в него входящих, и списка правил соединения слов в группы. Например, в предложении:

Это страна моего рождения.

Cлова "моего" и "рождения" образуют единую группу, "страна" и "моего рождения" над-группу, а всё предложение замыкает его смысл в виде иерархического представления указанных групп. Таким образом, получаются не только смыслы отдельных слов предложения, но и древесные структуры, в которые данные слова увязываются.

Как можно разобрать изображения по аналогии с текстовыми рекурсивными представлениями? Можно утверждать, что существует схожий принцип композиционности для изображений. Рассмотрим изображение:

Если рассмотреть некоторые сегменты изображения: здание, конусообразная крыша, оконный ряд, окна по отдельности. В указанном порядке они описывают "вложенную" рекурсивную структуру, которой можно описать здание целиком. Есть ещё параллельные зданию объекты -- люди, деревья и трава. Таким образом, изображение, например, дома можно представить виде древесной структуры, где узлы на нижних уровнях являются составляющими узлов-предков. В точности, как и в древесных структурах предложений на естественном языке.

Алгоритм на основе рекурсивных нейронных сетей авторов Socher и др. достигает 78,1% качества. Область применения таких алгоритмов -- распознавание сцен (область анализа изображений или image analysis). Исходные коды и датасеты можно посмотреть здесь.

Задачка. Для более детального ознакомления с deep learning в применении к задачам NLP на сайте Stanford'a предлагается к решению задачка: реализовать простой оконный распознаватель именных сущностей (Named Entity Recognition или NER). Описание задачки здесь. Стартовый исходный код на Java и тренировочный сет здесь. Тренировочный сет аннотирован персонами, например:

Franz PERSON
Fischler PERSON

У исходного кода есть два режима: собственно, исполнение и отправка решения на сервер. Понятно, что из этих двух режимов нас интересует первый.

Довольно много материалов по теме Deep Learning можно найти на сайте Richard Socher, включая применение данного направления к распознаванию тональности (sentiment analysis).

Thursday, October 3, 2013

Why I started learning (j)ruby

I have been pretty comfortable with Perl for a while in my programming life, but at some point realized, that what Dijkstra says about an impact a programming language has on a programmer's mind, seems to hold. I.e. if you code long the same language eventually you will look at every problem at hand through the prism of what your programming language has to offer to solve it. By this I mean, the data structures, debugging, dumping the data contents into stdout, references, copying, working with file encoding, web and so on. I don't want to go too far stretched in suggesting that even how for or for each loop affect one's mind, just to want to say, that practicing other languages for the same tasks can be quite useful. This still assumes, that each language has its power, and in the case of Perl that are certainly regular expressions.

This post won't compare Perl and Ruby or Ruby to any other language. I just want to note along the way the Ruby features, that I find the most interesting to me.

  probabilities = 10.times.map{Float(rand(0))}
  probabilities.each {|p| print p.to_s + " "}

This prints:

0.972042584650313 0.148158594901043 0.109142777878581 0.825619772397228 0.177120402897994 0.411135204463207 0.0448148166075958 0.996025937730191 0.143679780727901 0.311015907725463

That is, with just two lines of in practice functional code you are able to create an array of 10 random real numbers between 0 and 1 and output them to stdout.

Another compelling feature of ruby is that it can be turned into jruby and then all the java mass of libraries becomes available at your scripting finger tips. Supposing, that you have a text file with some categories separated with semi-colons, you can load them into guava's ArrayListMultimap:

# using guava 13.0.1 in jruby
require 'java'
require '/home/user/.m2/repository/com/google/guava/guava/13.0.1/guava-13.0.1.jar'

def loadCategories
  myCategoryMultimap = com.google.common.collect.ArrayListMultimap.create
  File.open(fName, "r").each_line do |line|
    category = line[/Category=[^;]+/]
    myCategoryMultimap.put category, line
  end
  return myCategoryMultimap
end

To summarize so far, two features: functional style of writing code and java-friendliness make (j)ruby a compelling next language to learn if you come from the scripting / Java world.

P.S. If you are in Finland around Helsinki you might be interested in Helsinki Ruby Brigade where the sessions have been pretty technical and interesting.

Monday, September 23, 2013

Fixing X issues with jconsole and jvisualvm under ubuntu

This is merely a technical post describing how to solve the issues with running the aforementioned jdk tools on ubuntu without X servers installed.

It is sometimes possible that you need to run X based apps on ubuntu servers that do not have graphical libraries no GUI installed.

The easy way to check whether your ubuntu server is missing any libraries is to run the command recommended on stackoverflow.com:



jvisualvm -J-Dnetbeans.logger.console=true


This command will output names of shared libraries that are required to run the command but are missing. For example these libraries could be: libxrender1, libxtst6, libxi6. Their names can be printed also with .so. suffixes, like so: libXrender.so.1.

In order to install them you can run:



sudo apt-get install libxrender1
sudo apt-get install libxtst6
sudo apt-get install libxi6


After installing a library keep running the jvisualvm command above to see if there are any libraries still missing.

If all libraries are in order the jvisualvm should start given that you have connected to your ubuntu server with ssh -X command line parameter which will stream the graphical command's GUI output to your client machine.

Have fun monitoring!


Saturday, September 7, 2013

Solr usability contest: make Apache Solr even cooler!

In august I took part in the Solr Usability Contest ran by Alexandre Rafalovitch, author of the Apache Solr for Indexing Data How-to book from Packt.

As I have already told Alexandre (@arafalov), it was a great idea to launch a contest like this. While Solr / Lucene mail-lists serve as a direct way of solving particular problems and Apache jira is a way of doing some feature requests and bug submissions, it is great to sometimes take a step back and have a look at a larger perspective of features / limitations / possible improvements and so on.

The following three are the winning suggestions of truly yours:

 

On atomic updates

It coincided that we have been evaluating new sexy sounding atomic updates feature and found out, that it wasn't easy to enable it. To actually make use of the feature, we essentially would've needed to make *all* fields stored. The past several years I have been "fighting" against storing fields unless really necessary. It is amongst one of the performance suggestions to avoid storing fields if possible that in turn helps avoiding extra disk seeks. Not all have SSD disks installed on their Solr servers. Having written some Lucene level code some time ago I could wildly guess, why would all fields be necessary stored for the atomic updates. Essentially upon atomic update Solr will have to retrieve an existing document with all its source values, update a value (or values) and push the document back into persistent storage (index). To my taste this describes a bulk update. There is one major advantage of atomic updates (given that all fields were made stored): saving on the network traffic. Indeed, instead of submitting an entire document with a couple updated fields, you can send only the fields with new values and provide a document id. There are other cool features, like setting a previously non-existent field on the document or deleting an existing field. These all will surely make an atomic update feature appealing to some folks. You will find real examples of how to use atomic updates feature in the Alexandre's book. So go and get your copy now.

In the course of reindexnig our data in solr4 we have found out a lot of improvements, one of them is index compression. The lossless compression algorithm used in Lucene / Solr 4 is lz4, which has made the index super compact. Our use case shows 20G vs 100G index size compression in solr4 vs solr3 battle, which is simply amazing. By the way the algorithm has a property of fast decompression (fast decoder), which makes it an ideal fit for an online algorithm.

In light of compactness of the index we are still considering to evaluate the atomic updates feature, merely from three perspectives:
  • traffic savings
  • speed of processing
  • index size increase vs storing only necessary fields

On interactivity of Solr dashboard

As we have started evaluating goodies of Solr4 I was positively surprised about how usable and eye-catchy looking the Solr dashboard (admin) has become. Above all is usability (it is after all usability contest) and, oh yes, it has become usable for the first time. In Solr 1.4.1 and 3.4 times we have been merely consulting the cache statistics page and analysis page occasionally. In Solr4 one can now administer the cores directly from the dashboard, optimize indices, study the frequency characteristics of text data and so on. This is of course on top of mentioned features, like field analysis and monitoring the cache stats.

But.. something is still missing. We are running several shards with frontend solrs and for us it has always been a bit of a pain to monitor our cluster. We intentionally do not use SolrCloud, because of requirements for logical sharding. Sometime ago I have blogged about Solr on FitNesse, which helped to see the situation with the cluster with just one click. We have also set up RAM monitoring with graphite, but wait, all of these are external to Solr tools. It would be really great to be able to integrate some of them directly into Solr dashboard. So we hope this will change into the direction of "plug-n-play" type of interfaces that would allow implementing plugins to Solr dashboard. In the mean time good ol' jvisualvm is a tool helping to monitor a heavy shard during soft-commit runs:



On scripting capability

I also dared to fantasize about what could open Solr up for wider audience. Especially people that are not dreaming of reading and changing Solr source code. This can be enabled with a scripting capability. By this I mean a way of hacking into Solr via external interfaces in the language that fits your task and skillset best (ruby or scala or some other JVM friendly language or perhaps something outside JVM family altogether). The best thing this would offer is an opportunity to experiment fast with the Solr search: changing runtime order of analyzers or search components, affecting on scoring, introducing advertisement entries, calculating some analytics, refining facets etc etc. While some of these may sound too far stretched, the feature in general may open up for changing the Solr core behaviour without hacking into the heavy-duty source code recompilation (although personally I would recommend diving into that anyway).

 

Concluding remarks

I would like to conclude that Solr4 has brought lots of compelling features and improvements (an extremely great soft-commit feature, for example) and we are happy to see this blazingly fast search platform to evolve that fast. In these three usability suggestions I have tried to summarize what is great to do to make the platform even more compelling and cool.

yours truly,


Friday, September 6, 2013

Monitoring Solr with graphite and carbon


This blog post requires graphite, carbon and python to be installed on your *ux. I'm running this on ubuntu.

http://graphite.wikidot.com/
https://launchpad.net/graphite/+download


To setup monitoring RAM usage of Solr instances (shards) with graphite you will need two things:

1. backend: carbon
2. frontend: graphite

The data can be pushed to carbon using the following simple python script.

In my local cron I have:

1,6,11,16,21,26,31,36,41,46,51,56 * * * * \
   /home/dmitry/Downloads/graphite-web-0.9.10\
          /examples/update_ram_usage.sh

The shell script is a wrapper for getting data from the remote server + pushing it to carbon with a python script:

scp -i /home/dmitry/keys/somekey.pem \
    user@remote_server:/path/memory.csv \ 
    /home/dmitry/Downloads/MemoryStats.csv

python \
  /home/dmitry/Downloads/graphite-web-0.9.10\
    /examples/solr_ram_usage.py

An example entry in the MemoryStats.csv:

2013-09-06T07:56:02.000Z,SHARD_NAME,\
  20756,33554432,10893512,32%,15.49%,SOLR/shard_name/tomcat

The command to produce a memory stat on ubuntu:

COMMAND="ssh user@remote_server pidstat -r -l -C java" | grep /path/to/shard 


The python script is parsing the csv file (you may want to define your own format of the input file, I'm giving this as an example):

import sys
import time
import os
import platform
import subprocess
from socket import socket
import datetime, time

CARBON_SERVER = '127.0.0.1'
CARBON_PORT = 2003

delay = 60
if len(sys.argv) > 1:
  delay = int( sys.argv[1] )

sock = socket()
try:
  sock.connect( (CARBON_SERVER,CARBON_PORT) )
except:
  print "Couldn't connect to %(server)s on port %(port)d, is carbon-agent.py running?" % { 'server':CARBON_SERVER, 'port':CARBON_PORT }
  sys.exit(1)

filename = '/home/dmitry/Downloads/MemoryStats.csv'

lines = []

with open(filename, 'r') as f:
  for line in f:
    lines.append(line.strip())

print lines
 
lines_to_send = []

for line in lines:
  if line.startswith("Time stamp"):
    continue
  shard = line.split(',')
  lines_to_send.append("system."+shard[1]+" %s %d" %(shard[5].replace("%", ""),int(time.mktime(datetime.datetime.strptime(shard[0], "%Y-%m-%dT%H:%M:%S.%fZ").timetuple()))))

#all lines must end in a newline
message = '\n'.join(lines_to_send) + '\n'
print "sending message\n"
print '-' * 80
print message
print
sock.sendall(message)
time.sleep(delay)

After the data has been pushed you can view it in graphite GWT based UI. The good thing about graphite vs jconsole or jvisualvm is that it persists data points so you can view and analyze them later.




For Amazon users, an alternative way of viewing the RAM usage graphs is with CloudWatch, although at the moment of this writing it allows storing 2 weeks worth of data only.

Sunday, August 25, 2013

ReVerb: Open Information Extraction


Предыдущий пост о семантических связях между словами представил открытый инструмент word2vec, позволяющий строить или выявлять в некотором смысле семантические сети слов и словосочетаний.

В этом посте мы рассмотрим систему, выявляющую связи между запросом и документами по тройке: Объект1-Связь-Объект2, где объекты {Объект1, Объект2} представлены в виде существительного либо семантического класса существительного, Связь -- в виде глагола или падежного типа.

Система называется ReVerb (от Relataion=связь, Verb=глагол). Её исходный код доступен на github. Система поддерживает только английский язык.
С попыткой представить знание в виде приведённых троек можно встретиться довольно часто где (например, этот подход упоминался в докладе Gerhard Weikum на RuSSIR'2011). Первое впечатление от такого подхода: слишком узкий взгляд на семантику и что ничего путного с этим не сделать. Однако это не совсем так. Часто перед решением задачи компьютерной лингвистики (будь то машинный перевод, анализ тональности или информационный поиск) нужно сделать первые шаги в изучении имеющихся данных. Эти шаги могут включать построение частотных таблиц слов или словосочетаний (N-грамм), выявление ключевых слов, представляющих документ и т.д. Кстати, многие начальные шаги можно оптимально сделать при помощи инструментов Linux, таких как cat, cut, grep, awk, sed, wc (от word count, а не то, что можно подумать) и других. Таким образом, воспользовавшись существующими инструментами обработки текста, можно решить начальные задачи, даже не написав строчки кода!

Демонстрация системы извлечения знаний из 500 млн веб-страниц находится здесь. Что в ней примечательного?
Например, можно получить список стран Африки, задав запрос:

Argument1: type:Country
Relation: is located in
Argument2: Africa

Система выводит список из 45 государств, видимо тех, о которых что-то публикуется в Сети (вообще, официально признанных суверенных государств в Африке 54, согласно Википедии).
Можно задавать общие вопросы: например, какие актёры играли в каких фильмах:

Argument1: what/who
Relation: starred in
Argument2: what/who

Например, Barbra Streisand снималась в фильме "Yentl", Jessica Alba в "Sin City", а Johny Depp в "Pirates of the Caribbean".
Воспользовавшись падежной связкой "symbol of", мы получаем список символов разных стран.

Argument1: what/who
Relation: symbol of
Argument2: type:Country

У Шотландии -- это единорог.

Индексировать и искать документы с мета-информацией можно, например, при помощи Apache Solr. Но это уже отдельная история.

Машинное обучение без учителя для определения смысла слов: open source инструмент от Google word2vec

Кросспост моего поста с http://mathlingvo.ru/

В блоге Google Open Source Blog появилось сообщение о новом open source инструменте word2vec. Исследователи Google утверждают, что при его помощи можно получить смысл слов, лишь прочитав огромные массивы данных. Инструмент применяет "распределённые представления" текстовых данных для обнаружения связей между концептами -- и всё это при помощи машинного обучения без учителя (unsupervised machine learning) на основе нейронных сетей (neural networks).
Интересно, что модель помещает близкие страны рядом, как и близкие столицы. Похожие связи возникают автоматически во время тренировки алгоритма.
У исходного кода хорошая лицензия: Apache License 2.0, которая позволяет менять его без опубликования изменений и встраивать его в том числе в коммерческие приложения.
В статье также упоминается ставший популярным в последнее время метод Deep Learning, дающий результаты, лучшие на порядок предыдущих методов. Кстати, большинство победителей конкурсов по машинному обучению на kaggle (ваш покорный слуга также имел честь участвовать) применяет либо ансамбли методов на Decision Trees, либо методы Deep Learning.
// ./demo_word.sh
Enter word or sentence (EXIT to break): machine translation

Word: machine  Position in vocabulary: 799

Word: translation  Position in vocabulary: 1206

          Word       Cosine distance
------------------------------------------------------------------------
          mmix              0.485542
    translator              0.484659
          msil              0.483476
        manual              0.479708
        turing              0.462978
  introduction              0.458771
      readable              0.449272
    unabridged              0.448343
      machines              0.447570
       rosetta              0.443270
      compiler              0.438949
    dictionary              0.437040
  translations              0.436334
    translated              0.429008
 specification              0.422286
    typewriter              0.422246
           awk              0.420415
       version              0.417623
   interpreter              0.415583
        itrans              0.414944
         tools              0.413505
     annotated              0.413150
        lincos              0.411448
      abridged              0.411152
          text              0.407197
      language              0.404664
        freedb              0.403896
       vulgate              0.402863
         xpath              0.401687
    calculator              0.397689
        enigma              0.394239
       klingon              0.394041
       opencyc              0.393687
       systran              0.391636
       multics              0.391623
           kli              0.389196
           apl              0.386948
      editions              0.383799
        skybox              0.383791
         algol              0.383730
Enter word or sentence (EXIT to break): weather

Word: weather  Position in vocabulary: 2693

          Word       Cosine distance
------------------------------------------------------------------------
          warm              0.634004
      humidity              0.611526
         humid              0.605240
       summers              0.594220
 thunderstorms              0.591256
      snowfall              0.590065
 precipitation              0.582246
       climate              0.580110
       winters              0.577238
      rainfall              0.570583
         rainy              0.566492
below_freezing              0.566140
  rainy_season              0.561857
        cooler              0.558795
         winds              0.558283
        colder              0.557494
  cold_winters              0.545980
           wet              0.545650
        frosts              0.539969
         drier              0.539645
      climatic              0.537766
        warmer              0.535417
        winter              0.532653
  warm_summers              0.530857
         el_ni              0.530692
  temperatures              0.528191
relative_humidity           0.527605
        summer              0.527042
  mild_winters              0.526249
       monsoon              0.524260
   trade_winds              0.523211
       daytime              0.523093
      seasonal              0.520377
           dry              0.519703
    hurricanes              0.517527
     subarctic              0.514771
    visibility              0.514740
     snowfalls              0.513660
     monsoonal              0.513538
   hot_summers              0.513050

Saturday, August 17, 2013

What is it like to study mathematics at Saint Petersburg State University? (my answer on quora.com)

As it turns out, not all of my readers are on quora. So because of this and in the spirit of posting non-technical blogs too, I'm reposting an answer I gave to the question there: "What is it like to study mathematics at Saint Petersburg State University?"




I have studied math and other subjects (like physics, computer science and others) during 2002-2005 in Saint Petersburg State University (SPbU) for a Specialist program (comparable to that of Master's degree).

My experience was constantly comparative in the beginning: as I was advancing further into teaching style of SPbU professors and docents I was viewing it side by side with the style of another State University of my home city (10x smaller in population than Saint Petersburg that time).

So perhaps I can approach answering your question from the perspective of comparison.

1. (a) In my home university we were taught to learn long theorem proofs in the fashion that would enable a student to easily reproduce it on an (pre-)exam. I remember only one occasion, when a theorem was so long that learning all the low-level details was impossible (despite how many days I tried), therefore really deriving the proof was the only option. Of course you would learn the fundamental constructs and apparatus for deriving the proof, that is you wouldn't be doing it completely from scratch and finding your ways into it.

  (b) In SPbU, in contrast, you wouldn't be expected to learn the entire theorem proof at all, but instead be ready to derive it. Some of the practical tasks given along the theoretical proofs would require the same: derive a solution as you go. This was the first thing that struck me as largely different.

2. (a) In my home university I was expected to learn about 80% of definitions, theorem formulations, their proofs.
    (b) It was my first exam on Control Theory in SPbU where its professor told me, a student should learn about 35% (or even less): the _most_ important theorem formulations and their proofs plus the _most_ important definitions. The rest is derivable as explained in (1) (b)

3. (a) The highlight of fun part of studying in my home university that comes to mind was that once a professor of mathematical analysis came to the class and asked: "Do you want theory and tasks today or talk about life?" "Life" was the answer, and the first question from the audience was: "Girls of which country were the most beautiful?".

   (b) In SPbU there have been all sorts of surprises that opened student's mind or made studying more fun. One example: during one of the exams on electrodynamics (complex theory with integral calculus, Lie algebra and so on), a professor said 10 minutes past the start: "The ones who would like to get C mark (3 or "satisfactory" in Russia)" can get it right now without answering their questions. Few people rushed towards him and exited the exam room. About 10 mins later he continued: "The ones who would like to get B mark (4 or "good" in Russia)" can get it now, but you have to show me, what you have written. Some more people rushed towards him. 15 min later (and a few drops of sweat on our brave necks) he said: "The rest just get A's, because you have survived and didn't know in advance what to expect. " (5 or "excellent", the best mark). What I have learnt was that it is not always necessary to be an egg head and learn everything to be always ready to stand up. Sometimes it is important to be a good person, brave and keep courage in your heart. That may lead to more adventures and opportunities in the future!

With a few exceptions I would say, that studying math was both fun and rather instructive in that, it developed some fundamental skills of reasoning and attacking a problem at hand without having trained yourself specifically to solve that class of problems before -- what you need in real life, be it further PhD studies or solving other complex problems, including those occurring in life.

Saturday, July 20, 2013

Controlled reflection and template methods in java

This was in drafts for a long time and since then I have lost the original context. But I thought I'll make it compilable and let you, the reader, decide, whether you find any use for this.

Suppose you have a base class A. Suppose also that you need to instantiate two classes B and C, sub-classes of A, with the same configuration data. One straightforward way to achieve this in java is to use constructors:

ConfigData configData = setConfigData();
B b = new B(configData);
C c = new C(configData);

The question is: is there is a way to keep everything just in one method of the base class, governing setting the config data that would return an instance of a subclass of A (B or C)?

Yes, there is! One way to set this is to implement a ctor in the base class A.

Another method is to use "templated" reflection. By "templated" I here refer to Java generics. In order to make sure we get the proper class instances, we should limit the accepted classes with <T extends A>:

public class A {
// declaration updated thanks to Pitko's comment below
public static <t extends A> t configureA(Class<a> ATemplateClass) {
t a;
 ConfigData configData = setConfigData();
 a = ATemplateClass.getConstructor(ConfigData.class).newInstance(configData);
 return a;
    }

private static ConfigData setConfigData() {
        ConfigData configData = new ConfigData("configParam1Value");
 return configData;
    }
}

public class ConfigData {
public String configParam1;

public ConfigData(String _configParam1) {
configParam1 = _configParam1;
}

/* (non-Javadoc)
 * @see java.lang.Object#toString()
 */
@Override
public String toString() {
   return "ConfigData [configParam1=" + configParam1 + "]";
}
}
The child classes will look alike (in practise they will have differrent implementation logic), illustrating with just B subclass:

public class B extends A {
ConfigData configData;

public B(ConfigData _configData) {
this.configData = _configData; 
}

/**
* @return the configData
*/
public ConfigData getConfigData() {
   return configData;
}
}

Now we can say:

B b = A.configureA(B.class);
C c = A.configureA(C.class);
  
System.out.println("Class B:" + b.getConfigData());
System.out.println("Class C:" + c.getConfigData()); 

// which outputs:
Class B:ConfigData [configParam1=configParam1Value]
Class C:ConfigData [configParam1=configParam1Value]

This post illustrates the usage of reflection and generics in java. We were able to access child classes in the base class using "controlled" reflection, that is we allowed only subclasses of the base class to be passed in the reflecitve method. We use generics to return proper subclass instances from the base class.

Saturday, June 15, 2013

Solr on FitNesse

This year's Berlin Buzzwords conference was as intense as last year's. For me, in particular, it was heavier on the discussion side (hooked up with Robert Muir to discuss the "deduplication of postings lists" in Lucene and with Ted Dunning to speak some Russian), but some of talks have been interesting enough for me to try something practical immediately.

Dominik Benz of Inovex has presented on FitNesse tool.

In its own words: FitNesse is "the fully integrated standalone wiki and acceptance testing framework". Dominik was describing their experience with integrating it and told that the upfront investment is almost nil and suits to non-technical people. At this point I can confirm the former point, while the second needs more investigation really.

As the presentation concentrated quite heavily on how one would go about integrating FitNesse into the cycle of a Big Data project, I got curious whether this tool would be suitable for some of the tasks on Solr side. I have also compiled a presentation of my own, that summarizes what follows (some of the slides were borrowed from Dominik's slides).
A bit of thinking, and decided: implement a FitNesse fixture, that will check the health of solr cluster. Sometimes, when the cluster is too big (say, tens of nodes) someone could be overloading it with posting data or querying data. Some of the nodes (with solr shards) can go down or become unresponsive. It would be nice in a wiki setting to be able to say with a glance: is the cluster up and running or suffers for more CPU / RAM etc?

I'll present quite simple fixture for checking the solr health, which roughly took me 15 minutes to implement. I hope it can be useful for you too.

Here is how FitNesse UI looks like after executing the fixture:



The Java code:

package example;

import fit.ColumnFixture;
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;
import org.apache.solr.client.solrj.response.SolrPingResponse;

import java.io.IOException;
import java.net.MalformedURLException;

/**
 * Created with IntelliJ IDEA.
 * User: dmitry
 * Date: 6/14/13
 * Time: 3:48 PM
 * To change this template use File | Settings | File Templates.
 */
public class SolrShardsFixture extends ColumnFixture {

    private String shardURL;
    private String shardName;

    public boolean isShardUp() {
        if (shardURL == null || shardURL.isEmpty())
            throw new RuntimeException("shardURL url is empty");
        try {
            SolrServer serverTopic = new CommonsHttpSolrServer(shardURL);
            SolrPingResponse solrPingResponse = serverTopic.ping();

            if (solrPingResponse.getStatus() == 0)
                return true;

        } catch (MalformedURLException e) {
            throw new RuntimeException("Failed to create SolrServer instance: " + e.getMessage());
        } catch (IOException e) {
            throw new RuntimeException("Failed to ping the SolrServer instance: " + e.getMessage());
        } catch (SolrServerException e) {
            throw new RuntimeException(e.getMessage());
        }
        return false;
    }

    public void setShardURL(String shardURL) {
        this.shardURL = shardURL;
    }

    public void setShardName(String shardName) {
        this.shardName = shardName;
    }
}

Monday, May 6, 2013

MTEngine: switching UI languages

This is latest update from our test environment:

We have implemented a feature of respecting your browser language. That is, if you browser tells us en-us, we'll show the English version and if it is ru-ru, it is going to load the Russian version.



Friday, April 26, 2013

MTEngine: latest developments

Here are the latest developments going on on our test environment: MTEngine_test:

1. We took a snapshot of sentences from opencorpora.org and are working on pushing these into the new UI feature, called "tasks". Each task is one Russian sentence to be translated and rated by the user.
2. The feature with a free-form translation remains in the UI and is pushed into its own tab (screenshot in the Russian version of this message below).

For the production version of MTEngine we have done one improvement: when registering and using the system for the first time, the dictionary entries will be looked from the common dictionary, contributed by all our users.

Happy translations!


Same in Russian:

Свежие разработки в тест версии проекта MTEngine:

1. Мы взяли дамп предложений проекта opencorpora.org и работаем над новой фичей под названием "задания". Каждое задание -- это одно предложение на русском языке для перевода и оценки пользователем.
2. Фича с произвольным переводом пользовательских предложений на русском языке будет находится в отдельном табе:



Мы сделали улучшение и в продакшн версии: теперь, когда пользователь регистрируется и делает первые переводы, словарные единицы берутся из общего переводного словаря, который создали все пользователи проекта.

Успешных переводов и хороших выходных!


Friday, April 19, 2013

What grammatical challenges prevent Google Translate from being more effective?

Cross-posting my answer to the question in the topic on quora.com [1].

Google is pretty good at modeling close enough language pairs. By close enough I mean languages that share multiple vocabulary units, have similar word order, morphological richness level and other grammatical features.

Let's pick an example of a pair, where Google Translate (GT) is good. Round-trip method is one way to verify whether the languages are close enough, at least statistically, for GT:

(these examples are using GT only, no human interpretation involved)

English: I am in a shop.
Dutch: Ik ben in een winkel.
back to English I'm in a store. (quite ok)

English: I danced into the room.
Dutch: Ik danste in de kamer.
back to English: I danced in the room. (preposition issues)


Let's pick a pair of more unrelated languages (by the way, when we claim the languages are unrelated grammatically, they may also be unrelated semantically or even pragmatically: different languages were created by people to suit their needs at particular moments of history). One such pair is English and Finnish:

Finnish: Hän on kaupassa.
English: He is in the shop.
Finnish: Hän on myymälä. (roughly the original Finnish sentence)

This example has pronoun hän, which in Finnish is not gender specific. It should be resolved based on larger context, than just a sentence. Somewhere before this sentence in a text, there should have been a mention of who hän is referring to.

To conclude this particular example: Google Translate translates on a sentence level and that is a limitation in itself, that makes correct pronoun resolution impossible. Pronouns are useful, if we wanted to understand, what was the interaction between the objects in a text.


Let's pick another example of unrelated languages: English and Russian.

Russian: Маска бывает правдивее и выразительнее лица.
English: The mask is truthful and expressive face. (should have been: The mask can be more truthful and expressive than face)
back to Russian: Маска правдивым и выразительным лицом. (hard to translate, but the meaning roughly: The mask being a truthful and expressive face).

To conclude this example: languges with rich morphology that, in the case of the Russian language, convey grammatical case in just a word inflection and thus require deeper grammatical analysis, which pure statistical machine translation methods lack no matter how much data has been acquired. There exist methods of combining rules and statistics together.


Another pair and different example:
English: Reporters said that IBM has bought Lotus.
Japanese: 記者は、IBMがロータスを買っていると述べた。
back to English: The reporter said that IBM Lotus are buying.

Japanese has a "recursive syntax", that represents this English sentence, like:

Reporters (IBM Lotus has bought) said that.

i.e. the verb is syntacically placed after the subject-object pair of a sentence or a sub-sentence (direct / indirect object).

To conclude this example: there should exist a method of mapping syntax structures as larger units of the language and that should be done in a more controlled fashion (i.e. is hard to derive from pure statistics).


References
[1] http://www.quora.com/Linguistics/What-grammatical-challenges-prevent-Google-Translate-from-being-more-effective