Wednesday, November 9, 2011

axis2: serialization and deserialization of wsdl2java generated objects

Using axis2's wsdl2java tool and a third-party wsdl I have generated service stub and supporting classes (data holders). Since there was a need to do post-processing of loaded data from a service, there was a need to serialize one of the data holder objects.

Questions that I had and posted on stackoverflow.com were:

1) is there a standard axis2 tool / approach that can be used for the purpose?

2) since the data holder class does not implement Serializable interface what would be the easiest way of serializing the object into xml format with the ability to restore the original object?

Data binding option was used (-d jaxbri) and each field of the class in question is annotated with @XmlElement tag, e.g.:

@XmlElement(name = "ID", required = true)
protected String id;



Here is how I solved it:

1. axis2 generated java classes set (client side) had an object called ObjectFactory. Majority of its methods create JAXBElement objects with values of fields of the class holder
2. I had to implement a serializable wrapper class ASerializable for the class holder, such that it uses the ObjectFactory to create the JAXBElement objects for all the fields.
3. some external code uses the wrapper class to create an serializable object and writes it to the output stream.
4. on the receiving end:

ASerializable aSerializable;
        A a;
        aSerializable= (ASerializable)in.readObject();
        a.setID((String)aSerializable.getID().getValue());

It still looks like extra work for the pre-annotated class serialization, but better than serializing into some text format and manual type checking during deserialization.
Some good intro into serialization with java can be found here.

Saturday, October 1, 2011

First international publication

Celebration!



Had my first international publication shown on the DBLP. Somehow it was something I wanted to achieve as an intermediate goal in the academic career. In a way this gives some visibility to what I have been doing for around 4 years. I mean NLP (Natural Language Processing) and Machine Translation more precisely. Before going international, I've had 5 publications in the Russian scientific journals and conferences.

On the same ICSOFT'11 conference where this publication has been presented in a form of poster, I had an honor to serve as knowledge-based systems track chair. Both presenting my work and leading the session were exciting. I wanted also to say thanks to the ICSOFT's organizing committee for giving me the participant's grant, that made my participation possible. Special thanks to Sergio Brissos.


ICSOFT 2011 Conference


ICSOFT is strictly not an NLP conference. However, it has a knowledge-based track, where rather relevant NLP related topics are listed:

Ontology Engineering
Decision Support Systems
Intelligent Problem Solving
Expert Systems
Reasoning Techniques
Knowledge Acquisition
Knowledge Mining
Machine Learning
Natural Language Processing
Human-Machine Cooperation

Two publications I remembered



From these, ontology engineering articles were strong. One of them (Barbara Furletti, Franco Turini: Mining Influence Rules out of Ontologies, see here) was about mining ontology and reasoning rules from the oldest Italian bank's data. This sounds exceptional to me, when some (even old) commercial data is given away to researchers.

Another, non-directly related to NLP, article I remembered was by Manolya Kavakli et al (Manolya Kavakli, Tarashankar Rudra, Manning Li: An Embodied Conversational Agent for Counselling Aborigines - Mr. Warnanggal.), where one of the challenges is providing health assistance to the Australlian aborigines via a computer based system, not very motivated people, poor, stealing food and other things. Another challenge is dealing with about 500 languages, that these aborigines speak. Here is a potential for interesting NLP problems.

Why would I recommend going to a conference not directly related to your research topic?


As a pre-word, I should mention, that in a way whatever we do in the NLP is materialized in the form of programming code. Therefore our work qualifies to a software engineering conference as well as to an NLP one.

Going to a strictly SW conference can give you the following benefits:
* concrete questions of you work in the light of software development practices. Some NLP researchers may think it is not very important to make their SW configurable, re-usable or performant. In the end of the day, this matters a lot, especially if you plan to implement you work into industrial level solution

* if you do a poster presentation, people can give you good insights into the quality of your poster and what can be improved. There were two extreme cases on the conference: one with the entire article text being pasted into the poster and another one with a couple of boxes and an arrow between them. The audience has reacted in an expected way: the first poster did not draw almost any attention, while the second had gathered the majority of the audience.

* you can pause and reflect a little bit: are you doing something valuable? Do you like what you do?


A couple of words about Spain, where ICSOFT'11 happened. +45 is something I have experienced for the first time; visiting royal palace Alcázar of Seville was extremely interesting and of course partying with conference peers over Spanish wine and tapas made the event memorable.

Enjoy you research life and publish your work as soon as possible.

Saturday, August 27, 2011

Оценка системы машинного перевода

Есть система машинного перевода с русского на английский.
Нужно сделать ручную оценку работы системы.

Целевая аудитория: все, кто хочет сделать машинные переводчики лучше (примеры: translate.google.com, translate.ru) и люди, интересующиеся прикладной лингвистикой (Natural Language Processing). Умение программировать НЕ требуется.

Связь: dmitry.kan[+AT+]gmail.com, twitter: DmitryKan

Задача: получить у меня пакет предложений (объём: сколько возьмётесь).

В пакете: предложения на русском языке и их переводы экспертом на английский язык.

Прогнать предложения на русском через систему. Просмотреть вручную их переводы на английский язык.

Составить список слов, которые не были найдены (это просто: не найденные слова будут выведены на русском). Послать мне список слов, я добавлю их в систему машинного перевода.

На выходе от вас три группы предложений из пакета:
1. хорошо перевелись
2. приемлемо перевелись (понятно по английской фразе, что было в русской)
3. плохо перевелись (непонятно по английской фразе, что было в русской)

Работа волонтёрская. Начальный бонус: статья со мной в соавторстве на конференции либо в журнале, если вам это интересно. Если нет -- всяческий пиар вам.

Дальше: если сработаемся, предложу Вам работу в лингвистических проектах (умение программировать обязательно).

Ссылки для интересующихся
[1] http://www.slideshare.net/dmitrykan/icsoft-2011-51cr
[2] http://www.slideshare.net/dmitrykan/automatic-build-of-semantic-translational-dictionary
[3] http://ufal.mff.cuni.cz/umc/

interested in rule based machine translation (rbmt)? / Интересуетесь машинным переводом на правилах?

I'm looking for students and activists of rule-based machine translation to help me in the evaluation of my machine translation system from Russian into English. Details in the e-mail: dmitry.kan[+AT+]gmail.com (substitute characters from [ fro ] with @).

Я ищу студентов и активистов машинного перевода на правилах для оценки моей системы машинного перевода с русского на английский. Детали по почте: dmitry.kan[+AT+]gmail.com (замените символы с [ по ] знаком @).

Thursday, July 14, 2011

Interested in machine translation between Russian and English?

Then mark August 15-19 2011 in your calendars. Web of Data'11 has accepted my poster on machine translation with semantic features, the full paper title is:

Semantic Feature Machine Translation System for Information Retrieval

Some details on the work from another poster, accepted to ICSOFT'11 can be checked here:

Sunday, July 10, 2011

Пример кода, интегрирующего AOT морфологический лемматайзер в C#

АОТ предлагает свой лемматайзер для русского и английского языка на сайте www.aot.ru. Если Вам нужно интегрировать их COM внутри проекта на C#, читайте ниже.

После установки библиотеки при помощи Setup.exe, загрузите Lemmatizer.dll в C# проект. Скопируйте следующий метод или его тело, например, в main-class:


private static void initAOTMorphoanalyzer()
{
LEMMATIZERLib.ILemmatizer lemmatizerRu = new LEMMATIZERLib.LemmatizerRussian();
lemmatizerRu.LoadDictionariesRegistry();
LEMMATIZERLib.IParadigmCollection piParadigmCollection = lemmatizerRu.CreateParadigmCollectionFromForm("мыла", 0, 0);

Console.Out.WriteLine(piParadigmCollection.Count);

for (int j=0; j < piParadigmCollection.Count; j++)
{
object[] args = { j };

Type paradigmCollectionType = piParadigmCollection.GetType();

if (paradigmCollectionType != null)
{
object Item = paradigmCollectionType.InvokeMember("Item", BindingFlags.GetProperty, null, piParadigmCollection, args);
Type itemType = Item.GetType();
if (itemType != null)
{
object Norm = itemType.InvokeMember("Norm", BindingFlags.GetProperty, null, Item, null);
Console.Out.WriteLine(Norm);
}
else
Console.Out.WriteLine("itemType is null");
}
else
Console.Out.WriteLine("paradigmCollectionType is null");
}
}



Результат:
2
МЫЛО
МЫТЬ

COM test example for C#: AOT lemmatizer

I will post this both in English and Russian for more people's benefit.

There is a Russian / English lemmatizer from AOT (www.aot.ru). If you need to use the COM that AOT provides inside C#, read on. Load the lemmatizer.dll inside your C# project. Insert the following method or its body inside your code, for example main class:


private static void initAOTMorphoanalyzer()
{
LEMMATIZERLib.ILemmatizer lemmatizerRu = new LEMMATIZERLib.LemmatizerRussian();
lemmatizerRu.LoadDictionariesRegistry();
LEMMATIZERLib.IParadigmCollection piParadigmCollection = lemmatizerRu.CreateParadigmCollectionFromForm("мыла", 0, 0);

Console.Out.WriteLine(piParadigmCollection.Count);

for (int j=0; j < piParadigmCollection.Count; j++)
{
object[] args = { j };

Type paradigmCollectionType = piParadigmCollection.GetType();

if (paradigmCollectionType != null)
{
object Item = paradigmCollectionType.InvokeMember("Item", BindingFlags.GetProperty, null, piParadigmCollection, args);
Type itemType = Item.GetType();
if (itemType != null)
{
object Norm = itemType.InvokeMember("Norm", BindingFlags.GetProperty, null, Item, null);
Console.Out.WriteLine(Norm);
}
else
Console.Out.WriteLine("itemType is null");
}
else
Console.Out.WriteLine("paradigmCollectionType is null");
}
}



Output:
2
МЫЛО
МЫТЬ

Wednesday, June 8, 2011

Amazed by Scala #1: objects and compilation

7 minutes and here is an object, that can be compiled into java classes:



import scala.actors._
import Actor._

object TopStock {
val symbols = List( "AAPL", "GOOG", "IBM", "MSFT")
val receiver = self
val year = 2008

def main(args: Array[String]) = {
symbols.foreach { symbol =>
actor { receiver ! getYearEndClosing(symbol, year) }
}

val (topStock, highestPrice) = getTopStock(symbols.length)
printf("Top stock of %d is %s closing at price %f\n", year, topStock, highestPrice)
}

def getYearEndClosing(symbol : String, year : Int) = {
val url = "http://ichart.finance.yahoo.com/table.csv?s="+
symbol + "&a=11&b=01&c=" + year + "&d=11&e=31&f=" + year+
"&g=m"

val data = io.Source.fromURL(url).mkString
val price = data.split("\n")(1).split(",")(4).toDouble
(symbol, price)
}

def getTopStock(count : Int) : (String, Double) = {
(1 to count).foldLeft("", 0.0) { (previousHigh, index) =>
receiveWithin(10000) {
case (symbol : String, price : Double) =>
if (price > previousHigh._2) (symbol, price) else previousHigh
}
}
}
}


Saved in TopStock.scala. Compiled with

> scalac TopStock.scala

ran with

> scala TopStock

Top stock of 2008 is GOOG closing at price 307,650000

Amazed by Scala

Top stock of 2008 is GOOG closing at price 307,650000 among (AAPL, GOOG, IBM, MSFT). Amazed by simplicity, clarity and beauty of the following code in Scala from this book.



import scala.actors._
import Actor._

val symbols = List( "AAPL", "GOOG", "IBM", "MSFT")
val receiver = self
val year = 2008

symbols.foreach { symbol =>
actor { receiver ! getYearEndClosing(symbol, year) }
}

val (topStock, highestPrice) = getTopStock(symbols.length)

printf("Top stock of %d is %s closing at price %f\n", year,
topStock, highestPrice)

def getYearEndClosing(symbol : String, year : Int) = {
val url = "http://ichart.finance.yahoo.com/table.csv?s="+
symbol + "&a=11&b=01&c=" + year + "&d=11&e=31&f=" + year+
"&g=m"

val data = io.Source.fromURL(url).mkString
val price = data.split("\n")(1).split(",")(4).toDouble
(symbol, price)
}

def getTopStock(count : Int) : (String, Double) = {
(1 to count).foldLeft("", 0.0) { (previousHigh, index) =>
receiveWithin(10000) {
case (symbol : String, price : Double) =>
if (price > previousHigh._2) (symbol, price) else
previousHigh
}
}
}

Wednesday, May 25, 2011

StackOverflow

Do you know the famous resource http://stackoverflow.com by Joel Spolsky and team? Here is my own definition of stackoverflow in Java:

private static void log(String logStatement) {
log(logStatement);
}

Tuesday, April 19, 2011

NerdCamp

NerdCamp is coming and I'm going to present on Apache technologies, specifically on Apache Solr and a bit on Apache Hadoop. If you are into all this and like creative thinking, want to create a start-up and will come around Saint-Petersburg this weekend, you should definitely come to NerdCamp!

Confirmed Key Participants

Yury Lifshits, Sergey Poduzov, Alexander Shtuchkin,
Vladimir Gorovoy, Nikolay Vyahhi, Vladimir Aluferov, Yakov Sirotkin and myself

The preliminary program has juicy topics among which are:
How the Web will Transform Education by Yury Lifshits,
Geo Information Systems Around Us by Aleksander Klechikov,
Software Development for little ones by Yakov Sirotkin,
Introducing NoSQL: Apache SOLR and Hadoop by Dmitry Kan,
Introducing Cloud Services by Dmitry Petrov,
Nature of Enterpreneurship Craft, Lifestyle or Science by Alexey Baranov,
How to grow places for smart people -- workshop

and more!

Friday, March 25, 2011

Angry Birds -pelaaja: DK

Angry Birds -pelaaja: DK: "DK on kertakaikkisen kyllästynyt sikailuun! Seuraa hänen edesottamuksiaan Angry Birdsin SM-kisoissa ja tule itsekin mukaan."

Monday, March 21, 2011

concert of components?

"ZooKeeper: Because coordinating distributed systems is a Zoo"

This post is short and not useful in that it doesn't give you any code snippets or technical recommendations. I would like just to cite the SolrCloud's wiki. The SolrCloud is "the set of Solr features that take Solr's distributed search to the next level, enabling and simplifying the creation and use of Solr clusters." They use Apache ZooKeeper project (a subproject of Hadoop) as a distributed system of keeping cluster of SOLRs state updates. In a distributed system every component can potentially crash, yet the system is expected to provide it's service to its users. If a single SOLR crashes, its replica will take over, but if ZooKeeper crashes, the system will still continue serving the user requests, but no updates of the system are visible to the system (sounds interesting, I know). To improve that, this is what is possible:

"Running multiple zookeeper servers in concert (a zookeeper ensemble) allows for high availability of the zookeeper service. Every zookeeper server needs to know about every other zookeeper server in the ensemble, and a majority of servers are needed to provide service. For example, a zookeeper ensemble of 3 servers allows any one to fail with the remaining 2 constituting a majority to continue providing service. 5 zookeeper servers are needed to allow for the failure of up to 2 servers at a time."

So, if you have a big zoo with variety of animals in it, make sure you have 5 zoo keepers for at least 3 of them take care of your pigs and elephants, when 2 others got stuck somewhere else.

Saturday, January 29, 2011

SOLR: speed up batch posting

If you are familiar with the Apache SOLR and deal with index worth of millions documents, it becomes quite important to be able to (re-)index fast. There are various techniques as to what to tweak for the indexing go faster. Another simple way to speed up your (batch) indexing without changing your SOLR schema is to modify logging.

Apparently, when deployed under Tomcat, SOLR logs each and every update request during POSTing process. Experience shows, that heavy http operation is done faster, when logging is minimal.

SOLR (as of 1.4 at least) has admin GUI which serves as a central information hub for the given SOLR core. Among other useful features, it has a page where one can set up logging levels of different SOLR components. In default SOLR installation you can access the page via http://localhost:8983/solr/admin/logging. By default, the logging levels amount mainly to INFO, which permits logging of all the select/update requests (imagine 1 million of such log entries for batch reindexing).

It would be handy to be able to automatically change the logging levels to, say, WARNING before batch POSTing and back to INFO after that. solr/admin/logging is declared as servlet in web.xml of the corresponding SOLR core:



Logging
org.apache.solr.servlet.LogLevelSelection



All the components which allow changing their logging levels are listed on the page http://localhost:8983/solr/admin/logging. Using curl we can send a post request to the servlet and set the desired levels. It is reasonable to implement a function, which takes logging level and url of SOLR core as parameters (choose your own favourite language, this is done in Perl):


sub setSolrLogLevel
{
my ($url, $level) = @_;

print "setting logging level to $level\n";
my $res = system("curl --user user:pass -d \"submit=set&root=$level&fi=$level" .
"&fi.alphasense=$level&fi.alphasense.solr=$level&fi.alphasense.solr.query=$level" .
"&fi.alphasense.solr.query.AlphaSenseQParserPlugin=$level&httpclient=$level&httpclient.wire=$level&httpclient.wire.content=$level&httpclient.wire.header=$level&javax=$level&javax.management=$level&javax.management.mbeanserver=$level&org=$level&org.apache=$level&org.apache.catalina=$level&org.apache.catalina.core=$level&org.apache.catalina.core.ContainerBase=$level&org.apache.catalina.core.ContainerBase.%5BCatalina%5D=$level&org.apache.catalina.core.ContainerBase.%5BCatalina%5D.%5Blocalhost%5D=$level&org.apache.catalina.core.ContainerBase.%5BCatalina%5D.%5Blocalhost%5D.%5B%2Fsolrtopic%5D=$level&org.apache.catalina.core.ContainerBase.%5BCatalina%5D.%5Blocalhost%5D.%5B%2Fsolrtopic%5D.%5BLogging%5D=$level&org.apache.catalina.core.ContainerBase.%5BCatalina%5D.%5Blocalhost%5D.%5B%2Fsolrtopic%5D.%5BSolrServer%5D=$level&org.apache.catalina.core.ContainerBase.%5BCatalina%5D.%5Blocalhost%5D.%5B%2Fsolrtopic%5D.%5BSolrUpdate%5D=$level&org.apache.catalina.core.ContainerBase.%5BCatalina%5D.%5Blocalhost%5D.%5B%2Fsolrtopic%5D.%5Bdefault%5D=$level&org.apache.catalina.core.ContainerBase.%5BCatalina%5D.%5Blocalhost%5D.%5B%2Fsolrtopic%5D.%5Bjsp%5D=$level&org.apache.catalina.core.ContainerBase.%5BCatalina%5D.%5Blocalhost%5D.%5B%2Fsolrtopic%5D.%5Bping%5D=$level&org.apache.catalina.session=$level&org.apache.catalina.session.ManagerBase=$level&org.apache.commons=$level&org.apache.commons.digester=$level&org.apache.commons.digester.Digester=$level&org.apache.commons.digester.Digester.sax=$level&org.apache.commons.httpclient=$level&org.apache.commons.httpclient.ChunkedInputStream=$level&org.apache.commons.httpclient.HeaderElement=$level&org.apache.commons.httpclient.HttpClient=$level&org.apache.commons.httpclient.HttpConnection=$level&org.apache.commons.httpclient.HttpMethodBase=$level&org.apache.commons.httpclient.HttpMethodDirector=$level&org.apache.commons.httpclient.HttpParser=$level&org.apache.commons.httpclient.HttpState=$level&org.apache.commons.httpclient.MultiThreadedHttpConnectionManager=$level&org.apache.commons.httpclient.SimpleHttpConnectionManager=$level&org.apache.commons.httpclient.auth=$level&org.apache.commons.httpclient.auth.AuthChallengeProcessor=$level&org.apache.commons.httpclient.cookie=$level&org.apache.commons.httpclient.cookie.CookiePolicy=$level&org.apache.commons.httpclient.cookie.CookieSpec=$level&org.apache.commons.httpclient.methods=$level&org.apache.commons.httpclient.methods.EntityEnclosingMethod=$level&org.apache.commons.httpclient.methods.ExpectContinueMethod=$level&org.apache.commons.httpclient.methods.PostMethod=$level&org.apache.commons.httpclient.params=$level&org.apache.commons.httpclient.params.DefaultHttpParams=$level&org.apache.commons.httpclient.params.HttpMethodParams=$level&org.apache.commons.httpclient.util=$level&org.apache.commons.httpclient.util.EncodingUtil=$level&org.apache.commons.httpclient.util.ExceptionUtil=$level&org.apache.commons.httpclient.util.IdleConnectionHandler=$level&org.apache.jasper=$level&org.apache.jasper.EmbeddedServletOptions=$level&org.apache.jasper.JspCompilationContext=$level&org.apache.jasper.compiler=$level&org.apache.jasper.compiler.Compiler=$level&org.apache.jasper.compiler.JspConfig=$level&org.apache.jasper.compiler.JspRuntimeContext=$level&org.apache.jasper.compiler.TldLocationsCache=$level&org.apache.jasper.servlet=$level&org.apache.jasper.servlet.JspServlet=$level&org.apache.jasper.servlet.JspServletWrapper=$level&org.apache.solr=$level&org.apache.solr.analysis=$level&org.apache.solr.analysis.BaseTokenFilterFactory=$level&org.apache.solr.analysis.BaseTokenizerFactory=$level&org.apache.solr.client=$level&org.apache.solr.client.solrj=$level&org.apache.solr.client.solrj.impl=$level&org.apache.solr.client.solrj.impl.CommonsHttpSolrServer=$level&org.apache.solr.common=$level&org.apache.solr.common.util=$level&org.apache.solr.common.util.ConcurrentLRUCache=$level&org.apache.solr.core=$level&org.apache.solr.core.Config=$level&org.apache.solr.core.CoreContainer=$level&org.apache.solr.core.JmxMonitoredMap=$level&org.apache.solr.core.RequestHandlers=$level&org.apache.solr.core.SolrConfig=$level&org.apache.solr.core.SolrCore=$level&org.apache.solr.core.SolrResourceLoader=$level&org.apache.solr.handler=$level&org.apache.solr.handler.AnalysisRequestHandler=$level&org.apache.solr.handler.XmlUpdateRequestHandler=$level&org.apache.solr.handler.admin=$level&org.apache.solr.handler.admin.LukeRequestHandler=$level&org.apache.solr.handler.admin.SystemInfoHandler=$level&org.apache.solr.handler.component=$level&org.apache.solr.handler.component.QueryElevationComponent=$level&org.apache.solr.handler.component.SearchHandler=$level&org.apache.solr.handler.component.SpellCheckComponent=$level&org.apache.solr.highlight=$level&org.apache.solr.highlight.SolrHighlighter=$level&org.apache.solr.request=$level&org.apache.solr.request.BinaryResponseWriter=$level&org.apache.solr.request.XSLTResponseWriter=$level&org.apache.solr.schema=$level&org.apache.solr.schema.FieldType=$level&org.apache.solr.schema.IndexSchema=$level&org.apache.solr.search=$level&org.apache.solr.search.SolrIndexSearcher=$level&org.apache.solr.servlet=$level&org.apache.solr.servlet.LogLevelSelection=$level&org.apache.solr.servlet.SolrDispatchFilter=$level&org.apache.solr.servlet.SolrRequestParsers=$level&org.apache.solr.servlet.SolrServlet=$level&org.apache.solr.servlet.SolrUpdateServlet=$level&org.apache.solr.spelling=$level&org.apache.solr.spelling.AbstractLuceneSpellChecker=$level&org.apache.solr.spelling.FileBasedSpellChecker=$level&org.apache.solr.spelling.IndexBasedSpellChecker=$level&org.apache.solr.update=$level&org.apache.solr.update.SolrIndexConfig=$level&org.apache.solr.update.UpdateHandler=$level&org.apache.solr.util=$level&org.apache.solr.util.SolrPluginUtils=$level&org.apache.solr.util.plugin=$level&org.apache.solr.util.plugin.AbstractPluginLoader=$level\" $url");
print "Result code:$res\n";
}


There you go. Call setSolrLogLevel("http://localhost:8983/solr/admin/logging", "WARNING"); before the batch POSTing and setSolrLogLevel("http://localhost:8983/solr/admin/logging", "INFO"); after the batch POSTing has finished.