Friday, November 19, 2010

"OOPness" in Java

Optimisation in my code has led me to a code refactoring, where several methods with return values became one. Since I could not use "pass by reference" feature for all my parameters (primitives and objects as collections) I had to come up with an inner class for holding all return parameters.

So by not supporting "pass by reference" feature Java makes your code even more OOP-ish. Even though the topic can be well known, this article helps to put things together and make a memory refresh.

Wednesday, November 3, 2010

Successive replacement in regular expressions (java)

Actually not sure, how often people out there do the successive replacement in a target text having a regular expression pattern, but Java has rather neat solution for it. I'm publishing it here, because I know, that esp. younger developers can re-invent a wheel here and have longer debugging sessions.

So the task is like this: you have a text T, like "cat-1 dog-1 cat-1 elephant-1 cat-2 dog-2 cat-3".
Suppose we want to change numerals attached to the words "cat" to their word representations: "1" to "one", "2" to "two".

One straightforward way would be to match all "cat-([0-9])+" subsequences and then run replace operation on T.

So the code would look something like this:


String T = "cat-1 dog-1 cat-1 elephant-1 cat-2 dog-2 cat-3";
Pattern catPattern = Pattern.compile("cat-([0-9]+)");
Matcher catMatcher = catPattern.matcher(T);
Map numToWord = new HashMap();
numToWord.add("1", "one");
numToWord.add("2", "two");
numToWord.add("3", "three"); // ...
while (catMatcher.find())
{
T = T.replaceFirst(catMatcher.group(1), numToWord.get(catMatcher.group(1)));
}


This code produces:

cat-one dog-one cat-1 elephant-1 cat-two dog-2 cat-three

Which is missing one substitution. Ok, let's use replaceAll instead and make sure we touch only cats:


{
T = T.replaceAll("cat-" + catMatcher.group(1), "cat-" + numToWord.get(catMatcher.group(1)));
}


which produces what we want:

cat-one dog-1 cat-one elephant-1 cat-two dog-2 cat-three

But now what happens inside the loop is logically out of sync with the loop condition: we iterate over matches, but call replaceAll (probably not efficient either, as replaceAll will be attempted even when not needed anymore, for duplicate matches).

Any more elegant and correct solution?

Yes! It is called Matcher.appendReplacement


Pattern catPattern = Pattern.compile("cat-([0-9]+)");
Matcher catMatcher = catPattern.matcher(T);
Map numToWord = new HashMap();
numToWord.put("1", "one");
numToWord.put("2", "two");
numToWord.put("3", "three"); // ...

StringBuffer sb = new StringBuffer();

while (catMatcher.find())
{
System.out.println("Match:" + catMatcher.group(1));
catMatcher.appendReplacement(sb, "cat-" + numToWord.get(catMatcher.group(1)));
}
catMatcher.appendTail(sb);


now sb.toString() contains:

cat-one dog-1 cat-one elephant-1 cat-two dog-2 cat-three

If you append System.out.println(sb.toString()); inside the while loop, you will also see, that replacements happen in sync with the while loop's state, so that what is inside the loop and what while loops over are in sync.

Saturday, August 21, 2010

B2B: what is WSDL (simple explanation)

Just for those wondering around what is WSDL file or technology in the area of Business 2 Business (when companies speak to each other automatically over the network): you can think of it as a declaration of the client-server conversation. In this case one B (server) provides some functionality (like taking square roots of big numbers) and another B (client) has a lot of these numbers and needs square roots of them.

So the server declares a method (which is in fact a remote method):

Double squareRoot(Double number)

in its WSDL file. The client side then takes this file and generates the client side code, which handles the protocol of communication and remote method invocation automatically. What client needs is simply to implement a business logic around this invocation, like a web page with a text field for the number or the entire hardware with touch screen display.

In general the input and return types of remote methods can be simple and complex -- in fact as complex and "proprietary" as you need, because the entire WSDL file is nothing but xml as well as the messages being sent over the network.

One of the libraries used in the industry is Axis and Axis2. Using the library you can generate the java code from wsdl for you client (and even generate wsdl from your java code for you server).

Tuesday, August 17, 2010

Zero-witdth negative lookahead group: example

Suppose you have to split a comma separated string in java where meaningful parts are separated by continious spaces, say from 2 to infinity. But the problem is that sometimes a meaningfull part (like a person name) contains 2 spaces. You want to exclude such cases and retain the name's parts together. Example:

name1[one_space]surname1[two_spaces]age
surname2[two_spaces]surname2[two_spaces]age

In this example it is easy to solve the problem with zero-width negative lookahead group (?!\\w), because age is always supposed to be numeric. So the final code will be something like this:

String s = current_string.split("\\s{2}(?!\\w)");

Thursday, July 1, 2010

Парижские впечатления

Вернулся из второй поездки в Париж. Вот сейчас, уже по возвращению, я подумал: в Париже все отдыхают. Или это только видимость? Как бы там ни было, отдыхать там легко. Понравилось кататься на public велосипедах!

Tuesday, April 27, 2010

Dumper and sorting of keys

When you use Dumper for debugging via comfortable logging data structures by reference, you can additionally apply sorting to the output. Say, you have a hash addressed by $hash_ref reference. If you need to output the hash contents, but have its keys sorted, you can do:


use strict;
use Data::Dumper;

$hash_ref = generate your hash here

$Data::Dumper::Sortkeys = \&my_filter;
print Dumper($hash_ref), "\n";

sub my_filter {
my ($hash) = @_;
# return an array ref containing the hash keys to dump
# in the order that you want them to be dumped
return [
sort {$a <=> $b} keys %$hash
];
}

Wednesday, April 21, 2010

Сходимость

Код программы аппроксимирует решение задачи. Сходимость кода к решению осуществляется посредством связки "изменение кода"-"компиляция/интерпретация"-"запуск". Получается что-то вроде временного ряда: код может не сходиться неделю и сойтись за один день. Если код не сходится, стоит посмотреть на саму концепцию в решении / задачу / компилятор.

Всё это к тому, что полезно преподавать программирование, рассматривающее программу как целостную математическую схему, а не только как набор алгоритмов, приёмов и дизайна.

Tuesday, April 20, 2010

Thursday, April 8, 2010

An idea for GUI designers and developers -- a standalone app / OS level change.

When I'm under VPN & putty I sometimes need to run an sftp client. Currently I have two options to use: WinSCP (I find it easier to use in some cases, like quickly watching / editing the contents of different text files) and Secure Shell Client from SSH. For some reason the first client loads in about 2-3 minutes after I log in. So: it would be great to have an option to configure launch of a program depending on a certain event, like putty execution, connected to a certain host.

This may give a lot of opportunities to quite flexibly configure your GUI world and save a lot of time spent otherwise on the monkey repetitions.

Saturday, March 13, 2010

giza++ under windows: episode 2

It turned out, that in order to comfortably run GIZA++ under win32 with cygwin I had to recompile it without the flag -DBINARY_SEARCH_FOR_TTABLE (credits: http://code.google.com/p/giza-pp/issues/detail?id=9).

The full list of steps goes here (I suppose that source corpus is stored in corpus.ru and target corpus is stored in corpus.en):

1. produce vcb and dictionary files with "plain2snt.out corpus.ru corpus.en" (credits: http://vee-r.blogspot.com/2006/12/giza-guide.html)
2. produce cooc file with "snt2cooc.out corpus.ru.vcb corpus.en.vcb corpus.ru_corpus.en.snt > ru_en.cooc" (credits: myself, after analyzing the train-factored-phrase-model.perl from Moses package)
3. run GIZA++ with config file:

outputfileprefix play_giza
sourcevocabularyfile corpus.ru.vcb
targetvocabularyfile corpus.en.vcb
c corpus.ru_corpus.en.snt
CoocurrenceFile ru_en.cooc
model1dumpfrequency 1
model4smoothfactor 0.4
nsmooth 4
onlyaldumps 1
nodumps 1
p0 .999
m1 5
m2 0
m3 3
m4 3
o giza

if these options are stored in giza.config then run "GIZA++ giza.config". This produces giza.A3.final file, typical entry of which is:

# Sentence pair (1) source length 4 target length 7 alignment score : 2.25315e-10
there is a book on the table
NULL ({ }) на ({ }) столе ({ 5 6 7 }) лежит ({ 1 2 }) книга ({ 3 4 })

which means the following mapping:

"столе" --> "on the table"
"лежит" --> "there is"
"книга" --> "a book"

Running GIZA++ under win32 and under linux gives same results in terms of word mappings, except that the alignment scores may slightly differ due to possibly different float point precision models.

Saturday, February 27, 2010

giza++ under windows

After an 'outrageous' attempt to compile giza++ under Visual Studio C++ Express 2008 (with more than 2000 compile time errors) I switched over to cygwin and installed:

gcc version 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)
make GNU Make 3.81 (built for i686-pc-cygwin)

With

$ make

under giza-pp-v1.0.3\giza-pp I have obtained two executables: GIZA++-v2/GIZA++.exe and mkcls-v2/mkcls.exe which perfectly run under Windows XP Professional Version 2002 SP 2.

Saturday, December 19, 2009

RuSSIR'2010

What is RuSSIR? It is the summer school on Information Retrieval. Where is it held? Usually in Russia, this time in Voronezh. Want to apply? Read on.




4th Russian Summer School in Information Retrieval (RuSSIR 2010)
Monday September 13 - Saturday September 18, 2010
Voronezh, Russia
http://romip.ru/russir2010/eng/



FIRST CALL FOR COURSE PROPOSALS



The 4th Russian Summer School in Information Retrieval (RuSSIR 2010) will be held on September 13-18, 2010 in Voronezh, Russia, one of the major cities in south-western Russia. The mission of the school is to teach students about modern problems and methods in Information Retrieval; to stimulate scientific research in the field of Information Retrieval; and to create an opportunity for informal contacts among scientists, students and industry professionals. The Russian Conference for Young Scientists in Information Retrieval will be co-located with the school. RuSSIR 2010 will offer 4 or 5 courses and host approximately 100 participants. The working languages of the school are English (preferable) and Russian. The target audience of RuSSIR is advanced graduate and PhD students, post-doctoral researchers, academic and industrial researchers, and developers.


The RuSSIR 2010 Organizing Committee invites proposals for courses on a wide
range of IR-related topics, including but not limited to:
- IR theory and models
- IR architectures
- Algorithms and data structures for IR
- Text IR
- Multimedia (including music, speech, image, video) IR
- Natural language techniques for IR tasks
- User interfaces for IR
- Web IR (including duplicate detection, hyperlink analysis, query logs)
- Text mining, information and fact extraction
- Mobile applications for IR
- Dynamic media IR (blogs, news, WIKIs)
- Social IR (collaborative filtering, tagging, recommender systems)
- IR evaluation.


Each course should consist of five 90-minute-long sessions (normally in five consecutive days). The course may include both lectures and practical exercises in computer labs.


RuSSIR 2010 organizers will cover travel expenses and accommodations at the school for one lecturer per course, but there is no additional honorarium. The RuSSIR organizers would highly appreciate if, whenever possible, lecturers could find alternative funding to cover travel and accommodation expenses and indicate this possibility in the proposal.


Course proposals for RuSSIR 2010 must be submitted by email to Pavel Braslavski (pb@yandex-team.ru), by February 14, 2010. A course proposal should contain a brief description of the course (up to 200 words), preferred schedule, prerequisites, equipment needs, a short description of teaching/research experience and contact information of the lecturer(s). All proposals will be evaluated by the RuSSIR 2010 program committee according to the school goals, presentation clarity, lecturer’s qualifications and experience. Topics not featured at previous RuSSIRs are preferred. All submitters will be notified by March 1, 2010. Early informal inquiries about the school or the proposal evaluation process are encouraged.


About RuSSIR: The Russian Summer School in Information Retrieval is co-organized by the Russian Information Retrieval Evaluation Seminar (ROMIP) and Voronezh State University. Previous schools took place in Ekaterinburg, Taganrog, and
Petrozavodsk. Previous RuSSIR courses included IR Models (by Djoerd Hiemstra), Modeling Web Searcher Behavior and Interactions (by Eugene Agichtein), Computational Advertising (by James Shanahan), Text Mining, Information and Fact Extraction (by Marie-Francine Moens), Natural Language Processing for Information Access (by Horacio Saggion), Music IR (by Andreas Rauber), and other. Ricardo Baeza-Yates, VP of Research for Europe and Latin America at Yahoo, has confirmed as an invited lecturer for RuSSIR 2010 with the course 'Web data mining'.


About the RuSSIR 2010 location: Voronezh is a major city in southwestern Russia, spanning both sides of the Voronezh River, with population of 850,000. Express trains from Moscow to Voronezh take about 10 hours. There are also regular flights from Moscow, Munich, Prague, Tel-Aviv, and Istanbul. The town was founded in 1586. In the 17th century, Voronezh gradually evolved into a sizeable town, especially after Tsar Peter the Great built a dockyard in Voronezh. Currently, Voronezh is an administrative, economic and cultural center of the Voronezh region. Voronezh surrounding area has many attractions including archeological museum, nature and historical reserve Divnogorie, Kostomarovo cave monastery, Orlov trotter stud farm at Khrenovoe. Voronezh has a large student population: 37 institutions of higher education and 53 colleges educating over 127,000 students today. Voronezh State University was founded in 1918 and is one of the largest universities in Russia, with a total enrollment of 22,000.


Contacts

Use the e-mail address and substitute [at] with @ and [dot] with "." school[at]romip[dot]ru.

Monday, November 30, 2009

Semantic Analysis: theory, applications and use cases

Gave a talk at 6th FRUCT seminar by Nokia Research Center and Helsinki University of Technology and Nokia Siemens Networks:

Monday, October 12, 2009

Augmented Reality with Adobe Flash

Having carefully followed the Adobe MAX conference, I have found a ground shaking presentation on the Augmented Reality with Flash.

As the presenters define it, AR is superimposing "graphics over real-world environments in realtime". So I checked out the exciting (and long) tutorial on how to set up the development environment to hit the road. For running a quick demo from the tutorial I printed a black and white pattern (it makes sense to make it smaller, than I made, as it is much easier to handle). Before diving into details I decided to have a clue on how it is going to look like and made a small video which I share with you (please be patient about video quality and mute your player).



It is early to draw any conclusions on the future of AR, but it sounds like a very exciting field of software development in the future. It is where creative people get together and come up with an exciting business card and all sorts of interesting things.

Wednesday, September 16, 2009

Sting's programmer's mind

First impression of Sting's web-site is his quote of the day:

"I quite like using songs as a modular system where you can mix and match lines from different songs. It's a tradition now and people expect it. Basically, it's all one big song. You could say it was an aspect of postmodernism if you liked but you'd be called pretentious if you said that."

Perl: concise way to map one array onto antother in perl hash

Suppose I have two arrays. Suppose further elements in one of this array logically map onto elements of the other.


my @ar1 = (...);
my @ar2 = (...);


Easy way to map ar1 (keys) onto ar2 in perl is:


my %hash;
@hash{@ar1} = (@ar2) x @ar1;


Important assumption: the order in this two arrays matters. In other words first element of ar1 maps to first element of ar2, ..., n-th element of first array ar1 maps onto n-th element of ar2 and there exactly n elements in both arrays.


Examples

It is OK to have unique keys, obviously for the hash to preserve correct mapping (include use Data::Dumper in your code):


sub unique_mapping
{
my @ar1 = ('a', 'b', 'c', 'd', 'e');
my @ar2 = ('1', '2', '3', '4', '5');

print Dumper(\@ar1);
print Dumper(\@ar2);

my %hash;
@hash{@ar1} = (@ar2) x @ar1;


print Dumper(\%hash);

}


Result:


$VAR1 = [
'a',
'b',
'c',
'd',
'e'
];
$VAR1 = [
'1',
'2',
'3',
'4',
'5'
];
$VAR1 = {
'e' => '5',
'c' => '3',
'a' => '1',
'b' => '2',
'd' => '4'
};


The mapping is not what you might want to have in the case when keys are not unique:


sub keys_non_unique_mapping
{
my @ar1 = ('a', 'b', 'b', 'd', 'e');
my @ar2 = ('1', '2', '3', '4', '5');

print Dumper(\@ar1);
print Dumper(\@ar2);

my %hash;
@hash{@ar1} = (@ar2) x @ar1;


print Dumper(\%hash);

}


Result:


$VAR1 = [
'a',
'b',
'b',
'd',
'e'
];
$VAR1 = [
'1',
'2',
'3',
'4',
'5'
];
$VAR1 = {
'e' => '5',
'a' => '1',
'b' => '3',
'd' => '4'
};

Monday, September 14, 2009

Logging: helpful perl snippet to start with

I needed to establish simple logging in my small perl app which serves as a plugin for a company "big" product. Here is what I have come up with:


use strict;

my $log_file=file_string_here;

my $LOG_HANDLE = open_log_file_for_writing($log_file);

log_entry($LOG_HANDLE, "Logging started");

log_entry($LOG_HANDLE, "Logging finished");

close_log_file($LOG_HANDLE);


sub open_log_file_for_writing
{
my $log_file = shift;
my $LOGGING_HANDLE;

print "INFO Opening log file...\n";

unless(open $LOGGING_HANDLE, ">> ", $log_file) {
return undef;
}

my $current_time = localtime;
print $LOGGING_HANDLE "\n".$current_time."\n";

return $LOGGING_HANDLE;
}

sub log_entry
{
my $LOGGING_HANDLE = shift;
my $log_entry = shift;

print $LOGGING_HANDLE $log_entry."\n";
}

sub close_log_file
{
my $LOGGING_HANDLE = shift;
print "INFO Closing log file...\n";
close($LOGGING_HANDLE);
}




upd: the logging handle can be externalized for easing the use of the logging. It comes at cost of global variable, but might still suit moderately sized perl-scripts. Code will change a bit:


my $g_LOGGING_HANDLE = open_log_file_for_writing($log_file);
log_entry("Logging started");

sub log_entry
{
my $log_entry = shift;

print $g_LOGGING_HANDLE $log_entry."\n";
}


Now you can call log("log entry goes here") from where you want quickly, without the need to pass down as well the logging handle, for example deep inside some procedure or function.

upd1: If you want to control whether to log or not, another small modification will do it for you:


my $LOG_ENABLED = 1; # put 1 to enable logging, 0 to disable logging

my $log_file = "plugin_request.log";
...

my $g_LOGGING_HANDLE;
undef $g_LOGGING_HANDLE;

if ($LOG_ENABLED)
{
$g_LOGGING_HANDLE = open_log_file_for_writing($log_file);
}


...

sub log_entry
{
my $log_entry = shift;

return if (!defined($g_LOGGING_HANDLE));

print $g_LOGGING_HANDLE $log_entry."\n";
}

sub close_log_file
{
my $LOGGING_HANDLE = shift;

return if (!defined($g_LOGGING_HANDLE));

print "INFO Closing log file...\n";
close($LOGGING_HANDLE);
}

Friday, September 11, 2009

C++: if file line length exceeds array (buffer) length

Suppose we have implemented the following scenario:


// read input file line by line
// allocate 256 characters for each line

ifstream input_file("some_file.txt");
const int BUF_SIZE=256;
char buf[BUF_SIZE];
string s, strCurString;

if (!input_file.is_open())
{
cerr << "File some_file.txt coudl not be open!" << endl;
getch();
exit(EXIT_FAILURE);
}

while(!input_file.eof()) {
input_file.getline(buf, BUF_SIZE);
strCurString = buf;
s += strCurString;
}

cout << "File contents: " << endl << s << endl;


But what if the current file length exceeds BUF_SIZE? Well, in this case the while loop will never end, becoming an infinite loop. Why? Simply, because in the input file stream object a special bit (failbit) will be set, saying that the last getline() operation has failed (in this case not due to the end of a file, but due to the buffer length exceeding). In this case all subsequent calls to getline() will fail to read anything (can be seen by calling input_file.gcount(), which constantly returns 0 (zero) after the last getline() call that led to setting a failbit).


To overcome this, we can use a trick found here:


// read input file line by line
// allocate 256 characters for each line

ifstream input_file("some_file.txt");
const int BUF_SIZE=256;
char buf[BUF_SIZE];
string s, strCurString;

if (!input_file.is_open())
{
cerr << "File some_file.txt coudl not be open!" << endl;
getch();
exit(EXIT_FAILURE);
}

while(!input_file.eof()) {
input_file.getline(buf, BUF_SIZE);

// remember about failbit when amount of
// characters in the current line is
// more than BUF_SIZE
if (input_file.fail() && !input_file.eof())
// clear up the failbit and
// continue reading the input file
input_file.clear();
strCurString = buf;
s += strCurString;
}

cout << "File contents: " << endl << s << endl;

Monday, August 31, 2009

HackDay'09

The HackDay'09 is about to happen in the glorious city on Neva river -- Saint-Petersburg! Together with our friends we have decided to participate with our ideas based on semantic analysis. Let's see where the HackDay brings us. On Friday we are heading there!

Sunday, August 2, 2009

Porvoo is a unique old city, but old town is disgustful

Introduction
Surprised with the topic formulation? Read on for the details.

Visiting Porvoo
The unique old city is the old city of Porvoo [1] we have visited today. It's been a pleasure, though it was raining almost all the time. Some photos here [2]. Comments in Russian though (well, as it was once said in a humourous russian show KVN, "learn Russian in order to understand humour" and vice versa). Why Porvoo might have become widely known is thanks to a fact Finland had been declared autonomous Grand Duche of Russia in the local cathedral.

Related work and criticism
I usually try to avoid posting any anti-ads in the blog, but this time it's unavoidable. It is about the second part - old town, which is the restaurant with reported refined taste and special service. Well, it was about 16:00 already, but should it make any difference for the service quality? First we have waited way too much time for placing an order, basically from being *very* hungry to the point of thoughts to leave getting nothing for our pains. When the waitress came down, she missed the soup story completely leading to two of us having a soup (delicious, nothing to say!) instead of one. But the topic adjective starts with pasta. Have you ever eaten the fast food spaghetti? Somewhere in a student dormitory may be or in a camp. So take this spaghetti, add some sea food (which is nice however) and put it into the micro. The taste was something like this. We experienced culmination for an ordered and never arrived dessert. As some random passers-by might have fetched all the cakes we have ordered coupled with black tea, the wairtress wasn't very confused to inform us that only one cake left. Why that happened? Because the restaurant had another side feature: a cafe. Sounds like a multitier technology, right? Which in practice averages to a (restaurant + cafe) / 2 quality level. The place has managed to blur the impression about Porvoo as right after that we have left the city, but thinking about it now, when about 5h passed, I generally feel the journey was pleasant. We decided not to slap our wrathful feelings into their face, but to publish this post.

Discussion
We have thought about why the restaurant quality was *that* not good and came to the conclusion that the target audience of it was tourists. Which means mostly occasional visitors, who will never remember it (as they usually don't come back) even if it wasn't worthwhile. Also, if you do care about your clients, think twice of coupling a restaurant with a cafe in the same cramped premise.

Conclusion
Instead of blurring, use sharpening: go for a home pizza or may be visit some really expensive place to eat and polish the collected impressions of your day in Porvoo.

Bibliography
[1] http://en.wikipedia.org/wiki/Porvoo_Cathedral
[2] http://picasaweb.google.ru/dmitry.kan/Porvoo#