Saturday, December 19, 2009


What is RuSSIR? It is the summer school on Information Retrieval. Where is it held? Usually in Russia, this time in Voronezh. Want to apply? Read on.

4th Russian Summer School in Information Retrieval (RuSSIR 2010)
Monday September 13 - Saturday September 18, 2010
Voronezh, Russia


The 4th Russian Summer School in Information Retrieval (RuSSIR 2010) will be held on September 13-18, 2010 in Voronezh, Russia, one of the major cities in south-western Russia. The mission of the school is to teach students about modern problems and methods in Information Retrieval; to stimulate scientific research in the field of Information Retrieval; and to create an opportunity for informal contacts among scientists, students and industry professionals. The Russian Conference for Young Scientists in Information Retrieval will be co-located with the school. RuSSIR 2010 will offer 4 or 5 courses and host approximately 100 participants. The working languages of the school are English (preferable) and Russian. The target audience of RuSSIR is advanced graduate and PhD students, post-doctoral researchers, academic and industrial researchers, and developers.

The RuSSIR 2010 Organizing Committee invites proposals for courses on a wide
range of IR-related topics, including but not limited to:
- IR theory and models
- IR architectures
- Algorithms and data structures for IR
- Text IR
- Multimedia (including music, speech, image, video) IR
- Natural language techniques for IR tasks
- User interfaces for IR
- Web IR (including duplicate detection, hyperlink analysis, query logs)
- Text mining, information and fact extraction
- Mobile applications for IR
- Dynamic media IR (blogs, news, WIKIs)
- Social IR (collaborative filtering, tagging, recommender systems)
- IR evaluation.

Each course should consist of five 90-minute-long sessions (normally in five consecutive days). The course may include both lectures and practical exercises in computer labs.

RuSSIR 2010 organizers will cover travel expenses and accommodations at the school for one lecturer per course, but there is no additional honorarium. The RuSSIR organizers would highly appreciate if, whenever possible, lecturers could find alternative funding to cover travel and accommodation expenses and indicate this possibility in the proposal.

Course proposals for RuSSIR 2010 must be submitted by email to Pavel Braslavski (, by February 14, 2010. A course proposal should contain a brief description of the course (up to 200 words), preferred schedule, prerequisites, equipment needs, a short description of teaching/research experience and contact information of the lecturer(s). All proposals will be evaluated by the RuSSIR 2010 program committee according to the school goals, presentation clarity, lecturer’s qualifications and experience. Topics not featured at previous RuSSIRs are preferred. All submitters will be notified by March 1, 2010. Early informal inquiries about the school or the proposal evaluation process are encouraged.

About RuSSIR: The Russian Summer School in Information Retrieval is co-organized by the Russian Information Retrieval Evaluation Seminar (ROMIP) and Voronezh State University. Previous schools took place in Ekaterinburg, Taganrog, and
Petrozavodsk. Previous RuSSIR courses included IR Models (by Djoerd Hiemstra), Modeling Web Searcher Behavior and Interactions (by Eugene Agichtein), Computational Advertising (by James Shanahan), Text Mining, Information and Fact Extraction (by Marie-Francine Moens), Natural Language Processing for Information Access (by Horacio Saggion), Music IR (by Andreas Rauber), and other. Ricardo Baeza-Yates, VP of Research for Europe and Latin America at Yahoo, has confirmed as an invited lecturer for RuSSIR 2010 with the course 'Web data mining'.

About the RuSSIR 2010 location: Voronezh is a major city in southwestern Russia, spanning both sides of the Voronezh River, with population of 850,000. Express trains from Moscow to Voronezh take about 10 hours. There are also regular flights from Moscow, Munich, Prague, Tel-Aviv, and Istanbul. The town was founded in 1586. In the 17th century, Voronezh gradually evolved into a sizeable town, especially after Tsar Peter the Great built a dockyard in Voronezh. Currently, Voronezh is an administrative, economic and cultural center of the Voronezh region. Voronezh surrounding area has many attractions including archeological museum, nature and historical reserve Divnogorie, Kostomarovo cave monastery, Orlov trotter stud farm at Khrenovoe. Voronezh has a large student population: 37 institutions of higher education and 53 colleges educating over 127,000 students today. Voronezh State University was founded in 1918 and is one of the largest universities in Russia, with a total enrollment of 22,000.


Use the e-mail address and substitute [at] with @ and [dot] with "." school[at]romip[dot]ru.

Monday, November 30, 2009

Monday, October 12, 2009

Augmented Reality with Adobe Flash

Having carefully followed the Adobe MAX conference, I have found a ground shaking presentation on the Augmented Reality with Flash.

As the presenters define it, AR is superimposing "graphics over real-world environments in realtime". So I checked out the exciting (and long) tutorial on how to set up the development environment to hit the road. For running a quick demo from the tutorial I printed a black and white pattern (it makes sense to make it smaller, than I made, as it is much easier to handle). Before diving into details I decided to have a clue on how it is going to look like and made a small video which I share with you (please be patient about video quality and mute your player).

It is early to draw any conclusions on the future of AR, but it sounds like a very exciting field of software development in the future. It is where creative people get together and come up with an exciting business card and all sorts of interesting things.

Wednesday, September 16, 2009

Sting's programmer's mind

First impression of Sting's web-site is his quote of the day:

"I quite like using songs as a modular system where you can mix and match lines from different songs. It's a tradition now and people expect it. Basically, it's all one big song. You could say it was an aspect of postmodernism if you liked but you'd be called pretentious if you said that."

Perl: concise way to map one array onto antother in perl hash

Suppose I have two arrays. Suppose further elements in one of this array logically map onto elements of the other.

my @ar1 = (...);
my @ar2 = (...);

Easy way to map ar1 (keys) onto ar2 in perl is:

my %hash;
@hash{@ar1} = (@ar2) x @ar1;

Important assumption: the order in this two arrays matters. In other words first element of ar1 maps to first element of ar2, ..., n-th element of first array ar1 maps onto n-th element of ar2 and there exactly n elements in both arrays.


It is OK to have unique keys, obviously for the hash to preserve correct mapping (include use Data::Dumper in your code):

sub unique_mapping
my @ar1 = ('a', 'b', 'c', 'd', 'e');
my @ar2 = ('1', '2', '3', '4', '5');

print Dumper(\@ar1);
print Dumper(\@ar2);

my %hash;
@hash{@ar1} = (@ar2) x @ar1;

print Dumper(\%hash);



$VAR1 = [
$VAR1 = [
$VAR1 = {
'e' => '5',
'c' => '3',
'a' => '1',
'b' => '2',
'd' => '4'

The mapping is not what you might want to have in the case when keys are not unique:

sub keys_non_unique_mapping
my @ar1 = ('a', 'b', 'b', 'd', 'e');
my @ar2 = ('1', '2', '3', '4', '5');

print Dumper(\@ar1);
print Dumper(\@ar2);

my %hash;
@hash{@ar1} = (@ar2) x @ar1;

print Dumper(\%hash);



$VAR1 = [
$VAR1 = [
$VAR1 = {
'e' => '5',
'a' => '1',
'b' => '3',
'd' => '4'

Monday, September 14, 2009

Logging: helpful perl snippet to start with

I needed to establish simple logging in my small perl app which serves as a plugin for a company "big" product. Here is what I have come up with:

use strict;

my $log_file=file_string_here;

my $LOG_HANDLE = open_log_file_for_writing($log_file);

log_entry($LOG_HANDLE, "Logging started");

log_entry($LOG_HANDLE, "Logging finished");


sub open_log_file_for_writing
my $log_file = shift;

print "INFO Opening log file...\n";

unless(open $LOGGING_HANDLE, ">> ", $log_file) {
return undef;

my $current_time = localtime;
print $LOGGING_HANDLE "\n".$current_time."\n";


sub log_entry
my $LOGGING_HANDLE = shift;
my $log_entry = shift;

print $LOGGING_HANDLE $log_entry."\n";

sub close_log_file
my $LOGGING_HANDLE = shift;
print "INFO Closing log file...\n";

upd: the logging handle can be externalized for easing the use of the logging. It comes at cost of global variable, but might still suit moderately sized perl-scripts. Code will change a bit:

my $g_LOGGING_HANDLE = open_log_file_for_writing($log_file);
log_entry("Logging started");

sub log_entry
my $log_entry = shift;

print $g_LOGGING_HANDLE $log_entry."\n";

Now you can call log("log entry goes here") from where you want quickly, without the need to pass down as well the logging handle, for example deep inside some procedure or function.

upd1: If you want to control whether to log or not, another small modification will do it for you:

my $LOG_ENABLED = 1; # put 1 to enable logging, 0 to disable logging

my $log_file = "plugin_request.log";


$g_LOGGING_HANDLE = open_log_file_for_writing($log_file);


sub log_entry
my $log_entry = shift;

return if (!defined($g_LOGGING_HANDLE));

print $g_LOGGING_HANDLE $log_entry."\n";

sub close_log_file
my $LOGGING_HANDLE = shift;

return if (!defined($g_LOGGING_HANDLE));

print "INFO Closing log file...\n";

Friday, September 11, 2009

C++: if file line length exceeds array (buffer) length

Suppose we have implemented the following scenario:

// read input file line by line
// allocate 256 characters for each line

ifstream input_file("some_file.txt");
const int BUF_SIZE=256;
char buf[BUF_SIZE];
string s, strCurString;

if (!input_file.is_open())
cerr << "File some_file.txt coudl not be open!" << endl;

while(!input_file.eof()) {
input_file.getline(buf, BUF_SIZE);
strCurString = buf;
s += strCurString;

cout << "File contents: " << endl << s << endl;

But what if the current file length exceeds BUF_SIZE? Well, in this case the while loop will never end, becoming an infinite loop. Why? Simply, because in the input file stream object a special bit (failbit) will be set, saying that the last getline() operation has failed (in this case not due to the end of a file, but due to the buffer length exceeding). In this case all subsequent calls to getline() will fail to read anything (can be seen by calling input_file.gcount(), which constantly returns 0 (zero) after the last getline() call that led to setting a failbit).

To overcome this, we can use a trick found here:

// read input file line by line
// allocate 256 characters for each line

ifstream input_file("some_file.txt");
const int BUF_SIZE=256;
char buf[BUF_SIZE];
string s, strCurString;

if (!input_file.is_open())
cerr << "File some_file.txt coudl not be open!" << endl;

while(!input_file.eof()) {
input_file.getline(buf, BUF_SIZE);

// remember about failbit when amount of
// characters in the current line is
// more than BUF_SIZE
if ( && !input_file.eof())
// clear up the failbit and
// continue reading the input file
strCurString = buf;
s += strCurString;

cout << "File contents: " << endl << s << endl;

Monday, August 31, 2009


The HackDay'09 is about to happen in the glorious city on Neva river -- Saint-Petersburg! Together with our friends we have decided to participate with our ideas based on semantic analysis. Let's see where the HackDay brings us. On Friday we are heading there!

Sunday, August 2, 2009

Porvoo is a unique old city, but old town is disgustful

Surprised with the topic formulation? Read on for the details.

Visiting Porvoo
The unique old city is the old city of Porvoo [1] we have visited today. It's been a pleasure, though it was raining almost all the time. Some photos here [2]. Comments in Russian though (well, as it was once said in a humourous russian show KVN, "learn Russian in order to understand humour" and vice versa). Why Porvoo might have become widely known is thanks to a fact Finland had been declared autonomous Grand Duche of Russia in the local cathedral.

Related work and criticism
I usually try to avoid posting any anti-ads in the blog, but this time it's unavoidable. It is about the second part - old town, which is the restaurant with reported refined taste and special service. Well, it was about 16:00 already, but should it make any difference for the service quality? First we have waited way too much time for placing an order, basically from being *very* hungry to the point of thoughts to leave getting nothing for our pains. When the waitress came down, she missed the soup story completely leading to two of us having a soup (delicious, nothing to say!) instead of one. But the topic adjective starts with pasta. Have you ever eaten the fast food spaghetti? Somewhere in a student dormitory may be or in a camp. So take this spaghetti, add some sea food (which is nice however) and put it into the micro. The taste was something like this. We experienced culmination for an ordered and never arrived dessert. As some random passers-by might have fetched all the cakes we have ordered coupled with black tea, the wairtress wasn't very confused to inform us that only one cake left. Why that happened? Because the restaurant had another side feature: a cafe. Sounds like a multitier technology, right? Which in practice averages to a (restaurant + cafe) / 2 quality level. The place has managed to blur the impression about Porvoo as right after that we have left the city, but thinking about it now, when about 5h passed, I generally feel the journey was pleasant. We decided not to slap our wrathful feelings into their face, but to publish this post.

We have thought about why the restaurant quality was *that* not good and came to the conclusion that the target audience of it was tourists. Which means mostly occasional visitors, who will never remember it (as they usually don't come back) even if it wasn't worthwhile. Also, if you do care about your clients, think twice of coupling a restaurant with a cafe in the same cramped premise.

Instead of blurring, use sharpening: go for a home pizza or may be visit some really expensive place to eat and polish the collected impressions of your day in Porvoo.


Friday, July 31, 2009

Perl: trust programmatic access to VB Project

It has been already a tradition to cross-post useful information bits, so let me continue.

If you came to the following problem when accessing the the VB Project programmatically from your Perl script:

"Programmatic access to Visual Basic Project is not trusted"

, you will need to allow the access and trust in your target MS Office app:

Office 2003 and Office XP

1. Open the Office 2003 or Office XP application in question. On the Tools menu, click Macro, and then click Security to open the Macro Security dialog box.
2. On the Trusted Sources tab, click to select the Trust access to Visual Basic Project check box to turn on access.
3. Click OK to apply the setting. You may need to restart the application for the code to run properly if you automate from a Component Object Model (COM) add-in or template.

Office 2007

1. Open the 2007 Microsoft Office system application in question. Click the Microsoft Office button, and then click Application Options.
2. Click the Trust Center tab, and then click Trust Center Settings.
3. Click the Macro Settings tab, click to select the Trust access to the VBA project object model check box, and then click OK.
4. Click OK.

Tuesday, July 28, 2009

Java: encoding and listing files in a directory

Two code snippets tested to be working under win32 at least;

Encoding. When your input text files are in encoding different from your default platform's there is no way to use FileReader in this case. Instead you should go deeper in the class hierarchy and specify a file enconding, which you know in advance. Code (inspired by answers here):

private static String loadFileContents(String filename)
throws FileNotFoundException, IOException {
StringBuilder contents = new StringBuilder();

InputStream is =
new BufferedInputStream(new FileInputStream(filename));
Reader reader = new InputStreamReader(is, "UTF8");

BufferedReader bufferedReader =
new BufferedReader(reader);

String line;
while ( (line = bufferedReader.readLine()) != null ) {
contents.append(line + "\n");

return contents.toString();

Directory listing. Suppose you want to list files in a directory based on some filename criteria. Easy way to do this is to implement a custom filename filter. Entire code (watched here):



* @author Dmitry_Kan
* Lists given directory based on filename pattern
public class DirectoryLister {

private String filenamePattern;
private String dirname;

public DirectoryLister(String filepat, String directory) {
filenamePattern = filepat;
dirname = directory;

public String[] getListOfFiles() {
class CustomFilter implements FilenameFilter {
public boolean accept(File dir, String s) {
if (s.contains(filenamePattern) && s.contains("txt") )
return true;
return false;
return new CustomFilter());


Wednesday, July 22, 2009

tabs in Firefox 3.5.1

Just got amazed: it is so easy to drag-n-drop tabs in Firefox 3.5.1. If you have two browser instances opened and you want to drag-n-drop tab from one of them to another, drag the tab and drop it to the tab area of another Firefox instance.

Friday, July 17, 2009

Post builder script for CGI development in Eclipse


In CGI development it may be a lot of help to have some sort of a post builder script. The primary aim of this script should be updating the cgi scenario(s) and modules when you build your perl project under Eclipse.

So I have come up with the following

#!/usr/bin/perl -w

use strict;

# append "/" to the target path if required
my $target_path = $ARGV[0];
if (not( $target_path =~ m/\/$/)) {
$target_path .= "/";

# copy cgi script: what, from where, to where
my $file = "uni_search.cgi";
my $copy_base_path = "../";
my $copy_from = $copy_base_path.$file;
my $copy_to = $target_path.$file;

copy($copy_from, $copy_to);

# copy Perl modules
my @modules = ("", "", "");
my $modules_path = "some_path/";
$copy_base_path = "./";
my $target_copy_base_path = $target_path.$modules_path;

foreach my $module(@modules) {

sub copy
my $copy_from = shift;
my $copy_to = shift;

my $result;

# chmod'ing
$result = system("echo ".[your_sudo_password_here].
" | sudo -S chmod +w ".$copy_to);
print "Changing write rights: ".$result."\n";

# updates / copies the cgi script to server
$result = system("sudo cp ".$copy_from." ".$copy_to);
if ($result == 0) {
print "Result of copying the ".$copy_from.
" to server's cgi-bin: ".$result."\n\n";

It is easy to add a builder which will call the script. In Perl perspective go to Project->Properties->Builders. Select "New..." on the right. Choose "Program" in the popup window. Configure the program by choosing what to launch, where and command line arguments:

There you go. After the builder has been successfully configured, edit your code and do full rebuild (select the root node of your project and Project->Build Project), when you want to deploy your cgi script set to your web server.

Thursday, July 9, 2009

Information retrieval: conceptual problem

From my experience with grep for example, I come to a conclusion that the main problem of current search engine
interfaces (not grep ones as grep serves other aims) is that they lack one major feature. It is the feature of visualizing the search space or from other perspective -- a feature of search hints.

Suppose I want to search *something* about HTTP protocol (some very specific detail) and I know very little about HTTP (an artificial example). So for example, I even don't know which category does that particular detail belongs too.
Needless to say, that I don't know the term which I'm looking for. Now I'm stuck. I should have some starting hints, which would lead me to first query terms apart from HTTP. Because HTTP is very wide term in a tree of terms (like the tree root), I will spend hours and hours reading through millions of returned hits.

Instead, I would use a concept graph, visualized, which directly corresponds to the search index of a searching engine.
Now I can easily jump in a concise way over those categories getting deeper and deeper into HTTP topic and discovering
some previously unknown terms, which I can further search for. Thus narrowing down the search space with the parallel learning I would come to a result faster.

Tuesday, June 30, 2009

Perl is truly weird

Otherwise how else would you treat the following: $#$ar_ref?

It is actually $#($ar_ref) which is the last index of an array referenced via $ar_ref.

Friday, June 26, 2009

Michael's last curtain call

It suddenly came to my mind right after I got to know about Michael's passing away: the world has had or is about to change when *SUCH* people leave it..

But the strangest thing is - when now I see him performing on my screen, I do not *feel* him passed away. It is half sad and half light feeling about what happened.

"This is it. This is the final curtain call." Michael Jackson

Wednesday, June 24, 2009

Gmail: spam handling as a user experience

It might be a better idea to post below as a feature request to Gmail team directly, but I post it here because of two reasons:

1) I'm rather busy at work to find their blog
2) let's check Google's track_everything_happening_on_the_web property

About spam again. Recently I have been receiving quite a bulk of these pleasant messages (giving the unique opportunity to enlarge something on a recipient's body etc) - which results in about 100+ messages / day filtered out to the spam folder.

Not sure about the majority of e-mail user habits, but I have an attitude to check all the spam messages (don't worry, only the titles; well the body rarely as well) and get rid of them manually, rather than leaving the engine to automatically delete them after N days. I should notice the precision of the spam filtering algorithm Google has in Gmail: it works perfect in my case. But still even the tiniest 0,(0*)1% probability to overlook the valid message in spam folder may give hard times and irreversible processes in my brain, which I do not want to happen.

So my idea follows next. Suppose I have 10 pages of junk piled up in the spam folder. When I open one page and start looking through the e-mail topics, I obviously spend some time (like 2-3 minutes for 100 messages). Once I have done this, I may or may not proceed to the next spam page, but with no regard to it Gmail marks internally the page as "read". When I sign out or after a while (another configurable parameter), Gmail silently deletes all the "read" messages.

For the sake of flexibility there can be the third feature: a button (or a fixed option) , explicitly prohibiting to silently remove messages even though they have been read.

Friday, June 19, 2009

Словарный запас: результаты теста

Итак, ваш запас: Тоже очень хороший результат. Ваш словарный запас значительно выше среднего
Пройти тест

Tuesday, June 16, 2009

Perl's flexibility

I was suspecting Perl to be rather flexible and implementing the principle "type as you think and get it working" (which does not apply always as Perl is 'weird' by definition), but could not imagine it goes that far:

sub getHeaderName
return (split " - ", shift)[1]; # let's see if it works. Upd: it works!

, where input string is of type: "process_type - process_name" and I need process_name value from it.

Monday, June 15, 2009

spam: get phd diploma!

"GET YOUR DIPLOMA TODAY!If you are looking for a fast and cheap way to get a diploma, this is the best way out for you. Choose the desired field and degree and call us right now: For US: some_num Outside US: some_another_num
"Just leave your NAME & PHONE NO. (with CountryCode)" in the voicemail.

Our staff will get back to you in next few days!"

I wonder, if this fake diploma will help me, e.g. when applying to Google jobs and during work.

Wednesday, June 10, 2009

Google's search suggestions

Is it "null" which Google predicted to be next in the sequence, or is it that Google predicted turned out to be null (system issue)?

Tuesday, June 9, 2009

Syntax and semantics

It may be the implications of the specifics of my phd topic, which made my eye keener, but I tend to come across the above two terms every now and then. People refer to syntax as to rules and to semantics as to in a way an invariant of those rules and something that one basically wishes to express using the rules.

Today I've bumped into an article with the following title: "Syntax is from Mars while Semantics from Venus! Insights from Spectral Analysis of Distributional Similarity Networks.":

"We observe that while the syntactic network has a hierarchical structure with strong communities and their mixtures, the semantic network has several tightly knit communities along with a large core without any such well-defined community structure."

Another example brought by one functional specification which has found me at work: "The [some company's] command language provides syntax and semantics to perform
service-specific and service-independent management operations."

It is important to notice, that in general one can bind any semantics to a particular syntactic structure, i.e. what exactly happens after applying this particular structure. And at this point we come to another level: pragmatics. What will happen refers to pragmatics behind an action rather than semantics. The semantics is just a formal intermediate representation of used syntactic structures. I will try to gather illustrations in English later.

Sunday, June 7, 2009

Приготовление еды и программирование

Приготовление еды - это как программирование или компилирование проекта с исходников.
При компилировании с исходников иногда тоже как и при написании кода надо поисхитряться, может, даже поизменять код. Поэтому тоже процесс творческий. И ещё: важное свойство компилирования проект.. приготовления любимого блюда - это то, что это можно проделать в любой стране, даже при условии, что в этой стране такое не готовят.

Переведу, пожалуй, для Гугл:

Cooking is like programming or compiling a project from its source code. When compiling from a source code at times one should be skilful, sometimes change the code. Therefore it is also a creative process. One important property of compiling a proj.. cooking you favourite dish is that one can do it in any country, even given a fact that this dish is not cooked in this particular country.

Friday, May 22, 2009

Кредо исследователя или как прожить в науке

Сегодня был на очень интересной лекции по машинному переводу, где
в неофициальной части докладчик (Matti Kääriäinen) рассмотрел проблемы академической
атмосферы. Главной проблемой он считает то, что основная масса
исследователей ориентируется на публикацию статей с целью публикации
статей. Он также привёл этапы аккуратного на его взгляд распорядка:

1) поставь научную цель и представь, что она решена - что тогда? Как
изменится мир? Если дальше, чем решение задачи, продумать не можешь,
выбери другую задачу;
2) keep the pace: делай даже маленькое сегодня (если нет веских причин
отложить на завтра);
3) публикуй свои достижения в любом виде: статьи, исходный код и т. д.

Также нужно убрать ненужные риски с плеч исследователя. Если,
например, одна группа исследует геном и полагает, что ген интеллекта
находится в хромосоме 7, а вторая - в хромосоме 9, то полугодичная
работа обеих групп с результатом "ген в хромосоме 9" не должен
означать конец карьеры участников первой группы. Эта ответственность
должна ложиться на профессора как главу исследовательской группы.

А что на практике: один исследователь получил контракт на 5 лет с
университетом. Полгода прошли активно, а после этого он стал постоянно
думать: а что через 4,5 года? И так очень многие: нужно жить, и люди
думают об этом очень много, стараясь выполнить все формальные
требования для продления контракта, забывая о главной цели.

Есть над чем подумать.

Tuesday, April 28, 2009

IR workshop @ AM-CP

has hosted another talk of mine. This time I have brought in more details on natural language functional theory and computer semantics. The audience has been rather wide as well: 10-15 people. After diving into computer semantics we've turned to Machine Translation topic, briefly covering statistical and classical approaches. It has been also important to see how the students follow the talk as well as keep them laughing from time to time to break the ice wall.

IR workshop @ AM-CP

Saturday, April 11, 2009


For those of you, who silently keep track of what's going on, my following message: I have made a conference talk last week at CPS'09 (Control Processes and Stability). The talk is in Russian. In brief: the article proposes an automatic method for creating a semantic translational dictionary ru->en. The method is based on statistical approaches and implemented with the use of GIZA++. Comparison of an experimental machine translation system based on the dictionary with the SMT system based on Moses is provided.

Monday, March 30, 2009

autoflush feature in ActivePerl

I tend to post here only those receipts that have practically worked for me.

There have been another feature I couldn't make working under win32 for ActivePerl. It is the autoflush, i.e. auto flushing the buffer content to a physical file.

Spied here. The solution is to make the filehandle "hot":

select((select(FILEHANDLE), $| = 1)[0]); # make the handle hot

Sunday, March 22, 2009

Analogy of a face search in a human brain

While I was preparing my breakfast a thought of a face expression flashed accross my mind. The small picture was so bright,
that I immediately started to search the owner in my memory. This is the exact verb, that I should use
here: to search. The search took just about two seconds (lucky me) and I started to analyze
how did it happen: what algorithm did I potentially use, what was the complexity. As I met the owner of the
face about a week ago I believe the search took place in cache and looked a linear search on one hand.
On the other hand I distinctly realized while this was happenning that I clearly was skipping some
faces without further detailed analysis (i.e. calculating some metrics for determining relevance level).
One important observation is that I didn't draw a clear line in a search space based on gender: rather it was a quick search (with no reference
to the existing algorithm, but an allegory) with very fast process of prunning the search space.

It might turn out, that existing search algorithms are not that far from what happens inside a human brain.

mood: thoughtful :)

Tuesday, March 10, 2009

podcasting on Google

I've decided to make the first podcast with a certain purpose: there is a podcast contest by budam on The general idea is to create a podcast, speaking entirely in English which must not be your native language.

The podcast is on how I have had an interview with Google in Trondheim (pity, pity, recently closed office) and - Google be safe - does not disclose a single task from the NDA signed tasks list I got that long day. It actually was a long day - with 5 interviews and one lunch. I bet I have drunk a 5 litres bottle of water - when you are speaking up for 45 minutes per interview your throat gets easily parched.

Thursday, March 5, 2009

Microsoft fights itself

No, no, no. It is not any kind of political post related to the company's current state. It is not a holy war type of post either. I have just got the below interesting message from which I concluded the topic.

I'll translate the key phrases as well:

Prevention of data execution - Microsoft Windows

For defending the computer this program was shut down by the system

Name: Windows®installer
Publisher: Microsoft Corporation

Sunday, March 1, 2009

Локальная гравитация

Представьте, что вы отправились покупать вашу любимую пиццу в местной пиццерии. Предположим, вы не просто пошли туда, а поехали на велосипеде (да, захватите ещё рюкзак, ниже я объясню, зачем). Когда пицца, горячая и ароматная, запакованная в картонную коробку, у вас в руках, вы раздумываете, а как бы добраться теперь до дома с комфортом.

Если пиццу держать на руле, сохраняя её горизонтальное положение, будет сложно маневрировать. Особенно, если дорога бугристая ото льда. Можно пиццу привязать к багажнику, но это будет вызывать неудобство: нужно всякий раз оборачиваться и смотреть, на мести ли пицца.

Тогда можно положить пиццу в рюкзак. Стоп! Как же можно класть пиццу в рюкзак, она ведь соберёт все свои компоненты внизу рюкзака. Можно! Можно положить пиццу в рюкзак с с локальной гравитацией, вектор действия которой вы можете задать самостоятельно. Таким образом, гравитация для пиццы в рюкзаке будет направлена соосно вашему движению на велосипеде, но в противоположную сторону.

Осталось подумать о том, как достичь следующего:

1) локальная гравитация должна быть именно локальной, то есть не действовать, к примеру, на вашу шапку на голове;

2) локальная гравитация не должна конкурировать с естественной гравитацией Земли, в идеальном случае они должны существовать параллельно друг к другу, без каких бы то ни было интерференций;

3) можно произвольно менять вектор локальной гравитации.

Приятного аппетита! :)

Monday, February 23, 2009

Mr. Lecturer

Last week I have given lectures on Machine Translation (in Russian) in my home Saint-Petersburg State University.

I let my students know beforehand that the course is experimental and they happen to be the pioneers whom the course is going to be tested on.

After the 3,5 h lecture I have asked them, how do they feel about the experiment.

The answer was: The course was interesting. In case it would have been uninteresting, they would sleep. I believe it is the best compliment, especially taking into account the fact, that a human concentrates only first 40 (15?) minutes.

Saturday, February 14, 2009

Unicode in Perl

Sometimes it feels that perl's power in string manipulations comes at a cost of its synthax being awkward.

When you open a file for reading without caring in what encoding is its contents, you do:

open FILE, "<".$filename or die $!;

But if you do care of an encoding you should open the file using the following instruction:

open ENC_FILE, "<:encoding(cp1251)", $enc_filename or die $!;

Now the key point is in comma following the encoding instruction. If you put there "." instead (which I believe does the concatenation of stream direction sign "<" and the filename), the file fails to open.

Another important addition is: if you know in advance in which encoding the file contents is represented, specify it using the above encoding instruction. Doing this you get all the string data to be in internal perl's representation which is by default utf8.

Tuesday, February 10, 2009

I feel like on top of the world (c)

.. when I manage to make an unhadled exception in compiler / interpreter. This time it has happened with Perl Command Line Interpreter:

Perl: file or directory

To check this, the prescription says:

if (-d $file)
   print $file." is a directory\n";
} else {
   print $file." is a file\n";

When this is used in pair with IO:Dir, which helps you to enumerate contents of a given directory, one non-obvious step should not be forgotten:

tie %dir, 'IO::Dir', $dir;
foreach my $entry(keys %dir) {
   next if ($entry eq '.' or $entry eq '..');
   # important part is here: concatenation with the full path
   if (-d $dir."/".$entry)
      print $entry." is a directory\n";

Sunday, February 8, 2009

Simple Perl modules

Making it already a rool to post technical details for which I have spent more than 20 minutes, I decided to post as well this.

keywords: How to write perl modules

Answer: it's simple!

Create file in Lib/ directory where you like with the following contents:

package Lib::StringManip;

use strict;

use base 'Exporter';
our @EXPORT = ('trim');

# Perl trim function to remove whitespace from the start and end of the string
sub trim
my $string = shift;
$string =~ s/^\s+//;
$string =~ s/\s+$//;
return $string;


Add the full path to Lib/ to PERL5LIB environment variable. In my case (win32) it is: PERL5LIB=%PERL5LIB%;D:\Programming\Perl

i.e. inside D:\Programming\Perl I have Lib/ In Linux/Unix: export PERL5LIB=some_path/Lib/

Usage snippet:

#!perl -w

use strict;
use Lib::StringManip;

print trim(" trim me! ");

Thursday, February 5, 2009

Natural Langauge Processing and preparation of a human brain

Having read a number of articles dealing with natural language processing (NLP), cognition and linguistics, like Beyond Zipf's law: Modeling the structure of human language, I come to a conclusion, that NLP in essence is one of the most accurate and non-intruding ways to understand how human brain works.

Compare NLP, for example to neuropsychology.

Thursday, January 29, 2009

Tuesday, January 27, 2009

How do I load an org.w3c.dom.Document from XML in a string?

The entire header of this message and it's contents were copied from with one aim: to give higher rank for the page in Google PageRank.

Good reason for this is that good things should be replicated.

I have a complete XML document in a string and would like a Document object. Google turns up all sorts of garbage. What is the simplest solution? (In Java 1.5)

Solution Thanks to Matt McMinn, I have settled on this implementation. It has the right level of input flexibility and exception granularity for me. (It's good to know if the error came from malformed XML - SAXException - or just bad IO - IOException.)

public static org.w3c.dom.Document loadXMLFrom(String xml)
throws org.xml.sax.SAXException, {
return loadXMLFrom(new;

public static org.w3c.dom.Document loadXMLFrom( is)
throws org.xml.sax.SAXException, {
javax.xml.parsers.DocumentBuilderFactory factory =
javax.xml.parsers.DocumentBuilder builder = null;
try {
builder = factory.newDocumentBuilder();
catch (javax.xml.parsers.ParserConfigurationException ex) {
org.w3c.dom.Document doc = builder.parse(is);
return doc;

Saturday, January 17, 2009

GUI vs Command Line

For those in software development and aiming at more, my following observation:

when one launches a task in GUI and it gets by some means frozen, there is no way to painlessly stop that particular task but kill the entire GUI.

when one launches a task in the Command Line and you suspect it got frozen, you simple stop the task by CTRL+X or CTRL+Z or CTRL+C (any key combination that does it).

The point is: I would aim at multithreading (or even creating separate processes) of consuming tasks so that I could easily undo any task-related actions. I would be really happy to have a functionality where I can press CTRL+Z to undo any action be it new thread/process for a task or typing a letter in a sentence.

Thursday, January 8, 2009

Unusual Billie Jean video

It has been a fun and honour to be a fan of Michael in my youth. This video looks very unusual and at the same time unique vs what you can find for the video search query "Billie Jean":