Friday, November 19, 2010

"OOPness" in Java

Optimisation in my code has led me to a code refactoring, where several methods with return values became one. Since I could not use "pass by reference" feature for all my parameters (primitives and objects as collections) I had to come up with an inner class for holding all return parameters.

So by not supporting "pass by reference" feature Java makes your code even more OOP-ish. Even though the topic can be well known, this article helps to put things together and make a memory refresh.

Wednesday, November 3, 2010

Successive replacement in regular expressions (java)

Actually not sure, how often people out there do the successive replacement in a target text having a regular expression pattern, but Java has rather neat solution for it. I'm publishing it here, because I know, that esp. younger developers can re-invent a wheel here and have longer debugging sessions.

So the task is like this: you have a text T, like "cat-1 dog-1 cat-1 elephant-1 cat-2 dog-2 cat-3".
Suppose we want to change numerals attached to the words "cat" to their word representations: "1" to "one", "2" to "two".

One straightforward way would be to match all "cat-([0-9])+" subsequences and then run replace operation on T.

So the code would look something like this:


String T = "cat-1 dog-1 cat-1 elephant-1 cat-2 dog-2 cat-3";
Pattern catPattern = Pattern.compile("cat-([0-9]+)");
Matcher catMatcher = catPattern.matcher(T);
Map numToWord = new HashMap();
numToWord.add("1", "one");
numToWord.add("2", "two");
numToWord.add("3", "three"); // ...
while (catMatcher.find())
{
T = T.replaceFirst(catMatcher.group(1), numToWord.get(catMatcher.group(1)));
}


This code produces:

cat-one dog-one cat-1 elephant-1 cat-two dog-2 cat-three

Which is missing one substitution. Ok, let's use replaceAll instead and make sure we touch only cats:


{
T = T.replaceAll("cat-" + catMatcher.group(1), "cat-" + numToWord.get(catMatcher.group(1)));
}


which produces what we want:

cat-one dog-1 cat-one elephant-1 cat-two dog-2 cat-three

But now what happens inside the loop is logically out of sync with the loop condition: we iterate over matches, but call replaceAll (probably not efficient either, as replaceAll will be attempted even when not needed anymore, for duplicate matches).

Any more elegant and correct solution?

Yes! It is called Matcher.appendReplacement


Pattern catPattern = Pattern.compile("cat-([0-9]+)");
Matcher catMatcher = catPattern.matcher(T);
Map numToWord = new HashMap();
numToWord.put("1", "one");
numToWord.put("2", "two");
numToWord.put("3", "three"); // ...

StringBuffer sb = new StringBuffer();

while (catMatcher.find())
{
System.out.println("Match:" + catMatcher.group(1));
catMatcher.appendReplacement(sb, "cat-" + numToWord.get(catMatcher.group(1)));
}
catMatcher.appendTail(sb);


now sb.toString() contains:

cat-one dog-1 cat-one elephant-1 cat-two dog-2 cat-three

If you append System.out.println(sb.toString()); inside the while loop, you will also see, that replacements happen in sync with the while loop's state, so that what is inside the loop and what while loops over are in sync.

Saturday, August 21, 2010

B2B: what is WSDL (simple explanation)

Just for those wondering around what is WSDL file or technology in the area of Business 2 Business (when companies speak to each other automatically over the network): you can think of it as a declaration of the client-server conversation. In this case one B (server) provides some functionality (like taking square roots of big numbers) and another B (client) has a lot of these numbers and needs square roots of them.

So the server declares a method (which is in fact a remote method):

Double squareRoot(Double number)

in its WSDL file. The client side then takes this file and generates the client side code, which handles the protocol of communication and remote method invocation automatically. What client needs is simply to implement a business logic around this invocation, like a web page with a text field for the number or the entire hardware with touch screen display.

In general the input and return types of remote methods can be simple and complex -- in fact as complex and "proprietary" as you need, because the entire WSDL file is nothing but xml as well as the messages being sent over the network.

One of the libraries used in the industry is Axis and Axis2. Using the library you can generate the java code from wsdl for you client (and even generate wsdl from your java code for you server).

Tuesday, August 17, 2010

Zero-witdth negative lookahead group: example

Suppose you have to split a comma separated string in java where meaningful parts are separated by continious spaces, say from 2 to infinity. But the problem is that sometimes a meaningfull part (like a person name) contains 2 spaces. You want to exclude such cases and retain the name's parts together. Example:

name1[one_space]surname1[two_spaces]age
surname2[two_spaces]surname2[two_spaces]age

In this example it is easy to solve the problem with zero-width negative lookahead group (?!\\w), because age is always supposed to be numeric. So the final code will be something like this:

String s = current_string.split("\\s{2}(?!\\w)");

Thursday, July 1, 2010

Парижские впечатления

Вернулся из второй поездки в Париж. Вот сейчас, уже по возвращению, я подумал: в Париже все отдыхают. Или это только видимость? Как бы там ни было, отдыхать там легко. Понравилось кататься на public велосипедах!

Tuesday, April 27, 2010

Dumper and sorting of keys

When you use Dumper for debugging via comfortable logging data structures by reference, you can additionally apply sorting to the output. Say, you have a hash addressed by $hash_ref reference. If you need to output the hash contents, but have its keys sorted, you can do:


use strict;
use Data::Dumper;

$hash_ref = generate your hash here

$Data::Dumper::Sortkeys = \&my_filter;
print Dumper($hash_ref), "\n";

sub my_filter {
my ($hash) = @_;
# return an array ref containing the hash keys to dump
# in the order that you want them to be dumped
return [
sort {$a <=> $b} keys %$hash
];
}

Wednesday, April 21, 2010

Сходимость

Код программы аппроксимирует решение задачи. Сходимость кода к решению осуществляется посредством связки "изменение кода"-"компиляция/интерпретация"-"запуск". Получается что-то вроде временного ряда: код может не сходиться неделю и сойтись за один день. Если код не сходится, стоит посмотреть на саму концепцию в решении / задачу / компилятор.

Всё это к тому, что полезно преподавать программирование, рассматривающее программу как целостную математическую схему, а не только как набор алгоритмов, приёмов и дизайна.

Thursday, April 8, 2010

An idea for GUI designers and developers -- a standalone app / OS level change.

When I'm under VPN & putty I sometimes need to run an sftp client. Currently I have two options to use: WinSCP (I find it easier to use in some cases, like quickly watching / editing the contents of different text files) and Secure Shell Client from SSH. For some reason the first client loads in about 2-3 minutes after I log in. So: it would be great to have an option to configure launch of a program depending on a certain event, like putty execution, connected to a certain host.

This may give a lot of opportunities to quite flexibly configure your GUI world and save a lot of time spent otherwise on the monkey repetitions.

Saturday, March 13, 2010

giza++ under windows: episode 2

It turned out, that in order to comfortably run GIZA++ under win32 with cygwin I had to recompile it without the flag -DBINARY_SEARCH_FOR_TTABLE (credits: http://code.google.com/p/giza-pp/issues/detail?id=9).

The full list of steps goes here (I suppose that source corpus is stored in corpus.ru and target corpus is stored in corpus.en):

1. produce vcb and dictionary files with "plain2snt.out corpus.ru corpus.en" (credits: http://vee-r.blogspot.com/2006/12/giza-guide.html)
2. produce cooc file with "snt2cooc.out corpus.ru.vcb corpus.en.vcb corpus.ru_corpus.en.snt > ru_en.cooc" (credits: myself, after analyzing the train-factored-phrase-model.perl from Moses package)
3. run GIZA++ with config file:

outputfileprefix play_giza
sourcevocabularyfile corpus.ru.vcb
targetvocabularyfile corpus.en.vcb
c corpus.ru_corpus.en.snt
CoocurrenceFile ru_en.cooc
model1dumpfrequency 1
model4smoothfactor 0.4
nsmooth 4
onlyaldumps 1
nodumps 1
p0 .999
m1 5
m2 0
m3 3
m4 3
o giza

if these options are stored in giza.config then run "GIZA++ giza.config". This produces giza.A3.final file, typical entry of which is:

# Sentence pair (1) source length 4 target length 7 alignment score : 2.25315e-10
there is a book on the table
NULL ({ }) на ({ }) столе ({ 5 6 7 }) лежит ({ 1 2 }) книга ({ 3 4 })

which means the following mapping:

"столе" --> "on the table"
"лежит" --> "there is"
"книга" --> "a book"

Running GIZA++ under win32 and under linux gives same results in terms of word mappings, except that the alignment scores may slightly differ due to possibly different float point precision models.

Saturday, February 27, 2010

giza++ under windows

After an 'outrageous' attempt to compile giza++ under Visual Studio C++ Express 2008 (with more than 2000 compile time errors) I switched over to cygwin and installed:

gcc version 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)
make GNU Make 3.81 (built for i686-pc-cygwin)

With

$ make

under giza-pp-v1.0.3\giza-pp I have obtained two executables: GIZA++-v2/GIZA++.exe and mkcls-v2/mkcls.exe which perfectly run under Windows XP Professional Version 2002 SP 2.