Saturday, March 13, 2010

giza++ under windows: episode 2

It turned out, that in order to comfortably run GIZA++ under win32 with cygwin I had to recompile it without the flag -DBINARY_SEARCH_FOR_TTABLE (credits: http://code.google.com/p/giza-pp/issues/detail?id=9).

The full list of steps goes here (I suppose that source corpus is stored in corpus.ru and target corpus is stored in corpus.en):

1. produce vcb and dictionary files with "plain2snt.out corpus.ru corpus.en" (credits: http://vee-r.blogspot.com/2006/12/giza-guide.html)
2. produce cooc file with "snt2cooc.out corpus.ru.vcb corpus.en.vcb corpus.ru_corpus.en.snt > ru_en.cooc" (credits: myself, after analyzing the train-factored-phrase-model.perl from Moses package)
3. run GIZA++ with config file:

outputfileprefix play_giza
sourcevocabularyfile corpus.ru.vcb
targetvocabularyfile corpus.en.vcb
c corpus.ru_corpus.en.snt
CoocurrenceFile ru_en.cooc
model1dumpfrequency 1
model4smoothfactor 0.4
nsmooth 4
onlyaldumps 1
nodumps 1
p0 .999
m1 5
m2 0
m3 3
m4 3
o giza

if these options are stored in giza.config then run "GIZA++ giza.config". This produces giza.A3.final file, typical entry of which is:

# Sentence pair (1) source length 4 target length 7 alignment score : 2.25315e-10
there is a book on the table
NULL ({ }) на ({ }) столе ({ 5 6 7 }) лежит ({ 1 2 }) книга ({ 3 4 })

which means the following mapping:

"столе" --> "on the table"
"лежит" --> "there is"
"книга" --> "a book"

Running GIZA++ under win32 and under linux gives same results in terms of word mappings, except that the alignment scores may slightly differ due to possibly different float point precision models.