Saturday, March 13, 2010

giza++ under windows: episode 2

It turned out, that in order to comfortably run GIZA++ under win32 with cygwin I had to recompile it without the flag -DBINARY_SEARCH_FOR_TTABLE (credits: http://code.google.com/p/giza-pp/issues/detail?id=9).

The full list of steps goes here (I suppose that source corpus is stored in corpus.ru and target corpus is stored in corpus.en):

1. produce vcb and dictionary files with "plain2snt.out corpus.ru corpus.en" (credits: http://vee-r.blogspot.com/2006/12/giza-guide.html)
2. produce cooc file with "snt2cooc.out corpus.ru.vcb corpus.en.vcb corpus.ru_corpus.en.snt > ru_en.cooc" (credits: myself, after analyzing the train-factored-phrase-model.perl from Moses package)
3. run GIZA++ with config file:

outputfileprefix play_giza
sourcevocabularyfile corpus.ru.vcb
targetvocabularyfile corpus.en.vcb
c corpus.ru_corpus.en.snt
CoocurrenceFile ru_en.cooc
model1dumpfrequency 1
model4smoothfactor 0.4
nsmooth 4
onlyaldumps 1
nodumps 1
p0 .999
m1 5
m2 0
m3 3
m4 3
o giza

if these options are stored in giza.config then run "GIZA++ giza.config". This produces giza.A3.final file, typical entry of which is:

# Sentence pair (1) source length 4 target length 7 alignment score : 2.25315e-10
there is a book on the table
NULL ({ }) на ({ }) столе ({ 5 6 7 }) лежит ({ 1 2 }) книга ({ 3 4 })

which means the following mapping:

"столе" --> "on the table"
"лежит" --> "there is"
"книга" --> "a book"

Running GIZA++ under win32 and under linux gives same results in terms of word mappings, except that the alignment scores may slightly differ due to possibly different float point precision models.

No comments: