What is GIZA++ ?
GIZA++ is an extension of the program GIZA (part of the SMT
toolkit EGYPT) which was
developed by the Statistical Machine Translation team during
the summer workshop in 1999 at the Center for Language and Speech
Processing at Johns-Hopkins University (CLSP/JHU). GIZA++ includes a
lot of additional features. The extensions of GIZA++ were designed and written by Franz Josef Och.
What is parallel corpus ?
A parallel corpus is a collection of texts, each of which is translated into
one or more other languages than the original.
The simplest case is where
two languages only are involved: one of the corpora is an exact translation
of the other. Some parallel corpora, however, exist in several languages.
Installing GIZA++
Step 1- Download Giza++ using following command:
$ wget https://github.com/moses-smt/giza-pp/archive/master.zip
Step 2- Make Folder for your GIZA++ installation
$ mkdir giza-practice
Step 3- Move the folder to installation directory
$ mv giza-pp-master.zip giza-practice/
Step 4- change the directory to your installation directory
$ cd giza-practice/
Step 5- unzip the directory
$ unzip giza-pp-master.zip
Step 6- change directory
$ cd giza-pp-master/
Step 7- type following command
$ make clean
Step 8- type following command
$ make
Creating parallel Corpus to Use in GIZA++
As we know that GIZA++ is tool for word alignment, it uses parallel corpus for creating dictionary.
In this example we use two language English as Source Language and Hindi as Target Language
Step 1. So First we create a file called hindi.txt and copy the below Hindi text in this file.
मैंने उसे किताब दी .
मैंने किताब को पढ़ा .
वह किताब को प्यार करता था .
उसने किताब दी .
मैंने किताब को पढ़ा .
वह किताब को प्यार करता था .
उसने किताब दी .
Step 2. Now we create a file called english.txt and copy the below English text in this file.
I gave him the book .
I read the book .
He loved the book .
He gave the book .
I read the book .
He loved the book .
He gave the book .
Now our parallel corpus is created.
Running GIZA++
Step 2. $ cd giza-pp-master/GIZA++-v2/
Step 3. use following command to convert your corpus into GIZA++ format:
./plain2snt.out [source_language_corpus] [target_language_corpus]
$ ./plain2snt.out english.txt hindi.txt
Step 4. Type following commands for Making class and co-occurrence:
$ ./../mkcls-v2/mkcls -p[source_language_corpus] -V[source_language_corpus].vcb.classes
$ ./../mkcls-v2/mkcls -p[target_language_corpus] -V[target_language_corpus].vcb.classes
$ ./../mkcls-v2/mkcls -p[target_language_corpus] -V[target_language_corpus].vcb.classes
Example
$ ./../mkcls-v2/mkcls -penglish.txt -Venglish.txt.vcb.classes
$ ./../mkcls-v2/mkcls -phindi.txt -Vhindi.txt.vcb.classes
$ ./../mkcls-v2/mkcls -phindi.txt -Vhindi.txt.vcb.classes
Step 5. create output directory using command
$ mkdir myout
Step 6. Now use GIZA++ to build your dictionary
./GIZA++ -S [target_language_corpus].vcb -T [source_language_corpus].vcb -C [target_language_corpus]_[source_language_corpus].snt -o [prefix] -outputpath [output_folder]
Ex. :
$. /GIZA++ -S hindi.vcb -T english.vcb -C hindi_english.snt -outputpath myout -o test
Note if you get an error please update the Makefile inside GIZA++-v2
Replace the line CFLAGS_OPT = $(CFLAGS) -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -DBINARY_SEARCH_FOR_TTABLE -DWORDINDEX_WITH_4_BYTE
with the line CFLAGS_OPT = $(CFLAGS) -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -DWORDINDEX_WITH_4_BYTE
It will generate the output files in myout/ directory and out of the various files file with name [prefix].actual.ti.final (file test.actual.ti.final in our case) will be the final file.
It contains the alignment of source and target words according to their probability value:
test.actual.ti.final:
book NULL 1
. को 0.333333
gave दी 1
He था 0.333333
him उसे 1
loved प्यार 0.5
read पढ़ा 1
the . 1
He उसने 0.333333
. किताब 0.666667
loved करता 0.5
I मैंने 1
He वह 0.333333
. को 0.333333
gave दी 1
He था 0.333333
him उसे 1
loved प्यार 0.5
read पढ़ा 1
the . 1
He उसने 0.333333
. किताब 0.666667
loved करता 0.5
I मैंने 1
He वह 0.333333
References:
http://www.statmt.org/moses/giza/GIZA++.html
http://okapiframework.org/wiki/index.php?title=GIZA%2B%2B_Installation_and_Running_Tutorial
In target portion of the training corpus, only 9 unique tokens appeared
ReplyDeletelambda for PP calculation in IBM-1,IBM-2,HMM:= 21/(25-4)== 1
ERROR: NO COOCURRENCE FILE GIVEN!
Aborted (core dumped)
it is giving this error
How to resolve the following error:
ReplyDeleteERROR: NO COOCURRENCE FILE GIVEN!
Aborted (core dumped)
I've also tried the above steps but it didn't work.
We are a full-service drywall company located in Bend, Oregon. Our drywall contractor offers all phases of drywall from installation to custom texture more about us visit website Drywall Repair Bend Oregon.
ReplyDeleteWhich one is the trained model ? How can we load the trained alignment model to get alignment result on a new pair of eng-hi sentences
ReplyDelete