Pages

Installing And Using GIZA++ in Ubuntu for Word Alignment


What is GIZA++ ?


 GIZA++ is an extension of the program GIZA (part of the SMT toolkit EGYPT) which was developed by the Statistical Machine Translation team during the summer workshop in 1999 at the Center for Language and Speech Processing at Johns-Hopkins University (CLSP/JHU). GIZA++ includes a lot of additional features. The extensions of GIZA++ were designed and written by Franz Josef Och.


 What is parallel corpus ?


A parallel corpus is a collection of texts, each of which is translated into one or more other languages than the original.

The simplest case is where two languages only are involved: one of the corpora is an exact translation of the other. Some parallel corpora, however, exist in several languages. 

Installing GIZA++


Step 1- Download Giza++ using following command:

$ wget https://github.com/moses-smt/giza-pp/archive/master.zip

Step 2-  Make Folder for your GIZA++ installation

$ mkdir giza-practice

Step 3- Move the folder to installation directory

$ mv giza-pp-master.zip giza-practice/

Step 4- change the directory to your installation directory


$ cd giza-practice/

Step 5- unzip the directory

$ unzip giza-pp-master.zip

Step 6- change directory

$  cd giza-pp-master/

Step 7- type following command
 
$ make clean

Step 8- type following command
 
$ make

Creating Parrel Corpus to Use in GIZA++

As we know that GIZA++ is tool for word alignment, it uses parallel corpus for creating dictionary.

In this example we use two language English as Source Language and Hindi as Target Language

Step 1. So First we create a file called hindi.txt and copy the below hindi text in this file.

मैंने उसे किताब दी .
मैंने किताब को पढ़ा .
वह किताब को प्यार करता था .
उसने किताब दी .

Step 2. Now we create a file called english.txt and copy the below english text in this file.

I gave him the book .
I read the book .
He loved the book .
He gave the book .

Now our parallel corpus is created.


 Running GIZA++


Step 1. Copy hindi.txt and english.txt  files to giza-pp-master/GIZA++-v2/

Step 2. $ cd giza-pp-master/GIZA++-v2/

Step 3. use following command to convert your corpus into GIZA++ format:

./plain2snt.out [source_language_corpus] [target_language_corpus]

$ ./plain2snt.out english.txt hindi.txt

Step 4.  Type following commands for Making class and co-occurrence:

$ ./../mkcls-v2/mkcls -p[source_language_corpus]   -V[source_language_corpus].vcb.classes

$ ./../mkcls-v2/mkcls -p[target_language_corpus] -V[target_language_corpus].vcb.classes

Example
$ ./../mkcls-v2/mkcls -penglish.txt -Venglish.txt.vcb.classes
$ ./../mkcls-v2/mkcls -phindi.txt -Vhindi.txt.vcb.classes

Step 5. create output directory using command

$ mkdir myout

Step 6. Now use GIZA++ to build your dictionary

./GIZA++ -S [target_language_corpus].vcb -T [source_language_corpus].vcb -C [target_language_corpus]_[source_language_corpus].snt -o [prefix] -outputpath [output_folder]

Ex. :
$. /GIZA++ -S hindi.vcb -T english.vcb -C hindi_english.snt -outputpath myout -o test

Note if you get an error please update the Makefile inside GIZA++-v2

 Replace the line CFLAGS_OPT = $(CFLAGS) -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -DBINARY_SEARCH_FOR_TTABLE -DWORDINDEX_WITH_4_BYTE

with the line CFLAGS_OPT = $(CFLAGS) -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -DWORDINDEX_WITH_4_BYTE


It will generate the output files in myout/ directory
and out of the variuos files file with name [prefix].actual.ti.final (file test.actual.ti.final in our case) will be the final file.

It contains the alignment of source and target words according to their probability value:

test.actual.ti.final:

book NULL 1
. को 0.333333
gave दी 1
He था 0.333333
him उसे 1
loved प्यार 0.5
read पढ़ा 1
the . 1
He उसने 0.333333
. किताब 0.666667
loved करता 0.5
I मैंने 1
He वह 0.333333


References: 

http://www.statmt.org/moses/giza/GIZA++.html 
http://okapiframework.org/wiki/index.php?title=GIZA%2B%2B_Installation_and_Running_Tutorial

No comments:

Post a Comment