techie knowledge: Installing And Using GIZA++ in Ubuntu for Word Alignment

What is GIZA++ ?

GIZA++ is an extension of the program GIZA (part of the SMT toolkit EGYPT) which was developed by the Statistical Machine Translation team during the summer workshop in 1999 at the Center for Language and Speech Processing at Johns-Hopkins University (CLSP/JHU). GIZA++ includes a lot of additional features. The extensions of GIZA++ were designed and written by Franz Josef Och.

What is parallel corpus ?

A parallel corpus is a collection of texts, each of which is translated into one or more other languages than the original.

The simplest case is where two languages only are involved: one of the corpora is an exact translation of the other. Some parallel corpora, however, exist in several languages.

Installing GIZA++

Step 1- Download Giza++ using following command:

$ wget https://github.com/moses-smt/giza-pp/archive/master.zip

Step 2- Make Folder for your GIZA++ installation

$ mkdir giza-practice

Step 3- Move the folder to installation directory

$ mv giza-pp-master.zip giza-practice/

Step 4- change the directory to your installation directory

$ cd giza-practice/

Step 5- unzip the directory

$ unzip giza-pp-master.zip

Step 6- change directory

$ cd giza-pp-master/

Step 7- type following command

$ make clean

Step 8- type following command

$ make

Creating parallel Corpus to Use in GIZA++

As we know that GIZA++ is tool for word alignment, it uses parallel corpus for creating dictionary.

In this example we use two language English as Source Language and Hindi as Target Language

Step 1. So First we create a file called hindi.txt and copy the below Hindi text in this file.

मैंने उसे किताब दी .

मैंने किताब को पढ़ा .

वह किताब को प्यार करता था .

उसने किताब दी .

Step 2. Now we create a file called english.txt and copy the below English text in this file.

I gave him the book .

I read the book .

He loved the book .

He gave the book .

Now our parallel corpus is created.

Running GIZA++

Step 1. Copy hindi.txt and english.txt files to giza-pp-master/GIZA++-v2/

Step 2. $ cd giza-pp-master/GIZA++-v2/

Step 3. use following command to convert your corpus into GIZA++ format:

./plain2snt.out [source_language_corpus] [target_language_corpus]

$ ./plain2snt.out english.txt hindi.txt

Step 4. Type following commands for Making class and co-occurrence:

$ ./../mkcls-v2/mkcls -p[source_language_corpus]   -V[source_language_corpus].vcb.classes

$ ./../mkcls-v2/mkcls -p[target_language_corpus] -V[target_language_corpus].vcb.classes

Example

$ ./../mkcls-v2/mkcls -penglish.txt -Venglish.txt.vcb.classes

$ ./../mkcls-v2/mkcls -phindi.txt -Vhindi.txt.vcb.classes

Step 5. create output directory using command

$ mkdir myout

Step 6. Now use GIZA++ to build your dictionary

./GIZA++ -S [target_language_corpus].vcb -T [source_language_corpus].vcb -C [target_language_corpus]_[source_language_corpus].snt -o [prefix] -outputpath [output_folder]

Ex. :

$. /GIZA++ -S hindi.vcb -T english.vcb -C hindi_english.snt -outputpath myout -o test

Note if you get an error please update the Makefile inside GIZA++-v2

Replace the line CFLAGS_OPT = $(CFLAGS) -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -DBINARY_SEARCH_FOR_TTABLE -DWORDINDEX_WITH_4_BYTE

with the line CFLAGS_OPT = $(CFLAGS) -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -DWORDINDEX_WITH_4_BYTE

It will generate the output files in myout/ directory and out of the various files file with name [prefix].actual.ti.final (file test.actual.ti.final in our case) will be the final file.

It contains the alignment of source and target words according to their probability value:

test.actual.ti.final:

book NULL 1

. को 0.333333

gave दी 1

He था 0.333333

him उसे 1

loved प्यार 0.5

read पढ़ा 1

the . 1

He उसने 0.333333

. किताब 0.666667

loved करता 0.5

I मैंने 1

He वह 0.333333

References:

http://www.statmt.org/moses/giza/GIZA++.html
http://okapiframework.org/wiki/index.php?title=GIZA%2B%2B_Installation_and_Running_Tutorial

4 comments:

Unknown28 August 2019 at 11:47
In target portion of the training corpus, only 9 unique tokens appeared
lambda for PP calculation in IBM-1,IBM-2,HMM:= 21/(25-4)== 1
ERROR: NO COOCURRENCE FILE GIVEN!
Aborted (core dumped)

it is giving this error
Aditya21 November 2019 at 04:46
How to resolve the following error:
ERROR: NO COOCURRENCE FILE GIVEN!
Aborted (core dumped)
I've also tried the above steps but it didn't work.
Sebo Repair29 September 2020 at 17:34
We are a full-service drywall company located in Bend, Oregon. Our drywall contractor offers all phases of drywall from installation to custom texture more about us visit website Drywall Repair Bend Oregon.
code01029 January 2022 at 06:54
Which one is the trained model ? How can we load the trained alignment model to get alignment result on a new pair of eng-hi sentences

Pages

Installing And Using GIZA++ in Ubuntu for Word Alignment