techie knowledge: April 2017

Installing And Using GIZA++ in Ubuntu for Word Alignment

What is GIZA++ ?

GIZA++ is an extension of the program GIZA (part of the SMT toolkit EGYPT) which was developed by the Statistical Machine Translation team during the summer workshop in 1999 at the Center for Language and Speech Processing at Johns-Hopkins University (CLSP/JHU). GIZA++ includes a lot of additional features. The extensions of GIZA++ were designed and written by Franz Josef Och.

What is parallel corpus ?

A parallel corpus is a collection of texts, each of which is translated into one or more other languages than the original.

The simplest case is where two languages only are involved: one of the corpora is an exact translation of the other. Some parallel corpora, however, exist in several languages.

Installing GIZA++

Step 1- Download Giza++ using following command:

$ wget https://github.com/moses-smt/giza-pp/archive/master.zip

Step 2- Make Folder for your GIZA++ installation

$ mkdir giza-practice

Step 3- Move the folder to installation directory

$ mv giza-pp-master.zip giza-practice/

Step 4- change the directory to your installation directory

$ cd giza-practice/

Step 5- unzip the directory

$ unzip giza-pp-master.zip

Step 6- change directory

$ cd giza-pp-master/

Step 7- type following command

$ make clean

Step 8- type following command

$ make

Creating parallel Corpus to Use in GIZA++

As we know that GIZA++ is tool for word alignment, it uses parallel corpus for creating dictionary.

In this example we use two language English as Source Language and Hindi as Target Language

Step 1. So First we create a file called hindi.txt and copy the below Hindi text in this file.

मैंने उसे किताब दी .

मैंने किताब को पढ़ा .

वह किताब को प्यार करता था .

उसने किताब दी .

Step 2. Now we create a file called english.txt and copy the below English text in this file.

I gave him the book .

I read the book .

He loved the book .

He gave the book .

Now our parallel corpus is created.

Running GIZA++

Step 1. Copy hindi.txt and english.txt files to giza-pp-master/GIZA++-v2/

Step 2. $ cd giza-pp-master/GIZA++-v2/

Step 3. use following command to convert your corpus into GIZA++ format:

./plain2snt.out [source_language_corpus] [target_language_corpus]

$ ./plain2snt.out english.txt hindi.txt

Step 4. Type following commands for Making class and co-occurrence:

$ ./../mkcls-v2/mkcls -p[source_language_corpus]   -V[source_language_corpus].vcb.classes

$ ./../mkcls-v2/mkcls -p[target_language_corpus] -V[target_language_corpus].vcb.classes

Example

$ ./../mkcls-v2/mkcls -penglish.txt -Venglish.txt.vcb.classes

$ ./../mkcls-v2/mkcls -phindi.txt -Vhindi.txt.vcb.classes

Step 5. create output directory using command

$ mkdir myout

Step 6. Now use GIZA++ to build your dictionary

./GIZA++ -S [target_language_corpus].vcb -T [source_language_corpus].vcb -C [target_language_corpus]_[source_language_corpus].snt -o [prefix] -outputpath [output_folder]

Ex. :

$. /GIZA++ -S hindi.vcb -T english.vcb -C hindi_english.snt -outputpath myout -o test

Note if you get an error please update the Makefile inside GIZA++-v2

Replace the line CFLAGS_OPT = $(CFLAGS) -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -DBINARY_SEARCH_FOR_TTABLE -DWORDINDEX_WITH_4_BYTE

with the line CFLAGS_OPT = $(CFLAGS) -O3 -funroll-loops -DNDEBUG -DWORDINDEX_WITH_4_BYTE -DWORDINDEX_WITH_4_BYTE

It will generate the output files in myout/ directory and out of the various files file with name [prefix].actual.ti.final (file test.actual.ti.final in our case) will be the final file.

It contains the alignment of source and target words according to their probability value:

test.actual.ti.final:

book NULL 1

. को 0.333333

gave दी 1

He था 0.333333

him उसे 1

loved प्यार 0.5

read पढ़ा 1

the . 1

He उसने 0.333333

. किताब 0.666667

loved करता 0.5

I मैंने 1

He वह 0.333333

References:

http://www.statmt.org/moses/giza/GIZA++.html
http://okapiframework.org/wiki/index.php?title=GIZA%2B%2B_Installation_and_Running_Tutorial

hashCode() and equals() Methode in java

As a Java programmer we know that java.lang.Object is the base class of every class in Java language.

Object class provide some method that provide some default implementation.

Since object class is the base class then method defined by Object class are also available to every class defined in Java, but some time the default implementation of these method is not appropriate for new User Defined classes.

Here we discuses two most important method of object class.

The method Define in object class

public boolean equals(Object obj)

Indicates whether some other object is "equal to" this one.

The default implementation of equal method compares two objects for equality and returns true if they are equal.

This method only check weather the references of object point to same object or not. means it checks for references not value.

public int hashCode()

Returns a hash code value for the object.

The value returned by hashCode() is the object's hash code, which is the object's memory address in hexadecimal.

Contract between equal() and hashCode()

1. If two objects are equal, their hash code must also be equal.
2. If you override the equals() method, you must also override the hashCode() method as well.

Some time we do not want to use default implementation of equals() method in our own define class so we must override this method in our class.

Ex. Suppose we have a class Student and we want to compare weather two student are equal or not base on the instance variable studentId.

then we have to override the equal() method to meet our requirement

The equals method implements an equivalence relation. It is:

•Reflexive: For any non-null reference value x , x.equals(x) must return true .

•Symmetric: For any non-null reference values x and y , x.equals(y) must return true if and only if y.equals(x) returns true .

• Transitive: For any non-null reference values x , y , z , if x.equals(y) returns true and y.equals(z) returns true , then x.equals(z) must return true .

• Consistent: For any non-null reference values x and y , multiple invocations of x.equals(y) consistently return true or consistently return false , pro-vided no information used in equals comparisons on the objects is modified.

• For any non-null reference value x , x.equals(null) must return false .

Here we provide an example how to override equal() and hashCode()

package in.co.techieknowledge;

public class Movie {

 String movieName;
 int price;

 public Movie(String movieName, int price) {

  this.movieName = movieName;
  this.price = price;
 }

 @Override
 public String toString() {
  return "Movie name is " + movieName + " And price is "
    + price;

 }

 /*
  * here we want if the movieName of the two Movie oject is same then both
  * Movie object is equal
  */
 @Override
 public boolean equals(Object o) {

  if (o == this)
   return true;
  if (o == null)
   return false;
  if (!(this.getClass().equals(o.getClass())))

   return false;
  Movie movie = (Movie) o;

  return (this.movieName.equals(movie.movieName)) ? true : false;

 }

 @Override
 public int hashCode() {

  return 31 * movieName.hashCode();

 }
}

Test the above class

package in.co.techieknowledge;

public class Test {

 public static void main(String[] args) {
  
  Movie movie1 = new Movie("The Ghazi Attack", 200);
  Movie movie2 = new Movie("The Ghazi Attack", 300);
  
  System.out.println(movie1);
  System.out.println(movie2);
  
  if(movie1.equals(movie2))
   System.out.println("object are equal");
  else
  System.out.println("object not equal");
  
  
  System.out.println(movie2);
  System.out.println(movie1);
  
  if(movie2.equals(movie1))
   System.out.println("object are equal");
  else
  System.out.println("object not equal");
  
 }
 
}

References:
https://docs.oracle.com

Some Basic point about Map, Set and List from JAVA Collection

A Set is a Collection that cannot contain duplicate elements

three general-purpose Set implementations:

1. HashSet :

    Uses HashTable to store its element.
    Uses Hash Function for Storing and retrieving its element.
    Order is not maintain in HashSet.

2. TreeSet :

   Uses Red-Black tree to store its element.
   Order of elements maintained according to their values.

3. LinkedHashSet (LinkeList + HashSet)

   Implemented as a hash table with a linked list running through it.
   orders its elements based on the order in which they were inserted into the set (insertion-order)

A List is an ordered Collection (sometimes called a sequence). Lists may contain duplicate elements

The Java platform contains two general-purpose List implementations

1. ArrayList :

     Use variable-size array to store element
     element can access randomly using index.
     maintain the elements insertion order

2. LinkedList :

   Doubly-linked list implementation of the List
   Sequential access of elements
   maintain the elements insertion order

Note : LinkedList element deletion is faster compared to ArrayList.

A Map is an object that maps keys to values.
A map cannot contain duplicate keys: Each key can map to at most one value

Java platform contains three general-purpose Map implementations:

1.HashMap :

   Hash table based implementation of the Map interface
   makes no guarantees as to the order of the map; in particular, it does not     guarantee that the order will remain constant over time.

2.TreeMap :

   A Red-Black tree based NavigableMap implementation
   The map is sorted according to the natural ordering of its keys

3.LinkedHashMap :

   Hash table and linked list implementation of the Map interface
   maintain the insertion order

ebs

Pages