Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Natural Language Processing - Word Alignment

I am looking for word alignment tools and algorithms.
I am dealing with bilingual English - Hindi text, and currently working on

  • DTW (Dynamic Time Warping) algorithm
  • CLA (Competitive Linking Algorithm)
  • NATools
  • Giza++

Could you please suggest any other algorithm/tool which is language independent and which could achieve Statistical word alignment for parallel English Hindi Corpora and its evaluation.
Some tools are best for certain languages; could you please tell me how true that is and, if so, could you please provide an example of what would be better suited for Asian languages like Hindi. Counter-examples of what one shouldn't I use for such languages is also welcome.

I have heard a bit about Uplug word aligner... Could someone tell me if this tool is useful for my purpose.

Thank you.. :)

like image 790
boddhisattva Avatar asked Mar 11 '10 14:03

boddhisattva


People also ask

What is word alignment?

Alignment determines the appearance and orientation of the edges of the paragraph: left-aligned text, right-aligned text, centered text, or justified text, which is aligned evenly along the left and right margins.

What is word size alignment?

Word size for sequence alignment algorithms is the minimum number of characters required to seed a match between two sequences. For example, a word size of 8 means that at least 8 characters much match between two sequences before the an alignment is considered by the algorithm.

What is sentence alignment?

Sentence alignment is the task that automatically extracts parallel sentences from noisy parallel documents. Parallel sentences are used to train cross-language models, especially for machine translation (MT) systems.

What is Bitext in NLP?

Bitext automatically annotates and generates NLP data for and AI/ML applications, both for. training and for evaluation. Our unique differentiator: we automate all processes, using our NLP technology to annotate. data and NLG technology to produce Synthetic Training Data.


2 Answers

The Berkeley Aligner is very good. By doing joint training of the IBM word alignment models, it's able to get a much lower alignment error rate (AER) than older packages like GIZA++.

It also supports some more advanced features such as syntactic distortion (i.e., using parse tree information to get better alignments). For this, you'll only need parse trees for one of the language pairs. So, you should be okay doing Hindi<->English, since there are plenty of freely available and good English parsers.

If you decide not to go with the Berkeley Aligner, you should probably just use GIZA++. For years, it has been essentially the standard word aligner in the machine translation community.

like image 95
dmcer Avatar answered Oct 09 '22 05:10

dmcer


Uplug is a great tool, I have been using it for aligning English<->Macedonian texts. It essentially builds on the Giza++ by adding the so-called clue alignments. It's advanced setting actually combines the the clue alignments and Giza++ and performs 3 such iterations. The more clues (pos-tags, lemmas ... ) you provide better the results will be. But I have to mention that you should not expect to get fundamentally different results then by just using Giza++.

Anyway, if you plan to seriously study the topic of SMT, I suggest that you read the paper (phd thesis) about Uplug, it will be very beneficial for you.

like image 41
msaveski Avatar answered Oct 09 '22 06:10

msaveski