Saturday, January 21, 2012

Statistical Machine Translation - SMT

Machine translation like Google Web Based services , is based on statistical machine translation where statistics play the big role in the selection of the proper translation matching.


1) Introduction:
a) Word-based translation
The fundamental unit of translation is a word in some natural language.
Typically, the number of words in translated sentences are different, because of compound words, morphology and idioms.
The ratio of the lengths of sequences of translated words is called fertility, which tells how many foreign words each native word produces.
Simple word-based translation can't translate between languages with different fertility.
An example of a word-based translation system is the freely available GIZA++ package (GPLed), which includes the training program for IBM models and HMM model and Model 6
Not widely used today

b) Phrase-based translation
Aim is to reduce the restrictions of word-based translation by translating whole sequences of words, where the lengths may differ.
The sequences of words are called blocks or phrases, but typically are not linguistic phrases but phrases found using statistical methods from corpora.
Most commonly used nowadays.

**Syntax-based translation
Based on the idea of translating syntactic units, rather than single words or strings of words (as in phrase-based MT), i.e. (partial) parse trees of sentences/utterances.
The idea of syntax-based translation is quite old in MT, though its statistical counterpart did not take off until the advent of strong stochastic parsers in the 1990s.
Examples of this approach include DOP-based MT and, more recently, synchronous context-free grammars.

2) Challenges :
-Sentence alignment:
In parallel corpora single sentences in one language can be found translated into several sentences in the other and vice versa.
Sentence aligning can be performed through the Gale-Church alignment algorithm.
-Compound words : Idioms
Depending on the corpora used, idioms may not translate "idiomatically". For example, using Canadian Hansard as the bilingual corpus, "hear" may almost invariably be translated to "Bravo!" since in Parliament "Hear, Hear!" becomes "Bravo!"
-Morphology : Different word orders
Word order in languages differ. Some classification can be done by naming the typical order of subject (S), verb (V) and object (O) in a sentence and one can talk, for instance, of SVO or VSO languages. There are also additional differences in word orders, for instance, where modifiers for nouns are located, or where the same words are used as a question or a statement.
-Syntax : Out of vocabulary (OOV) words
SMT systems store different word forms as separate symbols without any relation to each other and word forms or phrases that were not in the training data cannot be translated. This might be because of the lack of training data, changes in the human domain where the system is used, or differences in morphology.

3) Models:
A) Language Model:
P(statement)
For fluency and grammar-well structured statements.
N-Gram model.
Needs single corpus.
B) Translation Model:
P(target statement | source statement)
For Translation
Needs parallel corpus.

4) Our work :
We will use “Phrase-based translation”
We will work on translating English into Arabic.
We will uses Moses and SRILM and GIZA.

a) Environment Setup:
-Download Ubontu 10.04 LTS
http://www.ubuntu.com/download/ubuntu/download
-Install Virtual Box
Install Ubontu on a virtual box.
-Setup a shared folder between Win & Ubontu:
apt-get install virtualbox-ose-guest-modules-2.6.26-2-686 (somehow get, e.g. over ftp, previously saved file /sbin/mount.vboxsf)
chmod +rx /sbin/mount.vboxsf
modprobe vboxvfs
mount -t vboxsf
e.g. sudo chmod +rx /sbin/mount.vboxsf
sudo mount.vboxsf shared-folder /mnt/xp
Or sudo mount -t vboxsf c:/shared-folder /mnt/xp

https://forums.virtualbox.org/viewtopic.php?p=4586

-Install needed tools:
http://cl.naist.jp/~eric-n/ubuntu-nlp/dists/lucid/nlp/

b)Data Preparation:
Corpus Files : Bi-lingual , In our case Arabic and English.. (We Will use UN corpus)
*Needed data files:
-A large sentence-aligned bilingual parallel corpus.
We refer to this set as the training data, , since it will be used to train the translation model.
-A larger monolingual corpus.
We need data in the target language to train the language model. You could simply use the target side of the parallel corpus, but it is better to assemble to large amounts of monolingual text, since it will help improve the fluency of your translations.
-A small sentence-aligned bilingual corpus
To use as a development set (somewhere around 1000 sentence pairs ought to be sufficient).
This data should disjoint from your training data.
It will be used to optimize the parameters of your model in minimum error rate training (MERT).
-A small sentence-aligned bilingual corpus
To use as a test set to evaluate the translation quality of your system and any modifications that you make to it.
The test set should be disjoint from the dev and training sets.

-Data Tokenization:
Like using whitespace to delineate words.
For many languages, tokenization can be as simple as separating punctuation off as its own token.

-Data Normalization:
Normalize your data by lowercasing it.
The system treats words with variant capitalization as distinct, which can lead to worse probability estimates for their translation, since the counts are fragmented.
For each language you might want to normalize the text in other ways.
Another example is to transfer all numbers into words.
Using Moses scripts:
lowercase.perl < training.ar > training.ar

-Sentences length:
You can remove the long sentences to enhance processing speed ..
Using Moses scripts:
clean-corpus-n.perl training en ar training.clean 1 40
…..
…..
results: Input sentences: 36615 Output sentences: 36615


c) Creating Language Model:
-Statistical Language Modeling is to build a statistical language model that can estimate the distribution of natural language as accurate as possible.
-A statistical language model (SLM) is a probability distribution P(s) over strings S that attempts to reflect how frequently a string S occurs as a sentence.
-By expressing various language phenomena in terms of simple parameters in a statistical model, SLMs provide an easy way to deal with complex natural language in computer.
-Used in many natural language processing applications such as speech recognition, machine translation, part-of-speech tagging, parsing and information retrieval.
**Types:
i-Unigram models
-Used in information retrieval
-It splits the probabilities of different terms in a context, e.g. from P(t1t2t3) = P(t1)P(t2 | t1)P(t3 | t1t2) to Puni(t1t2t3) = P(t1)P(t2)P(t3).
-The probability to hit each word all depends on its own, so we only have one-state finite automations as units.
-For each automation, we only have one way to hit its only state, assigned with one probability. Viewing from the whole model, the sum of all the one-state-hitting probabilities should be 1.
-In information retrieval contexts, unigram language models are often smoothed to avoid instances where P(term) = 0.


ii-N-Gram Language Model:
-The goal of a language model is to determine the probability of a word sequence.
-In n-gram language models, we condition the probability of a word on the identity of the last (n −1) words.
-The choice of n is based on a trade-off between detail and reliability, and will be dependent on the available quantity of training data.
-Most widely used and many tools exist to generate this model.
We used:
SRILM
http://www-speech.sri.com/projects/srilm/
NGramTool
URL: http://www.nlplab.cn/zhangle/ngram.html

**N-Gram Format = ARPA: SRILM format (standard format) SYNOPSIS:
\data\
ngram 1=n1
ngram 2=n2
...
ngram N=nN
\1-grams:
p w [bow]
...
\2-grams:
p w1 w2 [bow]
...
\N-grams:
p w1
... wN ...
\end

-DESCRIPTION
The so-called ARPA (or Doug Paul) format for N-gram backoff models starts with a header, introduced by the keyword \data\, listing the number of N-grams of each length. Following that, N-grams are listed one per line, grouped into sections by length, each section starting with the keyword \N-gram:, where N is the length of the N-grams to follow.
Each N-gram line:
Starts with the logarithm (base 10) of conditional probability p of that N-gram
Followed by the words w1...wN making up the N-gram.
These are optionally followed by the logarithm (base 10) of the backoff weight for the N-gram.
The keyword \end\ concludes the model representation.
Note : Backoff weights are required only for those N-grams that form a prefix of longer N-grams in the model. The highest-order N-grams in particular will not need backoff weights (they would be useless). (so in our example 3-gram won’t have it , but 1st, 2nd gram will have it)
Important Tags:
start sentence marker
end sentence marker
class of unknown words


**Generation:
Using SRLIM and We will use tri-gram model:
For Arabic: (Arbaic to English translation)
ngram-count -order 3 -interpolate -kndiscount -unk -text training.en -lm lm/english.lm
For English: (English to Arabic translation)
ngram-count -order 3 -interpolate -kndiscount -unk -text training.ar -lm lm/arabic.lm



d) Translation Model:
Using Moses: (Arabic to English)
nohup nice train-model.perl -scripts-root-dir /usr/share/moses/scripts/ -root-dir /mnt/xp -corpus training -f ar -e en -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:3:/mnt/xp/lm/english.lm &>training.out
Will take time so we sent it to the background.
Training will be completed once the training.out file show the statement:
(9) create moses.ini @ Mon Nov 7 14:26:51 EET 2011

Using Moses: (English to Arabic)
nohup nice train-model.perl -scripts-root-dir /usr/share/moses/scripts/ -root-dir /mnt/xp -corpus training -f en -e ar -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:3:/mnt/xp/lm/arabic.lm &>training.out
Will take time so we sent it to the background.(if you are using 5-gram change 0:3 into 0:5
Training will be completed once the training.out file show the statement:
(9) create moses.ini @ Mon Nov 7 19:28:37 EET 2011

**This generate a lot of files consist the translation model like phrase table , re-order tables, configuration tables, ...etc.

Example of phrase tables:


** moses.ini describe the whole model parts, looks like:
#########################
### MOSES CONFIG FILE ###
#########################
[ttable-file]
0 0 0 5 /mnt/xp/model/phrase-table.gz
# language models: type(srilm/irstlm), factors, order, file
[lmodel-file]
0 0 3 /mnt/xp/lm/english.lm
# distortion (reordering) files
[distortion-file]
0-0 wbe-msd-bidirectional-fe-allff 6 /mnt/xp/model/reordering-table.wbe-msd-bidirectional-fe.gz
…….
…….
…….
…….

e) Validate The Generated Models:
echo “resolution” | TMP=/tmp moses –f model/moses.ini
….
….
….
Best Translation: القرار [1] [Total=-8.393]

f) Test The Model:
Using moses:
moses -config model/moses.ini -input-file test.en 1>output1.out 2> output2.out &
Keep monitor the output files or ps until the process execution end.

First output file , contain the translated file:
Example of content:
62 / 174 . معهد الأمم المتحدة الأفريقي منع الجريمة ومعاملة المجرمين
الجمعية العامة
تشير قرارها 61 / 182 المؤرخ 20 كانون 2006 وسائر قرارات ،
وإذ الأميــن ،
مراعاة بالحاجة الملحة إنشاء فعالة استراتيجيات منع الجريمة لأفريقيا وكذلك أهمية إنفاذ القوانين و القضائي الإقليمي ودون الإقليمي ،
مراعاة أيضا ) الفترة 2006-2010 ، الذي أقره اجتماع المائدة المستديرة لأفريقيا في أبوجا يومي 5 ( 6 أيلــول 2005

g) Evaluation Of The Translation:
-Re-case : Not needed in Arabic (need train the re-caser 1st)
-Detokenize the output:
detokenizer.perl -l en < first.out > first.detokenized
-Wrap output in XML file:
wrap-xml.perl data/devtest/nc-test2007-ref.en.sgm en osama-oransa < first.detokenized > output.sgm
-Score the translation:
mteval-v12b.pl -s test.en.sgm -r test.ar.sgm -t output.ar.sgm -c

Results:
Evaluation of en-to-ar translation using:
src set “un" (1 docs, 2007 segs)
ref set "nc-test2007" (1 refs)
tst set "nc-test2007" (1 systems)

NIST score = 9.1469 BLEU score = 0.6776 for system “osama-oransa"

** Manual sgm wrapping:
-Remove “ to avoid excel issues.
-Use excel to add <seg id=“1...n”>statement</seg> for each line.
*Source test data: (test.en.sgm)
Header:
<srcset setid="un" srclang="en">
<doc docid="un-test" genre="wb" origlang="en">
Footer:
</doc></srcset>
*Target test data: (test.ar.sgm)
Header:
<refset trglang="ar" setid="un" srclang="en">
<doc sysid=“osama-oransa" docid="un-test" genre=“wb" origlang="en">
*Result test data: (output.ar.sgm)
Header:
<tstset trglang="ar" setid="un" srclang="en">
<doc sysid=“osama-oransa" docid="un-test" genre=“wb" origlang="en">
You could have multiple doc(s) in the same set (src, ref, tst) each with unique id.
You could wrap each few <seg> with <p>…</p>


5) References:
-Moses step by step: http://www.statmt.org/moses_steps.html
-Wikipedia: http://en.wikipedia.org/wiki/Language_model
-Joshua step by step: http://cs.jhu.edu/~ccb/joshua/
-Evaluation plan : BLEU scoring reference: http://www.itl.nist.gov/iad/mig/tests/mt/2009/MT09_EvalPlan.pdf

2 comments:

  1. I like your post! You were able to explain well the statistical machine translation. Very impressive one.

    ReplyDelete
  2. This is very nice tutorial. I am not sure if you are aware of this open-source tool

    http://www.casmacat.eu/index.php?n=UserGuide.HomeEdition

    There is an option there to upload a Moses engine or even create a Moses deployment, but unfortunately Arabic is not one of the listed languages. Do you think you can help out with adding Arabic to the list of languages? This will make the machine translation process much easier.

    thanks,
    Mohamed

    ReplyDelete