Time: 1245pm-145pm, Friday, May 7
Place: Room 6496, CUNY Graduate Center, 365 Fifth Ave (34str&35str).
Speaker: Dr. Matthew Snover (Maryland)

Selective Translation-Model Adaptation: Learning Translation Rules from
Monolingual Text

With thousands of languages in the world, and the increasing speed and
quantity of information being distributed across the world, automatic
translation between languages by computers, Machine Translation (MT), has
become an increasingly important area of research. State-of-the-art MT
systems rely not upon hand-crafted translation rules written by human
experts, but rather on learned statistical models that translate a source
language to a target language. These models are typically generated from
large, parallel corpora containing copies of text in both the source and
target languages. The co-occurrence of words across languages in parallel
corpora allows the creation of translation rules that specify the
probability of translating words or phrases from one language to the
other. Monolingual corpora, containing text only in one
language~Wprimarily the target language~Ware  not used to model the
translation process, but are used to better model the structure of the
target language. Unlike parallel data, which require expensive human
translators to generate, monolingual data are cheap and widely available.

Similar topics and events to those in a source document that is being
translated often occur in documents in a comparable monolingual corpus. In
much the same way that a human translator would use world knowledge to aid
translation, the MT system may be able to use these relevant documents
from comparable corpora to guide translation by biasing the translation
system to produce output more similar to the relevant documents.

We describe a method for generating new translation rules from monolingual
data specifically targeted for the document that is being translated. Rule
generation leverages the existing translation system and topical overlap
between the foreign source text and the monolingual text, and unlike regular
translation rule generation does not require parallel text. For each
source document to be translated, potentially comparable documents are
selected from the monolingual data using cross-lingual information
retrieval. By biasing the MT system towards the selected relevant
documents and then measuring the similarity of the biased output
to the relevant documents using Translation Edit Rate Plus (TERp), it is
possible to identify sub-sentential regions of the source and comparable
documents that are possible translations of each other. This process
results in the generation of new  translation rules, where the source side
is taken from the document to be translated and the target side is fluent
target language text taken from the monolingual data. The use of these
rules results in improvements over a state-of-the-art statistical
translation system. These techniques are most effective when there is a
high degree of similarity between the source and relevant passages~Wsuch
as when they report on the same new stories~Wbut some benefit,
approximately half, can be achieved when the passages are only
historically or topically related.

Matthew Snover is a recent PhD graduate from the University of Maryland in
College Park.  His research focuses on  In June he will join the Blender
lab at CUNY Queens College and Graduate Center as a post-doctoral