Word Segmentation and Transliteration in Chinese and Japanese

Speaker: Masato Hagiwara (Rakuten Institute of Technology, New York)

Time: 2:15pm-3:30pm, April 5, Friday

Place: Room 6496, CUNY Graduate Center. 5th Ave & 34th St.

Chinese and Japanese processing demands special attention due to their
complex and unsegmented writing systems based on ideographic Chinese

Because words are not explicitly separated by whitespace, word
segmentation is an essential step for processing these languages.  We
firstly review traditional and state-of-the-art approaches for word
segmentation and PoS tagging (which are jointly called morphological
analysis for Japanese processing) in both languages. These approaches
include: (semi-)Markov structure prediction models, CRF-based models,
stack-based decoding models, and pointwise approaches.

We then elaborate on transliteration, one of the most fundamental but
difficult problems when dealing with these languages because of the
uniqueness of their sound systems compared to major European
languages, especially English. We specifically focus on recent
semantic transliteration models which take different language origins
into consideration.

Finally, we touch upon recent advances in the models which integrate
the knowledge from these two fields --- Transliteration helps proper
word segmentation and greatly reduces compound noun splitting errors.

Masato Hagiwara is a senior scientist working at Rakuten Institute of
Technology, New York.  He received his Ph.D. degree in Information
Science from Nagoya University in 2009. Before joining Rakuten, he
worked at Google and Microsoft Research as an intern, and at Baidu
Japan as a full-time R&D engineer, focusing on search engine-related
Japanese language processing. His research interests include Japanese
and Chinese word segmentation, knowledge acquisition, transliteration,
and language education. He received several paper awards from Japanese
domestic conferences for his work on knowledge acquisition and
transliteration. He speaks Japanese, Chinese, and English fluently.