Time: 215pm-330pm, Friday, March 11, 2011
Place: Room 4102, CUNY Graduate Center, 365 Fifth Ave (34str&35str).
Speaker: Mitch Marcus (Penn)
Title: Acquiring linguistic structure automatically using minimal computation

Abstract:

Modeling the acquisition of language from naturally occurring data is a central challenge for 
both linguistics and child language development and also for the applied cognitive science of 
natural language processing.  From a scientific viewpoint, this has been a central challenge 
for over fifty years.  From a technological viewpoint, supervised machine learning methods 
have yielded quite powerful technologies over the past decade, but require very expensive 
annotated corpora for training; unsupervised learning algorithms with similar performance 
would be of enormous value.

To force ourselves to more fully utilize the statistical and linguistic aspects of the signal 
the child uses to learn her native language; we have adopted a research strategy of minimal 
computation, attempting unsupervised language learning using only very simple counting 
methods.  To date, we have approached the learning of morphological structure, part of speech 
induction and part of speech tagging using two key sources of constraint:  First, the process 
of language acquisition, whether in children or machines, must exploit the Zipfian 
statistical distribution of the underlying data stream.  We have used this constraint to 
develop a state-of-the-art algorithm for morphology acquisition.  Second, appropriate 
linguistic representations provide essential constraints about domains of locality that make 
the learning problem tractable.  We have exploited locality implications of the Minimalist 
Program to develop new algorithms for automatically distinguishing open class lexical items 
from closed class words and grammatical formatives, and then used the output of this process 
to develop a fully unsupervised model of part of speech labeling. 

This talk presents joint work with Erwin Chan, Constantine Lignos, Qiuye Zhao, and Charles Yang.

Speaker Bio:

Mitchell Marcus is the RCA Professor of Artificial Intelligence in the Department of Computer 
and Information Science at the University of Pennsylvania. He was the principal investigator 
for the Penn Treebank Project through the mid-1990s; he and his collaborators continue to 
develop hand-annotated corpora for use world-wide as training materials for statistical 
natural language systems. Other research interests include: statistical natural language 
processing, human-robot communication, and cognitively plausible models for automatic 
acquisition of linguistic structure. He has served as chair of Penn's Computer and 
Information Science Department, as chair of the Penn Faculty Senate, and as president of the 
Association for Computational Linguistics. He is also a Fellow of the American Association 
of Artificial Intelligence.