Title: Computational readability: need for a domain-oriented approach? 
Speaker: Thomas Francois (Penn) 
Place: Science Center. Rm 4102, CUNY Graduate Center. 5th Ave & 34th St.

Abstract:

Readability aims at automatically assessing the difficulty of texts for a given
population, using some of the linguistic characteristics of the texts. The
classic attempts to do so (Flesch, 1984 ; Dale and Chall, 1948) were often
concerned with providing a general tool that could be used for a large range of
situations. More recently, researchers have focused on more specific group of
readers, such as adults with intellectual disabilities (Feng et al., 2009) or
readers in a foreign language (François, 2012). These studies have shown that,
in such contexts, some specialized features are more valuable than the
“classic” ones.

So far, the impact of the corpus used to trained readability models – which is
strongly connected with the task a given formula is aiming at – have not been
much investigated. However, Collins-Thompson et al. (2005) tried to apply their
to various corpora and noticed large differences in the performance between
them.

In this talk, we first report our previous work that led to a new computational
model for the readability of French as a foreign language, which was trained on
a corpus of texts from textbooks. In a second step, we describe how this
state-of-the-art model behave when applied to a different corpus of texts, from
FFL “readers”. We show that not only a ten-fold cross-validation approach does
not provide an accurate estimate of the model's performance for all types of
texts intended to a FFL audience, but also that the efficiency of some features
might dramatically vary depending on the characteristics of the corpus. We
conclude suggesting that these findings advocate for a more domain-oriented
approach of readability and that the main avenue for further performance
improvement might be the reliable labelling of a large domain corpus.

Bio:

Thomas François is a Belgium American Exchange Foundation (BAEF) and Fulbright
Fellow. He is doing a postdoctoral research stay at the Institute of Research
for Cognitive Science (IRCS) of the University of Pennsylvania where he focus
on various approaches to improve current readability models for FFL. He
received his Ph.D. from the University of Louvain (Belgium) in 2011. His
thesis, entitled “Les apports du traitement automatique du langage à la
lisibilité du français langue étrangère”, provides a very complete review of
the readability field for English and French as well as a new readability
formula for FFL. It has been awarded the Best 2012 Thesis Award by the ATALA,
the French association for NLP.