Weizmann Institute

Introduction to Parsing
Morphologically-Rich Languages

European Summer School for Logic, Language and Information (ESSLLI) 2013, Germany

Reut Tsarfaty, Weizmann Institute of Science

Keywords

Statistical Parsing, Structure Prediction, Joint Modeling, Morphology and Morphsyntax, Cross-Framework Evaluation, Universal Schemes.

Description

Parsing is a key task in natural language processing where we aim to automatically predict, for every input sentence, a graphical representation that captures its predicate-argument structure (that is, informally, "who did what to whom"). Parsers are key components in a range of technological applications, from Question Answering systems to Machine Translation technology. The best parsing systems to-date are data-driven and statistical, and were shown accurately predict parses for English texts. Cross-linguistic evaluation campaigns, however, reveal that these figures are somewhat misleading, and that many of these systems do not perform as well when applied to parsing morphologically-rich languages (PMRL).

Morphologically rich languages (MRLs) are languages such as Arabic, Czech, Finnish, Hebrew, Turkish, and many more. Their shared characteristic is that predicate-argument information usually marked in word-order in English, may be marked at word level, and allow for variable word-ordering patterns. These properties cast doubt on the adequacy of the formal methods and structure-prediction procedures developed for accurately predicting such structures in English, and it delivers a poor starting point for the development of language technology for MRLs. This is the first course to explore the intricate relationships between refined linguistic structure and advanced structure prediction, and to propose principled solutions.

In this course we will build from the ground up a formal framework that accommodates the intricate structures of MRLs yet allows for effective statistical learning. We cover parsers for different representation types (Phrase-Structures, Dependency-Structures, Relational Networks) and apply both generative and discriminative modeling methods, paying careful attention to the feature-space that learners have to deal with. We further present and solve particular challenges concerning the annotation and evaluation of MRL structures, setting the stage for the development and comparative evaluation of cross-linguistic universal-parsing techniques.

Materials*

Copyright Note

The outline and materials on these pages form the basis of the PMRL textbook now developed for publication in the Synthesis Lectures in Human Language Technology series by Morgan and Claypool Publishers. The citation entry for the manuscript is displayed here.

Contact

If you found these materials helpful, if you have any thoughts or would like to comment, I will be happy if you drop me a note.