LING 406, Spring 2004

Topics in Computational Linguistics

Instructor: Richard Sproat

Time: TT 1:00 - 2:20

Place: Beckman 1420

Office Hours: Wednesdays 3-5, FLB 4103

 

This course will be a lab course that introduces students to the practical problems of building a working natural language processing system.

This year the proposal is to build a fairly complete text normalization system for English, designed in such a way that it can be adapted to other languages.

A text preprocessing system is a system that performs end-of-sentence detection, disambiguates and expands abbreviations, expands digit strings into number names or sequences of numbers, and so forth. An example of the kinds of things covered by text normalization systems can be found in:

Richard Sproat, Alan Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards. "Normalization of non-standard words." Computer Speech and Language, 15(3), 287-333, 2001. Prepublication PDF version here.

Depending upon the class size, we may split into teams, or else work as a single team. Some members will be responsible for collecting and annotating data for training and testing, others will be responsible for designing modules of the system.

All software modules will be developed in Python, an open source, interpreted, interactive, object-oriented programming language.

Assuming the project is successful, we would plan to publish the software on the web, with all participants listed as authors.

Here is an overview of the course and an approximate timeline.