Paris catacombs
                            home         :        research         :        publications         :        presentations         :        personal      
Current Projects and Research
- Computational analysis of Persian language weblogs and conversational text   [abstract]
- Persian multiword expressions/complex predicates [see the recently organized International Conference on Complex Predicates in Iranian Languages (July 2008)]
- Lexical semantics of events for computational analysis
- Auxiliary clitic in Eastern Armenian (with Arsalan Kahnemuyipour)  [LSA abstract]
- Computational analysis of Tajiki Persian  [LREC abstract]
- Heritage language speakers of Persian


Past Research on Persian NLP

My main experience has been in the development of computational grammars for Persian (Farsi) systems. This section contains a description of the issues in Persian computational analysis and a description of past projects (if not proprietary).

  • Shiraz Project
While working at CRL, I was the computational linguist responsible for the development of the Shiraz machine translation system. The Shiraz project was a MT prototype developed at CRL that translated Persian text into English. The project began in October 1997 and the final version was delivered in August 1999. The system used typed feature structures and an underlying unification-based formalism to describe Persian linguistic phenomena. It used an electronic bilingual Persian to English dictionary consisting of approximately 50,000 terms, a complete morphological analyzer, a syntactic parser as well as transfer and generation modules. The system components were tested on a bilingual tagged corpus developed from a large Persian corpus of on-line material (approximately 10MB). The machine translation system is mainly targeted at translating news material.

This prototype performed tokenization and full morphological analysis. Compounds and light verbs were also recognized. The syntactic parser could analyze noun phrases (including relative clauses), preposition phrases and basic sentential constructions. The resulting feature structures were transferred into English syntax and morphological generation was performed on the final translations. The dictionary was built by a team of Persian lexicographers and included single words, compounds and phrasal expressions. It contained information about the orthography, morphosyntactic category and syntactic properties of lexical items as well as the English word-sense equivalents.

Detailed write-ups from the Shiraz project can be found under the publications page.

A general description of Persian is given in these pages:   (also available on the CRL website)
 Introduction to Persian:  An overview of linguistic properties of Persian.
 Persian morphology:  An introduction to Persian inflectional morphology as well as a discussion of the morphological grammar used in Shiraz
 Persian noun phrase:  A computational analysis of the Persian NP with emphasis on boundary detection.
 Persian syntax:  A description of Persian syntax with emphasis on issues that arise in a computational analysis of written text. The syntactic grammar used in Shiraz is also introduced.
System architecture:  A description of the unification-based architecture of the Shiraz system (courtesy of Jan W. Amtrup)
Dictionary structure:  A description of the dictionary structure developed for Shiraz.
Chart of Persian characters:  Includes unicode encoding and Shiraz transliteration of persian characters (glyphs missing)

  • Inxight Entity Extraction system
I was responsible for developing the linguistic aspects of the Persian (Farsi) information extraction system at Inxight Software (Sunnyvale, California). The system performs full segmentation, morphological analysis, part of speech tagging, shallow parsing, named entity extraction, and transcription of proper names to English. The linguistic knowledge is developed with the Xerox finite state technology (XFST) and is disambiguated using an HMM tagger.
For more information, see LinguistX and Thingfinder.

Other Computational Projects

  • Automatic Farsi transcription at Inxight Software
  • Persian and Eastern Armenian entity extraction for Multilingual Named Entity Detection and Transliteration project at U. of Illinois, Urbana-Champaign
  • Slovenian morphological analyzer, English ontology acquisition, Turkish parsing grammar
Heritage Language Courses Taught
Language analysis courses, Persian and Armenian, within the UCSD Linguistics dept.