Paris catacombs
                            home         :        research         :        publications         :        presentations         :        personal      

Current Projects

  • The Language of Iranian Blogs

    Persian is one of the top ten languages in the Blogosphere (source: Technorati). The main goal of the project is to provide a linguistic analysis of the language used in Iranian blogs - including conversational Persian text, code-switching, and neologisms - and provide tools for blog analysis such as a morphological analyzer, automatic topic and language classification, and a lexicon of blog vocabulary.
    Some findings from this project were published in Low-density language strategies for Persian and Armenian. In Language Engineering for Lesser-Studied Languages, Sergei Nirenburg (ed). IOS Press of Amsterdam. February 2009.

  • Tajiki Persian Machine Translation

    A finite-state transducer that converts Tajiki Persian text (in Cyrillic) to Iranian Persian script (Perso-Arabic) and runs the resulting transliterated document through an existing Persian-to-English MT system. We use this strategy for the rapid prototyping of MT for the low-resource Tajiki language.
    The preliminary results were reported in Low-density language bootstrapping: The case of Tajiki Persian. In Proceedings of LREC 2008. Marrakech, Morocco, May 2008.

  • Persian NLP Resources

    I am collecting a list of existing NLP resources for Persian language. The results will be posted on the SFIL website.

  • Persian Heritage Language

    - Developing a language teaching tool for analysis of Persian text, based on morphological analysis.
    - Grammar book designed for Persian heritage speakers.

Conference Organization

- Third Workshop on Computational Approaches to Arabic Script-based Languages; MT Summit XII, Ottawa, August 2009
- International Conference on Complex Predicates in Iranian Languages; Paris, July 2008

Persian NLP: Shiraz Project (1997-1999)

I was the computational linguist responsible for the development of the Shiraz machine translation system at the Computing Research Lab (CRL) in New Mexico State University. The Shiraz project was a MT prototype developed at CRL that translated Persian text into English and used typed feature structures and an underlying unification-based formalism to describe Persian linguistic phenomena. It used an electronic bilingual Persian to English dictionary consisting of approximately 50,000 terms, a complete morphological analyzer, a syntactic parser as well as transfer and generation modules. The system components were tested on a bilingual tagged corpus developed from a large Persian corpus of on-line material (approximately 10MB). The machine translation system is mainly targeted at translating news material.

Coverage: Tokenization and full morphological analysis. Compounds and light verbs were also recognized. The syntactic parser could analyze noun phrases (including relative clauses), preposition phrases and basic sentential constructions. The resulting feature structures were transferred into English syntax and morphological generation was performed on the final translations. The dictionary was built by a team of Persian lexicographers and included single words, compounds and phrasal expressions. It contained information about the orthography, morphosyntactic category and syntactic properties of lexical items as well as the English word-sense equivalents.

Detailed write-ups from the Shiraz project can be found under the publications page.

Persian NLP: Entity Extraction at Inxight (2002-2004)

I was responsible for developing the linguistic aspects of the Persian (Farsi) information extraction system at Inxight Software (now Business Objects). The system performed full segmentation, morphological analysis, part of speech tagging, shallow parsing, named entity extraction, and transcription of proper names to English. The linguistic knowledge is developed with the Xerox finite state technology (XFST) and is disambiguated using an HMM tagger.

For more information, see SAP Business Objects Text Analysis.

Other Computational Projects and Consulting

  • Automatic Farsi transcription at Inxight Software
  • Persian and Eastern Armenian entity extraction for Multilingual Named Entity Detection and Transliteration project at U. of Illinois, Urbana-Champaign
  • Slovenian morphological analyzer, English ontology acquisition, Turkish parsing grammar
  • Heritage Language Courses Taught

    Language analysis courses, Persian and Armenian, within the UCSD Linguistics dept.