Persian Morphology



Karine Megerdoomian
Computing Research Laboratory




Introduction

Persian morphology is an affixal system consisting mainly of suffixes and a few prefixes. The nominal paradigm consists of a relatively small number of affixes. The verbal inflectional system is quite regular and can be obtained by the combination of prefixes, stems, inflections and auxiliaries.

The following is a brief description of Persian morphology as well as the morphological grammar used in the Shiraz project. Since we are only dealing with written text, the difficulties encountered in the analysis of the written language are briefly discussed. The following sections describe the nominal and verbal morphology of Persian. The last section presents examples of the rules used in the morphological analyzer. The transliteration used is described in the Appendix.


Ambiguities in Written Text

Certain ambiguities arise in a computational analysis of Persian text since the same surface form can represent different morphemes. In addition, short vowels are not marked in written text, which results in different possibilities of analysis. For instance, the word mrdm could be analyzed, among other possibilities, either as the noun mardom (people) or as the past tense of the verb mordan (to die): mordam (I died). Furthermore, certain affixes always appear bound whereas others can also appear as free morphemes. The morphological analyzer is able to recognize all the possible surface forms of the affixes; it also uses the information available from the parts of speech that the morpheme appears on in order to disambiguate.


Nominal Morphology

There are no case forms and no gender distinctions in Persian. Person, number and sometimes animacy, however, are distinguished. Although there is no overt definite marker, a suffix is used on nouns and adjectives to indicate indefiniteness. The enclitic suffix which links nominal elements to a relative clause has the same surface form as the indefinite. There exist several morphemes to mark plurality, some of which are borrowings from Arabic. There are also some plural forms in Persian that follow the Arabic template morphology (also known as "broken" plurals) as shown below.

    ketâb --> kotob (books)
    faghir --> fogharâ (poor [people])
But the rules for forming these plurals are not used productively in Persian. These loan words are listed in the lexicon and need not undergo morphological analysis.

The elements within a Noun Phrase are linked by the enclitic particle called ezafe. This morpheme is usually an unwritten vowel, but it could also have an orthographic realization in certain phonological environments. The role of the ezafe is to mark nominal determination and it indicates nothing as to the nature of the semantic relation between the linked elements. In most cases, this relation can be translated as a genitive (or possessive) structure. Examples of this construction are given below:

Adjectives follow the same morphological patterns as nouns. They can also appear with comparative and superlative morphemes. Certain adverbs, mainly manner adverbs, can behave like adjectives and can appear with all the adjectival affixes. There are three types of ordinal constructions in Persian, which are formed by attaching their respective morphemes to the cardinal number.

Personal pronouns can appear either as free forms or as clitics. Although these cliticized pronouns have the same surface form, they can have different functions depending on the part of speech or syntactic context that they appear on: On the last element of a Noun Phrase, the clitic is interpreted as a possessive pronoun ketâb-at [book + 2sg] (your book). Attached to transitive verbs and prepositions, the clitic is the accusative form of the personal pronoun did-am-at [see(past) + 1sg infl. + 2sg] (I saw you). The clitic may appear on adverbials, numerical expressions and interrogative elements with a partitive meaning, vasat-ash [middle + 3sg] (in the middle of it). On intransitive verbs, it could be used as the subject clitic. It is also used in impersonal verbal constructions. Most of these usages, however, are limited to colloquial speech and apart from the possessive clitics, they are rarely used in written text.

The present indicative of the verb budan (to be) has a series of enclitic forms which can attach to the elements within a Noun Phrase. This morpheme is a verbal element but it can attach to nouns, adjectives and classifiers. The morphological analyzer needs to recognize this copula morpheme and separate it into a distinct lexical structure.

There exist other lexical elements, such as the preposition be, the postposition , or the relativizer ke, that usually appear as separate words in written text, but which can also be found as attached morphemes.


Verbal Morphology

Inflectional Paradigm

The inflectional system for the Persian verbs consists of simple forms and compound forms; the latter are forms that require an auxiliary verb. The simple forms are divided into two groups according to the stem they use in their formation: the tenses that use the Present Stem and those formed on the Past (or Aorist) Stem. The Present Stem needs to be specified in the lexicon since it cannot be derived, while the Past Stem is easily derivable from the infinitival form of the verb. The citation form for the verb is the infinitive.

In addition to the verb stems, the following elements also participate in the formation of the verbal inflectional system in Persian:

The complete inflectional system can be obtained by the various combinations of these elements.

Light Verbs

Most verbal constructions in Persian are formed using a light verb such as kardan (do, make), dâdan (give), zadan (hit, strike). The number of verbs that can be used as light verbs is limited, but these constructions are extremely productive in Persian. These structures consist of a preverbal element, which could be a noun, adjective or preposition, followed by a light verb, which has partly or completely lost its original meaning. Since these Light Verb or Compound Verb constructions are noncompositional in meaning, they are included in the dictionary as compounds.

Verbal inflection can only appear on the light verb itself, but bound morphemes can be attached to the preverbal element as well as the light verb. These inflectional morphemes are analyzed in the morphological component.


Morphological Grammar

The linguistic information associated with the morphemes is described using a unification-based morphological formalism. The morphological rule describes the concatenation of stems and morphemes (using regular expressions) and the combination of morphological features of words and morphemes (using feature structures and unification). Stems and their features are stored in the lexicon as feature structures. A morphological rule associates a surface form, representing a sequence of morphemes, to a linguistic structure, and describes how the features of the stem and the morpheme are combined.

As an example, consider the Plural rule for Persian given below (string variables are prefixed with the dollar sign, regular expressions are enclosed between angle brackets):


Plural = <
	   <$stem "hA">
	   Noun[exp: "$stem$",
		lex: [regular: True],
		infl: [number: Plural]]
>;

The regular expression in angled brackets describes the surface form of the morpheme (the suffix hA (=) in this example). The feature structure on the next line gives a partial description of the entry. The type is defined as Noun which indicates that the plural morpheme appears on a lexical element with type Noun. exp is the orthographic form (or citation form) of the entry as it is input in the lexicon. The lexical information available in the dictionary is presented under lex and the inflectional information is given under the path infl. The feature structure unifies the given inflection with the morpheme information. In this specific example, the morphological rule marks the number feature as Plural in case the lexical element is of type Noun and it is marked as regular in the dictionary.

It is also possible to account for the morphotactics of the inflections (i.e., the relative order of the morphemes). For instance, the indefinite marker in Persian can follow the plural morpheme but the reverse is not true. This rule can be written in the following manner:


Indefinite = <
		< <<$base \ Vowel> "yy">
		  Noun[infl.indefinite: True] > |
		< $base
		  Noun[infl.indefinite: False] >
>;

The plural rule can be used within the Indefinite rule in order to account for more complex morphological phenomena. The string analyzed by the Plural rule is bound to the variable base. This variable can thus be used in the Indefinite rule for checking, for instance, the character that it ends in. In other words, after the plural morpheme hA (=) has been detected on the word, the Indefinite rule applies. The first alternative checks if the surface form of the base application ends in a vowel; this is true since hA ends with the vowel "A". The following feature structure requires this entry to be of type Noun. The successful application of this rule will add (unify) the corresponding structure to the output feature structure. So, in this example, if the suffix yy has been recognized following the plural morpheme hA, the indefinite feature in the structure is marked True, otherwise it's marked False.



Top of Page