Persian Syntax



Karine Megerdoomian
Computing Research Laboratory



Introduction

Persian syntax is quite ambiguous in written form which raises certain difficulties in automatic parsing of written text. Several factors contribute to the ambiguity: Although Persian is a verb-final language, it does not adhere to a strict word order and the sentential constituents may occur in various positions in the clause; this is especially the case for preposition phrases and adverbials. In addition, there are no overt markers, such as case morphology, to indicate the function of a noun phrase or its boundary; in Persian, only specific direct objects receive an overt marker. Although in spoken language, the ezafe morpheme is used to link the elements within the noun phrase, this morpheme, being a short vowel, is absent in written text. Furthermore, subjects are optional in Persian and subject-verb agreement is not always present for inanimate subjects. Since short vowels are not transcribed, lexical ambiguity is also another problem in automatic parsing of Persian text.

Persian preposition phrases, however, are easily recognized and can be used to mark phrasal boundaries in the sentence. Additionally, the verb almost always occurs in the sentence-final position in written text which facilitates parsing. This section provides a description of Persian syntax, especially concerning issues that may arise in a computational analysis of written text. Certain rules used in the Shiraz syntactic grammar are presented. Syntactic disambiguation methods, if available, are also discussed.


Word Order

Persian is an SOV language: the sentences appear in the word order Subject-Object-Verb. The verb is marked for tense and aspect and usually agrees with the subject in person and number. Persian is a pro-drop language, thus the subject is optional. The object marker is used to indicate specific direct objects in simple sentences.1

If there is an oblique object or a Prepositional Phrase in the clause, it precedes the indefinite direct object as shown in (2), but usually follows the specific or definite object as in (3).

Although these examples describe the canonical word order, Persian is a free word order language and the sentential constituents can be moved around in the clause. These "scrambled" clauses often give rise to focused or topicalized readings. In the written language, although most elements may appear in relatively free word order, the sentences often remain verb-final. Adverbs and preposition phrases, however, can appear in various positions quite freely. Apart from manner adverbs, which occur within the verb phrase, other adverbs may appear almost anywhere in the clause, in between the various constituents. Adverbs usually can not occur following the verb.

Although Persian is verb-final at the sentential level, it behaves like head-initial languages in noun phrases (NP) and preposition phrases (PP). Thus, the head noun in a NP is often followed by the modifiers and possessors (4), and the preposition precedes the complement NP (5).

Certain preposition phrases, such as locative and directional PPs, can follow the verb as shown in the following examples. The preposition is sometimes optional in these cases. These constructions, however, do not often occur in written text.

Subordinate clauses follow the main clause as illustrated in (7). Persian has the complementizer ke (that) which marks both subordinate constructions and relative clauses; it is often optional.

Questions are usually formed in-situ, i.e., the element being questioned is replaced by the interrogative form without changing the word order.

Noun Phrases

Simple Noun Phrases

The head of a noun phrase could be a noun or an infinitival verb. Pronouns and proper names may also head noun phrases and they function as possessors in forming complex noun phrases (i.e., possessive constructions such as ketâb-e hushang (Hushang's book)).

The head noun is preceded by the determiner, the numeral constructions and the quantifiers, and it is followed by the modifiers, which usually consist of an adjectival phrase (AP). Superlative adjectives, however, do not appear in the AP; instead, they precede the head noun. Numeral constructions, quantifiers and superlative adjectives are in complementary distribution, i.e., if one of these elements is present, the others cannot occur within the NP.

The relative ordering of the constituents of the simple NP is given below:

NP = determiner specifier head modifier
where the head is a Noun and the parts of speech or phrases that can appear in each of the other categories are as shown below. Note that all the constituents, with the exception of the head noun, are optional.

The modifiers are linked to the head noun with the ezafemorpheme. The following example represents a simple Noun Phrase where CL stands for Classifier and Ez for the ezafe morpheme. Classifiers indicate the class or type of the noun. Thus, for instance, is used with count inanimate nouns, nafar indicates people, qalâde (=collar) can be used when giving a count for dogs, etc.

The infinitival constructions are very similar to the English gerundive. The infinitive head can appear in a predicate construction or with an adverbial. The objects of the verb become arguments of a possessive construction as exemplified in (11).

Possessive Constructions

These constructions are the equivalent of the genitive or possessive constructions in English, such as "Mao's red book", "her mother's hat" or "the syntax of noun phrases". In English, the link between the two nouns is marked by " `s" (e.g., Mao's) or the preposition "of". In the case of pronouns, the latter appear in their genitive form (e.g., her). The element joining the Persian noun phrase constituents to each other is the ezafe suffix. The ezafe, however, is usually pronounced as the short vowel /e/ and is therefore not marked in written text. The result, in Persian written text, is a series of consecutive nouns without any overt links or boundaries as shown in the example in (12) transcribed as it appears in Persian text (i.e., without short vowels). The actual pronunciation for this example is given in (13) where the ezafe morpheme is represented by the -e following the first three nouns, linking each one to the following constituent. Note that the last constituent in the NP does not carry the ezafe suffix, thus marking the end boundary of the noun phrase. In this example, each noun forms a simple NP which then join together to form the complex NP given in (12).

When pronouns are used as the possessor, the constructions are identical:

NP Boundaries

Certain morphological and syntactic elements can help resolve some of the ambiguities arising in parsing of Persian written text. As already mentioned, the ezafe suffix can mark boundaries within Noun Phrases. Unfortunately, this morpheme is often absent from written form. It does occur, however, after the vowels â and u as exemplified below.

In (15), the adjective zyba is followed by the ezafe suffix y, which indicates that the adjective is linked to the following element daryvsh. Thus, the absence of the ezafe after the adjective zyba will mark a noun phrase boundary as illustrated in the examples below.

Certain morphemes, such as the pronominal clitics, the indefinite article and the enclitic used to link NPs to relative clauses, can only occur as the last element in the NP. The detection of any of these morphemes indicates that the boundary of the noun phrase has been reached. In addition, proper names and pronouns often mark the boundary of the noun phrase.

In the current Shiraz grammar, these boundary markers have been incorporated within the NP rules. Thus, if a simple noun phrase carries a boundary marker, it is not allowed to join with another NP to form a more complex phrase. As a simple illustration, consider the two N'-forming rules, NounBarClitic and NounBarEzafe. These rules contain a left-hand side (lhs) and a right-hand side (rhs) as in rewrite rules. In the first rule, the right-hand side is satisfied if a clitic is detected (indicated by clitic.function: True). As can be seen in the left-hand side of this rule, this nominal element is tagged as the head of the N' (per.NounBar) and the value of the boundary feature is set to True. This boundary value is transferred up when the higher NP level is formed; this NP will not be allowed to join to another noun phrase following it since the boundary has already been set to True.


// N' --> N   carrying a boundary marker
NounBarClitic = per.Rule[
lhs: per.NounBar[
head: #head,
boundary: True],
rhs: <:
#head= per.Noun[infl.clitic.function: True]
:>
];
In the case of the NounBarEzafe rule, however, when an ezafe feature is detected (shown in the right-hand side of the rule as infl.ezafe:per.EzTrue), the boundary feature in the left-hand side is set to False. This allows the N' and the higher NP to join to the following noun phrase construction.


// N' --> N   carrying ezafe - no boundary set
NounBarEzafe = per.Rule[
lhs: per.NounBar[
head: #head,
boundary: False],
rhs: <:
#head= per.Noun[infl.ezafe: per.EzTrue]
:>
];

Relative Clauses

Persian relative clauses are usually introduced by the relativizer ke (that), which is used regardless of the animacy, gender or function of the head noun. In nonrestrictive relative clauses, the head noun often carries an enclitic morpheme (Encl) which links the noun to the following relative clause. If the relativized noun is the object of the main sentence, then it may appear with the object marker as illustrated in (20).

The relative clause may be separated from the head noun by the main verb as illustrated below. In addition, several relative clauses could follow a head noun.

If the head noun is the subject or direct object of the relative clause, it is often left as a gap as was shown in the examples in (19) and (20). However, even in such cases, the relativized noun may be replaced by a resumptive pronoun in the clause it originated from. Thus, in (22), the head noun plâk-e kuchak (small plaque) is the subject of the relative clause; it is substituted by the resumptive pronoun ân (it). The use of the resumptive pronoun usually occurs when the head noun is separated from the relative clause by an intervening verb. In this example, the verb pey borde-and (have found) precedes the relative clause.

When the head noun is the indirect object or is extracted from a Prepositional Phrase adjunct in the clause, a resumptive pronoun is used. In other words, the position from which the head noun originates is substituted by a pronoun that agrees with the head noun. This is exemplified in the three NP cases below:

Verb Phrases

As already discussed, the verb in Persian usually occurs in the sentence-final position, with objects, adverbials and adjuncts preceding it. The relative order of the direct object and the indirect object or PP may be modified based on the specificity of the direct object. The verb agrees in number and person with the subject of the clause. However, if the subject is inanimate, the agreement may default to the third person singular as illustrated in the contrast in the examples below, taken from the same newspaper article, both containing an inanimate plural subject but giving rise to different agreements on the verb:

Light Verb Constructions

Persian simple verbs are quite rare compared to the number of light verb constructions, also known as complex predicates, in the language. These constructions consist of a noun, adjective or preposition followed by a light verb such as the verbs "do", "give" or "hit", forming non-compositional units of meaning. In other words, the meaning of these light verb constructions can not be obtained by translating each element separately as the examples illustrate:
zamin xordan "floor eat" to fall
zendegi kardan "life do" to live
gul zadan "deception hit" to deceive
shekast dâdan "defeat give" to defeat
e'lâm kardan "announcement do" to announce
âsib didan "damage see" to be damaged
pâyân yâftan "ending find" to end
na're keshidan "yelling pull" to yell, to roar
e'teqâd dâshtan "belief have" to believe
be donyâ âmadan "to world come" to be born
az dast dâdan "from hand give" to lose

These constructions can also be used as purely idiomatic expressions:
del be daryâ zadan "heart to sea hit" to take a risk

In any case, these complex predicates are extremely productive in Persian. New verbs are formed following this pattern, by joining a nominal or adjectival word (possibly a loan word) to a light verb as shown:
email zadan "email hit" to (send) email
klik kardan "click do" to click (on a mouse)

In addition, verbs in simple form have been and currently are in the process of dying out and are being transformed into the light verb constructions. The light verbs used in these complex predicates are not always semantically vacuous. In fact, these verbs may contribute to the aspectual readings of the predicate or provide a causation interpretation to the verb. They may also contribute to the transitivity of the verb phrase as shown in the examples below. The first sentence consists of the light verb construction shekanje dâdan (torture give) and gives rise to a transitive sentence. The second sentence, on the other hand, is formed with the light verb construction shekanje didan (torture see) and the result is a passive reading.

For the purposes of the Shiraz project, however, light verb constructions were input into the dictionary as lexical units with their corresponding translations into English. In other words, light verbs are treated as compounds in the Shiraz machine translation system: Each element of the construction undergoes morphological analysis and the results are joined together when the light verb construction is recognized. Consider the example in (31) representing a light verb construction, in which both the nominal and the verbal parts carry morphemes.

In this example, the light verb zadand carries information on tense, aspect, number and person. The nominal part kotak(beating) carries the clitic pronoun for third person singular. This clitic is analyzed as an object (i.e., accusative) on verbs. The result of morphological analysis and lexical lookup for each part is shown in (32) for the nominal part and in (33) for the verbal part, where lex represents the lexical information and infl is the inflectional information computed by the morphological analyzer. Note that in (32), the noun has been analyzed as a singular, carrying a clitic pronoun (third person singular). In (33), the verb is analyzed as active voice, preterite tense, third person plural agreement; there are no clitic pronouns on the verb.

(32) Noun[
lex : LexMorph[ number : Singular, regular : True],
infl : NominalInfl[
number : Singular,
clitic : Clitic [person : Third,
number : Singular,
function : Possessive],
ezafe : EzFalse,
indefEncl : False,
indefinite : False,
enclitic : False],
exp : "ktk",
trans : <: LSign[exp : "beating"] :>]

(33) Verb[
lex : LexMorph[
number : Singular,
presentStem : "zn",
regular : True],
infl : VerbalInfl[
voice : Active,
clitic : Clitic[function : Null],
tense : Preterite,
causative : False,
negation : False,
mood : Indicative,
person : Third,
participle : PartFalse,
numberAgr : Plural],
exp : "zdn",
trans : <:
LSign[
exp : "hit"]LSign[
exp : "play"]:>]

The simple rule below shows how the two parts of such a light verb construction are unified in the Shiraz grammar, and how their morphological information is percolated from the right-hand side, up to the left-hand side, in order to form the single light verb construction NominalLVEntry.

(34) NominalLV = per.Rule[
lhs: per.NominalLVEntry[
infl: [ mood: #mood, //verbal morphemes
tense: #tense,
voice: #voice,
person: #person,
numberAgr: #numberAgr,
causative: #caus,
negation: #neg,
participle: #part,
clitic: #clitic, //nominal morphemes
number: #number,
ezafe: #ezafe,
indefEncl: #indencl,
indefinite: #indef,
enclitic: #encl]],
rhs: <:
per.Noun[infl: [
number: #number = Top,
ezafe: #ezafe = Top,
indefEncl: #indencl = Top,
indefinite: #indef = Top,
enclitic: #encl = Top,
clitic: #clitic = Top]]
per.Verb[infl: [mood: #mood = Top,
tense: #tense = Top,
voice: #voice = Top,
person: #person = Top,
numberAgr: #numberAgr = Top,
causative: #caus = Top,
negation: #neg = Top,
participle: #part = Top]]
:>,
recursive: True, // This rule can apply recursively
lookup: True, // Perform dictionary lookup after creating lhs
remove: True]; // Remove all edges used by this rule after parsing

The final structure for the light verb construction, after the rule in (34) has applied and the final structure is looked up in the dictionary, is shown in (35). We now have a light verb construction (NominalLVEntry) resulting from the unification of the two parts.

(35) NominalLVEntry[
lex : LexMorph[
number : Singular,
regular : True],
infl : NominalLVInfl[
voice : Active,
number : Singular,
causative : False,
ezafe : EzFalse,
tense : Preterite,
person : Third,
clitic : Clitic[ person : Third,
number : Singular,
function : Possessive],
participle : PartFalse,
mood : Indicative,
negation : False,
numberAgr : Plural,
indefinite : False,
indefEncl : False,
enclitic : False],
exp : "ktk zdn",
trans : <: LSign[ exp : "beat up"]:>]

Thus, if light verb constructions always occurred as one single unit, they could easily be recognized. This is not the case, however. These verbal constructions can be separated from each other by other intervening elements. The object of the light verb, for instance, may appear between the two parts of the construction as shown in (36) for the light verb construction âsheq shodan (fall in love). In (37), the light verb predicate afzâyesh yâftan (increase) has been separated by the adjective shadid, which is behaving as an adverb. (38) represents the light verb construction xâstâr shodan (request) with an intervening object, which itself consists of a complex noun phrase composed of a NP and a PP.

In all of these examples, the separated parts of the light verb are still to be recognized as one unit. However, in certain cases, the separated constituents lose the light verb construction meaning. Compare the two sentences in (39) . In (39a), the light verb construction is interpreted as a unit, whereas in (39b), the intervening object marker splits the light verb construction. In this case, the nominal part jâru (broom) has become the direct object of the verb zadan (to hit). A similar effect is obtained by the relativization of the nominal part in (40).

Compare (40), however, to the construction in (41) with the light verb predicate latme zadan (damage). In this instance, even when the nominal element is relativized, the light verb construction still obtains.

The examples discussed in this section show that light verb constructions do not form a unified category. Some research is required, however, to be able to better classify the various light verb predicates based on their properties. The current Shiraz dictionary contains more than 8000 light verb constructions and the syntactic parser can correctly recognize them as well as any inflection that appears on them. The parser, however, is unable at this point to recognize light verb constructions with intervening elements.



References

Bateni, Mohamad-Reza (1995).
"Tosif-e Sakhteman-e Dastury-e Zaban-e Farsi [Description of Persian Syntax]." Tehran, Iran: Amir Kabir Publishers.
Karimi-Doostan, Mohamad-Reza (1997)
Light Verb constructions in Persian. Doctoral Dissertation, University of Essex.
Lazard, Gérard. (1992).
"A Grammar of Contemporary Persian." Costa Mesa, California: Mazda Publishers.<
Mahootian, Shahrzad (1997)
Persian. Routledge, New York, NY.

Footnotes

1. Brackets indicate optionalilty in the examples. Ez=ezafe morpheme discussed in the section on Noun Phrases. Imp=imperfective marker, Neg=negation, Obj=object marker, Past=past tense, Present=present tense, Subj=subjunctive. Person is marked by 1, 2 or 3; number is either pl=plural or sg=singular. Back

Top of Page