The highly ambiguous structure of the Persian Noun Phrase (NP) causes immense difficulties for automatic parsing of written text. Numerous factors contribute to the ambiguity of the Persian NP structure. Certain vowels, known as short vowels, are not written, which produces additional lexical ambiguities. There are very few overt morphemes in the language to mark boundaries of Noun Phrases: With the exception of the specific object marker, the language lacks Case morphology. There are often no particles in written text linking the constituents of a Noun Phrase, such as "of" in English, since these particles are pronounced as short vowels and are therefore not transcribed in written form. Furthermore, since the basic word order in Persian is Subject-Object-Verb, the lack of overt morphology for marking boundaries makes it very difficult to determine where the Subject ends and the Object begins. All of these factors, coupled with a relatively free word order and the optionality of the subject, combine to make the Persian Noun Phrase extremely ambiguous for an analysis of written text.
This report introduces the constituents forming a Noun Phrase in Persian, as well as a description of its structure. It also shows how the lexical and morphological information present in the written text could be used in determining the boundaries of the NP. To describe the NP rules in the Shiraz project, a unification-based syntactic grammar was used. This grammar, known as Bolero, operates on typed feature structures. Relative Clauses are discussed briefly in the last section.
The head Noun is followed by the modifiers, which usually consist of an Adjectival Phrase (AP) construction. There could be several modifiers in a Noun Phrase. The elements preceding the head noun are the determiner, the numeral constructions and the quantifiers. Although adjectives always follow the noun, the superlative adjective can only appear before the head. Numerals, quantifiers and superlative adjectives are in complementary distribution; if one of these elements is present, the others cannot appear within the NP. Since complementary distribution usually indicates that the lexical elements occupy the same position, the numeral, quantifier and superlative constructions are all placed under the specifier category.
The relative ordering of the constituents of the simple NP is as follows:
where the head is a Noun and the parts of speech or phrases that can appear in each of the other categories are as shown below. Brackets indicate optionality. Note that all the constituents, with the exception of the head noun, are optional.
NP = determiner specifier head modifier
NP = predicate head or NP = adjunct head
The complex noun phrase is the equivalent of the genitive or possessive constructions in English, such as "Mao's red book", "her mother's hat" or "the syntax of noun phrases". In English, the link between the two nouns is marked by `s (e.g., Mao's) or the preposition of. In the case of pronouns, the latter appear in their genitive form (e.g., her). Other languages, such as Turkish or Armenian, use Case to indicate the link between noun phrases.
The element joining the Persian noun phrase constituents to each other is the ezafe suffix. The ezafe, however, is usually pronounced as the short vowel /e/ and is therefore not marked in written text. The result, in Persian written text, is a series of consecutive nouns without any overt links or boundaries as shown in the example (1) transcribed as it appears in Persian text (i.e., without short vowels). The actual pronunciation for this example is given in (2); the ezafe morpheme is represented by the -e following the first three nouns, linking each one to the following constituent. Note that the last constituent in the NP does not carry the ezafe suffix, thus marking the end boundary of the noun phrase.
In this example, each noun forms a simple NP which then join together to form the complex NP given in (1). The lack of Case and agreement, as well as overt linking morphemes, coupled with a verb-final word order, can make the computational parsing of Persian NPs extremely ambiguous. In the next section, we will present possible boundary markers or joining elements in Persian that can help resolve some of the parsing ambiguities.
The constituent ordering for the simple standard NP, discussed above, already points to some of the beginning boundaries of the Persian NP. Hence, the determiner, if present, is the first element of the noun phrase. If there is no determiner, the specifier is the first element. As already mentioned, the possessor elements constitute the boundaries of the complex noun phrase as well. In other words, if a simple nominal NP is followed by a possessor NP structure, the two NPs can join to form a bigger NP, but no other NP element can join to the right of this newly formed complex NP 3.
Consider the sentence in (1) below with its corresponding noun phrase boundaries as shown on the gloss in (2).
In this example, there are eight NP constituents between the
preposition according to and the final verb. The NP
boundary can, in principle, fall after any of the nouns in this
sentence, which leads to a very high parsing ambiguity. Now compare
the sentence below containing proper nouns:
In this case, the proper names can be used to detect the final boundaries of the noun phrases, thus analyses joining Amryka(US) and banv (Lady), or linking fransh (France) and vzyr (minister) will not be formed.
Although the indefinite article and the enclitic particle have different syntactic functions, they have the same surface representation and cannot be differentiated in morphology. This morpheme can appear on a noun or on an adjective.
In all of these instances, the presence of the IE marks the NP boundary, in the sense that no other NP element can follow the noun in or the noun-AP combination in the examples above. This is exemplified below:
Since the noun mrdm appears with the enclitic affix y, the simple NP consisting of mrdmy Azadh is not allowed to join to the NP to the right.
Instead of appearing as a lexical pronoun, the possessive pronoun may be cliticized onto the rightmost constituent of the simple NP as shown in the two examples below. When the clitic is present, it marks the end boundary and the simple NP can not join to the following nominal element to form a complex NP.
The sentence below shows how the clitic is used to denote the end boundary of the Noun Phrase, thus not allowing it to join to the following element.
The ezafe morpheme does not mark the end boundary of a Noun Phrase but rather the lack thereof, since the ezafe is used to join the head of a NP to the constituents following it. As already mentioned, the ezafe is rarely written in Persian text since it is a short vowel. When it appears after a vowel, however, it has the surface form y. In these cases, the ezafe can be used to indicate that the simple NP should be joined to the following nominal phrase.
In the sentence below, the adjective zyba (beautiful) appears with an overt ezafe morpheme, which indicates that the simple NP zn zybay (beautiful woman/wife + EZ) should be joined to the following NP (Dariush) thus forming the complex noun
phrase zn zybay daryush. In other words, the NP boundary can NOT be at this location.
In the second example below, on the other hand, the adjective zyba (beautiful) does not carry the ezafe suffix. Note that since this word ends in a vowel, if the ezafe were present, it would have apperaed in its overt form y, hence we can co nclude with certainty that the ezafe is not available. The absence of the ezafe indicates that a boundary should be set following the adjective thus forming two separate noun phrases as shown.
The following table presents the coocurrence possibilities for the ezafe, indefinite/enclitic and the pronominal clitic morphemes. The combination of these features is used in certain rules in the syntactic parser.
This section introduces a few sample NP rules from the Bolero syntactic grammar. These rules demonstrate how the information from the structure and the boundary-marking elements of the NP are incorporated within the grammar.
The presence of a boundary marker, such as the indefinite/enclitic morpheme, is denoted by a feature on the NP feature structure rules called boundary. When a boundary marker (e.g., clitic or IE) is encountered, the value for this feature is set to True. The True value indicates that the NP has reached a boundary and cannot join to the following constituent to form a bigger noun phrase. If a boundary-marking morpheme was not found on the NP constituents, the value is set to False. In such cases, the NP is free to join to the next element. In certain cases, as when the presence or absence of an ezafe morpheme can not be determined, the boundary is set to "Undefined", in which case the NP may or may not join to the constituent following it.
Consider the rule NounBarIndefinite given below. This rule contains a left-hand side (lhs) and a right-hand side (rhs) as in rewrite rules.
NounBarIndefinite = Rule[
lhs: NounBar[
head: #head,
boundary: True],
rhs: <:
#head= Entry[form.morph:[
lex.pos: Noun,
infl.indefiniteEnclitic: True]]
:>
];
The right-hand side of this rule is satisfied if an entry with a Noun POS is recognized, which also carries an indefinite/enclitic morpheme. As can be seen in the left-hand side of this rule, this nominal element is tagged as the head of the N' and the value of the boundary feature is set to True. The boundary value is transferred up when the higher NP level is formed as shown below in the NPo rule.
The NPo is the feature structure forming a standard simple NP. It contains all of the constituents that could constitute the standard noun phrase. Each constituent on the rhs is linked by a variable (marked by the pound sign #) to the elements in t he feature structure in the lhs of the rule. As mentioned, the boundary value that was set for the NounBar (N') is also transferred up to the NPo structure.
// NPo --> Det? Spec? N' where N' --> Noun Adj?
NPo = Rule[
lhs: NounPhraseZero[
determiner: #det,
specifier: #spec,
head: #head,
modifier: #mod,
boundary: #bnd],
rhs: <:
"optional" #det= Entry[form.morph.lex.pos: Determiner]
"optional" Specifier[specType: #spec = Top]
NounBar[
head: #head = Top,
modifier: #mod = Top,
boundary: #bnd = Top]
:>
];
The complex noun phrase, which consists of two or more simple NPs, is formed using the recursive rule called complexNP. The right-hand side of this rule looks for an NPo structure followed by a Noun Phrase feature structure. This construction could be exemplified with the noun phrase zn zybay daryvsh (woman beautiful-EZ Dariush), in which the simple NP (or NPo) "woman beautiful-EZ" and the proper name NP "Dariush" join to form a bigger NP. What should be noted is that the right-hand side of this rule is satisfied only if the boundary value is set to False or to Undefined. Hence, if the boundary value is True, such as when an IE morpheme is encountered, the complex nou n phrase will not be formed.
complexNP = Rule[
lhs: NounPhrase[
head: #np1,
possessor: #np2],
rhs: <:
#np1= NounPhraseZero[
boundary: FalseOrUndefined]
#np2= NounPhrase
:>
];
Relative Clauses are used to give further information about a nominal element, such as in the English sentence "The man, whom I met yesterday, has had an accident.", where "whom I met yesterday" represents a relative clause providing further information about "the man". In Persian, relative clauses are usually introduced by the relativizer kh [ke] (that), which is used regardless of the animacy, gender or function of the head noun. In nonrestrictive relative clauses, the head noun often carries the Enclitic morpheme which links it to the fol lowing relative clause. In these instances, the head noun is usually interpreted as a definite.
The relative clause construction is similar to English: The head noun is followed by the relativizer (kh in Persian), which is then followed by the clause that relates to the head noun, as shown below:
head noun [`kh' [ Clause] ] ...
In certain cases, the relative clause can be separated from the head noun by the verb of the sentence. In addition, several relative clauses could follow a head noun. As mentioned above, the relativizer kh does not vary depending on animacy or func tion of the head noun; in other words, relative pronouns such as "who", "which", "whom" do not exist in Persian. It is also not possible to precede the relativizer by a preposition as in the English examples "to whom&quo t;, "in which".
If the head noun is the subject or direct object of the relative clause, it is often left as a gap as shown in the examples (1) and (2) below, respectively. Note that the subject in the clause (tv "you") is optional, since Persian is a pro-drop (i.e., optional subject) language.
In certain instances, however, even if the head noun is the subject or
direct object of the relative clause, it may be replaced by a pronoun
in the clause it originated from. In the following example, the head
noun plak kvchk (small plaque) is the subject of the relative
clause; it is substituted by the resumptive pronoun An
(it). The use of the resumptive pronoun usually occurs when the head
noun is separated from the relative clause by an intervening verb. In
this example, the verb py brdh and (have found) precedes the relative clause.
In the example (1) above, the head noun ayn bchh-ha (these kids) is the indirect object of the clause; it is extracted from the PP complement of the verb "ask". As the example shows, the preposition az (from) is left behind in the relative clause, and the head noun is replaced by a pronoun Anha (they/them). A similar example is given in (2) with an inanimate head noun. In (3), the head noun zn (woman) is also extracted from the PP complement of the clausal verb. In this instance, however, the head noun is replaced by a clitic pronoun -sh (him/her), which appears attached on the preposition bray (for). The word for word gloss for this example is then woman that for-her book (you) bought.
If the head noun of the relative clause is the object of the main sentence, then it may appear with the object marker ra, as shown in the following sentence. Note that the head noun receives an object marker, even if it is the subject of the relati ve clause.
The examples below show a head noun separated from the relative clause by an intervening verb and an intervening adverb, respectively.
This report describes the structure of the Noun Phrase in Persian and explains how certain morphological and syntactic features could be helpful in determining boundaries of Noun Phrases in Persian. These constraints, when incorporated within the syntactic grammar, can reduce the number of parses produced during analysis. The way in which these boundary markers were incorporated within the Bolero grammar, used in specifying the syntactic structure of Persian, is also discussed. The final section covers the relative clause constructions in Persian.
1. Since there is no Case in Persian, the surface form of the pronoun is always the same whether it is used in a subject, object or possessive context. Back
2. There are no capital letters in Persian, hence proper names are not easily differentiated from nouns. Back 3. There are cases as in (i), where the proper name can be modified and even joined to the right by another possessor (here, a pronoun) but such instances seldom occur in written text. Back| Top of Page |