Shiraz Dictionary Structure





The Shiraz Persian to English dictionary consists of approximately 50,000 entries including single words, phrases and proper names. The dictionary entries consist of several fields: the stem used in dictionary lookup, the part of speech of the entries and their English translations. It also contains morphological features for irregular entries. This information is stored as feature structures describing each dictionary entry.


Citation Form

Each entry is input in the Headword field in the dictionary, in citation form, using the Shiraz romanization method. The citation form entered for nonverbal elements is the noninflected form or stem, and the citation form entered for verbal elements is the infinitival. In case entries have other orthographic variants, they are also included in a specific field called Variants.

Vowels generally known as short vowels (a, e, o) are usually not written in Persian; only the long vowels (y, u, A) are represented in text. Therefore, words with different short vowels are input as one entry in the dictionary. This, of course, creates certain ambiguities. Since the short vowels are not inscribed, the word krm, for instance, can be pronounced with different vowel combinations resulting in five possible lexical elements as shown below. A reader uses the context to determine the word in the sentence.

A great number of words in Persian language exist as compounds, such as light verbs, compound nouns, and a number of prepositions. They are input in the dictionary with a space between their constituent elements.


Part of Speech

The POS field holds the Part of Speech for the entry. The main Open Class parts of speech in the Shiraz dictionary are: Noun, Adjective, Proper Name, Verb, Light Verb. The latter consist of one or more preverbal elements, which could be a noun, adjective or preposition, followed by a verb which has lost its original meaning; they are categorized as LightVerb in our dictionary. For example, asrar krdn[esrAr kardan], meaning "insist", which consists of the noun asrar "insistence", and the verb krdn "do".

Closed Class items in the dictionary are: Prepositions, Postposition (object marker ra[rA]), Conjunctions, Relativizers, Numerals (numbers and digits), Determiners, Interrogatives, Interjections, Titles, Phrases, Numeratives (classifiers used to form numeral expressions), Number Units (which refer to numbers such as hzar[hezAr] "thousand", mylyvn[milion] "million"). Pronouns are also among the Closed Class items. They are twofold: Personal Pronouns, such as mn[man] "I" or av[au] "he/she", and Quantifier Pronouns like hmh[hame] "everyone".

In the current version of the dictionary there are certain POS categories, such as POSNotAvailable, for entries whose POS were not clear to the lexicographer; they need to be edited.


Present Stem

The verbs are entered in the dictionary in their infinitival form. Every Persian verb has two stems, Present and Past. The Past Stem could be derived from the infinitival form of the verb, but the Present Stem is not easily obtained by the surface structure of the infinitival. As a result, the Present Stem of verbs is input in the dictionary.

Sense

The English translations for the Persian entries are listed in this field. For machine translation purposes, the most generic meaning of every Persian entry is input in the dictionary. However, since the dictionary also needs to be used as a stand-alone tool, the synonyms and infrequent meanings of the entries are also included.

Features

Number

Persian contains a large number of Arabic loan words. The main area in which the Arabic borrowings are noticeable is in the formation of plural nouns. These plural forms follow the "broken" plural formation in Arabic, based on a consonantal root. The rules for forming these plurals are not used productively in Persian, however; instead, the forms derived from the Arabic morphological paradigm have been lexicalized into the language. These plural nouns are input in the Shiraz dictionary as lexical elements, with the feature Number set to plural. These entries are treated as irregular and are not analyzed for number by the morphological analyzer.

Number Type

There are also a number of irregular ordinal numbers that can not be derived from their cardinal forms. These ordinal numbers, that do not follow the morphological rules for ordinal formation, are input in the dictionary as irregular with the feature Ordinal indicating the number type. There exist three types of ordinal values in Persian depending on their morphological structure and syntactic behavior. When an irregular ordinal is entered in the dictionary, its corresponding ordinal type should also be set. For example, one of the ordinal forms of the cardinal number yk[yek] meaning "one" is avl[aval] meaning "first". In this case, the irregular ordinal number avl[aval] is input in the dictionary and its number type is set to Ordinal Third. (The form avly is of type Second and avlyn is of type First.)

Regular Feature

If an entry is treated as an irregular (i.e., if the Number feature is set to Plural, or if it is marked as an Ordinal), the value of the Regular feature is automatically set to False. This information is used by the morphological and syntactic components in analysis.


A Richer Lexicon

The dictionary could be further enriched by adding features that have not been included in the current version. These features could help eliminate spurious analyses in morphology and to disambiguate certain parsing and translation results. They include features indicating person or number on pronouns, animacy on nominal elements, and verb category marking properties such as transitivity or impersonal constructions. Marking compound heads as well as incorporating certain phonological information in the dictionary can also help disambiguate results.


Top of Page