Architecture of The Shiraz Machine Translation System

Jan W. Amtrup
Computing Research Laboratory

Introduction

Meat, the Multilingual Environment for Advanced Translations, is the underlying architecture of the Shiraz machine translation system. This architecture is centered around the notion of a chart, capable of storing partial and complete results on a multitude of description levels, ranging from simple tokens, which appear in the source text, to syntactic analyses and target language output strings. The system operates on Unicode strings, complex typed feature structures are used to encode linguistic knowledge and intermediate results. The system consists of a number of modules which can be configured to perform different tasks, from glossing a text to full machine translation.

This report describes some aspects of the architecture and gives an outline of the modules involved in the translation from Persian to English. The system is completely written in C++ and can be used both on Unix machines or on PCs. We have also applied Meat for translations from Korean, Japanese, Spanish, Russian, Serbo-Croatian, and Turkish.

Charts for Shiraz

The central data structure within Shiraz is a chart, which is used to store partial and completed results on all levels of linguistic description. A Chart [Kay:80] is an acyclic, directed graph of hypotheses about parts of a document. Vertices correspond to points between words, edges denote words or descriptions of a sequence of words. Charts are extremely suitable for the representation of results within a natural language processing system. They allow to separate the description of what needs to be processed from the exact order in which actions are carried out, thus allowing for a wide range of search and processing strategies. Moreover, they remove redundancy since not only complete results are stored, but also all partial results that arise during a computation. These partial results can be reused in a larger context.

Shiraz uses several types of edges to distinguish between different types and levels of description. Thus, the chart can not only be used for a single purpose (say, syntactic parsing or generation), but it stores all hypotheses on all levels. Internally, so-called tags are used to mark edges as to what module they belong. In fact, the chart used for Shiraz is a weaker version of the layered chart used in [Amtrup:97], in that it does not support hypergraphs or the distribution of modules to employ parallel processing.

Edges in the chart are annotated with complex typed feature structures following [Carpenter:92]. Different types of feature structures can be used to encode different aspects of liguistic knowledge conveniently. We use an efficient implementation based on a vector-oriented representation for feature structures. Figure 1 shows an image of a chart and some of its edges. In the lower part of the image, part of a feature structure is shown.

Figure 1: A Chart and some edges

Components and the application definition

The Shiraz system is designed to fulfill different functions within a natural language processing scenario. Two main requirements have to be met:

The core system must be capable of processing different languages, probably using different scripts. Thus, Shiraz uses Unicode characters to represent user data throughout. Strings within feature structures are also Unicode.
The system has to handle differents tasks. In the Persian case, at least a glosser and a full translation system have to be supported.

The approach we chose in order to realize a configurable, flexible system is a combination of extreme modularization and user-defined application. Shiraz consists of currently 27 different modules. The user is able to compose a sequence of modules in order to build a complete application. Upon runtime, the system interprets the application definition and executes the modules needed.

An application definition file defines

A set of variable definitions, which can be used later on to save on typing and to group things,
A set of application definitions, which define which modules to execute for which application, and
A set of module definitions, which define the parameters for individual modules.

A small excerpt from the Persian application definition file is shown in figure 2. It exemplifies the composition of modules to form a complete application, as well as the definition of parameters, variables, and the incorporation of command-line parameters.


// Variable definitions
$RES=/home/mcm2/meat/per

// Global parameters
tangoModule = $(RES)/shiraz.mod

// An application
application lookup = Tokenizer($File=$1):PostTokenizer:MorphAnalyzer:
                     DictionaryLookup:DictionaryCompoundLookup:ChartViewer

// Sample module definitions
module Tokenizer {
  class = Tokenizer
  inputFile = /home/mcm/$File
  encoding = UTF8
}

module MorphAnalyzer {
  class = MorphAnalyzer
  grammar = $(RES)/GenMorph.samba
  rule = Morphology
  type = chart
  sourceTag = TOKEN
  targetTag = MATOKEN
}

Figure 2: A sample application definition file

Components of the Shiraz system

In this section, we give a short overview of the main components that are involved in constructing an English translation from a Persian document. Using the mechanism just mentioned, an application is defined as a sequence of modules which are executed one after the other. The results of each component are gathered in the central chart and can be used by any other component. The translation process can be divided into five major steps:

Reading and preparing the input text
Morphological analysis and dictionary lookup
Syntactic parsing
Transfer
Generation and preparation of the target language output

Preparing the input text

The first step in preparing the input text for a translation is performed by a Tokenizer, which reads an input file and splits this up into separate items such as words, punctuation, numbers etc. The input file is usually not in ASCII format, but rather a code conversion from some encoding to Unicode has to be performed. The tokenizer is a generic Unicode tokenizer, it is not specialized for any language.

For the Persian case, we also added a Posttokenizer. The task of this component is to postprocess the Tokenizer output with respect to some peculiarities of Persian. In particular, detached affixes are again attached to their kernels.

Morphological analysis and dictionary lookup

Then, in order to be able to perform dictionary lookup, the inflected surface words need to be processed by a Morphological Analyzer. We use a finite state transducer with feature structures formalism called Samba [Zajac:98] to describe morphological properties of words. Figure 3 shows a simple rule that describes the suffix which marks the causative form of Persian verbs. For more information see the web report on Persian Morphology.

CausativePastStem < GeneralRule;
CausativePastStem = 
< RegularPresentStem 
  <"|n" "d"> 
  [form.morph.infl: per.Form.VerbalInflection[
     causative: True]] 
>;

Figure 3: A simple morphological rule

The dictionary itself is based on citation forms. It contains approx. 50000 entries. Dictionary Lookup takes the citation forms generated by the morphological analyzer and uses them to access lemma definitions in the lexicon. The inflectional information gained by morphology is then unified with the dictionary entry, rendering a rich description of the input word. For a more detailed description of the structure of the lexicon, see the web report on the Shiraz dictionary.

Compounding is taken care of in the Compound lookup component. Here, we are not looking for individual words in the dictionary, but rather take any sequence of words to find compounds. The compound lookup procedure is based both on citation forms and surface forms, since some compound parts are not words on their own right. We do not record the internal structure of compounds in the dictionary, but since Persian is a head-final language, we assume that the last element in a compound carries the most important inflectional information. The compound inherits this inflectional information, if possible.

Syntactic Parsing

The parser employed in Shiraz is a unification-based, bidirectional Chart parser. Figure 4 shows a simple syntax rule for the composition of complex noun phrases. The rules are phrase structure rules, and consist of a left hand side, which describes the constituent being formed, and a right hand side, which describes which subconstituents are used for the construction. Feature structures on both sides allow to formulate restrictions and to build up structure. The rules can be parametrized to allow for certain special situations. First, they can be marked as non-recursive, in which case they are not used to propose new categories more than once at the same position. Second, they can be marked to perform dictionary lookup. If this happens, the left hand side is considered to refer to a dictionary entry and it is only constructed if there is an entry in the dictionary which matches the citation form built.


complexNP = per.Rule.Rule[
  lhs: per.Rule.NounPhrase[
     head: #np1,
     possessor: #np2],
  rhs: <:
   #np1= per.Rule.NounPhraseZero[
            boundary: per.Type.FalseOrUndefined]
    #np2= per.Rule.NounPhrase[
            head: per.Entry.Entry[form.morph.lex.pos: per.Type.NounHeads]]
  :>
];

Figure 4: A sample syntax rule

In the Shiraz system, we use three incarnations of the parser to perform different tasks. You can think of this as having a grammar with different levels, each of which is applied in sequence. These incarnations are:

The Auxiliary Verb Parser which is used to attach auxiliary verbs to their main verb counterparts. In doing that, morphological information is merged.
The Light Verb Parser. Persian shows a large number of light verbs, which are combinations of non-verbal words (Nouns, Adjectives, etc.) with semantically poor verbs (most often ``do''). These combinations have a non-compositional semantics, so they need to be lexicalized. The light verb parser combines the individual parts of a light verb and looks it up in the dictionary.
Finally, the Syntax Parser is used to construct syntactic constituents from individual words.

Transfer

The Transfer component is used to transform Persian syntactic structures to their English counterparts. Currently, we are only performing lexical transfer, i.e. the Persian morphological information is mapped to English inflectional features. Like all components within the Shiraz system, transfer is based on the chart notion. Incorporating syntactic transfer will allow to reuse partial translations within larger constructs (cf. [Amtrup:95]).

Generation and Surface Construction

Two components are involved in the final construction of English surface strings: The Syntactic Generator creates English fragments, and the Surface Generator searches for a suitable path through the fragments.

Syntactic generation currently uses a simple method of linearization of English words. There is no complex mechanism to generate surface strings from syntactic descriptions. A sample rule for the generator is shown in Figure 5.

np1 = [
  structure: per.Rule.NounPhrase[
    head: #1= Top,
    relClause: #2= Top],
  order: <: #1 #2 :>,
  trigger: "relClause"
];

Figure 5: A sample generation rule

The rule demonstrates the three elements present in a generation rule:

The structure defines what kind of syntactic structure can be handled by the rule.
The order defines in which surface order the underlying parts should be generated. Fixed strings can be inserted here as well, e.g. to mark the possessor in English with an additional ``of''.
The trigger restricts the application of rules. If present, then the feature marked by the trigger path has to be non-empty in order for the rule to be applicable.

Apart from constructing surface strings from syntactic descriptions, a morphological generation procedure is performed during this phase. Thus, English words are generated with correct inflection.

The surface generation, finally, chooses the best path through the graph of generated English surface fragments and issues these as output. In the future, we plan to use an English language model to choose among the many possible surface strings. The string which is ranked best by the model will be issued.

System Statistics

The system is completely written in C++ (with the exception of a small Java applet used to render Persian script for the glosser). It consists of approx. 27000 lines of code. It can be run on both Unix platforms (using the Gnu compiler) and PCs running Windows NT (using Visual C++). Translating a sentence of medium length and complexity (i.e., ambiguity) takes between 8 and 15 seconds.

Conclusion

Shiraz is a machine translation system for translating Persian written text into English. It is based on two main architectural foundations: The use of a chart throughout the system, which allows an integrated view on results created on all levels of linguistic description, and the use of a complex typed feature structure formalism, which unifies the view on the descriptions itself.

Apart from a major renovation (the system was written in a short period of time, which led to some suboptimal solutions and left almost no time for optimization), the main components that could be added are a model of syntactic-semantic transfer and a more elaborate syntactic generation.

References

Amtrup, Jan W., 1995: Chart-based Incremental Transfer in Machine Translation. In Proceedings of the Sixth International Conference on Theoretical and Methodological Issues in Machine Translation. KU Leuven, July 1995, pp. 188-195. Postscript, 8pp, 45k
Amtrup, Jan W., 1997: Layered Charts for Speech Translation. In Proceedings of the Seventh International Conference on Theoretical and Methodological Issues in Machine Translation, TMI '97, Sante Fe, NM, Jul. 1997, pp. 192-199. Postscript, 8pp, 44k
Carpenter, Bob, 1992: The Logic of Typed Feature Structures. Tracts in Theoretical Computer Science, Cambridge University Press, Cambridge, MA.
Kay, Martin, 1980: Algorithmic Schemata and Data Structures in Syntactic Processing. Technical Report CSL-80-12, Xerox Palo Alto Research Center.
Zajac, Remi, 1998: Feature Structures, Unification and Finite-State Transducers. In: FSMNLP'98, International Workshop on Finite State Methods in Natural Language Processing, Ankara, Turkey, 1998.

Top of Page