Background on Interlingua
An interlingua is a notation representing the content of text that can be used to mediate between source and target languages in machine translation. Interlinguas have a long history in machine translation (see for example proceedings of the series of SIG-IL workshops at http://crl.nmsu.edu/Events/FWOI/index.html), and the PIs all have considerable experience building interlingua-based and other MT systems. The advantages of using an interlingua for translation are well known. First, because each language has its own independent analyzer mapping it into the interlingua and generator mapping it out of the interlingua, any number of source and target languages can be connected without having to write explicit rules for each language pair and each direction. Thus, interlingual systems save both development time and reduce system size, especially for bi-directional multilingual systems involving more than two languages. Second, an intermediate language representation can provide a neutral basis of comparison for translation equivalents that differ syntactically.

In spite of these advantages, interlingual machine translation has been used much less widely than transfer-based MT, EBMT, or, more recently, systems based on statistical methods, which have been gaining popularity in all areas of language technology. One reason for this situation is that there is no commonly accepted theory of interlingua, and the problem is too big to address from scratch in the life span of a typical research or development effort. However, we do not claim that a standard theory of interlingual representation would automatically increase the popularity of interlingual representations. In fact, a very sticky central problem is that any standard theory will have to account for the fact that different aspects of interlingua are relevant for different applications of MT. However, what is both necessary and feasible in the near term is the development of a methodology for building interlingual representations. Such a methodology would involve interlingua theory, representation design, encoding/annotation manuals, actual text encoding/annotation, and evaluation measures. Guidelines for evaluating an interlingua would put boundaries on the problem and enable subsequent projects to gain the benefits of linguistic knowledge instead of being forced to abandon it.

Ideally, interlingual representations would have the following important properties:

Inter-coder compatibility: Two annotators, faced with the same piece of text, should be able to annotate it with compatible interlingual representations. Compatible encodings would not change meaning in a way that would cause system failure, but are not necessarily identical. For example, a natural language generator taking two compatible interlingua representations as input might produce the same output or two different but equally acceptable outputs. Compatibility is especially important in multi-site development efforts where, for example, a source language analyzer built in Italy might have to produce an interlingua that is compatible with a target language generator built in Korea.

Granularity and coverage that are appropriate to the application: For any given application, it is not necessary to represent every aspect of text meaning in the interlingua. An interlingual representation that is too deep will take a long time to develop, may not meet the criterion of inter-coder compatibility, and may be difficult to produce reliably with NLP software. Conversely, an interlingual representation that is not detailed enough will lose text meaning distinctions that are necessary for the application. It is always necessary to strike a balance in order to build a running system. Striking a balance does not mean sacrificing theoretical correctness. It can, for example, involve a detailed theory that allows underspecification for non-critical details.

To be well specified and most useful, these properties of interlingua require three distinct but related enterprises:

The representational formalism, which involves issues such as whether or not a phenomenon of meaning should be represented by a simple slot-filler pair, or, instead, have scope over a larger unit of representation, and where in relation to other representational units the phenomenon typically fits.

The representation content (structures, terms and symbols), which involves issues such as whether or not the values representing the phenomenon are discrete, if so, which actual representation symbols to use and, if not, how the continuum is represented, how the values are determined, and what the relationship is between the symbols and lexical item definitions. Ideally, the various essential resources—a collection of all allowed semantic symbols, etc.—are also provided.

Examples of representation, tied to actual text and of annotation methodology. Naturally, this is most useful with detailed manuals describing decision procedures, with examples in various languages, and with associated tools that support the manual or automated creation of interlingua representations.

Developing such representations and supporting knowledge resources is not trivial, especially because there is often no obviously correct representation for some phenomenon. To ensure the success of such an enterprise, we rely on three strengths unique to the team of participants in this proposal: (1) Agreement on a clear methodology for arriving at decisions regarding the interlingua, as indicated by the annotation experiments conducted for recent SIG-IL (Special Interest Group on Interlinguas) workshops (Habash 2002; Habash 2003). (2) Complimentary research foci and synergy among the participants, as exemplified by a solid history of successful cross-site collaborations on the Pangloss MT project (LTI, CRL, ISI) (Farwell et al. 1994), the Nitrogen natural language generator (ISI and UMIACS), the Mikrokosmos and Omega ontologies (ISI and CRL), the 2002 Johns Hopkins Summer Workshop on Generation for Machine Translation (UMIACS and Columbia), and the three workshops of the SIG-IL that have been held since 1998; (3) Successful experience in working together and producing consistent results in our current ITR NSF grant #IIS-0326553.


Interlingua Description and Representation
Recognizing the complexity of interlinguas, we adopt an incrementally deepening approach, which allows us to produce some quantity of relatively stable annotations while exploring alternatives at the next level down. Moving the annotation from surface form to deep semantic representation, we currently identify three levels of representation, referred to as IL0, IL1, and IL2. Each level of representation incorporates additional semantic features and removes existing syntactic ones. Throughout, we make as much use of automated procedures as possible.

IL0 is a deep syntactic dependency representation, constructed by hand-correcting the output of a dependency parser. This parser outputs a variant intermediate between the analytical and tectogrammatical levels of the Prague School (Hajic et al. 2001). IL0 includes part-of-speech tags for words and a parse tree that makes explicit the syntactic predicate-argument structure of verbs. The parse tree is labeled with syntactic categories such as Subject or Object, which refer to deep-syntactic grammatical function (normalized for voice alternations). IL0 does not contain function words (determiners, auxiliaries, etc.), but encodes their contributions as features. Semantically void punctuation is removed. Though this representation is purely syntactic, many disambiguation decisions have been made (e.g., relative clause and PP attachment) and the presentation abstracts as much as possible from surface-syntactic phenomena. By allowing annotators to see how textual units relate syntactically when making semantic judgments, IL0 is a useful starting point for semantic annotation at IL1.

IL1 is an intermediate semantic representation. With lexical units like nouns, adjectives, adverbs and verbs, it associates semantic concepts, drawn from an ontology of symbols (see below and Section 3.3). It also replaces the syntactic relations in IL0, such as Subject and Object, with thematic roles, such as agent, theme and goal. Thus, like PropBank (Kingsbury et al. 2002), IL1 neutralizes different alternations for argument realization. However, IL1 is not an interlingua; it does not normalize over all linguistic realizations of the same semantics. In particular, it does not address how the meanings of individual lexical units combine to form the meaning of a phrase or clause. It also does not address idioms, metaphors and other non-literal uses of language, and does not assign semantic features to prepositions. Though some aspects of IL1 remain to be fleshed out, we have created IL1 annotations for our test corpus.

IL2, which is in its design stage, is intended to be an interlingua, albeit a relatively simple one. As a representation of meaning that is (reasonably) independent of language, IL2 will capture similarities in meaning across languages and across different lexical/syntactic realizations within a language. For example, IL2 is expected to normalize over conversives (e.g., X bought a book from Y vs. Y sold a book to X), as does FrameNet (Baker et al. 1998) and non-literal language usage (e.g., X started its business vs. X opened its doors to customers). The exact definition of IL2, as well as annotation manuals and associated resources, will be a major research contribution of this project.

To progress from IL0 to IL1, annotators select semantic terms (concepts that represent particular senses of words) for the nouns, verbs, adjectives, and adverbs in each sentence. These terms are represented as concepts in the 110,000-node ontology Omega (Philpot et al. 2003). Still under construction at ISI, Omega has been assembled semi-automatically from a variety of sources, including Princeton’s WordNet (Fellbaum 1998), New Mexico State University’s Mikrokosmos (Mahesh and Nirenburg 1995), ISI’s Upper Model (Bateman et al. 1989) and ISI’s SENSUS (Knight and Luk 1994). After the uppermost region of Omega was created by hand, these various resources’ contents were incorporated and, to some extent, reconciled. After that, several million instances of people, locations, and other facts were added (Fleischman et al. 2003). The ontology, which has been used in several projects in recent years (Hovy et al. 2001), can be browsed using the DINO browser at http://blombos.isi.edu:8000/dino; this browser forms a part of the annotation environment. Omega remains under continued development and extension.

In addition to its semantic sense, each verb in Omega is assigned one or more theta grids specifying the arguments associated with a verb and their theta roles (or thematic role). Theta roles are abstractions of deep semantic relations that generalize over verb classes. They are by far the most common approach in the field to represent predicate-argument structure. However, there are numerous variations with little agreement even on terminology (Fillmore 1968; Stowell 1981; Jackendoff 1972; Levin and Rappaport-Hovav 1998). The theta grids used in our project were extracted from the Lexical Conceptual Structure Verb Database (LVD) (Dorr 2001). The WordNet senses assigned to each entry in the LVD were then used to link the theta grids to the verbs in the Omega ontology. In addition to the theta roles, the theta grids specify syntactic realization information, such as Subject, Object or Prepositional Phrase, and the Obligatory/Optional nature of the argument. The set of theta roles used, although based on research in LCS-based MT (Dorr 1993; Habash et al. 2002), has been simplified for this project. This list was used in the Interlingua Annotation Experiment 2002 (Habash and Dorr, 2002).1

1 Other contributors to this list are Dan Gildea and Karin Kipper Schuler.

 

[Home] [People] [Publications] [Goals] [Interlingua] [Tools] [Workplan] [Results]