Background on Interlingua
An interlingua is a notation representing
the content of text that can be used to mediate between source and
target languages in machine translation. Interlinguas have a long
history in machine translation (see for example proceedings of the
series of SIG-IL workshops at http://crl.nmsu.edu/Events/FWOI/index.html),
and the PIs all have considerable experience building interlingua-based
and other MT systems. The advantages of using an interlingua for translation
are well known. First, because each language has its own independent
analyzer mapping it into the interlingua and generator mapping it
out of the interlingua, any number of source and target languages
can be connected without having to write explicit rules for each language
pair and each direction. Thus, interlingual systems save both development
time and reduce system size, especially for bi-directional multilingual
systems involving more than two languages. Second, an intermediate
language representation can provide a neutral basis of comparison
for translation equivalents that differ syntactically.
In spite of these advantages, interlingual machine
translation has been used much less widely than transfer-based MT,
EBMT, or, more recently, systems based on statistical methods, which
have been gaining popularity in all areas of language technology.
One reason for this situation is that there is no commonly accepted
theory of interlingua, and the problem is too big to address from
scratch in the life span of a typical research or development effort.
However, we do not claim that a standard theory of interlingual representation
would automatically increase the popularity of interlingual representations.
In fact, a very sticky central problem is that any standard theory
will have to account for the fact that different aspects of interlingua
are relevant for different applications of MT. However, what is both
necessary and feasible in the near term is the development of a methodology
for building interlingual representations. Such a methodology would
involve interlingua theory, representation design, encoding/annotation
manuals, actual text encoding/annotation, and evaluation measures.
Guidelines for evaluating an interlingua would put boundaries on the
problem and enable subsequent projects to gain the benefits of linguistic
knowledge instead of being forced to abandon it.
Ideally, interlingual representations would have
the following important properties:
Inter-coder compatibility: Two annotators, faced
with the same piece of text, should be able to annotate it with compatible
interlingual representations. Compatible encodings would not change
meaning in a way that would cause system failure, but are not necessarily
identical. For example, a natural language generator taking two compatible
interlingua representations as input might produce the same output
or two different but equally acceptable outputs. Compatibility is
especially important in multi-site development efforts where, for
example, a source language analyzer built in Italy might have to produce
an interlingua that is compatible with a target language generator
built in Korea.
Granularity and coverage that are appropriate to
the application: For any given application, it is not necessary to
represent every aspect of text meaning in the interlingua. An interlingual
representation that is too deep will take a long time to develop,
may not meet the criterion of inter-coder compatibility, and may be
difficult to produce reliably with NLP software. Conversely, an interlingual
representation that is not detailed enough will lose text meaning
distinctions that are necessary for the application. It is always
necessary to strike a balance in order to build a running system.
Striking a balance does not mean sacrificing theoretical correctness.
It can, for example, involve a detailed theory that allows underspecification
for non-critical details.
To be well specified and most useful, these properties
of interlingua require three distinct but related enterprises:
The representational formalism, which involves issues
such as whether or not a phenomenon of meaning should be represented
by a simple slot-filler pair, or, instead, have scope over a larger
unit of representation, and where in relation to other representational
units the phenomenon typically fits.
The representation content (structures, terms and
symbols), which involves issues such as whether or not the values
representing the phenomenon are discrete, if so, which actual representation
symbols to use and, if not, how the continuum is represented, how
the values are determined, and what the relationship is between the
symbols and lexical item definitions. Ideally, the various essential
resources—a collection of all allowed semantic symbols, etc.—are
also provided.
Examples of representation, tied to actual text and
of annotation methodology. Naturally, this is most useful with detailed
manuals describing decision procedures, with examples in various languages,
and with associated tools that support the manual or automated creation
of interlingua representations.
Developing such representations and supporting knowledge
resources is not trivial, especially because there is often no obviously
correct representation for some phenomenon. To ensure the success
of such an enterprise, we rely on three strengths unique to the team
of participants in this proposal: (1) Agreement on a clear methodology
for arriving at decisions regarding the interlingua, as indicated
by the annotation experiments conducted for recent SIG-IL (Special
Interest Group on Interlinguas) workshops (Habash 2002; Habash 2003).
(2) Complimentary research foci and synergy among the participants,
as exemplified by a solid history of successful cross-site collaborations
on the Pangloss MT project (LTI, CRL, ISI) (Farwell et al. 1994),
the Nitrogen natural language generator (ISI and UMIACS), the Mikrokosmos
and Omega ontologies (ISI and CRL), the 2002 Johns Hopkins Summer
Workshop on Generation for Machine Translation (UMIACS and Columbia),
and the three workshops of the SIG-IL that have been held since 1998;
(3) Successful experience in working together and producing consistent
results in our current ITR NSF grant #IIS-0326553.
Interlingua Description and
Representation
Recognizing the complexity of interlinguas,
we adopt an incrementally deepening approach, which allows us to produce
some quantity of relatively stable annotations while exploring alternatives
at the next level down. Moving the annotation from surface form to
deep semantic representation, we currently identify three levels of
representation, referred to as IL0, IL1, and IL2. Each level of representation
incorporates additional semantic features and removes existing syntactic
ones. Throughout, we make as much use of automated procedures as possible.
IL0 is a deep syntactic dependency representation,
constructed by hand-correcting the output of a dependency parser.
This parser outputs a variant intermediate between the analytical
and tectogrammatical levels of the Prague School (Hajic et al. 2001).
IL0 includes part-of-speech tags for words and a parse tree that makes
explicit the syntactic predicate-argument structure of verbs. The
parse tree is labeled with syntactic categories such as Subject or
Object, which refer to deep-syntactic grammatical function (normalized
for voice alternations). IL0 does not contain function words (determiners,
auxiliaries, etc.), but encodes their contributions as features. Semantically
void punctuation is removed. Though this representation is purely
syntactic, many disambiguation decisions have been made (e.g., relative
clause and PP attachment) and the presentation abstracts as much as
possible from surface-syntactic phenomena. By allowing annotators
to see how textual units relate syntactically when making semantic
judgments, IL0 is a useful starting point for semantic annotation
at IL1.
IL1 is an intermediate semantic representation. With
lexical units like nouns, adjectives, adverbs and verbs, it associates
semantic concepts, drawn from an ontology of symbols (see below and
Section 3.3). It also replaces the syntactic relations in IL0, such
as Subject and Object, with thematic roles, such as agent, theme and
goal. Thus, like PropBank (Kingsbury et al. 2002), IL1 neutralizes
different alternations for argument realization. However, IL1 is not
an interlingua; it does not normalize over all linguistic realizations
of the same semantics. In particular, it does not address how the
meanings of individual lexical units combine to form the meaning of
a phrase or clause. It also does not address idioms, metaphors and
other non-literal uses of language, and does not assign semantic features
to prepositions. Though some aspects of IL1 remain to be fleshed out,
we have created IL1 annotations for our test corpus.
IL2, which is in its design stage, is intended to
be an interlingua, albeit a relatively simple one. As a representation
of meaning that is (reasonably) independent of language, IL2 will
capture similarities in meaning across languages and across different
lexical/syntactic realizations within a language. For example, IL2
is expected to normalize over conversives (e.g., X bought a book from
Y vs. Y sold a book to X), as does FrameNet (Baker et al. 1998) and
non-literal language usage (e.g., X started its business vs. X opened
its doors to customers). The exact definition of IL2, as well as annotation
manuals and associated resources, will be a major research contribution
of this project.
To progress from IL0 to IL1, annotators select semantic
terms (concepts that represent particular senses of words) for the
nouns, verbs, adjectives, and adverbs in each sentence. These terms
are represented as concepts in the 110,000-node ontology Omega (Philpot
et al. 2003). Still under construction at ISI, Omega has been assembled
semi-automatically from a variety of sources, including Princeton’s
WordNet (Fellbaum 1998), New Mexico State University’s Mikrokosmos
(Mahesh and Nirenburg 1995), ISI’s Upper Model (Bateman et al.
1989) and ISI’s SENSUS (Knight and Luk 1994). After the uppermost
region of Omega was created by hand, these various resources’
contents were incorporated and, to some extent, reconciled. After
that, several million instances of people, locations, and other facts
were added (Fleischman et al. 2003). The ontology, which has been
used in several projects in recent years (Hovy et al. 2001), can be
browsed using the DINO browser at http://blombos.isi.edu:8000/dino;
this browser forms a part of the annotation environment. Omega remains
under continued development and extension.
In addition to its semantic sense, each verb in Omega
is assigned one or more theta grids specifying the arguments associated
with a verb and their theta roles (or thematic role). Theta roles
are abstractions of deep semantic relations that generalize over verb
classes. They are by far the most common approach in the field to
represent predicate-argument structure. However, there are numerous
variations with little agreement even on terminology (Fillmore 1968;
Stowell 1981; Jackendoff 1972; Levin and Rappaport-Hovav 1998). The
theta grids used in our project were extracted from the Lexical Conceptual
Structure Verb Database (LVD) (Dorr 2001). The WordNet senses assigned
to each entry in the LVD were then used to link the theta grids to
the verbs in the Omega ontology. In addition to the theta roles, the
theta grids specify syntactic realization information, such as Subject,
Object or Prepositional Phrase, and the Obligatory/Optional nature
of the argument. The set of theta roles used, although based on research
in LCS-based MT (Dorr 1993; Habash et al. 2002), has been simplified
for this project. This list was used in the Interlingua Annotation
Experiment 2002 (Habash and Dorr, 2002).1
1 Other contributors to this list are Dan Gildea
and Karin Kipper Schuler.
[Home]
[People] [Publications]
[Goals] [Interlingua] [Tools]
[Workplan] [Results]