Goals
We propose research that aims at developing
and testing a well-defined, well-motivated, and practical level of representation
that captures semantic information from natural language text. Because
the representation will be motivated by and tested against six languages,
we refer to it as an “interlingual representation.” This
research will help provide the basis for a paradigmatic shift in natural
language processing (NLP), enabling corpus-based research as well as
linguistic research into language-independent meaning representations
in applications such as machine translation, question answering, text
summarization, and information retrieval. The novelty of the research
comes not only from the interlingua representation itself, but also
from improved methodologies for designing and evaluating such representations.
We conceptualize the research as developing a series of transitional
steps from surface representation to deep semantic representation (IL0,
IL1, IL2,) with rules for transitioning from one to the next, at each
stage ensuring intercoder reliability. So far, we have created IL0 and
IL1, and have been gathering experience toward specifying IL2. In the
next stage, we will proceed from IL1 to IL2, and possibly further, in
a number of ways. First, we will move from lexical semantic content
to information about the actual events, objects, and properties referred
to in the text. Second, we will create for this information a well-defined
representation language (the interlingual representation structure).
Third, extending IL2, we will investigate the appropriate coding of
temporal, modal, and aspectual information from text. Finally, we will
integrate information that relates the participants to an event and
also events to each other. At each stage, we will verify the intercoder
reliability of the coding schema, and code a sizeable multilingual corpus
with the additional information.
Objectives
We will continue to develop an interlingual
representation framework based on a careful study of the text corpora
in six languages already collected, and their translations. Building
upon the representation frameworks currently in use or under development,
the framework will include a formal definition of the representation
language along with coding manuals for the main components of meaning
(lexical semantic concepts, events and objects, time, aspect, modality,
etc.). An important part of this work will involve reducing ambiguity
and vagueness in the large ontological specification of meanings.
We will annotate these bilingual corpora using the
agreed-upon interlingual representation. This effort will also include
a straightforward extension of those corpora as needed, without further
research being required.
We will extend our current set of annotation tools
(a tree editor, annotation interface, etc.), and build and deploy for
the whole group new ones as required. Our current tools enable effective
and relatively problem-free annotation at the six sites and subsequent
merging of the results.
We will design new metrics and conduct various evaluations
of the interlingual representations, both of annotator agreement and
for choosing a granularity of meaning representation that is appropriate
for a given task. Our current metrics, based on inter-annotator reliability,
will be augmented to consider also the growth rate of the interlingual
representation, the ability to handle translation divergences, and quality
of the target language text that can be generated from the interlingua.
We will also examine closely on a case-by-case basis the interaction
between reliable coding of the univocal semantics of the text and legitimate
differences in understanding/interpretation as indicated by alternative
translations.
Corpus
The target data set is modeled on and an extension of
the DARPA MT Evaluation data set (White and O’Connell, 1994) and
includes data from the Linguistic Data Consortium (LDC) Multiple Translation
Arabic, Part 1 (Walker et al. 2003). The data set consists of 6 bilingual
comparable corpora (Japanese, Korean, Hindi, Arabic, French, and Spanish).
Each corpus currently contains 5 source language news articles (125
by the end of the project), each with either two or three independently
produced high-quality translations into English, each around 300 words
in length. Since the source news articles for each individual language
corpus are different, the 6 corpora are comparable rather than parallel.
Ultimately, each corpus will have between 150,000 and 200,000 words,
for a total of some 1,000,000 words. Thus, for any given corpus, the
annotation effort is to assign interlingua annotations to a set of 3
or 4 parallel texts, 2 or 3 of which are in the same language, English,
and all of which supposedly communicate the same information.
[Home]
[People] [Publications]
[Goals] [Interlingua]
[Tools] [Workplan]
[Results]