Goals
We propose research that aims at developing and testing a well-defined, well-motivated, and practical level of representation that captures semantic information from natural language text. Because the representation will be motivated by and tested against six languages, we refer to it as an “interlingual representation.” This research will help provide the basis for a paradigmatic shift in natural language processing (NLP), enabling corpus-based research as well as linguistic research into language-independent meaning representations in applications such as machine translation, question answering, text summarization, and information retrieval. The novelty of the research comes not only from the interlingua representation itself, but also from improved methodologies for designing and evaluating such representations.

We conceptualize the research as developing a series of transitional steps from surface representation to deep semantic representation (IL0, IL1, IL2,) with rules for transitioning from one to the next, at each stage ensuring intercoder reliability. So far, we have created IL0 and IL1, and have been gathering experience toward specifying IL2. In the next stage, we will proceed from IL1 to IL2, and possibly further, in a number of ways. First, we will move from lexical semantic content to information about the actual events, objects, and properties referred to in the text. Second, we will create for this information a well-defined representation language (the interlingual representation structure). Third, extending IL2, we will investigate the appropriate coding of temporal, modal, and aspectual information from text. Finally, we will integrate information that relates the participants to an event and also events to each other. At each stage, we will verify the intercoder reliability of the coding schema, and code a sizeable multilingual corpus with the additional information.

Objectives
We will continue to develop an interlingual representation framework based on a careful study of the text corpora in six languages already collected, and their translations. Building upon the representation frameworks currently in use or under development, the framework will include a formal definition of the representation language along with coding manuals for the main components of meaning (lexical semantic concepts, events and objects, time, aspect, modality, etc.). An important part of this work will involve reducing ambiguity and vagueness in the large ontological specification of meanings.

We will annotate these bilingual corpora using the agreed-upon interlingual representation. This effort will also include a straightforward extension of those corpora as needed, without further research being required.

We will extend our current set of annotation tools (a tree editor, annotation interface, etc.), and build and deploy for the whole group new ones as required. Our current tools enable effective and relatively problem-free annotation at the six sites and subsequent merging of the results.

We will design new metrics and conduct various evaluations of the interlingual representations, both of annotator agreement and for choosing a granularity of meaning representation that is appropriate for a given task. Our current metrics, based on inter-annotator reliability, will be augmented to consider also the growth rate of the interlingual representation, the ability to handle translation divergences, and quality of the target language text that can be generated from the interlingua. We will also examine closely on a case-by-case basis the interaction between reliable coding of the univocal semantics of the text and legitimate differences in understanding/interpretation as indicated by alternative translations.

Corpus
The target data set is modeled on and an extension of the DARPA MT Evaluation data set (White and O’Connell, 1994) and includes data from the Linguistic Data Consortium (LDC) Multiple Translation Arabic, Part 1 (Walker et al. 2003). The data set consists of 6 bilingual comparable corpora (Japanese, Korean, Hindi, Arabic, French, and Spanish). Each corpus currently contains 5 source language news articles (125 by the end of the project), each with either two or three independently produced high-quality translations into English, each around 300 words in length. Since the source news articles for each individual language corpus are different, the 6 corpora are comparable rather than parallel. Ultimately, each corpus will have between 150,000 and 200,000 words, for a total of some 1,000,000 words. Thus, for any given corpus, the annotation effort is to assign interlingua annotations to a set of 3 or 4 parallel texts, 2 or 3 of which are in the same language, English, and all of which supposedly communicate the same information.

 

[Home] [People] [Publications] [Goals] [Interlingua] [Tools] [Workplan] [Results]