Timetable
October: Phase 0 (preparing materials and tools)
November–January: Phase 1 (learning to annotate)
Annotate training corpora
Cross-correlate (locally and across sites)
Refine instructions, ontology, etc.
Go back and fix
February-March: Phase 2 (annotating for agreement measure)
Everyone annotates same (English) test materials
We measure annotator agreement
End March: 6-month report; The main annotation task starts


Overall Setup
Assumptions: 2 translations per text.

Each site collects N texts in its source language and translates them (two translations) into English. For Korean, call the source texts K1, K2, etc. Their translations are K1E1, K1E2, K2E1, K2E2, etc.

 


Fig 1. Overall setup.

Phase 1: Annotator training
Assumptions: 2 annotators per site. 1 annotator per site is bilingual.

Each site has (at least) two annotators A1 and A2. Their annotations of the English texts are called K1E1A1, K1E2A1, K2E1A1, K2E2A1, K1E1A2, K1E2A2, K2E1A2, K2E2A2, and of the (Korean) originals K1A1, K2A1, etc.

 

Fig. 2. Annotators and the texts they annotate.


We can use the bilingual annotator to check for source–target consistency of IL, and the other annotator to check cross-site IL consistency.

 

 

 


Fig 3. Using annotators for intra-site cross-language and cross-site consistency.

Source Collection
The data set consists of 6 bilingual parallel corpora. Each corpus is made up of 125 source language news articles along with three independently produced translations into English. (The source news articles for each individual language corpus are different from the source articles in the other language corpora.) The source languages are Japanese, Korean, Hindi, Arabic, French and Spanish. Typically, each article is between 300 and 400 words long (or the equivalent) and thus each corpus has between 150,000 and 200,000 words. Consequently, the size of the entire data set is around 1,000,000 words. The Spanish, French, and Japanese corpora are based on the DARPA MT evaluation data (White and O’Connell 1994). The Arabic data is corpus is based on LDC’s Multiple Translation Arabic, Part 1 (Walker et al., 2003).

For any given subcorpus, the annotation effort is to assign interlingual content to a set of 4 parallel texts, 3 of which are in the same language, English, and all of which theoretically communicate the same information. A multilingual parallel data set of source language texts and English translations offers a unique perspective and unique problem for annotating texts for meaning.


Annotation Manuals
Markup instructions are contained in three manuals: a users’ guide for Tiamat (including procedural instructions), a definitional guide to semantic roles, and a manual for creating a dependency structure (IL0). Together these manuals allow the annotator to understand (1) the intention behind aspects of the dependency structure; (2) how to use Tiamat to mark up texts; and (3) how to determine appropriate semantic roles and ontological concepts. In choosing a set of appropriate ontological concepts, annotators were encouraged to look at the name of the concept and its definition, the name and definition of the parent node, example sentences, lexical synonyms attached to the same node, and sub- and super-classes of the node.


Ongoing Findings

 

[Home] [People] [Publications] [Goals] [Interlingua] [Tools] [Workplan] [Results]