Timetable
October: Phase
0 (preparing materials and tools)
November–January: Phase 1 (learning to annotate)
Annotate training corpora
Cross-correlate (locally and across sites)
Refine instructions, ontology, etc.
Go back and fix
February-March: Phase 2 (annotating for agreement
measure)
Everyone annotates same (English) test materials
We measure annotator agreement
End March: 6-month report; The main annotation task
starts
Overall
Setup
Assumptions: 2 translations
per text.
Each site collects N texts in its source language
and translates them (two translations) into English. For Korean, call
the source texts K1, K2, etc. Their translations are K1E1, K1E2, K2E1,
K2E2, etc.

Fig 1. Overall setup.
Phase 1: Annotator training
Assumptions: 2 annotators per site. 1 annotator per
site is bilingual.
Each site has (at least) two annotators A1 and A2.
Their annotations of the English texts are called K1E1A1, K1E2A1,
K2E1A1, K2E2A1, K1E1A2, K1E2A2, K2E1A2, K2E2A2, and of the (Korean)
originals K1A1, K2A1, etc.

Fig. 2. Annotators and the texts they annotate.
We can use the bilingual annotator to check for source–target
consistency of IL, and the other annotator to check cross-site IL
consistency.

Fig 3. Using annotators for intra-site cross-language
and cross-site consistency.
Source
Collection
The data set consists of 6 bilingual
parallel corpora. Each corpus is made up of 125 source language news
articles along with three independently produced translations into
English. (The source news articles for each individual language corpus
are different from the source articles in the other language corpora.)
The source languages are Japanese, Korean, Hindi, Arabic, French and
Spanish. Typically, each article is between 300 and 400 words long
(or the equivalent) and thus each corpus has between 150,000 and 200,000
words. Consequently, the size of the entire data set is around 1,000,000
words. The Spanish, French, and Japanese corpora are based on the
DARPA MT evaluation data (White and O’Connell 1994). The Arabic
data is corpus is based on LDC’s Multiple Translation Arabic,
Part 1 (Walker et al., 2003).
For any given subcorpus, the annotation effort is
to assign interlingual content to a set of 4 parallel texts, 3 of
which are in the same language, English, and all of which theoretically
communicate the same information. A multilingual parallel data set
of source language texts and English translations offers a unique
perspective and unique problem for annotating texts for meaning.
Annotation
Manuals
Markup instructions are contained
in three manuals: a users’ guide for Tiamat (including procedural
instructions), a definitional guide to semantic roles, and a manual
for creating a dependency structure (IL0). Together these manuals
allow the annotator to understand (1) the intention behind aspects
of the dependency structure; (2) how to use Tiamat to mark up texts;
and (3) how to determine appropriate semantic roles and ontological
concepts. In choosing a set of appropriate ontological concepts, annotators
were encouraged to look at the name of the concept and its definition,
the name and definition of the parent node, example sentences, lexical
synonyms attached to the same node, and sub- and super-classes of
the node.
Ongoing
Findings
[Home]
[People] [Publications]
[Goals] [Interlingua]
[Tools] [Workplan] [Results]