Annotation Tools

We have assembled and/or built a suite of tools to be used in the annotation process.

Since we gather our corpora from disparate sources, we have to standardize the text before presenting it to automated procedures. For English, this involves sentence boundary detection, but for other languages, it may involve word segmentation, chunking of text, demorphing, or similar language-specific operations.

The text is then processed using a dependency parser. For English, we have two parsers, one from Prague and the other Connexor (Tapanainen and Jarvinen 1997). Their output is converted to standard form and then viewed and corrected in TrEd (Hajic et al. 2001), a graphically based tree editing program, written in Perl/Tk. The hand-corrected deep dependency structure produced by this process is the IL0 representation for that sentence. Already at this stage, some of the lexical items are replaced by features (e.g., tense), morphological forms are replaced by features on the citation form, and certain constructions are regularized (e.g., passive) with empty arguments inserted.

In order to derive IL1 from the IL0 representation, annotators use Tiamat, a tool developed specifically for this project. This tool displays background resources and information, including the IL0 tree and the Omega ontology. Principally, it is the annotator’s workspace, showing the current sentence, the current word(s) to be annotated, the ontology’s options for annotation, including theta roles (already connected to other parts of the sentence, as far as possible), etc. It provides the ability to annotate text via simple point-and-click selections of words, concepts, and theta roles.


Evaluation of the annotators’ output would be daunting based solely on a visual inspection of the annotated IL1 files. Thus, an annotation agreement evaluation tool was also developed to compare the output and to generate evaluation measures. The reports generated by the evaluation tool allow the researchers to look at both gross-level phenomena, such as inter-annotator agreement, and at more detailed points of interest, such as lexical items on which agreement was particularly low, possibly indicating gaps or other inconsistencies in the ontology being used.




Annotation Manuals


English IL0 Annotation Manual

Arabic IL0 Annotation Manual
Hindi IL0 Annotation Manual
Japanese IL0 Annotation Manual
Korean IL0 Annotation Manual
Spanish IL0 Annotation Manual


IL1 Theta Role Manual
Revised Theta Role Definitions



Annotated and Unannotated Data


The files are available as two sets. The unannotated texts are available here and their annotated counterparts are available here.