femine/masculine
dictionary entry
broken plurals
copular
segmentation
pos of kan and ina
About the Annotation Manual |
Structure of the Annotation Manual Sections
Each section of the manual should cover structure and content. Structure is basically syntax (and thus much can be reused from the existing syntax manual which Owen inserts into these pages), though some things will of course have to be changed. Content refers to the process of linking the words to the ontology.
In writing the manual, the linguistic content of the manual should be separated from the description of how to use the tool (so that we can reuse the manual without change if the tool changes). However, we need to have a clear idea of what the tool does before writing the linguistic content part.
Procedure for Creating/Maintaining the Annotation Manual
Each section has one person in charge. Please all feel free to add comments in the corresponding comment page.
The person in charge can incorporate the comment into the draft and then remove the comment, or argue against the comment in another comment. Eventually, all comments should be removed, either by incorporation (then the person in charge removes it) or by retracting the comment (then the person making the comment removes it).
Important: when you have changed a part of the annotation manual, and you have given people in the Content Committee time to comment (or the change was discussed), then you should copy the updated page to the mirror site on the Annotator Wiki. This is crucical, since the annotators do not have access to this page!
English IL0 Manual |
Each node in the dependency tree can be thought of as an attribute-value matrix, i.e., a bundle of features with values. All values must be set for each node in the tree. This will require checking each node before finishing the analysis. Here is a list of features:
If DRole is omitted, it is assumed to be the same as SRole (which is frequently the case). There is a certain redundancy among these features, which is intended (for error checking).
- Position (wpos). The linear position of the word in the sentence. This should not be modified or annotated, except for new empty nodes created by the annotator, which should always be given the wpos 100.
- Word (lex). This is the inflected word form associated with the node. It is almost always correctly displayed already. Example: went.
- Part-of-Speech (POS). This is the lexical class, taken from a short list. Example: verb. Specific options:
o V -- verbs, but not auxiliary verbs (=Aux)
o N -- common nouns
o PN -- proper nouns
o Adj -- adjectives
o Adv -- adverbs
o P -- prepositions and subordinating conjunctions
o G -- what is this?
o Conj -- coordinating conjunctions, but not subordinating conjunctions; also includes the comma used in enumerations instead of repeated and
o Det -- determiners
o Aux -- auxiliary verbs
o Pun -- punctuation marks, but not the comma used in conjunctions
o Sym -- various symbols (dollar signs and the like)
o Uh -- speech-specifc sounds, even if meaningful (such as /UH HUH/)
o Misc -- everything else, including greetings (Hi, Hello) and interjections (Okay)
- Supertag (Stag). IGNORE. This is the supertag. This should not be annotated, as it will be filled in automatically.
- Base form (Root). This is the base form (lexeme) of the inflected form. A first "guess" will be included, which needs to be checked and corrected. Example: go.
- Morphological Features (Morph). A complete specification of the morphological features needed to derive the inflected word form from the base form. The options are gruoped by part-of-speech; in the menu of the GRAPH tool, all options are displayed at once and the GRAPH tool does not enforce a proper choice of features given the part-of-speech. Possibilities are:
o NOUNS (including proper nouns and determiners)
- sg -- singular
- pl -- plural
o VERBS (including auxiliaries)
- 3sg_PRES -- 3rd person singular present sings
- PRES -- present tense, but NOT 3rd person singular sing; also use for all subjunctives (lest he sing)
- PAST -- past tense sang
- PAPRT -- past participle sung
- PRESPRT -- present participle singing
- INF -- base form when used in infinitive sing
o ADJECTIVES and ADVERBS
- COMP -- morphological comparative longer
- SUPER -- morphological superlative longest
o In comparatives and superlatives formed with more and most, label the adjective or adverb as None. MISC
- None -- use this when the word is not inflected, other than infinitive verbs (e.g. adjectives in base form)
- --- -- use this when the word is not inflected, other than infinitive verbs (e.g. adjectives in base form), or when the word does not ever inflect (prepositions, particles, etc).
Note: in fact "---" subsumes "None" and can be used whenever "None" can be used.
- Functional Role Reassignment (FRR). This feature is only used on verbs, nouns, adjectives, or prepositions, and reflects ways in which the usual distribution of roles has been changed. There are only four options:
o Pass -- passive (only verbs). See discussion on passives.
o Erg -- ergative (only verbs). See discussion on ergatives.
o There -- there-insertion. See discussion on there-insertion.
o Pred -- predicative (only nouns, adjectives, prepositions). Use this to indicate that the noun, adjective, or preposition is used as head of a predicative construction with a dependent form of be which is analyzed as an auxiliary.
o None -- none of the above three cases apply.
- Surface role (SRole). This is the role of the node with respect to its mother, as the node appears in the surface string. All nodes have a surface role.
o Subj -- surface subject. An argument that agrees in person and number with the tensed verb is always a surface subject in English; other surface subjects are the empty nodes of non-finite verbs. Surface subjects can also be verbs (sentential subjects, as in That John came bothered Mercedes.), or adjectives (Earlier is better than later) or prepositions (In the morning suits me best).
o Obj -- surface object. This will never have a preposition as its head (if it does, use PObj). This is also the surface role of a complement of a preposition. This is also the role of sentential complements (the verb which heads the complement has the role Obj).
o Obj2 -- surface indirect object. This will never have a preposition a preposition as its head (if it does, use PObj2).
o PrepObj -- surface prepositional object. An object which is dominated by a preposition. It is in fact the preposition which gets the role "PrepObj". Examples: put the book (Obj) on (PrepObj) the table (Obj) or give the book (Obj) to (PrepObj) Mary (Obj)
o PrepObj2 -- second surface prepositional argument (rare).
o Adj -- all adjuncts, including modifiers, appositions, and the like. Also, all function words (determiners, auxiliaries, subordinating conjunctions) will be labeled "Adj".
- Deep role (DRole). This is the role of the node with respect to its mother, in some deeper representation. This is a little murky. We will use strictly syntactic criteria. Specifically, DRole is different from SRole only if there is a form of the verb in which it is realized with more arguments. DRole reflects the argument patterns of the the verb if it were in its active, non-ergative form.
o Subj -- deep subject. The surface subject, except for passives or ergative verbs (the door opened), in which case there is no deep subject or it is expressed (for passives) by the by phrase. The deep subject may, but need not, agree in person and number with the tensed verb; empty surface subjects are like overt surface subjects in that they may or may not be deep subjects.
o Obj -- deep object. This will never have a preposition associated with it in the underlying form (on the surface it might). This is in surface subject position for passives and ergatives. This is also the deep role of a complement of a preposition.
o Obj2 -- deep indirect object. This is for verbs which allow dative shift and have a V NP NP pattern (such as give). Whether or not the dependent has a preposition associated with it, if it can be realized without preposition, it is an Obj2 (typically, the recipient).
o PrepObj -- deep prepositional object. An object which is always dominated by a preposition. It is in fact the preposition which gets the role "PrepObj". Examples: put the book (Obj) on (PrepObj) the table (Obj) but give the book (Obj) to (Obj2) Mary (Obj). Note that put may be the only verb that has a deep PrepObj.
o Adj -- all adjuncts, including modifiers, auxiliaries, appositions, and the like.
- Done. This feature is only a check to make sure that the default values have been checked. Set it to "Y" when you are done with the features for one node.
Verbs are heads of sentences and clauses.
The head of any complete clausal utterance is the main verb. Incomplete utterances (NPs, PPs, Greetings) should have as their head the usual head for that type of phrase.
Auxiliary verbs (do, have, had, auxiliary-be) are deleted. Their meaning is represented as features on the main verb (for example, tense:fut). Modals (can, must, etc) are syntactically very much like auxiliaries, but they are included in IL0 for semantic reasons as dependents of the main verb. In all cases, when the main verb is missing, as in VP ellipsis, an empty verb node should be created and used as the head of the entire clause.
Sequences of auxiliary verbs (had been Ven, are to be Ving, could have been being Ven) should be annotated with the main verb as the head, and all auxiliaries removed (except modal auxiliaries).
When the main verb is a form of the copula, the head of the clause will vary depending on the type of copular sentence. Predicative copular constructions will have the predicate as their head. Equative copular constructions will have the copula as their head.
In an infinitive construction, to should be treated as an auxiliary, i.e., it should be removed. This includes instances of want to and have to (in which the to depends on the embedded verb).
In distinguishing between arguments and adjuncts, consistency is the most important thing. This distinction will matter most for annotating empty categories. In addition, each argument will be annotated with a feature encoding its grammatical role. All non-arguments will be annotated as adjuncts, including function words.
The only NPs that will be considered arguments for annotation purposes are
A list of argument patterns of common verbs can be consulted for questionable cases.
Only constituents marked Subj, Obj, or PrepObj will count as arguments. When the same constituent can appear either as an object or a prepositional object, it will count as an argument. As such, the deep role of an object that can appear with or without a preposition is always a role where there is no preposition. For example, with depart the source NP has a deep role of Obj if its surface role is PrepObj.
Key:
Subj = subject
Obj = object argument
PrepObj =
oblique argument
Adj = adjunct
ARRIVE: (subj X) arrives (obj Z = goal) or (subj X) arrives (prepobj in Z =
goal)
DEPART: (subj X) departs (obj Z = source) or (subj X) departs (prepobj
from Z = goal)
LEAVE: (subj X) leaves (obj Y) (adj for Z)
PUT: (subj X)
puts (obj Z ) (prepobj in/on/under/etc. Z = goal)
The role of each argument (subject, object, indirect object) must be annotated as a feature of its node. See the features page for a more detailed description.
Both deep and surface grammatical relations should be annotated when there is a functional role reversal, i.e. a mismatch between surface subject and deep subject. There are two possible cases:
* In a passive construction, the surface subject should be annotated as the deep object or indirect object (depending on which argument is passivized). The deep subject (surface oblique) should be annotated as such. However, if the deep subject is missing, there is no need to include a missing node for it.
* In an ergative/unaccusative construction, the surface subject should be annotated as the deep object. A verb is in this form if the same verb also can realize its subject as its direct object, with an agentive subject (e.g. the window opened vs the wind opened the window).
Missing arguments will appear as empty nodes. Missing adjuncts will not. In some cases, deciding whether the missing constituent is just an adjunct or a seemingly optional argument is quite difficult. Consistency is the important thing in such a situtation. See the discussion of arguments vs. adjuncts.
New empty nodes are created using the "new" option under "Node" in the GRAPH tool. The new node should have feature wpos set to 100, feature lex to e, and feature POS to N (most cases) or V (if VP ellipsis).
VP ellipsis: This will require an empty verbal head with the auxiliary as its dependent. Only the verbal head and no missing arguments will be added. Quantified noun phrases without a noun head: put in an empty noun head Subject and object control verbs: These constructions will require a missing category to be included as a dependent of the embedded verb, in particular the surface subject of this verbal head. Gerundive, Infinitive and Participial VPs: put in an empty surface subject NP if no subject is present. Raising verbs: Raising verbs will not have a missing category. Instead, annotate them with the surface subject as the direct dependent (and surface subject) of the lower verb. ECM verbs: ECM verbs will not have a missing category either, on analogy with raising verbs. The lower subject will be the dependent of the lower verb. Imperatives: put in an empty subject NP if one is not present. Relative clause complementizers: when no relativizer is present include one, unless the clause is a reduced relative. Conjunctions: In lists of conjoined phrases where there is only one conjunction but more than two conjuncts (e.g. Tom, Dick, and Harry), a comma separating two conjuncts in lieu of a conjunction can be analyzed as the missing conjunction.
Note that ergative constructions ("the window opened") should not have their missing subject NP included as an empty NP.
Raising verbs will not have a missing category. Instead, annotate them with the surface subject as the direct dependent of the lower verb. In other words, in a raising construction, it is really the lower verb that is imposing the selectional restrictions on the subject of the whole clause.
Verbs (and adjectives) that will be regarded as raising predicates here include seem, appear, need, tend, start, turn out, be supposed to, be going to (gonna), have to, continue, be certain, be likely.
Tests for raising (vs. control structures), include using expletive there (as in (1)), expletive-it (as in (2), weather-it (as in (3)) and a non-thematic subject from a sentential idiom as the subject of the verb in question (as in (4)). Raising structures occur with all of these types of subjects (exception discussed below.) Control structures do not occur with these.
Raising
Control structures should have an empty node included as the subject of their lower verb.
Subject control structures are easy to confuse with raising structures because they appear similar in some contexts.
Raising
Object control verbs include: tell, tempt, force, persuade, appeal to. As with subject control verbs, object control constructions cannot be used with expletives or non-thematic subjects of sentential idioms. Here too an empty node must be included as the dependent of the lower verb. Just like subject control verbs can be confused with raising, object control verbs can be confused with ECM verbs. Using an expletive object is generally a good test to distinguish between the two, as shown here with the control verb decide and the ECM verb believe.
In an exceptional case marking (ECM, also known as AcI "Akkusativ cum Infinitiv") construction, the NP that appears to be a direct object will only be the subject of the lower verb. That is, it will have as its head not the ECM verb, but the lower verb.
Common ECM verbs include expect, assume, believe, forbid, know, let, need.
As with raising verbs, the best tests are to use expletive there and non-thematic subject idioms.
Exceptional case marking constructions with for as in (1-2) below should be analyzed as a subordinate clause with for as a complementizer dependent on the subordinate clause's main verb:
Non-finite (gerundive and infinitive) verb phrases (as present participles or infinitives) can appear with or without subjects. Past participles can only appear without subjects.
In general, non-finite clauses will be dependents of main verbs. Exceptions are reduced relative clauses, (such as 5 above), if they modify nouns. In cases that are not clear, the default choice of a head should be the verb.
Small clause complements will be analyzed with the predication as the head of the small clause and dependent on the head verb. The predication may be nominal, prepositional, or adjectival:
In the case of a past participle-headed predication, like the following, the participle should be tagged as an adjective.
When the subject of a clause is an expletive it or there, the expletive will only be the surface subject, not the deep subject. This can indicated through the features on the node. In addition, the head node in a there-construction will have an FRR value of "There."
Often, however, there-constructions with the copula will be missing a clear predication, as in the following example:
Deciding whether post-NP material is part of the noun phrase or in fact the main predication of the sentence can be tricky. The best test to use is simply to attempt to paraphrase the sentence as a copular sentence, then parse it on that basis. For example, (1) and (3) can be treated as having a main predication. The sentence in (5), however, seems better analyzed as missing a predication, according to this test.
As with other full clauses, the head of a wh-question will be its main/lexical verb. The wh-word will be a dependent of the main verb like any other argument.
When the wh-word is part of a long-distance dependency, it will not be a dependent of the highest main verb, but of the embedded main verb heading the clause in which the wh-word originated. The linear order will allow a reconstruction of the wh-word's surface position. In cases of long-distance dependencies, there may be "crossing arcs". This is ok.
If an overt subject is not present, as in (1), include an empty noun; otherwise an imperative will have the same analysis as a declarative sentence.
A relative clause will be the dependent of whatever it modifies, in most cases a noun. The arc is labeled Adj. As with other clauses, its main verb will be its own head. The relativizer will be a dependent of the main verb like any other argument (or adjunct, in cases such as the place where he saw the fish).
Wh-word relativizers and that should be analyzed the same (except of course for part-of-speech). Empty relative pronoun nodes should be inserted if and only if neither a wh-word nor a that-complementizer is present. The arc label should reflect the grammatical function of the relativized argument, independently of the type of complementizer (wh, that, or empty).
In long-distance dependencies, the relativizer will not be a dependent of the highest main verb, but of the embedded main verb heading the clause in which it originated. The linear order will allow a reconstruction of its surface position.
Reduced relative clauses (the flight chosen by you or the airline flying to Wausau) are analyzed like regular relative clauses without overt relative pronoun. They have only an empty subject inserted, but not an empty complementizer, nor an empty auxiliary.
Reduced relative clauses appear similar to non-finite past or present participial clauses and may be difficult to distinguish from these. However, they will always depend on a nominal rather than a verbal head. Although most reduced relative clauses are postnominal, it seems that they can be preposed as in (1) below. When sentence initial, it may be difficult to decide what they depend on. If it is clear that they modify a noun phrase (as in (1) below), choose the noun; otherwise choose the verb as their default head, have in (2) and (3), sang in (4). Note that world knowledge needs to be used when making these decisions.
The surface vs. deep subject of a passive construction can be indicated through the use of the features. The grammatical subject (usually the patient) will be indicated as the surface subject but underlying object. See the discussion of grammatical roles for a related treatment of ergative/unaccusative constructions.
The underlying subject (usually the agent), if expressed, will be a surface oblique argument, but the deep subject. If it is not expressed, an empty node should not be included.
Passive morphology (i.e. the auxiliary be or got) will be a dependent of the main verb.
VP-ellipsis should be annotated with an empty verbal head as the root node. Any auxiliaries and the subject will be dependents of this node. No missing arguments should be added.
Nominal modifiers
The head of a noun phrase is the head noun. Any determiner is a dependent. Adjectives are separate dependents from determiners. If there are multiple adjectives, the default structure will simply have each adjective as a direct dependent of the noun. This is the case for multiple determiners also.
Adverbial noun modifiers can be dependents of the determiner or the noun in the phrase they modify. For example, approximately, nearly, practically, almost, about, at most, only can depend on cardinals or some quantifiers; at least, only, just, even can depend on nouns (i.e. modify entire noun phrases). These classes have some overlap; the default head choice in cases of ambiguity should be the noun.
Compound nouns
Compound noun phrases, when clear, can have multiple noun phrases as dependents. For example, child safety seat will have seat as the head and child and safety as its direct dependents. A good test for this is to remove each noun in turn, to see if the phrase still retains part of its original sense. Because a child safety seat is a seat for children and a safety seat, this analysis is the one we want.
In contrast, a phrase like seven-day advance purchase should be annotated with purchase as the head, advance as its dependent and seven-day as the dependent of advance--> advance purchase vs. *seven-day purchase
In cases where it's not clear whether or which nouns modify each other, the default compound structure will have all modifying nouns as direct dependents on the rightmost noun.
http://www.cis.upenn.edu/~creswell/dependency/compound.gif
Proper Nouns
Proper nouns should have the value PN for feature POS. They are treated largely like nouns, except that compound proper nouns are not analyzed syntactically as if they were common nouns, but rather given right-branching structures. (The intuition is that they are really fixed phrases.) So in British Airways, British is the head, has POS PN, and carries the other features of this proper noun (in American English, singular number). Airways is a dependent on British (with SRole Adj), and also has POS PN.
InLondon Heathrow airport, London Heathrow is interpreted as a compound proper noun as described above, and airport as a common noun, which has London Heathorw as its dependent (SRole Adj).
http://www.cis.upenn.edu/~creswell/dependency/pnouns.gif
Quantifier headed NPs
In a noun phrase consisting of only a quantifier, the quantifier should be the head of the NP. Any modifying phrases are directly dependent on it.
GENERAL
Adverbs and adjectives are modifying concepts -- adjectives for nouns, adverbs for verbs. For example, in the phrase, (الكتاب الكبير) "the big book" the adjective "big" (الكبير) modifies the concept "book" (الكتاب) by identifying the size of the book. In (سافر صباحا) "he-left-travelling in-morning" the adverb صباحا modifies the verb سافر by specifying the time in which the action was performed.
DEGREE
The degree of the modification can be specified by other modifiers, such as (جدا) "very", as in (كتاب كبير جدا) "a very big book":
In addition, there are two kinds of degree specification: comparative and superlative forms. In the first, the degree of modification is specified by comparing the case in question to one other case: (هذا الكتاب اكبر من ذلك الكتاب) "This book is bigger than that book." In the second, the degree of modification is specified by comparing to all other cases: (هذا اكبر كتاب / هذا هو الكتاب الاكبر) "This is the biggest book". The form of the adjective will be the same as its normal descriptive (not comparative nor superlative) form.
COPULAR ADJECTIVES
See the manual section on "copular constructions" for how to handle such sentences as (الكتاب كبير) "The book is big."
Prepositions dominate their object NPs. For example,
As in the English IL0 Manual ,"John gave a book to Mary" (اعطى يحيى الكتاب الى مريم ) and "John gave Mary a book" (اعطى يحيى مريم الكتاب) look different at IL0 and IL1, but the same at IL2.
Sentences whose main verb is a form of "to be" (in Arabic, a sentence with sister's of kAna كان واخواتها, sisters of Ain~a ان واخواتها, or a nominal sentence with topic/subject مبتدأ -complement/predicate خبر) will always have the predicate خبر as the sentence head with the subject/topic مبتدأ as the child marked with DSyntRole of Subj. The verbal element ( كان / ان ) will be a child of the heading predicate خبر. For example,

Conjunction has its own part-of-speech (Conj). The conjunction (and, or, but, etc) is placed as a dependent of the first conjunct with role Mod, and the second conjunct is a dependent of the conjunction with role Obj.
If a comma acts as a conjunction, it is treated as such (given part-of-speech Conj and analyzed as in the above paragraph). However, note that in "chicken, ducks, and geese", the second (last) comma does not serve as a conjunction (since there is an explicit "and"), and it is removed at IL0. The first comma does serve as a conjunction.
An empty node is a node which does no correspond to a word (or other graphical manifestation such as a punctuation mark) in the input string.
In all cases, when you create an empty node, give it a wpos feature so that it ends up in a position that roughly corresponds to its grammatical function (i.e., if it is a subject, to the left of its governing verb, and so on). When the fs files come out of the parser, the nodes have wpos features in increments of 10, so there are enough unused positions to place new nodes where they belong. Never reuse an already used position.
There are (at least) three types of empty nodes.
Empty nominal nodes: big-PRO, and related cases
These are cases of empty nodes where the meaning can be derived from the syntactic context:
Empty nominal nodes: little-pro, missing argument in passive, and related cases
Empty verbal nodes
This happens in cases of VP-ellipsis ("Mary has seen the chicken but Gigi has
not"). VP-ellipsis should be annotated with an empty verbal head as the root
node. The lexeme and word of the empty head should be filled in from the
antecedent between brackets, e.g. "<play>" for "Mary plays with cats and
so does Tony". Any auxiliaries (including the auxiliary which is overt in the
string are deleted, as usual in IL0. The subject and other arguments will be
dependents of this node. No missing arguments should be added.
Remove all punctuation, except meaningful punctuation. Examples:
Creating IL0 |
Currently, researchers create IL0 files. Read the manual. Run text through parser (see link below). Correct fs file in Tred. Check carefully the resulting fs files. Email to Owen for blessing.