femine/masculine

dictionary entry

broken plurals

copular

segmentation

pos of kan and ina


Arabic Annotation Manual  

(Version 0.0)


 

About the Annotation Manual

Structure of the Annotation Manual Sections

Each section of the manual should cover structure and content. Structure is basically syntax (and thus much can be reused from the existing syntax manual which Owen inserts into these pages), though some things will of course have to be changed. Content refers to the process of linking the words to the ontology.

In writing the manual, the linguistic content of the manual should be separated from the description of how to use the tool (so that we can reuse the manual without change if the tool changes). However, we need to have a clear idea of what the tool does before writing the linguistic content part.

Procedure for Creating/Maintaining the Annotation Manual

Each section has one person in charge. Please all feel free to add comments in the corresponding comment page.

The person in charge can incorporate the comment into the draft and then remove the comment, or argue against the comment in another comment. Eventually, all comments should be removed, either by incorporation (then the person in charge removes it) or by retracting the comment (then the person making the comment removes it).

Important: when you have changed a part of the annotation manual, and you have given people in the Content Committee time to comment (or the change was discussed), then you should copy the updated page to the mirror site on the Annotator Wiki. This is crucical, since the annotators do not have access to this page!

English IL0 Manual


Features on All Nodes 

Each node in the dependency tree can be thought of as an attribute-value matrix, i.e., a bundle of features with values. All values must be set for each node in the tree. This will require checking each node before finishing the analysis. Here is a list of features:

If DRole is omitted, it is assumed to be the same as SRole (which is frequently the case). There is a certain redundancy among these features, which is intended (for error checking).

Verbs 

Verbs are heads of sentences and clauses.


Verbs and Auxiliaries: choosing a head

The head of any complete clausal utterance is the main verb. Incomplete utterances (NPs, PPs, Greetings) should have as their head the usual head for that type of phrase.

Auxiliary verbs (do, have, had, auxiliary-be) are deleted. Their meaning is represented as features on the main verb (for example, tense:fut). Modals (can, must, etc) are syntactically very much like auxiliaries, but they are included in IL0 for semantic reasons as dependents of the main verb. In all cases, when the main verb is missing, as in VP ellipsis, an empty verb node should be created and used as the head of the entire clause.

Sequences of auxiliary verbs (had been Ven, are to be Ving, could have been being Ven) should be annotated with the main verb as the head, and all auxiliaries removed (except modal auxiliaries).

When the main verb is a form of the copula, the head of the clause will vary depending on the type of copular sentence. Predicative copular constructions will have the predicate as their head. Equative copular constructions will have the copula as their head.

In an infinitive construction, to should be treated as an auxiliary, i.e., it should be removed. This includes instances of want to and have to (in which the to depends on the embedded verb).


Arguments and adjuncts

In distinguishing between arguments and adjuncts, consistency is the most important thing. This distinction will matter most for annotating empty categories. In addition, each argument will be annotated with a feature encoding its grammatical role. All non-arguments will be annotated as adjuncts, including function words.

The only NPs that will be considered arguments for annotation purposes are

  1. NPs that never appear with a preposition;
  2. NPs that appear with a preposition but can occur with the same verb in an alternation without a proposition (e.g. the indirect object Y in give X to Y);
  3. NPs that are obligatory (e.g. Y in put X on Y).

    A list of argument patterns of common verbs can be consulted for questionable cases.


Verb argument patterns

Only constituents marked Subj, Obj, or PrepObj will count as arguments. When the same constituent can appear either as an object or a prepositional object, it will count as an argument. As such, the deep role of an object that can appear with or without a preposition is always a role where there is no preposition. For example, with depart the source NP has a deep role of Obj if its surface role is PrepObj.

Key:
Subj = subject
Obj = object argument
PrepObj = oblique argument
Adj = adjunct

ARRIVE: (subj X) arrives (obj Z = goal) or (subj X) arrives (prepobj in Z = goal)
DEPART: (subj X) departs (obj Z = source) or (subj X) departs (prepobj from Z = goal)
LEAVE: (subj X) leaves (obj Y) (adj for Z)
PUT: (subj X) puts (obj Z ) (prepobj in/on/under/etc. Z = goal)


Grammatical relations

The role of each argument (subject, object, indirect object) must be annotated as a feature of its node. See the features page for a more detailed description.

Both deep and surface grammatical relations should be annotated when there is a functional role reversal, i.e. a mismatch between surface subject and deep subject. There are two possible cases:

* In a passive construction, the surface subject should be annotated as the deep object or indirect object (depending on which argument is passivized). The deep subject (surface oblique) should be annotated as such. However, if the deep subject is missing, there is no need to include a missing node for it.

* In an ergative/unaccusative construction, the surface subject should be annotated as the deep object. A verb is in this form if the same verb also can realize its subject as its direct object, with an agentive subject (e.g. the window opened vs the wind opened the window).


Empty categories and missing constituents

Missing arguments will appear as empty nodes. Missing adjuncts will not. In some cases, deciding whether the missing constituent is just an adjunct or a seemingly optional argument is quite difficult. Consistency is the important thing in such a situtation. See the discussion of arguments vs. adjuncts.

New empty nodes are created using the "new" option under "Node" in the GRAPH tool. The new node should have feature wpos set to 100, feature lex to e, and feature POS to N (most cases) or V (if VP ellipsis).

VP ellipsis: This will require an empty verbal head with the auxiliary as its dependent. Only the verbal head and no missing arguments will be added. Quantified noun phrases without a noun head: put in an empty noun head Subject and object control verbs: These constructions will require a missing category to be included as a dependent of the embedded verb, in particular the surface subject of this verbal head. Gerundive, Infinitive and Participial VPs: put in an empty surface subject NP if no subject is present. Raising verbs: Raising verbs will not have a missing category. Instead, annotate them with the surface subject as the direct dependent (and surface subject) of the lower verb. ECM verbs: ECM verbs will not have a missing category either, on analogy with raising verbs. The lower subject will be the dependent of the lower verb. Imperatives: put in an empty subject NP if one is not present. Relative clause complementizers: when no relativizer is present include one, unless the clause is a reduced relative. Conjunctions: In lists of conjoined phrases where there is only one conjunction but more than two conjuncts (e.g. Tom, Dick, and Harry), a comma separating two conjuncts in lieu of a conjunction can be analyzed as the missing conjunction.

Note that ergative constructions ("the window opened") should not have their missing subject NP included as an empty NP.


Raising Verbs

Raising verbs will not have a missing category. Instead, annotate them with the surface subject as the direct dependent of the lower verb. In other words, in a raising construction, it is really the lower verb that is imposing the selectional restrictions on the subject of the whole clause.

Picture

Verbs (and adjectives) that will be regarded as raising predicates here include seem, appear, need, tend, start, turn out, be supposed to, be going to (gonna), have to, continue, be certain, be likely.

Tests for raising (vs. control structures), include using expletive there (as in (1)), expletive-it (as in (2), weather-it (as in (3)) and a non-thematic subject from a sentential idiom as the subject of the verb in question (as in (4)). Raising structures occur with all of these types of subjects (exception discussed below.) Control structures do not occur with these.

Raising

  1. There is likely to be a problem when he's around.
  2. It is likely that Jerry will lose the race.
  3. It is likely to rain on Tuesday.
  4. The cat is likely to be out of the bag.
Control
  1. ? There tried to be a problem when he was around
  2. ? It tried that Jerry will lose the race.
  3. ? It tried to rain on Tuesday.
  4. ? The cat tried to be out of the bag.
Note that these types of subjects must be allowed by the lower verb if they are to be acceptable in a raising structure. In other words, if a verb doesn't take an expletive-there subject normally, it won't work in a raising structure either:
  1. ? There eats an apple.
  2. ? There seems to eat an apple.
In addition, these tests are not entirely decisive. Some raising verbs cannot occur with an expletive-it and a finite sentential complement:
  1. ? It starts/continues/tends that John is a problem.
Finally, don't get confused by these tests. Raising verbs can occur with ordinary NPs as subjects too.
  1. Kim tends/seems/is likely to nominate Sandy.
  2. Those apples tend/seem/are likely to decay rather quickly.

Control Structures

Control structures should have an empty node included as the subject of their lower verb.

Subject control structures are easy to confuse with raising structures because they appear similar in some contexts.

  1. John seems to neglect his duties.
  2. John tried to neglect his duties.
However, subject control structures can not appear with the same types of subjects that raising structures allow:

Raising

  1. There seems to be a problem when he's around.
  2. It seems that Jerry will lose the race.
  3. It seems to rain on Tuesday.
  4. The cat seems to be out of the bag.
Control
  1. ? There tried to be a problem when he was around
  2. ? It tried that Jerry will lose the race.
  3. ? It tried to rain on Tuesday.
  4. ? The cat tried to be out of the bag.
Some common subject control verbs/adjectives are try, hope, want (wanna), be keen, be eager, desire, expect, decide, be silly, be lucky.

Object control verbs include: tell, tempt, force, persuade, appeal to. As with subject control verbs, object control constructions cannot be used with expletives or non-thematic subjects of sentential idioms. Here too an empty node must be included as the dependent of the lower verb. Just like subject control verbs can be confused with raising, object control verbs can be confused with ECM verbs. Using an expletive object is generally a good test to distinguish between the two, as shown here with the control verb decide and the ECM verb believe.

  1. ? I decided there to be a problem.
  2. ? I decided the shoe to be on the other foot.
  3. I believed there to be a problem.
  4. I believed the shoe to be on the other foot.
Note that although want is a subject control verb, when it appears with a second NP, it is an ECM verb. In addition, it can appear with a infinitival for-complement. An empty node should only be included in its subject control version. The case with for should be analyzed as an ECM construction, differing only in the fact that for appears as a complementizer dependent of the embedded verb.
  1. I want to leave.
  2. * There wants to be a solution.
  3. I want him to leave.
  4. I want there to be a solution.
  5. I want for him to win the race.

Exceptional Case Marking Verbs

In an exceptional case marking (ECM, also known as AcI "Akkusativ cum Infinitiv") construction, the NP that appears to be a direct object will only be the subject of the lower verb. That is, it will have as its head not the ECM verb, but the lower verb.

Common ECM verbs include expect, assume, believe, forbid, know, let, need.

As with raising verbs, the best tests are to use expletive there and non-thematic subject idioms.

  1. I believe there to be a problem.
  2. I believe the shoe to be on the other foot.
  3. I need there to be a solution.
  4. I need the cat to be out of the bag.
  5. He let there be light.
ECM constructions may be confused with object control. See Control for a discussion of this matter.

Exceptional case marking constructions with for as in (1-2) below should be analyzed as a subordinate clause with for as a complementizer dependent on the subordinate clause's main verb:

  1. For me to eat Crispy Critters would be unprecedented.
  2. I want for you to eat only Crispy Critters.
Some ECM verbs (need) subcategorize for either an NP and an infinitive or an NP and a past participle. In the case of the latter, the analysis will be the same as that of the small clause complement analysis. The past participle will be tagged as an adjective.
  1. John needs me to solve the problem.
  2. John needs the problem solved.

Non-finite clauses

Non-finite (gerundive and infinitive) verb phrases (as present participles or infinitives) can appear with or without subjects. Past participles can only appear without subjects.

  1. Norma's complaining about everyone never fails to annoy me.
  2. Complaining about everyone never fails to annoy others.
  3. For Bunny to leave now would disrupt everything.
  4. To leave now would disrupt everything.
  5. Depressed by the results, Uli ceased to make an effort.
  6. Before leaving/While jogging/After eating, Max called Mike.
When they appear without subjects, an empty noun node should be included as a dependent of the verb. If a subject noun phrase is present and part of the VP, as in (1) and (3) above, an empty node should not be included. Instead, that head noun (and its dependents if any) should be a dependent of the non-finite verb.

In general, non-finite clauses will be dependents of main verbs. Exceptions are reduced relative clauses, (such as 5 above), if they modify nouns. In cases that are not clear, the default choice of a head should be the verb.


Small clauses

Small clause complements will be analyzed with the predication as the head of the small clause and dependent on the head verb. The predication may be nominal, prepositional, or adjectival:

  1. The manager considers Ernie an asset to the company.
  2. The agent considers that issue outside the scope of our discussion.
  3. We consider the problem intractable.
The analysis of small clauses is identical to predicative copular constructions except the latter have the copula as an additional dependent on the head predication.

In the case of a past participle-headed predication, like the following, the participle should be tagged as an adjective.

  1. We consider the problem solved.
  2. We need the car repaired.

Expletive subjects and there-insertion

When the subject of a clause is an expletive it or there, the expletive will only be the surface subject, not the deep subject. This can indicated through the features on the node. In addition, the head node in a there-construction will have an FRR value of "There."

  1. It surprised me that those peppers were so expensive.
  2. That those peppers were so expensive surprised me.
  3. There was a man in the garden.
  4. A man was in the garden.
  5. There arose shouts in the crowd.
  6. Shouts arose in the crowd.
There-insertion In sentences like (3) above, the dependency tree will have to incorporate the analysis of predicative copular constructions. In this example the head of the sentence will be the head of the predicate in the garden.

Often, however, there-constructions with the copula will be missing a clear predication, as in the following example:

  1. There's three non stops.
In such a case, include an empty head for the sentence, with the noun phrase, the copula, and the there as its dependents. The NP will be the underlying subject as above, and the there will be the surface subject.

Deciding whether post-NP material is part of the noun phrase or in fact the main predication of the sentence can be tricky. The best test to use is simply to attempt to paraphrase the sentence as a copular sentence, then parse it on that basis. For example, (1) and (3) can be treated as having a main predication. The sentence in (5), however, seems better analyzed as missing a predication, according to this test.

  1. There are no flights to Newark at that time
  2. No flights to Newark are at that time
  3. There is a flight departing San Jose at ten thirty a. m.
  4. A flight is departing San Jose at ten thirty a. m.
  5. There are no lower fares for this particular trip
  6. ?? No lower fares are for this particular trip

Wh-questions

As with other full clauses, the head of a wh-question will be its main/lexical verb. The wh-word will be a dependent of the main verb like any other argument.

When the wh-word is part of a long-distance dependency, it will not be a dependent of the highest main verb, but of the embedded main verb heading the clause in which the wh-word originated. The linear order will allow a reconstruction of the wh-word's surface position. In cases of long-distance dependencies, there may be "crossing arcs". This is ok.


Imperatives

If an overt subject is not present, as in (1), include an empty noun; otherwise an imperative will have the same analysis as a declarative sentence.

  1. Leave me alone!
  2. You leave me alone!

Relative clauses

A relative clause will be the dependent of whatever it modifies, in most cases a noun. The arc is labeled Adj. As with other clauses, its main verb will be its own head. The relativizer will be a dependent of the main verb like any other argument (or adjunct, in cases such as the place where he saw the fish).

Wh-word relativizers and that should be analyzed the same (except of course for part-of-speech). Empty relative pronoun nodes should be inserted if and only if neither a wh-word nor a that-complementizer is present. The arc label should reflect the grammatical function of the relativized argument, independently of the type of complementizer (wh, that, or empty).

In long-distance dependencies, the relativizer will not be a dependent of the highest main verb, but of the embedded main verb heading the clause in which it originated. The linear order will allow a reconstruction of its surface position.

Reduced relative clauses (the flight chosen by you or the airline flying to Wausau) are analyzed like regular relative clauses without overt relative pronoun. They have only an empty subject inserted, but not an empty complementizer, nor an empty auxiliary.

Reduced relative clauses appear similar to non-finite past or present participial clauses and may be difficult to distinguish from these. However, they will always depend on a nominal rather than a verbal head. Although most reduced relative clauses are postnominal, it seems that they can be preposed as in (1) below. When sentence initial, it may be difficult to decide what they depend on. If it is clear that they modify a noun phrase (as in (1) below), choose the noun; otherwise choose the verb as their default head, have in (2) and (3), sang in (4). Note that world knowledge needs to be used when making these decisions.

  1. [Staying at the Palace Hotel], you can use the gym.
  2. [Returning on the eleventh], I have a couple flights, the first one departing Baltimore at twelve forty p.m.
  3. The lowest rate I have for a car [using your discount number] is going to be Avis.
  4. [Playing in the yard], the boy sang happily.
Two tests to use to decide whether the clause is modifying the verb or a noun:

Passive

The surface vs. deep subject of a passive construction can be indicated through the use of the features. The grammatical subject (usually the patient) will be indicated as the surface subject but underlying object. See the discussion of grammatical roles for a related treatment of ergative/unaccusative constructions.

The underlying subject (usually the agent), if expressed, will be a surface oblique argument, but the deep subject. If it is not expressed, an empty node should not be included.

Passive morphology (i.e. the auxiliary be or got) will be a dependent of the main verb.


VP ellipsis

VP-ellipsis should be annotated with an empty verbal head as the root node. Any auxiliaries and the subject will be dependents of this node. No missing arguments should be added.


Nouns and Proper Nouns 

Nominal modifiers

The head of a noun phrase is the head noun. Any determiner is a dependent. Adjectives are separate dependents from determiners. If there are multiple adjectives, the default structure will simply have each adjective as a direct dependent of the noun. This is the case for multiple determiners also.

Adverbial noun modifiers can be dependents of the determiner or the noun in the phrase they modify. For example, approximately, nearly, practically, almost, about, at most, only can depend on cardinals or some quantifiers; at least, only, just, even can depend on nouns (i.e. modify entire noun phrases). These classes have some overlap; the default head choice in cases of ambiguity should be the noun.

Compound nouns

Compound noun phrases, when clear, can have multiple noun phrases as dependents. For example, child safety seat will have seat as the head and child and safety as its direct dependents. A good test for this is to remove each noun in turn, to see if the phrase still retains part of its original sense. Because a child safety seat is a seat for children and a safety seat, this analysis is the one we want.

In contrast, a phrase like seven-day advance purchase should be annotated with purchase as the head, advance as its dependent and seven-day as the dependent of advance--> advance purchase vs. *seven-day purchase

In cases where it's not clear whether or which nouns modify each other, the default compound structure will have all modifying nouns as direct dependents on the rightmost noun.

http://www.cis.upenn.edu/~creswell/dependency/compound.gif

Proper Nouns

Proper nouns should have the value PN for feature POS. They are treated largely like nouns, except that compound proper nouns are not analyzed syntactically as if they were common nouns, but rather given right-branching structures. (The intuition is that they are really fixed phrases.) So in British Airways, British is the head, has POS PN, and carries the other features of this proper noun (in American English, singular number). Airways is a dependent on British (with SRole Adj), and also has POS PN.

InLondon Heathrow airport, London Heathrow is interpreted as a compound proper noun as described above, and airport as a common noun, which has London Heathorw as its dependent (SRole Adj).

http://www.cis.upenn.edu/~creswell/dependency/pnouns.gif

Quantifier headed NPs

In a noun phrase consisting of only a quantifier, the quantifier should be the head of the NP. Any modifying phrases are directly dependent on it.

  1. All of the students registered for the class, but five/many/some/most wished they hadn't.
In a sentence with the copula as the main verb, a post-copular NP headed by a quantifier and modified by a relative clause (all I needed) should be treated as an equative construction.
  1. This vacation is all I've ever wished for.
  2. All I've ever wished for is this vacation.
  3. ? I consider this vacation all I've ever wished for.
  4. ?? This vacation seems all I've ever wished for.
  5. Coca-cola is all he drinks.
  6. All he drinks is Coca-cola
  7. ?? I consider Coca-cola all he drinks
  8. ?? Coca-cola seems all he drinks.

Adjectives and Adverbs (DONE)

GENERAL

Adverbs and adjectives are modifying concepts -- adjectives for nouns, adverbs for verbs. For example, in the phrase,  (الكتاب الكبير) "the big book" the adjective "big" (الكبير) modifies the concept "book" (الكتاب) by identifying the size of the book. In  (سافر صباحا) "he-left-travelling  in-morning" the adverb  صباحا modifies the verb سافر by specifying the time in which the action was performed.

DEGREE

The degree of the modification can be specified by other modifiers, such as (جدا) "very", as in (كتاب كبير جدا) "a very big book":

In addition, there are two kinds of degree specification:  comparative and superlative forms. In the first, the degree of modification is specified by comparing the case in question to one other case: (هذا الكتاب اكبر من ذلك الكتاب) "This book is bigger than that book." In the second, the degree of modification is specified by comparing to all other cases: (هذا اكبر كتاب / هذا هو الكتاب الاكبر) "This is the biggest book".  The form of the adjective will be the same as its normal descriptive (not comparative nor superlative) form.

COPULAR ADJECTIVES

See the manual section on "copular constructions" for how to handle such sentences as (الكتاب كبير) "The book is big."


Prepositions and Particles (DONE)

Prepositions dominate their object NPs. For example,

As in the English IL0 Manual ,"John gave a book to Mary" (اعطى يحيى الكتاب الى مريم ) and "John gave Mary a book" (اعطى يحيى مريم الكتاب) look different at IL0 and IL1, but the same at IL2.


Copular constructions  (DONE)

Sentences whose main verb is a form of "to be" (in Arabic, a sentence with sister's of kAna كان واخواتها, sisters of Ain~a ان واخواتها, or a nominal sentence with topic/subject مبتدأ -complement/predicate خبر) will always have the predicate خبر as the sentence head with the subject/topic مبتدأ as the child marked with DSyntRole of Subj. The verbal element ( كان / ان ) will be a child of the heading predicate خبر. For example,


Conjunction 

Conjunction has its own part-of-speech (Conj). The conjunction (and, or, but, etc) is placed as a dependent of the first conjunct with role Mod, and the second conjunct is a dependent of the conjunction with role Obj.

If a comma acts as a conjunction, it is treated as such (given part-of-speech Conj and analyzed as in the above paragraph). However, note that in "chicken, ducks, and geese", the second (last) comma does not serve as a conjunction (since there is an explicit "and"), and it is removed at IL0. The first comma does serve as a conjunction.


Empty Nodes 

An empty node is a node which does no correspond to a word (or other graphical manifestation such as a punctuation mark) in the input string.

In all cases, when you create an empty node, give it a wpos feature so that it ends up in a position that roughly corresponds to its grammatical function (i.e., if it is a subject, to the left of its governing verb, and so on). When the fs files come out of the parser, the nodes have wpos features in increments of 10, so there are enough unused positions to place new nodes where they belong. Never reuse an already used position.

There are (at least) three types of empty nodes.

Empty nominal nodes: big-PRO, and related cases

These are cases of empty nodes where the meaning can be derived from the syntactic context:

In these cases, we introduce an empty node and identify the node with which it is co-referential. We then copy the co-referential node's word and lexeme values to the empty node, but add brackets around the value: "<Dominic>".

Empty nominal nodes: little-pro, missing argument in passive, and related cases

In these cases, we label both the lexeme and the word feature of the new node "<pro>". In case of doubt ("<pro>" or "<Dominic>"), ask yourself: can I tell from syntax alone what this node means? If no, "<pro>". If yes, fill in the lexeme.

Empty verbal nodes

This happens in cases of VP-ellipsis ("Mary has seen the chicken but Gigi has not"). VP-ellipsis should be annotated with an empty verbal head as the root node. The lexeme and word of the empty head should be filled in from the antecedent between brackets, e.g. "<play>" for "Mary plays with cats and so does Tony". Any auxiliaries (including the auxiliary which is overt in the string are deleted, as usual in IL0. The subject and other arguments will be dependents of this node. No missing arguments should be added.


Punctuation

Remove all punctuation, except meaningful punctuation. Examples:

Do remove:

Creating IL0

Currently, researchers create IL0 files. Read the manual. Run text through parser (see link below). Correct fs file in Tred. Check carefully the resulting fs files. Email to Owen for blessing.

Checklist for Producing IL0 from Connexor Parses