The Penn Treebank guidelines ("Part-of-Speech Tagging Guidelines For The Penn Treebank Project (3rd Revision)," Beatrice Santorini. University of Pennsylvania Computer and Information Science Department Technical Report MS-CIS-90-47, LINC LAB 178) were followed as closely as possible, with the following changes: 1. tokenization of hyphenated items ("New York-based" has been replaced by "New York - based" for example) 2. addition of the HYPH tag for hyphens in the above tokenization 3. addition of the AFX tag for rare cases of affixes that must be tokenized as above (mis- and poorly understood, e.g.) Specific guidelines for hyphenated items along with a number of additional bits of tagging policy are below (with thanks to Justin Mott for collecting and maintaining these lists). *** GUIDELINES FOR HYPHENATED ITEMS *** The most fundamental problem we have is the question of things like: New York-based (or, appropriately for us, Hong Kong-based). Since York-based (or Kong-based) clearly does not form a single lexical (or semantic) unit, whereas New York (or Hong Kong) does, it is counterintutitive and misrepresentative of the actual structure to leave such items as a single token. There are, in addition to this, many other uses of the hyphen which could stand to have their annotation revisited and ameliorated. Obviously, one possible solution to this would be to separate every occurrence of a hyphenated item. This, however, would overgenerate tokens to an enormous extent. This is due to the variety of uses of the hyphen in standard written English and, more importantly, the fact that not all hyphenated items form lexical/semantic/morphological units. The goal then is to determine a policy for the annotation of hyphenated items that reasonably reflects morphological constituency, is felicitous to other levels of annotation and will be intuitive to annotators. It should be noted that many other uses of the 'hyphen' either fall under established policy, or can easily be accounted for in a way that is harmonious with pre-existing policy. The two most common of such uses are simple punctuation (in which case it is technically an em- or en-dash) and 'symbolic' usage (as in 40-50%). In cases where a 'hyphen' is acting as punctuation, it should be treated as whichever form of punctuation it is functioning as. In practice it is (nearly?) always a comma or a colon. Symbolic usage (keeping in mind that the main determining factor of such is if it can be read out as a word) is tagged as SYM. The ITR/E biomedical project (http://www.ldc.upenn.edu/myl/ITR/, funded via award EIA-0205448 from the National Science Foundation's Information Technology Research (ITR) program) has two major innovations in regards to this question. First is the use of the POS tag HYPH to label hyphens. Second is the label AFX for subword morphological units. Let us steal them. Guidelines: 1) Do not break anything into units smaller than a word. cross-strait/JJ relations pro-Beijing/JJ position conduct a meta-search/NN high-tech/JJ solution vis-a-vis/IN wishy-washy/JJ If absolutely forced to, use AFX tag for subword morphemes. pre/AFX -/HYPH and/CC post/AFX -/HYPH natal/JJ care Indo/AFX -/HYPH European/JJ and/CC -/HYPH Iranian/JJ linguistics was self/AFX -/HYPH designed/VBN by the company 2) Break collocations involving participles. Hong/NNP Kong/NNP -/HYPH based/VBN companies wrinkle/NN -/HYPH removing/VBG cream This policy is suspended if: a) breaking it would result in a subword unit self-governing/JJ island non-leg-related/JJ problems non-threatening/JJ stance b) the putative participle does not correspond to an actual verb (or not in the usual sense of an existing verb) half-assed/JJ job is very hard-nosed/JJ life-sized/JJ portrait c) it is a combination of participle-particle chopped-off/JJ finger grown-ups/NNS d) it occurs as part of a proper noun "A Number of Regulations Concerning Establishing Foreign-Invested/NNP Enterprises" 3) In general, break combinations with simple verbs as the final element: must/MD -/HYPH see/VB television the bag/NN -/HYPH wrap/VB method However, do not break when simple verbs are the first element: committed a break-in/NN a bad break-up/NN the know-how/NN 4) Break all collapsed phrases. a/DT -/HYPH man/NN 's/POS -/HYPH life/NN -/HYPH is/VBZ -/HYPH hard/JJ generation the editor/NN -/HYPH in/IN -/HYPH chief/NN 5) Break all noun-noun (including proper noun) combinations. have beer/NN -/HYPH guts/NNS Taiwan/NNP -/HYPH Palau/NNP trade This does not apply to nouns which are tagged as NNP because they appear in a proper noun. the '98 East-West/NNP China Cooperation, Investment and Trade Negotitaion Conference 6) Do not break adjective-adjective combinations. white-hot/JJ poker 7) Break noun-adjective combinations. visa/NN -/HYPH free/JJ entry Rumsfeld/NNP -/HYPH free/JJ administration secretary/NN -/HYPH general/JJ world/NN -/HYPH famous/JJ annotator 8) Do not break adjective-noun combinations. are all pretty-boys/NNS large-scale/JJ assault works full-time/NN left-wing/JJ leanings Western-style/JJ houses 9) Break any combination with a cardinal number. 16/CD -/HYPH year/NN -/HYPH old/JJ gymnast 20/CD -/HYPH odd/JJ years Numbers which are spelled out are excepted. twenty-five/CD years three-hundred/CD and forty-five/CD 10) Do not break combinations of adjective and adverb (in either order). remain ever-alert/JJ first-ever/JJ 12) If a hyphenated item is to be treated as a single token, follow the guidelines in the old POS manual. ADDENDA- A1) Complex collocations. In instances where hyphenated items are themselves attached via a hyphen, the decision on how to break them should follow the bracketting of the unit. For instance- (elementary-school)-age children should be- elementary-school/NN -/HYPH age/NN children Conversely- non-leg-related should be- non-leg-related/JJ since breaking the participle off would result in a non-unit. A2) /t,d/ Deletion: Items such as the following were originally combinations of participle and noun. As part of the more general phenomenon of deleting coda /t,d/ in modern English, they appear as life-size ice-cream skim-milk (Note that the hyphenated forms are not always the prescriptively preferred ones, but given that there is so much variation we can easily imagine them, and more importantly, we should be ready to encounter them.) The default in such instances should be to break them as noun-noun combinations. If, however, there is no corresponding noun (as in skim-milk), it should be unbroken. A3) Participle vs. Adjective In instances where it is ambiguous whether it is an item is an adjective or participle, apply the diagnostics listed in the POS manual. Despite this, there will unavoidably be some inter-annotator (and possibly intra-annotator) variation. So it is possible to imagine the following competing annotations- clean/RB -/HYPH shaven/VBN men or clean-shaven/JJ men easy/RB -/HYPH going/VBN people or easy-going/JJ people As a practical solution, I recommend annotators keep a list of frequently occurring collocations they encounter. These cases can then be discussed among annotators and a single annotation scheme adopted for them. A4) Remaining doubts, inconsistencies, etc. Cases such as 'iced-tea' and 'cutting-edge' are still a little worrisome for me. In theory, I think they should be broken up, since one of the main motivations for this revision is getting access to participles. *** MISCELLANEOUS ADDITIONAL ITEMS *** This list contains items that have been discussed in the course of pos-tagging the Xinhua corpus. 1. NN or NNS: For items which are ambiguous between NN(P) and NN(P)S, the default is to rely on the surface forms. foreign affairs/NNS savings/NNS and loans/NNS communications/NNS sales/NNS income Foreign Affairs/NNPS Bureau This does not apply to entities which take singular agreement. the Philippines/NNP the Virgin Islands/NNP Tianjin Customs/NNP has decided A notable exception to this is the item 'data', which is only marked NNS when it triggers plural agreement. Also, 'news' is treated as singular Xinhua News/NNP Agency 2. Hyphenated Items: twenty-five/CD most/RBS -/HYPH favored/VBN -/HYPH nation/NN double-digit/JJ unemployment rate (was) self/AFX -/HYPH designed/VBN by the company the post-industrialization/JJ mid-development/JJ stage a large, high-speed/JJ V-shaped/JJ rise large-scaled/JJ post-processing/JJ polished rice start-up/JJ projects vice-Premier/NNP to be multi-share/JJ holding companies self-run/JJ business import and export rights help laid-off/JJ workers be re-employed exclusively -/HYPH owned world/NN -/HYPH class/NN re-employment/NN projects/NNS Do not break hyphenated items with a participle when they occur as part of a proper noun: "A Number of Regulations Concerning Establishing Foreign-Invested/NNP Enterprises" the '98 East-West/NNP China Cooperation, Investment and Trade Negotitaion Conference 3. NN or NNP RMB/NN GDP/NN Chapter/NN 4 of the US Commercial Code reached 1 billion Finland Marks/NNS more than 120,000 State/NN owned and collective enterprises 4. JJ or VBN refined/JJ rice (vs. unrefined/JJ rice) a blown/VBN glass factory under the planned/JJ economy system specialized/JJ accessory factories bonded/VBN area the mixed/JJ economy mixed/JJ ownership economy developed/JJ countries advanced/JJ technology 5. Sundry Items: per/IN capita/FW per/IN annum/FW yuan/NNS east/NN Asia East/NNP Asia eastern/JJ Asia chemical/NN and mechanical/JJ industries ranking first/RB world/NN wide/RB October 30th/NN express/JJ mail --/, It will benefit... continue to maintain relatively large grow/VB real/JJ estate pursuant/JJ to law 6. New Items these ``three do not fears/NNS? '' grassroots/JJ organisation Post Code: 423700/CD in the lower/JJ picture (vs. upper) Holy/JJ Cow/NN It's not right, Old/NNP Ye/NNP , employed/JJ people 30/CD -/HYPH some/DT extroverted/JJ agriculture processed/VBN products cease/VB fire/NN Cretaceous/JJ period involved/JJ parties personnel/NNS of the/PDT both/DT sides after many years '/?? of construction warned the West not to air/NN strike United Nation's/NNS resolutions world renowned/JJ prime/JJ minister resulting in German/NNP 's/POS division into two crisis-ridden/JJ a yuhua/NN stone torch-shaped/JJ stripes liquified/VBN natural gas degrees centigrade/JJ anthraquinone/NN hydrogen peroxide solution China Academy Of/IN Sciences ----------------------------- Ann Bies bies@ldc.upenn.edu Linguistic Data Consortium October 4, 2004 -----------------------------