BOLT Chinese-English Word Alignment and Tagging Training -- Discussion Forum Linguistic Data Consortium Authors: Xuansong Li, Katherine Peterson, Stephen Grimes, Stephanie Strassel 1. Introduction The DARPA BOLT Program created new techniques for automated translation and linguistic analysis that can be applied to informal genres of text and speech common to online and in-person communications. LDC supports the BOLT Program by collecting informal data sources including discussion forums, text messaging and chat and conversational telephone speech in English, Chinese and Egyptian Arabic, and applying annotations including translation, word alignment, Treebanking, PropBanking, co-reference and queries/responses. Word alignment data is contained in this release. In machine translation, word alignment is a crucial intermediate stage indicating corresponding word relations in parallel text. Word alignment data for this release is built on Treebank annotation. Tokens resulted from treebank annotation (including empty categories/traces) are directly extracted for word-level alignment. The word alignment annotation in this release is linguistic-orientated and supported by linguistic theories, aiming to reach a variety of users from NLP fields as well as other research/education fields. For MT research, the data is intended for all MT performers with varying MT models. The annotation data format is designed to allow MT users to flexibly tailor or customize the annotation for different uses. Please see section 4.2 and 8 of this document as to how correctly use the data for your model. Chinese word alignment in this release is also topped with a layer of tagging data. The alignment and tagging annotation are performed at two levels: character-level and Chinese treebank (CTB) token level. Character-level alignments are manually annotated. CTB-token alignment and tagging are automatically generated from character-level alignments. This particular release includes Chinese discussion forum word alignment and tagging annotation data. This file contains documentation for the corpus BOLT Chinese-English Word Alignment and Tagging Training-- Discussion Forum. 2. Source Data Profile 2.1 Data Source and Selection Data used for word alignment are discussion forum posts harvested on-line at LDC in 2012 for the BOLT project. Threads were collected based on the results of manual data scouting by native speaker annotators. Scouts are instructed to seek content that is in the Chinese language; original (written by the post's author rather than quoted), interactive, and informal. The harvested data was further selected and sentence-segmented at LDC for translation by professional translation agencies. A manual selection procedure was used to choose data appropriate for translation. Selection criteria included linguistic features and and topic features. After selection, selected posts were segmented into sentence units (SU). Then files were assigned to professional translators for careful translation. Translators followed LDC's BOLT Translation guidelines. After translations were completed, bilingual LDC staff performed quality control by selecting a proportional sample from each delivery and scrutinizing it for mistakes. Chinese source tokens for character and ctb word alignment were automatically extracted from word-segmented files provided by the BOLT Chinese Treebank annotation team at the Brandeis University. The word-segemnted tokens are used directly for automatic ctb alignment generation. They are also tokenized for character alignment by inserting white space to separate characters. 2.2 Data Profile Language Genre Files Words CharTokens CTBTokens Segments -------------------------------------------------------------- Chinese forum 570 448094 672140 442520 20819 Note 1: Token = 1 character; 1 word = 1.5 characters Note 2: All token counts are based on the Chinese data Note 3: CTB token counts are Chinese word-segmented words (exclusive of empty categories and traces) Note 4: The CharToken count also excludes empty category markers 3. Annotation 3.1 WA Annotation Task The Chinese word alignment (WA) tasks consist of the following components: -Identifying, aligning, and tagging 8 different types of links -Identifying, attaching, and tagging local level unmatched words -Identifying and tagging sentence/discourse-level unmatched words -Identifying and tagging all instances of Chinese 的(DE) except when they are a part of a semantic link. 3.2 WA Annotation Guidelines Annotation guildelines used in developing this corpus are contained in the docs/ directory of this release. They were updated versions of guidelines used for the GALE Word Alignment project. Both the alignment and the tagging guidelines have been updated to reflect the alignment of the BOLT data. 3.3 WA Annotation Process This annotation process involves following steps: -Annotator training to familiarize Chinese WA annotation team with alignment and tagging guidelines. -Annotation to produce first pass alignment annotation on Chinese files. -Second pass by senior annotators to review and correct first past alignment and tagging annotation. -Quality control by lead annotator for annotation consistency of all files. -Automatic and manual sanity checks to ensure file format consistency. 4. File Format Description 4.1 File Format A given document name has five files associated with it. Chinese character tokenized file (section 4.1.1) Chinese CTB-tokenized file (section 4.1.2) English tokenized file (section 4.1.3) Character-based word alignment file (section 4.1.4) CTB-based word alignment file (section 4.1.5) In each of the file types, all annotation for a given sentence segment consistently appears on the n_th line of all associated files (for some fixed "n"). Therefore all files associated with a given document have the same number of lines. 4.1.1 Chinese character tokenized file (.cmn.tkn) Each file contains one tokenized source segment per line. Each Chinese character is considered as a separate token as the input to word alignment -- no higher level grouping is considered at this stage. Whitespace is the token delimiter, and the implicit numbering of tokens (1,2, ...) is referenced by the word alignment file. 4.1.2 Chinese CTB-based tokenized file (.ctb.tkn) Token files referenced by .ctb.wa token numbers. These tokens were extracted from CTB-tokens. 4.1.3 English tokenized file (.eng.tkn) Each file contains one tokenized translation segment per line. The tokenizer used was an LDC script intended to approximate ETB. Whitespace is the token delimiter, and the implicit numbering of tokens (1,2, ...) is referenced by the word alignment file. 4.1.4 Character-based word alignment file (.wa) Each line of the word alignment file contains a set of alignments for a given sentence. The alignments are space-delimited and appear in no particular order. A given alignment contains a comma-delimited list of source tokens, a hyphen, a comma-delimited list of source tokens, and an obligatory link type in parentheses. Additionally, each token number may be optionally followed a tag in square brackets. Chinese token numbers always appear before the hyphen and English tokens always after no matter which language is the source or translation. The following examples should make this representation clear: Example1. 13,14-16(SEM) Chinese tokens 13 and 14 are linked to English token 16. The link type is SEM. There are no tagged tokens. Example2. 22-25[OMN],26,27,28[POS](GIS) Chinese token 22 is linked to English tokens 25, 26, 27, and 28. English tokens 25 and 28 are tagged OMN and POS respectively. The link type is GIS. Example3. -3[CON](NTR) English token 3, which is tagged as CON, has no correspondent in the Chinese sentence. The link type is NTR (not translated). (The link type is NTR if and only if either of the Chinese or English list of tokens for the alignment is empty.) Occasionally it happened that automatic sentence segment realignment did not produce valid Chinese-English pairs. For example, one segment in one language may be empty while the other contains tokens. In this case annotators had the option of "rejecting" a sentence for annotation. When this happens, the word "rejected" (without quotes) appears in the .wa file for that line. 4.1.5 CTB-based word alignment file (*.ctb.wa) CTB-based word alignment was derived from the character-based word alignment using a post-processing script. The Chinese tokens in CTB-based word alignment are CTB words extracted from CTB parse trees. The CTB-based word alignment file has the same structure as the character-based word alignment file with two exceptions. As the CTB tokens are often comprised of multiple Chinese characters, we attempted to preserve the tags for each individual character. Similarly, there may be multiple link types reported in case a "super-link" was inferred from two individual links in the character-based WA file. In the case of multiple word tags or multiple links, a comma separates each tag or link-type marker. To facilitate data processing, MET (word tag) and MTA (link tag) are used for meta data such as translation/transcription markups and treebank traces. MRK (word tag) indicates that markup characters may be present in the token, but a token tagged with MRK is still required to be aligned. A new tag "ALT" is added to show alternate translations. Valid word tag codes: DEM DE-modifier marker DEC DE-clause marker DEP DE-possessive marker TEN Tense/Passive marker OMN Omni-function-preposition marker POS Possessive marker TOI To-infinitive marker SEN Sentence marker MEA Measure-word marker DET Determiner/demonstrative marker CLA Clause marker ANA Anaphoric-reference marker LOC Local context marker RHE Rhetorical marker COO Not Translated: Context obligatory marker CON Not Translated: Context non-obligatory marker INC Not Translated: Incorrect marker TYP Typo marker MET Meta word marker MRK Markup present on the token Link codes: SEM Semantic link FUN Function link PDE DE-possessive link CDE DE-clause link MDE DE-modifier link GIF Grammatically Inferred Function link GIS Grammatically Inferred Semantic link COI Contextually Inferred link TIN (Translated) Incorrect link NTR Not Translated link MTA link for Meta word 4.2 Using the Data This section provides some strategies that could be helpful for using the data for various tasks. 4.2.1 Character-based word alignment data This part of the corpus consists of character-level alignments, and the data provide syntactic information at the same time. Word alignment information can be extracted by noting terminal semantic/function alignments as in the following examples: China <--> 中国 (terminal semantic link) at <--> 在 (terminal function link) The syntactic information is captured by composite links as in the following example: has completed <--> 完成了 (this is a grammatically inferred link, both "has" and "了" are tagged as "tense marker") If terminal links are of primary interest when using the data, such links can be readily obtained by stripping or splitting all the tagged words inside composite links. Therefore, composite links provide both alignment and syntactic information. -Word tags can be used to infer syntactic information about phrases, as in the following example: the flowers <--> 花 : At the word level alignment, the tag (determiner) is attached to the head word. Now, given the following alignments, the flowers <--> 花 fresh <--> 鲜, we can infer a minimum phrasal unit where "the fresh flowers" corresponds with "鲜花". At this infered phrase-level alignment, the determiner is automatically attached to the phrase "fresh flower" since it has been already attached to the head word at the word- level alignment. -If the word tags for both source and the target language within an alignment are of the same type, they would assume the same function but with unique forms of expression. For example, "has", "已经" and "了" in the following two alignments are all tagged as "tense marker". has completed – 已经完成 has completed – 完成了 Through these alignments, the user can recognize the patterns of tense usage across the source and target languages; through word tagging, the user can find the distribution of words comprising such patterns. -Within a complex alignment (composed of multiple words), if the tagged words are adjacent, as in the following example, they may be regarded as one single Chinese word: has completed <-–> 已经完成 ("has", "已", and "经" are all tagged as "tense marker". As 已 and 经 are adjacent, they are regarded as a single Chinese word linked to "has". 4.2.2 CTB token-based word alignment data CTB token-based word alignment data were achieved by automatically post-processing the character-level alignment results. -In CTB WA, link types are preserved to indicate internal different structures of a joint CTB WA. For example, at the character-level WA, "fresh" is aligned to 鲜 and "the flowers" is aligned to 花, and the link types are "SEM" and "GIS" respectively. After CTB WA post-processing, the CTB token 鲜花 is aligned to "the fresh flowers", and we keep the link type as "SEM GIS". If "fresh" is aligned to 鲜 and "flowers" is aligned to 花, and the link types are both "SEM". After CTB WA post-processing, we only keep one "SEM". -In CTB WA, word tags are also preserved. The unmatched functional or local-contextually added words are attached to the CTB-tokens. If users of this data release choose not to use such syntactic information or word dependency relations, the attached words can be automatically removed or moved out of alignments based on word tag clues since all unmatched and attached words are tagged. 5. Data Directory Structure -data/source/tokenized/{character_tokenized,ctb_tokenized}/: tokenized (character tokenized and CTB-word tokenized) source data -data/translation/tokenized/: tokenized translation data -data/WA/{character_aligned,ctb_aligned}/: character-level and CTB word-level aligned and tagged data 6. Documentation -docs/BOLT_Chinese_WA_tagging_guidelines_v1.0.pdf: instructions for annotators adding Chinese tags -docs/BOLT_Chinese_alignment_guidelines_v1.0.pdf: instructions for annotators performing word alignment -docs/filelist.txt: the file list show the release package structures -docs/LREC2012_LDC_parallel_aligned_BOLT.pdf: reference paper for WA -docs/LREC2012_wa_autoalign.pdf: reference paper for WA -docs/LREC2010_Enriching_Word_Alignment_with_Linguistic_Tags.pdf: reference paper for WA -docs/filelist.txt: file showing package structures. 7. Data Validation and Sanity Checks A set of data validation and sanity checks have been performed by LDC's technical staff, particularly: --Bilingual annotators checked by hand several files to ensure that their word alignment annotations were faithfully recorded in the output format. --It was verified that all files across directories have the same filename stems and have the same number of files --It was verified that all files associated with a given document contain the same number of sentence segments. --It was verified that all tokens for a given sentence were annotated and that those annotations appear in the .wa file. --It was verified that all token numbers referenced in the .wa file have a corresponding token in the .tkn file. 8. Use of Word Alignment for Your MT Model In MT modelling, linguistic-rule-based alignment annotation may not always be the MT-preferred approach. To satisfy MT preferences, users can modify the annotation by detaching and re-attaching tagged morphemes to automatically derive MT-style annotations. For instance, a model may not favor the preposition "of" being attached and co-aligned to NP2 in the structure (NP1 (PNP2)), as annotated in the alignment data with the example "island (NP1) of Japan (NP2)". If the model favours "of" being attached to NP1 instead of NP2, then the annotation can be customized in two steps: 1) detach all preposition "of" from NP2 via word tag and 2) re-attach "of" to NP1. Such automatic customization of annotation is possible because all the unaligned and attached words are tagged with appropriate word tag. 9. Acknowledgements This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. 10. Copyright Info (c) 201, 2013, 2014, 2015 Trustees of the University of Pennsylvania. 11. Contact Information If you have questions about this release, please contact the following personnel at the LDC. Xuansong Li Stephen Grimes Stephanie Strassel -------------------------------------------------------------------------- README Created April 12, 2015 by Xuansong Li