GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 2 Authors: Xuansong Li, Stephen Grimes, Stephanie M Strassel Linguistic Data Consortium 1. Introduction This file contains documentation for the GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 2. The corpus includes character-level word aligned and tagged broadcast conversation data. 2. Source Data Profile 2.1 Data Source The file names indicate the source where the data were first harvested. The data were chiefly harvested from TV or Radio broadcast source. Given a file name "CCTV1_LEGALREPORT_CMN_20070409", "CCTV1" means "China Central TV, first channel", "LEGALREPORT" means the program named "legal report", "CMN" stands for "Chinese language", and "20070409" indicates the date: 04/09/2007. 2.2 Data Profile Language Genre Files Words CharTokens Segments --------------------------------------------------- Chinese bc 9 43379 65069 2419 Note: Token = 1 character; 1 word = 1.5 characters Note: All token counts are based on the Chinese data 3. Annotation 3.1 WA Annotation Task The Chinese word alignment (WA) tasks consist of the following components: -Identifying, aligning, and tagging 8 different types of links -Identifying, attaching, and tagging local level unmatched words -Identifying and tagging sentence/discourse-level unmatched words -Identifying and tagging all instances of Chinese 的(DE) except when they are a part of a semantic link. 3.2 WA Annotation Guidelines Annotation guildelines used in developing this corpus are contained in the docs/ directory of this release. They were the same versions used for the GALE Chinese Word Alignment Tagging Pilot Corpus release. 3.3 WA Annotation Process This annotation process involves following steps: -Annotator training to familiarize Chinese WA annotation team with alignment and tagging guidelines. -Annotation to produce first pass alignment annotation on Chinese files. -Second pass by senior annotators to review and correct first past alignment and tagging annotation. -Quality control by lead annotator for annotation consistency of all files. -Automatic and manual sanity checks to ensure file format consistency. 4. File Format Description 4.1 File Format Chinese raw source file (section 4.1.1) English raw translation file (section 4.1.2) Chinese character tokenized file (section 4.1.3) English tokenized file (section 4.1.4) Character-based word alignment file (section 4.1.5) In each of the file types, all annotation for a given sentence segment consistently appears on the n_th line of all associated files (for some fixed "n"). Therefore all files associated with a given document have the same number of lines. 4.1.1 Chinese raw source (or translation - see Note) file (.cmn.raw) Each file contains one source segment per line. Chinese characters generally appear as blocks of consecutive characters with no intervening whitespace. Note: The treebank bc data is English->Chinese translation. Therefore .cmn.raw files are the translation raw files. 4.1.2 English translation (or source - see Note) raw file (.eng.raw) Each file contains one translation segment per line. Note: The treebank data is English->Chinese translation. Therefore .eng.raw files are the source raw files. 4.1.3 Chinese character tokenized file (.cmn.tkn) Each file contains one tokenized source segment per line. Each Chinese character is considered as a separate token as the input to word alignment -- no higher level grouping is considered at this stage. Whitespace is the token delimiter, and the implicit numbering of tokens (1,2, ...) is referenced by the word alignment file. 4.1.4 English tokenized file (.eng.tkn) Each file contains one tokenized translation segment per line. The tokenizer used was an LDC script intended to approximate ETB. Whitespace is the token delimiter, and the implicit numbering of tokens (1,2, ...) is referenced by the word alignment file. 4.1.5 Character-based word alignment file (.wa) Each line of the word alignment file contains a set of alignments for a given sentence. The alignments are space-delimited and appear in no particular order. A given alignment contains a comma-delimited list of source tokens, a hyphen, a comma-delimited list of source tokens, and an obligatory link type in parentheses. Additionally, each token number may be optionally followed a tag in square brackets. Chinese token numbers always appear before the hyphen and English tokens always after no matter which language is the source or translation. The following examples should make this representation clear: Example1. 13,14-16(SEM) Chinese tokens 13 and 14 are linked to English token 16. The link type is SEM. There are no tagged tokens. Example2. 22-25[OMN],26,27,28[POS](GIS) Chinese token 22 is linked to English tokens 25, 26, 27, and 28. English tokens 25 and 28 are tagged OMN and POS respectively. The link type is GIS. Example3. -3[CON](NTR) English token 3, which is tagged as CON, has no correspondent in the Chinese sentence. The link type is NTR (not translated). (The link type is NTR if and only if either of the Chinese or English list of tokens for the alignment is empty.) Occasionally it happened that automatic sentence segment realignment did not produce valid Chinese-English pairs. For example, one segment in one language may be empty while the other contains tokens. In this case annotators had the option of "rejecting" a sentence for annotation. When this happens, the word "rejected" (without quotes) appears in the .wa file for that line. To facilitate data processing, MET (word tag) and MTA (link tag) are used for meta data such as translation/transcription markups and treebank traces. Valid word tag codes: DEM DE-modifier marker DEC DE-clause marker DEP DE-possessive marker TEN Tense/Passive marker OMN Omni-function-preposition marker POS Possessive marker TOI To-infinitive marker SEN Sentence marker MEA Measure-word marker DET Determiner/demonstrative marker CLA Clause marker ANA Anaphoric-reference marker LOC Local context marker RHE Rhetorical marker COO Not Translated: Context obligatory marker CON Not Translated: Context non-obligatory marker INC Not Translated: Incorrect marker TYP Typo marker MET Meta word marker Link codes: SEM Semantic link FUN Function link PDE DE-possessive link CDE DE-clause link MDE DE-modifier link GIF Grammatically Inferred Function link GIS Grammatically Inferred Semantic link COI Contextually Inferred link TIN (Translated) Incorrect link NTR Not Translated link MTA link for Meta word 4.2 Using the Data This section provides some strategies that could be helpful for using the data for various tasks. the corpus consists of character-level alignments, and the data provide syntactic information at the same time. Word alignment information can be extracted by noting terminal semantic/function alignments as in the following examples: China <--> 中国 (terminal semantic link) at <--> 在 (terminal function link) The syntactic information is captured by composite links as in the following example: has completed <--> 完成了 (this is a grammatically inferred link, both "has" and "了" are tagged as "tense marker") If terminal links are of primary interest when using the data, such links can be readily obtained by stripping or splitting all the tagged words inside composite links. Therefore, composite links provide both alignment and syntactic information. -Word tags can be used to infer syntactic information about phrases, as in the following example: the flowers <--> 花 : At the word level alignment, the tag (determiner) is attached to the head word. Now, given the following alignments, the flowers <--> 花 fresh <--> 鲜, we can infer a minimum phrasal unit where "the fresh flowers" corresponds with "鲜花". At this infered phrase-level alignment, the determiner is automatically attached to the phrase "fresh flower" since it has been already attached to the head word at the word- level alignment. -If the word tags for both source and the target language within an alignment are of the same type, they would assume the same function but with unique forms of expression. For example, "has", "已经" and "了" in the following two alignments are all tagged as "tense marker". has completed – 已经完成 has completed – 完成了 Through these alignments, the user can recognize the patterns of tense usage across the source and target languages; through word tagging, the user can find the distribution of words comprising such patterns. -Within a complex alignment (composed of multiple words), if the tagged words are adjacent, as in the following example, they may be regarded as one single Chinese word: has completed <-–> 已经完成 ("has", "已", and "经" are all tagged as "tense marker". As 已 and 经 are adjacent, they are regarded as a single Chinese word linked to "has". 5. Data Directory Structure - data/parallel_word_aligned/bc/source/{raw,tokenized}/: raw(un-tokenized) and tokenized Chinese source data - data/parallel_word_aligned/bc/translation/{raw,tokenized}/: raw(un-tokenized) and tokenized English translation data - data/parallel_word_aligned/bc/WA/: character-level aligned and tagged data 6. Documentation -docs/GALE_Chinese_WA_tagging_guidelines_v1.0.pdf: instructions for annotators adding Chinese tags -docs/GALE_Chinese_alignment_guidelines_v4.0.pdf: instructions for annotators performing word alignment -docs/files.sha1: checksum for data files -docs/README.txt: this file containing general documentation -docs/Enriching_Word_Alignment_with_Linguistic_Tags_full_paper.pdf: this is a published paper 7. Data Validation 7.1 Data Consistency The following data consistency checks were performed: -Bilingual annotators checked by hand several files to ensure that their word alignment annotations were faithfully recorded in the output format. -It was verified that all files associated with a given document contain the same number of sentence segments. -It was verified that all tokens for a given sentence were annotated and that those annotations appear in the .wa file. -It was verified that all token numbers referenced in the .wa file have a corresponding token in the .tkn file. 7.2 Sanity Checks A set of independent sanity checks have been performed by LDC's technical staff. 8. Acknowledgments This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. Special thanks to annotators: Yuanyuan Xu; Xiaoming Zhang; Haomin Zhang; Jie Ma and Rongzhi Chen. 9. Contact Information If you have questions about this release, please contact the following personnel at the LDC. Project manager: Xuansong Li Lead programmer: Stephen Grimes Project consultant: Stephanie Strassel -------------------------------------------------------------------------- README Created June 8, 2011 Stephen Grimes README Updated June 27, 2011 Xuansong Li README Updated June 29, 2011 Xuansong Li