GALE Chinese-English Parallel Aligned Treebank Training Authors: Xuansong Li, Stephen Grimes, Stephanie Strassel Xiaoyi Ma, Nianwen Xue, Mitch Marcus, Ann Taylor 1. Introduction This file contains documentation for the GALE Chinese-English Parallel Aligned Treebank Training release. Data were sourced from Chinese broadcast news, newswire, and web sources and translated to English. Chinese and English Treebank annotations were performed indepdently, and finally the parallel texts were word aligned to create this release. 2. Source Data Profile 2.1 Data Source and Selection Newswire data were from 1996-1998 Xinhua. They are treebank-annotated files selected from a subset of Chinese Treebank 6.0. Broadcast conversation data were from 2005 China Central TV and 2005 Phoenix TV. They are treebank annotated files selected from OntoNotes 3.0. Web data are harvested online. The treebanked web files were selected from OntoNotes 4.0. 2.2 Data Profile Genre Files Words CharTokens CTBTokens Segments ---------------------------------------------------- bc 10 57571 86356 60270 3328 nw 172 64337 96505 57722 2092 wb 86 30925 46388 31240 1321 ---------------------------------------------------- Total 268 152833 229249 149232 6741 1 word = 1.5 characters. All token counts are based on Chinese. 2.3 Data Pre-processing and Complications 2.3.1 Complications in Preparing Data for WA Task -Segment mismatch between source and translation raw files -Segment mismatch between TB tree files and raw files -Segment mismatch between translation TB files and source language TB files -Inconsistent file format within a package and across packages -Inconsistent annotation style across packages -Incomplete common source of translation TB and source language TB 2.3.2 Handling Complications -To solve the problem of segment mismatch, automatic re-alignment was performed on CTB WB dataset. -To handle inconsistent data and file format (including inconsistent segment indexing) within a package and across packages, segments were renumbered, and files were converted into the desired release format. -To handle the inconsistent annotation style across packages, such as the first sentence annotated in one package but not another package, we removed the inconsistently annotated sentence segments. -For CTB WB files contained in this release, raw files weren't provided in the OntoNotes/CTB release. Raw files were provided by LDC, but sentence segments from these raw files did not match tokenized files. The raw files were edited by hand for sentence segment match. 3. Annotation 3.1 WA Annotation Task The Chinese word alignment (WA) tasks consist of the following components: -Identifying, aligning, and tagging 8 different types of links -Identifying, attaching, and tagging local level unmatched words -Identifying and tagging sentence/discourse-level unmatched words -Identifying and tagging all instances of Chinese 的(DE) except when they are a part of a semantic link. 3.2 WA Annotation Guidelines Annotation guildelines used in developing this corpus are contained in the docs/ directory of this release. They were the same versions used for the GALE Chinese Word Alignment Tagging Pilot Corpus release. They are also available at the following URLs: 3.3 WA Annotation Process This annotation process involves following steps: -Annotator training to familiarize Chinese WA annotation team with alignment and tagging guidelines. -Annotation to produce first pass alignment annotation on Chinese files. -Second pass by senior annotators to review and correct first past alignment and tagging annotation. -Quality control by lead annotator for annotation consistency of all files. -Automatic and manual sanity checks to ensure file format consistency. 4. File Format Description 4.1 File Format For each document there are nine associated files,: Chinese raw source file (section 4.1.1) English raw translation file (section 4.1.2) Chinese character tokenized file (section 4.1.3) Chinese CTB tokenized file (section 4.1.4) English tokenized file (section 4.1.5) Chinese treebank file (section 4.1.6) English treebank file (section 4.1.7) Character-based word alignment file (section 4.1.8) CTB-based word alignment file (section 4.1.9) In each of the nine file types, all annotation for a given sentence segment is on one line. Therefore all files associated with a given document have the same number of lines. Below are details about each file type. 4.1.1 Chinese raw source file Each file contains one source segment per line. Utf8-encoded. Chinese characters generally appear as blocks of consecutive characters with no intervening whitespace. 4.1.2 English translation raw file Each file contains one translation segment per line with utf8-encoding. 4.1.3 Chinese character tokenized file Each file contains one tokenized source segment per line. Chinese character are whitespace-delimited, i.e. each Chinese character is considered to be a single token. There is an implicit number to each string/token, and this number is referenced by the .wa and .cmn.tree files. These character tokenized files were obtained from CTB tokens (see 4.1.4). 4.1.4 Chinese CTB tokenized file While the character tokenized file is used for character-based word alignment, the CTB tokenized file enumerates the tokens used for CTB-based word alignment. The tokens are taken directly from leaves of CTB trees. It should be stressed that all tree leaves, including Empty Category markers or speaker IDs, are treated as tokens. 4.1.5 English tokenized file Each file contains one tokenized translation segment per line. For non-treebank data, the tokenizer used was an LDC script intended to approximate English Treebank tokenization. For treebank data, the tokens were taken directly from the leaves of ECTB trees. All tree leaves like Empty Category markers are retained from the syntax tree. Whitespace is the token delimiter, and the implicit numbering of tokens (1,2, ...) is referenced by the word alignment file and the tree file. 4.1.6 Chinese treebank file There is generally one tree per line, but a line containing no trees is a mismatched segment (corresponding to blank line in the raw, tokenized, and WA files). Multiple trees may be found on line, in which case they are separated by whitespace. The leaves of the tree were replaced by token numbers, which are numbered 1, 2, ... and so on. 4.1.7 English treebank file The English treebank file is structured the same as the Chinese treebank file above (4.1.6). 4.1.8 Character-based word alignment file Each line of the word alignment file contains a set of alignments for a given sentence. The alignments are space-delimited and appear in no particular order. A given alignment contains a comma-delimited list of source tokens, a hyphen, a comma-delimited list of source tokens, and an obligatory link type in parentheses. Additionally, each token number may be optionally followed a tag in square brackets. The following examples should make this representation clear: Example 1. 13,14-16(SEM) Chinese tokens 13 and 14 are linked to English token 16. The link type is SEM. There are no tagged tokens. Example 2. 22-25[OMN],26,27,28[POS](GIS) Chinese token 22 is linked to English tokens 25, 26, 27, and 28. English tokens 25 and 28 are tagged OMN and POS respectively. The link type is GIS. Example 3. -3[CON](NTR) English token 3, which is tagged as CON, has no correspondent in the Chinese sentence. The link type is NTR (not translated). Occasionally it happened that automatic sentence segment realignment did not produce valid Chinese-English pairs. For example, one segment in one language may be empty while the other contains tokens. In this case annotators had the option of "rejecting" a sentence for annotation. When this happens, the word "rejected" (without quotes) appears in the .wa file for that line. Note: For treebank data, because the tokens originate from Chinese treebank files, some tokens are Empty Category tokens. No word alignment was performed on these empty tokens -- they were not explicity linked to another token nor were they explicity marked as not translated. They should be tagged as MET and appear as not-translated with the NTR link type. 4.2 Using the Data This section provides some strategies that could be helpful for using the data for various tasks. 4.2.1 Character-based word alignment data This part of the corpus consists of character-level alignments, and the data provide syntactic information at the same time. Word alignment information can be extracted by noting terminal semantic/function alignments as in the following examples: China <--> 中国 (terminal semantic link) at <--> 在 (terminal function link) The syntactic information is captured by composite links as in the following example: has completed <--> 完成了 (this is a grammatically inferred link, both "has" and "了" are tagged as "tense marker") If terminal links are of primary interest when using the data, such links can be readily obtained by stripping or splitting all the tagged words inside composite links. Therefore, composite links provide both alignment and syntactic information. Word tags can be used to infer syntactic information about phrases, as in the following example: the flowers <--> 花 : At the word level alignment, the tag (determiner) is attached to the head word. Now, given the following alignments, the flowers <--> 花 fresh <--> 鲜, we can infer a minimum phrasal unit where "the fresh flowers" corresponds with "鲜花". At this infered phrase-level alignment, the determiner is automatically attached to the phrase "fresh flower" since it has been already attached to the head word at the word- level alignment. If the word tags for both source and the target language within an alignment are of the same type, they would assume the same function but with unique forms of expression. For example, "has", "已经" and "了" in the following two alignments are all tagged as "tense marker". has completed – 已经完成 has completed – 完成了 Through these alignments, the user can recognize the patterns of tense usage across the source and target languages; through word tagging, the user can find the distribution of words comprising such patterns. Within a complex alignment (composed of multiple words), if the tagged words are adjacent, as in the following example, they may be regarded as one single Chinese word: has completed <-–> 已经完成 ("has", "已", and "经" are all tagged as "tense marker". As 已 and 经 are adjacent, they are regarded as a single Chinese word linked to "has". 4.2.2 CTB token-based word alignment data CTB token-based word alignment data were achieved by automatically post-processing the character-level alignment results. In CTB WA, link types are preserved to indicate internal different structures of a joint CTB WA. For example, at the character-level WA, "fresh" is aligned to 鲜 and "the flowers" is aligned to 花, and the link types are "SEM" and "GIS" respectively. After CTB WA post-processing, the CTB token 鲜花 is aligned to "the fresh flowers", and we keep the link type as "SEM GIS". If "fresh" is aligned to 鲜 and "flowers" is aligned to 花, and the link types are both "SEM". After CTB WA post-processing, we only keep one "SEM". In CTB WA, word tags are also preserved. The unmatched functional or local-contextually added words are attached to the CTB-tokens. If users of this data release choose not to use such syntactic information or word dependency relations, the attached words can be automatically removed or moved out of alignments based on word tag clues since all unmatched and attached words are tagged. 5. Data Directory Structure -data/parallel_word_aligned_treebank/{bc,nw,wb}/source/{raw,tokenized}/: raw(un-tokenized) and tokenized chinese source data -data/parallel_word_aligned_treebank/{bc,nw,wb}/translation/ {raw,character_tokenized,ctb_tokenized}/: raw(un-tokenized), character tokenized, and Chinese treebank word segmented data -data/parallel_word_aligned_treebank/{bc,nw,wb}/WA/ {character_aligned,ctb_aligned}/: character-level and CTB word-level aligned and tagged data -data/parallel_word_aligned_treebank/{bc,nw,wb}/tree/{CTB,ECTB}: tree files 6. Documentation -docs/GALE_Chinese_WA_tagging_guidelines_v1.0.pdf: instructions for annotators regarding assigning tags -docs/GALE_Chinese_alignment_guidelines_v4.0.pdf: instructions for annotators regarding word alignment -docs/nw-filename-mapping.txt: two-column file listing the original name in column two and the Chinese Treebank name, used for this release, in column one -docs/README.txt: this file 7. Data Validation The following data consistency checks were performed: -Bilingual annotators checked by hand several files to ensure that their word alignment annotations were faithfully recorded in the output format. -It was verified that all files associated with a given document contain the same number of sentence segments. -It was verified that all tokens for a given sentence were annotated and that those annotations appear in the .wa file. -It was verified that all token numbers referenced in a given .wa file have correspondents in the .tkn files. -It was verified that all token numbers references in the .tree files has correspondents in the .tkn files. 8. Copyright Information Portions (c) 2005 China Central TV, Phoenix TV, (c) 1996-1998 Xinhua News Agency, (c) 2011 Trustees of the University of Pennsylvania 9. Acknowledgments This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. Special thanks to annotators: Yuanyuan Xu, Xiaoming Zhang, Haomin Zhang, Jie Ma and Rongzhi Chen. 10. Contact Information If you have questions about this release, please contact the following personnel at the LDC. Project manager Xuansong Li Technical lead Stephen Grimes Project Consultant Stephanie Strassel -------------------------------------------------------------------------- README Updated Feb 2, 2011 Stephen Grimes README Updated June 6, 2011 Xuansong Li README Updated June 8, 2011 Stephen Grimes README Updated June 29, 2011 Xuansong Li