GALE Chinese-English Parallel Aligned Treebank Training

    Authors: Xuansong Li, Stephen Grimes, Stephanie Strassel
             Xiaoyi Ma, Nianwen Xue, Mitch Marcus, Ann Taylor  

                 
1. Introduction

 This file contains documentation for the GALE Chinese-English Parallel Aligned 
 Treebank Training release. Data were sourced from Chinese broadcast news,
 newswire, and web sources and translated to English. Chinese and English
 Treebank annotations were performed indepdently, and finally the parallel
 texts were word aligned to create this release.

2. Source Data Profile

2.1 Data Source and Selection

 Newswire data were from 1996-1998 Xinhua. They are treebank-annotated
 files selected from a subset of Chinese Treebank 6.0.

 Broadcast conversation data were from 2005 China Central TV and 2005 
 Phoenix TV. They are treebank annotated files selected from OntoNotes 3.0. 

 Web data are harvested online. The treebanked web files were selected from 
 OntoNotes 4.0.

2.2 Data Profile

 Genre  Files  Words  CharTokens  CTBTokens  Segments
 ----------------------------------------------------
  bc      10    57571     86356      60270      3328
  nw     172    64337     96505      57722      2092
  wb      86    30925     46388      31240      1321
 ----------------------------------------------------
 Total   268   152833    229249     149232      6741

 1 word = 1.5 characters. All token counts are based on Chinese.

2.3 Data Pre-processing and Complications

2.3.1 Complications in Preparing Data for WA Task

 -Segment mismatch between source and translation raw files

 -Segment mismatch between TB tree files and raw files

 -Segment mismatch between translation TB files and source language TB
  files

 -Inconsistent file format within a package and across packages

 -Inconsistent annotation style across packages

 -Incomplete common source of translation TB and source language TB

2.3.2 Handling Complications

 -To solve the problem of segment mismatch, automatic re-alignment was
  performed on CTB WB dataset.  

 -To handle inconsistent data and file format (including inconsistent
  segment indexing) within a package and across packages, segments
  were renumbered, and files were converted into the desired release
  format.

 -To handle the inconsistent annotation style across packages, such as
  the first sentence annotated in one package but not another package,
  we removed the inconsistently annotated sentence segments.

 -For CTB WB files contained in this release, raw files weren't provided 
  in the OntoNotes/CTB release. Raw files were provided by LDC, but 
  sentence segments from these raw files did not match tokenized files. 
  The raw files were edited by hand for sentence segment match.

3. Annotation

3.1 WA Annotation Task

The Chinese word alignment (WA) tasks consist of the following
components:

 -Identifying, aligning, and tagging 8 different types of links

 -Identifying, attaching, and tagging local level unmatched words

 -Identifying and tagging sentence/discourse-level unmatched words

 -Identifying and tagging all instances of Chinese 的(DE) except when
  they are a part of a semantic link.

3.2 WA Annotation Guidelines

 Annotation guildelines used in developing this corpus are contained in
 the docs/ directory of this release. They were the same versions used
 for the GALE Chinese Word Alignment Tagging Pilot Corpus release. They
 are also available at the following URLs:

3.3 WA Annotation Process

 This annotation process involves following steps:

 -Annotator training to familiarize Chinese WA annotation team with
  alignment and tagging guidelines.
 -Annotation to produce first pass alignment annotation on Chinese files.
 -Second pass by senior annotators to review and correct first past
  alignment and tagging annotation.
 -Quality control by lead annotator for annotation consistency of all
  files.
 -Automatic and manual sanity checks to ensure file format consistency.

4. File Format Description

4.1 File Format

 For each document there are nine associated files,:

 Chinese raw source file (section 4.1.1)
 English raw translation file (section 4.1.2)
 Chinese character tokenized file (section 4.1.3)
 Chinese CTB tokenized file (section 4.1.4)
 English tokenized file (section 4.1.5)
 Chinese treebank file (section 4.1.6)
 English treebank file (section 4.1.7)
 Character-based word alignment file (section 4.1.8)
 CTB-based word alignment file (section 4.1.9)

 In each of the nine file types, all annotation for a given sentence 
 segment is on one line.  Therefore all files associated with a
 given document have the same number of lines. Below are details about
 each file type.

4.1.1 Chinese raw source file

 Each file contains one source segment per line. Utf8-encoded. Chinese
 characters generally appear as blocks of consecutive characters with 
 no intervening whitespace.

4.1.2 English translation raw file

 Each file contains one translation segment per line with utf8-encoding.

4.1.3 Chinese character tokenized file

 Each file contains one tokenized source segment per line. Chinese character
 are whitespace-delimited, i.e. each Chinese character is considered to be a 
 single token. There is an implicit number to each string/token, and this 
 number is referenced by the .wa and .cmn.tree files. These character 
 tokenized files were obtained from CTB tokens (see 4.1.4).

4.1.4 Chinese CTB tokenized file

 While the character tokenized file is used for character-based word
 alignment, the CTB tokenized file enumerates the tokens used for
 CTB-based word alignment. The tokens are taken directly from leaves
 of CTB trees. It should be stressed that all tree leaves, including 
 Empty Category markers or speaker IDs, are treated as tokens.

4.1.5 English tokenized file

 Each file contains one tokenized translation segment per line. For
 non-treebank data, the tokenizer used was an LDC script intended to
 approximate English Treebank tokenization.  For treebank data, the tokens
 were taken directly from the leaves of ECTB trees. All tree leaves like
 Empty Category markers are retained from the syntax tree. Whitespace
 is the token delimiter, and the implicit numbering of tokens (1,2, ...)
 is referenced by the word alignment file and the tree file.

4.1.6 Chinese treebank file

 There is generally one tree per line, but a line containing no trees
 is a mismatched segment (corresponding to blank line in the raw,
 tokenized, and WA files). Multiple trees may be found on line, in 
 which case they are separated by whitespace. The leaves of the tree
 were replaced by token numbers, which are numbered 1, 2, ... and so 
 on.

4.1.7 English treebank file

 The English treebank file is structured the same as the Chinese 
 treebank file above (4.1.6).

4.1.8 Character-based word alignment file

 Each line of the word alignment file contains a set of alignments for
 a given sentence. The alignments are space-delimited and appear in no
 particular order. A given alignment contains a comma-delimited list of
 source tokens, a hyphen, a comma-delimited list of source tokens, and
 an obligatory link type in parentheses. Additionally, each token 
 number may be optionally followed a tag in square brackets. The 
 following examples should make this representation clear:

 Example 1.  13,14-16(SEM)

 Chinese tokens 13 and 14 are linked to English token 16. The link type
 is SEM. There are no tagged tokens.

 Example 2.  22-25[OMN],26,27,28[POS](GIS)

 Chinese token 22 is linked to English tokens 25, 26, 27, and 28. 
 English tokens 25 and 28 are tagged OMN and POS respectively. The 
 link type is GIS.

 Example 3.  -3[CON](NTR)

 English token 3, which is tagged as CON, has no correspondent in the
 Chinese sentence. The link type is NTR (not translated).

 Occasionally it happened that automatic sentence segment realignment
 did not produce valid Chinese-English pairs. For example, one segment
 in one language may be empty while the other contains tokens. In this
 case annotators had the option of "rejecting" a sentence for 
 annotation. When this happens, the word "rejected" (without quotes) 
 appears in the .wa file for that line.

 Note: For treebank data, because the tokens originate from Chinese
 treebank files, some tokens are Empty Category tokens.  No word
 alignment was performed on these empty tokens -- they were not
 explicity linked to another token nor were they explicity marked as
 not translated. They should be tagged as MET and appear as 
 not-translated with the NTR link type.

4.2 Using the Data

 This section provides some strategies that could be helpful for using
 the data for various tasks.

4.2.1 Character-based word alignment data

 This part of the corpus consists of character-level alignments, and
 the data provide syntactic information at the same time. Word
 alignment information can be extracted by noting terminal
 semantic/function alignments as in the following examples:

  China <--> 中国 (terminal semantic link)
  at <--> 在 (terminal function link)

 The syntactic information is captured by composite links as in the
 following example:

  has completed <--> 完成了 (this is a grammatically inferred link, both
  "has" and "了" are tagged as "tense marker")

 If terminal links are of primary interest when using the data, such
 links can be readily obtained by stripping or splitting all the
 tagged words inside composite links.  Therefore, composite links
 provide both alignment and syntactic information.

 Word tags can be used to infer syntactic information about phrases,
 as in the following example:

  the flowers <--> 花 : At the word level alignment, the tag
  (determiner) is attached to the head word.

 Now, given the following alignments, the flowers <--> 花 fresh <-->
 鲜, we can infer a minimum phrasal unit where "the fresh flowers"
 corresponds with "鲜花".  At this infered phrase-level alignment,
 the determiner is automatically attached to the phrase "fresh
 flower" since it has been already attached to the head word at the
 word- level alignment.

 If the word tags for both source and the target language within an
 alignment are of the same type, they would assume the same function
 but with unique forms of expression.  For example, "has", "已经" and
 "了" in the following two alignments are all tagged as "tense
 marker".

  has completed – 已经完成
  has completed – 完成了

 Through these alignments, the user can recognize the patterns of
 tense usage across the source and target languages; through word
 tagging, the user can find the distribution of words comprising such
 patterns.

 Within a complex alignment (composed of multiple words), if the
 tagged words are adjacent, as in the following example, they may be
 regarded as one single Chinese word:

  has completed <-–> 已经完成 ("has", "已", and "经" are all
  tagged as "tense marker".

 As 已 and 经 are adjacent, they are regarded as a single Chinese
 word linked to "has".

4.2.2 CTB token-based word alignment data

 CTB token-based word alignment data were achieved by automatically
 post-processing the character-level alignment results.

 In CTB WA, link types are preserved to indicate internal different
 structures of a joint CTB WA. For example, at the character-level
 WA, "fresh" is aligned to 鲜 and "the flowers" is aligned to 花, and
 the link types are "SEM" and "GIS" respectively. After CTB WA
 post-processing, the CTB token 鲜花 is aligned to "the fresh
 flowers", and we keep the link type as "SEM GIS". If "fresh" is
 aligned to 鲜 and "flowers" is aligned to 花, and the link types are
 both "SEM". After CTB WA post-processing, we only keep one "SEM".
 
 In CTB WA, word tags are also preserved. The unmatched functional or
 local-contextually added words are attached to the CTB-tokens. If
 users of this data release choose not to use such syntactic
 information or word dependency relations, the attached words can be
 automatically removed or moved out of alignments based on word tag
 clues since all unmatched and attached words are tagged.

5. Data Directory Structure

 -data/parallel_word_aligned_treebank/{bc,nw,wb}/source/{raw,tokenized}/: 
  raw(un-tokenized) and tokenized chinese source data

 -data/parallel_word_aligned_treebank/{bc,nw,wb}/translation/
  {raw,character_tokenized,ctb_tokenized}/: raw(un-tokenized),
  character tokenized, and Chinese treebank word segmented data

 -data/parallel_word_aligned_treebank/{bc,nw,wb}/WA/
  {character_aligned,ctb_aligned}/: character-level and CTB word-level
  aligned and tagged data

 -data/parallel_word_aligned_treebank/{bc,nw,wb}/tree/{CTB,ECTB}: tree files

6. Documentation

 -docs/GALE_Chinese_WA_tagging_guidelines_v1.0.pdf: instructions 
  for annotators regarding assigning tags

 -docs/GALE_Chinese_alignment_guidelines_v4.0.pdf: instructions for
  annotators regarding word alignment

 -docs/nw-filename-mapping.txt: two-column file listing the original name
  in column two and the Chinese Treebank name, used for this release, in 
  column one

 -docs/README.txt: this file

7. Data Validation

 The following data consistency checks were performed:

 -Bilingual annotators checked by hand several files to ensure that
  their word alignment annotations were faithfully recorded in the
  output format.
 -It was verified that all files associated with a given document
  contain the same number of sentence segments.
 -It was verified that all tokens for a given sentence were annotated and
  that those annotations appear in the .wa file.
 -It was verified that all token numbers referenced in a given .wa file have 
  correspondents in the .tkn files.
 -It was verified that all token numbers references in the .tree files 
  has correspondents in the .tkn files.

8. Copyright Information

 Portions (c) 2005 China Central TV, Phoenix TV, (c) 1996-1998 Xinhua News 
 Agency, (c) 2011 Trustees of the University of Pennsylvania 

9. Acknowledgments

This work was supported in part by the Defense Advanced Research 
Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of 
this publication does not necessarily reflect the position or the policy 
of the Government, and no official endorsement should be inferred.
Special thanks to annotators: Yuanyuan Xu, Xiaoming Zhang, Haomin Zhang, 
Jie Ma and Rongzhi Chen.  

10. Contact Information

 If you have questions about this release, please contact the following
 personnel at the LDC.

 Project manager  Xuansong Li	        <xuansong@ldc.upenn.edu>
 Technical lead   Stephen Grimes	<sgrimes@ldc.upenn.edu>
 Project Consultant  Stephanie Strassel	<strassel@ldc.upenn.edu>

--------------------------------------------------------------------------
README Updated Feb 2, 2011 Stephen Grimes
README Updated June 6, 2011 Xuansong Li
README Updated June 8, 2011 Stephen Grimes
README Updated June 29, 2011 Xuansong Li