BOLT Chinese-English Word Alignment and Tagging Training -- Discussion Forum 
                           
                    Linguistic Data Consortium

 Authors: Xuansong Li, Katherine Peterson, Stephen Grimes, Stephanie Strassel


1. Introduction

 The DARPA BOLT Program created new techniques for automated translation
 and linguistic analysis that can be applied to informal genres of text
 and speech common to online and in-person communications. LDC supports the
 BOLT Program by collecting informal data sources including discussion forums,
 text messaging and chat and conversational telephone speech in English,
 Chinese and Egyptian Arabic, and applying annotations including translation,
 word alignment, Treebanking, PropBanking, co-reference and queries/responses.

 Word alignment data is contained in this release. In machine translation, word
 alignment is a crucial intermediate stage indicating corresponding word
 relations in parallel text. Word alignment data for this release is built on
 Treebank annotation. Tokens resulted from treebank annotation (including empty
 categories/traces) are directly extracted for word-level alignment.

 The word alignment annotation in this release is linguistic-orientated and
 supported by linguistic theories, aiming to reach a variety of users from NLP
 fields as well as other research/education fields. For MT research, the data is
 intended for all MT performers with varying MT models. The annotation data format
 is designed to allow MT users to flexibly tailor or customize the annotation for
 different uses. Please see section 4.2 and 8 of this document as to how correctly
 use the data for your model.

 Chinese word alignment in this release is also topped with a layer of tagging data.
 The alignment and tagging annotation are performed at two levels: character-level
 and Chinese treebank (CTB) token level. Character-level alignments are manually
 annotated. CTB-token alignment and tagging are automatically generated from
 character-level alignments. This particular release includes Chinese discussion 
 forum word alignment and tagging annotation data.

 This file contains documentation for the corpus BOLT Chinese-English Word 
 Alignment and Tagging Training-- Discussion Forum. 

2. Source Data Profile

2.1 Data Source and Selection

 Data used for word alignment are discussion forum posts harvested on-line at 
 LDC in 2012 for the BOLT project. Threads were collected based on the results 
 of manual data scouting by native speaker annotators. Scouts are instructed 
 to seek content that is in the Chinese language; original (written by the 
 post's author rather than quoted), interactive, and informal.

 The harvested data was further selected and sentence-segmented at LDC for 
 translation by professional translation agencies. A manual selection procedure 
 was used to choose data appropriate for translation. Selection criteria included
 linguistic features and and topic features. After selection, selected posts were 
 segmented into sentence units (SU). Then files were assigned to professional 
 translators for careful translation. Translators followed LDC's BOLT Translation 
 guidelines. After translations were completed, bilingual LDC staff performed
 quality control by selecting a proportional sample from each delivery and
 scrutinizing it for mistakes. 

 Chinese source tokens for character and ctb word alignment were automatically 
 extracted from word-segmented files provided by the BOLT Chinese Treebank 
 annotation team at the Brandeis University. The word-segemnted tokens are used 
 directly for automatic ctb alignment generation. They are also tokenized for 
 character alignment by inserting white space to separate characters.   

 
2.2 Data Profile

 Language  Genre  Files  Words  CharTokens  CTBTokens  Segments
 --------------------------------------------------------------
 Chinese   forum  570   448094  672140       442520    20819

 Note 1: Token = 1 character; 1 word = 1.5 characters
 Note 2: All token counts are based on the Chinese data
 Note 3: CTB token counts are Chinese word-segmented words
         (exclusive of empty categories and traces)
 Note 4: The CharToken count also excludes empty category markers


3. Annotation

3.1 WA Annotation Task

The Chinese word alignment (WA) tasks consist of the following
components:

 -Identifying, aligning, and tagging 8 different types of links
 -Identifying, attaching, and tagging local level unmatched words
 -Identifying and tagging sentence/discourse-level unmatched words
 -Identifying and tagging all instances of Chinese 的(DE) except when
  they are a part of a semantic link.

3.2 WA Annotation Guidelines

 Annotation guildelines used in developing this corpus are contained in
 the docs/ directory of this release. They were updated versions of
 guidelines used for the GALE Word Alignment project. Both the alignment
 and the tagging guidelines have been updated to reflect the alignment of
 the BOLT data.

 
3.3 WA Annotation Process

 This annotation process involves following steps:

 -Annotator training to familiarize Chinese WA annotation team with
  alignment and tagging guidelines.
 -Annotation to produce first pass alignment annotation on Chinese files.
 -Second pass by senior annotators to review and correct first past
  alignment and tagging annotation.
 -Quality control by lead annotator for annotation consistency of all
  files.
 -Automatic and manual sanity checks to ensure file format consistency.

4. File Format Description

4.1 File Format

 A given document name has five files associated with it. 

 Chinese character tokenized file (section 4.1.1)
 Chinese CTB-tokenized file (section 4.1.2)
 English tokenized file (section 4.1.3)
 Character-based word alignment file (section 4.1.4)
 CTB-based word alignment file (section 4.1.5)

 In each of the file types, all annotation for a given sentence segment
 consistently appears on the n_th line of all associated files (for some
 fixed "n"). Therefore all files associated with a given document have the
 same number of lines.

4.1.1 Chinese character tokenized file (.cmn.tkn)

 Each file contains one tokenized source segment per line. Each
 Chinese character is considered as a separate token as the input to
 word alignment -- no higher level grouping is considered at this
 stage. Whitespace is the token delimiter, and the implicit numbering
 of tokens (1,2, ...) is referenced by the word alignment file.

4.1.2 Chinese CTB-based tokenized file (.ctb.tkn)

 Token files referenced by .ctb.wa token numbers. These tokens were
 extracted from CTB-tokens.

4.1.3 English tokenized file (.eng.tkn)

 Each file contains one tokenized translation segment per line. The
 tokenizer used was an LDC script intended to approximate ETB. Whitespace
 is the token delimiter, and the implicit numbering of tokens (1,2, ...)
 is referenced by the word alignment file.

4.1.4 Character-based word alignment file (.wa)

 Each line of the word alignment file contains a set of alignments for
 a given sentence. The alignments are space-delimited and appear in no
 particular order. A given alignment contains a comma-delimited list
 of source tokens, a hyphen, a comma-delimited list of source tokens,
 and an obligatory link type in parentheses. Additionally, each token
 number may be optionally followed a tag in square brackets. Chinese token
 numbers always appear before the hyphen and English tokens always after no
 matter which language is the source or translation.

 The following examples should make this representation clear:

 Example1. 13,14-16(SEM)

 Chinese tokens 13 and 14 are linked to English token 16. The link type is
 SEM. There are no tagged tokens.

 Example2. 22-25[OMN],26,27,28[POS](GIS)

 Chinese token 22 is linked to English tokens 25, 26, 27, and
 28. English tokens 25 and 28 are tagged OMN and POS respectively. The
 link type is GIS.

 Example3. -3[CON](NTR)

 English token 3, which is tagged as CON, has no correspondent in the
 Chinese sentence. The link type is NTR (not translated).  (The link
 type is NTR if and only if either of the Chinese or English list of
 tokens for the alignment is empty.)

 Occasionally it happened that automatic sentence segment realignment
 did not produce valid Chinese-English pairs. For example, one segment in
 one language may be empty while the other contains tokens. In this case
 annotators had the option of "rejecting" a sentence for annotation. When
 this happens, the word "rejected" (without quotes) appears in the .wa file
 for that line.

4.1.5 CTB-based word alignment file (*.ctb.wa)

 CTB-based word alignment was derived from the character-based word
 alignment using a post-processing script. The Chinese tokens in
 CTB-based word alignment are CTB words extracted from CTB parse trees.

 The CTB-based word alignment file has the same structure as the
 character-based word alignment file with two exceptions. As the CTB tokens
 are often comprised of multiple Chinese characters, we attempted to
 preserve the tags for each individual character. Similarly, there may be
 multiple link types reported in case a "super-link" was inferred from two
 individual links in the character-based WA file. In the case of multiple
 word tags or multiple links, a comma separates each tag or link-type
 marker.

 To facilitate data processing, MET (word tag) and MTA (link tag) are
 used for meta data such as translation/transcription markups and treebank
 traces. MRK (word tag) indicates that markup characters may be present in
 the token, but a token tagged with MRK is still required to be aligned.

 A new tag "ALT" is added to show alternate translations.

 Valid word tag codes:
 DEM DE-modifier marker
 DEC DE-clause marker
 DEP DE-possessive marker
 TEN Tense/Passive marker
 OMN Omni-function-preposition marker
 POS Possessive marker
 TOI To-infinitive marker
 SEN Sentence marker
 MEA Measure-word marker
 DET Determiner/demonstrative marker
 CLA Clause marker
 ANA Anaphoric-reference marker
 LOC Local context marker
 RHE Rhetorical marker
 COO Not Translated: Context obligatory marker
 CON Not Translated: Context non-obligatory marker
 INC Not Translated: Incorrect marker
 TYP Typo marker
 MET Meta word marker
 MRK Markup present on the token

 Link codes:
 SEM Semantic link
 FUN Function link
 PDE DE-possessive link
 CDE DE-clause link
 MDE DE-modifier link
 GIF Grammatically Inferred Function link
 GIS Grammatically Inferred Semantic link
 COI Contextually Inferred link
 TIN (Translated) Incorrect link
 NTR Not Translated link
 MTA link for Meta word

4.2 Using the Data

 This section provides some strategies that could be helpful for using
 the data for various tasks.

4.2.1 Character-based word alignment data

 This part of the corpus consists of character-level alignments, and
 the data provide syntactic information at the same time. Word
 alignment information can be extracted by noting terminal
 semantic/function alignments as in the following examples:

  China <--> 中国 (terminal semantic link)
  at <--> 在 (terminal function link)

  The syntactic information is captured by composite links as in the
  following example:

  has completed <--> 完成了 (this is a grammatically inferred link, both
  "has" and "了" are tagged as "tense marker")

  If terminal links are of primary interest when using the data, such
  links can be readily obtained by stripping or splitting all the
  tagged words inside composite links.  Therefore, composite links
  provide both alignment and syntactic information.

 -Word tags can be used to infer syntactic information about phrases,
  as in the following example:

  the flowers <--> 花 : At the word level alignment, the tag
  (determiner) is attached to the head word.

  Now, given the following alignments, the flowers <--> 花 fresh <-->
  鲜, we can infer a minimum phrasal unit where "the fresh flowers"
  corresponds with "鲜花".  At this infered phrase-level alignment,
  the determiner is automatically attached to the phrase "fresh
  flower" since it has been already attached to the head word at the
  word- level alignment.

 -If the word tags for both source and the target language within an
  alignment are of the same type, they would assume the same function
  but with unique forms of expression.  For example, "has", "已经" and
  "了" in the following two alignments are all tagged as "tense
  marker".

  has completed – 已经完成
  has completed – 完成了

  Through these alignments, the user can recognize the patterns of
  tense usage across the source and target languages; through word
  tagging, the user can find the distribution of words comprising such
  patterns.

 -Within a complex alignment (composed of multiple words), if the
  tagged words are adjacent, as in the following example, they may be
  regarded as one single Chinese word:

  has completed <-–> 已经完成 ("has", "已", and "经" are all
  tagged as "tense marker".

  As 已 and 经 are adjacent, they are regarded as a single Chinese
  word linked to "has".

4.2.2 CTB token-based word alignment data

 CTB token-based word alignment data were achieved by automatically
 post-processing the character-level alignment results.

 -In CTB WA, link types are preserved to indicate internal different
  structures of a joint CTB WA. For example, at the character-level
  WA, "fresh" is aligned to 鲜 and "the flowers" is aligned to 花, and
  the link types are "SEM" and "GIS" respectively. After CTB WA
  post-processing, the CTB token 鲜花 is aligned to "the fresh
  flowers", and we keep the link type as "SEM GIS". If "fresh" is
  aligned to 鲜 and "flowers" is aligned to 花, and the link types are
  both "SEM". After CTB WA post-processing, we only keep one "SEM".

 -In CTB WA, word tags are also preserved. The unmatched functional or
  local-contextually added words are attached to the CTB-tokens. If
  users of this data release choose not to use such syntactic
  information or word dependency relations, the attached words can be
  automatically removed or moved out of alignments based on word tag
  clues since all unmatched and attached words are tagged.

5. Data Directory Structure

 -data/source/tokenized/{character_tokenized,ctb_tokenized}/: tokenized (character tokenized
  and CTB-word tokenized) source data

 -data/translation/tokenized/: tokenized translation data

 -data/WA/{character_aligned,ctb_aligned}/:
  character-level and CTB word-level aligned and tagged data

6. Documentation

 -docs/BOLT_Chinese_WA_tagging_guidelines_v1.0.pdf: instructions
  for annotators adding Chinese tags

 -docs/BOLT_Chinese_alignment_guidelines_v1.0.pdf: instructions for
  annotators performing word alignment

 -docs/filelist.txt: the file list show the release package structures 

 -docs/LREC2012_LDC_parallel_aligned_BOLT.pdf: reference paper for WA
 
 -docs/LREC2012_wa_autoalign.pdf: reference paper for WA
 
 -docs/LREC2010_Enriching_Word_Alignment_with_Linguistic_Tags.pdf: 
  reference paper for WA

 -docs/filelist.txt: file showing package structures. 

7. Data Validation and Sanity Checks

 A set of data validation and sanity checks have been performed by LDC's 
 technical staff, particularly:
 --Bilingual annotators checked by hand several files to ensure that their
   word alignment annotations were faithfully recorded in the output format.
 --It was verified that all files across directories have the same filename
   stems and have the same number of files
 --It was verified that all files associated with a given document contain
   the same number of sentence segments.
 --It was verified that all tokens for a given sentence were annotated and
   that those annotations appear in the .wa file.
 --It was verified that all token numbers referenced in the .wa file have a
   corresponding token in the .tkn file.

8. Use of Word Alignment for Your MT Model

 In MT modelling, linguistic-rule-based alignment annotation may not always
 be the MT-preferred approach. To satisfy MT preferences, users can modify
 the annotation by detaching and re-attaching tagged morphemes to automatically
 derive MT-style annotations. For instance, a model may not favor the preposition
 "of" being attached and co-aligned to NP2 in the structure (NP1 (PNP2)), as
 annotated in the alignment data with the example "island (NP1) of Japan (NP2)".
  If the model favours "of" being attached to NP1 instead of NP2, then the
 annotation can be customized in two steps: 1) detach all preposition "of" from
 NP2 via word tag and 2) re-attach "of" to NP1. Such automatic customization of
 annotation is possible because all the unaligned and attached words are tagged
 with appropriate word tag.

9. Acknowledgements

 This material is based upon work supported by the Defense Advanced Research
 Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does
 not necessarily reflect the position or the policy of the Government, and no
 official endorsement should be inferred.

10. Copyright Info

(c) 201, 2013, 2014, 2015 Trustees of the University of Pennsylvania.

11. Contact Information

 If you have questions about this release, please contact the following
 personnel at the LDC.

 Xuansong Li <xuansong@ldc.upenn.edu>
 Stephen Grimes <sgrimes@ldc.upenn.edu>
 Stephanie Strassel <strassel@ldc.upenn.edu>

--------------------------------------------------------------------------
README Created April 12, 2015 by Xuansong Li