UN Parallel Documents 1993-2007 Version 1.0 1. Introduction This dataset contains the text of United Nations parliamentary documents in Arabic, Chinese, English, French, Russian, and Spanish from 1993 through 2007. The data is provided in two formats. 1.1 Raw Text The raw text is very close to what was extracted from the word processing documents, converted to UTF-8 encoding. 1.2 Word-aligned Text The word-aligned text has been normalized, tokenized, aligned at the sentence-level, further broken into sub-sentential "chunk-pairs", and then aligned at the word. The sentence, chunk, and word alignment operations were performed separately for each individual language pair. 2. About the UN Parliamentary Documents The United Nations parliamentary documents are available from the UN Official Document System (UN ODS) at http://ods.un.org/. The ODS, in its main "UNDOC" database, contains the full text of all types of UN parliamentary documents. It has complete coverage going back to 1993, and variable coverage before that. Documents exist in one or more of the official languages of the UN: Arabic, Chinese, English, French, Russian, and Spanish. (There are also a large number of German documents, marked with the language "other", but these are not included in this dataset.) For more information, see the ODS documentation at http://documents.un.org/help_E.htm. For more details of the UN bibliographic systems, see http://www.un.org/depts/dhl/unbisref_manual/. 3. Some Terminology Document: A publication independent of language and format, such as "The Agenda for the 52nd General Assembly" Document Symbol: A symbolic identifier assigned by the UN to a document. For example, "A/52/100" denotes published text number 100 from the 52nd session of the general assembly. Release date: The date when the the document was entered into the ODS database. Document instance: Version of a document in a particular language and format, e.g. the English PDF version of "The Agenda for the 52nd General Assembly". 4. Mining the Data from the ODS Since the years before 1993 yield almost no useful parallel documents, the ODS was mined for parallel documents from 1993-2007. Searching by release date, all documents that were released from January 1, 1993 - December 31, 2007 in more than one language were downloaded. This dataset uses the UN document symbols to identify text files. In some cases, the document had a compound document symbol, such as A/45/2 • A/45/2(SUPP) (i.e. "A/45/2 & A/45/2(SUPP)"). In these cases, we simply use the first document identifier, "A/45/2" in the example. Note that this results in some extreme cases, such as the document "V & A/C.5/48/70" ending up with the short identifier "V". UN document symbols were turned into directory/file names by mapping the characters space, tab, newline, period, open/close parenthesis, open/close square brackets, colon, and comma to underscore. For example: A/47/24/ADD.1(SUPP) --> A_47_24_ADD_1_SUPP 5. Extracting the Text The ODS provides documents in a variety of formats, including PDF, Word, and various versions of WordPerfect. Working from the word processing formats, the Word and Wordperfect documents were converted to HTML. In a second step, the HTML documents were converted to text. The text was then broken into sentences. The raw documents simply contain one sentence per line. The text extraction method retained some of the "visual formatting" that was present in the original word processing documents. For example, suppose an author used line breaks or empty lines for a title page like so: United Nations Conference on Trade and Development In such cases, each segment "United Nations", "Conference", "on Trade and", and "Development" becomes a separate sentence, and forms a separate line in the text file in this dataset. 6. Processing the Text For the word-aligned data, the text was further preprocessed. This included the following steps: - UTF-8 normalization maps characters (and in some cases words) to more standard forms - Chinese segmentation splits Chinese text into "words" - Chunk splitting separates sentences into chunks at certain boundaries in the text 7. Performing Word Alignments Given two parallel chunks in English and one foreign language (Arabic, Chinese, French, Spanish, or Russian), word alignment was performed in both directions, English-to-foreign and foreign-to-English. We trained 2 models for English-to-foreign and foreign-to-English respectively. In each case, we trained models using a recipe consisting of 3 Model-1 iterations and 2 HMM iterations. A posterior probability matrix was computed for each sentence pair. Using this matrix, we obtained the Maximum A Posterior alignment for each English/foreign position. For further details on word alignment, we refer the reader to (Kumar et. al.). 8. Format of Word-Aligned Files Each word-aligned file consists of a series of records. Each record consists of four lines containing the following: a) one English chunk b) one foreign chunk c) English-to-foreign alignment information d) foreign-to-English information The alignment information consists of a series of integers, one for each word in the source language. The integer represents which word in the target language this source word has been aligned to. If the source token has not been aligned to any target word, then the index is -1. For example, the english-to-foreign alignment should be interpreted as follows: For every token in the English sentence, the index indicates the foreign token to which it has been aligned. As an illustration, below is a possible alignment chunk for "Mary did not slap the green witch" <--> "Mary no daba una botefada a la bruja verde". Mary did not slap the green witch Mary no daba una botefada a la bruja verde 0 -1 1 2 6 8 7 0 2 3 -1 -1 -1 4 6 5 This represents the following word alignments: English-to-Spanish: Mary->Mary did NOT ALIGNED not->no slap->daba the->la green->verde witch->bruja Spanish-to-English: Mary->Mary no->not daba->slap una NOT ALIGNED botefada NOT ALIGNED a NOT ALIGNED la->the bruja->witch verde->green We only provide English-foreign alignment pairs. For example, we do not provide Arabic-Spanish word-aligned data. Furthermore, the chunks are contained in the file in random order. That is, the file does not contain the same text as the raw text file in the same order, but the chunks from the raw text file in random order. 9. Directory Structure There are two main subdirectories, "doc" and "data". 9.1 doc The "doc" directory contains the following files: README - This file. INDEX - A file mapping UN documents to their release year and month. (The data files are arranged in a directory hierarchy based on release year and month - see below.) Example line from this file: 1993/01 A_46_2 This indicates that the directory (for raw data) or file (for word-aligned data) for the UN document "A_46_2" is found in the 1993/01 (January 1993) subdirectory. FILES_RAW - a list of all files included in the "raw" portion of this dataset FILES_ALIGNED - a list of all files included in the "aligned" portion of this dataset 9.2 data The "data" directory contains two subdirectories "raw" and "aligned". The "raw" directory holds the raw document files, and the "aligned" directory contains the word-aligned files. 9.2.1 Date-based directory hierarchy Both raw and aligned data are stored in a parallel date-based directory hierarchy based on the release date of the UN document. The location of a UN documnet in this directory hierarchy can be looked up in the INDEX file. This makes it easy to find corresponding raw and aligned files, if desired. Thus, both "raw" and "aligned/en-" directories have subdirectories for the years 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, and 2007. Each year directory has subdirectories for the months 01, 02, 03, 04, 05, 06, 07, 08, 09, 10, 11, and 12. 9.2.2 data/raw The "data/raw" directory contains the raw document text files. In the appropriate spot in the date-based directory hierarchy, there is one directory for each UN document. This directory contains one file for each language in which the document is available. Each file carries the language code as an extension (.ar, .en, .es, .fr, .ru, and .zh) As an example, the complete path to the Arabic text file for the document "A_46_2", which was released in January of 1993, is as follows: data/raw/1993/01/A_46_2/A_46_2.ar And the Spanish and French versions of the same document are contained in the following files: data/raw/1993/01/A_46_2/A_46_2.es data/raw/1993/01/A_46_2/A_46_2.fr 9.2.3 aligned Under "aligned", there are five subdirectories for the five language pairs "en-ar", "en-zh", "en-fr", "en-es", and "en-ru". Below this level, the aligned files are stored in the directory hierarchy that is arranged by date. The word-aligned data for one language pair for one UN document is stored in a single file. Thus, the English-Arabic word-aligned text for the UN document A_48_413 can be found in the following file: data/aligned/en-ar/1993/01/A_48_413 And the English-Spanish, English-French, and English-Russian word-aligned text for the same document can be found in the following files: data/aligned/en-es/1993/01/A_48_413 data/aligned/en-fr/1993/01/A_48_413 data/aligned/en-ru/1993/01/A_48_413 9.3 Compression In order to save space, the files are not present on the DVDs in the format described in this section. Instead, logical groups of files are combined into tar files, and compressed using the bzip2 compression utility. For the raw files, all files for each year have been combined into one compressed tar file. File sizes range from 13 to 34 MBytes. % cd data/raw % ls -l 135367526 1993.tar.bz2 219287568 1994.tar.bz2 212082791 1995.tar.bz2 225026541 1996.tar.bz2 252403354 1997.tar.bz2 290371527 1998.tar.bz2 301889346 1999.tar.bz2 339254596 2000.tar.bz2 328620027 2001.tar.bz2 326741654 2002.tar.bz2 310739409 2003.tar.bz2 323849446 2004.tar.bz2 337267011 2005.tar.bz2 348437909 2006.tar.bz2 281384252 2007.tar.bz2 For the word-aligned files, the files for each year for each language pair were combined into one compressed tar file. % cd data/aligned % ls -la 30384751 2010/05/04 14:37:31 /gfs/fa/home/mt/corpora/un_to_ldc/data/aligned/en-ar-1993.tar.bz2 80900798 en-ar-1994.tar.bz2 81996148 en-ar-1995.tar.bz2 80684075 en-ar-1996.tar.bz2 83625663 en-ar-1997.tar.bz2 95628265 en-ar-1998.tar.bz2 99381318 en-ar-1999.tar.bz2 117439799 en-ar-2000.tar.bz2 113731125 en-ar-2001.tar.bz2 108869821 en-ar-2002.tar.bz2 109077286 en-ar-2003.tar.bz2 113483413 en-ar-2004.tar.bz2 120468766 en-ar-2005.tar.bz2 126040703 en-ar-2006.tar.bz2 100220936 en-ar-2007.tar.bz2 60973035 en-es-1993.tar.bz2 102078443 en-es-1994.tar.bz2 88208057 en-es-1995.tar.bz2 96733640 en-es-1996.tar.bz2 99937632 en-es-1997.tar.bz2 109371904 en-es-1998.tar.bz2 112810073 en-es-1999.tar.bz2 127890075 en-es-2000.tar.bz2 118276833 en-es-2001.tar.bz2 112910349 en-es-2002.tar.bz2 118623925 en-es-2003.tar.bz2 126155978 en-es-2004.tar.bz2 135797604 en-es-2005.tar.bz2 132869190 en-es-2006.tar.bz2 107843815 en-es-2007.tar.bz2 61879476 en-fr-1993.tar.bz2 101890814 en-fr-1994.tar.bz2 90828662 en-fr-1995.tar.bz2 101058863 en-fr-1996.tar.bz2 109492814 en-fr-1997.tar.bz2 115845466 en-fr-1998.tar.bz2 117847594 en-fr-1999.tar.bz2 137961532 en-fr-2000.tar.bz2 135978538 en-fr-2001.tar.bz2 133039835 en-fr-2002.tar.bz2 128627804 en-fr-2003.tar.bz2 134986466 en-fr-2004.tar.bz2 141379142 en-fr-2005.tar.bz2 141673542 en-fr-2006.tar.bz2 116268271 en-fr-2007.tar.bz2 36790895 en-ru-1993.tar.bz2 102452567 en-ru-1994.tar.bz2 97899578 en-ru-1995.tar.bz2 101792324 en-ru-1996.tar.bz2 106055776 en-ru-1997.tar.bz2 118421156 en-ru-1998.tar.bz2 124786875 en-ru-1999.tar.bz2 139907671 en-ru-2000.tar.bz2 134473740 en-ru-2001.tar.bz2 132815336 en-ru-2002.tar.bz2 127360179 en-ru-2003.tar.bz2 134087992 en-ru-2004.tar.bz2 143615790 en-ru-2005.tar.bz2 146617817 en-ru-2006.tar.bz2 122258069 en-ru-2007.tar.bz2 116 en-zh-1993.tar.bz2 119 en-zh-1994.tar.bz2 753313 en-zh-1995.tar.bz2 9503752 en-zh-1996.tar.bz2 49040908 en-zh-1997.tar.bz2 92486515 en-zh-1998.tar.bz2 100639719 en-zh-1999.tar.bz2 114162464 en-zh-2000.tar.bz2 112065579 en-zh-2001.tar.bz2 109705298 en-zh-2002.tar.bz2 109464402 en-zh-2003.tar.bz2 114638863 en-zh-2004.tar.bz2 121908949 en-zh-2005.tar.bz2 122583229 en-zh-2006.tar.bz2 100128511 en-zh-2007.tar.bz2 10. Acknowledging the Data We are very pleased to be able to release this dataset, and we hope that many groups find it useful in their work. If you use this data, we would like to ask youg to acknowledge it in your presentations and publications. We are also interested in hearing what uses this data finds, so we would appreciate hearing from you how you used the data. 11. Contact Information We would welcome comments, suggestions, questions about the contents of this dataset, suggestions for possible future data sets, and any other feedback. Please send email to: ngrams@google.com 12. References Improving Word Alignment with Bridge Languages, Shankar Kumar, Franz Och, Wolfgang Macherey, Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2007 (Kumar et. al.) Alex Franz Shankar Kumar Thorsten Brants Google Research September 2012