UN Parallel Documents 1993-2007 Version 1.0

1. Introduction

This dataset contains the text of United Nations parliamentary
documents in Arabic, Chinese, English, French, Russian, and Spanish
from 1993 through 2007. The data is provided in two formats.

1.1 Raw Text

The raw text is very close to what was extracted from the word
processing documents, converted to UTF-8 encoding.

1.2 Word-aligned Text

The word-aligned text has been normalized, tokenized, aligned at the
sentence-level, further broken into sub-sentential "chunk-pairs", and
then aligned at the word. The sentence, chunk, and word alignment
operations were performed separately for each individual language
pair.


2. About the UN Parliamentary Documents

The United Nations parliamentary documents are available from the UN
Official Document System (UN ODS) at http://ods.un.org/.

The ODS, in its main "UNDOC" database, contains the full text of all
types of UN parliamentary documents. It has complete coverage going
back to 1993, and variable coverage before that. Documents exist in
one or more of the official languages of the UN: Arabic, Chinese,
English, French, Russian, and Spanish.

(There are also a large number of German documents, marked with the
language "other", but these are not included in this dataset.)

For more information, see the ODS documentation at
http://documents.un.org/help_E.htm.

For more details of the UN bibliographic systems, see
http://www.un.org/depts/dhl/unbisref_manual/.


3. Some Terminology

Document: A publication independent of language and format, such as
          "The Agenda for the 52nd General Assembly"

Document Symbol: A symbolic identifier assigned by the UN to a
                 document. For example, "A/52/100" denotes published
                 text number 100 from the 52nd session of the general
                 assembly.

Release date: The date when the the document was entered into the ODS
              database.

Document instance: Version of a document in a particular language and
                   format, e.g. the English PDF version of "The Agenda
                   for the 52nd General Assembly".


4. Mining the Data from the ODS

Since the years before 1993 yield almost no useful parallel documents,
the ODS was mined for parallel documents from 1993-2007. Searching by
release date, all documents that were released from January 1, 1993 -
December 31, 2007 in more than one language were downloaded.

This dataset uses the UN document symbols to identify text files. In
some cases, the document had a compound document symbol, such as
A/45/2&nbsp;&#8226;&nbsp;A/45/2(SUPP) (i.e. "A/45/2 &
A/45/2(SUPP)"). In these cases, we simply use the first document
identifier, "A/45/2" in the example. Note that this results in some
extreme cases, such as the document "V & A/C.5/48/70" ending up with
the short identifier "V".

UN document symbols were turned into directory/file names by
mapping the characters space, tab, newline, period, open/close
parenthesis, open/close square brackets, colon, and comma to
underscore. For example:

A/47/24/ADD.1(SUPP) --> A_47_24_ADD_1_SUPP


5. Extracting the Text

The ODS provides documents in a variety of formats, including PDF,
Word, and various versions of WordPerfect. Working from the word
processing formats, the Word and Wordperfect documents were converted
to HTML. In a second step, the HTML documents were converted to text.

The text was then broken into sentences. The raw documents simply
contain one sentence per line.

The text extraction method retained some of the "visual formatting"
that was present in the original word processing documents. For
example, suppose an author used line breaks or empty lines for a title
page like so:

United Nations
Conference
on Trade and
Development

In such cases, each segment "United Nations", "Conference", "on Trade
and", and "Development" becomes a separate sentence, and forms a
separate line in the text file in this dataset.


6. Processing the Text

For the word-aligned data, the text was further preprocessed. This
included the following steps:

- UTF-8 normalization maps characters (and in some cases words)
  to more standard forms

- Chinese segmentation splits Chinese text into "words"

- Chunk splitting separates sentences into chunks at certain
  boundaries in the text


7. Performing Word Alignments

Given two parallel chunks in English and one foreign language (Arabic,
Chinese, French, Spanish, or Russian), word alignment was performed in
both directions, English-to-foreign and foreign-to-English.

We trained 2 models for English-to-foreign and foreign-to-English
respectively. In each case, we trained models using a recipe
consisting of 3 Model-1 iterations and 2 HMM iterations. A posterior
probability matrix was computed for each sentence pair. Using this
matrix, we obtained the Maximum A Posterior alignment for each
English/foreign position. For further details on word alignment, we
refer the reader to (Kumar et. al.).

8. Format of Word-Aligned Files

Each word-aligned file consists of a series of records. Each record
consists of four lines containing the following:

a) one English chunk
b) one foreign chunk
c) English-to-foreign alignment information
d) foreign-to-English information

The alignment information consists of a series of integers, one for
each word in the source language. The integer represents which word in
the target language this source word has been aligned to. If the
source token has not been aligned to any target word, then the index
is -1.

For example, the english-to-foreign alignment should be interpreted as
follows: For every token in the English sentence, the index indicates
the foreign token to which it has been aligned.

As an illustration, below is a possible alignment chunk for "Mary did
not slap the green witch" <--> "Mary no daba una botefada a la bruja
verde". 

Mary did not slap the green witch
Mary no daba una botefada a la bruja verde
0 -1 1 2 6 8 7
0 2 3 -1 -1 -1 4 6 5

This represents the following word alignments:

English-to-Spanish:

Mary->Mary
did NOT ALIGNED
not->no
slap->daba
the->la
green->verde
witch->bruja

Spanish-to-English:

Mary->Mary
no->not
daba->slap
una NOT ALIGNED
botefada NOT ALIGNED
a NOT ALIGNED
la->the
bruja->witch
verde->green

We only provide English-foreign alignment pairs. For example, we do
not provide Arabic-Spanish word-aligned data. Furthermore, the chunks
are contained in the file in random order. That is, the file does not
contain the same text as the raw text file in the same order, but the
chunks from the raw text file in random order.

9. Directory Structure

There are two main subdirectories, "doc" and "data".

9.1 doc

The "doc" directory contains the following files:

README - This file.

INDEX - A file mapping UN documents to their release year and month.
        (The data files are arranged in a directory hierarchy based
        on release year and month - see below.)

        Example line from this file:

        1993/01 A_46_2

        This indicates that the directory (for raw data) or file (for
        word-aligned data) for the UN document "A_46_2" is found in
        the 1993/01 (January 1993) subdirectory.


FILES_RAW - a list of all files included in the "raw" portion of this
            dataset

FILES_ALIGNED - a list of all files included in the "aligned" portion
                of this dataset


9.2 data

The "data" directory contains two subdirectories "raw" and "aligned".
The "raw" directory holds the raw document files, and the "aligned"
directory contains the word-aligned files.

9.2.1 Date-based directory hierarchy

Both raw and aligned data are stored in a parallel date-based
directory hierarchy based on the release date of the UN document. The
location of a UN documnet in this directory hierarchy can be looked up
in the INDEX file. This makes it easy to find corresponding raw and
aligned files, if desired.

Thus, both "raw" and "aligned/en-<foreign>" directories have
subdirectories for the years 1993, 1994, 1995, 1996, 1997, 1998, 1999,
2000, 2001, 2002, 2003, 2004, 2005, 2006, and 2007. Each year
directory has subdirectories for the months 01, 02, 03, 04, 05, 06,
07, 08, 09, 10, 11, and 12.

9.2.2 data/raw

The "data/raw" directory contains the raw document text files. In the
appropriate spot in the date-based directory hierarchy, there is one
directory for each UN document. This directory contains one file for
each language in which the document is available. Each file carries
the language code as an extension (.ar, .en, .es, .fr, .ru, and .zh)

As an example, the complete path to the Arabic text file for the
document "A_46_2", which was released in January of 1993, is as follows:

data/raw/1993/01/A_46_2/A_46_2.ar 

And the Spanish and French versions of the same document are contained
in the following files:

data/raw/1993/01/A_46_2/A_46_2.es
data/raw/1993/01/A_46_2/A_46_2.fr

9.2.3 aligned

Under "aligned", there are five subdirectories for the five language
pairs "en-ar", "en-zh", "en-fr", "en-es", and "en-ru".

Below this level, the aligned files are stored in the directory
hierarchy that is arranged by date. The word-aligned data for one
language pair for one UN document is stored in a single file.

Thus, the English-Arabic word-aligned text for the UN document
A_48_413 can be found in the following file:

data/aligned/en-ar/1993/01/A_48_413

And the English-Spanish, English-French, and English-Russian
word-aligned text for the same document can be found in the following
files:

data/aligned/en-es/1993/01/A_48_413
data/aligned/en-fr/1993/01/A_48_413
data/aligned/en-ru/1993/01/A_48_413

9.3 Compression

In order to save space, the files are not present on the DVDs in the
format described in this section. Instead, logical groups of files are
combined into tar files, and compressed using the bzip2 compression
utility. 

For the raw files, all files for each year have been combined into one
compressed tar file. File sizes range from 13 to 34 MBytes.

% cd data/raw
% ls  -l

135367526 1993.tar.bz2
219287568 1994.tar.bz2
212082791 1995.tar.bz2
225026541 1996.tar.bz2
252403354 1997.tar.bz2
290371527 1998.tar.bz2
301889346 1999.tar.bz2
339254596 2000.tar.bz2
328620027 2001.tar.bz2
326741654 2002.tar.bz2
310739409 2003.tar.bz2
323849446 2004.tar.bz2
337267011 2005.tar.bz2
348437909 2006.tar.bz2
281384252 2007.tar.bz2


For the word-aligned files, the files for each year for each language pair were combined into one compressed tar file.

% cd data/aligned
% ls -la
30384751 2010/05/04 14:37:31 /gfs/fa/home/mt/corpora/un_to_ldc/data/aligned/en-ar-1993.tar.bz2
80900798 en-ar-1994.tar.bz2
81996148 en-ar-1995.tar.bz2
80684075 en-ar-1996.tar.bz2
83625663 en-ar-1997.tar.bz2
95628265 en-ar-1998.tar.bz2
99381318 en-ar-1999.tar.bz2
117439799 en-ar-2000.tar.bz2
113731125 en-ar-2001.tar.bz2
108869821 en-ar-2002.tar.bz2
109077286 en-ar-2003.tar.bz2
113483413 en-ar-2004.tar.bz2
120468766 en-ar-2005.tar.bz2
126040703 en-ar-2006.tar.bz2
100220936 en-ar-2007.tar.bz2
60973035 en-es-1993.tar.bz2
102078443 en-es-1994.tar.bz2
88208057 en-es-1995.tar.bz2
96733640 en-es-1996.tar.bz2
99937632 en-es-1997.tar.bz2
109371904 en-es-1998.tar.bz2
112810073 en-es-1999.tar.bz2
127890075 en-es-2000.tar.bz2
118276833 en-es-2001.tar.bz2
112910349 en-es-2002.tar.bz2
118623925 en-es-2003.tar.bz2
126155978 en-es-2004.tar.bz2
135797604 en-es-2005.tar.bz2
132869190 en-es-2006.tar.bz2
107843815 en-es-2007.tar.bz2
61879476 en-fr-1993.tar.bz2
101890814 en-fr-1994.tar.bz2
90828662 en-fr-1995.tar.bz2
101058863 en-fr-1996.tar.bz2
109492814 en-fr-1997.tar.bz2
115845466 en-fr-1998.tar.bz2
117847594 en-fr-1999.tar.bz2
137961532 en-fr-2000.tar.bz2
135978538 en-fr-2001.tar.bz2
133039835 en-fr-2002.tar.bz2
128627804 en-fr-2003.tar.bz2
134986466 en-fr-2004.tar.bz2
141379142 en-fr-2005.tar.bz2
141673542 en-fr-2006.tar.bz2
116268271 en-fr-2007.tar.bz2
36790895 en-ru-1993.tar.bz2
102452567 en-ru-1994.tar.bz2
97899578 en-ru-1995.tar.bz2
101792324 en-ru-1996.tar.bz2
106055776 en-ru-1997.tar.bz2
118421156 en-ru-1998.tar.bz2
124786875 en-ru-1999.tar.bz2
139907671 en-ru-2000.tar.bz2
134473740 en-ru-2001.tar.bz2
132815336 en-ru-2002.tar.bz2
127360179 en-ru-2003.tar.bz2
134087992 en-ru-2004.tar.bz2
143615790 en-ru-2005.tar.bz2
146617817 en-ru-2006.tar.bz2
122258069 en-ru-2007.tar.bz2
116 en-zh-1993.tar.bz2
119 en-zh-1994.tar.bz2
753313 en-zh-1995.tar.bz2
9503752 en-zh-1996.tar.bz2
49040908 en-zh-1997.tar.bz2
92486515 en-zh-1998.tar.bz2
100639719 en-zh-1999.tar.bz2
114162464 en-zh-2000.tar.bz2
112065579 en-zh-2001.tar.bz2
109705298 en-zh-2002.tar.bz2
109464402 en-zh-2003.tar.bz2
114638863 en-zh-2004.tar.bz2
121908949 en-zh-2005.tar.bz2
122583229 en-zh-2006.tar.bz2
100128511 en-zh-2007.tar.bz2


10. Acknowledging the Data

We are very pleased to be able to release this dataset, and we hope that
many groups find it useful in their work. If you use this data, we
would like to ask youg to acknowledge it in your presentations and
publications. We are also interested in hearing what uses this data
finds, so we would appreciate hearing from you how you used the data.

11. Contact Information

We would welcome comments, suggestions, questions about the contents
of this dataset, suggestions for possible future data sets, and any other
feedback. Please send email to: ngrams@google.com

12. References

Improving Word Alignment with Bridge Languages, Shankar Kumar, Franz
Och, Wolfgang Macherey, Conference on Empirical Methods in Natural
Language Processing and Computational Natural Language Learning, 2007
(Kumar et. al.)


Alex Franz
Shankar Kumar
Thorsten Brants

Google Research
September 2012