Arabic English Parallel News Part 1
Item Name: | Arabic English Parallel News Part 1 |
Author(s): | Linguistic Data Consortium |
LDC Catalog No.: | LDC2004T18 |
ISBN: | ISBN 1-58563-310-0 |
ISLRN: | 233-597-996-883-6 |
DOI: | https://doi.org/10.35111/et6p-7264 |
Release Date: | October 26, 2004 |
Member Year(s): | 2004 |
DCMI Type(s): | Text |
Data Source(s): | newswire |
Project(s): | GALE, TIDES |
Language(s): | English, Standard Arabic |
Language ID(s): | eng, arb |
License(s): |
LDC User Agreement for Non-Members |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Linguistic Data Consortium. Arabic English Parallel News Part 1 LDC2004T18. Web Download. Philadelphia: Linguistic Data Consortium, 2004. |
Related Works: | View |
Introduction
Arabic English Parallel News Part 1 was developed by the Linguistic Data Consortium (LDC) and contains Arabic news stories and their English translations aligned at sentence level, totaling approximately 2 million Arabic words and 2.5 million English words.
Data
LDC collected the data in this corpus via Ummah Press Service from January 2001 to September 2004. It totals 8,439 story pairs, 68,685 sentence pairs. The corpus is aligned at sentence level. All data files are SGML documents.
Ummah Press Service publishes weekly digests. Each issue of the Ummah publication contains a series of articles from various Arabic newspapers (eg. Al-Ahram, Al-Hayat, Asharq Al-Awsat, Al-Hakika, Al-Alam Al-Youm, Al-Gomhouria, Al-Ittihad) and their English translations.
Ummah sends every issue to LDC in CP1256 or UTF8 via email on a weekly basis. The emails were then decoded, reformatted, and the Arabic text converted to UTF8 if necessary. The data came aligned at the story level but not at the sentence level. The sentence alignment was done at LDC using Champollion v1.1.
Samples
For an example of the data in this corpus, please view this Arabic example (SGM) and this English example (SGM).
Updates
None at this time.