Arabic Gigaword Second Edition
|Item Name:||Arabic Gigaword Second Edition|
|Author(s):||David Graff, Ke Chen, Junbo Kong, Kazuaki Maeda|
|LDC Catalog No.:||LDC2006T02|
|Release Date:||January 19, 2006|
|Application(s):||information retrieval, language modeling, natural language processing|
LDC User Agreement for Non-Members
|Online Documentation:||LDC2006T02 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Graff, David, et al. Arabic Gigaword Second Edition LDC2006T02. DVD. Philadelphia: Linguistic Data Consortium, 2006.|
Arabic Gigaword Second Edition was produced by Linguistic Data Consortium (LDC) catalog number LDC2006T02 and ISBN 1-58563-371-2. This is a comprehensive archive of newswire text data that has been acquired from Arabic news sources by the Linguistic Data Consortium (LDC), at the University of Pennsylvania.
Arabic Gigaword Second Edition includes all of the content of the first edition of Arabic Gigaword (LDC2003T12) as well as new data.
Five distinct sources of Arabic newswire are represented here:
|Agence France Presse||(afp_arb; formally afa)|
|Al Hayat News Agency||(hyt_arb; formally alh)|
|An Nahar News Agency||(nhr_arb; formally ann)|
|Xinhua News Agency||(xin_arb; formally xia)|
The seven-letter codes in the parentheses above consist of the three-character source name IDs and the three-character language code ("arb") separated by an underscore ("_") character. The three-letter language code represents the standard Arabic in the ISO 639-3 standard. In the first edition of the Arabic Gigaword corpus, a simpler three-character-code scheme was used to identify both the source and the language. The new convention allows us to distinguish data sets by source and language more naturally when a single newswire provider distributes data in multiple languages.
Ummah Press is a new source added to the Second Edition. The following table shows the new data that appear for the first time in the Second Edition.
|Agence France Presse||2003.01-2004.12||143,766 documents|
|Al Hayat News Agency||2002.01-2003.12||64,308 documents|
|An Nahar News Agency||2003.01-2004.01||16,316 documents|
|Ummah Press||2003.01-2004.12||4,641 documents|
|Xinhua News Agency||2003.06-2004.12||10,6236 documents|
The table below presents the following categories of information: source of the data, number of files per source, Gzip-MB shows totals for compressed file sizes, Totl-MB shows totals for uncompressed file sizes (i.e. approximately 5.3 gigabytes total), K-words are the number of space-separated tokens in the text, excluding SGML tags.
All text files in this corpus have been converted to UTF-8 character encoding.
Owing to the use of UTF-8, the SGML tagging within each file shows up as lines of single-byte-per-character (ASCII) text, whereas lines of actual text data, including article headlines and datelines, contain a mixture of single-byte and multi-byte characters. In general, single-byte characters in the text data will consist of digits and punctuation marks (where the original source relied on ASCII punctuation codes, rather than Arabic-specific punctuation), whereas multi-byte characters consist of Arabic letters and a small number of special punctuation or other symbols. This variable-width character encoding is intrinsic to UTF-8, and all UTF-8 capable processes will handle the data appropriately.
Each data file name consists of the seven-letter prefix, an underscore character ("_"), and a six-digit date (representing the year and month during which the file contents were generated by the respective news source), followed by a ".gz" file extension, indicating that the file contents have been compressed using the GNU "gzip" compression utility (RFC 1952). Therefore, each file contains all the usable data received by LDC for the given month from the given news source.
All text data are presented in SGML form, using a very simple, minimal markup structure. The file gigaword_a.dtd in the "dtd" directory provides the formal "Document Type Declaration" for parsing the SGML content. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using this DTD file.
Unlike older corpora, the present corpus uses only the information structure that is common to all sources and serves a clear function: headline, dateline, and core news content (usually containing paragraphs).
All sources have received a uniform treatment in terms of quality control, and have been categorized into three distinct "types":
|story||this type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences|
|multi||this type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event: "summaries of today's news," "news briefs in ... (some general area like finance or sports)" and so on|
|other||these DOCs clearly do not fall into any of the above types; these are things like lists of sports scores, stock prices, temperatures around the world, and so on|
The general strategy for categorizing DOCs into these three classes was, for each source, to discover the most common and frequent clues in the text stream that correlated with the "non-story" types. When none of the known clues was in evidence, the DOC was classified as a "story."
Other "Gigaword" corpora (in English and Chinese) had a fourth category, "advis" (for "advisory"), which applied to DOCs that contain text intended solely for news service editors, not the news-reading public. In preparing the Arabic data, the task of determining patterns for assigning "non-story" type labels was carried out by a native speaker of Arabic, and (for whatever reason) this person did not find the "advis" category to be applicable to any of the data.
As described in the introduction section, a new naming scheme for file names and document IDs is used in the Second Edition. All of the documents in the first edition of the Arabic Gigaword corpus can be mapped to the same documents in this edition by changing the prefix of DOC IDs and file names as below. The upper case letters are used for the DOC IDs; the lower case letters are used for the file and directory names. The underscore character to connect the seven-letter prefix and the date is included in the following table.
For an example of the data in this corpus, please examine this screenshot which is an image of the text from a single file.