Arabic Gigaword Fifth Edition

Item Name: Arabic Gigaword Fifth Edition
Author(s): Robert Parker, David Graff, Ke Chen, Junbo Kong, Kazuaki Maeda
LDC Catalog No.: LDC2011T11
ISBN: 1-58563-595-2
ISLRN: 494-144-988-211-3
DOI: https://doi.org/10.35111/p02g-rw14
Release Date: October 21, 2011
Member Year(s): 2011
DCMI Type(s): Text
Data Source(s): newswire
Project(s): GALE
Application(s): natural language processing, language modeling, information retrieval
Language(s): Standard Arabic, Arabic
Language ID(s): arb, ara
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2011T11 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Parker, Robert, et al. Arabic Gigaword Fifth Edition LDC2011T11. Web Download. Philadelphia: Linguistic Data Consortium, 2011.
Related Works: View

Introduction

Arabic Gigaword Fifth Edition, Linguistic Data Consortium (LDC) catalog number LDC2011T11 and ISBN 1-58563-595-2, was produced by LDC. It is a comprehensive archive of newswire text data that has been acquired from Arabic news sources by LDC at the University of Pennsylvania. Arabic Gigaword Fifth Edition includes all of the content of the fourth edition of Arabic Gigaword (LDC2009T30) plus new data covering the period from January 2009 through December 2010.

Nine distinct sources of Arabic newswire are represented here:

  • Asharq Al-Awsat (aaw_arb)
  • Agence France Presse (afp_arb)
  • Al-Ahram (ahr_arb)
  • Assabah (asb_arb)
  • Al Hayat (hyt_arb)
  • An Nahar (nhr_arb)
  • Al-Quds Al-Arabi (qds_arb)
  • Ummah Press (umh_arb)
  • Xinhua News Agency (xin_arb)

The seven-character codes shown above represent both the directory names where the data files are found, and the 7-letter prefix that appears at the beginning of every file name. The 7-letter codes consist of the three-character source name IDs and the three-character language code (arb) separated by an underscore (_) character. The three-character language code conforms to the ISO 639-3 standard.

In addition to adding new data, the following updates were made:

  • Repeated documents in Asharq Al-Awsat data from 2008 were removed.
  • Document formatting and docid duplication problems were corrected in Agence France Presse (AFP) data.
  • Significant duplication of content in 2007-2008 An Nahar data was detected, and the duplicated documents were removed.

More details about these changes can be found in the included readme file.

Data

All text data are presented in SGML form, using a very simple, minimal markup structure. For every opening tag (DOC, HEADLINE, DATELINE, TEXT, P), there is a corresponding closing tag -- always. The attribute values in the DOC tag are always presented within double-quotes the id= attribute of DOC consists of the 7-letter source abbreviation (in CAPS), an underscore character, an 8-digit date string representing the date of the story (YYYYMMDD), a period, and a 4-digit sequence number starting at 0001 for each date (e.g. XIN_ARB_200101.0001) in this way, every DOC in the corpus is uniquely identifiable by the id string.

For this release, all sources have received a uniform treatment in terms of quality control, and we have applied a rudimentary (and _approximate_) categorization of DOC units into four distinct types. The classification is indicated by the type=string attribute that is included in each opening DOC tag. The four types are:

  • story: This is by far the most frequent type, and it represents the most typical newswire item: a coherent report on a particular topic or event, consisting of paragraphs and full sentences.
  • multi: This type of DOC contains a series of unrelated blurbs, each of which briefly describes a particular topic or event this is typically applied to DOCs that contain summaries of todays news, news briefs in ... (some general area like finance or sports), and so on.
  • other: This represents DOCs that clearly do not fall into any of the above types -- in general, items of this type are intended for broad circulation (they are not advisories), they may be topically coherent (unlike multi type DOCs), and they typically do not contain paragraphs or sentences (they arent really stories) these are things like lists of sports scores, stock prices, temperatures around the world, and so on.

Other Gigaword corpora (e.g., in English and Chinese) have a fourth category, advis (for advisory), which applies to DOCs that contain text intended solely for news service editors, not the news-reading public. The task of determining patterns for assigning non-story type labels was carried out by a native speaker of Arabic, and the advis category was determined to be inapplicable to the data.

Note that the markup was applied algorithmically, using logic that was based on less-than-complete knowledge of the data. For the most part, the HEADLINE, DATELINE and TEXT tags have their intended content but due to the inherent variability (and the inevitable source errors) in the data, users may find occasional mishaps where the headline and/or dateline were not successfully identified (hence show up within TEXT), or where an initial sentence or paragraph has been mistakenly tagged as the headline or dateline.

Sample

Please view this sample.

Sponsorship

This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily refelct the position or policy of the Government, and no official endorsement should be inferred.

Updates

None at this time.

Available Media

View Fees





Login for the applicable fee