Arabic Gigaword Fifth Edition


Item Name: Arabic Gigaword Fifth Edition
Authors: Robert Parker, David Graff, Ke Chen, Junbo Kong, and Kazuaki Maeda
LDC Catalog No.: LDC2011T11
ISBN: 1-58563-595-2
Release Date: Oct 21, 2011
Data Type: text
Data Source(s): newswire
Project(s): GALE
Application(s): information retrieval, language modeling, natural language processing
Language(s): Arabic
Language ID(s): arb
Distribution: 1 DVD
Member fee: $0 for 2011 members
Non-member Fee: US $6000.00
Reduced-License Fee: US $3000.00
Extra-Copy Fee: US $200.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Robert Parker, et al.
2011
Arabic Gigaword Fifth Edition
Linguistic Data Consortium, Philadelphia

Introduction

Arabic Gigaword Fifth Edition, Linguistic Data Consortium (LDC) catalog number LDC2011T11 and ISBN 1-58563-595-2, was produced by LDC. It is a comprehensive archive of newswire text data that has been acquired from Arabic news sources by LDC at the University of Pennsylvania. Arabic Gigaword Fifth Edition includes all of the content of the fourth edition of Arabic Gigaword (LDC2009T30) plus new data covering the period from January 2009 through December 2010.

Nine distinct sources of Arabic newswire are represented here:

  • Asharq Al-Awsat (aaw_arb)
  • Agence France Presse (afp_arb)
  • Al-Ahram (ahr_arb)
  • Assabah (asb_arb)
  • Al Hayat (hyt_arb)
  • An Nahar (nhr_arb)
  • Al-Quds Al-Arabi (qds_arb)
  • Ummah Press (umh_arb)
  • Xinhua News Agency (xin_arb)

The seven-character codes shown above represent both the directory names where the data files are found, and the 7-letter prefix that appears at the beginning of every file name. The 7-letter codes consist of the three-character source name IDs and the three-character language code (arb) separated by an underscore (_) character. The three-character language code conforms to the ISO 639-3 standard.

In addition to adding new data, the following updates were made:

  • Repeated documents in Asharq Al-Awsat data from 2008 were removed.
  • Document formatting and docid duplication problems were corrected in Agence France Presse (AFP) data.
  • Significant duplication of content in 2007-2008 An Nahar data was detected, and the duplicated documents were removed.

More details about these changes can be found in the included readme file.

Data

All text data are presented in SGML form, using a very simple, minimal markup structure. For every opening tag (DOC, HEADLINE, DATELINE, TEXT, P), there is a corresponding closing tag -- always. The attribute values in the DOC tag are always presented within double-quotes the id= attribute of DOC consists of the 7-letter source abbreviation (in CAPS), an underscore character, an 8-digit date string representing the date of the story (YYYYMMDD), a period, and a 4-digit sequence number starting at 0001 for each date (e.g. XIN_ARB_200101.0001) in this way, every DOC in the corpus is uniquely identifiable by the id string.

For this release, all sources have received a uniform treatment in terms of quality control, and we have applied a rudimentary (and _approximate_) categorization of DOC units into four distinct types. The classification is indicated by the type=string attribute that is included in each opening DOC tag. The four types are:

  • story: This is by far the most frequent type, and it represents the most typical newswire item: a coherent report on a particular topic or event, consisting of paragraphs and full sentences.
  • multi: This type of DOC contains a series of unrelated blurbs, each of which briefly describes a particular topic or event this is typically applied to DOCs that contain summaries of todays news, news briefs in ... (some general area like finance or sports), and so on.
  • other: This represents DOCs that clearly do not fall into any of the above types -- in general, items of this type are intended for broad circulation (they are not advisories), they may be topically coherent (unlike multi type DOCs), and they typically do not contain paragraphs or sentences (they arent really stories) these are things like lists of sports scores, stock prices, temperatures around the world, and so on.

Other Gigaword corpora (e.g., in English and Chinese) have a fourth category, advis (for advisory), which applies to DOCs that contain text intended solely for news service editors, not the news-reading public. The task of determining patterns for assigning non-story type labels was carried out by a native speaker of Arabic, and the advis category was determined to be inapplicable to the data.

Note that the markup was applied algorithmically, using logic that was based on less-than-complete knowledge of the data. For the most part, the HEADLINE, DATELINE and TEXT tags have their intended content but due to the inherent variability (and the inevitable source errors) in the data, users may find occasional mishaps where the headline and/or dateline were not successfully identified (hence show up within TEXT), or where an initial sentence or paragraph has been mistakenly tagged as the headline or dateline.

Sample

Please view this sample.

Sponsorship

This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily refelct the position or policy of the Government, and no official endorsement should be inferred.

Updates

None at this time.

Content Copyright

Portions 1994-2010 Agence France Presse, 2006-2010 Al-Ahram, 2006-2010 Al-Quds Al-Arabi, 2006-2010 Asharq Al-Awsat, 2004-2010 Assabah, 1994-2003, 2005-2010 Al Hayat, 1995-2010 An Nahar, 2003-2010 Ummah Press, 2001-2010 Xinhua News Agency, 2003, 2006, 2007, 2009, 2011 Trustees of the University of Pennsylvania