Arabic Gigaword Fourth Edition


Item Name: Arabic Gigaword Fourth Edition
Authors: Robert Parker, David Graff, Ke Chen, Junbo Kong, and Kazuaki Maeda
LDC Catalog No.: LDC2009T30
ISBN: 1-58563-532-4
Release Date: Dec 17, 2009
Data Type: text
Data Source(s): newswire
Project(s): GALE
Application(s): information retrieval, language modeling, natural language processing
Language(s): Arabic
Language ID(s): arb
Distribution: 1 DVD
Member fee: $0 for 2009 members
Non-member Fee: US $5000.00
Reduced-License Fee: US $2500.00
Extra-Copy Fee: US $200.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Robert Parker, et al.
2009
Arabic Gigaword Fourth Edition
Linguistic Data Consortium, Philadelphia

Introduction

Arabic Gigaword Fourth Edition, Linguistic Data Consortium (LDC) catalog number LDC2009T30 and ISBN 1-58563-532-4, is a comprehensive archive of Arabic newswire text that has been acquired over several years at LDC. Arabic Gigaword Fourth Edition includes all of the content of Arabic Gigaword Third Edition (LDC2007T40) as well as newly-collected data. In addition, three new sources have been added in the fourth edition: Al-Ahram, Asharq Al-Awsat and Al-Quds Al-Arabi.

Nine distinct international sources of Arabic newswire are represented here:

  • Al-Ahram (ahr_arb)
  • Asharq Al-Awsat (aaw_arb)
  • Agence France Presse (afp_arb)
  • Assabah (asb_arb)
  • Al Hayat (hyt_arb)
  • An Nahar (nhr_arb)
  • Al-Quds Al-Arabi (qds_arb)
  • Ummah Press (umh_arb)
  • Xinhua News Agency (xin_arb)

The seven-character codes shown above represent both the directory names where the data files are found and the 7-letter prefix that appears at the beginning of every file name. The 7-letter codes consist of the three-character source name IDs and the three-character language code ("arb") separated by an underscore ("_") character.

These news services all use Modern Standard Arabic (MSA), so there should be a fairly limited scope for orthographic and lexical variation due to regional Arabic dialects. However, to the extent that regional dialects might have an influence on MSA usage, the following should be noted:

  • Al-Ahram is based in Cairo, Egypt.
  • Asharq Al-Awsat is based in London, England, UK.
  • An Nahar is based in Beirut, Lebanon.
  • Al Hayat was originally a Lebanese news service, but it has been based in London during the entire period represented in this archive.
  • Assabah is based in Tunisia.
  • The Xinhua and Agence France Presse (AFP) services are obviously international in scope (Xinhua is based in Beijing, AFP in Paris), and the regional distribution of Arabic reporters and editors for these services is not known.
  • The content provided by Ummah Press comes from diverse sources throughout the Arabic-speaking world.
  • Al-Quds Al-Arabi is based in London, England, UK.

New in the Fourth Edition

  • New Sources

    This release marks the first edition of Arabic Gigaword to include content from Al-Ahram, Asharq Al-Awsat and Al-Quds Al-Arabi covering the period from November 2006 through December 2008.

  • New Data for Existing Sources

    This release contains all data collected by LDC from January 2007 through December 2008, except for Ummah Press for which data from January 2005 through December 2008 is included.

The table below shows data quantity by source under the following categories: data source (Source); the number of files per source (#Files); compressed file size (Gzip-MB); uncompressed file size (Totl-MB); the number of space-separated words tokens in the text (K-words); and the number of documents per source (#DOCs).

Source #Files Gzip-MB Totl-MB K-wrds #DOCs
aaw_arb 26 114 386 36694 87506
afp_arb 176 530 1979 184631 930656
ahr_arb 26 114 131 42265 107187
asb_arb 52 45 149 14322 32794
hyt_arb 166 663 2224 209318 448335
nhr_arb 157 784 2662 253559 557151
qds_arb 26 62 198 18996 49352
umh_arb 68 9.3 31 2995 11350
xin_arb 91 245 890 85689 492664
Totals 788 5018 8650 848469 2716995

Samples

For an example of the data contained in this corps, please examine this jpeg image of the text content.

Sponsorship

This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

Content Copyright

Portions 1994-2008 Agence France Presse, 2006-2008 Al-Ahram, 2006-2008 Al-Quds Al-Arabi, 2006-2008 Asharq Al-Awsat, 2004-2008 Assabah, 1994-2003, 2005-2008 Al Hayat, 1995-2008 An Nahar, 2003-2008 Ummah Press, 2001-2008 Xinhua News Agency, 2003, 2006, 2007, 2009 Trustees of the University of Pennsylvania