Arabic Gigaword Fourth Edition
|Item Name:||Arabic Gigaword Fourth Edition|
|Author(s):||Robert Parker, David Graff, Ke Chen, Junbo Kong, Kazuaki Maeda|
|LDC Catalog No.:||LDC2009T30|
|Release Date:||December 17, 2009|
|Application(s):||natural language processing, language modeling, information retrieval|
|Language(s):||Standard Arabic, Arabic|
|Language ID(s):||arb, ara|
LDC User Agreement for Non-Members
|Online Documentation:||LDC2009T30 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Parker, Robert, et al. Arabic Gigaword Fourth Edition LDC2009T30. Web Download. Philadelphia: Linguistic Data Consortium, 2009.|
Arabic Gigaword Fourth Edition, Linguistic Data Consortium (LDC) catalog number LDC2009T30 and ISBN 1-58563-532-4, is a comprehensive archive of Arabic newswire text that has been acquired over several years at LDC. Arabic Gigaword Fourth Edition includes all of the content of Arabic Gigaword Third Edition (LDC2007T40) as well as newly-collected data. In addition, three new sources have been added in the fourth edition: Al-Ahram, Asharq Al-Awsat and Al-Quds Al-Arabi.
Nine distinct international sources of Arabic newswire are represented here:
- Al-Ahram (ahr_arb)
- Asharq Al-Awsat (aaw_arb)
- Agence France Presse (afp_arb)
- Assabah (asb_arb)
- Al Hayat (hyt_arb)
- An Nahar (nhr_arb)
- Al-Quds Al-Arabi (qds_arb)
- Ummah Press (umh_arb)
- Xinhua News Agency (xin_arb)
The seven-character codes shown above represent both the directory names where the data files are found and the 7-letter prefix that appears at the beginning of every file name. The 7-letter codes consist of the three-character source name IDs and the three-character language code ("arb") separated by an underscore ("_") character.
These news services all use Modern Standard Arabic (MSA), so there should be a fairly limited scope for orthographic and lexical variation due to regional Arabic dialects. However, to the extent that regional dialects might have an influence on MSA usage, the following should be noted:
- Al-Ahram is based in Cairo, Egypt.
- Asharq Al-Awsat is based in London, England, UK.
- An Nahar is based in Beirut, Lebanon.
- Al Hayat was originally a Lebanese news service, but it has been based in London during the entire period represented in this archive.
- Assabah is based in Tunisia.
- The Xinhua and Agence France Presse (AFP) services are obviously international in scope (Xinhua is based in Beijing, AFP in Paris), and the regional distribution of Arabic reporters and editors for these services is not known.
- The content provided by Ummah Press comes from diverse sources throughout the Arabic-speaking world.
- Al-Quds Al-Arabi is based in London, England, UK.
New in the Fourth Edition
- New Sources
This release marks the first edition of Arabic Gigaword to include content from Al-Ahram, Asharq Al-Awsat and Al-Quds Al-Arabi covering the period from November 2006 through December 2008.
- New Data for Existing Sources
This release contains all data collected by LDC from January 2007 through December 2008, except for Ummah Press for which data from January 2005 through December 2008 is included.
The table below shows data quantity by source under the following categories: data source (Source); the number of files per source (#Files); compressed file size (Gzip-MB); uncompressed file size (Totl-MB); the number of space-separated words tokens in the text (K-words); and the number of documents per source (#DOCs).
For an example of the data contained in this corps, please examine this jpeg image of the text content.
This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.