Home › Language Resources › Data

Arabic Gigaword Second Edition

Item Name:	Arabic Gigaword Second Edition
Author(s):	David Graff, Ke Chen, Junbo Kong, Kazuaki Maeda
LDC Catalog No.:	LDC2006T02
ISBN:	1-58563-371-2
ISLRN:	299-814-033-635-4
DOI:	https://doi.org/10.35111/scbm-4q37
Release Date:	January 19, 2006
Member Year(s):	2006
DCMI Type(s):	Text
Data Source(s):	newswire
Application(s):	information retrieval, language modeling
Language(s):	Standard Arabic
Language ID(s):	arb
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2006T02 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Graff, David, et al. Arabic Gigaword Second Edition LDC2006T02. Web Download. Philadelphia: Linguistic Data Consortium, 2006.
Related Works: Hide	View isVersionOf LDC2003T12 Arabic Gigaword hasVersion LDC2007T40 Arabic Gigaword Third Edition LDC2009T30 Arabic Gigaword Fourth Edition LDC2011T11 Arabic Gigaword Fifth Edition hasAnnotation LDC2018T08 2007 CoNLL Shared Task - Arabic & English hasOutcome LDC2007T08 ISI Arabic-English Automatically Extracted Parallel Text LDC2010T12 NIST 2004 Open Machine Translation (OpenMT) Evaluation LDC2010T14 NIST 2005 Open Machine Translation (OpenMT) Evaluation

Introduction

Arabic Gigaword Second Edition was developed by the Linguistic Data Consortium (LDC) and contains 1.6 million documents of Arabic newswire text collected by LDC.

This second edition includes all of the content of the first edition of Arabic Gigaword (LDC2003T12) as well as new data.

Data

The following table contains information for this corpus, broken down by source. The information includes source codes represented in the corpus as well as their codes from the first edition, the collection span and number of documents new to this edition, the number of documents total, and the K-words (thousands of words) for each source. Ummah Press is a new source included in the second edition and therefore has no first edition info.

Source	Second Edition Codes	First Edition Codes	Second Edition Collection Span	New Docs	Total Docs	K-words
Agence France Presse	afp_arb	afa	01/2003 - 12/2004	143,766	660,621	123,594
Al Hayat New Agency	hyt_arb	alh	01/2002 - 12/2003	64,308	369,555	169,100
An Nahar News Agency	nhr_arb	ann	01/2003 - 01/2004	16,316	344,084	151,078
Ummah Press	umh_arb		01/2003 - 12/2004	4,641	4,641	1,201
Xinhua News Agency	xin_arb	xia	06/2003 - 12/2004	106,236	213,082	36,933
Total				335,267	1,591,983	481,906

Further statistics for each source are included in the corpus documentation. All text files in this corpus have been converted to UTF-8 character encoding.

Owing to the use of UTF-8, the SGML tagging within each file shows up as lines of single-byte-per-character (ASCII) text, whereas lines of actual text data, including article headlines and datelines, contain a mixture of single-byte and multi-byte characters. In general, single-byte characters in the text data will consist of digits and punctuation marks (where the original source relied on ASCII punctuation codes, rather than Arabic-specific punctuation), whereas multi-byte characters consist of Arabic letters and a small number of special punctuation or other symbols. This variable-width character encoding is intrinsic to UTF-8, and all UTF-8 capable processes will handle the data appropriately.

Each data file name consists of the seven-letter prefix, an underscore character ("_"), and a six-digit date representing the year and month during which the file contents were generated by the respective news source. Therefore, each file contains all the usable data received by LDC for the given month from the given news source.

All text data are presented in SGML form, using a very simple, minimal markup structure. The file gigaword_a.dtd in the "dtd" directory provides the formal "Document Type Declaration" for parsing the SGML content. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using this DTD file.

Unlike older corpora, the present corpus uses only the information structure that is common to all sources and serves a clear function: headline, dateline, and core news content (usually containing paragraphs).

All sources have received a uniform treatment in terms of quality control, and have been categorized into three distinct "types":

story	this type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences
multi	this type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event: "summaries of today's news," "news briefs in ... (some general area like finance or sports)" and so on
other	these DOCs clearly do not fall into any of the above types; these are things like lists of sports scores, stock prices, temperatures around the world, and so on

Samples

For an example of the data in this corpus, please view this sample (TXT).

Copyright

Portions © 1994-2004 Agence France Presse, © 1994-2003 Al Hayat News Agency, © 1995-2004 An Nahar News Agency, © 2001-2004 Xinhua News Agency, © 2003-2004 Ummah Press, © 2005-2006 Trustees of the University of Pennsylvania

Arabic Gigaword Second Edition

Introduction

Data

Samples

Copyright

Available Media

View Fees