Arabic Newswire Part 1
Item Name: | Arabic Newswire Part 1 |
Author(s): | David Graff, Kevin Walker |
LDC Catalog No.: | LDC2001T55 |
ISBN: | 1-58563-190-6 |
ISLRN: | 013-368-610-633-9 |
DOI: | https://doi.org/10.35111/6at4-b624 |
Member Year(s): | 2001 |
DCMI Type(s): | Text |
Data Source(s): | newswire |
Project(s): | EARS, GALE, TIDES, TREC |
Application(s): | information retrieval, language modeling |
Language(s): | Standard Arabic |
Language ID(s): | arb |
License(s): |
LDC User Agreement for Non-Members |
Online Documentation: | LDC2001T55 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Graff, David, and Kevin Walker. Arabic Newswire Part 1 LDC2001T55. Web Download. Philadelphia: Linguistic Data Consortium, 2001. |
Related Works: | View |
Introduction
This publication contains the Arabic Newswire A Corpus, Linguistic Data Consortium (LDC) catalog number LDC2001T55 and ISBN 1-58563-190-6. The Arabic Newswire Corpus is composed of articles from the Agence France Presse (AFP) Arabic Newswire. The source material was tagged using TIPSTER-style SGML and was transcoded to Unicode (UTF-8). The corpus includes articles from May 13, 1994 to December 20, 2000.
Data
The data is in 2,337 compressed (zipped) Arabic text data files. There are 209 Mb of compressed data (869 Mb uncompressed) with approximately 383,872 documents containing 76 million tokens over approximately 666,094 unique words.
A template of the tagging is presented below.
<DOC> <DOCNO>yyyymmdd_AFP_ARB.dddd</DOCNO> <HEADER>Arabic Text</HEADER> <BODY> <HEADLINE>Arabic Text</HEADLINE> <TEXT> <P>One or More Paragraphs of Arabic Text</P> </TEXT> <FOOTER>Arabic Text</FOOTER> </BODY> <TRAILER>Arabic Text</TRAILER> </DOC>
Samples
For a sample file of tagged articles, please see this sample.
Updates
There are no updates at this time.