Title: BOLT Arabic Discussion Forum Source Data Authors: Jennifer Tracey, Haejoong Lee, Stephanie Strassel, Safa Ismael 1.0 Introduction This release consists of discussion forum threads in Arabic harvested from the Internet using a combination of manual and automatic processes. The raw HTML was downloaded for each discussion thread, and the HTML files were then converted to an XML format. The target of the collection was Egyptian Arabic, but it should be expected that the collection contains other Arabic varieties as well (see description of collection methods below). 2.0 Directory structure README - this file data/ - directory containing data files arz/ - arz (Egyptian Arabic) / - html or xml .zip - zip archives containing all threads for a single forum docs/ - directory containing package documentation web_text.rng - a RELAX NG schema for .xml files arz_file.tab - list of threads contained in the package along with their original URL and token count arz_suspect_LID.txt - list of files that may not be primarily in Arabic 3.0 File naming The file naming convention for HTML and XML files is bolt--DF---. where is one of the language IDs: arz, cmn and eng is a numeric ID associated with the web site is a numeric ID associated with the forum is a numeric ID associated with the discussion thread is one of "html" or "xml" 4.0 Data profile +----------+-------------+---------------+ | language | num_threads | num_tokens | +----------+-------------+---------------+ | arz | 813,080 | 648,423,321 | +----------+-------------+---------------+ See docs/arz_file.tab for post/token counts and original URLs for individual threads. 5.0 Data format The HTML files are a raw HTML file downloaded from the discussion thread. If the thread spans multiple URLs, it is stored as a concatenation of the downloaded HTML files. The XML files have the following format. ... ... ... ... The "id" attribute of the element contains the document ID, which is the file name minus the extension. The element contains the title (or subject) of the thread. Each post is represented by a element. Each element has a poster ("author" attribute), post date ("datetime" attribute), and the body text. The body of a post is mixed content, and can contain text, links ( elements), image tags ( elements), and quotes ( elements). Quotes may optionally reference an original author and post date, and the quote body contains the same mixed content as the post body. A RELAX NG schema for this format is included in the docs directory. 6.0 Data collection and processing Collection of threads was seeded based on the results of manual data scouting by native speaker annotators. Scouts were instructed to seek content that is in Egyptian Arabic, original (written by the post's author rather than quoted), interactive, and informal. Upon locating an appropriate thread, scouts submitted the URL and some simple judgments about it to a database, via a web browser plugin. When multiple threads from a forum were submitted, the entire forum was automatically harvested and added to the collection. Note that this method allowed collection of large volumes of data; the scale of the collection, in turn, precludes the possibility of manual review of all data. Therefore, only a small portion of the threads included in this package have been manually reviewed, and it is expected that there may be some offensive or otherwise undesired content as well as some threads that contain a large amount of non-Arabic content. Language ID was performed on all threads in this package (using CLD2), and threads for which the LID results indicate a high probability of largely non-Arabic content are listed in arz_suspect_LID.txt in the /docs directory of this package. It should also be noted that many threads may contain a mixture of Egyptian and other varieties of Arabic, even among the threads that are primarily Arabic. Initial scouting efforts permitted scouts to search for content on an unlimited number of topics, while subsequent instructions refined the list of topics to those related to current events and other dynamic events. This data release contains a mix of general and current event topic threads. The HTML files were harvested using custom written harvester scripts. Each harvester script was paired with a conversion script. HTML documents were converted using these scripts and validated against a RELAX NG schema (included in the docs directory). Other than the schema validation, no further QC check was done for the data. For each site, commonly used quote styles were identified and processed. However, there may be quotes not covered by this since there can be unanticipated variations. 8. Acknowledgements This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. 9. Contact Information If you have any questions about the data in this release, please contact the following personnel at the LDC. Jennifer Tracey -BOLT Discussion Forum collection manager Stephanie Strassel -BOLT project PI ----------- README created by Jennifer Tracey on 6 April 2015 updated by Jennifer Tracey on 18 December 2015