Title: MADCAT Phase 3 Training Set Authors: David Lee, Safa Ismael, Dave Doermann, Stephanie Strassel, Zhiyi Song, Stephen Grimes 1. Introduction The MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Phase 3 Training Set contains all training data created by Linguistic Data Consortium to support Phase 3 of the DARPA MADCAT Program. The data consists of handwritten Arabic documents, scanned at high resolution and annotated for the physical coordinates of each line and token. Digital transcripts and English translations of each document are also provided, with the various content and annotation layers integrated in a single MADCAT XML output. The data creation process for MADCAT begins with Arabic source documents in three genres: newswire, weblog and newsgroup text, originally collected as part of the DARPA GALE Program. Arabic speaking "scribes" copy the document by hand, following specific instructions as to the writing style (fast, normal, careful), writing implement (pen, pencil) and paper (lined, unlined). Prior to assignment, source documents are processed to optimize their appearance for the handwriting task, which may result in original source documents being broken into multiple "pages" for handwriting. Each resulting handwritten page is assigned to up to 5 independent scribes, using different writing conditions. Once handwritten data has been collected from scribes, it is checked for quality and completeness, then each page is scanned at a high resolution (600 dpi, greyscale) to create a digital version of the handwritten document. The scanned images are then annotated to indicate the physical coordinates of each line and token. Explicit reading order is also labeled, as are any errors produced by the scribes when copying the text. The final step is to produce a unified data format that takes multiple data streams and generates single xml output file which contains all required information. The xml file contains distinct components: a text layer that consists of the source text, tokenization and sentence segmentation; an image layer that consist of bounding boxes; a scribe demographic layer that consists of scribe ID and partition (train/test); and a document metadata layer. The docs/ directory includes the file DataFlowChart.pdf which provides additional details of the data creation process. 2. Data Profile This release includes 4,540 annotation files in both GEDI XML (gedi.xml) and MADCAT XML format (.madcat.xml) along with their corresponding scanned image files in TIFF format. Files are named as follows: galeID_page#_scribeID.{tif|gedi.xml|madcat.xml} 3. Contents 3.1 Data The data/ directory contains two subdirectories: data/images/ This directory contains TIFF image files (.tif), containing scanned images of handwritten Arabic text. data/gedi/ This directory contains GEDI XML files (.gedi.xml), which include ground truth annotations and source transripts. All gedi.xml files were validated using gedi.dtd under dtd/. data/madcat/ This directory contains the official MADCAT XML files (.madcat.xml), which include ground truth annotations, source transcripts, and translations. All .madcat.xml files were validated using madcat.v1.1.1.dtd under dtd/. 3.2 Documentation README.txt - this file The docs/ directory includes: FileStats.tab - a tab-delimited list containing information about the files, such as document ID, type, number of tokens, and number of segments Scribe_Demographics.tab: a table that summarizes collected demographic information on included scribes MADCAT_Data_Format_Spec_v4h.pdf: a description with examples of the MADCAT XML format writing_conditions.tab: the instructions given to scribes for each document outlining the specific instructions for handwriting each MADCAT document (ie. pen vs pencil) The writing conditions themselves are listed as three letter strings which may be interpreted as follows: First letter refers to writing instrument: I => Pen P => Pencil Second letter refers to type of paper: U => Unlined Paper L => Lined Paper Third letter refers to writing speed: C => Careful N => Normal F => Fast DataFlowChart.pdf: explains the data creation process files.md5: md5 checksums of files in data directory The dtd/ directory includes: gedi.dtd: the DTD for the gedi.xml files madcat.v1.1.1.dtd: the DTD for the madcat.xml files madcat.v1.1.1.std: an XML schema equivalent of the DTD above 4. Copyright Information Portions © 2006 Agence France Presse, Al-Ahram, Al Hayat, Al Quds-Al Arabi, An Nahar, Asharq Al-Awsat, Assabah, Xinhua News Agency, © 2006, 2013 Trustees of the University of Pennsylvania -- December 2012 Stephanie Strassel Stephen Grimes Zhiyi Song Xiaoyi Ma