MADCAT Phase 1 Training LDC2012T15 Authors: David Lee, Safa Ismael, Stephen Grimes, Dave Doermann, Stephanie Strassel, Zhiyi Song 1. Introduction The MADCAT Phase 1 Training Corpus contains all training data created by Linguistic Data Consortium to support Phase 1 of the DARPA MADCAT Program. The data consists of handwritten Arabic documents, scanned at high resolution and annotated for the physical coordinates of each line and token. Digital transcripts and English translations of each document are also provided, with the various content and annotation layers integrated in a single MADCAT XML output. The data creation process for MADCAT begins with Arabic source documents in three genres: newswire, weblog and newsgroup text, originally collected as part of the DARPA GALE Program. Arabic speaking "scribes" copy the document by hand, following specific instructions as to the writing style (fast, normal, careful), writing implement (pen, pencil) and paper (lined, unlined). Prior to assignment, source documents are processed to optimize their appearance for the handwriting task, which may result in original source documents being broken into multiple "pages" for handwriting. Each resulting handwritten page is assigned to up to 5 independent scribes, using different writing conditions. Once handwritten data has been collected from scribes, it is checked for quality and completeness, then each page is scanned at a high resolution (600 dpi, greyscale) to create a digital version of the handwritten document. The scanned images are then annotated to indicate the physical coordinates of each line and token. Explicit reading order is also labeled, as are any errors produced by the scribes when copying the text. The final step is to produce a unified data format that takes multiple data streams and generates single xml output file which contains all required information. The xml file contains distinct components: a text layer that consists of the source text, tokenization and sentence segmentation; an image layer that consist of bounding boxes; a scribe demographic layer that consists of scribe ID and partition (train/test); and a document metadata layer. The docs/ directory includes the file flowchart.pdf which provides additional details of the data creation process. 2. Data Profile This release includes 9693 annotation files in MADCAT XML format (.madcat.xml) along with their corresponding scanned image files in TIFF format. Files are named as follows: galeID_page#_scribeID.{tif|madcat.xml} 3. Contents 3.1 Data The data/ directory contains two subdirectories: data/images/ This directory contains TIFF image files (.tif), containing scanned images of handwritten Arabic text. data/madcat/ This directory contains the official MADCAT XML files (.madcat.xml), which include ground truth annotations, source transcripts, and translations. All .madcat.xml files were validated using NIST's MADCATEval_1.0 toolkit. 3.2 Documentation README.txt - this file The docs/ directory includes: - a tab-delimited list containing information about the files, such as document ID, type, number of tokens, and number of segments a table that summarizes collected demographic information on included scribes MADCAT_Data_Format_Spec_v4h.pdf: a description with examples of the MADCAT XML format madcat.v1.0.5.dtd: the DTD for the madcat.xml files madcat.v1.0.6.std: an XML schema equivalent of the DTD above the instructions given to scribes for each document outlining the specific instructions for handwriting each MADCAT document (ie. pen vs pencil) The writing conditions themselves are listed as three letter strings which may be interpreted as follows: First letter refers to writing instrument: I => Pen P => Pencil Second letter refers to type of paper: U => Unlined Paper L => Lined Paper Third letter refers to writing speed: C => Careful N => Normal F => Fast flowchart.pdf: explains the annotation process files.md5: md5 checksums of files in data directory 4. Copyright Information Portions (c) 2007 Al-Ahram, (c) 2007 Al Hayat, (c) 2007 Al Quds - Al Arabi, (c) 2007 Asharq Al-Awsat, (c) 2007 An Nahar, (c) 2007 Assabah, (c) 2007-2010 Trustees of the University of Pennsylvania. -- April 2012 Stephanie Strassel Stephen Grimes Zhiyi Song