Title: MADCAT Phase 1, 2, and 3 Composite Evaluation Set Authors: David Lee, Safa Ismael, Dave Doermann, Stephanie Strassel, Song Chen, Stephen Grimes 1. Introduction MADCAT (Multilingual Automatic Document Classification Analysis and Translation) was a technology evaluation program sponsored by the U.S. Defense Advanced Research Projects Agency (DARPA). MADCAT’s goal was to produce systems that could automatically convert foreign language text images into English transcripts for use by humans and downstream processes, including summarization and information extraction. The core evaluation task in MADCAT was the translation of handwritten Arabic documents. The Linguistic Data Consortium (LDC) created publicly available linguistic resources to support the MADCAT evaluations. The MADCAT Phase 1, 2, and 3 Composite Evaluation Set contains all evaluation data created by Linguistic Data Consortium to support Phases 1, 2, and 3 of the DARPA MADCAT Program. Phase 2 and Phase 3 data were also used to support the NIST OpenHaRT 2010 and 2013 evaluations. The data consists of handwritten Arabic documents, scanned at high resolution and annotated for the physical coordinates of each line and token. Digital transcripts and English translations of each document are also provided, with the various content and annotation layers integrated in a single MADCAT XML output. The data was previously released in the following packages: LDC2008R43 - MADCAT Phase 1 Pilot Eval NIST V1.0 LDC2008E52 - MADCAT Phase 1 Pilot Evaluation LDC2009R65 - MADCAT Phase 2 Eval Gold Standard Reference LDC2010R65 - MADCAT Phase 3 Eval Gold Standard Reference 2. Data Profile This release includes 1,643 images and corresponding annotation files. phase | file_count --------------------- 1 | 470 --------------------- 2 | 540 --------------------- 3 | 633 --------------------- Total | 1,643 For details of data profiles in each phase, please refer to the documentation of each phase under ./docs. 3. MADCAT Data Creation Process The data creation process for MADCAT began with Arabic source documents in three genres: newswire, weblog and newsgroup text, originally collected as part of the DARPA GALE Program. Arabic speaking "scribes" copied the document by hand, following specific instructions as to the writing style (fast, normal, careful), writing implement (pen, pencil) and paper (lined, unlined). Prior to assignment, source documents were processed to optimize their appearance for the handwriting task, which may result in original source documents being broken into multiple "pages" for handwriting. Each resulting handwritten page was assigned to up to 3 independent scribes, using different writing conditions. Once handwritten data had been collected from scribes, it was checked for quality and completeness, then each page was scanned at a high resolution (600 dpi, greyscale) to create a digital version of the handwritten document. The scanned images were then annotated to indicate the physical coordinates of each line and token. Explicit reading order was also labeled, as were any errors produced by the scribes when copying the text. The final step was to produce a unified data format that takes multiple data streams and generates single MADCAT XML output file which contains all required information. The XML file contains distinct components: a text layer that consists of the source text, tokenization and sentence segmentation; an image layer that consist of bounding boxes; a scribe demographic layer that consists of scribe ID and partition (train/test); and a document metadata layer. The ./docs/ directory includes the file DataFlowChart.pdf which provides additional details of the data creation process. All madcat.xml files were validated using NIST's MADCATEval_1.0 toolkit. You can find the toolkit on NIST's website: https://www.nist.gov/itl/iad/mig/tools 4. Contents 4.1 Data The data/ directory contains three subdirectories, one for each phase, with the subdirectories they contain enumerated below: data/phase1/madcat/ This directory contains MADCAT XML files (.madcat.xml), which include ground truth annotations, source transcripts, and translations. data/phase1/gedi/ This directory contains GEDI XML files (.gedi.xml), which include ground truth annotation. data/phase1/images/ This directory contains scanned image files (.tif) of the handwritten documents. data/phase1/segmentation/ This directory contains segmentation-only MADCAT XML files (.textseg.madcat.xml | .wordseg.madcat.xml | .lineseg.madcat.xml). data/phase2/gedi/ This directory contains the output of the GEDI annotation tool. data/phase2/madcat/ This directory contains the full reference MADCAT XML files (.madcat.xml). data/phase2/line-segmentation/ This directory contains line-segmentation-only MADCAT XML files (.lineseg.madcat.xml). These files are included for reference only. data/phase2/segmentation/ This directory contains word-segmentation-only MADCAT XML files (.seg.madcat.xml). These files are included for reference only. data/phase2/images/ This directory contains .tif files consisting of the scanned image files of the handwritten documents (.tif). data/phase3/gedi/ This directory contains the output of the GEDI annotation tool. data/phase3/madcat/ This directory contains the full reference MADCAT XML files (.madcat.xml). data/phase3/images/ This directory contains .tif files consisting of the scanned image files of the handwritten documents (.tif). 4.2 DTD The DTDs for each phase can be found in ./dtd/phase{1,2,3}. ../dtd/phase1: - madcat.v1.0.5.dtd: the DTD for the madcat.xml files - madcat.v1.0.6.xsd: an XML schema equivalent of the DTD above (for reference only) ../dtd/phase2/ - madcat.v1.0.5.dtd: the DTD for the madcat.xml files - madcat.v1.0.6.xsd: an XML schema equivalent of the DTD above - madcat.v1.1.0.dtd: a DTD for segmentation-only files (for reference only) - madcat.v1.1.0.xsd: an XML schema equivalent of the DTD above (for reference only) ../dtd/phase3/ - madcat.v1.1.1.dtd: a DTD for segmentation-only files (for reference only) - madcat.v1.1.1.xsd: an XML schema equivalent of the DTD above (for reference only) 4.3 Documentation The docs/ directory includes: README.txt: this file MADCAT_Data_Format_Spec_v4h.pdf: a description with examples of the MADCAT XML format DataFlowChart.pdf: illustrates detailed workflow of the data creation process openhart2010_filelist.txt: Phase 2 files that were used in OpenHart 2010 evaluation openhart2013_filelist.txt: phase 3 files that were used in OpenHart 2013 evaluation Phase{1,2,3}_FileStats.tab: a list of MADCAT xml file statistics Phase{1,2,3}_ScribeDemographics.tab: contains demographic info for all participating scribes whenever available Phase{1,2,3}_writing_conditions.tab: The instructions given to scribes for each document outlining the specific instructions for handwriting each MADCAT document (ie. pen vs pencil). The writing conditions themselves are listed as three letter strings which may be interpreted as follows: First letter refers to writing instrument: I => Pen P => Pencil Second letter refers to type of paper: U => Unlined Paper L => Lined Paper Third letter refers to writing speed: C => Careful N => Normal F => Fast 5. Sponsorship This work was supported in part by the Defense Advanced Research Projects Agency, MADCAT Program No. HR0011-08-1-004 and GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. 6. Copyright Information Portions © 2007-2008 Al-Ahram, Al Hayat, Al Quds - Al Arabi, Asharq Al-Awsat, An Nahar, Assabah, Agence France Presse, Xinhua News Agency © 2004-2013 Trustees of the University of Pennsylvania 7. Contact Stephanie Strassel Song Chen