Title: MADCAT Phase 1, 2, and 3 Composite Evaluation Set

Authors: David Lee, Safa Ismael, Dave Doermann, Stephanie Strassel, Song Chen,
         Stephen Grimes

1. Introduction

MADCAT (Multilingual Automatic Document Classification Analysis and Translation)
was a technology evaluation program sponsored by the U.S. Defense Advanced
Research Projects Agency (DARPA). MADCAT’s goal was to produce systems that
could automatically convert foreign language text images into English transcripts
for use by humans and downstream processes, including summarization and information
extraction. The core evaluation task in MADCAT was the translation of handwritten
Arabic documents. The Linguistic Data Consortium (LDC) created publicly available
linguistic resources to support the MADCAT evaluations.

The MADCAT Phase 1, 2, and 3 Composite Evaluation Set contains all
evaluation data created by Linguistic Data Consortium to support Phases 1,
2, and 3 of the DARPA MADCAT Program. Phase 2 and Phase 3 data were also
used to support the NIST OpenHaRT 2010 and 2013 evaluations.

The data consists of handwritten Arabic documents, scanned at high
resolution and annotated for the physical coordinates of each line and
token. Digital transcripts and English translations of each document are
also provided, with the various content and annotation layers integrated
in a single MADCAT XML output.

The data was previously released in the following packages:

    LDC2008R43 - MADCAT Phase 1 Pilot Eval NIST V1.0
    LDC2008E52 - MADCAT Phase 1 Pilot Evaluation
    LDC2009R65 - MADCAT Phase 2 Eval Gold Standard Reference
    LDC2010R65 - MADCAT Phase 3 Eval Gold Standard Reference

2. Data Profile

This release includes 1,643 images and corresponding annotation files.

 phase  | file_count
 ---------------------
 1      |   470       
 ---------------------
 2      |   540       
 ---------------------
 3      |   633       
 ---------------------
 Total  | 1,643       

For details of data profiles in each phase, please refer to the
documentation of each phase under ./docs.

3. MADCAT Data Creation Process

The data creation process for MADCAT began with Arabic source documents
in three genres: newswire, weblog and newsgroup text, originally collected
as part of the DARPA GALE Program. Arabic speaking "scribes" copied the
document by hand, following specific instructions as to the writing style
(fast, normal, careful), writing implement (pen, pencil) and paper (lined,
unlined). Prior to assignment, source documents were processed to optimize
their appearance for the handwriting task, which may result in original
source documents being broken into multiple "pages" for handwriting. Each
resulting handwritten page was assigned to up to 3 independent scribes,
using different writing conditions.

Once handwritten data had been collected from scribes, it was checked for
quality and completeness, then each page was scanned at a high resolution
(600 dpi, greyscale) to create a digital version of the handwritten
document. The scanned images were then annotated to indicate the physical
coordinates of each line and token. Explicit reading order was also
labeled, as were any errors produced by the scribes when copying the
text.

The final step was to produce a unified data format that takes multiple
data streams and generates single MADCAT XML output file which contains
all required information. The XML file contains distinct components: a
text layer that consists of the source text, tokenization and sentence
segmentation; an image layer that consist of bounding boxes; a scribe
demographic layer that consists of scribe ID and partition (train/test);
and a document metadata layer.

The ./docs/ directory includes the file DataFlowChart.pdf which provides
additional details of the data creation process.

All madcat.xml files were validated using NIST's MADCATEval_1.0 toolkit. You
can find the toolkit on NIST's website: https://www.nist.gov/itl/iad/mig/tools

4. Contents

4.1 Data

The data/ directory contains three subdirectories, one for each phase, with the
subdirectories they contain enumerated below:

   data/phase1/madcat/

     This directory contains MADCAT XML files (.madcat.xml), which
     include ground truth annotations, source transcripts, and
     translations.

   data/phase1/gedi/

     This directory contains GEDI XML files (.gedi.xml), which
     include ground truth annotation.

   data/phase1/images/

     This directory contains scanned image files (.tif) of the
     handwritten documents.

   data/phase1/segmentation/

    This directory contains segmentation-only MADCAT XML files
    (.textseg.madcat.xml | .wordseg.madcat.xml | .lineseg.madcat.xml).

   data/phase2/gedi/

     This directory contains the output of the GEDI annotation tool.

   data/phase2/madcat/

     This directory contains the full reference MADCAT XML files
     (.madcat.xml).

   data/phase2/line-segmentation/

     This directory contains line-segmentation-only MADCAT XML files
     (.lineseg.madcat.xml).  These files are included for reference only.

   data/phase2/segmentation/

     This directory contains word-segmentation-only MADCAT XML files
     (.seg.madcat.xml).  These files are included for reference only. 

   data/phase2/images/

     This directory contains .tif files consisting of the scanned
     image files of the handwritten documents (.tif).

   data/phase3/gedi/

     This directory contains the output of the GEDI annotation tool.

   data/phase3/madcat/

     This directory contains the full reference MADCAT XML files
     (.madcat.xml).

   data/phase3/images/

     This directory contains .tif files consisting of the scanned
     image files of the handwritten documents (.tif).

4.2 DTD

The DTDs for each phase can be found in ./dtd/phase{1,2,3}.

../dtd/phase1:
     - madcat.v1.0.5.dtd: the DTD for the madcat.xml files
    
     - madcat.v1.0.6.xsd: an XML schema equivalent of the DTD above
                          (for reference only)

../dtd/phase2/
     - madcat.v1.0.5.dtd: the DTD for the madcat.xml files

     - madcat.v1.0.6.xsd: an XML schema equivalent of the DTD above

     - madcat.v1.1.0.dtd: a DTD for segmentation-only files (for
                          reference only)

     - madcat.v1.1.0.xsd: an XML schema equivalent of the DTD above
                          (for reference only)

../dtd/phase3/
     - madcat.v1.1.1.dtd: a DTD for segmentation-only files (for
                          reference only)

     - madcat.v1.1.1.xsd: an XML schema equivalent of the DTD above
                          (for reference only)

4.3 Documentation

The docs/ directory includes:

   README.txt: this file
   
   MADCAT_Data_Format_Spec_v4h.pdf: a description with examples of the
                                    MADCAT XML format

   DataFlowChart.pdf: illustrates detailed workflow of the data creation
                      process
  
   openhart2010_filelist.txt: Phase 2 files that were used in OpenHart 2010
                              evaluation

   openhart2013_filelist.txt: phase 3 files that were used in OpenHart 2013
                              evaluation

   Phase{1,2,3}_FileStats.tab: a list of MADCAT xml file statistics

   Phase{1,2,3}_ScribeDemographics.tab: contains demographic info for all
                                        participating scribes whenever available

   Phase{1,2,3}_writing_conditions.tab:  The instructions given to scribes for
                                         each document outlining the specific
                                         instructions for handwriting each
                                         MADCAT document (ie. pen vs pencil).

       The writing conditions themselves are listed as three
       letter strings which may be interpreted as follows:

               First letter refers to writing instrument:
                               I => Pen
                               P => Pencil
               Second letter refers to type of paper:
                               U => Unlined Paper
                               L => Lined Paper
               Third letter refers to writing speed:
                               C => Careful
                               N => Normal
                               F => Fast


5. Sponsorship

This work was supported in part by the Defense Advanced Research Projects
Agency, MADCAT Program No. HR0011-08-1-004 and GALE Program Grant No.
HR0011-06-1-0003. The content of this publication does not necessarily
reflect the position or the policy of the Government, and no official
endorsement should be inferred.

6. Copyright Information

Portions © 2007-2008 Al-Ahram, Al Hayat, Al Quds - Al Arabi, Asharq Al-Awsat,
An Nahar, Assabah, Agence France Presse, Xinhua News Agency © 2004-2013
Trustees of the University of Pennsylvania

7. Contact

Stephanie Strassel <strassel@ldc.upenn.edu>
Song Chen <zhiyi@ldc.upenn.edu>