MADCAT Phases 1-3 Composite Evaluation Set

Item Name: MADCAT Phases 1-3 Composite Evaluation Set
Author(s): David Lee, Safa Ismael, Dave Doermann, Stephanie Strassel, Song Chen, Stephen Grimes
LDC Catalog No.: LDC2026T05
ISLRN: 604-223-719-294-7
DOI: https://doi.org/10.35111/wks8-ak28
Release Date: May 15, 2026
Member Year(s): 2026
DCMI Type(s): StillImage, Text
Data Source(s): newsgroups, newswire, weblogs
Project(s): GALE, MADCAT, OpenHaRT
Application(s): handwriting recognition, machine translation
Language(s): Arabic
Language ID(s): ara
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2026T05 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Lee, David, et al. MADCAT Phases 1-3 Composite Evaluation Set LDC2026T05. Web Download. Philadelphia: Linguistic Data Consortium, 2026.
Related Works: View

Introduction

MADCAT (Multilingual Automatic Document Classification Analysis and Translation) Phases 1-3 Composite Evaluation Set contains the evaluation data created by the Linguistic Data Consortium (LDC) to support Phases 1-3 of the DARPA MADCAT program and the NIST OpenHaRT 2010 and 2013 evaluations. It consists of handwritten Arabic documents scanned at high resolution and annotated for the physical coordinates of each line and token, digital transcripts, and English translations with content and annotation layers integrated in a single MADCAT XML output.

The goal of the MADCAT program was to automatically convert foreign language text images into English transcripts for use by humans and downstream processes, including summarization and information extraction. The core evaluation task in MADCAT was the translation of handwritten Arabic documents.

Data

Arabic source documents were collected by LDC in three genres: newswire, weblog and newsgroup text. Arabic speaking scribes copied documents by hand, following specific instructions as to the writing style (fast, normal, careful), writing implement (pen, pencil) and paper (lined, unlined). Prior to assignment, source documents were processed to optimize their appearance for the handwriting task, which resulted in some source documents separated into multiple pages for handwriting. Each resulting handwritten page was assigned to up to three independent scribes using different writing conditions.

The handwritten, transcribed documents were checked for quality and completeness; then each page was scanned at a high resolution (600 dpi, greyscale) to create a digital version of the handwritten document. The scanned images were annotated to indicate the physical coordinates of each line and token. Explicit reading order was also labeled, along with any errors produced by the scribes when copying the text.

In the final step, a unified data format was produced consisting of the source text, tokenization and sentence segmentation; an image layer of bounding boxes; a scribe demographic layer containing scribe ID and partition (train/test); and a document metadata layer.

This release includes 1,643 images and corresponding annotation files in both GEDI XML and MADCAT XML formats (gedi.xml and .madcat.xml) along with their corresponding scanned image files in TIFF format. GEDI XML files contain ground truth annotations.

Phase File Count
1 470
2 540
3 633
Total 1,643

Sponsorship

This work was supported in part by the Defense Advanced Research Projects Agency, MADCAT Program No. HR0011-08-1-004 and GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

Samples

Please view the following samples.

Updates

No updates at this time.

Available Media

View Fees





Login for the applicable fee