CAMIO Transcription Languages

Item Name: CAMIO Transcription Languages
Author(s): Michael Arrigo, Stephanie Strassel, Christopher Caruso
LDC Catalog No.: LDC2022T07
ISLRN: 014-810-264-834-8
DOI: https://doi.org/10.35111/r7ds-gy89
Release Date: December 15, 2022
Member Year(s): 2022
DCMI Type(s): StillImage, Text
Data Source(s): web collection
Project(s): CAMIO
Application(s): keyword spotting, language identification, OCR decoding, script identificaton, text localizaton
Language(s): English, Arabic, Persian, Hindi, Japanese, Kannada, Korean, Russian, Tamil, Thai, Urdu, Vietnamese, Mandarin Chinese
Language ID(s): eng, ara, fas, hin, jpn, kan, kor, rus, tam, tha, urd, vie, cmn
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2022T07 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Arrigo, Michael, Stephanie Strassel, and Christopher Caruso. CAMIO Transcription Languages LDC2022T07. Web Download. Philadelphia: Linguistic Data Consortium, 2022.

Introduction

CAMIO Transcription Languages was developed by the Linguistic Data Consortium and contains nearly 70,000 images of machine printed text with corresponding annotations and transcripts in the following 13 languages: Arabic, Chinese, English, Farsi, Hindi, Japanese, Kannada, Korean, Russian, Tamil, Thai, Urdu, and Vietnamese.

This corpus is a subset of data created for a broader effort to support the development and evaluation of optical character recognition (OCR) and related technologies for 35 languages across 24 unique script types. The CAMIO (Corpus of Annotated Multilingual Images for OCR) collection was designed to address gaps in language and script coverage from existing corpora and to support future evaluation of OCR capabilities through a systematically constructed data set.


Data

Most images were annotated for text localization, resulting in over 2.3M line-level bounding boxes. For the 13 languages represented in this release, 1250 images per language were also annotated with orthographic transcriptions of each line plus specification of reading order, yielding over 2.4M tokens of transcribed text. The resulting annotations are represented in a comprehensive XML output format defined for this corpus.

The script for each language is indicated in parentheses: Arabic (Arabic), Chinese (Simplified), English (Latin), Farsi (Arabic), Hindi (Devanagari), Japanese (Japanese), Kannada (Kannada), Korean (Hangul), Russian (Cyrillic), Tamil (Tamil), Thai (Thai), Urdu (Arabic), and Vietnamese (Latin).

Data for each language is partitioned into test, train or validation sets.

 

Samples

Please view these samples:

Updates

None at this time.

Available Media

View Fees





Login for the applicable fee