Title: CAMIO Transcription Languages
Authors: Michael Arrigo, Stephanie Strassel, Christopher Caruso
Catalog ID: LDC2022T07

                       Linguistic Data Consortium
                             July 21, 2022


1.0 Introduction

CAMIO (Corpus of Annotated Multilingual Images for OCR) was developed by
Linguistic Data Consortium to serve as a resource to support the
development and evaluation of optical character recognition (OCR) and
related technologies for 35 languages across 24 unique script types.  CAMIO
was designed to address gaps in language and script coverage from existing
corpora and to support future evaluation of OCR capabilities through a
systematically constructed data set.

The CAMIO corpus comprises nearly 70,000 images of machine printed text, covering
a wide variety of topics and styles, document domains, attributes and
scanning/capture artifacts. Most images have been exhaustively annotated
for text localization, resulting in over 2.3M line-level bounding
boxes. For 13 of the 35 languages, 1250 images/language have been further
annotated with orthographic transcriptions of each line plus specification
of reading order, yielding over 2.4M tokens of transcribed text. The
resulting annotations are represented in a comprehensive XML output format
defined for this corpus. 

The current package contains data and annotations for 13 CAMIO languages
that were annotated with orthographic transcription. The script for each
language is indicated in parentheses: Arabic (Arabic), Chinese
(Simplified), English (Latin), Farsi (Arabic), Hindi (Devanagari), Japanese
(Japanese), Kannada (Kannada), Korean (Hangul), Russian (Cyrillic), Tamil
(Tamil), Thai (Thai), Urdu (Arabic), and Vietnamese (Latin). Data for each
language is partitioned into test, train or validation. 

2.0 Directory Structure

The directory structure and contents of the package are summarized
below -- paths shown are relative to the base (root) directory of the
package:

./README.txt -- this file

./data/{test,train,valid}/{LNG}/
             png/ -- PNG versions of all images
             xml/ -- XML files for images with annotation

./docs/
       guidelines/
                  CAMIO_Auditing_Guidelines_V1.3.pdf
                  CAMIO_Data_Scouting_Guidelines_V1.0.pdf
                  CAMIO_Reading_Order_Guidelines_V1.0.pdf
                  CAMIO_Text_Localization_Guidelines_V1.3.pdf
                  CAMIO_Transcription_Guidelines_V1.2.pdf
       corpus_partitions/
           -- lists indicating which images are partitioned into each
              of the test/train/valid data subdirectories
       image_status.tab
           -- information on images included (see Section 6.0)
       language_summary.tab -- summary of content per language
       reading_order.tab
           -- information about reading order (see Section 6.0)
       urls.tab -- mapping of file names to source URLs

./dtds/camio.v.1.0.dtd

For each XML file in ./data/{test,train,valid}/{LNG}/xml/, there is a
corresponding PNG file in ./data/{test,train,valid}/{LNG}/png/. For
example:

./data/test/arb/png/ARB_BK_J00SV16ZA.png.ldcc
    ...

./data/test/arb/xml/ARB_BK_J00SV16ZA.xml
    ...

The file names assigned to individual documents provide the following
information about the document:

    Language  3-letter abbrev.
    Genre     2-letter abbrev.
    Global-ID 9-digit alphanumeric assigned to this image

The language portion is the standard 3-letter code established in ISO
639-3 (see ./docs/language_summary.tab for a complete list). The
possible genre codes for this corpus are:

    BK = book
    CS = card/slide
    PR = periodical
    RC = record
    ST = scene text
    WP = webpage

The three fields are joined by underscore characters, yielding a
16-character file name (e.g. ARB_BK_J00SV16ZA).

A set of 1390 images do not contain the 2-letter genre code in their
file names. These are images for which auditing has not been carried
out. They have a value of "no" in column 5 of ./docs/image_status.tab.

3.0 Content Summary

A complete summary of the contents is provided by language in
./docs/language_summary.tab. The overall contents are as follows:

+------------------------+-------------+
|  n_languages           |         13  |
+------------------------+-------------+
|  n_images_total        |     16,246  |
+------------------------+-------------+
|  n_images_boxed        |     16,246  |
+------------------------+-------------+
|  n_boxes               |    323,668  |
+------------------------+-------------+
|  n_tokens*             |  2,431,141  |
+------------------------+-------------+
|  n_images_transcribed  |     16,246  |
+------------------------+-------------+

* For Chinese, Japanese, and Thai, n_tokens represents the number of
non-whitespace characters. For all other languages, it represents the
number of whitespace delimited tokens.

4.0 Source Data Collection, Processing and Formats

Data collection was conducted by LDC annotators and crowdworkers, who
provided URLs for machine print images in a CAMIO language along with a set
of feature labels describing the image properties. Once downloaded via
LDC’s web data collection system, the images received a unique identifier
and were added to a comprehensive tracking database where metadata values
were recorded, including the image’s original URL, a unique source
identifier (e.g. website name), image provenance (data scouting or crowd),
language, and the various feature labels. The collected images represented
various file formats including JPG, PNG and TIF. For consistency, all
documents were converted to a standard file format (PNG) after collection
and before auditing and annotation and all documents were assigned a
normalized corpus filename.

Prior to corpus distribution, all PNG image files were converted into LDCC
format, which applies a specialized header block prepended to the image
data. The header block provides metadata about the file contents, including
the MD5 checksum (for self-validation), the data type, and byte count.

The LDCC header block always begins with a 16-byte ASCII signature, as
shown between double-quotes on the following line (where "\n"
represents the ASCII "newline" character 0x0A):

"LDCc   \n1024   \n"

Note that the "1024" on the second line of the signature represents
the exact byte count of the LDCC header block. (If/when this header
design needs to accommodate larger quantities of metadata, the header
byte count can be expanded as needed in increments of 1024 bytes. Such
expansion does not arise in the present release.)

Immediately after the 16-byte signature, a YAML string presents a data
structure comprising the file-specific header content, expressed as a
set of "key: value" pairings in UTF-8 encoding.

The YAML string is padded at the end with space characters, such that
when the following 8-byte string is appended, the full header block
size is exactly 1024 bytes (or whatever size is stated in the initial
signature):

"endLDCc\n"

In order to process the content of an LDCC header:
  - read the initial block of 1024 bytes from the *.ldcc data file
  - check that it begins with "LDCc   \n1024   \n" and ends with
    "endLDCc\n"
  - strip off those 16- and 8-byte portions
  - pass the remainder of the block to a YAML parser.

In order to access the original content of the data file, simply skip
or remove the initial 1024 bytes. Images can be unwrapped using 'dd':

    dd bs=1024 skip=1 if=<wrapped file> of=<target_name>

5.0 Annotation and Auditing Process

CAMIO annotation consisted of auditing, text localization, reading order
and transcription. All annotation was performed using a custom web-based
user interface developed by LDC for CAMIO.

All collected documents were subject to auditing, which consisted of
verifying the document quality and manually labeling feature metadata
(genre, document domain, attributes, artifacts).

After auditing, most documents were subject to text localization, in which
bounding boxes were drawn around each line of machine printed text. The
bounding box for each line consists of a unique id, its contents
(e.g. text, features), and coordinates indicating the location of the line
on the page. Explicit, natural reading order was then indicated by applying
a next_id tag to each bounding box. A subset of images for the 13
transcription languages were then selected for orthographic transcription,
with one transcript for each line of machine printed text for which a
bounding box had been produced.

Guidelines for each of these annotation tasks are provided in the /docs directory.

5.1 XML Format

Annotation and document metadata is presented in a unified XML format
defined by LDC for CAMIO. There is one XML file per image containing the
original source URL, source and language/script info from the Collection
task, document- level features from Auditing, line zone numbers from
Reading Order, bounding box coordinates from Text Localization, line-level
features from Reading Order, line- level orientation features from
Transcription, and the transcript itself.

Each line zone (lineZone) in the XML contains the four coordinate points of
the bounding boxplus line-level features. Each lineZone contains an id
attribute and a next_id attribute with values of the form "LineID001". The
value for next_id can be "NONE" when the current lineZone is the last in a
sequence's reading order (i.e. the final lineZone in the document). All
lineZone ids are unique within a given document.

Transcripts are provided for a subset of images for the 13 languages 
included in this corpus. Where there is no transcript, the transcript 
tags are empty (e.g. "<transcript></transcript>").

5.2 Transcription Markup and Features

The transcripts can contain the following six markup symbols:

    +-----------+----------------+----------------------------------+
    | symbol(s) | code_point     | name                             |
+---+-----------+----------------+----------------------------------+
| 1 | ﹟          U+FE5F         | small number sign                |
+---+-----------+----------------+----------------------------------+
| 2 | ⦋word⦌    | U+298B, U+298C | left/right square bracket with   |
|   |           |                | underbar                         |
+---+-----------+----------------+----------------------------------+
| 3 | ₍word₎    | U+208D, U+208E | subscript left/right parenthesis |
+---+-----------+----------------+----------------------------------+
| 4 | ₊         | U+208A         | subscript plus sign              |
+---+-----------+----------------+----------------------------------+

(1) stands in for any words/symbols that are not in the relevant
script and could not be typed. (2) enclose any words that are in the
relevant script but contain uncommon characters, diacritics, or
features that could not be keyboarded. (3) enclose words that are not
fully readable by themselves but that could be understood from
context. (4) stands in for any words that are in the relevant script
but are unreadable.

For the images that received transcription, annotators marked the
orientation of the text being transcribed at the line-level. There
are three orientation flags: vertical, upside down, and mirror. For
the three Arabic script transcription languages (Arabic, Farsi, and
Urdu), the default directionality is right-to-left. For Chinese,
Japanese, and Korean, the default directionality is left-to-right when
written horizontally and top-to-bottom when written vertically. Even
though both are considered "normal," vertical text is still flagged as
such for these languages. The remaining seven transcription
languages are written left-to-right by default.

6.0 Documents

The file ./docs/image_status.tab contains the following fields for
each image included in this release:

  (1) file_name         - the unique name of the image
  (2) n_boxes           - the number of bounding boxes, if applicable
  (3) n_transcripts     - the number of transcripts, if applicable
  (4) n_tokens          - the number of tokens, if applicable
  (5) auditing          - whether the image has been audited
  (6) text_localization - whether the image received text localization
  (7) reading_order     - whether the image received reading order
  (8) transcription     - whether the image has been transcribed

The file ./docs/language_summary.tab contains the following fields for
each of the 13 languages included in this release:

  (1) iso_code
  (2) language_name
  (3) n_images       - the total number of images included
  (4) n_ann_images   - the number of images that received annotation
  (5) n_boxes
  (6) avg_boxes_per_image
  (7) n_trn_images   - the number of images that have been transcribed
  (8) n_transcripts (if applicable)
  (9) n_tokens (if applicable)

The file ./docs/reading_order.tab contains a mapping of all annotated
file names to the source of the reading order provided. The possible
values are:

  (1) native     - order of zones provided by native speaker trained
                   on task
  (2) non-native - order of zones provided by non-native speaker
                   trained on task
  (3) automatic  - order of zones reflects creation of bounding boxes
                   during text localization annotation

The file ./docs/urls.tab contains a mapping of the file name for each
image in the release to its full source URL.

7.0 Known Issues

A small number of images received incomplete or no reading order
annotation; these are labeled "automatic" in ./docs/reading_order.tab.

For 33 XML files, there is at least one bounding box with a missing
transcript. For these 33 file names, n_boxes is greater than
n_transcripts in ./docs/image_status.tab.

8.0 References

Michael Arrigo, Stephanie Strassel, Nolan King, Thao Tran, Lisa Mason
CAMIO: A Corpus for OCR in Multiple Languages. LREC 2022: 13th Edition of
the Language Resources and Evaluation Conference Marseille, June 20-25

9.0 Copyright Statement

TBD.

10.0 Acknowledgements

We gratefully acknowledge the contributions of CAMIO project lead
annotators Justin Mott and Neil Kuster, software developers Alex Shelmire
and Jonathan Wright, and the hundreds of CAMIO annotators whose efforts
were central to the development of this corpus.

11.0 Contacts

If you have questions about this data release, please contact the
following personnel at LDC.

Stephanie Strassel <strassel@ldc.upenn.edu>
Christopher Caruso <carusocr@ldc.upenn.edu>

----------------------
README created by Michael Arrigo on December 17, 2020
       updated by Jeremy Getman on June 17, 2022
       updated by Jeremy Getman on June 22, 2022
       updated by Stephanie Strassel on July 5, 2022
       updated by Jeremy Getman on July 7, 2022
       updated by Stephanie Strassel on July 21, 2022
       updated by Jeremy Getman on July 25, 2022