Title: CAMIO Transcription Languages Authors: Michael Arrigo, Stephanie Strassel, Christopher Caruso Catalog ID: LDC2022T07 Linguistic Data Consortium July 21, 2022 1.0 Introduction CAMIO (Corpus of Annotated Multilingual Images for OCR) was developed by Linguistic Data Consortium to serve as a resource to support the development and evaluation of optical character recognition (OCR) and related technologies for 35 languages across 24 unique script types. CAMIO was designed to address gaps in language and script coverage from existing corpora and to support future evaluation of OCR capabilities through a systematically constructed data set. The CAMIO corpus comprises nearly 70,000 images of machine printed text, covering a wide variety of topics and styles, document domains, attributes and scanning/capture artifacts. Most images have been exhaustively annotated for text localization, resulting in over 2.3M line-level bounding boxes. For 13 of the 35 languages, 1250 images/language have been further annotated with orthographic transcriptions of each line plus specification of reading order, yielding over 2.4M tokens of transcribed text. The resulting annotations are represented in a comprehensive XML output format defined for this corpus. The current package contains data and annotations for 13 CAMIO languages that were annotated with orthographic transcription. The script for each language is indicated in parentheses: Arabic (Arabic), Chinese (Simplified), English (Latin), Farsi (Arabic), Hindi (Devanagari), Japanese (Japanese), Kannada (Kannada), Korean (Hangul), Russian (Cyrillic), Tamil (Tamil), Thai (Thai), Urdu (Arabic), and Vietnamese (Latin). Data for each language is partitioned into test, train or validation. 2.0 Directory Structure The directory structure and contents of the package are summarized below -- paths shown are relative to the base (root) directory of the package: ./README.txt -- this file ./data/{test,train,valid}/{LNG}/ png/ -- PNG versions of all images xml/ -- XML files for images with annotation ./docs/ guidelines/ CAMIO_Auditing_Guidelines_V1.3.pdf CAMIO_Data_Scouting_Guidelines_V1.0.pdf CAMIO_Reading_Order_Guidelines_V1.0.pdf CAMIO_Text_Localization_Guidelines_V1.3.pdf CAMIO_Transcription_Guidelines_V1.2.pdf corpus_partitions/ -- lists indicating which images are partitioned into each of the test/train/valid data subdirectories image_status.tab -- information on images included (see Section 6.0) language_summary.tab -- summary of content per language reading_order.tab -- information about reading order (see Section 6.0) urls.tab -- mapping of file names to source URLs ./dtds/camio.v.1.0.dtd For each XML file in ./data/{test,train,valid}/{LNG}/xml/, there is a corresponding PNG file in ./data/{test,train,valid}/{LNG}/png/. For example: ./data/test/arb/png/ARB_BK_J00SV16ZA.png.ldcc ... ./data/test/arb/xml/ARB_BK_J00SV16ZA.xml ... The file names assigned to individual documents provide the following information about the document: Language 3-letter abbrev. Genre 2-letter abbrev. Global-ID 9-digit alphanumeric assigned to this image The language portion is the standard 3-letter code established in ISO 639-3 (see ./docs/language_summary.tab for a complete list). The possible genre codes for this corpus are: BK = book CS = card/slide PR = periodical RC = record ST = scene text WP = webpage The three fields are joined by underscore characters, yielding a 16-character file name (e.g. ARB_BK_J00SV16ZA). A set of 1390 images do not contain the 2-letter genre code in their file names. These are images for which auditing has not been carried out. They have a value of "no" in column 5 of ./docs/image_status.tab. 3.0 Content Summary A complete summary of the contents is provided by language in ./docs/language_summary.tab. The overall contents are as follows: +------------------------+-------------+ | n_languages | 13 | +------------------------+-------------+ | n_images_total | 16,246 | +------------------------+-------------+ | n_images_boxed | 16,246 | +------------------------+-------------+ | n_boxes | 323,668 | +------------------------+-------------+ | n_tokens* | 2,431,141 | +------------------------+-------------+ | n_images_transcribed | 16,246 | +------------------------+-------------+ * For Chinese, Japanese, and Thai, n_tokens represents the number of non-whitespace characters. For all other languages, it represents the number of whitespace delimited tokens. 4.0 Source Data Collection, Processing and Formats Data collection was conducted by LDC annotators and crowdworkers, who provided URLs for machine print images in a CAMIO language along with a set of feature labels describing the image properties. Once downloaded via LDC’s web data collection system, the images received a unique identifier and were added to a comprehensive tracking database where metadata values were recorded, including the image’s original URL, a unique source identifier (e.g. website name), image provenance (data scouting or crowd), language, and the various feature labels. The collected images represented various file formats including JPG, PNG and TIF. For consistency, all documents were converted to a standard file format (PNG) after collection and before auditing and annotation and all documents were assigned a normalized corpus filename. Prior to corpus distribution, all PNG image files were converted into LDCC format, which applies a specialized header block prepended to the image data. The header block provides metadata about the file contents, including the MD5 checksum (for self-validation), the data type, and byte count. The LDCC header block always begins with a 16-byte ASCII signature, as shown between double-quotes on the following line (where "\n" represents the ASCII "newline" character 0x0A): "LDCc \n1024 \n" Note that the "1024" on the second line of the signature represents the exact byte count of the LDCC header block. (If/when this header design needs to accommodate larger quantities of metadata, the header byte count can be expanded as needed in increments of 1024 bytes. Such expansion does not arise in the present release.) Immediately after the 16-byte signature, a YAML string presents a data structure comprising the file-specific header content, expressed as a set of "key: value" pairings in UTF-8 encoding. The YAML string is padded at the end with space characters, such that when the following 8-byte string is appended, the full header block size is exactly 1024 bytes (or whatever size is stated in the initial signature): "endLDCc\n" In order to process the content of an LDCC header: - read the initial block of 1024 bytes from the *.ldcc data file - check that it begins with "LDCc \n1024 \n" and ends with "endLDCc\n" - strip off those 16- and 8-byte portions - pass the remainder of the block to a YAML parser. In order to access the original content of the data file, simply skip or remove the initial 1024 bytes. Images can be unwrapped using 'dd': dd bs=1024 skip=1 if= of= 5.0 Annotation and Auditing Process CAMIO annotation consisted of auditing, text localization, reading order and transcription. All annotation was performed using a custom web-based user interface developed by LDC for CAMIO. All collected documents were subject to auditing, which consisted of verifying the document quality and manually labeling feature metadata (genre, document domain, attributes, artifacts). After auditing, most documents were subject to text localization, in which bounding boxes were drawn around each line of machine printed text. The bounding box for each line consists of a unique id, its contents (e.g. text, features), and coordinates indicating the location of the line on the page. Explicit, natural reading order was then indicated by applying a next_id tag to each bounding box. A subset of images for the 13 transcription languages were then selected for orthographic transcription, with one transcript for each line of machine printed text for which a bounding box had been produced. Guidelines for each of these annotation tasks are provided in the /docs directory. 5.1 XML Format Annotation and document metadata is presented in a unified XML format defined by LDC for CAMIO. There is one XML file per image containing the original source URL, source and language/script info from the Collection task, document- level features from Auditing, line zone numbers from Reading Order, bounding box coordinates from Text Localization, line-level features from Reading Order, line- level orientation features from Transcription, and the transcript itself. Each line zone (lineZone) in the XML contains the four coordinate points of the bounding boxplus line-level features. Each lineZone contains an id attribute and a next_id attribute with values of the form "LineID001". The value for next_id can be "NONE" when the current lineZone is the last in a sequence's reading order (i.e. the final lineZone in the document). All lineZone ids are unique within a given document. Transcripts are provided for a subset of images for the 13 languages included in this corpus. Where there is no transcript, the transcript tags are empty (e.g. ""). 5.2 Transcription Markup and Features The transcripts can contain the following six markup symbols: +-----------+----------------+----------------------------------+ | symbol(s) | code_point | name | +---+-----------+----------------+----------------------------------+ | 1 | ﹟ U+FE5F | small number sign | +---+-----------+----------------+----------------------------------+ | 2 | ⦋word⦌ | U+298B, U+298C | left/right square bracket with | | | | | underbar | +---+-----------+----------------+----------------------------------+ | 3 | ₍word₎ | U+208D, U+208E | subscript left/right parenthesis | +---+-----------+----------------+----------------------------------+ | 4 | ₊ | U+208A | subscript plus sign | +---+-----------+----------------+----------------------------------+ (1) stands in for any words/symbols that are not in the relevant script and could not be typed. (2) enclose any words that are in the relevant script but contain uncommon characters, diacritics, or features that could not be keyboarded. (3) enclose words that are not fully readable by themselves but that could be understood from context. (4) stands in for any words that are in the relevant script but are unreadable. For the images that received transcription, annotators marked the orientation of the text being transcribed at the line-level. There are three orientation flags: vertical, upside down, and mirror. For the three Arabic script transcription languages (Arabic, Farsi, and Urdu), the default directionality is right-to-left. For Chinese, Japanese, and Korean, the default directionality is left-to-right when written horizontally and top-to-bottom when written vertically. Even though both are considered "normal," vertical text is still flagged as such for these languages. The remaining seven transcription languages are written left-to-right by default. 6.0 Documents The file ./docs/image_status.tab contains the following fields for each image included in this release: (1) file_name - the unique name of the image (2) n_boxes - the number of bounding boxes, if applicable (3) n_transcripts - the number of transcripts, if applicable (4) n_tokens - the number of tokens, if applicable (5) auditing - whether the image has been audited (6) text_localization - whether the image received text localization (7) reading_order - whether the image received reading order (8) transcription - whether the image has been transcribed The file ./docs/language_summary.tab contains the following fields for each of the 13 languages included in this release: (1) iso_code (2) language_name (3) n_images - the total number of images included (4) n_ann_images - the number of images that received annotation (5) n_boxes (6) avg_boxes_per_image (7) n_trn_images - the number of images that have been transcribed (8) n_transcripts (if applicable) (9) n_tokens (if applicable) The file ./docs/reading_order.tab contains a mapping of all annotated file names to the source of the reading order provided. The possible values are: (1) native - order of zones provided by native speaker trained on task (2) non-native - order of zones provided by non-native speaker trained on task (3) automatic - order of zones reflects creation of bounding boxes during text localization annotation The file ./docs/urls.tab contains a mapping of the file name for each image in the release to its full source URL. 7.0 Known Issues A small number of images received incomplete or no reading order annotation; these are labeled "automatic" in ./docs/reading_order.tab. For 33 XML files, there is at least one bounding box with a missing transcript. For these 33 file names, n_boxes is greater than n_transcripts in ./docs/image_status.tab. 8.0 References Michael Arrigo, Stephanie Strassel, Nolan King, Thao Tran, Lisa Mason CAMIO: A Corpus for OCR in Multiple Languages. LREC 2022: 13th Edition of the Language Resources and Evaluation Conference Marseille, June 20-25 9.0 Copyright Statement TBD. 10.0 Acknowledgements We gratefully acknowledge the contributions of CAMIO project lead annotators Justin Mott and Neil Kuster, software developers Alex Shelmire and Jonathan Wright, and the hundreds of CAMIO annotators whose efforts were central to the development of this corpus. 11.0 Contacts If you have questions about this data release, please contact the following personnel at LDC. Stephanie Strassel Christopher Caruso ---------------------- README created by Michael Arrigo on December 17, 2020 updated by Jeremy Getman on June 17, 2022 updated by Jeremy Getman on June 22, 2022 updated by Stephanie Strassel on July 5, 2022 updated by Jeremy Getman on July 7, 2022 updated by Stephanie Strassel on July 21, 2022 updated by Jeremy Getman on July 25, 2022