Home › Language Resources › Data

CAMIO Transcription Languages

Item Name:	CAMIO Transcription Languages
Author(s):	Michael Arrigo, Stephanie Strassel, Christopher Caruso
LDC Catalog No.:	LDC2022T07
ISLRN:	014-810-264-834-8
DOI:	https://doi.org/10.35111/r7ds-gy89
Release Date:	December 15, 2022
Member Year(s):	2022
DCMI Type(s):	StillImage, Text
Data Source(s):	web collection
Project(s):	CAMIO
Application(s):	keyword spotting, language identification, OCR decoding, script identificaton, text localizaton
Language(s):	English, Arabic, Persian, Hindi, Japanese, Kannada, Korean, Russian, Tamil, Thai, Urdu, Vietnamese, Mandarin Chinese
Language ID(s):	eng, ara, fas, hin, jpn, kan, kor, rus, tam, tha, urd, vie, cmn
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2022T07 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Arrigo, Michael, Stephanie Strassel, and Christopher Caruso. CAMIO Transcription Languages LDC2022T07. Web Download. Philadelphia: Linguistic Data Consortium, 2022.

Introduction

CAMIO Transcription Languages was developed by the Linguistic Data Consortium and contains nearly 70,000 images of machine printed text with corresponding annotations and transcripts in the following 13 languages: Arabic, Chinese, English, Farsi, Hindi, Japanese, Kannada, Korean, Russian, Tamil, Thai, Urdu, and Vietnamese.

This corpus is a subset of data created for a broader effort to support the development and evaluation of optical character recognition (OCR) and related technologies for 35 languages across 24 unique script types. The CAMIO (Corpus of Annotated Multilingual Images for OCR) collection was designed to address gaps in language and script coverage from existing corpora and to support future evaluation of OCR capabilities through a systematically constructed data set.

Data

Most images were annotated for text localization, resulting in over 2.3M line-level bounding boxes. For the 13 languages represented in this release, 1250 images per language were also annotated with orthographic transcriptions of each line plus specification of reading order, yielding over 2.4M tokens of transcribed text. The resulting annotations are represented in a comprehensive XML output format defined for this corpus.

The script for each language is indicated in parentheses: Arabic (Arabic), Chinese (Simplified), English (Latin), Farsi (Arabic), Hindi (Devanagari), Japanese (Japanese), Kannada (Kannada), Korean (Hangul), Russian (Cyrillic), Tamil (Tamil), Thai (Thai), Urdu (Arabic), and Vietnamese (Latin).

Data for each language is partitioned into test, train or validation sets.

Samples

Please view these samples:

Updates

None at this time.

Copyright

Portions © 2007, 2015, 2017-2020 1399 picofiles, © 2015-2019 65tes-habeshamusic.com, © 2019-2020 Accessify.com, © 2019-2020 Adobe, © 2013, 2019-2020 Alamy Ltd., © 2010-2011, 2019-2020, Amazon.com, Inc. or its affiliates, © 2008, 2018-2019 ambebi.ge, © 2000, 2019-2020 A Medium Corporation, © 2019-2020 App Annie, © 2019 AppKiwi, © 2014, 2019 Armenian News - Tert.am, © 2012-2014, 2018-2019 ARMENPRESS, © 2002, 2006, 2008, 2010-2011, 2013-2014, 2019-2020 Assimba.org, © 2011-2019 Atv - Eritrean Satellite Television, © 2016-2017 AtYourService.pk, © 2018-2019 Aysor, © 2019 Bag, © 2002-2003, 2009-2019 Baidu, © 2017-2019 Bangla sms bengali shayari, © 2019 bbcode0.com, © 2014, 2019-2020 Benawa Network, © 2002, 2012-2019 Bennett Coleman & Co. Ltd., © 2013-2019 Best TV, © 2000, 2019-2020 BigCommerce Pty. Ltd., © 2019 Bnet Technologies, © 2017 BONDHU2U, © 2011, 2015-2018 BuzzFeed, Inc., © 2016-2017, 2019 CBSEPORTAL.COM, © 2019-2020 cinejosh.com, © 2019 Civic Network OPORA, © 2010, 2018-2019 Clipart.com, a division of Vital Imagery Ltd., © 2000, 2015, 2019-2020 Cloudinary, © 2012, 2019 CMS, © 2013, 2018-2019, COUNTRY.ua, © 2014-2019 CyberAgent, Inc., © 2019 Daily Hunt, © 2013-2020 Dehai.org, © 2015, 2019 Deutsche Welle, © 2016-2019 DF Marketplace Company Limited, © 2019 DigitalOcean, LLC, © 2017, 2019 DocPlayer.hu, © 2019 Dreamstime, © 2000, 2007-2019 DuckDuckGo Blog, © 2012-2014, 2016-2019 DVB Multimedia Group, © 2018-2019 DYODEKA SA, © 2012-2019 EastAFRO.com, © 2018-2020 eBay Inc., © 2017-2019 Electronic Database of Cultural Values, © 2019-2020 ePapersland.com, © 2010-2019 Eritrea-Chat.Com, © 2019 Ethiopian Press Agency, © 2019 Etsy, Inc., © 2015-2016, 2019-2020 Exotic India, © 2017-2018 Ezinemart, © 2019-2020 F5, Inc., © 2019 Fine Arts Department. Ministry of Culture, © 2013-2019 Free 4 Reader, © 2016-2019 FRESH NEWS, © 2016-2019 Global Publishers, © 2000-2001, 2003, 2005-2006, 2011, 2015-2020 Google Inc., © 2019-2020 Google LLC, © 2013-2018 Goolgule, © 2013-2014, 2016, 2019 Hetq, © 2010-2014, 2016 Himalayabon.com, © 2011, 2019-2020 Holding "Labyrinth", © 2019 Houshamadyan - Houshamadyan e.V., © 2014, 2019 HRAPARAK, © 2019-2020 Imgur, Inc, © 2016, 2019 Institute for Development of Freedom of Information, © 2013-2015, 2017-2019 IRAVABAN.NET, © 2018-2019 Islam land, © 2019 Jagran Prakashan Ltd, © 2013, 2019 Jofogas, © 2005-2006, 2019 Kapruka.com, © 2019 Kerala Niyamasabha, © 2019 Kesari Weekly, © 2019 Khamsat.com, subsidiary of Hsoub, © 2019 Kidzpark, © 2019 LEPL LEGISLATIVE HERALD OF GEORGIA, © 2019 LLC "Infourok", © 2012, 2019-2020 LLC "Yandex", © 2014, 2019 Magzter Inc, © 2003, 2019 Mahibere Kidusan, © 2006, 2017-2020 Mashreq News, © 2012, 2016-2019 Matichon Public Co., Ltd., © 2016-2018 MemeBuster, © 2019 Mereb Inc., © 2019 Microsoft, © 2014-2019 MillardAyo.com, © 2019 Minhaj-ul-Quran International, © 2015-2016, 2018-2019 MJ Innovations (Pvt) Ltd, © 2003, 2008, 2019-2020 Mohalla Tech Pvt. Ltd., © 2019 Mohsensoft, © 2019 MyShared Inc., © 2018 Nai, © 2014, 2017-2019 Newsroom Ltd., © 2019 Nikand, © 2002, 2019 nplg.gov.ge, © 2016, 2019 nuaodisha.com, © 2012, 2015-2016, 2018-2019 OdiaWeb, © 2019 Omedia Studio, © 2015-2019 online auction auction.ru, © 2019-2020 Owler Inc., © 2019 Oxford University Press, © 2019 "Paste.Pics", © 2010, 2014-2019 People's Daily Online, © 2017, 2019 Pinterest, © 2019 Prom.ua, © 2019 Qurango, © 2011, 2013, 2019-2020 ResearchGate GmbH, © 2019-2020 Reddit Inc, © 2014, 2016-2017, 2019-2020 RFE/RL, Inc., © 2019 Rozetka Online Store, © 2012, 2018-2019 Sambad, © 2016-2019 Satenaw News/Breaking News, © 2019 Scribd Inc., © 2018-2019 Semayat Book Store, © 2018-2019 Shant TV, © 2002, 2008, 2013, 2019 Share Your Essays, © 2018-2019 Shutterstock, Inc., © 2019 Simon Ager, © 2008, 2010-2013, 2019-2020 SlidePlayer.com Inc., © 2017, 2019-2020 SlideServe, © 2019 Slide-Share, © 2016-2017 Smart Doc Posters, © 2009, 2019-2020 SmugMug, Inc., © 2019 spotidoc.com, © 2009, 2019 Squarespace, © 2007, 2019 svitppt Inc., © 2013-2014, 2017, 2019 Tabula, © 2019 Teachers Pay Teachers|Teacher Synergy LLC, © 2019-2020 TAMIL TEXTBOOKS, © 2019 Tanzania Educational Publishers Ltd, © 2019-2020 TeluguOne.com, © 2010, 2019 Text Book Centre Ltd, © 2016, 2018-2019 The Hankyoreh, © 2019 The News Minute, © 2019 The Samaja Epaper, © 2000, 2019 The University of Chicago, © 2010, 2018-2019 Tibet News, © 2015, 2017 Tibetan Community Health Network, © 2017-2019 Tigray Communication Affairs Bureau, © 2019-2020 TripAdvisor LLC, © 2015-2019 Tsanpo.com, © 2015-2019 Tsem Rinpoche, © 2017-2019 University of South-East Asia, © 2004-2007, 2011-2014, 2016-2019 Upali Newspapers (Pvt) Ltd., © 2019 VietTouch, © 2019 Vindad, © 2019 Vinh Phuc Newspaper, © 2012, 2019 Wasabi Technologies, Inc., © 2019-2020 WatKhemaraRatanaram.org, © 2011, 2016, 2018-2020 Wonder Idea Technology Co., Ltd., © 2019 WorthPoint Corporation, © 2013, 2019 www.Dek-D.com, © 2019 Yakaboo, © 2009, 2011, 2019 yeddyurappa.in, © 2001, 2003-2004, 2009, 2012-2013, 2016, 2019-2020 Yumpu.com, © 2011-2019 ZeHabesha,© 2020, 2022 Trustees of the University of Pennsylvania