LDC Corpora ⇔ Projects
Many of the corpora in the Catalog were developed for, or used in, sponsored research projects. Some of those resources were training and test data for benchmark tests of language-based systems developed during the project. A corpus is associated with a given project either because it was developed for the project, it was used in the project or it was considered otherwise relevant to the work of the project.
ACE
| LDC2017T10 | Abstract Meaning Representation (AMR) Annotation Release 2.0 | |
| LDC2020T02 | Abstract Meaning Representation (AMR) Annotation Release 3.0 | |
| LDC2024T11 | Abstract Meaning Representation 3.0 - Machine Translations | |
| LDC2005T09 | ACE 2004 Multilingual Training Corpus | |
| LDC2008T03 | ACE 2005 English SpatialML Annotations | |
| LDC2011T02 | ACE 2005 English SpatialML Annotations Version 2 | |
| LDC2010T09 | ACE 2005 Mandarin SpatialML Annotations | |
| LDC2006T06 | ACE 2005 Multilingual Training Corpus | |
| LDC2014T18 | ACE 2007 Multilingual Training Corpus | |
| LDC2015T20 | ACE 2007 Spanish DevTest - Pilot Evaluation | |
| LDC2010T18 | ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0 | |
| LDC2005T07 | ACE Time Normalization (TERN) 2004 English Training Data v 1.0 | |
| LDC2003T11 | ACE-2 Version 1.0 | |
| LDC2024T05 | Automatic Content Extraction for Portuguese | |
| LDC2005T33 | BBN Pronoun Coreference and Entity Type Corpus | |
| LDC2019T07 | Chinese Abstract Meaning Representation 1.0 | |
| LDC2011T08 | Datasets for Generic Relation Extraction (reACE) | |
| LDC2004T14 | Proposition Bank I | |
| LDC2009T11 | REFLEX Entity Translation Training/DevTest | |
| LDC2004T09 | TIDES Extraction (ACE) 2003 Multilingual Training Data |
AIDA
| LDC2023T10 | AIDA Scenario 1 and 2 Reference Knowledge Base | |
| LDC2025T13 | AIDA Scenario 1 Evaluation Topic Source Data, Annotation, and Assessment | |
| LDC2024T02 | AIDA Scenario 1 Practice Topic Annotation | |
| LDC2023T11 | AIDA Scenario 1 Practice Topic Source Data | |
| LDC2024T06 | AIDA Scenario 2 Practice Topic Annotation | |
| LDC2024T04 | AIDA Scenario 2 Practice Topic Source Data | |
| LDC2025T02 | AIDA Scenario 3 Practice Topic Source Data and Annotation | |
| LDC2023S01 | AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts |
American National Corpus (ANC)
| LDC2005T35 | American National Corpus (ANC) Second Release | |
| LDC2010T22 | Manually Annotated Sub-Corpus First Release | |
| LDC2013T12 | Manually Annotated Sub-Corpus Third Release |
AnnoDIFP
| LDC2025S06 | AnnoDIFP Session Audio and Transcripts |
AQUAINT
| LDC2008T25 | AQUAINT-2 Information-Retrieval Text Research Collection | |
| LDC2005T33 | BBN Pronoun Coreference and Entity Type Corpus |
ATIS
| LDC2021T04 | ATIS - Seven Languages | |
| LDC93S4A | ATIS0 Complete | |
| LDC93S4B | ATIS0 Pilot | |
| LDC93S4B-2 | ATIS0 Read | |
| LDC93S4B-3 | ATIS0 SD Read | |
| LDC93S5 | ATIS2 | |
| LDC95S26 | ATIS3 Test Data | |
| LDC94S19 | ATIS3 Training Data | |
| LDC2019T04 | Multilingual ATIS |
BOLT
| LDC2014T12 | Abstract Meaning Representation (AMR) Annotation Release 1.0 | |
| LDC2017T10 | Abstract Meaning Representation (AMR) Annotation Release 2.0 | |
| LDC2020T02 | Abstract Meaning Representation (AMR) Annotation Release 3.0 | |
| LDC2020T07 | Abstract Meaning Representation 2.0 - Four Translations | |
| LDC2024T11 | Abstract Meaning Representation 3.0 - Machine Translations | |
| LDC2019T01 | BOLT Arabic Discussion Forum Parallel Training Data | |
| LDC2018T10 | BOLT Arabic Discussion Forums | |
| LDC2021T07 | BOLT Chinese Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech | |
| LDC2017T05 | BOLT Chinese Discussion Forum Parallel Training Data | |
| LDC2016T05 | BOLT Chinese Discussion Forums | |
| LDC2018T15 | BOLT Chinese SMS/Chat | |
| LDC2021T11 | BOLT Chinese SMS/Chat Parallel Training Data | |
| LDC2016T19 | BOLT Chinese-English Word Alignment and Tagging -- Discussion Forum Training | |
| LDC2020T15 | BOLT Chinese-English Word Alignment and Tagging -- Conversational Telephone Speech Training | |
| LDC2019T13 | BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training | |
| LDC2025S09 | BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Audio | |
| LDC2025T14 | BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Transcripts and Translations | |
| LDC2025S04 | BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Audio | |
| LDC2025T05 | BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Transcripts and Translations | |
| LDC2021T14 | BOLT Egyptian Arabic Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech | |
| LDC2021T18 | BOLT Egyptian Arabic PropBank and Sense -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech | |
| LDC2017T07 | BOLT Egyptian Arabic SMS/Chat and Transliteration | |
| LDC2021T15 | BOLT Egyptian Arabic SMS/Chat Parallel Training Data | |
| LDC2021T12 | BOLT Egyptian Arabic Treebank - Conversational Telephone Speech | |
| LDC2018T23 | BOLT Egyptian Arabic Treebank - Discussion Forum | |
| LDC2021T17 | BOLT Egyptian Arabic Treebank - SMS/Chat | |
| LDC2020T05 | BOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training | |
| LDC2019T18 | BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training | |
| LDC2019T06 | BOLT Egyptian-English Word Alignment -- Discussion Forum Training | |
| LDC2020T20 | BOLT English Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech | |
| LDC2017T11 | BOLT English Discussion Forums | |
| LDC2020T21 | BOLT English PropBank and Sense -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech | |
| LDC2018T19 | BOLT English SMS/Chat | |
| LDC2020T09 | BOLT English Translation Treebank - Chinese Discussion Forum | |
| LDC2021T19 | BOLT English Translation Treebank - Chinese SMS/Chat | |
| LDC2022T06 | BOLT English Translation Treebank - Egyptian Arabic SMS/Chat | |
| LDC2019T15 | BOLT English Treebank - Discussion Forum | |
| LDC2021T03 | BOLT English Treebank - SMS/Chat | |
| LDC2018T18 | BOLT Information Retrieval Comprehensive Training and Evaluation | |
| LDC2013T21 | Chinese Treebank 8.0 | |
| LDC2016T13 | Chinese Treebank 9.0 | |
| LDC2024T03 | LoReHLT Hausa Representative Language Pack | |
| LDC2025T08 | LoReHLT Uzbek Representative Language Pack |
CAMIO
| LDC2022T07 | CAMIO Transcription Languages |
CHiME
| LDC2017S07 | CHiME2 Grid | |
| LDC2017S10 | CHiME2 WSJ0 | |
| LDC2017S24 | CHiME3 |
Communicator
| LDC2004T15 | 2000 Communicator Dialogue Act Tagged | |
| LDC2002S56 | 2000 Communicator Evaluation | |
| LDC2004T16 | 2001 Communicator Dialogue Act Tagged | |
| LDC2003S01 | 2001 Communicator Evaluation |
CoNLL
| LDC2015T12 | 2006 CoNLL Shared Task - Arabic & Czech | |
| LDC2015T11 | 2006 CoNLL Shared Task - Ten Languages | |
| LDC2018T08 | 2007 CoNLL Shared Task - Arabic & English | |
| LDC2018T06 | 2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish | |
| LDC2018T07 | 2007 CoNLL Shared Task - Greek, Hungarian & Italian | |
| LDC2012T03 | 2009 CoNLL Shared Task Part 1 | |
| LDC2012T04 | 2009 CoNLL Shared Task Part 2 | |
| LDC2017T13 | 2015-2016 CoNLL Shared Task |
DARPA-CSR
| LDC2005S08 | BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts | |
| LDC93S6A | CSR-I (WSJ0) Complete | |
| LDC93S6C | CSR-I (WSJ0) Other | |
| LDC93S6B | CSR-I (WSJ0) Sennheiser | |
| LDC94S13A | CSR-II (WSJ1) Complete | |
| LDC94S13C | CSR-II (WSJ1) Other | |
| LDC94S13B | CSR-II (WSJ1) Sennheiser | |
| LDC95S23 | CSR-III Speech | |
| LDC95T6 | CSR-III Text | |
| LDC96S33 | CSR-IV HUB3 | |
| LDC96S31 | CSR-IV HUB4 |
DASL
| LDC2003T15 | SLX Corpus of Classic Sociolinguistic Interviews |
DEFT
| LDC2014T12 | Abstract Meaning Representation (AMR) Annotation Release 1.0 | |
| LDC2017T10 | Abstract Meaning Representation (AMR) Annotation Release 2.0 | |
| LDC2020T02 | Abstract Meaning Representation (AMR) Annotation Release 3.0 | |
| LDC2020T07 | Abstract Meaning Representation 2.0 - Four Translations | |
| LDC2024T11 | Abstract Meaning Representation 3.0 - Machine Translations | |
| LDC2020L02 | Chinese Lexical Resources for Gender, Number, Animacy | |
| LDC2019T03 | DEFT Chinese Committed Belief Annotation | |
| LDC2020T19 | DEFT Chinese Light and Rich ERE Annotation | |
| LDC2019T16 | DEFT English Committed Belief Annotation | |
| LDC2023T04 | DEFT English Light and Rich ERE Annotation | |
| LDC2016T07 | DEFT Narrative Text | |
| LDC2019T09 | DEFT Spanish Committed Belief Annotation | |
| LDC2025T04 | DEFT Spanish Light and Rich ERE Annotation | |
| LDC2018T01 | DEFT Spanish Treebank | |
| LDC2016T23 | Richer Event Description | |
| LDC2023T13 | TAC KBP Belief and Sentiment - Comprehensive Training and Evaluation Data 2016-2017 | |
| LDC2017T09 | The EventStatus Corpus |
DIRHA
| LDC2018S01 | DIRHA English WSJ Audio |
DOE/IRS2008-0256
| LDC2025L01 | Iraqi Arabic - English Lexical Database | |
| LDC2023L01 | Moroccan Arabic - English Lexical Database |
EARS
| LDC97S66 | 1996 English Broadcast News Dev and Eval (HUB4) | |
| LDC97S44 | 1996 English Broadcast News Speech (HUB4) | |
| LDC97T22 | 1996 English Broadcast News Transcripts (HUB4) | |
| LDC98S71 | 1997 English Broadcast News Speech (HUB4) | |
| LDC98T28 | 1997 English Broadcast News Transcripts (HUB4) | |
| LDC2001S91 | 1997 HUB4 Broadcast News Evaluation Non-English Test Material | |
| LDC2002S11 | 1997 HUB4 English Evaluation Speech and Transcripts | |
| LDC2002S22 | 1997 HUB5 Arabic Evaluation | |
| LDC2002T39 | 1997 HUB5 Arabic Transcripts | |
| LDC2002S24 | 1997 HUB5 German Evaluation | |
| LDC2003T03 | 1997 HUB5 German Transcripts | |
| LDC2002S25 | 1997 HUB5 Spanish Evaluation | |
| LDC2003T04 | 1997 HUB5 Spanish Transcripts | |
| LDC98S73 | 1997 Mandarin Broadcast News Speech (HUB4-NE) | |
| LDC98T24 | 1997 Mandarin Broadcast News Transcripts (HUB4-NE) | |
| LDC2002S10 | 1998 HUB5 English Evaluation | |
| LDC2003T02 | 1998 HUB5 English Transcripts | |
| LDC2002S13 | 2001 HUB5 English Evaluation | |
| LDC2002S12 | 2001 HUB5 Mandarin Evaluation | |
| LDC2003T01 | 2001 HUB5 Mandarin Transcripts | |
| LDC2004S11 | 2002 Rich Transcription Broadcast News and Conversational Telephone Speech | |
| LDC99L23 | American English Spoken Lexicon | |
| LDC2005S07 | Arabic CTS Levantine Fisher Training Data Set 3, Speech | |
| LDC2005T03 | Arabic CTS Levantine Fisher Training Data Set 3, Transcripts | |
| LDC2003T12 | Arabic Gigaword | |
| LDC2001T55 | Arabic Newswire Part 1 | |
| LDC2005S08 | BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts | |
| LDC96S46 | CALLFRIEND American English-Non-Southern Dialect | |
| LDC2019S21 | CALLFRIEND American English-Non-Southern Dialect Second Edition | |
| LDC96S47 | CALLFRIEND American English-Southern Dialect | |
| LDC2020S08 | CALLFRIEND American English-Southern Dialect Second Edition | |
| LDC2019S18 | CALLFRIEND Canadian French Second Edition | |
| LDC96S49 | CALLFRIEND Egyptian Arabic | |
| LDC2019S04 | CALLFRIEND Egyptian Arabic Second Edition | |
| LDC96S55 | CALLFRIEND Mandarin Chinese-Mainland Dialect | |
| LDC2018S09 | CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition | |
| LDC96S56 | CALLFRIEND Mandarin Chinese-Taiwan Dialect | |
| LDC2020S06 | CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition | |
| LDC97L20 | CALLHOME American English Lexicon (PRONLEX) | |
| LDC97S42 | CALLHOME American English Speech | |
| LDC97T14 | CALLHOME American English Transcripts | |
| LDC97S45 | CALLHOME Egyptian Arabic Speech | |
| LDC2002S37 | CALLHOME Egyptian Arabic Speech Supplement | |
| LDC97T19 | CALLHOME Egyptian Arabic Transcripts | |
| LDC2002T38 | CALLHOME Egyptian Arabic Transcripts Supplement | |
| LDC96L15 | CALLHOME Mandarin Chinese Lexicon | |
| LDC96S34 | CALLHOME Mandarin Chinese Speech | |
| LDC96T16 | CALLHOME Mandarin Chinese Transcripts | |
| LDC2003T09 | Chinese Gigaword | |
| LDC2005T14 | Chinese Gigaword Second Edition | |
| LDC2005T08 | Discourse Graphbank | |
| LDC99L22 | Egyptian Colloquial Arabic Lexicon | |
| LDC2003T05 | English Gigaword | |
| LDC2005T12 | English Gigaword Second Edition | |
| LDC2005S13 | Fisher English Training Part 2, Speech | |
| LDC2005T19 | Fisher English Training Part 2, Transcripts | |
| LDC2004S13 | Fisher English Training Speech Part 1 Speech | |
| LDC2004T19 | Fisher English Training Speech Part 1 Transcripts | |
| LDC2005S15 | HKUST Mandarin Telephone Speech, Part 1 | |
| LDC2005T32 | HKUST Mandarin Telephone Transcript Data, Part 1 | |
| LDC2018S18 | HUB5 Mandarin Telephone Speech and Transcripts Second Edition | |
| LDC98S69 | HUB5 Mandarin Telephone Speech Corpus | |
| LDC98T26 | HUB5 Mandarin Transcripts | |
| LDC2005S14 | Levantine Arabic QT Training Data Set 4 (Speech + Transcripts) | |
| LDC2006S29 | Levantine Arabic QT Training Data Set 5, Speech | |
| LDC2006T07 | Levantine Arabic QT Training Data Set 5, Transcripts | |
| LDC95T13 | Mandarin Chinese News Text | |
| LDC95T21 | North American News Text Corpus | |
| LDC98T30 | North American News Text Supplement | |
| LDC2004S08 | RT-03 MDE Training Data Speech | |
| LDC2004T12 | RT-03 MDE Training Data Text and Annotations | |
| LDC2005S16 | RT-04 MDE Training Data Speech | |
| LDC2005T24 | RT-04 MDE Training Data Text/Annotations | |
| LDC2004S10 | Santa Barbara Corpus of Spoken American English Part III | |
| LDC2005S25 | Santa Barbara Corpus of Spoken American English Part IV | |
| LDC2006T12 | Spanish Gigaword First Edition | |
| LDC2009T21 | Spanish Gigaword Second Edition | |
| LDC2001S13 | Switchboard Cellular Part 1 Audio | |
| LDC2001S15 | Switchboard Cellular Part 1 Transcribed Audio | |
| LDC2001T14 | Switchboard Cellular Part 1 Transcription | |
| LDC2004S07 | Switchboard Cellular Part 2 Audio | |
| LDC97S62 | Switchboard-1 Release 2 | |
| LDC98S75 | Switchboard-2 Phase I | |
| LDC99S79 | Switchboard-2 Phase II | |
| LDC2002S06 | Switchboard-2 Phase III Audio | |
| LDC98S72 | Taiwanese Putonghua Speech and Transcripts | |
| LDC98T25 | TDT Pilot Study Corpus | |
| LDC2000S92 | TDT2 Careful Transcription Audio | |
| LDC2000T44 | TDT2 Careful Transcription Text | |
| LDC99S84 | TDT2 English Audio | |
| LDC2001S93 | TDT2 Mandarin Audio Corpus | |
| LDC2001T57 | TDT2 Multilanguage Text Version 4.0 | |
| LDC2001S94 | TDT3 English Audio | |
| LDC2001S95 | TDT3 Mandarin Audio | |
| LDC2001T58 | TDT3 Multilanguage Text Version 2.0 |
GALE
| LDC97S66 | 1996 English Broadcast News Dev and Eval (HUB4) | |
| LDC97S44 | 1996 English Broadcast News Speech (HUB4) | |
| LDC97T22 | 1996 English Broadcast News Transcripts (HUB4) | |
| LDC98S71 | 1997 English Broadcast News Speech (HUB4) | |
| LDC98T28 | 1997 English Broadcast News Transcripts (HUB4) | |
| LDC2001S91 | 1997 HUB4 Broadcast News Evaluation Non-English Test Material | |
| LDC2002S11 | 1997 HUB4 English Evaluation Speech and Transcripts | |
| LDC2002S22 | 1997 HUB5 Arabic Evaluation | |
| LDC2002T39 | 1997 HUB5 Arabic Transcripts | |
| LDC2002S24 | 1997 HUB5 German Evaluation | |
| LDC2003T03 | 1997 HUB5 German Transcripts | |
| LDC2002S25 | 1997 HUB5 Spanish Evaluation | |
| LDC2003T04 | 1997 HUB5 Spanish Transcripts | |
| LDC98S73 | 1997 Mandarin Broadcast News Speech (HUB4-NE) | |
| LDC98T24 | 1997 Mandarin Broadcast News Transcripts (HUB4-NE) | |
| LDC2002S10 | 1998 HUB5 English Evaluation | |
| LDC2003T02 | 1998 HUB5 English Transcripts | |
| LDC2002S13 | 2001 HUB5 English Evaluation | |
| LDC2002S12 | 2001 HUB5 Mandarin Evaluation | |
| LDC2003T01 | 2001 HUB5 Mandarin Transcripts | |
| LDC2004S11 | 2002 Rich Transcription Broadcast News and Conversational Telephone Speech | |
| LDC2009T05 | 2008 NIST Metrics for Machine Translation (MetricsMATR08) Development Data | |
| LDC2011T05 | 2008/2010 NIST Metrics for Machine Translation (MetricsMaTr) GALE Evaluation Set | |
| LDC2017T10 | Abstract Meaning Representation (AMR) Annotation Release 2.0 | |
| LDC2020T02 | Abstract Meaning Representation (AMR) Annotation Release 3.0 | |
| LDC2024T11 | Abstract Meaning Representation 3.0 - Machine Translations | |
| LDC2005T09 | ACE 2004 Multilingual Training Corpus | |
| LDC2005T07 | ACE Time Normalization (TERN) 2004 English Training Data v 1.0 | |
| LDC2003T11 | ACE-2 Version 1.0 | |
| LDC93T1 | ACL/DCI | |
| LDC99L23 | American English Spoken Lexicon | |
| LDC2012T21 | Annotated English Gigaword | |
| LDC2005S07 | Arabic CTS Levantine Fisher Training Data Set 3, Speech | |
| LDC2005T03 | Arabic CTS Levantine Fisher Training Data Set 3, Transcripts | |
| LDC2004T18 | Arabic English Parallel News Part 1 | |
| LDC2003T12 | Arabic Gigaword | |
| LDC2011T11 | Arabic Gigaword Fifth Edition | |
| LDC2009T30 | Arabic Gigaword Fourth Edition | |
| LDC2007T40 | Arabic Gigaword Third Edition | |
| LDC2004T17 | Arabic News Translation Text Part 1 | |
| LDC2001T55 | Arabic Newswire Part 1 | |
| LDC2012T07 | Arabic Treebank - Broadcast News v1.0 | |
| LDC2016T02 | Arabic Treebank - Weblog | |
| LDC2003T07 | Arabic Treebank: Part 1 - 10K-word English Translation | |
| LDC2003T06 | Arabic Treebank: Part 1 v 2.0 | |
| LDC2005T02 | Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic analysis) | |
| LDC2010T13 | Arabic Treebank: Part 1 v 4.1 | |
| LDC2004T02 | Arabic Treebank: Part 2 v 2.0 | |
| LDC2011T09 | Arabic Treebank: Part 2 v 3.1 | |
| LDC2005T20 | Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis) | |
| LDC2004T11 | Arabic Treebank: Part 3 v 1.0 | |
| LDC2012T09 | Arabic-Dialect/English Parallel Text | |
| LDC2005T33 | BBN Pronoun Coreference and Entity Type Corpus | |
| LDC2005S08 | BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts | |
| LDC2000T43 | BLLIP 1987-89 WSJ Corpus Release 1 | |
| LDC2002L49 | Buckwalter Arabic Morphological Analyzer Version 1.0 | |
| LDC2004L02 | Buckwalter Arabic Morphological Analyzer Version 2.0 | |
| LDC96S46 | CALLFRIEND American English-Non-Southern Dialect | |
| LDC2019S21 | CALLFRIEND American English-Non-Southern Dialect Second Edition | |
| LDC96S47 | CALLFRIEND American English-Southern Dialect | |
| LDC2020S08 | CALLFRIEND American English-Southern Dialect Second Edition | |
| LDC2019S18 | CALLFRIEND Canadian French Second Edition | |
| LDC96S49 | CALLFRIEND Egyptian Arabic | |
| LDC2019S04 | CALLFRIEND Egyptian Arabic Second Edition | |
| LDC96S55 | CALLFRIEND Mandarin Chinese-Mainland Dialect | |
| LDC2018S09 | CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition | |
| LDC96S56 | CALLFRIEND Mandarin Chinese-Taiwan Dialect | |
| LDC2020S06 | CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition | |
| LDC97L20 | CALLHOME American English Lexicon (PRONLEX) | |
| LDC97S42 | CALLHOME American English Speech | |
| LDC97T14 | CALLHOME American English Transcripts | |
| LDC97S45 | CALLHOME Egyptian Arabic Speech | |
| LDC2002S37 | CALLHOME Egyptian Arabic Speech Supplement | |
| LDC97T19 | CALLHOME Egyptian Arabic Transcripts | |
| LDC2002T38 | CALLHOME Egyptian Arabic Transcripts Supplement | |
| LDC96L15 | CALLHOME Mandarin Chinese Lexicon | |
| LDC96S34 | CALLHOME Mandarin Chinese Speech | |
| LDC96T16 | CALLHOME Mandarin Chinese Transcripts | |
| LDC2005T13 | CCGbank | |
| LDC96L14 | CELEX2 | |
| LDC2005T10 | Chinese English News Magazine Parallel Text | |
| LDC2003T09 | Chinese Gigaword | |
| LDC2011T13 | Chinese Gigaword Fifth Edition | |
| LDC2009T27 | Chinese Gigaword Fourth Edition | |
| LDC2005T14 | Chinese Gigaword Second Edition | |
| LDC2007T38 | Chinese Gigaword Third Edition | |
| LDC2005T06 | Chinese News Translation Text Part 1 | |
| LDC2005T23 | Chinese Proposition Bank 1.0 | |
| LDC2001T11 | Chinese Treebank 2.0 | |
| LDC2004T05 | Chinese Treebank 4.0 | |
| LDC2005T01 | Chinese Treebank 5.0 | |
| LDC2007T36 | Chinese Treebank 6.0 | |
| LDC2010T07 | Chinese Treebank 7.0 | |
| LDC2013T21 | Chinese Treebank 8.0 | |
| LDC2016T13 | Chinese Treebank 9.0 | |
| LDC2002L27 | Chinese-English Translation Lexicon Version 3.0 | |
| LDC2018T20 | Concretely Annotated English Gigaword | |
| LDC2005T08 | Discourse Graphbank | |
| LDC99L22 | Egyptian Colloquial Arabic Lexicon | |
| LDC2009T01 | English CTS Treebank with Structural Metadata | |
| LDC2003T05 | English Gigaword | |
| LDC2011T07 | English Gigaword Fifth Edition | |
| LDC2009T13 | English Gigaword Fourth Edition | |
| LDC2005T12 | English Gigaword Second Edition | |
| LDC2007T07 | English Gigaword Third Edition | |
| LDC2012T02 | English Translation Treebank: An-Nahar Newswire | |
| LDC2012T13 | English Web Treebank | |
| LDC2006T10 | English-Arabic Treebank v 1.0 | |
| LDC95T11 | European Language Newspaper Text | |
| LDC2005S13 | Fisher English Training Part 2, Speech | |
| LDC2005T19 | Fisher English Training Part 2, Transcripts | |
| LDC2004S13 | Fisher English Training Speech Part 1 Speech | |
| LDC2004T19 | Fisher English Training Speech Part 1 Transcripts | |
| LDC2007S02 | Fisher Levantine Arabic Conversational Telephone Speech | |
| LDC2007T04 | Fisher Levantine Arabic Conversational Telephone Speech, Transcripts | |
| LDC2013T14 | GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 1 | |
| LDC2014T03 | GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 2 | |
| LDC2013T10 | GALE Arabic-English Parallel Aligned Treebank -- Newswire | |
| LDC2014T08 | GALE Arabic-English Parallel Aligned Treebank -- Web Training | |
| LDC2014T19 | GALE Arabic-English Word Alignment -- Broadcast Training Part 1 | |
| LDC2014T22 | GALE Arabic-English Word Alignment -- Broadcast Training Part 2 | |
| LDC2014T05 | GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web | |
| LDC2014T10 | GALE Arabic-English Word Alignment Training Part 2 -- Newswire | |
| LDC2014T14 | GALE Arabic-English Word Alignment Training Part 3 -- Web | |
| LDC2015T06 | GALE Chinese-English Parallel Aligned Treebank -- Training | |
| LDC2013T23 | GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1 | |
| LDC2014T25 | GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 2 | |
| LDC2015T04 | GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3 | |
| LDC2015T18 | GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 4 | |
| LDC2012T16 | GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web | |
| LDC2012T20 | GALE Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire | |
| LDC2012T24 | GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web | |
| LDC2013T05 | GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web | |
| LDC2017T06 | GALE English-Chinese Parallel Aligned Treebank -- Training | |
| LDC2008T02 | GALE Phase 1 Arabic Blog Parallel Text | |
| LDC2007T24 | GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 | |
| LDC2008T09 | GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 | |
| LDC2009T03 | GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 | |
| LDC2009T09 | GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 | |
| LDC2008T06 | GALE Phase 1 Chinese Blog Parallel Text | |
| LDC2009T02 | GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1 | |
| LDC2009T06 | GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2 | |
| LDC2007T23 | GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1 | |
| LDC2008T08 | GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2 | |
| LDC2008T18 | GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3 | |
| LDC2009T15 | GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1 | |
| LDC2010T03 | GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2 | |
| LDC2007T20 | GALE Phase 1 Distillation Training | |
| LDC2012T06 | GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 | |
| LDC2012T14 | GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 | |
| LDC2013S02 | GALE Phase 2 Arabic Broadcast Conversation Speech Part 1 | |
| LDC2013S07 | GALE Phase 2 Arabic Broadcast Conversation Speech Part 2 | |
| LDC2013T04 | GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 1 | |
| LDC2013T17 | GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 2 | |
| LDC2012T18 | GALE Phase 2 Arabic Broadcast News Parallel Text | |
| LDC2014S07 | GALE Phase 2 Arabic Broadcast News Speech Part 1 | |
| LDC2015S01 | GALE Phase 2 Arabic Broadcast News Speech Part 2 | |
| LDC2014T17 | GALE Phase 2 Arabic Broadcast News Transcripts Part 1 | |
| LDC2015T01 | GALE Phase 2 Arabic Broadcast News Transcripts Part 2 | |
| LDC2012T17 | GALE Phase 2 Arabic Newswire Parallel Text | |
| LDC2013T01 | GALE Phase 2 Arabic Web Parallel Text | |
| LDC2013T11 | GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 1 | |
| LDC2013T16 | GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 2 | |
| LDC2013S04 | GALE Phase 2 Chinese Broadcast Conversation Speech | |
| LDC2013T08 | GALE Phase 2 Chinese Broadcast Conversation Transcripts | |
| LDC2014T04 | GALE Phase 2 Chinese Broadcast News Parallel Text Part 1 | |
| LDC2014T11 | GALE Phase 2 Chinese Broadcast News Parallel Text Part 2 | |
| LDC2013S08 | GALE Phase 2 Chinese Broadcast News Speech | |
| LDC2013T20 | GALE Phase 2 Chinese Broadcast News Transcripts | |
| LDC2014T15 | GALE Phase 2 Chinese Newswire Parallel Text Part 1 | |
| LDC2014T20 | GALE Phase 2 Chinese Newswire Parallel Text Part 2 | |
| LDC2014T26 | GALE Phase 2 Chinese Web Parallel Text | |
| LDC2015T05 | GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel Text | |
| LDC2015T07 | GALE Phase 3 and 4 Arabic Broadcast News Parallel Text | |
| LDC2015T19 | GALE Phase 3 and 4 Arabic Newswire Parallel Text | |
| LDC2016T08 | GALE Phase 3 and 4 Arabic Web Parallel Text | |
| LDC2016T09 | GALE Phase 3 and 4 Chinese Broadcast Conversation Parallel Text | |
| LDC2016T15 | GALE Phase 3 and 4 Chinese Broadcast News Parallel Text | |
| LDC2016T25 | GALE Phase 3 and 4 Chinese Newswire Parallel Text | |
| LDC2017T02 | GALE Phase 3 and 4 Chinese Web Parallel Text | |
| LDC2015S11 | GALE Phase 3 Arabic Broadcast Conversation Speech Part 1 | |
| LDC2016S01 | GALE Phase 3 Arabic Broadcast Conversation Speech Part 2 | |
| LDC2015T16 | GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 1 | |
| LDC2016T06 | GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2 | |
| LDC2016S07 | GALE Phase 3 Arabic Broadcast News Speech Part 1 | |
| LDC2017S02 | GALE Phase 3 Arabic Broadcast News Speech Part 2 | |
| LDC2016T17 | GALE Phase 3 Arabic Broadcast News Transcripts Part 1 | |
| LDC2017T04 | GALE Phase 3 Arabic Broadcast News Transcripts Part 2 | |
| LDC2014S09 | GALE Phase 3 Chinese Broadcast Conversation Speech Part 1 | |
| LDC2015S06 | GALE Phase 3 Chinese Broadcast Conversation Speech Part 2 | |
| LDC2014T28 | GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 1 | |
| LDC2015T09 | GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2 | |
| LDC2015S13 | GALE Phase 3 Chinese Broadcast News Speech | |
| LDC2015T25 | GALE Phase 3 Chinese Broadcast News Transcripts | |
| LDC2016T11 | GALE Phase 4 Arabic Broadcast Conversation Parallel Sentences | |
| LDC2017S15 | GALE Phase 4 Arabic Broadcast Conversation Speech | |
| LDC2017T12 | GALE Phase 4 Arabic Broadcast Conversation Transcripts | |
| LDC2016T20 | GALE Phase 4 Arabic Broadcast News Parallel Sentences | |
| LDC2018S05 | GALE Phase 4 Arabic Broadcast News Speech | |
| LDC2018T14 | GALE Phase 4 Arabic Broadcast News Transcripts | |
| LDC2016T27 | GALE Phase 4 Arabic Newswire Parallel Sentences | |
| LDC2016T14 | GALE Phase 4 Arabic Weblog Parallel Sentences | |
| LDC2015T14 | GALE Phase 4 Chinese Broadcast Conversation Parallel Sentences | |
| LDC2016S03 | GALE Phase 4 Chinese Broadcast Conversation Speech | |
| LDC2016T12 | GALE Phase 4 Chinese Broadcast Conversation Transcripts | |
| LDC2015T21 | GALE Phase 4 Chinese Broadcast News Parallel Sentences | |
| LDC2017S25 | GALE Phase 4 Chinese Broadcast News Speech | |
| LDC2017T18 | GALE Phase 4 Chinese Broadcast News Transcripts | |
| LDC2015T24 | GALE Phase 4 Chinese Newswire Parallel Sentences | |
| LDC2016T04 | GALE Phase 4 Chinese Weblog Parallel Sentences | |
| LDC2005S15 | HKUST Mandarin Telephone Speech, Part 1 | |
| LDC2005T32 | HKUST Mandarin Telephone Transcript Data, Part 1 | |
| LDC2000T50 | Hong Kong Hansards Parallel Text | |
| LDC2000T47 | Hong Kong Laws Parallel Text | |
| LDC2000T46 | Hong Kong News Parallel Text | |
| LDC2004T08 | Hong Kong Parallel Text | |
| LDC2018S18 | HUB5 Mandarin Telephone Speech and Transcripts Second Edition | |
| LDC98S69 | HUB5 Mandarin Telephone Speech Corpus | |
| LDC98T26 | HUB5 Mandarin Transcripts | |
| LDC95T8 | Japanese Business News Text | |
| LDC99T34 | Japanese Business News Text Supplement | |
| LDC2000T45 | Korean Newswire | |
| LDC2005S14 | Levantine Arabic QT Training Data Set 4 (Speech + Transcripts) | |
| LDC95T13 | Mandarin Chinese News Text | |
| LDC2001T02 | Message Understanding Conference (MUC) 7 | |
| LDC2003T18 | Multiple-Translation Arabic (MTA) Part 1 | |
| LDC2005T05 | Multiple-Translation Arabic (MTA) Part 2 | |
| LDC2003T17 | Multiple-Translation Chinese (MTC) Part 2 | |
| LDC2004T07 | Multiple-Translation Chinese (MTC) Part 3 | |
| LDC2002T01 | Multiple-Translation Chinese Corpus | |
| LDC2010T21 | NIST 2008 Open Machine Translation (OpenMT) Evaluation | |
| LDC2010T01 | NIST Open MT 2008 Evaluation (MT08) Selected References and System Translations | |
| LDC95T21 | North American News Text Corpus | |
| LDC98T30 | North American News Text Supplement | |
| LDC2007T21 | OntoNotes Release 1.0 | |
| LDC2008T04 | OntoNotes Release 2.0 | |
| LDC2009T24 | OntoNotes Release 3.0 | |
| LDC2011T03 | OntoNotes Release 4.0 | |
| LDC2013T19 | OntoNotes Release 5.0 | |
| LDC2004T23 | Prague Arabic Dependency Treebank 1.0 | |
| LDC2004T14 | Proposition Bank I | |
| LDC2004S08 | RT-03 MDE Training Data Speech | |
| LDC2004T12 | RT-03 MDE Training Data Text and Annotations | |
| LDC2005S16 | RT-04 MDE Training Data Speech | |
| LDC2005T24 | RT-04 MDE Training Data Text/Annotations | |
| LDC2004S10 | Santa Barbara Corpus of Spoken American English Part III | |
| LDC2005S25 | Santa Barbara Corpus of Spoken American English Part IV | |
| LDC2013T18 | Semantic Textual Similarity (STS) 2013 Machine Translation | |
| LDC2006T12 | Spanish Gigaword First Edition | |
| LDC2009T21 | Spanish Gigaword Second Edition | |
| LDC95T9 | Spanish News Text | |
| LDC99T41 | Spanish Newswire Text, Volume 2 | |
| LDC2001S13 | Switchboard Cellular Part 1 Audio | |
| LDC2001S15 | Switchboard Cellular Part 1 Transcribed Audio | |
| LDC2001T14 | Switchboard Cellular Part 1 Transcription | |
| LDC2004S07 | Switchboard Cellular Part 2 Audio | |
| LDC97S62 | Switchboard-1 Release 2 | |
| LDC98S75 | Switchboard-2 Phase I | |
| LDC99S79 | Switchboard-2 Phase II | |
| LDC2002S06 | Switchboard-2 Phase III Audio | |
| LDC98S72 | Taiwanese Putonghua Speech and Transcripts | |
| LDC98T25 | TDT Pilot Study Corpus | |
| LDC2000S92 | TDT2 Careful Transcription Audio | |
| LDC2000T44 | TDT2 Careful Transcription Text | |
| LDC99S84 | TDT2 English Audio | |
| LDC2001S93 | TDT2 Mandarin Audio Corpus | |
| LDC2001T57 | TDT2 Multilanguage Text Version 4.0 | |
| LDC2001S94 | TDT3 English Audio | |
| LDC2001S95 | TDT3 Mandarin Audio | |
| LDC2001T58 | TDT3 Multilanguage Text Version 2.0 | |
| LDC2005S11 | TDT4 Multilingual Broadcast News Speech Corpus | |
| LDC2005T16 | TDT4 Multilingual Text and Annotations | |
| LDC2004T09 | TIDES Extraction (ACE) 2003 Multilingual Training Data | |
| LDC93T3A | TIPSTER Complete | |
| LDC2018T13 | TRAD Arabic-French Parallel Text -- Newsgroup | |
| LDC2018T21 | TRAD Arabic-French Parallel Text -- Newswire | |
| LDC2018T02 | TRAD Chinese-French Parallel Text -- Blog | |
| LDC2018T17 | TRAD Chinese-French Parallel Text -- Broadcast News | |
| LDC2000T52 | TREC Mandarin | |
| LDC2000T51 | TREC Spanish | |
| LDC99T42 | Treebank-3 | |
| LDC94T4B-1 | UN Parallel Text (English) | |
| LDC94T4B-3 | UN Parallel Text (Spanish) |
GENOA
| LDC2004S05 | ISL Meeting Speech Part 1 | |
| LDC2004T10 | ISL Meeting Transcripts Part 1 |
HAVIC
| LDC2018V01 | HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation | |
| LDC2022V01 | HAVIC MED Novel 1 Test -- Videos, Metadata and Annotation | |
| LDC2022V02 | HAVIC MED Novel 2 Test -- Videos, Metadata and Annotation | |
| LDC2019V01 | HAVIC MED Progress Test -- Videos, Metadata and Annotation | |
| LDC2021V01 | HAVIC MED Training Data -- Videos, Metadata and Annotation | |
| LDC2016V01 | HAVIC Pilot Transcription |
Hub4
| LDC98T31 | 1996 CSR HUB4 Language Model | |
| LDC97S66 | 1996 English Broadcast News Dev and Eval (HUB4) | |
| LDC97S44 | 1996 English Broadcast News Speech (HUB4) | |
| LDC97T22 | 1996 English Broadcast News Transcripts (HUB4) | |
| LDC98S71 | 1997 English Broadcast News Speech (HUB4) | |
| LDC98T28 | 1997 English Broadcast News Transcripts (HUB4) | |
| LDC2001S91 | 1997 HUB4 Broadcast News Evaluation Non-English Test Material | |
| LDC2002S11 | 1997 HUB4 English Evaluation Speech and Transcripts | |
| LDC98S73 | 1997 Mandarin Broadcast News Speech (HUB4-NE) | |
| LDC98T24 | 1997 Mandarin Broadcast News Transcripts (HUB4-NE) | |
| LDC98S74 | 1997 Spanish Broadcast News Speech (HUB4-NE) | |
| LDC98T29 | 1997 Spanish Broadcast News Transcripts (HUB4-NE) | |
| LDC2000S86 | 1998 HUB4 Broadcast News Evaluation English Test Material | |
| LDC2015S05 | Mandarin Chinese Phonetic Segmentation and Tone | |
| LDC95T21 | North American News Text Corpus | |
| LDC98T30 | North American News Text Supplement |
Hub5-LVCSR
| LDC2002S22 | 1997 HUB5 Arabic Evaluation | |
| LDC2002T39 | 1997 HUB5 Arabic Transcripts | |
| LDC2002S23 | 1997 HUB5 English Evaluation | |
| LDC2002S24 | 1997 HUB5 German Evaluation | |
| LDC2003T03 | 1997 HUB5 German Transcripts | |
| LDC2002S25 | 1997 HUB5 Spanish Evaluation | |
| LDC2003T04 | 1997 HUB5 Spanish Transcripts | |
| LDC2002S10 | 1998 HUB5 English Evaluation | |
| LDC2003T02 | 1998 HUB5 English Transcripts | |
| LDC2002S09 | 2000 HUB5 English Evaluation Speech | |
| LDC2002T43 | 2000 HUB5 English Evaluation Transcripts | |
| LDC2002S13 | 2001 HUB5 English Evaluation | |
| LDC2002S12 | 2001 HUB5 Mandarin Evaluation | |
| LDC2003T01 | 2001 HUB5 Mandarin Transcripts | |
| LDC97L20 | CALLHOME American English Lexicon (PRONLEX) | |
| LDC97S42 | CALLHOME American English Speech | |
| LDC97T14 | CALLHOME American English Transcripts | |
| LDC97S45 | CALLHOME Egyptian Arabic Speech | |
| LDC2002S37 | CALLHOME Egyptian Arabic Speech Supplement | |
| LDC97T19 | CALLHOME Egyptian Arabic Transcripts | |
| LDC2002T38 | CALLHOME Egyptian Arabic Transcripts Supplement | |
| LDC97L18 | CALLHOME German Lexicon | |
| LDC97S43 | CALLHOME German Speech | |
| LDC97T15 | CALLHOME German Transcripts | |
| LDC96L17 | CALLHOME Japanese Lexicon | |
| LDC96S37 | CALLHOME Japanese Speech | |
| LDC96T18 | CALLHOME Japanese Transcripts | |
| LDC96L15 | CALLHOME Mandarin Chinese Lexicon | |
| LDC96S34 | CALLHOME Mandarin Chinese Speech | |
| LDC96T16 | CALLHOME Mandarin Chinese Transcripts | |
| LDC96L16 | CALLHOME Spanish Lexicon | |
| LDC96S35 | CALLHOME Spanish Speech | |
| LDC96T17 | CALLHOME Spanish Transcripts | |
| LDC99L22 | Egyptian Colloquial Arabic Lexicon | |
| LDC2018S18 | HUB5 Mandarin Telephone Speech and Transcripts Second Edition | |
| LDC98S69 | HUB5 Mandarin Telephone Speech Corpus | |
| LDC98T26 | HUB5 Mandarin Transcripts | |
| LDC98S70 | HUB5 Spanish Telephone Speech Corpus | |
| LDC98T27 | HUB5 Spanish Transcripts | |
| LDC97S62 | Switchboard-1 Release 2 | |
| LDC2001T60 | Syllable-Final /s/ Lenition |
JANUS
| LDC2004S05 | ISL Meeting Speech Part 1 | |
| LDC2004T10 | ISL Meeting Transcripts Part 1 |
KAIROS
| LDC2025T11 | KAIROS Phase 1 Quizlet | |
| LDC2025T15 | KAIROS Phase 2 Quizlet | |
| LDC2025T07 | KAIROS Schema Learning Complex Event Annotation |
LID
| LDC96S46 | CALLFRIEND American English-Non-Southern Dialect | |
| LDC2019S21 | CALLFRIEND American English-Non-Southern Dialect Second Edition | |
| LDC96S47 | CALLFRIEND American English-Southern Dialect | |
| LDC2020S08 | CALLFRIEND American English-Southern Dialect Second Edition | |
| LDC96S48 | CALLFRIEND Canadian French | |
| LDC2019S18 | CALLFRIEND Canadian French Second Edition | |
| LDC96S49 | CALLFRIEND Egyptian Arabic | |
| LDC2019S04 | CALLFRIEND Egyptian Arabic Second Edition | |
| LDC96S50 | CALLFRIEND Farsi | |
| LDC2014S01 | CALLFRIEND Farsi Second Edition Speech | |
| LDC2014T01 | CALLFRIEND Farsi Second Edition Transcripts | |
| LDC96S51 | CALLFRIEND German | |
| LDC96S52 | CALLFRIEND Hindi | |
| LDC96S53 | CALLFRIEND Japanese | |
| LDC96S54 | CALLFRIEND Korean | |
| LDC96S55 | CALLFRIEND Mandarin Chinese-Mainland Dialect | |
| LDC2018S09 | CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition | |
| LDC96S56 | CALLFRIEND Mandarin Chinese-Taiwan Dialect | |
| LDC2020S06 | CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition | |
| LDC2023S08 | CALLFRIEND Russian Speech | |
| LDC2023T09 | CALLFRIEND Russian Text | |
| LDC96S57 | CALLFRIEND Spanish-Caribbean Dialect | |
| LDC96S58 | CALLFRIEND Spanish-Non-Caribbean Dialect | |
| LDC96S59 | CALLFRIEND Tamil | |
| LDC96S60 | CALLFRIEND Vietnamese |
Linguistic Atlas Project
| LDC2012S03 | Digital Archive of Southern Speech | |
| LDC2016S05 | Digital Archive of Southern Speech - NLP Version |
LORELEI
| LDC2020T02 | Abstract Meaning Representation (AMR) Annotation Release 3.0 | |
| LDC2024T11 | Abstract Meaning Representation 3.0 - Machine Translations | |
| LDC2023T10 | AIDA Scenario 1 and 2 Reference Knowledge Base | |
| LDC2023S01 | AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts | |
| LDC2024T03 | LoReHLT Hausa Representative Language Pack | |
| LDC2025T08 | LoReHLT Uzbek Representative Language Pack | |
| LDC2021T02 | LORELEI Akan Representative Language Pack | |
| LDC2018T04 | LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text | |
| LDC2022T05 | LORELEI Bengali Representative Language Pack | |
| LDC2020T10 | LORELEI Entity Detection and Linking Knowledge Base | |
| LDC2024T01 | LORELEI Farsi Representative Language Pack | |
| LDC2025T12 | LORELEI Hindi Representative Language Pack | |
| LDC2025T01 | LORELEI Hungarian Representative Language Pack | |
| LDC2023T07 | LORELEI Indonesian Representative Language Pack | |
| LDC2022T01 | LORELEI Kinyarwanda Incident Language Pack | |
| LDC2020T11 | LORELEI Oromo Incident Language Pack | |
| LDC2018T11 | LORELEI Somali Representative Language Pack - Monolingual and Parallel Text | |
| LDC2023T01 | LORELEI Swahili Representative Language Pack | |
| LDC2023T02 | LORELEI Tagalog Representative Language Pack | |
| LDC2023T03 | LORELEI Tamil Representative Language Pack | |
| LDC2023T08 | LORELEI Thai Representative Language Pack | |
| LDC2020T22 | LORELEI Tigrinya Incident Language Pack | |
| LDC2020T24 | LORELEI Ukrainian Representative Language Pack | |
| LDC2024T07 | LORELEI Uyghur Incident Language Pack | |
| LDC2020T17 | LORELEI Vietnamese Representative Language Pack | |
| LDC2022T03 | LORELEI Wolof Representative Language Pack | |
| LDC2024T10 | LORELEI Yoruba Representative Language Pack | |
| LDC2023T06 | LORELEI Zulu Representative Language Pack |
Machine Reading
| LDC2020T04 | Machine Reading Phase 1 IC Training Data | |
| LDC2019T14 | Machine Reading Phase 1 NFL Scoring Training Data |
MADCAT
| LDC2014T13 | MADCAT Chinese Pilot Training Set | |
| LDC2012T15 | MADCAT Phase 1 Training Set | |
| LDC2013T09 | MADCAT Phase 2 Training Set | |
| LDC2013T15 | MADCAT Phase 3 Training Set |
MALACH
| LDC2014S04 | USC-SFI MALACH Interviews and Transcripts Czech | |
| LDC2012S05 | USC-SFI MALACH Interviews and Transcripts English | |
| LDC2019S11 | USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition |
MIXER
| LDC2019S09 | First DIHARD Challenge Development - Eight Sources | |
| LDC2019S12 | First DIHARD Challenge Evaluation - Nine Sources | |
| LDC2023S02 | Mixer 3 Speech | |
| LDC2020S03 | Mixer 4 and 5 Speech | |
| LDC2013S03 | Mixer 6 Speech | |
| LDC2025S08 | Mixer 7 English Speech | |
| LDC2023S04 | Mixer 7 Spanish Speech | |
| LDC2023S09 | REMIX Telephone Collection | |
| LDC2022S06 | Second DIHARD Challenge Evaluation - Eleven Sources | |
| LDC2022S12 | Third DIHARD Challenge Development | |
| LDC2022S14 | Third DIHARD Challenge Evaluation |
MT08
| LDC2010T01 | NIST Open MT 2008 Evaluation (MT08) Selected References and System Translations |
MUC
| LDC2003T13 | Message Understanding Conference (MUC) 6 | |
| LDC96T10 | Message Understanding Conference (MUC) 6 Additional News Text | |
| LDC2001T02 | Message Understanding Conference (MUC) 7 | |
| LDC2010T15 | Message Understanding Conference 7 Timed (MUC7_T) | |
| LDC95T21 | North American News Text Corpus | |
| LDC93T3A | TIPSTER Complete | |
| LDC93T3B | TIPSTER Volume 1 | |
| LDC93T3C | TIPSTER Volume 2 | |
| LDC93T3D | TIPSTER Volume 3 |
NIEUW
| LDC2022S09 | Xi'an Guanzhong Object Naming |
NIST Automatic Meeting Recognition
| LDC2004S09 | NIST Meeting Pilot Corpus Speech | |
| LDC2004T13 | NIST Meeting Pilot Corpus Transcripts and Metadata |
NIST LRE
| LDC2006S31 | 2003 NIST Language Recognition Evaluation | |
| LDC2008S05 | 2005 NIST Language Recognition Evaluation | |
| LDC2009S05 | 2007 NIST Language Recognition Evaluation Supplemental Training Set | |
| LDC2009S04 | 2007 NIST Language Recognition Evaluation Test Set | |
| LDC2014S06 | 2009 NIST Language Recognition Evaluation Test Set | |
| LDC2018S06 | 2011 NIST Language Recognition Evaluation Test Set | |
| LDC2025S02 | 2015 NIST Language Recognition Evaluation Test Set | |
| LDC2022S10 | 2017 NIST Language Recognition Evaluation Training and Development Sets | |
| LDC2023S01 | AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts | |
| LDC2023S02 | Mixer 3 Speech | |
| LDC2019S02 | Multi-Language Conversational Telephone Speech 2011 -- Arabic Group | |
| LDC2018S03 | Multi-Language Conversational Telephone Speech 2011 -- Central Asian | |
| LDC2018S08 | Multi-Language Conversational Telephone Speech 2011 -- Central European | |
| LDC2019S15 | Multi-Language Conversational Telephone Speech 2011 -- East Asian | |
| LDC2019S06 | Multi-Language Conversational Telephone Speech 2011 -- English Group | |
| LDC2020S05 | Multi-Language Conversational Telephone Speech 2011 -- Mandarin Chinese | |
| LDC2016S11 | Multi-Language Conversational Telephone Speech 2011 -- Slavic Group | |
| LDC2017S14 | Multi-Language Conversational Telephone Speech 2011 -- South Asian | |
| LDC2018S12 | Multi-Language Conversational Telephone Speech 2011 -- Spanish | |
| LDC2017S09 | Multi-Language Conversational Telephone Speech 2011 -- Turkish |
NIST MT
| LDC2009T05 | 2008 NIST Metrics for Machine Translation (MetricsMATR08) Development Data | |
| LDC2014T09 | HyTER Networks of Selected OpenMT08/09 Sentences | |
| LDC2010T10 | NIST 2002 Open Machine Translation (OpenMT) Evaluation | |
| LDC2010T11 | NIST 2003 Open Machine Translation (OpenMT) Evaluation | |
| LDC2010T12 | NIST 2004 Open Machine Translation (OpenMT) Evaluation | |
| LDC2010T14 | NIST 2005 Open Machine Translation (OpenMT) Evaluation | |
| LDC2010T17 | NIST 2006 Open Machine Translation (OpenMT) Evaluation | |
| LDC2010T21 | NIST 2008 Open Machine Translation (OpenMT) Evaluation | |
| LDC2013T07 | NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets | |
| LDC2010T23 | NIST 2009 Open Machine Translation (OpenMT) Evaluation | |
| LDC2013T03 | NIST 2012 Open Machine Translation (OpenMT) Evaluation | |
| LDC2014T02 | NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language Source | |
| LDC2013T18 | Semantic Textual Similarity (STS) 2013 Machine Translation |
NIST OpenSAT
| LDC2022S01 | 2017 NIST OpenSAT Pilot - SSSF | |
| LDC2023S06 | 2019 OpenSAT Public Safety Communications Simulation |
NIST Public Safety
| LDC2023S06 | 2019 OpenSAT Public Safety Communications Simulation |
NIST SRE
| LDC96S61 | 1996 Speaker Recognition Benchmark | |
| LDC99S80 | 1997 Speaker Recognition Benchmark | |
| LDC98S76 | 1998 Speaker Recognition Benchmark | |
| LDC99S81 | 1999 Speaker Recognition Benchmark | |
| LDC2001S97 | 2000 NIST Speaker Recognition Evaluation | |
| LDC2002S34 | 2001 NIST Speaker Recognition Evaluation Corpus | |
| LDC2004S04 | 2002 NIST Speaker Recognition Evaluation | |
| LDC2010S03 | 2003 NIST Speaker Recognition Evaluation | |
| LDC2006S44 | 2004 NIST Speaker Recognition Evaluation | |
| LDC2011S04 | 2005 NIST Speaker Recognition Evaluation Test Data | |
| LDC2011S01 | 2005 NIST Speaker Recognition Evaluation Training Data | |
| LDC2011S10 | 2006 NIST Speaker Recognition Evaluation Test Set Part 1 | |
| LDC2012S01 | 2006 NIST Speaker Recognition Evaluation Test Set Part 2 | |
| LDC2011S09 | 2006 NIST Speaker Recognition Evaluation Training Set | |
| LDC2011S11 | 2008 NIST Speaker Recognition Evaluation Supplemental Set | |
| LDC2011S08 | 2008 NIST Speaker Recognition Evaluation Test Set | |
| LDC2011S05 | 2008 NIST Speaker Recognition Evaluation Training Set Part 1 | |
| LDC2011S07 | 2008 NIST Speaker Recognition Evaluation Training Set Part 2 | |
| LDC2017S06 | 2010 NIST Speaker Recognition Evaluation Test Set | |
| LDC2019S20 | 2016 NIST Speaker Recognition Evaluation Test Set | |
| LDC2020S04 | 2018 NIST Speaker Recognition Evaluation Test Set | |
| LDC2023V01 | 2019 NIST Speaker Recognition Evaluation Test Set -- Audio-Visual | |
| LDC2023S03 | 2019 NIST Speaker Recognition Evaluation Test Set -- CTS Challenge | |
| LDC2024S05 | Call My Net 1 | |
| LDC2019S09 | First DIHARD Challenge Development - Eight Sources | |
| LDC2019S12 | First DIHARD Challenge Evaluation - Nine Sources | |
| LDC2013S05 | Greybeard | |
| LDC2025S05 | IWSLT 2022-2023 Shared Task Training, Development and Test Set | |
| LDC2024S01 | KASET - Kurmanji and Sorani Kurdish Speech and Transcripts | |
| LDC2023S02 | Mixer 3 Speech | |
| LDC2020S03 | Mixer 4 and 5 Speech | |
| LDC2013S03 | Mixer 6 Speech | |
| LDC2025S08 | Mixer 7 English Speech | |
| LDC2023S04 | Mixer 7 Spanish Speech | |
| LDC2009T26 | NXT Switchboard Annotations | |
| LDC2023S09 | REMIX Telephone Collection | |
| LDC2022S06 | Second DIHARD Challenge Evaluation - Eleven Sources | |
| LDC2001S13 | Switchboard Cellular Part 1 Audio | |
| LDC2001S15 | Switchboard Cellular Part 1 Transcribed Audio | |
| LDC2001T14 | Switchboard Cellular Part 1 Transcription | |
| LDC2004S07 | Switchboard Cellular Part 2 Audio | |
| LDC93S8 | Switchboard Credit Card | |
| LDC97S62 | Switchboard-1 Release 2 | |
| LDC98S75 | Switchboard-2 Phase I | |
| LDC99S79 | Switchboard-2 Phase II | |
| LDC2002S06 | Switchboard-2 Phase III Audio |
OpenHaRT
| LDC2012T15 | MADCAT Phase 1 Training Set | |
| LDC2013T09 | MADCAT Phase 2 Training Set | |
| LDC2013T15 | MADCAT Phase 3 Training Set |
PEA-TRAD
| LDC2018T13 | TRAD Arabic-French Parallel Text -- Newsgroup | |
| LDC2018T21 | TRAD Arabic-French Parallel Text -- Newswire | |
| LDC2018T02 | TRAD Chinese-French Parallel Text -- Blog | |
| LDC2018T17 | TRAD Chinese-French Parallel Text -- Broadcast News |
RATS
| LDC2017S20 | RATS Keyword Spotting | |
| LDC2018S10 | RATS Language Identification | |
| LDC2024S03 | RATS Low Speech Density | |
| LDC2021S08 | RATS Speaker Identification | |
| LDC2015S02 | RATS Speech Activity Detection |
REFLEX-MTE
| LDC2009T11 | REFLEX Entity Translation Training/DevTest |
RM
| LDC96S39 | RM Isolated and Spelled Word Data |
ROAR
| LDC2019S09 | First DIHARD Challenge Development - Eight Sources | |
| LDC2019S12 | First DIHARD Challenge Evaluation - Nine Sources | |
| LDC2004S05 | ISL Meeting Speech Part 1 | |
| LDC2004T10 | ISL Meeting Transcripts Part 1 | |
| LDC2022S06 | Second DIHARD Challenge Evaluation - Eleven Sources | |
| LDC2022S14 | Third DIHARD Challenge Evaluation |
RT
| LDC2007S12 | 2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data | |
| LDC2007S11 | 2004 Spring NIST Rich Transcription (RT-04S) Development Data | |
| LDC2011S06 | 2005 Spring NIST Rich Transcription (RT-05S) Evaluation Set | |
| LDC2019S09 | First DIHARD Challenge Development - Eight Sources | |
| LDC2019S12 | First DIHARD Challenge Evaluation - Nine Sources | |
| LDC2022S06 | Second DIHARD Challenge Evaluation - Eleven Sources | |
| LDC2022S12 | Third DIHARD Challenge Development |
SemEval
| LDC2016T10 | SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing | |
| LDC2011T01 | SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages |
SID
| LDC2001S13 | Switchboard Cellular Part 1 Audio | |
| LDC2001S15 | Switchboard Cellular Part 1 Transcribed Audio | |
| LDC2001T14 | Switchboard Cellular Part 1 Transcription | |
| LDC2004S07 | Switchboard Cellular Part 2 Audio | |
| LDC98S75 | Switchboard-2 Phase I | |
| LDC99S79 | Switchboard-2 Phase II | |
| LDC2002S06 | Switchboard-2 Phase III Audio |
SPINE
| LDC2000S96 | Speech in Noisy Environments (SPINE) Evaluation Audio | |
| LDC2000T54 | Speech in Noisy Environments (SPINE) Evaluation Transcripts | |
| LDC2000S87 | Speech in Noisy Environments (SPINE) Training Audio | |
| LDC2000T49 | Speech in Noisy Environments (SPINE) Training Transcripts | |
| LDC2001S04 | Speech in Noisy Environments (SPINE2) Part 1 Audio | |
| LDC2001T05 | Speech in Noisy Environments (SPINE2) Part 1 Transcripts | |
| LDC2001S06 | Speech in Noisy Environments (SPINE2) Part 2 Audio | |
| LDC2001T07 | Speech in Noisy Environments (SPINE2) Part 2 Transcripts | |
| LDC2001S08 | Speech in Noisy Environments (SPINE2) Part 3 Audio | |
| LDC2001T09 | Speech in Noisy Environments (SPINE2) Part 3 Transcripts | |
| LDC2001S99 | Speech in Noisy Environments 1 (SPINE1 CODED) Coded Audio |
TAC
| LDC2024T09 | MultiTACRED | |
| LDC2023T13 | TAC KBP Belief and Sentiment - Comprehensive Training and Evaluation Data 2016-2017 | |
| LDC2017T17 | TAC KBP Chinese Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2011-2014 | |
| LDC2019T08 | TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014 | |
| LDC2019T17 | TAC KBP Cold Start - Comprehensive Evaluation Data 2012-2017 | |
| LDC2018T03 | TAC KBP Comprehensive English Source Corpora 2009-2014 | |
| LDC2018T16 | TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013 | |
| LDC2020T03 | TAC KBP English Event Argument - Training and Evaluation Data 2014-2015 | |
| LDC2020T13 | TAC KBP English Event Nugget Detection and Coreference - Comprehensive Training and Evaluation Data 2014-2015 | |
| LDC2018T22 | TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014 | |
| LDC2021T08 | TAC KBP English Sentiment Slot Filling -- Comprehensive Training and Evaluation Data 2013-2014 | |
| LDC2021T06 | TAC KBP English Surprise Slot Filling -- Comprehensive Training and Evaluation Data 2010 | |
| LDC2020T08 | TAC KBP English Temporal Slot Filling - Comprehensive Training and Evaluation Data 2011 and 2013 | |
| LDC2019T19 | TAC KBP Entity Discovery and Linking - Comprehensive Evaluation Data 2016-2017 | |
| LDC2019T02 | TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015 | |
| LDC2019T12 | TAC KBP Evaluation Source Corpora 2016-2017 | |
| LDC2020T18 | TAC KBP Event Argument - Comprehensive Training and Evaluation Data 2016-2017 | |
| LDC2014T16 | TAC KBP Reference Knowledge Base | |
| LDC2016T26 | TAC KBP Spanish Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2012-2014 | |
| LDC2018T24 | TAC Relation Extraction Dataset |
Talkbank
| LDC2005T35 | American National Corpus (ANC) Second Release | |
| LDC2004V01 | FORM1 Kinematic Gesture | |
| LDC2003V01 | FORM2 Kinematic Gesture | |
| LDC2003L01 | Grassfields Bantu Fieldwork: Dschang Lexicon | |
| LDC2003S02 | Grassfields Bantu Fieldwork: Dschang Tone Paradigms | |
| LDC2001S16 | Grassfields Bantu Fieldwork: Ngomba Tone Paradigms | |
| LDC2004L01 | Klex: Finite-State Lexical Transducer for Korean | |
| LDC2004T03 | Morphologically Annotated Korean Text | |
| LDC2003S06 | Santa Barbara Corpus of Spoken American English Part II | |
| LDC2004S10 | Santa Barbara Corpus of Spoken American English Part III | |
| LDC2005S25 | Santa Barbara Corpus of Spoken American English Part IV | |
| LDC2003T15 | SLX Corpus of Classic Sociolinguistic Interviews | |
| LDC2004S12 | TalkBank Ethology Data: Field Recordings of Vervet Monkey Calls |
TDT
| LDC2010T18 | ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0 | |
| LDC98T25 | TDT Pilot Study Corpus | |
| LDC2000S92 | TDT2 Careful Transcription Audio | |
| LDC2000T44 | TDT2 Careful Transcription Text | |
| LDC99S84 | TDT2 English Audio | |
| LDC2001S93 | TDT2 Mandarin Audio Corpus | |
| LDC2001T57 | TDT2 Multilanguage Text Version 4.0 | |
| LDC2001S94 | TDT3 English Audio | |
| LDC2001S95 | TDT3 Mandarin Audio | |
| LDC2001T58 | TDT3 Multilanguage Text Version 2.0 | |
| LDC2005S11 | TDT4 Multilingual Broadcast News Speech Corpus | |
| LDC2005T16 | TDT4 Multilingual Text and Annotations | |
| LDC2007V02 | TRECVID 2003 Keyframes & Transcripts | |
| LDC2007V01 | TRECVID 2005 Keyframes & Transcripts |
TERN
| LDC2010T18 | ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0 |
TIDES
| LDC2005T09 | ACE 2004 Multilingual Training Corpus | |
| LDC2010T18 | ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0 | |
| LDC2005T07 | ACE Time Normalization (TERN) 2004 English Training Data v 1.0 | |
| LDC2003T11 | ACE-2 Version 1.0 | |
| LDC93T1 | ACL/DCI | |
| LDC2004T18 | Arabic English Parallel News Part 1 | |
| LDC2003T12 | Arabic Gigaword | |
| LDC2004T17 | Arabic News Translation Text Part 1 | |
| LDC2001T55 | Arabic Newswire Part 1 | |
| LDC2003T07 | Arabic Treebank: Part 1 - 10K-word English Translation | |
| LDC2003T06 | Arabic Treebank: Part 1 v 2.0 | |
| LDC2005T02 | Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic analysis) | |
| LDC2004T02 | Arabic Treebank: Part 2 v 2.0 | |
| LDC2005T20 | Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis) | |
| LDC2004T11 | Arabic Treebank: Part 3 v 1.0 | |
| LDC2005T33 | BBN Pronoun Coreference and Entity Type Corpus | |
| LDC2000T43 | BLLIP 1987-89 WSJ Corpus Release 1 | |
| LDC2002L49 | Buckwalter Arabic Morphological Analyzer Version 1.0 | |
| LDC2004L02 | Buckwalter Arabic Morphological Analyzer Version 2.0 | |
| LDC2005T13 | CCGbank | |
| LDC96L14 | CELEX2 | |
| LDC2005T10 | Chinese English News Magazine Parallel Text | |
| LDC2003T09 | Chinese Gigaword | |
| LDC2005T14 | Chinese Gigaword Second Edition | |
| LDC2005T06 | Chinese News Translation Text Part 1 | |
| LDC2005T23 | Chinese Proposition Bank 1.0 | |
| LDC2001T11 | Chinese Treebank 2.0 | |
| LDC2004T05 | Chinese Treebank 4.0 | |
| LDC2005T01 | Chinese Treebank 5.0 | |
| LDC2007T36 | Chinese Treebank 6.0 | |
| LDC2010T07 | Chinese Treebank 7.0 | |
| LDC2013T21 | Chinese Treebank 8.0 | |
| LDC2002L27 | Chinese-English Translation Lexicon Version 3.0 | |
| LDC2007T02 | English Chinese Translation Treebank v 1.0 | |
| LDC2003T05 | English Gigaword | |
| LDC2005T12 | English Gigaword Second Edition | |
| LDC95T11 | European Language Newspaper Text | |
| LDC2000T50 | Hong Kong Hansards Parallel Text | |
| LDC2000T47 | Hong Kong Laws Parallel Text | |
| LDC2000T46 | Hong Kong News Parallel Text | |
| LDC2004T08 | Hong Kong Parallel Text | |
| LDC95T8 | Japanese Business News Text | |
| LDC99T34 | Japanese Business News Text Supplement | |
| LDC2000T45 | Korean Newswire | |
| LDC95T13 | Mandarin Chinese News Text | |
| LDC2001T02 | Message Understanding Conference (MUC) 7 | |
| LDC2003T18 | Multiple-Translation Arabic (MTA) Part 1 | |
| LDC2005T05 | Multiple-Translation Arabic (MTA) Part 2 | |
| LDC2003T17 | Multiple-Translation Chinese (MTC) Part 2 | |
| LDC2004T07 | Multiple-Translation Chinese (MTC) Part 3 | |
| LDC2006T04 | Multiple-Translation Chinese (MTC) Part 4 | |
| LDC2002T01 | Multiple-Translation Chinese Corpus | |
| LDC95T21 | North American News Text Corpus | |
| LDC98T30 | North American News Text Supplement | |
| LDC2004T23 | Prague Arabic Dependency Treebank 1.0 | |
| LDC2004T14 | Proposition Bank I | |
| LDC2006T12 | Spanish Gigaword First Edition | |
| LDC2009T21 | Spanish Gigaword Second Edition | |
| LDC95T9 | Spanish News Text | |
| LDC99T41 | Spanish Newswire Text, Volume 2 | |
| LDC98T25 | TDT Pilot Study Corpus | |
| LDC2000S92 | TDT2 Careful Transcription Audio | |
| LDC2000T44 | TDT2 Careful Transcription Text | |
| LDC99S84 | TDT2 English Audio | |
| LDC2001S93 | TDT2 Mandarin Audio Corpus | |
| LDC2001T57 | TDT2 Multilanguage Text Version 4.0 | |
| LDC2001S94 | TDT3 English Audio | |
| LDC2001S95 | TDT3 Mandarin Audio | |
| LDC2001T58 | TDT3 Multilanguage Text Version 2.0 | |
| LDC2005S11 | TDT4 Multilingual Broadcast News Speech Corpus | |
| LDC2005T16 | TDT4 Multilingual Text and Annotations | |
| LDC2004T09 | TIDES Extraction (ACE) 2003 Multilingual Training Data | |
| LDC93T3A | TIPSTER Complete | |
| LDC2000T52 | TREC Mandarin | |
| LDC2000T51 | TREC Spanish | |
| LDC99T42 | Treebank-3 | |
| LDC94T4B-1 | UN Parallel Text (English) | |
| LDC94T4B-3 | UN Parallel Text (Spanish) |
Tipster
| LDC95T13 | Mandarin Chinese News Text | |
| LDC95T9 | Spanish News Text | |
| LDC93T3A | TIPSTER Complete | |
| LDC93T3B | TIPSTER Volume 1 | |
| LDC93T3C | TIPSTER Volume 2 | |
| LDC93T3D | TIPSTER Volume 3 |
TRAD
| LDC2018T13 | TRAD Arabic-French Parallel Text -- Newsgroup | |
| LDC2018T21 | TRAD Arabic-French Parallel Text -- Newswire | |
| LDC2018T02 | TRAD Chinese-French Parallel Text -- Blog | |
| LDC2018T17 | TRAD Chinese-French Parallel Text -- Broadcast News |
TREC
| LDC2001T55 | Arabic Newswire Part 1 | |
| LDC95T13 | Mandarin Chinese News Text | |
| LDC95T9 | Spanish News Text | |
| LDC93T3A | TIPSTER Complete | |
| LDC93T3B | TIPSTER Volume 1 | |
| LDC93T3C | TIPSTER Volume 2 | |
| LDC93T3D | TIPSTER Volume 3 | |
| LDC2000T52 | TREC Mandarin | |
| LDC2000T51 | TREC Spanish | |
| LDC2007V02 | TRECVID 2003 Keyframes & Transcripts | |
| LDC2010V01 | TRECVID 2004 Keyframes & Transcripts | |
| LDC2007V01 | TRECVID 2005 Keyframes & Transcripts | |
| LDC2010V02 | TRECVID 2006 Keyframes |
VACE
| LDC2012V01 | 2005 NIST/USF Evaluation Resources for the VACE Program - Broadcast News | |
| LDC2011V05 | 2006 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 1 | |
| LDC2011V06 | 2006 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 2 | |
| LDC2011V03 | NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 1 | |
| LDC2011V04 | NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 2 | |
| LDC2011V01 | NIST/USF Evaluation Resources for the VACE Program - Meeting Data Training Set Part 1 | |
| LDC2011V02 | NIST/USF Evaluation Resources for the VACE Program - Meeting Data Training Set Part 2 |
VAST
| LDC2023V01 | 2019 NIST Speaker Recognition Evaluation Test Set -- Audio-Visual | |
| LDC2019S09 | First DIHARD Challenge Development - Eight Sources | |
| LDC2019S12 | First DIHARD Challenge Evaluation - Nine Sources | |
| LDC2022S06 | Second DIHARD Challenge Evaluation - Eleven Sources | |
| LDC2022S12 | Third DIHARD Challenge Development | |
| LDC2022S14 | Third DIHARD Challenge Evaluation | |
| LDC2019S05 | VAST Chinese Speech and Transcripts |