LDC Corpora ⇔ Projects
Many of the corpora in the Catalog were developed for, or used in, sponsored research projects. Some of those resources were training and test data for benchmark tests of language-based systems developed during the project. A corpus is associated with a given project either because it was developed for the project, it was used in the project or it was considered otherwise relevant to the work of the project.
ACE
LDC2017T10 | Abstract Meaning Representation (AMR) Annotation Release 2.0 | |
LDC2020T02 | Abstract Meaning Representation (AMR) Annotation Release 3.0 | |
LDC2005T09 | ACE 2004 Multilingual Training Corpus | |
LDC2008T03 | ACE 2005 English SpatialML Annotations | |
LDC2011T02 | ACE 2005 English SpatialML Annotations Version 2 | |
LDC2010T09 | ACE 2005 Mandarin SpatialML Annotations | |
LDC2006T06 | ACE 2005 Multilingual Training Corpus | |
LDC2014T18 | ACE 2007 Multilingual Training Corpus | |
LDC2015T20 | ACE 2007 Spanish DevTest - Pilot Evaluation | |
LDC2010T18 | ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0 | |
LDC2005T07 | ACE Time Normalization (TERN) 2004 English Training Data v 1.0 | |
LDC2003T11 | ACE-2 Version 1.0 | |
LDC2024T05 | Automatic Content Extraction for Portuguese | |
LDC2005T33 | BBN Pronoun Coreference and Entity Type Corpus | |
LDC2019T07 | Chinese Abstract Meaning Representation 1.0 | |
LDC2011T08 | Datasets for Generic Relation Extraction (reACE) | |
LDC2004T14 | Proposition Bank I | |
LDC2009T11 | REFLEX Entity Translation Training/DevTest | |
LDC2004T09 | TIDES Extraction (ACE) 2003 Multilingual Training Data |
AIDA
LDC2023T10 | AIDA Scenario 1 and 2 Reference Knowledge Base | |
LDC2024T02 | AIDA Scenario 1 Practice Topic Annotation | |
LDC2023T11 | AIDA Scenario 1 Practice Topic Source Data | |
LDC2024T06 | AIDA Scenario 2 Practice Topic Annotation | |
LDC2024T04 | AIDA Scenario 2 Practice Topic Source Data | |
LDC2023S01 | AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts |
American National Corpus (ANC)
LDC2005T35 | American National Corpus (ANC) Second Release | |
LDC2010T22 | Manually Annotated Sub-Corpus First Release | |
LDC2013T12 | Manually Annotated Sub-Corpus Third Release |
AQUAINT
LDC2008T25 | AQUAINT-2 Information-Retrieval Text Research Collection | |
LDC2005T33 | BBN Pronoun Coreference and Entity Type Corpus |
ATIS
LDC2021T04 | ATIS - Seven Languages | |
LDC93S4A | ATIS0 Complete | |
LDC93S4B | ATIS0 Pilot | |
LDC93S4B-2 | ATIS0 Read | |
LDC93S4B-3 | ATIS0 SD Read | |
LDC93S5 | ATIS2 | |
LDC95S26 | ATIS3 Test Data | |
LDC94S19 | ATIS3 Training Data | |
LDC2019T04 | Multilingual ATIS |
BOLT
LDC2014T12 | Abstract Meaning Representation (AMR) Annotation Release 1.0 | |
LDC2017T10 | Abstract Meaning Representation (AMR) Annotation Release 2.0 | |
LDC2020T02 | Abstract Meaning Representation (AMR) Annotation Release 3.0 | |
LDC2020T07 | Abstract Meaning Representation 2.0 - Four Translations | |
LDC2019T01 | BOLT Arabic Discussion Forum Parallel Training Data | |
LDC2018T10 | BOLT Arabic Discussion Forums | |
LDC2021T07 | BOLT Chinese Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech | |
LDC2017T05 | BOLT Chinese Discussion Forum Parallel Training Data | |
LDC2016T05 | BOLT Chinese Discussion Forums | |
LDC2018T15 | BOLT Chinese SMS/Chat | |
LDC2021T11 | BOLT Chinese SMS/Chat Parallel Training Data | |
LDC2016T19 | BOLT Chinese-English Word Alignment and Tagging -- Discussion Forum Training | |
LDC2020T15 | BOLT Chinese-English Word Alignment and Tagging -- Conversational Telephone Speech Training | |
LDC2019T13 | BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training | |
LDC2021T14 | BOLT Egyptian Arabic Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech | |
LDC2021T18 | BOLT Egyptian Arabic PropBank and Sense -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech | |
LDC2017T07 | BOLT Egyptian Arabic SMS/Chat and Transliteration | |
LDC2021T15 | BOLT Egyptian Arabic SMS/Chat Parallel Training Data | |
LDC2021T12 | BOLT Egyptian Arabic Treebank - Conversational Telephone Speech | |
LDC2018T23 | BOLT Egyptian Arabic Treebank - Discussion Forum | |
LDC2021T17 | BOLT Egyptian Arabic Treebank - SMS/Chat | |
LDC2020T05 | BOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training | |
LDC2019T18 | BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training | |
LDC2019T06 | BOLT Egyptian-English Word Alignment -- Discussion Forum Training | |
LDC2020T20 | BOLT English Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech | |
LDC2017T11 | BOLT English Discussion Forums | |
LDC2020T21 | BOLT English PropBank and Sense -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech | |
LDC2018T19 | BOLT English SMS/Chat | |
LDC2020T09 | BOLT English Translation Treebank - Chinese Discussion Forum | |
LDC2021T19 | BOLT English Translation Treebank - Chinese SMS/Chat | |
LDC2022T06 | BOLT English Translation Treebank - Egyptian Arabic SMS/Chat | |
LDC2019T15 | BOLT English Treebank - Discussion Forum | |
LDC2021T03 | BOLT English Treebank - SMS/Chat | |
LDC2018T18 | BOLT Information Retrieval Comprehensive Training and Evaluation | |
LDC2013T21 | Chinese Treebank 8.0 | |
LDC2016T13 | Chinese Treebank 9.0 | |
LDC2024T03 | LoReHLT Hausa Representative Language Pack |
CAMIO
LDC2022T07 | CAMIO Transcription Languages |
CHiME
LDC2017S07 | CHiME2 Grid | |
LDC2017S10 | CHiME2 WSJ0 | |
LDC2017S24 | CHiME3 |
Communicator
LDC2004T15 | 2000 Communicator Dialogue Act Tagged | |
LDC2002S56 | 2000 Communicator Evaluation | |
LDC2004T16 | 2001 Communicator Dialogue Act Tagged | |
LDC2003S01 | 2001 Communicator Evaluation |
CoNLL
LDC2015T12 | 2006 CoNLL Shared Task - Arabic & Czech | |
LDC2015T11 | 2006 CoNLL Shared Task - Ten Languages | |
LDC2018T08 | 2007 CoNLL Shared Task - Arabic & English | |
LDC2018T06 | 2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish | |
LDC2018T07 | 2007 CoNLL Shared Task - Greek, Hungarian & Italian | |
LDC2012T03 | 2009 CoNLL Shared Task Part 1 | |
LDC2012T04 | 2009 CoNLL Shared Task Part 2 | |
LDC2017T13 | 2015-2016 CoNLL Shared Task |
DARPA-CSR
LDC2005S08 | BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts | |
LDC93S6A | CSR-I (WSJ0) Complete | |
LDC93S6C | CSR-I (WSJ0) Other | |
LDC93S6B | CSR-I (WSJ0) Sennheiser | |
LDC94S13A | CSR-II (WSJ1) Complete | |
LDC94S13C | CSR-II (WSJ1) Other | |
LDC94S13B | CSR-II (WSJ1) Sennheiser | |
LDC95S23 | CSR-III Speech | |
LDC95T6 | CSR-III Text | |
LDC96S33 | CSR-IV HUB3 | |
LDC96S31 | CSR-IV HUB4 |
DASL
LDC2003T15 | SLX Corpus of Classic Sociolinguistic Interviews |
DEFT
LDC2014T12 | Abstract Meaning Representation (AMR) Annotation Release 1.0 | |
LDC2017T10 | Abstract Meaning Representation (AMR) Annotation Release 2.0 | |
LDC2020T02 | Abstract Meaning Representation (AMR) Annotation Release 3.0 | |
LDC2020T07 | Abstract Meaning Representation 2.0 - Four Translations | |
LDC2020L02 | Chinese Lexical Resources for Gender, Number, Animacy | |
LDC2019T03 | DEFT Chinese Committed Belief Annotation | |
LDC2020T19 | DEFT Chinese Light and Rich ERE Annotation | |
LDC2019T16 | DEFT English Committed Belief Annotation | |
LDC2023T04 | DEFT English Light and Rich ERE Annotation | |
LDC2016T07 | DEFT Narrative Text | |
LDC2019T09 | DEFT Spanish Committed Belief Annotation | |
LDC2018T01 | DEFT Spanish Treebank | |
LDC2016T23 | Richer Event Description | |
LDC2023T13 | TAC KBP Belief and Sentiment - Comprehensive Training and Evaluation Data 2016-2017 | |
LDC2017T09 | The EventStatus Corpus |
DIRHA
LDC2018S01 | DIRHA English WSJ Audio |
DOE/IRS2008-0256
LDC2023L01 | Moroccan Arabic - English Lexical Database |
EARS
LDC97S66 | 1996 English Broadcast News Dev and Eval (HUB4) | |
LDC97S44 | 1996 English Broadcast News Speech (HUB4) | |
LDC97T22 | 1996 English Broadcast News Transcripts (HUB4) | |
LDC98S71 | 1997 English Broadcast News Speech (HUB4) | |
LDC98T28 | 1997 English Broadcast News Transcripts (HUB4) | |
LDC2001S91 | 1997 HUB4 Broadcast News Evaluation Non-English Test Material | |
LDC2002S11 | 1997 HUB4 English Evaluation Speech and Transcripts | |
LDC2002S22 | 1997 HUB5 Arabic Evaluation | |
LDC2002T39 | 1997 HUB5 Arabic Transcripts | |
LDC2002S24 | 1997 HUB5 German Evaluation | |
LDC2003T03 | 1997 HUB5 German Transcripts | |
LDC2002S25 | 1997 HUB5 Spanish Evaluation | |
LDC2003T04 | 1997 HUB5 Spanish Transcripts | |
LDC98S73 | 1997 Mandarin Broadcast News Speech (HUB4-NE) | |
LDC98T24 | 1997 Mandarin Broadcast News Transcripts (HUB4-NE) | |
LDC2002S10 | 1998 HUB5 English Evaluation | |
LDC2003T02 | 1998 HUB5 English Transcripts | |
LDC2002S13 | 2001 HUB5 English Evaluation | |
LDC2002S12 | 2001 HUB5 Mandarin Evaluation | |
LDC2003T01 | 2001 HUB5 Mandarin Transcripts | |
LDC2004S11 | 2002 Rich Transcription Broadcast News and Conversational Telephone Speech | |
LDC99L23 | American English Spoken Lexicon | |
LDC2005S07 | Arabic CTS Levantine Fisher Training Data Set 3, Speech | |
LDC2005T03 | Arabic CTS Levantine Fisher Training Data Set 3, Transcripts | |
LDC2003T12 | Arabic Gigaword | |
LDC2001T55 | Arabic Newswire Part 1 | |
LDC2005S08 | BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts | |
LDC96S46 | CALLFRIEND American English-Non-Southern Dialect | |
LDC2019S21 | CALLFRIEND American English-Non-Southern Dialect Second Edition | |
LDC96S47 | CALLFRIEND American English-Southern Dialect | |
LDC2020S08 | CALLFRIEND American English-Southern Dialect Second Edition | |
LDC2019S18 | CALLFRIEND Canadian French Second Edition | |
LDC96S49 | CALLFRIEND Egyptian Arabic | |
LDC2019S04 | CALLFRIEND Egyptian Arabic Second Edition | |
LDC96S55 | CALLFRIEND Mandarin Chinese-Mainland Dialect | |
LDC2018S09 | CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition | |
LDC96S56 | CALLFRIEND Mandarin Chinese-Taiwan Dialect | |
LDC2020S06 | CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition | |
LDC97L20 | CALLHOME American English Lexicon (PRONLEX) | |
LDC97S42 | CALLHOME American English Speech | |
LDC97T14 | CALLHOME American English Transcripts | |
LDC97S45 | CALLHOME Egyptian Arabic Speech | |
LDC2002S37 | CALLHOME Egyptian Arabic Speech Supplement | |
LDC97T19 | CALLHOME Egyptian Arabic Transcripts | |
LDC2002T38 | CALLHOME Egyptian Arabic Transcripts Supplement | |
LDC96L15 | CALLHOME Mandarin Chinese Lexicon | |
LDC96S34 | CALLHOME Mandarin Chinese Speech | |
LDC96T16 | CALLHOME Mandarin Chinese Transcripts | |
LDC2003T09 | Chinese Gigaword | |
LDC2005T14 | Chinese Gigaword Second Edition | |
LDC2005T08 | Discourse Graphbank | |
LDC99L22 | Egyptian Colloquial Arabic Lexicon | |
LDC2003T05 | English Gigaword | |
LDC2005T12 | English Gigaword Second Edition | |
LDC2005S13 | Fisher English Training Part 2, Speech | |
LDC2005T19 | Fisher English Training Part 2, Transcripts | |
LDC2004S13 | Fisher English Training Speech Part 1 Speech | |
LDC2004T19 | Fisher English Training Speech Part 1 Transcripts | |
LDC2005S15 | HKUST Mandarin Telephone Speech, Part 1 | |
LDC2005T32 | HKUST Mandarin Telephone Transcript Data, Part 1 | |
LDC2018S18 | HUB5 Mandarin Telephone Speech and Transcripts Second Edition | |
LDC98S69 | HUB5 Mandarin Telephone Speech Corpus | |
LDC98T26 | HUB5 Mandarin Transcripts | |
LDC2005S14 | Levantine Arabic QT Training Data Set 4 (Speech + Transcripts) | |
LDC2006S29 | Levantine Arabic QT Training Data Set 5, Speech | |
LDC2006T07 | Levantine Arabic QT Training Data Set 5, Transcripts | |
LDC95T13 | Mandarin Chinese News Text | |
LDC95T21 | North American News Text Corpus | |
LDC98T30 | North American News Text Supplement | |
LDC2004S08 | RT-03 MDE Training Data Speech | |
LDC2004T12 | RT-03 MDE Training Data Text and Annotations | |
LDC2005S16 | RT-04 MDE Training Data Speech | |
LDC2005T24 | RT-04 MDE Training Data Text/Annotations | |
LDC2004S10 | Santa Barbara Corpus of Spoken American English Part III | |
LDC2005S25 | Santa Barbara Corpus of Spoken American English Part IV | |
LDC2006T12 | Spanish Gigaword First Edition | |
LDC2009T21 | Spanish Gigaword Second Edition | |
LDC2001S13 | Switchboard Cellular Part 1 Audio | |
LDC2001S15 | Switchboard Cellular Part 1 Transcribed Audio | |
LDC2001T14 | Switchboard Cellular Part 1 Transcription | |
LDC2004S07 | Switchboard Cellular Part 2 Audio | |
LDC97S62 | Switchboard-1 Release 2 | |
LDC98S75 | Switchboard-2 Phase I | |
LDC99S79 | Switchboard-2 Phase II | |
LDC2002S06 | Switchboard-2 Phase III Audio | |
LDC98S72 | Taiwanese Putonghua Speech and Transcripts | |
LDC98T25 | TDT Pilot Study Corpus | |
LDC2000S92 | TDT2 Careful Transcription Audio | |
LDC2000T44 | TDT2 Careful Transcription Text | |
LDC99S84 | TDT2 English Audio | |
LDC2001S93 | TDT2 Mandarin Audio Corpus | |
LDC2001T57 | TDT2 Multilanguage Text Version 4.0 | |
LDC2001S94 | TDT3 English Audio | |
LDC2001S95 | TDT3 Mandarin Audio | |
LDC2001T58 | TDT3 Multilanguage Text Version 2.0 |
GALE
LDC97S66 | 1996 English Broadcast News Dev and Eval (HUB4) | |
LDC97S44 | 1996 English Broadcast News Speech (HUB4) | |
LDC97T22 | 1996 English Broadcast News Transcripts (HUB4) | |
LDC98S71 | 1997 English Broadcast News Speech (HUB4) | |
LDC98T28 | 1997 English Broadcast News Transcripts (HUB4) | |
LDC2001S91 | 1997 HUB4 Broadcast News Evaluation Non-English Test Material | |
LDC2002S11 | 1997 HUB4 English Evaluation Speech and Transcripts | |
LDC2002S22 | 1997 HUB5 Arabic Evaluation | |
LDC2002T39 | 1997 HUB5 Arabic Transcripts | |
LDC2002S24 | 1997 HUB5 German Evaluation | |
LDC2003T03 | 1997 HUB5 German Transcripts | |
LDC2002S25 | 1997 HUB5 Spanish Evaluation | |
LDC2003T04 | 1997 HUB5 Spanish Transcripts | |
LDC98S73 | 1997 Mandarin Broadcast News Speech (HUB4-NE) | |
LDC98T24 | 1997 Mandarin Broadcast News Transcripts (HUB4-NE) | |
LDC2002S10 | 1998 HUB5 English Evaluation | |
LDC2003T02 | 1998 HUB5 English Transcripts | |
LDC2002S13 | 2001 HUB5 English Evaluation | |
LDC2002S12 | 2001 HUB5 Mandarin Evaluation | |
LDC2003T01 | 2001 HUB5 Mandarin Transcripts | |
LDC2004S11 | 2002 Rich Transcription Broadcast News and Conversational Telephone Speech | |
LDC2009T05 | 2008 NIST Metrics for Machine Translation (MetricsMATR08) Development Data | |
LDC2011T05 | 2008/2010 NIST Metrics for Machine Translation (MetricsMaTr) GALE Evaluation Set | |
LDC2017T10 | Abstract Meaning Representation (AMR) Annotation Release 2.0 | |
LDC2020T02 | Abstract Meaning Representation (AMR) Annotation Release 3.0 | |
LDC2005T09 | ACE 2004 Multilingual Training Corpus | |
LDC2005T07 | ACE Time Normalization (TERN) 2004 English Training Data v 1.0 | |
LDC2003T11 | ACE-2 Version 1.0 | |
LDC93T1 | ACL/DCI | |
LDC99L23 | American English Spoken Lexicon | |
LDC2012T21 | Annotated English Gigaword | |
LDC2005S07 | Arabic CTS Levantine Fisher Training Data Set 3, Speech | |
LDC2005T03 | Arabic CTS Levantine Fisher Training Data Set 3, Transcripts | |
LDC2004T18 | Arabic English Parallel News Part 1 | |
LDC2003T12 | Arabic Gigaword | |
LDC2011T11 | Arabic Gigaword Fifth Edition | |
LDC2009T30 | Arabic Gigaword Fourth Edition | |
LDC2007T40 | Arabic Gigaword Third Edition | |
LDC2004T17 | Arabic News Translation Text Part 1 | |
LDC2001T55 | Arabic Newswire Part 1 | |
LDC2012T07 | Arabic Treebank - Broadcast News v1.0 | |
LDC2016T02 | Arabic Treebank - Weblog | |
LDC2003T07 | Arabic Treebank: Part 1 - 10K-word English Translation | |
LDC2003T06 | Arabic Treebank: Part 1 v 2.0 | |
LDC2005T02 | Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic analysis) | |
LDC2010T13 | Arabic Treebank: Part 1 v 4.1 | |
LDC2004T02 | Arabic Treebank: Part 2 v 2.0 | |
LDC2011T09 | Arabic Treebank: Part 2 v 3.1 | |
LDC2005T20 | Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis) | |
LDC2004T11 | Arabic Treebank: Part 3 v 1.0 | |
LDC2012T09 | Arabic-Dialect/English Parallel Text | |
LDC2005T33 | BBN Pronoun Coreference and Entity Type Corpus | |
LDC2005S08 | BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts | |
LDC2000T43 | BLLIP 1987-89 WSJ Corpus Release 1 | |
LDC2002L49 | Buckwalter Arabic Morphological Analyzer Version 1.0 | |
LDC2004L02 | Buckwalter Arabic Morphological Analyzer Version 2.0 | |
LDC96S46 | CALLFRIEND American English-Non-Southern Dialect | |
LDC2019S21 | CALLFRIEND American English-Non-Southern Dialect Second Edition | |
LDC96S47 | CALLFRIEND American English-Southern Dialect | |
LDC2020S08 | CALLFRIEND American English-Southern Dialect Second Edition | |
LDC2019S18 | CALLFRIEND Canadian French Second Edition | |
LDC96S49 | CALLFRIEND Egyptian Arabic | |
LDC2019S04 | CALLFRIEND Egyptian Arabic Second Edition | |
LDC96S55 | CALLFRIEND Mandarin Chinese-Mainland Dialect | |
LDC2018S09 | CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition | |
LDC96S56 | CALLFRIEND Mandarin Chinese-Taiwan Dialect | |
LDC2020S06 | CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition | |
LDC97L20 | CALLHOME American English Lexicon (PRONLEX) | |
LDC97S42 | CALLHOME American English Speech | |
LDC97T14 | CALLHOME American English Transcripts | |
LDC97S45 | CALLHOME Egyptian Arabic Speech | |
LDC2002S37 | CALLHOME Egyptian Arabic Speech Supplement | |
LDC97T19 | CALLHOME Egyptian Arabic Transcripts | |
LDC2002T38 | CALLHOME Egyptian Arabic Transcripts Supplement | |
LDC96L15 | CALLHOME Mandarin Chinese Lexicon | |
LDC96S34 | CALLHOME Mandarin Chinese Speech | |
LDC96T16 | CALLHOME Mandarin Chinese Transcripts | |
LDC2005T13 | CCGbank | |
LDC96L14 | CELEX2 | |
LDC2005T10 | Chinese English News Magazine Parallel Text | |
LDC2003T09 | Chinese Gigaword | |
LDC2011T13 | Chinese Gigaword Fifth Edition | |
LDC2009T27 | Chinese Gigaword Fourth Edition | |
LDC2005T14 | Chinese Gigaword Second Edition | |
LDC2007T38 | Chinese Gigaword Third Edition | |
LDC2005T06 | Chinese News Translation Text Part 1 | |
LDC2005T23 | Chinese Proposition Bank 1.0 | |
LDC2001T11 | Chinese Treebank 2.0 | |
LDC2004T05 | Chinese Treebank 4.0 | |
LDC2005T01 | Chinese Treebank 5.0 | |
LDC2007T36 | Chinese Treebank 6.0 | |
LDC2010T07 | Chinese Treebank 7.0 | |
LDC2013T21 | Chinese Treebank 8.0 | |
LDC2016T13 | Chinese Treebank 9.0 | |
LDC2002L27 | Chinese-English Translation Lexicon Version 3.0 | |
LDC2018T20 | Concretely Annotated English Gigaword | |
LDC2005T08 | Discourse Graphbank | |
LDC99L22 | Egyptian Colloquial Arabic Lexicon | |
LDC2009T01 | English CTS Treebank with Structural Metadata | |
LDC2003T05 | English Gigaword | |
LDC2011T07 | English Gigaword Fifth Edition | |
LDC2009T13 | English Gigaword Fourth Edition | |
LDC2005T12 | English Gigaword Second Edition | |
LDC2007T07 | English Gigaword Third Edition | |
LDC2012T02 | English Translation Treebank: An-Nahar Newswire | |
LDC2012T13 | English Web Treebank | |
LDC2006T10 | English-Arabic Treebank v 1.0 | |
LDC95T11 | European Language Newspaper Text | |
LDC2005S13 | Fisher English Training Part 2, Speech | |
LDC2005T19 | Fisher English Training Part 2, Transcripts | |
LDC2004S13 | Fisher English Training Speech Part 1 Speech | |
LDC2004T19 | Fisher English Training Speech Part 1 Transcripts | |
LDC2007S02 | Fisher Levantine Arabic Conversational Telephone Speech | |
LDC2007T04 | Fisher Levantine Arabic Conversational Telephone Speech, Transcripts | |
LDC2013T14 | GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 1 | |
LDC2014T03 | GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 2 | |
LDC2013T10 | GALE Arabic-English Parallel Aligned Treebank -- Newswire | |
LDC2014T08 | GALE Arabic-English Parallel Aligned Treebank -- Web Training | |
LDC2014T19 | GALE Arabic-English Word Alignment -- Broadcast Training Part 1 | |
LDC2014T22 | GALE Arabic-English Word Alignment -- Broadcast Training Part 2 | |
LDC2014T05 | GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web | |
LDC2014T10 | GALE Arabic-English Word Alignment Training Part 2 -- Newswire | |
LDC2014T14 | GALE Arabic-English Word Alignment Training Part 3 -- Web | |
LDC2015T06 | GALE Chinese-English Parallel Aligned Treebank -- Training | |
LDC2013T23 | GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1 | |
LDC2014T25 | GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 2 | |
LDC2015T04 | GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3 | |
LDC2015T18 | GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 4 | |
LDC2012T16 | GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web | |
LDC2012T20 | GALE Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire | |
LDC2012T24 | GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web | |
LDC2013T05 | GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web | |
LDC2017T06 | GALE English-Chinese Parallel Aligned Treebank -- Training | |
LDC2008T02 | GALE Phase 1 Arabic Blog Parallel Text | |
LDC2007T24 | GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 | |
LDC2008T09 | GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2 | |
LDC2009T03 | GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1 | |
LDC2009T09 | GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2 | |
LDC2008T06 | GALE Phase 1 Chinese Blog Parallel Text | |
LDC2009T02 | GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1 | |
LDC2009T06 | GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2 | |
LDC2007T23 | GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1 | |
LDC2008T08 | GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2 | |
LDC2008T18 | GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3 | |
LDC2009T15 | GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1 | |
LDC2010T03 | GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2 | |
LDC2007T20 | GALE Phase 1 Distillation Training | |
LDC2012T06 | GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 | |
LDC2012T14 | GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 | |
LDC2013S02 | GALE Phase 2 Arabic Broadcast Conversation Speech Part 1 | |
LDC2013S07 | GALE Phase 2 Arabic Broadcast Conversation Speech Part 2 | |
LDC2013T04 | GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 1 | |
LDC2013T17 | GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 2 | |
LDC2012T18 | GALE Phase 2 Arabic Broadcast News Parallel Text | |
LDC2014S07 | GALE Phase 2 Arabic Broadcast News Speech Part 1 | |
LDC2015S01 | GALE Phase 2 Arabic Broadcast News Speech Part 2 | |
LDC2014T17 | GALE Phase 2 Arabic Broadcast News Transcripts Part 1 | |
LDC2015T01 | GALE Phase 2 Arabic Broadcast News Transcripts Part 2 | |
LDC2012T17 | GALE Phase 2 Arabic Newswire Parallel Text | |
LDC2013T01 | GALE Phase 2 Arabic Web Parallel Text | |
LDC2013T11 | GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 1 | |
LDC2013T16 | GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 2 | |
LDC2013S04 | GALE Phase 2 Chinese Broadcast Conversation Speech | |
LDC2013T08 | GALE Phase 2 Chinese Broadcast Conversation Transcripts | |
LDC2014T04 | GALE Phase 2 Chinese Broadcast News Parallel Text Part 1 | |
LDC2014T11 | GALE Phase 2 Chinese Broadcast News Parallel Text Part 2 | |
LDC2013S08 | GALE Phase 2 Chinese Broadcast News Speech | |
LDC2013T20 | GALE Phase 2 Chinese Broadcast News Transcripts | |
LDC2014T15 | GALE Phase 2 Chinese Newswire Parallel Text Part 1 | |
LDC2014T20 | GALE Phase 2 Chinese Newswire Parallel Text Part 2 | |
LDC2014T26 | GALE Phase 2 Chinese Web Parallel Text | |
LDC2015T05 | GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel Text | |
LDC2015T07 | GALE Phase 3 and 4 Arabic Broadcast News Parallel Text | |
LDC2015T19 | GALE Phase 3 and 4 Arabic Newswire Parallel Text | |
LDC2016T08 | GALE Phase 3 and 4 Arabic Web Parallel Text | |
LDC2016T09 | GALE Phase 3 and 4 Chinese Broadcast Conversation Parallel Text | |
LDC2016T15 | GALE Phase 3 and 4 Chinese Broadcast News Parallel Text | |
LDC2016T25 | GALE Phase 3 and 4 Chinese Newswire Parallel Text | |
LDC2017T02 | GALE Phase 3 and 4 Chinese Web Parallel Text | |
LDC2015S11 | GALE Phase 3 Arabic Broadcast Conversation Speech Part 1 | |
LDC2016S01 | GALE Phase 3 Arabic Broadcast Conversation Speech Part 2 | |
LDC2015T16 | GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 1 | |
LDC2016T06 | GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2 | |
LDC2016S07 | GALE Phase 3 Arabic Broadcast News Speech Part 1 | |
LDC2017S02 | GALE Phase 3 Arabic Broadcast News Speech Part 2 | |
LDC2016T17 | GALE Phase 3 Arabic Broadcast News Transcripts Part 1 | |
LDC2017T04 | GALE Phase 3 Arabic Broadcast News Transcripts Part 2 | |
LDC2014S09 | GALE Phase 3 Chinese Broadcast Conversation Speech Part 1 | |
LDC2015S06 | GALE Phase 3 Chinese Broadcast Conversation Speech Part 2 | |
LDC2014T28 | GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 1 | |
LDC2015T09 | GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2 | |
LDC2015S13 | GALE Phase 3 Chinese Broadcast News Speech | |
LDC2015T25 | GALE Phase 3 Chinese Broadcast News Transcripts | |
LDC2016T11 | GALE Phase 4 Arabic Broadcast Conversation Parallel Sentences | |
LDC2017S15 | GALE Phase 4 Arabic Broadcast Conversation Speech | |
LDC2017T12 | GALE Phase 4 Arabic Broadcast Conversation Transcripts | |
LDC2016T20 | GALE Phase 4 Arabic Broadcast News Parallel Sentences | |
LDC2018S05 | GALE Phase 4 Arabic Broadcast News Speech | |
LDC2018T14 | GALE Phase 4 Arabic Broadcast News Transcripts | |
LDC2016T27 | GALE Phase 4 Arabic Newswire Parallel Sentences | |
LDC2016T14 | GALE Phase 4 Arabic Weblog Parallel Sentences | |
LDC2015T14 | GALE Phase 4 Chinese Broadcast Conversation Parallel Sentences | |
LDC2016S03 | GALE Phase 4 Chinese Broadcast Conversation Speech | |
LDC2016T12 | GALE Phase 4 Chinese Broadcast Conversation Transcripts | |
LDC2015T21 | GALE Phase 4 Chinese Broadcast News Parallel Sentences | |
LDC2017S25 | GALE Phase 4 Chinese Broadcast News Speech | |
LDC2017T18 | GALE Phase 4 Chinese Broadcast News Transcripts | |
LDC2015T24 | GALE Phase 4 Chinese Newswire Parallel Sentences | |
LDC2016T04 | GALE Phase 4 Chinese Weblog Parallel Sentences | |
LDC2005S15 | HKUST Mandarin Telephone Speech, Part 1 | |
LDC2005T32 | HKUST Mandarin Telephone Transcript Data, Part 1 | |
LDC2000T50 | Hong Kong Hansards Parallel Text | |
LDC2000T47 | Hong Kong Laws Parallel Text | |
LDC2000T46 | Hong Kong News Parallel Text | |
LDC2004T08 | Hong Kong Parallel Text | |
LDC2018S18 | HUB5 Mandarin Telephone Speech and Transcripts Second Edition | |
LDC98S69 | HUB5 Mandarin Telephone Speech Corpus | |
LDC98T26 | HUB5 Mandarin Transcripts | |
LDC95T8 | Japanese Business News Text | |
LDC99T34 | Japanese Business News Text Supplement | |
LDC2000T45 | Korean Newswire | |
LDC2005S14 | Levantine Arabic QT Training Data Set 4 (Speech + Transcripts) | |
LDC95T13 | Mandarin Chinese News Text | |
LDC2001T02 | Message Understanding Conference (MUC) 7 | |
LDC2003T18 | Multiple-Translation Arabic (MTA) Part 1 | |
LDC2005T05 | Multiple-Translation Arabic (MTA) Part 2 | |
LDC2003T17 | Multiple-Translation Chinese (MTC) Part 2 | |
LDC2004T07 | Multiple-Translation Chinese (MTC) Part 3 | |
LDC2002T01 | Multiple-Translation Chinese Corpus | |
LDC2010T21 | NIST 2008 Open Machine Translation (OpenMT) Evaluation | |
LDC2010T01 | NIST Open MT 2008 Evaluation (MT08) Selected References and System Translations | |
LDC95T21 | North American News Text Corpus | |
LDC98T30 | North American News Text Supplement | |
LDC2007T21 | OntoNotes Release 1.0 | |
LDC2008T04 | OntoNotes Release 2.0 | |
LDC2009T24 | OntoNotes Release 3.0 | |
LDC2011T03 | OntoNotes Release 4.0 | |
LDC2013T19 | OntoNotes Release 5.0 | |
LDC2004T23 | Prague Arabic Dependency Treebank 1.0 | |
LDC2004T14 | Proposition Bank I | |
LDC2004S08 | RT-03 MDE Training Data Speech | |
LDC2004T12 | RT-03 MDE Training Data Text and Annotations | |
LDC2005S16 | RT-04 MDE Training Data Speech | |
LDC2005T24 | RT-04 MDE Training Data Text/Annotations | |
LDC2004S10 | Santa Barbara Corpus of Spoken American English Part III | |
LDC2005S25 | Santa Barbara Corpus of Spoken American English Part IV | |
LDC2013T18 | Semantic Textual Similarity (STS) 2013 Machine Translation | |
LDC2006T12 | Spanish Gigaword First Edition | |
LDC2009T21 | Spanish Gigaword Second Edition | |
LDC95T9 | Spanish News Text | |
LDC99T41 | Spanish Newswire Text, Volume 2 | |
LDC2001S13 | Switchboard Cellular Part 1 Audio | |
LDC2001S15 | Switchboard Cellular Part 1 Transcribed Audio | |
LDC2001T14 | Switchboard Cellular Part 1 Transcription | |
LDC2004S07 | Switchboard Cellular Part 2 Audio | |
LDC97S62 | Switchboard-1 Release 2 | |
LDC98S75 | Switchboard-2 Phase I | |
LDC99S79 | Switchboard-2 Phase II | |
LDC2002S06 | Switchboard-2 Phase III Audio | |
LDC98S72 | Taiwanese Putonghua Speech and Transcripts | |
LDC98T25 | TDT Pilot Study Corpus | |
LDC2000S92 | TDT2 Careful Transcription Audio | |
LDC2000T44 | TDT2 Careful Transcription Text | |
LDC99S84 | TDT2 English Audio | |
LDC2001S93 | TDT2 Mandarin Audio Corpus | |
LDC2001T57 | TDT2 Multilanguage Text Version 4.0 | |
LDC2001S94 | TDT3 English Audio | |
LDC2001S95 | TDT3 Mandarin Audio | |
LDC2001T58 | TDT3 Multilanguage Text Version 2.0 | |
LDC2005S11 | TDT4 Multilingual Broadcast News Speech Corpus | |
LDC2005T16 | TDT4 Multilingual Text and Annotations | |
LDC2004T09 | TIDES Extraction (ACE) 2003 Multilingual Training Data | |
LDC93T3A | TIPSTER Complete | |
LDC2018T13 | TRAD Arabic-French Parallel Text -- Newsgroup | |
LDC2018T21 | TRAD Arabic-French Parallel Text -- Newswire | |
LDC2018T02 | TRAD Chinese-French Parallel Text -- Blog | |
LDC2018T17 | TRAD Chinese-French Parallel Text -- Broadcast News | |
LDC2000T52 | TREC Mandarin | |
LDC2000T51 | TREC Spanish | |
LDC99T42 | Treebank-3 | |
LDC94T4B-1 | UN Parallel Text (English) | |
LDC94T4B-3 | UN Parallel Text (Spanish) |
GENOA
LDC2004S05 | ISL Meeting Speech Part 1 | |
LDC2004T10 | ISL Meeting Transcripts Part 1 |
HAVIC
LDC2018V01 | HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation | |
LDC2022V01 | HAVIC MED Novel 1 Test -- Videos, Metadata and Annotation | |
LDC2022V02 | HAVIC MED Novel 2 Test -- Videos, Metadata and Annotation | |
LDC2019V01 | HAVIC MED Progress Test -- Videos, Metadata and Annotation | |
LDC2021V01 | HAVIC MED Training Data -- Videos, Metadata and Annotation | |
LDC2016V01 | HAVIC Pilot Transcription |
Hub4
LDC98T31 | 1996 CSR HUB4 Language Model | |
LDC97S66 | 1996 English Broadcast News Dev and Eval (HUB4) | |
LDC97S44 | 1996 English Broadcast News Speech (HUB4) | |
LDC97T22 | 1996 English Broadcast News Transcripts (HUB4) | |
LDC98S71 | 1997 English Broadcast News Speech (HUB4) | |
LDC98T28 | 1997 English Broadcast News Transcripts (HUB4) | |
LDC2001S91 | 1997 HUB4 Broadcast News Evaluation Non-English Test Material | |
LDC2002S11 | 1997 HUB4 English Evaluation Speech and Transcripts | |
LDC98S73 | 1997 Mandarin Broadcast News Speech (HUB4-NE) | |
LDC98T24 | 1997 Mandarin Broadcast News Transcripts (HUB4-NE) | |
LDC98S74 | 1997 Spanish Broadcast News Speech (HUB4-NE) | |
LDC98T29 | 1997 Spanish Broadcast News Transcripts (HUB4-NE) | |
LDC2000S86 | 1998 HUB4 Broadcast News Evaluation English Test Material | |
LDC2015S05 | Mandarin Chinese Phonetic Segmentation and Tone | |
LDC95T21 | North American News Text Corpus | |
LDC98T30 | North American News Text Supplement |
Hub5-LVCSR
LDC2002S22 | 1997 HUB5 Arabic Evaluation | |
LDC2002T39 | 1997 HUB5 Arabic Transcripts | |
LDC2002S23 | 1997 HUB5 English Evaluation | |
LDC2002S24 | 1997 HUB5 German Evaluation | |
LDC2003T03 | 1997 HUB5 German Transcripts | |
LDC2002S25 | 1997 HUB5 Spanish Evaluation | |
LDC2003T04 | 1997 HUB5 Spanish Transcripts | |
LDC2002S10 | 1998 HUB5 English Evaluation | |
LDC2003T02 | 1998 HUB5 English Transcripts | |
LDC2002S09 | 2000 HUB5 English Evaluation Speech | |
LDC2002T43 | 2000 HUB5 English Evaluation Transcripts | |
LDC2002S13 | 2001 HUB5 English Evaluation | |
LDC2002S12 | 2001 HUB5 Mandarin Evaluation | |
LDC2003T01 | 2001 HUB5 Mandarin Transcripts | |
LDC97L20 | CALLHOME American English Lexicon (PRONLEX) | |
LDC97S42 | CALLHOME American English Speech | |
LDC97T14 | CALLHOME American English Transcripts | |
LDC97S45 | CALLHOME Egyptian Arabic Speech | |
LDC2002S37 | CALLHOME Egyptian Arabic Speech Supplement | |
LDC97T19 | CALLHOME Egyptian Arabic Transcripts | |
LDC2002T38 | CALLHOME Egyptian Arabic Transcripts Supplement | |
LDC97L18 | CALLHOME German Lexicon | |
LDC97S43 | CALLHOME German Speech | |
LDC97T15 | CALLHOME German Transcripts | |
LDC96L17 | CALLHOME Japanese Lexicon | |
LDC96S37 | CALLHOME Japanese Speech | |
LDC96T18 | CALLHOME Japanese Transcripts | |
LDC96L15 | CALLHOME Mandarin Chinese Lexicon | |
LDC96S34 | CALLHOME Mandarin Chinese Speech | |
LDC96T16 | CALLHOME Mandarin Chinese Transcripts | |
LDC96L16 | CALLHOME Spanish Lexicon | |
LDC96S35 | CALLHOME Spanish Speech | |
LDC96T17 | CALLHOME Spanish Transcripts | |
LDC99L22 | Egyptian Colloquial Arabic Lexicon | |
LDC2018S18 | HUB5 Mandarin Telephone Speech and Transcripts Second Edition | |
LDC98S69 | HUB5 Mandarin Telephone Speech Corpus | |
LDC98T26 | HUB5 Mandarin Transcripts | |
LDC98S70 | HUB5 Spanish Telephone Speech Corpus | |
LDC98T27 | HUB5 Spanish Transcripts | |
LDC97S62 | Switchboard-1 Release 2 | |
LDC2001T60 | Syllable-Final /s/ Lenition |
JANUS
LDC2004S05 | ISL Meeting Speech Part 1 | |
LDC2004T10 | ISL Meeting Transcripts Part 1 |
LID
LDC96S46 | CALLFRIEND American English-Non-Southern Dialect | |
LDC2019S21 | CALLFRIEND American English-Non-Southern Dialect Second Edition | |
LDC96S47 | CALLFRIEND American English-Southern Dialect | |
LDC2020S08 | CALLFRIEND American English-Southern Dialect Second Edition | |
LDC96S48 | CALLFRIEND Canadian French | |
LDC2019S18 | CALLFRIEND Canadian French Second Edition | |
LDC96S49 | CALLFRIEND Egyptian Arabic | |
LDC2019S04 | CALLFRIEND Egyptian Arabic Second Edition | |
LDC96S50 | CALLFRIEND Farsi | |
LDC2014S01 | CALLFRIEND Farsi Second Edition Speech | |
LDC2014T01 | CALLFRIEND Farsi Second Edition Transcripts | |
LDC96S51 | CALLFRIEND German | |
LDC96S52 | CALLFRIEND Hindi | |
LDC96S53 | CALLFRIEND Japanese | |
LDC96S54 | CALLFRIEND Korean | |
LDC96S55 | CALLFRIEND Mandarin Chinese-Mainland Dialect | |
LDC2018S09 | CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition | |
LDC96S56 | CALLFRIEND Mandarin Chinese-Taiwan Dialect | |
LDC2020S06 | CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition | |
LDC2023S08 | CALLFRIEND Russian Speech | |
LDC2023T09 | CALLFRIEND Russian Text | |
LDC96S57 | CALLFRIEND Spanish-Caribbean Dialect | |
LDC96S58 | CALLFRIEND Spanish-Non-Caribbean Dialect | |
LDC96S59 | CALLFRIEND Tamil | |
LDC96S60 | CALLFRIEND Vietnamese |
Linguistic Atlas Project
LDC2012S03 | Digital Archive of Southern Speech | |
LDC2016S05 | Digital Archive of Southern Speech - NLP Version |
LORELEI
LDC2020T02 | Abstract Meaning Representation (AMR) Annotation Release 3.0 | |
LDC2023T10 | AIDA Scenario 1 and 2 Reference Knowledge Base | |
LDC2023S01 | AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts | |
LDC2024T03 | LoReHLT Hausa Representative Language Pack | |
LDC2021T02 | LORELEI Akan Representative Language Pack | |
LDC2018T04 | LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text | |
LDC2022T05 | LORELEI Bengali Representative Language Pack | |
LDC2020T10 | LORELEI Entity Detection and Linking Knowledge Base | |
LDC2024T01 | LORELEI Farsi Representative Language Pack | |
LDC2023T07 | LORELEI Indonesian Representative Language Pack | |
LDC2022T01 | LORELEI Kinyarwanda Incident Language Pack | |
LDC2020T11 | LORELEI Oromo Incident Language Pack | |
LDC2018T11 | LORELEI Somali Representative Language Pack - Monolingual and Parallel Text | |
LDC2023T01 | LORELEI Swahili Representative Language Pack | |
LDC2023T02 | LORELEI Tagalog Representative Language Pack | |
LDC2023T03 | LORELEI Tamil Representative Language Pack | |
LDC2023T08 | LORELEI Thai Representative Language Pack | |
LDC2020T22 | LORELEI Tigrinya Incident Language Pack | |
LDC2020T24 | LORELEI Ukrainian Representative Language Pack | |
LDC2024T07 | LORELEI Uyghur Incident Language Pack | |
LDC2020T17 | LORELEI Vietnamese Representative Language Pack | |
LDC2022T03 | LORELEI Wolof Representative Language Pack | |
LDC2024T10 | LORELEI Yoruba Representative Language Pack | |
LDC2023T06 | LORELEI Zulu Representative Language Pack |
Machine Reading
LDC2020T04 | Machine Reading Phase 1 IC Training Data | |
LDC2019T14 | Machine Reading Phase 1 NFL Scoring Training Data |
MADCAT
LDC2014T13 | MADCAT Chinese Pilot Training Set | |
LDC2012T15 | MADCAT Phase 1 Training Set | |
LDC2013T09 | MADCAT Phase 2 Training Set | |
LDC2013T15 | MADCAT Phase 3 Training Set |
MALACH
LDC2014S04 | USC-SFI MALACH Interviews and Transcripts Czech | |
LDC2012S05 | USC-SFI MALACH Interviews and Transcripts English | |
LDC2019S11 | USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition |
MIXER
LDC2019S09 | First DIHARD Challenge Development - Eight Sources | |
LDC2019S12 | First DIHARD Challenge Evaluation - Nine Sources | |
LDC2023S02 | Mixer 3 Speech | |
LDC2020S03 | Mixer 4 and 5 Speech | |
LDC2013S03 | Mixer 6 Speech | |
LDC2023S04 | Mixer 7 Spanish Speech | |
LDC2023S09 | REMIX Telephone Collection | |
LDC2022S06 | Second DIHARD Challenge Evaluation - Eleven Sources | |
LDC2022S12 | Third DIHARD Challenge Development | |
LDC2022S14 | Third DIHARD Challenge Evaluation |
MT08
LDC2010T01 | NIST Open MT 2008 Evaluation (MT08) Selected References and System Translations |
MUC
LDC2003T13 | Message Understanding Conference (MUC) 6 | |
LDC96T10 | Message Understanding Conference (MUC) 6 Additional News Text | |
LDC2001T02 | Message Understanding Conference (MUC) 7 | |
LDC2010T15 | Message Understanding Conference 7 Timed (MUC7_T) | |
LDC95T21 | North American News Text Corpus | |
LDC93T3A | TIPSTER Complete | |
LDC93T3B | TIPSTER Volume 1 | |
LDC93T3C | TIPSTER Volume 2 | |
LDC93T3D | TIPSTER Volume 3 |
NIEUW
LDC2022S09 | Xi'an Guanzhong Object Naming |
NIST Automatic Meeting Recognition
LDC2004S09 | NIST Meeting Pilot Corpus Speech | |
LDC2004T13 | NIST Meeting Pilot Corpus Transcripts and Metadata |
NIST LRE
LDC2006S31 | 2003 NIST Language Recognition Evaluation | |
LDC2008S05 | 2005 NIST Language Recognition Evaluation | |
LDC2009S05 | 2007 NIST Language Recognition Evaluation Supplemental Training Set | |
LDC2009S04 | 2007 NIST Language Recognition Evaluation Test Set | |
LDC2014S06 | 2009 NIST Language Recognition Evaluation Test Set | |
LDC2018S06 | 2011 NIST Language Recognition Evaluation Test Set | |
LDC2022S10 | 2017 NIST Language Recognition Evaluation Training and Development Sets | |
LDC2023S01 | AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts | |
LDC2023S02 | Mixer 3 Speech | |
LDC2019S02 | Multi-Language Conversational Telephone Speech 2011 -- Arabic Group | |
LDC2018S03 | Multi-Language Conversational Telephone Speech 2011 -- Central Asian | |
LDC2018S08 | Multi-Language Conversational Telephone Speech 2011 -- Central European | |
LDC2019S15 | Multi-Language Conversational Telephone Speech 2011 -- East Asian | |
LDC2019S06 | Multi-Language Conversational Telephone Speech 2011 -- English Group | |
LDC2020S05 | Multi-Language Conversational Telephone Speech 2011 -- Mandarin Chinese | |
LDC2016S11 | Multi-Language Conversational Telephone Speech 2011 -- Slavic Group | |
LDC2017S14 | Multi-Language Conversational Telephone Speech 2011 -- South Asian | |
LDC2018S12 | Multi-Language Conversational Telephone Speech 2011 -- Spanish | |
LDC2017S09 | Multi-Language Conversational Telephone Speech 2011 -- Turkish |
NIST MT
LDC2009T05 | 2008 NIST Metrics for Machine Translation (MetricsMATR08) Development Data | |
LDC2014T09 | HyTER Networks of Selected OpenMT08/09 Sentences | |
LDC2010T10 | NIST 2002 Open Machine Translation (OpenMT) Evaluation | |
LDC2010T11 | NIST 2003 Open Machine Translation (OpenMT) Evaluation | |
LDC2010T12 | NIST 2004 Open Machine Translation (OpenMT) Evaluation | |
LDC2010T14 | NIST 2005 Open Machine Translation (OpenMT) Evaluation | |
LDC2010T17 | NIST 2006 Open Machine Translation (OpenMT) Evaluation | |
LDC2010T21 | NIST 2008 Open Machine Translation (OpenMT) Evaluation | |
LDC2013T07 | NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets | |
LDC2010T23 | NIST 2009 Open Machine Translation (OpenMT) Evaluation | |
LDC2013T03 | NIST 2012 Open Machine Translation (OpenMT) Evaluation | |
LDC2014T02 | NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language Source | |
LDC2013T18 | Semantic Textual Similarity (STS) 2013 Machine Translation |
NIST OpenSAT
LDC2022S01 | 2017 NIST OpenSAT Pilot - SSSF | |
LDC2023S06 | 2019 OpenSAT Public Safety Communications Simulation |
NIST Public Safety
LDC2023S06 | 2019 OpenSAT Public Safety Communications Simulation |
NIST SRE
LDC96S61 | 1996 Speaker Recognition Benchmark | |
LDC99S80 | 1997 Speaker Recognition Benchmark | |
LDC98S76 | 1998 Speaker Recognition Benchmark | |
LDC99S81 | 1999 Speaker Recognition Benchmark | |
LDC2001S97 | 2000 NIST Speaker Recognition Evaluation | |
LDC2002S34 | 2001 NIST Speaker Recognition Evaluation Corpus | |
LDC2004S04 | 2002 NIST Speaker Recognition Evaluation | |
LDC2010S03 | 2003 NIST Speaker Recognition Evaluation | |
LDC2006S44 | 2004 NIST Speaker Recognition Evaluation | |
LDC2011S04 | 2005 NIST Speaker Recognition Evaluation Test Data | |
LDC2011S01 | 2005 NIST Speaker Recognition Evaluation Training Data | |
LDC2011S10 | 2006 NIST Speaker Recognition Evaluation Test Set Part 1 | |
LDC2012S01 | 2006 NIST Speaker Recognition Evaluation Test Set Part 2 | |
LDC2011S09 | 2006 NIST Speaker Recognition Evaluation Training Set | |
LDC2011S11 | 2008 NIST Speaker Recognition Evaluation Supplemental Set | |
LDC2011S08 | 2008 NIST Speaker Recognition Evaluation Test Set | |
LDC2011S05 | 2008 NIST Speaker Recognition Evaluation Training Set Part 1 | |
LDC2011S07 | 2008 NIST Speaker Recognition Evaluation Training Set Part 2 | |
LDC2017S06 | 2010 NIST Speaker Recognition Evaluation Test Set | |
LDC2019S20 | 2016 NIST Speaker Recognition Evaluation Test Set | |
LDC2020S04 | 2018 NIST Speaker Recognition Evaluation Test Set | |
LDC2023V01 | 2019 NIST Speaker Recognition Evaluation Test Set -- Audio-Visual | |
LDC2023S03 | 2019 NIST Speaker Recognition Evaluation Test Set -- CTS Challenge | |
LDC2024S05 | Call My Net 1 | |
LDC2019S09 | First DIHARD Challenge Development - Eight Sources | |
LDC2019S12 | First DIHARD Challenge Evaluation - Nine Sources | |
LDC2013S05 | Greybeard | |
LDC2024S01 | KASET - Kurmanji and Sorani Kurdish Speech and Transcripts | |
LDC2023S02 | Mixer 3 Speech | |
LDC2020S03 | Mixer 4 and 5 Speech | |
LDC2013S03 | Mixer 6 Speech | |
LDC2023S04 | Mixer 7 Spanish Speech | |
LDC2009T26 | NXT Switchboard Annotations | |
LDC2023S09 | REMIX Telephone Collection | |
LDC2022S06 | Second DIHARD Challenge Evaluation - Eleven Sources | |
LDC2001S13 | Switchboard Cellular Part 1 Audio | |
LDC2001S15 | Switchboard Cellular Part 1 Transcribed Audio | |
LDC2001T14 | Switchboard Cellular Part 1 Transcription | |
LDC2004S07 | Switchboard Cellular Part 2 Audio | |
LDC93S8 | Switchboard Credit Card | |
LDC97S62 | Switchboard-1 Release 2 | |
LDC98S75 | Switchboard-2 Phase I | |
LDC99S79 | Switchboard-2 Phase II | |
LDC2002S06 | Switchboard-2 Phase III Audio |
OpenHaRT
LDC2012T15 | MADCAT Phase 1 Training Set | |
LDC2013T09 | MADCAT Phase 2 Training Set | |
LDC2013T15 | MADCAT Phase 3 Training Set |
PEA-TRAD
LDC2018T13 | TRAD Arabic-French Parallel Text -- Newsgroup | |
LDC2018T21 | TRAD Arabic-French Parallel Text -- Newswire | |
LDC2018T02 | TRAD Chinese-French Parallel Text -- Blog | |
LDC2018T17 | TRAD Chinese-French Parallel Text -- Broadcast News |
RATS
LDC2017S20 | RATS Keyword Spotting | |
LDC2018S10 | RATS Language Identification | |
LDC2024S03 | RATS Low Speech Density | |
LDC2021S08 | RATS Speaker Identification | |
LDC2015S02 | RATS Speech Activity Detection |
REFLEX-MTE
LDC2009T11 | REFLEX Entity Translation Training/DevTest |
RM
LDC96S39 | RM Isolated and Spelled Word Data |
ROAR
LDC2019S09 | First DIHARD Challenge Development - Eight Sources | |
LDC2019S12 | First DIHARD Challenge Evaluation - Nine Sources | |
LDC2004S05 | ISL Meeting Speech Part 1 | |
LDC2004T10 | ISL Meeting Transcripts Part 1 | |
LDC2022S06 | Second DIHARD Challenge Evaluation - Eleven Sources | |
LDC2022S14 | Third DIHARD Challenge Evaluation |
RT
LDC2007S12 | 2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data | |
LDC2007S11 | 2004 Spring NIST Rich Transcription (RT-04S) Development Data | |
LDC2011S06 | 2005 Spring NIST Rich Transcription (RT-05S) Evaluation Set | |
LDC2019S09 | First DIHARD Challenge Development - Eight Sources | |
LDC2019S12 | First DIHARD Challenge Evaluation - Nine Sources | |
LDC2022S06 | Second DIHARD Challenge Evaluation - Eleven Sources | |
LDC2022S12 | Third DIHARD Challenge Development |
SemEval
LDC2016T10 | SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing | |
LDC2011T01 | SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages |
SID
LDC2001S13 | Switchboard Cellular Part 1 Audio | |
LDC2001S15 | Switchboard Cellular Part 1 Transcribed Audio | |
LDC2001T14 | Switchboard Cellular Part 1 Transcription | |
LDC2004S07 | Switchboard Cellular Part 2 Audio | |
LDC98S75 | Switchboard-2 Phase I | |
LDC99S79 | Switchboard-2 Phase II | |
LDC2002S06 | Switchboard-2 Phase III Audio |
SPINE
LDC2000S96 | Speech in Noisy Environments (SPINE) Evaluation Audio | |
LDC2000T54 | Speech in Noisy Environments (SPINE) Evaluation Transcripts | |
LDC2000S87 | Speech in Noisy Environments (SPINE) Training Audio | |
LDC2000T49 | Speech in Noisy Environments (SPINE) Training Transcripts | |
LDC2001S04 | Speech in Noisy Environments (SPINE2) Part 1 Audio | |
LDC2001T05 | Speech in Noisy Environments (SPINE2) Part 1 Transcripts | |
LDC2001S06 | Speech in Noisy Environments (SPINE2) Part 2 Audio | |
LDC2001T07 | Speech in Noisy Environments (SPINE2) Part 2 Transcripts | |
LDC2001S08 | Speech in Noisy Environments (SPINE2) Part 3 Audio | |
LDC2001T09 | Speech in Noisy Environments (SPINE2) Part 3 Transcripts | |
LDC2001S99 | Speech in Noisy Environments 1 (SPINE1 CODED) Coded Audio |
TAC
LDC2024T09 | MultiTACRED | |
LDC2023T13 | TAC KBP Belief and Sentiment - Comprehensive Training and Evaluation Data 2016-2017 | |
LDC2017T17 | TAC KBP Chinese Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2011-2014 | |
LDC2019T08 | TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014 | |
LDC2019T17 | TAC KBP Cold Start - Comprehensive Evaluation Data 2012-2017 | |
LDC2018T03 | TAC KBP Comprehensive English Source Corpora 2009-2014 | |
LDC2018T16 | TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013 | |
LDC2020T03 | TAC KBP English Event Argument - Training and Evaluation Data 2014-2015 | |
LDC2020T13 | TAC KBP English Event Nugget Detection and Coreference - Comprehensive Training and Evaluation Data 2014-2015 | |
LDC2018T22 | TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014 | |
LDC2021T08 | TAC KBP English Sentiment Slot Filling -- Comprehensive Training and Evaluation Data 2013-2014 | |
LDC2021T06 | TAC KBP English Surprise Slot Filling -- Comprehensive Training and Evaluation Data 2010 | |
LDC2020T08 | TAC KBP English Temporal Slot Filling - Comprehensive Training and Evaluation Data 2011 and 2013 | |
LDC2019T19 | TAC KBP Entity Discovery and Linking - Comprehensive Evaluation Data 2016-2017 | |
LDC2019T02 | TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015 | |
LDC2019T12 | TAC KBP Evaluation Source Corpora 2016-2017 | |
LDC2020T18 | TAC KBP Event Argument - Comprehensive Training and Evaluation Data 2016-2017 | |
LDC2014T16 | TAC KBP Reference Knowledge Base | |
LDC2016T26 | TAC KBP Spanish Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2012-2014 | |
LDC2018T24 | TAC Relation Extraction Dataset |
Talkbank
LDC2005T35 | American National Corpus (ANC) Second Release | |
LDC2004V01 | FORM1 Kinematic Gesture | |
LDC2003V01 | FORM2 Kinematic Gesture | |
LDC2003L01 | Grassfields Bantu Fieldwork: Dschang Lexicon | |
LDC2003S02 | Grassfields Bantu Fieldwork: Dschang Tone Paradigms | |
LDC2001S16 | Grassfields Bantu Fieldwork: Ngomba Tone Paradigms | |
LDC2004L01 | Klex: Finite-State Lexical Transducer for Korean | |
LDC2004T03 | Morphologically Annotated Korean Text | |
LDC2003S06 | Santa Barbara Corpus of Spoken American English Part II | |
LDC2004S10 | Santa Barbara Corpus of Spoken American English Part III | |
LDC2005S25 | Santa Barbara Corpus of Spoken American English Part IV | |
LDC2003T15 | SLX Corpus of Classic Sociolinguistic Interviews | |
LDC2004S12 | TalkBank Ethology Data: Field Recordings of Vervet Monkey Calls |
TDT
LDC2010T18 | ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0 | |
LDC98T25 | TDT Pilot Study Corpus | |
LDC2000S92 | TDT2 Careful Transcription Audio | |
LDC2000T44 | TDT2 Careful Transcription Text | |
LDC99S84 | TDT2 English Audio | |
LDC2001S93 | TDT2 Mandarin Audio Corpus | |
LDC2001T57 | TDT2 Multilanguage Text Version 4.0 | |
LDC2001S94 | TDT3 English Audio | |
LDC2001S95 | TDT3 Mandarin Audio | |
LDC2001T58 | TDT3 Multilanguage Text Version 2.0 | |
LDC2005S11 | TDT4 Multilingual Broadcast News Speech Corpus | |
LDC2005T16 | TDT4 Multilingual Text and Annotations | |
LDC2007V02 | TRECVID 2003 Keyframes & Transcripts | |
LDC2007V01 | TRECVID 2005 Keyframes & Transcripts |
TERN
LDC2010T18 | ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0 |
TIDES
LDC2005T09 | ACE 2004 Multilingual Training Corpus | |
LDC2010T18 | ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0 | |
LDC2005T07 | ACE Time Normalization (TERN) 2004 English Training Data v 1.0 | |
LDC2003T11 | ACE-2 Version 1.0 | |
LDC93T1 | ACL/DCI | |
LDC2004T18 | Arabic English Parallel News Part 1 | |
LDC2003T12 | Arabic Gigaword | |
LDC2004T17 | Arabic News Translation Text Part 1 | |
LDC2001T55 | Arabic Newswire Part 1 | |
LDC2003T07 | Arabic Treebank: Part 1 - 10K-word English Translation | |
LDC2003T06 | Arabic Treebank: Part 1 v 2.0 | |
LDC2005T02 | Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic analysis) | |
LDC2004T02 | Arabic Treebank: Part 2 v 2.0 | |
LDC2005T20 | Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis) | |
LDC2004T11 | Arabic Treebank: Part 3 v 1.0 | |
LDC2005T33 | BBN Pronoun Coreference and Entity Type Corpus | |
LDC2000T43 | BLLIP 1987-89 WSJ Corpus Release 1 | |
LDC2002L49 | Buckwalter Arabic Morphological Analyzer Version 1.0 | |
LDC2004L02 | Buckwalter Arabic Morphological Analyzer Version 2.0 | |
LDC2005T13 | CCGbank | |
LDC96L14 | CELEX2 | |
LDC2005T10 | Chinese English News Magazine Parallel Text | |
LDC2003T09 | Chinese Gigaword | |
LDC2005T14 | Chinese Gigaword Second Edition | |
LDC2005T06 | Chinese News Translation Text Part 1 | |
LDC2005T23 | Chinese Proposition Bank 1.0 | |
LDC2001T11 | Chinese Treebank 2.0 | |
LDC2004T05 | Chinese Treebank 4.0 | |
LDC2005T01 | Chinese Treebank 5.0 | |
LDC2007T36 | Chinese Treebank 6.0 | |
LDC2010T07 | Chinese Treebank 7.0 | |
LDC2013T21 | Chinese Treebank 8.0 | |
LDC2002L27 | Chinese-English Translation Lexicon Version 3.0 | |
LDC2007T02 | English Chinese Translation Treebank v 1.0 | |
LDC2003T05 | English Gigaword | |
LDC2005T12 | English Gigaword Second Edition | |
LDC95T11 | European Language Newspaper Text | |
LDC2000T50 | Hong Kong Hansards Parallel Text | |
LDC2000T47 | Hong Kong Laws Parallel Text | |
LDC2000T46 | Hong Kong News Parallel Text | |
LDC2004T08 | Hong Kong Parallel Text | |
LDC95T8 | Japanese Business News Text | |
LDC99T34 | Japanese Business News Text Supplement | |
LDC2000T45 | Korean Newswire | |
LDC95T13 | Mandarin Chinese News Text | |
LDC2001T02 | Message Understanding Conference (MUC) 7 | |
LDC2003T18 | Multiple-Translation Arabic (MTA) Part 1 | |
LDC2005T05 | Multiple-Translation Arabic (MTA) Part 2 | |
LDC2003T17 | Multiple-Translation Chinese (MTC) Part 2 | |
LDC2004T07 | Multiple-Translation Chinese (MTC) Part 3 | |
LDC2006T04 | Multiple-Translation Chinese (MTC) Part 4 | |
LDC2002T01 | Multiple-Translation Chinese Corpus | |
LDC95T21 | North American News Text Corpus | |
LDC98T30 | North American News Text Supplement | |
LDC2004T23 | Prague Arabic Dependency Treebank 1.0 | |
LDC2004T14 | Proposition Bank I | |
LDC2006T12 | Spanish Gigaword First Edition | |
LDC2009T21 | Spanish Gigaword Second Edition | |
LDC95T9 | Spanish News Text | |
LDC99T41 | Spanish Newswire Text, Volume 2 | |
LDC98T25 | TDT Pilot Study Corpus | |
LDC2000S92 | TDT2 Careful Transcription Audio | |
LDC2000T44 | TDT2 Careful Transcription Text | |
LDC99S84 | TDT2 English Audio | |
LDC2001S93 | TDT2 Mandarin Audio Corpus | |
LDC2001T57 | TDT2 Multilanguage Text Version 4.0 | |
LDC2001S94 | TDT3 English Audio | |
LDC2001S95 | TDT3 Mandarin Audio | |
LDC2001T58 | TDT3 Multilanguage Text Version 2.0 | |
LDC2005S11 | TDT4 Multilingual Broadcast News Speech Corpus | |
LDC2005T16 | TDT4 Multilingual Text and Annotations | |
LDC2004T09 | TIDES Extraction (ACE) 2003 Multilingual Training Data | |
LDC93T3A | TIPSTER Complete | |
LDC2000T52 | TREC Mandarin | |
LDC2000T51 | TREC Spanish | |
LDC99T42 | Treebank-3 | |
LDC94T4B-1 | UN Parallel Text (English) | |
LDC94T4B-3 | UN Parallel Text (Spanish) |
Tipster
LDC95T13 | Mandarin Chinese News Text | |
LDC95T9 | Spanish News Text | |
LDC93T3A | TIPSTER Complete | |
LDC93T3B | TIPSTER Volume 1 | |
LDC93T3C | TIPSTER Volume 2 | |
LDC93T3D | TIPSTER Volume 3 |
TRAD
LDC2018T13 | TRAD Arabic-French Parallel Text -- Newsgroup | |
LDC2018T21 | TRAD Arabic-French Parallel Text -- Newswire | |
LDC2018T02 | TRAD Chinese-French Parallel Text -- Blog | |
LDC2018T17 | TRAD Chinese-French Parallel Text -- Broadcast News |
TREC
LDC2001T55 | Arabic Newswire Part 1 | |
LDC95T13 | Mandarin Chinese News Text | |
LDC95T9 | Spanish News Text | |
LDC93T3A | TIPSTER Complete | |
LDC93T3B | TIPSTER Volume 1 | |
LDC93T3C | TIPSTER Volume 2 | |
LDC93T3D | TIPSTER Volume 3 | |
LDC2000T52 | TREC Mandarin | |
LDC2000T51 | TREC Spanish | |
LDC2007V02 | TRECVID 2003 Keyframes & Transcripts | |
LDC2010V01 | TRECVID 2004 Keyframes & Transcripts | |
LDC2007V01 | TRECVID 2005 Keyframes & Transcripts | |
LDC2010V02 | TRECVID 2006 Keyframes |
VACE
LDC2012V01 | 2005 NIST/USF Evaluation Resources for the VACE Program - Broadcast News | |
LDC2011V05 | 2006 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 1 | |
LDC2011V06 | 2006 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 2 | |
LDC2011V03 | NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 1 | |
LDC2011V04 | NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 2 | |
LDC2011V01 | NIST/USF Evaluation Resources for the VACE Program - Meeting Data Training Set Part 1 | |
LDC2011V02 | NIST/USF Evaluation Resources for the VACE Program - Meeting Data Training Set Part 2 |
VAST
LDC2023V01 | 2019 NIST Speaker Recognition Evaluation Test Set -- Audio-Visual | |
LDC2019S09 | First DIHARD Challenge Development - Eight Sources | |
LDC2019S12 | First DIHARD Challenge Evaluation - Nine Sources | |
LDC2022S06 | Second DIHARD Challenge Evaluation - Eleven Sources | |
LDC2022S12 | Third DIHARD Challenge Development | |
LDC2022S14 | Third DIHARD Challenge Evaluation | |
LDC2019S05 | VAST Chinese Speech and Transcripts |