LDC Corpora ⇔ Projects

Many of the corpora in the Catalog were developed for, or used in, sponsored research projects. Some of those resources were training and test data for benchmark tests of language-based systems developed during the project. A corpus is associated with a given project either because it was developed for the project, it was used in the project or it was considered otherwise relevant to the work of the project.

ACE

LDC2017T10 Abstract Meaning Representation (AMR) Annotation Release 2.0
LDC2020T02 Abstract Meaning Representation (AMR) Annotation Release 3.0
LDC2005T09 ACE 2004 Multilingual Training Corpus
LDC2008T03 ACE 2005 English SpatialML Annotations
LDC2011T02 ACE 2005 English SpatialML Annotations Version 2
LDC2010T09 ACE 2005 Mandarin SpatialML Annotations
LDC2006T06 ACE 2005 Multilingual Training Corpus
LDC2014T18 ACE 2007 Multilingual Training Corpus
LDC2015T20 ACE 2007 Spanish DevTest - Pilot Evaluation
LDC2010T18 ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0
LDC2005T07 ACE Time Normalization (TERN) 2004 English Training Data v 1.0
LDC2003T11 ACE-2 Version 1.0
LDC2024T05 Automatic Content Extraction for Portuguese
LDC2005T33 BBN Pronoun Coreference and Entity Type Corpus
LDC2019T07 Chinese Abstract Meaning Representation 1.0
LDC2011T08 Datasets for Generic Relation Extraction (reACE)
LDC2004T14 Proposition Bank I
LDC2009T11 REFLEX Entity Translation Training/DevTest
LDC2004T09 TIDES Extraction (ACE) 2003 Multilingual Training Data

AIDA

LDC2023T10 AIDA Scenario 1 and 2 Reference Knowledge Base
LDC2024T02 AIDA Scenario 1 Practice Topic Annotation
LDC2023T11 AIDA Scenario 1 Practice Topic Source Data
LDC2024T06 AIDA Scenario 2 Practice Topic Annotation
LDC2024T04 AIDA Scenario 2 Practice Topic Source Data
LDC2023S01 AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts

American National Corpus (ANC)

LDC2005T35 American National Corpus (ANC) Second Release
LDC2010T22 Manually Annotated Sub-Corpus First Release
LDC2013T12 Manually Annotated Sub-Corpus Third Release

AQUAINT

LDC2008T25 AQUAINT-2 Information-Retrieval Text Research Collection
LDC2005T33 BBN Pronoun Coreference and Entity Type Corpus

ATIS

LDC2021T04 ATIS - Seven Languages
LDC93S4A ATIS0 Complete
LDC93S4B ATIS0 Pilot
LDC93S4B-2 ATIS0 Read
LDC93S4B-3 ATIS0 SD Read
LDC93S5 ATIS2
LDC95S26 ATIS3 Test Data
LDC94S19 ATIS3 Training Data
LDC2019T04 Multilingual ATIS

BOLT

LDC2014T12 Abstract Meaning Representation (AMR) Annotation Release 1.0
LDC2017T10 Abstract Meaning Representation (AMR) Annotation Release 2.0
LDC2020T02 Abstract Meaning Representation (AMR) Annotation Release 3.0
LDC2020T07 Abstract Meaning Representation 2.0 - Four Translations
LDC2019T01 BOLT Arabic Discussion Forum Parallel Training Data
LDC2018T10 BOLT Arabic Discussion Forums
LDC2021T07 BOLT Chinese Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech
LDC2017T05 BOLT Chinese Discussion Forum Parallel Training Data
LDC2016T05 BOLT Chinese Discussion Forums
LDC2018T15 BOLT Chinese SMS/Chat
LDC2021T11 BOLT Chinese SMS/Chat Parallel Training Data
LDC2016T19 BOLT Chinese-English Word Alignment and Tagging -- Discussion Forum Training
LDC2020T15 BOLT Chinese-English Word Alignment and Tagging -- Conversational Telephone Speech Training
LDC2019T13 BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training
LDC2021T14 BOLT Egyptian Arabic Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech
LDC2021T18 BOLT Egyptian Arabic PropBank and Sense -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech
LDC2017T07 BOLT Egyptian Arabic SMS/Chat and Transliteration
LDC2021T15 BOLT Egyptian Arabic SMS/Chat Parallel Training Data
LDC2021T12 BOLT Egyptian Arabic Treebank - Conversational Telephone Speech
LDC2018T23 BOLT Egyptian Arabic Treebank - Discussion Forum
LDC2021T17 BOLT Egyptian Arabic Treebank - SMS/Chat
LDC2020T05 BOLT Egyptian Arabic-English Word Alignment -- Conversational Telephone Speech Training
LDC2019T18 BOLT Egyptian Arabic-English Word Alignment -- SMS/Chat Training
LDC2019T06 BOLT Egyptian-English Word Alignment -- Discussion Forum Training
LDC2020T20 BOLT English Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech
LDC2017T11 BOLT English Discussion Forums
LDC2020T21 BOLT English PropBank and Sense -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech
LDC2018T19 BOLT English SMS/Chat
LDC2020T09 BOLT English Translation Treebank - Chinese Discussion Forum
LDC2021T19 BOLT English Translation Treebank - Chinese SMS/Chat
LDC2022T06 BOLT English Translation Treebank - Egyptian Arabic SMS/Chat
LDC2019T15 BOLT English Treebank - Discussion Forum
LDC2021T03 BOLT English Treebank - SMS/Chat
LDC2018T18 BOLT Information Retrieval Comprehensive Training and Evaluation
LDC2013T21 Chinese Treebank 8.0
LDC2016T13 Chinese Treebank 9.0
LDC2024T03 LoReHLT Hausa Representative Language Pack

CAMIO

LDC2022T07 CAMIO Transcription Languages

CHiME

LDC2017S07 CHiME2 Grid
LDC2017S10 CHiME2 WSJ0
LDC2017S24 CHiME3

Communicator

LDC2004T15 2000 Communicator Dialogue Act Tagged
LDC2002S56 2000 Communicator Evaluation
LDC2004T16 2001 Communicator Dialogue Act Tagged
LDC2003S01 2001 Communicator Evaluation

CoNLL

LDC2015T12 2006 CoNLL Shared Task - Arabic & Czech
LDC2015T11 2006 CoNLL Shared Task - Ten Languages
LDC2018T08 2007 CoNLL Shared Task - Arabic & English
LDC2018T06 2007 CoNLL Shared Task - Basque, Catalan, Czech & Turkish
LDC2018T07 2007 CoNLL Shared Task - Greek, Hungarian & Italian
LDC2012T03 2009 CoNLL Shared Task Part 1
LDC2012T04 2009 CoNLL Shared Task Part 2
LDC2017T13 2015-2016 CoNLL Shared Task

DARPA-CSR

LDC2005S08 BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
LDC93S6A CSR-I (WSJ0) Complete
LDC93S6C CSR-I (WSJ0) Other
LDC93S6B CSR-I (WSJ0) Sennheiser
LDC94S13A CSR-II (WSJ1) Complete
LDC94S13C CSR-II (WSJ1) Other
LDC94S13B CSR-II (WSJ1) Sennheiser
LDC95S23 CSR-III Speech
LDC95T6 CSR-III Text
LDC96S33 CSR-IV HUB3
LDC96S31 CSR-IV HUB4

DASL

LDC2003T15 SLX Corpus of Classic Sociolinguistic Interviews

DEFT

LDC2014T12 Abstract Meaning Representation (AMR) Annotation Release 1.0
LDC2017T10 Abstract Meaning Representation (AMR) Annotation Release 2.0
LDC2020T02 Abstract Meaning Representation (AMR) Annotation Release 3.0
LDC2020T07 Abstract Meaning Representation 2.0 - Four Translations
LDC2020L02 Chinese Lexical Resources for Gender, Number, Animacy
LDC2019T03 DEFT Chinese Committed Belief Annotation
LDC2020T19 DEFT Chinese Light and Rich ERE Annotation
LDC2019T16 DEFT English Committed Belief Annotation
LDC2023T04 DEFT English Light and Rich ERE Annotation
LDC2016T07 DEFT Narrative Text
LDC2019T09 DEFT Spanish Committed Belief Annotation
LDC2018T01 DEFT Spanish Treebank
LDC2016T23 Richer Event Description
LDC2023T13 TAC KBP Belief and Sentiment - Comprehensive Training and Evaluation Data 2016-2017
LDC2017T09 The EventStatus Corpus

DIRHA

LDC2018S01 DIRHA English WSJ Audio

DOE/IRS2008-0256

LDC2023L01 Moroccan Arabic - English Lexical Database

EARS

LDC97S66 1996 English Broadcast News Dev and Eval (HUB4)
LDC97S44 1996 English Broadcast News Speech (HUB4)
LDC97T22 1996 English Broadcast News Transcripts (HUB4)
LDC98S71 1997 English Broadcast News Speech (HUB4)
LDC98T28 1997 English Broadcast News Transcripts (HUB4)
LDC2001S91 1997 HUB4 Broadcast News Evaluation Non-English Test Material
LDC2002S11 1997 HUB4 English Evaluation Speech and Transcripts
LDC2002S22 1997 HUB5 Arabic Evaluation
LDC2002T39 1997 HUB5 Arabic Transcripts
LDC2002S24 1997 HUB5 German Evaluation
LDC2003T03 1997 HUB5 German Transcripts
LDC2002S25 1997 HUB5 Spanish Evaluation
LDC2003T04 1997 HUB5 Spanish Transcripts
LDC98S73 1997 Mandarin Broadcast News Speech (HUB4-NE)
LDC98T24 1997 Mandarin Broadcast News Transcripts (HUB4-NE)
LDC2002S10 1998 HUB5 English Evaluation
LDC2003T02 1998 HUB5 English Transcripts
LDC2002S13 2001 HUB5 English Evaluation
LDC2002S12 2001 HUB5 Mandarin Evaluation
LDC2003T01 2001 HUB5 Mandarin Transcripts
LDC2004S11 2002 Rich Transcription Broadcast News and Conversational Telephone Speech
LDC99L23 American English Spoken Lexicon
LDC2005S07 Arabic CTS Levantine Fisher Training Data Set 3, Speech
LDC2005T03 Arabic CTS Levantine Fisher Training Data Set 3, Transcripts
LDC2003T12 Arabic Gigaword
LDC2001T55 Arabic Newswire Part 1
LDC2005S08 BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
LDC96S46 CALLFRIEND American English-Non-Southern Dialect
LDC2019S21 CALLFRIEND American English-Non-Southern Dialect Second Edition
LDC96S47 CALLFRIEND American English-Southern Dialect
LDC2020S08 CALLFRIEND American English-Southern Dialect Second Edition
LDC2019S18 CALLFRIEND Canadian French Second Edition
LDC96S49 CALLFRIEND Egyptian Arabic
LDC2019S04 CALLFRIEND Egyptian Arabic Second Edition
LDC96S55 CALLFRIEND Mandarin Chinese-Mainland Dialect
LDC2018S09 CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition
LDC96S56 CALLFRIEND Mandarin Chinese-Taiwan Dialect
LDC2020S06 CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition
LDC97L20 CALLHOME American English Lexicon (PRONLEX)
LDC97S42 CALLHOME American English Speech
LDC97T14 CALLHOME American English Transcripts
LDC97S45 CALLHOME Egyptian Arabic Speech
LDC2002S37 CALLHOME Egyptian Arabic Speech Supplement
LDC97T19 CALLHOME Egyptian Arabic Transcripts
LDC2002T38 CALLHOME Egyptian Arabic Transcripts Supplement
LDC96L15 CALLHOME Mandarin Chinese Lexicon
LDC96S34 CALLHOME Mandarin Chinese Speech
LDC96T16 CALLHOME Mandarin Chinese Transcripts
LDC2003T09 Chinese Gigaword
LDC2005T14 Chinese Gigaword Second Edition
LDC2005T08 Discourse Graphbank
LDC99L22 Egyptian Colloquial Arabic Lexicon
LDC2003T05 English Gigaword
LDC2005T12 English Gigaword Second Edition
LDC2005S13 Fisher English Training Part 2, Speech
LDC2005T19 Fisher English Training Part 2, Transcripts
LDC2004S13 Fisher English Training Speech Part 1 Speech
LDC2004T19 Fisher English Training Speech Part 1 Transcripts
LDC2005S15 HKUST Mandarin Telephone Speech, Part 1
LDC2005T32 HKUST Mandarin Telephone Transcript Data, Part 1
LDC2018S18 HUB5 Mandarin Telephone Speech and Transcripts Second Edition
LDC98S69 HUB5 Mandarin Telephone Speech Corpus
LDC98T26 HUB5 Mandarin Transcripts
LDC2005S14 Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)
LDC2006S29 Levantine Arabic QT Training Data Set 5, Speech
LDC2006T07 Levantine Arabic QT Training Data Set 5, Transcripts
LDC95T13 Mandarin Chinese News Text
LDC95T21 North American News Text Corpus
LDC98T30 North American News Text Supplement
LDC2004S08 RT-03 MDE Training Data Speech
LDC2004T12 RT-03 MDE Training Data Text and Annotations
LDC2005S16 RT-04 MDE Training Data Speech
LDC2005T24 RT-04 MDE Training Data Text/Annotations
LDC2004S10 Santa Barbara Corpus of Spoken American English Part III
LDC2005S25 Santa Barbara Corpus of Spoken American English Part IV
LDC2006T12 Spanish Gigaword First Edition
LDC2009T21 Spanish Gigaword Second Edition
LDC2001S13 Switchboard Cellular Part 1 Audio
LDC2001S15 Switchboard Cellular Part 1 Transcribed Audio
LDC2001T14 Switchboard Cellular Part 1 Transcription
LDC2004S07 Switchboard Cellular Part 2 Audio
LDC97S62 Switchboard-1 Release 2
LDC98S75 Switchboard-2 Phase I
LDC99S79 Switchboard-2 Phase II
LDC2002S06 Switchboard-2 Phase III Audio
LDC98S72 Taiwanese Putonghua Speech and Transcripts
LDC98T25 TDT Pilot Study Corpus
LDC2000S92 TDT2 Careful Transcription Audio
LDC2000T44 TDT2 Careful Transcription Text
LDC99S84 TDT2 English Audio
LDC2001S93 TDT2 Mandarin Audio Corpus
LDC2001T57 TDT2 Multilanguage Text Version 4.0
LDC2001S94 TDT3 English Audio
LDC2001S95 TDT3 Mandarin Audio
LDC2001T58 TDT3 Multilanguage Text Version 2.0

GALE

LDC97S66 1996 English Broadcast News Dev and Eval (HUB4)
LDC97S44 1996 English Broadcast News Speech (HUB4)
LDC97T22 1996 English Broadcast News Transcripts (HUB4)
LDC98S71 1997 English Broadcast News Speech (HUB4)
LDC98T28 1997 English Broadcast News Transcripts (HUB4)
LDC2001S91 1997 HUB4 Broadcast News Evaluation Non-English Test Material
LDC2002S11 1997 HUB4 English Evaluation Speech and Transcripts
LDC2002S22 1997 HUB5 Arabic Evaluation
LDC2002T39 1997 HUB5 Arabic Transcripts
LDC2002S24 1997 HUB5 German Evaluation
LDC2003T03 1997 HUB5 German Transcripts
LDC2002S25 1997 HUB5 Spanish Evaluation
LDC2003T04 1997 HUB5 Spanish Transcripts
LDC98S73 1997 Mandarin Broadcast News Speech (HUB4-NE)
LDC98T24 1997 Mandarin Broadcast News Transcripts (HUB4-NE)
LDC2002S10 1998 HUB5 English Evaluation
LDC2003T02 1998 HUB5 English Transcripts
LDC2002S13 2001 HUB5 English Evaluation
LDC2002S12 2001 HUB5 Mandarin Evaluation
LDC2003T01 2001 HUB5 Mandarin Transcripts
LDC2004S11 2002 Rich Transcription Broadcast News and Conversational Telephone Speech
LDC2009T05 2008 NIST Metrics for Machine Translation (MetricsMATR08) Development Data
LDC2011T05 2008/2010 NIST Metrics for Machine Translation (MetricsMaTr) GALE Evaluation Set
LDC2017T10 Abstract Meaning Representation (AMR) Annotation Release 2.0
LDC2020T02 Abstract Meaning Representation (AMR) Annotation Release 3.0
LDC2005T09 ACE 2004 Multilingual Training Corpus
LDC2005T07 ACE Time Normalization (TERN) 2004 English Training Data v 1.0
LDC2003T11 ACE-2 Version 1.0
LDC93T1 ACL/DCI
LDC99L23 American English Spoken Lexicon
LDC2012T21 Annotated English Gigaword
LDC2005S07 Arabic CTS Levantine Fisher Training Data Set 3, Speech
LDC2005T03 Arabic CTS Levantine Fisher Training Data Set 3, Transcripts
LDC2004T18 Arabic English Parallel News Part 1
LDC2003T12 Arabic Gigaword
LDC2011T11 Arabic Gigaword Fifth Edition
LDC2009T30 Arabic Gigaword Fourth Edition
LDC2007T40 Arabic Gigaword Third Edition
LDC2004T17 Arabic News Translation Text Part 1
LDC2001T55 Arabic Newswire Part 1
LDC2012T07 Arabic Treebank - Broadcast News v1.0
LDC2016T02 Arabic Treebank - Weblog
LDC2003T07 Arabic Treebank: Part 1 - 10K-word English Translation
LDC2003T06 Arabic Treebank: Part 1 v 2.0
LDC2005T02 Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic analysis)
LDC2010T13 Arabic Treebank: Part 1 v 4.1
LDC2004T02 Arabic Treebank: Part 2 v 2.0
LDC2011T09 Arabic Treebank: Part 2 v 3.1
LDC2005T20 Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis)
LDC2004T11 Arabic Treebank: Part 3 v 1.0
LDC2012T09 Arabic-Dialect/English Parallel Text
LDC2005T33 BBN Pronoun Coreference and Entity Type Corpus
LDC2005S08 BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
LDC2000T43 BLLIP 1987-89 WSJ Corpus Release 1
LDC2002L49 Buckwalter Arabic Morphological Analyzer Version 1.0
LDC2004L02 Buckwalter Arabic Morphological Analyzer Version 2.0
LDC96S46 CALLFRIEND American English-Non-Southern Dialect
LDC2019S21 CALLFRIEND American English-Non-Southern Dialect Second Edition
LDC96S47 CALLFRIEND American English-Southern Dialect
LDC2020S08 CALLFRIEND American English-Southern Dialect Second Edition
LDC2019S18 CALLFRIEND Canadian French Second Edition
LDC96S49 CALLFRIEND Egyptian Arabic
LDC2019S04 CALLFRIEND Egyptian Arabic Second Edition
LDC96S55 CALLFRIEND Mandarin Chinese-Mainland Dialect
LDC2018S09 CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition
LDC96S56 CALLFRIEND Mandarin Chinese-Taiwan Dialect
LDC2020S06 CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition
LDC97L20 CALLHOME American English Lexicon (PRONLEX)
LDC97S42 CALLHOME American English Speech
LDC97T14 CALLHOME American English Transcripts
LDC97S45 CALLHOME Egyptian Arabic Speech
LDC2002S37 CALLHOME Egyptian Arabic Speech Supplement
LDC97T19 CALLHOME Egyptian Arabic Transcripts
LDC2002T38 CALLHOME Egyptian Arabic Transcripts Supplement
LDC96L15 CALLHOME Mandarin Chinese Lexicon
LDC96S34 CALLHOME Mandarin Chinese Speech
LDC96T16 CALLHOME Mandarin Chinese Transcripts
LDC2005T13 CCGbank
LDC96L14 CELEX2
LDC2005T10 Chinese English News Magazine Parallel Text
LDC2003T09 Chinese Gigaword
LDC2011T13 Chinese Gigaword Fifth Edition
LDC2009T27 Chinese Gigaword Fourth Edition
LDC2005T14 Chinese Gigaword Second Edition
LDC2007T38 Chinese Gigaword Third Edition
LDC2005T06 Chinese News Translation Text Part 1
LDC2005T23 Chinese Proposition Bank 1.0
LDC2001T11 Chinese Treebank 2.0
LDC2004T05 Chinese Treebank 4.0
LDC2005T01 Chinese Treebank 5.0
LDC2007T36 Chinese Treebank 6.0
LDC2010T07 Chinese Treebank 7.0
LDC2013T21 Chinese Treebank 8.0
LDC2016T13 Chinese Treebank 9.0
LDC2002L27 Chinese-English Translation Lexicon Version 3.0
LDC2018T20 Concretely Annotated English Gigaword
LDC2005T08 Discourse Graphbank
LDC99L22 Egyptian Colloquial Arabic Lexicon
LDC2009T01 English CTS Treebank with Structural Metadata
LDC2003T05 English Gigaword
LDC2011T07 English Gigaword Fifth Edition
LDC2009T13 English Gigaword Fourth Edition
LDC2005T12 English Gigaword Second Edition
LDC2007T07 English Gigaword Third Edition
LDC2012T02 English Translation Treebank: An-Nahar Newswire
LDC2012T13 English Web Treebank
LDC2006T10 English-Arabic Treebank v 1.0
LDC95T11 European Language Newspaper Text
LDC2005S13 Fisher English Training Part 2, Speech
LDC2005T19 Fisher English Training Part 2, Transcripts
LDC2004S13 Fisher English Training Speech Part 1 Speech
LDC2004T19 Fisher English Training Speech Part 1 Transcripts
LDC2007S02 Fisher Levantine Arabic Conversational Telephone Speech
LDC2007T04 Fisher Levantine Arabic Conversational Telephone Speech, Transcripts
LDC2013T14 GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 1
LDC2014T03 GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 2
LDC2013T10 GALE Arabic-English Parallel Aligned Treebank -- Newswire
LDC2014T08 GALE Arabic-English Parallel Aligned Treebank -- Web Training
LDC2014T19 GALE Arabic-English Word Alignment -- Broadcast Training Part 1
LDC2014T22 GALE Arabic-English Word Alignment -- Broadcast Training Part 2
LDC2014T05 GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web
LDC2014T10 GALE Arabic-English Word Alignment Training Part 2 -- Newswire
LDC2014T14 GALE Arabic-English Word Alignment Training Part 3 -- Web
LDC2015T06 GALE Chinese-English Parallel Aligned Treebank -- Training
LDC2013T23 GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1
LDC2014T25 GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 2
LDC2015T04 GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3
LDC2015T18 GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 4
LDC2012T16 GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web
LDC2012T20 GALE Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire
LDC2012T24 GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web
LDC2013T05 GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web
LDC2017T06 GALE English-Chinese Parallel Aligned Treebank -- Training
LDC2008T02 GALE Phase 1 Arabic Blog Parallel Text
LDC2007T24 GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1
LDC2008T09 GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2
LDC2009T03 GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1
LDC2009T09 GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2
LDC2008T06 GALE Phase 1 Chinese Blog Parallel Text
LDC2009T02 GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1
LDC2009T06 GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2
LDC2007T23 GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1
LDC2008T08 GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2
LDC2008T18 GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3
LDC2009T15 GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1
LDC2010T03 GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2
LDC2007T20 GALE Phase 1 Distillation Training
LDC2012T06 GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1
LDC2012T14 GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2
LDC2013S02 GALE Phase 2 Arabic Broadcast Conversation Speech Part 1
LDC2013S07 GALE Phase 2 Arabic Broadcast Conversation Speech Part 2
LDC2013T04 GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 1
LDC2013T17 GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 2
LDC2012T18 GALE Phase 2 Arabic Broadcast News Parallel Text
LDC2014S07 GALE Phase 2 Arabic Broadcast News Speech Part 1
LDC2015S01 GALE Phase 2 Arabic Broadcast News Speech Part 2
LDC2014T17 GALE Phase 2 Arabic Broadcast News Transcripts Part 1
LDC2015T01 GALE Phase 2 Arabic Broadcast News Transcripts Part 2
LDC2012T17 GALE Phase 2 Arabic Newswire Parallel Text
LDC2013T01 GALE Phase 2 Arabic Web Parallel Text
LDC2013T11 GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 1
LDC2013T16 GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 2
LDC2013S04 GALE Phase 2 Chinese Broadcast Conversation Speech
LDC2013T08 GALE Phase 2 Chinese Broadcast Conversation Transcripts
LDC2014T04 GALE Phase 2 Chinese Broadcast News Parallel Text Part 1
LDC2014T11 GALE Phase 2 Chinese Broadcast News Parallel Text Part 2
LDC2013S08 GALE Phase 2 Chinese Broadcast News Speech
LDC2013T20 GALE Phase 2 Chinese Broadcast News Transcripts
LDC2014T15 GALE Phase 2 Chinese Newswire Parallel Text Part 1
LDC2014T20 GALE Phase 2 Chinese Newswire Parallel Text Part 2
LDC2014T26 GALE Phase 2 Chinese Web Parallel Text
LDC2015T05 GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel Text
LDC2015T07 GALE Phase 3 and 4 Arabic Broadcast News Parallel Text
LDC2015T19 GALE Phase 3 and 4 Arabic Newswire Parallel Text
LDC2016T08 GALE Phase 3 and 4 Arabic Web Parallel Text
LDC2016T09 GALE Phase 3 and 4 Chinese Broadcast Conversation Parallel Text
LDC2016T15 GALE Phase 3 and 4 Chinese Broadcast News Parallel Text
LDC2016T25 GALE Phase 3 and 4 Chinese Newswire Parallel Text
LDC2017T02 GALE Phase 3 and 4 Chinese Web Parallel Text
LDC2015S11 GALE Phase 3 Arabic Broadcast Conversation Speech Part 1
LDC2016S01 GALE Phase 3 Arabic Broadcast Conversation Speech Part 2
LDC2015T16 GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 1
LDC2016T06 GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2
LDC2016S07 GALE Phase 3 Arabic Broadcast News Speech Part 1
LDC2017S02 GALE Phase 3 Arabic Broadcast News Speech Part 2
LDC2016T17 GALE Phase 3 Arabic Broadcast News Transcripts Part 1
LDC2017T04 GALE Phase 3 Arabic Broadcast News Transcripts Part 2
LDC2014S09 GALE Phase 3 Chinese Broadcast Conversation Speech Part 1
LDC2015S06 GALE Phase 3 Chinese Broadcast Conversation Speech Part 2
LDC2014T28 GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 1
LDC2015T09 GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2
LDC2015S13 GALE Phase 3 Chinese Broadcast News Speech
LDC2015T25 GALE Phase 3 Chinese Broadcast News Transcripts
LDC2016T11 GALE Phase 4 Arabic Broadcast Conversation Parallel Sentences
LDC2017S15 GALE Phase 4 Arabic Broadcast Conversation Speech
LDC2017T12 GALE Phase 4 Arabic Broadcast Conversation Transcripts
LDC2016T20 GALE Phase 4 Arabic Broadcast News Parallel Sentences
LDC2018S05 GALE Phase 4 Arabic Broadcast News Speech
LDC2018T14 GALE Phase 4 Arabic Broadcast News Transcripts
LDC2016T27 GALE Phase 4 Arabic Newswire Parallel Sentences
LDC2016T14 GALE Phase 4 Arabic Weblog Parallel Sentences
LDC2015T14 GALE Phase 4 Chinese Broadcast Conversation Parallel Sentences
LDC2016S03 GALE Phase 4 Chinese Broadcast Conversation Speech
LDC2016T12 GALE Phase 4 Chinese Broadcast Conversation Transcripts
LDC2015T21 GALE Phase 4 Chinese Broadcast News Parallel Sentences
LDC2017S25 GALE Phase 4 Chinese Broadcast News Speech
LDC2017T18 GALE Phase 4 Chinese Broadcast News Transcripts
LDC2015T24 GALE Phase 4 Chinese Newswire Parallel Sentences
LDC2016T04 GALE Phase 4 Chinese Weblog Parallel Sentences
LDC2005S15 HKUST Mandarin Telephone Speech, Part 1
LDC2005T32 HKUST Mandarin Telephone Transcript Data, Part 1
LDC2000T50 Hong Kong Hansards Parallel Text
LDC2000T47 Hong Kong Laws Parallel Text
LDC2000T46 Hong Kong News Parallel Text
LDC2004T08 Hong Kong Parallel Text
LDC2018S18 HUB5 Mandarin Telephone Speech and Transcripts Second Edition
LDC98S69 HUB5 Mandarin Telephone Speech Corpus
LDC98T26 HUB5 Mandarin Transcripts
LDC95T8 Japanese Business News Text
LDC99T34 Japanese Business News Text Supplement
LDC2000T45 Korean Newswire
LDC2005S14 Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)
LDC95T13 Mandarin Chinese News Text
LDC2001T02 Message Understanding Conference (MUC) 7
LDC2003T18 Multiple-Translation Arabic (MTA) Part 1
LDC2005T05 Multiple-Translation Arabic (MTA) Part 2
LDC2003T17 Multiple-Translation Chinese (MTC) Part 2
LDC2004T07 Multiple-Translation Chinese (MTC) Part 3
LDC2002T01 Multiple-Translation Chinese Corpus
LDC2010T21 NIST 2008 Open Machine Translation (OpenMT) Evaluation
LDC2010T01 NIST Open MT 2008 Evaluation (MT08) Selected References and System Translations
LDC95T21 North American News Text Corpus
LDC98T30 North American News Text Supplement
LDC2007T21 OntoNotes Release 1.0
LDC2008T04 OntoNotes Release 2.0
LDC2009T24 OntoNotes Release 3.0
LDC2011T03 OntoNotes Release 4.0
LDC2013T19 OntoNotes Release 5.0
LDC2004T23 Prague Arabic Dependency Treebank 1.0
LDC2004T14 Proposition Bank I
LDC2004S08 RT-03 MDE Training Data Speech
LDC2004T12 RT-03 MDE Training Data Text and Annotations
LDC2005S16 RT-04 MDE Training Data Speech
LDC2005T24 RT-04 MDE Training Data Text/Annotations
LDC2004S10 Santa Barbara Corpus of Spoken American English Part III
LDC2005S25 Santa Barbara Corpus of Spoken American English Part IV
LDC2013T18 Semantic Textual Similarity (STS) 2013 Machine Translation
LDC2006T12 Spanish Gigaword First Edition
LDC2009T21 Spanish Gigaword Second Edition
LDC95T9 Spanish News Text
LDC99T41 Spanish Newswire Text, Volume 2
LDC2001S13 Switchboard Cellular Part 1 Audio
LDC2001S15 Switchboard Cellular Part 1 Transcribed Audio
LDC2001T14 Switchboard Cellular Part 1 Transcription
LDC2004S07 Switchboard Cellular Part 2 Audio
LDC97S62 Switchboard-1 Release 2
LDC98S75 Switchboard-2 Phase I
LDC99S79 Switchboard-2 Phase II
LDC2002S06 Switchboard-2 Phase III Audio
LDC98S72 Taiwanese Putonghua Speech and Transcripts
LDC98T25 TDT Pilot Study Corpus
LDC2000S92 TDT2 Careful Transcription Audio
LDC2000T44 TDT2 Careful Transcription Text
LDC99S84 TDT2 English Audio
LDC2001S93 TDT2 Mandarin Audio Corpus
LDC2001T57 TDT2 Multilanguage Text Version 4.0
LDC2001S94 TDT3 English Audio
LDC2001S95 TDT3 Mandarin Audio
LDC2001T58 TDT3 Multilanguage Text Version 2.0
LDC2005S11 TDT4 Multilingual Broadcast News Speech Corpus
LDC2005T16 TDT4 Multilingual Text and Annotations
LDC2004T09 TIDES Extraction (ACE) 2003 Multilingual Training Data
LDC93T3A TIPSTER Complete
LDC2018T13 TRAD Arabic-French Parallel Text -- Newsgroup
LDC2018T21 TRAD Arabic-French Parallel Text -- Newswire
LDC2018T02 TRAD Chinese-French Parallel Text -- Blog
LDC2018T17 TRAD Chinese-French Parallel Text -- Broadcast News
LDC2000T52 TREC Mandarin
LDC2000T51 TREC Spanish
LDC99T42 Treebank-3
LDC94T4B-1 UN Parallel Text (English)
LDC94T4B-3 UN Parallel Text (Spanish)

GENOA

LDC2004S05 ISL Meeting Speech Part 1
LDC2004T10 ISL Meeting Transcripts Part 1

HAVIC

LDC2018V01 HAVIC MED Event E051-E060 -- Videos, Metadata and Annotation
LDC2022V01 HAVIC MED Novel 1 Test -- Videos, Metadata and Annotation
LDC2022V02 HAVIC MED Novel 2 Test -- Videos, Metadata and Annotation
LDC2019V01 HAVIC MED Progress Test -- Videos, Metadata and Annotation
LDC2021V01 HAVIC MED Training Data -- Videos, Metadata and Annotation
LDC2016V01 HAVIC Pilot Transcription

Hub4

LDC98T31 1996 CSR HUB4 Language Model
LDC97S66 1996 English Broadcast News Dev and Eval (HUB4)
LDC97S44 1996 English Broadcast News Speech (HUB4)
LDC97T22 1996 English Broadcast News Transcripts (HUB4)
LDC98S71 1997 English Broadcast News Speech (HUB4)
LDC98T28 1997 English Broadcast News Transcripts (HUB4)
LDC2001S91 1997 HUB4 Broadcast News Evaluation Non-English Test Material
LDC2002S11 1997 HUB4 English Evaluation Speech and Transcripts
LDC98S73 1997 Mandarin Broadcast News Speech (HUB4-NE)
LDC98T24 1997 Mandarin Broadcast News Transcripts (HUB4-NE)
LDC98S74 1997 Spanish Broadcast News Speech (HUB4-NE)
LDC98T29 1997 Spanish Broadcast News Transcripts (HUB4-NE)
LDC2000S86 1998 HUB4 Broadcast News Evaluation English Test Material
LDC2015S05 Mandarin Chinese Phonetic Segmentation and Tone
LDC95T21 North American News Text Corpus
LDC98T30 North American News Text Supplement

Hub5-LVCSR

LDC2002S22 1997 HUB5 Arabic Evaluation
LDC2002T39 1997 HUB5 Arabic Transcripts
LDC2002S23 1997 HUB5 English Evaluation
LDC2002S24 1997 HUB5 German Evaluation
LDC2003T03 1997 HUB5 German Transcripts
LDC2002S25 1997 HUB5 Spanish Evaluation
LDC2003T04 1997 HUB5 Spanish Transcripts
LDC2002S10 1998 HUB5 English Evaluation
LDC2003T02 1998 HUB5 English Transcripts
LDC2002S09 2000 HUB5 English Evaluation Speech
LDC2002T43 2000 HUB5 English Evaluation Transcripts
LDC2002S13 2001 HUB5 English Evaluation
LDC2002S12 2001 HUB5 Mandarin Evaluation
LDC2003T01 2001 HUB5 Mandarin Transcripts
LDC97L20 CALLHOME American English Lexicon (PRONLEX)
LDC97S42 CALLHOME American English Speech
LDC97T14 CALLHOME American English Transcripts
LDC97S45 CALLHOME Egyptian Arabic Speech
LDC2002S37 CALLHOME Egyptian Arabic Speech Supplement
LDC97T19 CALLHOME Egyptian Arabic Transcripts
LDC2002T38 CALLHOME Egyptian Arabic Transcripts Supplement
LDC97L18 CALLHOME German Lexicon
LDC97S43 CALLHOME German Speech
LDC97T15 CALLHOME German Transcripts
LDC96L17 CALLHOME Japanese Lexicon
LDC96S37 CALLHOME Japanese Speech
LDC96T18 CALLHOME Japanese Transcripts
LDC96L15 CALLHOME Mandarin Chinese Lexicon
LDC96S34 CALLHOME Mandarin Chinese Speech
LDC96T16 CALLHOME Mandarin Chinese Transcripts
LDC96L16 CALLHOME Spanish Lexicon
LDC96S35 CALLHOME Spanish Speech
LDC96T17 CALLHOME Spanish Transcripts
LDC99L22 Egyptian Colloquial Arabic Lexicon
LDC2018S18 HUB5 Mandarin Telephone Speech and Transcripts Second Edition
LDC98S69 HUB5 Mandarin Telephone Speech Corpus
LDC98T26 HUB5 Mandarin Transcripts
LDC98S70 HUB5 Spanish Telephone Speech Corpus
LDC98T27 HUB5 Spanish Transcripts
LDC97S62 Switchboard-1 Release 2
LDC2001T60 Syllable-Final /s/ Lenition

JANUS

LDC2004S05 ISL Meeting Speech Part 1
LDC2004T10 ISL Meeting Transcripts Part 1

LID

LDC96S46 CALLFRIEND American English-Non-Southern Dialect
LDC2019S21 CALLFRIEND American English-Non-Southern Dialect Second Edition
LDC96S47 CALLFRIEND American English-Southern Dialect
LDC2020S08 CALLFRIEND American English-Southern Dialect Second Edition
LDC96S48 CALLFRIEND Canadian French
LDC2019S18 CALLFRIEND Canadian French Second Edition
LDC96S49 CALLFRIEND Egyptian Arabic
LDC2019S04 CALLFRIEND Egyptian Arabic Second Edition
LDC96S50 CALLFRIEND Farsi
LDC2014S01 CALLFRIEND Farsi Second Edition Speech
LDC2014T01 CALLFRIEND Farsi Second Edition Transcripts
LDC96S51 CALLFRIEND German
LDC96S52 CALLFRIEND Hindi
LDC96S53 CALLFRIEND Japanese
LDC96S54 CALLFRIEND Korean
LDC96S55 CALLFRIEND Mandarin Chinese-Mainland Dialect
LDC2018S09 CALLFRIEND Mandarin Chinese-Mainland Dialect Second Edition
LDC96S56 CALLFRIEND Mandarin Chinese-Taiwan Dialect
LDC2020S06 CALLFRIEND Mandarin Chinese-Taiwan Dialect Second Edition
LDC2023S08 CALLFRIEND Russian Speech
LDC2023T09 CALLFRIEND Russian Text
LDC96S57 CALLFRIEND Spanish-Caribbean Dialect
LDC96S58 CALLFRIEND Spanish-Non-Caribbean Dialect
LDC96S59 CALLFRIEND Tamil
LDC96S60 CALLFRIEND Vietnamese

Linguistic Atlas Project

LDC2012S03 Digital Archive of Southern Speech
LDC2016S05 Digital Archive of Southern Speech - NLP Version

LORELEI

LDC2020T02 Abstract Meaning Representation (AMR) Annotation Release 3.0
LDC2023T10 AIDA Scenario 1 and 2 Reference Knowledge Base
LDC2023S01 AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts
LDC2024T03 LoReHLT Hausa Representative Language Pack
LDC2021T02 LORELEI Akan Representative Language Pack
LDC2018T04 LORELEI Amharic Representative Language Pack - Monolingual and Parallel Text
LDC2022T05 LORELEI Bengali Representative Language Pack
LDC2020T10 LORELEI Entity Detection and Linking Knowledge Base
LDC2024T01 LORELEI Farsi Representative Language Pack
LDC2023T07 LORELEI Indonesian Representative Language Pack
LDC2022T01 LORELEI Kinyarwanda Incident Language Pack
LDC2020T11 LORELEI Oromo Incident Language Pack
LDC2018T11 LORELEI Somali Representative Language Pack - Monolingual and Parallel Text
LDC2023T01 LORELEI Swahili Representative Language Pack
LDC2023T02 LORELEI Tagalog Representative Language Pack
LDC2023T03 LORELEI Tamil Representative Language Pack
LDC2023T08 LORELEI Thai Representative Language Pack
LDC2020T22 LORELEI Tigrinya Incident Language Pack
LDC2020T24 LORELEI Ukrainian Representative Language Pack
LDC2024T07 LORELEI Uyghur Incident Language Pack
LDC2020T17 LORELEI Vietnamese Representative Language Pack
LDC2022T03 LORELEI Wolof Representative Language Pack
LDC2024T10 LORELEI Yoruba Representative Language Pack
LDC2023T06 LORELEI Zulu Representative Language Pack

Machine Reading

LDC2020T04 Machine Reading Phase 1 IC Training Data
LDC2019T14 Machine Reading Phase 1 NFL Scoring Training Data

MADCAT

LDC2014T13 MADCAT Chinese Pilot Training Set
LDC2012T15 MADCAT Phase 1 Training Set
LDC2013T09 MADCAT Phase 2 Training Set
LDC2013T15 MADCAT Phase 3 Training Set

MALACH

LDC2014S04 USC-SFI MALACH Interviews and Transcripts Czech
LDC2012S05 USC-SFI MALACH Interviews and Transcripts English
LDC2019S11 USC-SFI MALACH Interviews and Transcripts English – Speech Recognition Edition

MIXER

LDC2019S09 First DIHARD Challenge Development - Eight Sources
LDC2019S12 First DIHARD Challenge Evaluation - Nine Sources
LDC2023S02 Mixer 3 Speech
LDC2020S03 Mixer 4 and 5 Speech
LDC2013S03 Mixer 6 Speech
LDC2023S04 Mixer 7 Spanish Speech
LDC2023S09 REMIX Telephone Collection
LDC2022S06 Second DIHARD Challenge Evaluation - Eleven Sources
LDC2022S12 Third DIHARD Challenge Development
LDC2022S14 Third DIHARD Challenge Evaluation

MT08

LDC2010T01 NIST Open MT 2008 Evaluation (MT08) Selected References and System Translations

MUC

LDC2003T13 Message Understanding Conference (MUC) 6
LDC96T10 Message Understanding Conference (MUC) 6 Additional News Text
LDC2001T02 Message Understanding Conference (MUC) 7
LDC2010T15 Message Understanding Conference 7 Timed (MUC7_T)
LDC95T21 North American News Text Corpus
LDC93T3A TIPSTER Complete
LDC93T3B TIPSTER Volume 1
LDC93T3C TIPSTER Volume 2
LDC93T3D TIPSTER Volume 3

NIEUW

LDC2022S09 Xi'an Guanzhong Object Naming

NIST Automatic Meeting Recognition

LDC2004S09 NIST Meeting Pilot Corpus Speech
LDC2004T13 NIST Meeting Pilot Corpus Transcripts and Metadata

NIST LRE

LDC2006S31 2003 NIST Language Recognition Evaluation
LDC2008S05 2005 NIST Language Recognition Evaluation
LDC2009S05 2007 NIST Language Recognition Evaluation Supplemental Training Set
LDC2009S04 2007 NIST Language Recognition Evaluation Test Set
LDC2014S06 2009 NIST Language Recognition Evaluation Test Set
LDC2018S06 2011 NIST Language Recognition Evaluation Test Set
LDC2022S10 2017 NIST Language Recognition Evaluation Training and Development Sets
LDC2023S01 AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts
LDC2023S02 Mixer 3 Speech
LDC2019S02 Multi-Language Conversational Telephone Speech 2011 -- Arabic Group
LDC2018S03 Multi-Language Conversational Telephone Speech 2011 -- Central Asian
LDC2018S08 Multi-Language Conversational Telephone Speech 2011 -- Central European
LDC2019S15 Multi-Language Conversational Telephone Speech 2011 -- East Asian
LDC2019S06 Multi-Language Conversational Telephone Speech 2011 -- English Group
LDC2020S05 Multi-Language Conversational Telephone Speech 2011 -- Mandarin Chinese
LDC2016S11 Multi-Language Conversational Telephone Speech 2011 -- Slavic Group
LDC2017S14 Multi-Language Conversational Telephone Speech 2011 -- South Asian
LDC2018S12 Multi-Language Conversational Telephone Speech 2011 -- Spanish
LDC2017S09 Multi-Language Conversational Telephone Speech 2011 -- Turkish

NIST MT

LDC2009T05 2008 NIST Metrics for Machine Translation (MetricsMATR08) Development Data
LDC2014T09 HyTER Networks of Selected OpenMT08/09 Sentences
LDC2010T10 NIST 2002 Open Machine Translation (OpenMT) Evaluation
LDC2010T11 NIST 2003 Open Machine Translation (OpenMT) Evaluation
LDC2010T12 NIST 2004 Open Machine Translation (OpenMT) Evaluation
LDC2010T14 NIST 2005 Open Machine Translation (OpenMT) Evaluation
LDC2010T17 NIST 2006 Open Machine Translation (OpenMT) Evaluation
LDC2010T21 NIST 2008 Open Machine Translation (OpenMT) Evaluation
LDC2013T07 NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets
LDC2010T23 NIST 2009 Open Machine Translation (OpenMT) Evaluation
LDC2013T03 NIST 2012 Open Machine Translation (OpenMT) Evaluation
LDC2014T02 NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language Source
LDC2013T18 Semantic Textual Similarity (STS) 2013 Machine Translation

NIST OpenSAT

LDC2022S01 2017 NIST OpenSAT Pilot - SSSF
LDC2023S06 2019 OpenSAT Public Safety Communications Simulation

NIST Public Safety

LDC2023S06 2019 OpenSAT Public Safety Communications Simulation

NIST SRE

LDC96S61 1996 Speaker Recognition Benchmark
LDC99S80 1997 Speaker Recognition Benchmark
LDC98S76 1998 Speaker Recognition Benchmark
LDC99S81 1999 Speaker Recognition Benchmark
LDC2001S97 2000 NIST Speaker Recognition Evaluation
LDC2002S34 2001 NIST Speaker Recognition Evaluation Corpus
LDC2004S04 2002 NIST Speaker Recognition Evaluation
LDC2010S03 2003 NIST Speaker Recognition Evaluation
LDC2006S44 2004 NIST Speaker Recognition Evaluation
LDC2011S04 2005 NIST Speaker Recognition Evaluation Test Data
LDC2011S01 2005 NIST Speaker Recognition Evaluation Training Data
LDC2011S10 2006 NIST Speaker Recognition Evaluation Test Set Part 1
LDC2012S01 2006 NIST Speaker Recognition Evaluation Test Set Part 2
LDC2011S09 2006 NIST Speaker Recognition Evaluation Training Set
LDC2011S11 2008 NIST Speaker Recognition Evaluation Supplemental Set
LDC2011S08 2008 NIST Speaker Recognition Evaluation Test Set
LDC2011S05 2008 NIST Speaker Recognition Evaluation Training Set Part 1
LDC2011S07 2008 NIST Speaker Recognition Evaluation Training Set Part 2
LDC2017S06 2010 NIST Speaker Recognition Evaluation Test Set
LDC2019S20 2016 NIST Speaker Recognition Evaluation Test Set
LDC2020S04 2018 NIST Speaker Recognition Evaluation Test Set
LDC2023V01 2019 NIST Speaker Recognition Evaluation Test Set -- Audio-Visual
LDC2023S03 2019 NIST Speaker Recognition Evaluation Test Set -- CTS Challenge
LDC2024S05 Call My Net 1
LDC2019S09 First DIHARD Challenge Development - Eight Sources
LDC2019S12 First DIHARD Challenge Evaluation - Nine Sources
LDC2013S05 Greybeard
LDC2024S01 KASET - Kurmanji and Sorani Kurdish Speech and Transcripts
LDC2023S02 Mixer 3 Speech
LDC2020S03 Mixer 4 and 5 Speech
LDC2013S03 Mixer 6 Speech
LDC2023S04 Mixer 7 Spanish Speech
LDC2009T26 NXT Switchboard Annotations
LDC2023S09 REMIX Telephone Collection
LDC2022S06 Second DIHARD Challenge Evaluation - Eleven Sources
LDC2001S13 Switchboard Cellular Part 1 Audio
LDC2001S15 Switchboard Cellular Part 1 Transcribed Audio
LDC2001T14 Switchboard Cellular Part 1 Transcription
LDC2004S07 Switchboard Cellular Part 2 Audio
LDC93S8 Switchboard Credit Card
LDC97S62 Switchboard-1 Release 2
LDC98S75 Switchboard-2 Phase I
LDC99S79 Switchboard-2 Phase II
LDC2002S06 Switchboard-2 Phase III Audio

OpenHaRT

LDC2012T15 MADCAT Phase 1 Training Set
LDC2013T09 MADCAT Phase 2 Training Set
LDC2013T15 MADCAT Phase 3 Training Set

PEA-TRAD

LDC2018T13 TRAD Arabic-French Parallel Text -- Newsgroup
LDC2018T21 TRAD Arabic-French Parallel Text -- Newswire
LDC2018T02 TRAD Chinese-French Parallel Text -- Blog
LDC2018T17 TRAD Chinese-French Parallel Text -- Broadcast News

RATS

LDC2017S20 RATS Keyword Spotting
LDC2018S10 RATS Language Identification
LDC2024S03 RATS Low Speech Density
LDC2021S08 RATS Speaker Identification
LDC2015S02 RATS Speech Activity Detection

REFLEX-MTE

LDC2009T11 REFLEX Entity Translation Training/DevTest

RM

LDC96S39 RM Isolated and Spelled Word Data

ROAR

LDC2019S09 First DIHARD Challenge Development - Eight Sources
LDC2019S12 First DIHARD Challenge Evaluation - Nine Sources
LDC2004S05 ISL Meeting Speech Part 1
LDC2004T10 ISL Meeting Transcripts Part 1
LDC2022S06 Second DIHARD Challenge Evaluation - Eleven Sources
LDC2022S14 Third DIHARD Challenge Evaluation

RT

LDC2007S12 2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data
LDC2007S11 2004 Spring NIST Rich Transcription (RT-04S) Development Data
LDC2011S06 2005 Spring NIST Rich Transcription (RT-05S) Evaluation Set
LDC2019S09 First DIHARD Challenge Development - Eight Sources
LDC2019S12 First DIHARD Challenge Evaluation - Nine Sources
LDC2022S06 Second DIHARD Challenge Evaluation - Eleven Sources
LDC2022S12 Third DIHARD Challenge Development

SemEval

LDC2016T10 SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing
LDC2011T01 SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages

SID

LDC2001S13 Switchboard Cellular Part 1 Audio
LDC2001S15 Switchboard Cellular Part 1 Transcribed Audio
LDC2001T14 Switchboard Cellular Part 1 Transcription
LDC2004S07 Switchboard Cellular Part 2 Audio
LDC98S75 Switchboard-2 Phase I
LDC99S79 Switchboard-2 Phase II
LDC2002S06 Switchboard-2 Phase III Audio

SPINE

LDC2000S96 Speech in Noisy Environments (SPINE) Evaluation Audio
LDC2000T54 Speech in Noisy Environments (SPINE) Evaluation Transcripts
LDC2000S87 Speech in Noisy Environments (SPINE) Training Audio
LDC2000T49 Speech in Noisy Environments (SPINE) Training Transcripts
LDC2001S04 Speech in Noisy Environments (SPINE2) Part 1 Audio
LDC2001T05 Speech in Noisy Environments (SPINE2) Part 1 Transcripts
LDC2001S06 Speech in Noisy Environments (SPINE2) Part 2 Audio
LDC2001T07 Speech in Noisy Environments (SPINE2) Part 2 Transcripts
LDC2001S08 Speech in Noisy Environments (SPINE2) Part 3 Audio
LDC2001T09 Speech in Noisy Environments (SPINE2) Part 3 Transcripts
LDC2001S99 Speech in Noisy Environments 1 (SPINE1 CODED) Coded Audio

TAC

LDC2024T09 MultiTACRED
LDC2023T13 TAC KBP Belief and Sentiment - Comprehensive Training and Evaluation Data 2016-2017
LDC2017T17 TAC KBP Chinese Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2011-2014
LDC2019T08 TAC KBP Chinese Regular Slot Filling - Comprehensive Training and Evaluation Data 2014
LDC2019T17 TAC KBP Cold Start - Comprehensive Evaluation Data 2012-2017
LDC2018T03 TAC KBP Comprehensive English Source Corpora 2009-2014
LDC2018T16 TAC KBP English Entity Linking - Comprehensive Training and Evaluation Data 2009-2013
LDC2020T03 TAC KBP English Event Argument - Training and Evaluation Data 2014-2015
LDC2020T13 TAC KBP English Event Nugget Detection and Coreference - Comprehensive Training and Evaluation Data 2014-2015
LDC2018T22 TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014
LDC2021T08 TAC KBP English Sentiment Slot Filling -- Comprehensive Training and Evaluation Data 2013-2014
LDC2021T06 TAC KBP English Surprise Slot Filling -- Comprehensive Training and Evaluation Data 2010
LDC2020T08 TAC KBP English Temporal Slot Filling - Comprehensive Training and Evaluation Data 2011 and 2013
LDC2019T19 TAC KBP Entity Discovery and Linking - Comprehensive Evaluation Data 2016-2017
LDC2019T02 TAC KBP Entity Discovery and Linking - Comprehensive Training and Evaluation Data 2014-2015
LDC2019T12 TAC KBP Evaluation Source Corpora 2016-2017
LDC2020T18 TAC KBP Event Argument - Comprehensive Training and Evaluation Data 2016-2017
LDC2014T16 TAC KBP Reference Knowledge Base
LDC2016T26 TAC KBP Spanish Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2012-2014
LDC2018T24 TAC Relation Extraction Dataset

Talkbank

LDC2005T35 American National Corpus (ANC) Second Release
LDC2004V01 FORM1 Kinematic Gesture
LDC2003V01 FORM2 Kinematic Gesture
LDC2003L01 Grassfields Bantu Fieldwork: Dschang Lexicon
LDC2003S02 Grassfields Bantu Fieldwork: Dschang Tone Paradigms
LDC2001S16 Grassfields Bantu Fieldwork: Ngomba Tone Paradigms
LDC2004L01 Klex: Finite-State Lexical Transducer for Korean
LDC2004T03 Morphologically Annotated Korean Text
LDC2003S06 Santa Barbara Corpus of Spoken American English Part II
LDC2004S10 Santa Barbara Corpus of Spoken American English Part III
LDC2005S25 Santa Barbara Corpus of Spoken American English Part IV
LDC2003T15 SLX Corpus of Classic Sociolinguistic Interviews
LDC2004S12 TalkBank Ethology Data: Field Recordings of Vervet Monkey Calls

TDT

LDC2010T18 ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0
LDC98T25 TDT Pilot Study Corpus
LDC2000S92 TDT2 Careful Transcription Audio
LDC2000T44 TDT2 Careful Transcription Text
LDC99S84 TDT2 English Audio
LDC2001S93 TDT2 Mandarin Audio Corpus
LDC2001T57 TDT2 Multilanguage Text Version 4.0
LDC2001S94 TDT3 English Audio
LDC2001S95 TDT3 Mandarin Audio
LDC2001T58 TDT3 Multilanguage Text Version 2.0
LDC2005S11 TDT4 Multilingual Broadcast News Speech Corpus
LDC2005T16 TDT4 Multilingual Text and Annotations
LDC2007V02 TRECVID 2003 Keyframes & Transcripts
LDC2007V01 TRECVID 2005 Keyframes & Transcripts

TERN

LDC2010T18 ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0

TIDES

LDC2005T09 ACE 2004 Multilingual Training Corpus
LDC2010T18 ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0
LDC2005T07 ACE Time Normalization (TERN) 2004 English Training Data v 1.0
LDC2003T11 ACE-2 Version 1.0
LDC93T1 ACL/DCI
LDC2004T18 Arabic English Parallel News Part 1
LDC2003T12 Arabic Gigaword
LDC2004T17 Arabic News Translation Text Part 1
LDC2001T55 Arabic Newswire Part 1
LDC2003T07 Arabic Treebank: Part 1 - 10K-word English Translation
LDC2003T06 Arabic Treebank: Part 1 v 2.0
LDC2005T02 Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic analysis)
LDC2004T02 Arabic Treebank: Part 2 v 2.0
LDC2005T20 Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis)
LDC2004T11 Arabic Treebank: Part 3 v 1.0
LDC2005T33 BBN Pronoun Coreference and Entity Type Corpus
LDC2000T43 BLLIP 1987-89 WSJ Corpus Release 1
LDC2002L49 Buckwalter Arabic Morphological Analyzer Version 1.0
LDC2004L02 Buckwalter Arabic Morphological Analyzer Version 2.0
LDC2005T13 CCGbank
LDC96L14 CELEX2
LDC2005T10 Chinese English News Magazine Parallel Text
LDC2003T09 Chinese Gigaword
LDC2005T14 Chinese Gigaword Second Edition
LDC2005T06 Chinese News Translation Text Part 1
LDC2005T23 Chinese Proposition Bank 1.0
LDC2001T11 Chinese Treebank 2.0
LDC2004T05 Chinese Treebank 4.0
LDC2005T01 Chinese Treebank 5.0
LDC2007T36 Chinese Treebank 6.0
LDC2010T07 Chinese Treebank 7.0
LDC2013T21 Chinese Treebank 8.0
LDC2002L27 Chinese-English Translation Lexicon Version 3.0
LDC2007T02 English Chinese Translation Treebank v 1.0
LDC2003T05 English Gigaword
LDC2005T12 English Gigaword Second Edition
LDC95T11 European Language Newspaper Text
LDC2000T50 Hong Kong Hansards Parallel Text
LDC2000T47 Hong Kong Laws Parallel Text
LDC2000T46 Hong Kong News Parallel Text
LDC2004T08 Hong Kong Parallel Text
LDC95T8 Japanese Business News Text
LDC99T34 Japanese Business News Text Supplement
LDC2000T45 Korean Newswire
LDC95T13 Mandarin Chinese News Text
LDC2001T02 Message Understanding Conference (MUC) 7
LDC2003T18 Multiple-Translation Arabic (MTA) Part 1
LDC2005T05 Multiple-Translation Arabic (MTA) Part 2
LDC2003T17 Multiple-Translation Chinese (MTC) Part 2
LDC2004T07 Multiple-Translation Chinese (MTC) Part 3
LDC2006T04 Multiple-Translation Chinese (MTC) Part 4
LDC2002T01 Multiple-Translation Chinese Corpus
LDC95T21 North American News Text Corpus
LDC98T30 North American News Text Supplement
LDC2004T23 Prague Arabic Dependency Treebank 1.0
LDC2004T14 Proposition Bank I
LDC2006T12 Spanish Gigaword First Edition
LDC2009T21 Spanish Gigaword Second Edition
LDC95T9 Spanish News Text
LDC99T41 Spanish Newswire Text, Volume 2
LDC98T25 TDT Pilot Study Corpus
LDC2000S92 TDT2 Careful Transcription Audio
LDC2000T44 TDT2 Careful Transcription Text
LDC99S84 TDT2 English Audio
LDC2001S93 TDT2 Mandarin Audio Corpus
LDC2001T57 TDT2 Multilanguage Text Version 4.0
LDC2001S94 TDT3 English Audio
LDC2001S95 TDT3 Mandarin Audio
LDC2001T58 TDT3 Multilanguage Text Version 2.0
LDC2005S11 TDT4 Multilingual Broadcast News Speech Corpus
LDC2005T16 TDT4 Multilingual Text and Annotations
LDC2004T09 TIDES Extraction (ACE) 2003 Multilingual Training Data
LDC93T3A TIPSTER Complete
LDC2000T52 TREC Mandarin
LDC2000T51 TREC Spanish
LDC99T42 Treebank-3
LDC94T4B-1 UN Parallel Text (English)
LDC94T4B-3 UN Parallel Text (Spanish)

Tipster

LDC95T13 Mandarin Chinese News Text
LDC95T9 Spanish News Text
LDC93T3A TIPSTER Complete
LDC93T3B TIPSTER Volume 1
LDC93T3C TIPSTER Volume 2
LDC93T3D TIPSTER Volume 3

TRAD

LDC2018T13 TRAD Arabic-French Parallel Text -- Newsgroup
LDC2018T21 TRAD Arabic-French Parallel Text -- Newswire
LDC2018T02 TRAD Chinese-French Parallel Text -- Blog
LDC2018T17 TRAD Chinese-French Parallel Text -- Broadcast News

TREC

LDC2001T55 Arabic Newswire Part 1
LDC95T13 Mandarin Chinese News Text
LDC95T9 Spanish News Text
LDC93T3A TIPSTER Complete
LDC93T3B TIPSTER Volume 1
LDC93T3C TIPSTER Volume 2
LDC93T3D TIPSTER Volume 3
LDC2000T52 TREC Mandarin
LDC2000T51 TREC Spanish
LDC2007V02 TRECVID 2003 Keyframes & Transcripts
LDC2010V01 TRECVID 2004 Keyframes & Transcripts
LDC2007V01 TRECVID 2005 Keyframes & Transcripts
LDC2010V02 TRECVID 2006 Keyframes

VACE

LDC2012V01 2005 NIST/USF Evaluation Resources for the VACE Program - Broadcast News
LDC2011V05 2006 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 1
LDC2011V06 2006 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 2
LDC2011V03 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 1
LDC2011V04 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 2
LDC2011V01 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Training Set Part 1
LDC2011V02 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Training Set Part 2

VAST

LDC2023V01 2019 NIST Speaker Recognition Evaluation Test Set -- Audio-Visual
LDC2019S09 First DIHARD Challenge Development - Eight Sources
LDC2019S12 First DIHARD Challenge Evaluation - Nine Sources
LDC2022S06 Second DIHARD Challenge Evaluation - Eleven Sources
LDC2022S12 Third DIHARD Challenge Development
LDC2022S14 Third DIHARD Challenge Evaluation
LDC2019S05 VAST Chinese Speech and Transcripts