LDC Corpora ⇔ Projects

Many of the corpora in the Catalog were developed for, or used in, sponsored research projects. Some of those resources were training and test data for benchmark tests of language-based systems developed during the project. A corpus is associated with a given project either because it was developed for the project, it was used in the project or it was considered otherwise relevant to the work of the project.

ACE

  • LDC2003T11 ACE-2 Version 1.0
  • LDC2004T09 TIDES Extraction (ACE) 2003 Multilingual Training Data
  • LDC2004T14 Proposition Bank I
  • LDC2005T07 ACE Time Normalization (TERN) 2004 English Training Data v 1.0
  • LDC2005T09 ACE 2004 Multilingual Training Corpus
  • LDC2005T33 BBN Pronoun Coreference and Entity Type Corpus
  • LDC2006T06 ACE 2005 Multilingual Training Corpus
  • LDC2008T03 ACE 2005 English SpatialML Annotations
  • LDC2009T11 REFLEX Entity Translation Training/DevTest
  • LDC2010T09 ACE 2005 Mandarin SpatialML Annotations
  • LDC2010T18 ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0
  • LDC2011T02 ACE 2005 English SpatialML Annotations Version 2
  • LDC2011T08 Datasets for Generic Relation Extraction (reACE)
  • LDC2014T18 ACE 2007 Multilingual Training Corpus
  • LDC2015T20 ACE 2007 Spanish DevTest - Pilot Evaluation

American National Corpus (ANC)

  • LDC2005T35 American National Corpus (ANC) Second Release
  • LDC2010T22 Manually Annotated Sub-Corpus First Release
  • LDC2013T12 Manually Annotated Sub-Corpus Third Release

AQUAINT

  • LDC2005T33 BBN Pronoun Coreference and Entity Type Corpus
  • LDC2008T25 AQUAINT-2 Information-Retrieval Text Research Collection

ATIS

BOLT

  • LDC2014T12 Abstract Meaning Representation (AMR) Annotation Release 1.0
  • LDC2013T21 Chinese Treebank 8.0
  • LDC2016T05 BOLT Chinese Discussion Forums
  • LDC2016T13 Chinese Treebank 9.0
  • LDC2016T19 BOLT Chinese-English Word Alignment and Tagging -- Discussion Forum Training
  • LDC2017T05 BOLT Chinese Discussion Forum Parallel Training Data
  • LDC2017T07 BOLT Egyptian Arabic SMS/Chat and Transliteration

Communicator

CoNLL

DARPA-CSR

DASL

  • LDC2003T15 SLX Corpus of Classic Sociolinguistic Interviews

DEFT

EARS

  • LDC2000S92 TDT2 Careful Transcription Audio
  • LDC2000T44 TDT2 Careful Transcription Text
  • LDC2001S13 Switchboard Cellular Part 1 Audio
  • LDC2001S15 Switchboard Cellular Part 1 Transcribed Audio
  • LDC2001S91 1997 HUB4 Broadcast News Evaluation Non-English Test Material
  • LDC2001S93 TDT2 Mandarin Audio Corpus
  • LDC2001S94 TDT3 English Audio
  • LDC2001S95 TDT3 Mandarin Audio
  • LDC2001T14 Switchboard Cellular Part 1 Transcription
  • LDC2001T55 Arabic Newswire Part 1
  • LDC2001T57 TDT2 Multilanguage Text Version 4.0
  • LDC2001T58 TDT3 Multilanguage Text Version 2.0
  • LDC2002S06 Switchboard-2 Phase III Audio
  • LDC2002S10 1998 HUB5 English Evaluation
  • LDC2002S11 1997 HUB4 English Evaluation Speech and Transcripts
  • LDC2002S12 2001 HUB5 Mandarin Evaluation
  • LDC2002S13 2001 HUB5 English Evaluation
  • LDC2002S22 1997 HUB5 Arabic Evaluation
  • LDC2002S24 1997 HUB5 German Evaluation
  • LDC2002S25 1997 HUB5 Spanish Evaluation
  • LDC2002S37 CALLHOME Egyptian Arabic Speech Supplement
  • LDC2002T38 CALLHOME Egyptian Arabic Transcripts Supplement
  • LDC2002T39 1997 HUB5 Arabic Transcripts
  • LDC2003T01 2001 HUB5 Mandarin Transcripts
  • LDC2003T02 1998 HUB5 English Transcripts
  • LDC2003T03 1997 HUB5 German Transcripts
  • LDC2003T04 1997 HUB5 Spanish Transcripts
  • LDC2003T05 English Gigaword
  • LDC2003T09 Chinese Gigaword
  • LDC2003T12 Arabic Gigaword
  • LDC2005S25 Santa Barbara Corpus of Spoken American English Part IV
  • LDC2004S07 Switchboard Cellular Part 2 Audio
  • LDC2004S08 RT-03 MDE Training Data Speech
  • LDC2004S10 Santa Barbara Corpus of Spoken American English Part III
  • LDC2004S11 2002 Rich Transcription Broadcast News and Conversational Telephone Speech
  • LDC2004S13 Fisher English Training Speech Part 1 Speech
  • LDC2004T12 RT-03 MDE Training Data Text and Annotations
  • LDC2004T19 Fisher English Training Speech Part 1 Transcripts
  • LDC2005S07 Arabic CTS Levantine Fisher Training Data Set 3, Speech
  • LDC2005S08 BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
  • LDC2005S13 Fisher English Training Part 2, Speech
  • LDC2005T03 Arabic CTS Levantine Fisher Training Data Set 3, Transcripts
  • LDC2005T08 Discourse Graphbank
  • LDC2005T12 English Gigaword Second Edition
  • LDC2005T14 Chinese Gigaword Second Edition
  • LDC2005T19 Fisher English Training Part 2, Transcripts
  • LDC95T13 Mandarin Chinese News Text
  • LDC95T21 North American News Text Corpus
  • LDC96L15 CALLHOME Mandarin Chinese Lexicon
  • LDC96S34 CALLHOME Mandarin Chinese Speech
  • LDC96S46 CALLFRIEND American English-Non-Southern Dialect
  • LDC96S47 CALLFRIEND American English-Southern Dialect
  • LDC96S49 CALLFRIEND Egyptian Arabic
  • LDC96S55 CALLFRIEND Mandarin Chinese-Mainland Dialect
  • LDC96S56 CALLFRIEND Mandarin Chinese-Taiwan Dialect
  • LDC96T16 CALLHOME Mandarin Chinese Transcripts
  • LDC97L20 CALLHOME American English Lexicon (PRONLEX)
  • LDC97S42 CALLHOME American English Speech
  • LDC97S44 1996 English Broadcast News Speech (HUB4)
  • LDC97S45 CALLHOME Egyptian Arabic Speech
  • LDC97S62 Switchboard-1 Release 2
  • LDC97S66 1996 English Broadcast News Dev and Eval (HUB4)
  • LDC97T14 CALLHOME American English Transcripts
  • LDC97T19 CALLHOME Egyptian Arabic Transcripts
  • LDC97T22 1996 English Broadcast News Transcripts (HUB4)
  • LDC98S69 HUB5 Mandarin Telephone Speech Corpus
  • LDC98S71 1997 English Broadcast News Speech (HUB4)
  • LDC98S72 Taiwanese Putonghua Speech and Transcripts
  • LDC98S73 1997 Mandarin Broadcast News Speech (HUB4-NE)
  • LDC98S75 Switchboard-2 Phase I
  • LDC98T24 1997 Mandarin Broadcast News Transcripts (HUB4-NE)
  • LDC98T25 TDT Pilot Study Corpus
  • LDC98T26 HUB5 Mandarin Transcripts
  • LDC98T28 1997 English Broadcast News Transcripts (HUB4)
  • LDC98T30 North American News Text Supplement
  • LDC99L22 Egyptian Colloquial Arabic Lexicon
  • LDC99L23 American English Spoken Lexicon
  • LDC99S79 Switchboard-2 Phase II
  • LDC99S84 TDT2 English Audio
  • LDC2005S14 Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)
  • LDC2005T32 HKUST Mandarin Telephone Transcript Data, Part 1
  • LDC2005S16 RT-04 MDE Training Data Speech
  • LDC2005T24 RT-04 MDE Training Data Text/Annotations
  • LDC2006T12 Spanish Gigaword First Edition
  • LDC2006S29 Levantine Arabic QT Training Data Set 5, Speech
  • LDC2006T07 Levantine Arabic QT Training Data Set 5, Transcripts
  • LDC2005S15 HKUST Mandarin Telephone Speech, Part 1
  • LDC2009T21 Spanish Gigaword Second Edition

GALE

  • LDC2000S92 TDT2 Careful Transcription Audio
  • LDC2000T43 BLLIP 1987-89 WSJ Corpus Release 1
  • LDC2000T44 TDT2 Careful Transcription Text
  • LDC2000T45 Korean Newswire
  • LDC2000T46 Hong Kong News Parallel Text
  • LDC2000T47 Hong Kong Laws Parallel Text
  • LDC2000T50 Hong Kong Hansards Parallel Text
  • LDC2000T51 TREC Spanish
  • LDC2000T52 TREC Mandarin
  • LDC2001S13 Switchboard Cellular Part 1 Audio
  • LDC2001S15 Switchboard Cellular Part 1 Transcribed Audio
  • LDC2001S91 1997 HUB4 Broadcast News Evaluation Non-English Test Material
  • LDC2001S93 TDT2 Mandarin Audio Corpus
  • LDC2001S94 TDT3 English Audio
  • LDC2001S95 TDT3 Mandarin Audio
  • LDC2001T02 Message Understanding Conference (MUC) 7
  • LDC2013T14 GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 1
  • LDC2001T11 Chinese Treebank 2.0
  • LDC2001T14 Switchboard Cellular Part 1 Transcription
  • LDC2001T55 Arabic Newswire Part 1
  • LDC2001T57 TDT2 Multilanguage Text Version 4.0
  • LDC2001T58 TDT3 Multilanguage Text Version 2.0
  • LDC2002L27 Chinese-English Translation Lexicon Version 3.0
  • LDC2002L49 Buckwalter Arabic Morphological Analyzer Version 1.0
  • LDC2002S06 Switchboard-2 Phase III Audio
  • LDC2002S10 1998 HUB5 English Evaluation
  • LDC2002S11 1997 HUB4 English Evaluation Speech and Transcripts
  • LDC2002S12 2001 HUB5 Mandarin Evaluation
  • LDC2002S13 2001 HUB5 English Evaluation
  • LDC2002S22 1997 HUB5 Arabic Evaluation
  • LDC2002S24 1997 HUB5 German Evaluation
  • LDC2002S25 1997 HUB5 Spanish Evaluation
  • LDC2002S37 CALLHOME Egyptian Arabic Speech Supplement
  • LDC2002T01 Multiple-Translation Chinese Corpus
  • LDC2002T38 CALLHOME Egyptian Arabic Transcripts Supplement
  • LDC2002T39 1997 HUB5 Arabic Transcripts
  • LDC2003T01 2001 HUB5 Mandarin Transcripts
  • LDC2003T02 1998 HUB5 English Transcripts
  • LDC2003T03 1997 HUB5 German Transcripts
  • LDC2003T04 1997 HUB5 Spanish Transcripts
  • LDC2003T05 English Gigaword
  • LDC2003T06 Arabic Treebank: Part 1 v 2.0
  • LDC2003T07 Arabic Treebank: Part 1 - 10K-word English Translation
  • LDC2003T09 Chinese Gigaword
  • LDC2003T11 ACE-2 Version 1.0
  • LDC2003T12 Arabic Gigaword
  • LDC2003T17 Multiple-Translation Chinese (MTC) Part 2
  • LDC2003T18 Multiple-Translation Arabic (MTA) Part 1
  • LDC2005S25 Santa Barbara Corpus of Spoken American English Part IV
  • LDC2007T24 GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1
  • LDC2004L02 Buckwalter Arabic Morphological Analyzer Version 2.0
  • LDC2004S07 Switchboard Cellular Part 2 Audio
  • LDC2004S08 RT-03 MDE Training Data Speech
  • LDC2004S10 Santa Barbara Corpus of Spoken American English Part III
  • LDC2004S11 2002 Rich Transcription Broadcast News and Conversational Telephone Speech
  • LDC2004S13 Fisher English Training Speech Part 1 Speech
  • LDC2004T02 Arabic Treebank: Part 2 v 2.0
  • LDC2004T05 Chinese Treebank 4.0
  • LDC2004T07 Multiple-Translation Chinese (MTC) Part 3
  • LDC2004T08 Hong Kong Parallel Text
  • LDC2004T09 TIDES Extraction (ACE) 2003 Multilingual Training Data
  • LDC2004T11 Arabic Treebank: Part 3 v 1.0
  • LDC2004T12 RT-03 MDE Training Data Text and Annotations
  • LDC2004T14 Proposition Bank I
  • LDC2004T17 Arabic News Translation Text Part 1
  • LDC2004T18 Arabic English Parallel News Part 1
  • LDC2004T19 Fisher English Training Speech Part 1 Transcripts
  • LDC2004T23 Prague Arabic Dependency Treebank 1.0
  • LDC2005S07 Arabic CTS Levantine Fisher Training Data Set 3, Speech
  • LDC2005S08 BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
  • LDC2005S11 TDT4 Multilingual Broadcast News Speech Corpus
  • LDC2005S13 Fisher English Training Part 2, Speech
  • LDC2005T01 Chinese Treebank 5.0
  • LDC2005T02 Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic analysis)
  • LDC2005T03 Arabic CTS Levantine Fisher Training Data Set 3, Transcripts
  • LDC2005T05 Multiple-Translation Arabic (MTA) Part 2
  • LDC2005T06 Chinese News Translation Text Part 1
  • LDC2005T07 ACE Time Normalization (TERN) 2004 English Training Data v 1.0
  • LDC2005T08 Discourse Graphbank
  • LDC2005T09 ACE 2004 Multilingual Training Corpus
  • LDC2005T10 Chinese English News Magazine Parallel Text
  • LDC2005T12 English Gigaword Second Edition
  • LDC2005T13 CCGbank
  • LDC2005T14 Chinese Gigaword Second Edition
  • LDC2013T23 GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 1
  • LDC2005T16 TDT4 Multilingual Text and Annotations
  • LDC2005T19 Fisher English Training Part 2, Transcripts
  • LDC93T1 ACL/DCI
  • LDC93T3A TIPSTER Complete
  • LDC94T4B-1 UN Parallel Text (English)
  • LDC94T4B-3 UN Parallel Text (Spanish)
  • LDC95T11 European Language Newspaper Text
  • LDC95T13 Mandarin Chinese News Text
  • LDC95T21 North American News Text Corpus
  • LDC95T8 Japanese Business News Text
  • LDC95T9 Spanish News Text
  • LDC96L14 CELEX2
  • LDC96L15 CALLHOME Mandarin Chinese Lexicon
  • LDC96S34 CALLHOME Mandarin Chinese Speech
  • LDC96S46 CALLFRIEND American English-Non-Southern Dialect
  • LDC96S47 CALLFRIEND American English-Southern Dialect
  • LDC96S49 CALLFRIEND Egyptian Arabic
  • LDC96S55 CALLFRIEND Mandarin Chinese-Mainland Dialect
  • LDC96S56 CALLFRIEND Mandarin Chinese-Taiwan Dialect
  • LDC96T16 CALLHOME Mandarin Chinese Transcripts
  • LDC97L20 CALLHOME American English Lexicon (PRONLEX)
  • LDC97S42 CALLHOME American English Speech
  • LDC97S44 1996 English Broadcast News Speech (HUB4)
  • LDC97S45 CALLHOME Egyptian Arabic Speech
  • LDC97S62 Switchboard-1 Release 2
  • LDC97S66 1996 English Broadcast News Dev and Eval (HUB4)
  • LDC97T14 CALLHOME American English Transcripts
  • LDC97T19 CALLHOME Egyptian Arabic Transcripts
  • LDC97T22 1996 English Broadcast News Transcripts (HUB4)
  • LDC98S69 HUB5 Mandarin Telephone Speech Corpus
  • LDC98S71 1997 English Broadcast News Speech (HUB4)
  • LDC98S72 Taiwanese Putonghua Speech and Transcripts
  • LDC98S73 1997 Mandarin Broadcast News Speech (HUB4-NE)
  • LDC98S75 Switchboard-2 Phase I
  • LDC98T24 1997 Mandarin Broadcast News Transcripts (HUB4-NE)
  • LDC98T25 TDT Pilot Study Corpus
  • LDC98T26 HUB5 Mandarin Transcripts
  • LDC98T28 1997 English Broadcast News Transcripts (HUB4)
  • LDC98T30 North American News Text Supplement
  • LDC99L22 Egyptian Colloquial Arabic Lexicon
  • LDC99L23 American English Spoken Lexicon
  • LDC99S79 Switchboard-2 Phase II
  • LDC99S84 TDT2 English Audio
  • LDC99T34 Japanese Business News Text Supplement
  • LDC99T41 Spanish Newswire Text, Volume 2
  • LDC99T42 Treebank-3
  • LDC2005T20 Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis)
  • LDC2005T33 BBN Pronoun Coreference and Entity Type Corpus
  • LDC2005S14 Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)
  • LDC2005T32 HKUST Mandarin Telephone Transcript Data, Part 1
  • LDC2005S16 RT-04 MDE Training Data Speech
  • LDC2005T24 RT-04 MDE Training Data Text/Annotations
  • LDC2005T23 Chinese Proposition Bank 1.0
  • LDC2006T12 Spanish Gigaword First Edition
  • LDC2006T10 English-Arabic Treebank v 1.0
  • LDC2005S15 HKUST Mandarin Telephone Speech, Part 1
  • LDC2007S02 Fisher Levantine Arabic Conversational Telephone Speech
  • LDC2007T04 Fisher Levantine Arabic Conversational Telephone Speech, Transcripts
  • LDC2007T07 English Gigaword Third Edition
  • LDC2007T20 GALE Phase 1 Distillation Training
  • LDC2007T21 OntoNotes Release 1.0
  • LDC2007T23 GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1
  • LDC2007T38 Chinese Gigaword Third Edition
  • LDC2007T40 Arabic Gigaword Third Edition
  • LDC2008T02 GALE Phase 1 Arabic Blog Parallel Text
  • LDC2008T04 OntoNotes Release 2.0
  • LDC2008T06 GALE Phase 1 Chinese Blog Parallel Text
  • LDC2008T08 GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2
  • LDC2008T09 GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2
  • LDC2008T18 GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3
  • LDC2009T09 GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2
  • LDC2009T01 English CTS Treebank with Structural Metadata
  • LDC2009T02 GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1
  • LDC2009T03 GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1
  • LDC2009T05 2008 NIST Metrics for Machine Translation (MetricsMATR08) Development Data
  • LDC2009T06 GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2
  • LDC2009T13 English Gigaword Fourth Edition
  • LDC2009T15 GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1
  • LDC2009T21 Spanish Gigaword Second Edition
  • LDC2009T24 OntoNotes Release 3.0
  • LDC2009T27 Chinese Gigaword Fourth Edition
  • LDC2009T30 Arabic Gigaword Fourth Edition
  • LDC2010T01 NIST Open MT 2008 Evaluation (MT08) Selected References and System Translations
  • LDC2010T03 GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2
  • LDC2010T13 Arabic Treebank: Part 1 v 4.1
  • LDC2010T21 NIST 2008 Open Machine Translation (OpenMT) Evaluation
  • LDC2011T03 OntoNotes Release 4.0
  • LDC2011T05 2008/2010 NIST Metrics for Machine Translation (MetricsMaTr) GALE Evaluation Set
  • LDC2011T07 English Gigaword Fifth Edition
  • LDC2011T09 Arabic Treebank: Part 2 v 3.1
  • LDC2011T11 Arabic Gigaword Fifth Edition
  • LDC2011T13 Chinese Gigaword Fifth Edition
  • LDC2012T02 English Translation Treebank: An-Nahar Newswire
  • LDC2012T06 GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1
  • LDC2012T07 Arabic Treebank - Broadcast News v1.0
  • LDC2012T09 Arabic-Dialect/English Parallel Text
  • LDC2012T13 English Web Treebank
  • LDC2012T14 GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2
  • LDC2012T16 GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web
  • LDC2012T17 GALE Phase 2 Arabic Newswire Parallel Text
  • LDC2012T18 GALE Phase 2 Arabic Broadcast News Parallel Text
  • LDC2013T04 GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 1
  • LDC2013T01 GALE Phase 2 Arabic Web Parallel Text
  • LDC2012T20 GALE Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire
  • LDC2013T05 GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web
  • LDC2013S02 GALE Phase 2 Arabic Broadcast Conversation Speech Part 1
  • LDC2012T21 Annotated English Gigaword
  • LDC2012T24 GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web
  • LDC2013T08 GALE Phase 2 Chinese Broadcast Conversation Transcripts
  • LDC2013S04 GALE Phase 2 Chinese Broadcast Conversation Speech
  • LDC2013T10 GALE Arabic-English Parallel Aligned Treebank -- Newswire
  • LDC2013T11 GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 1
  • LDC2013T16 GALE Phase 2 Chinese Broadcast Conversation Parallel Text Part 2
  • LDC2013S07 GALE Phase 2 Arabic Broadcast Conversation Speech Part 2
  • LDC2013T17 GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 2
  • LDC2013T18 Semantic Textual Similarity (STS) 2013 Machine Translation
  • LDC2013T20 GALE Phase 2 Chinese Broadcast News Transcripts
  • LDC2013S08 GALE Phase 2 Chinese Broadcast News Speech
  • LDC2013T19 OntoNotes Release 5.0
  • LDC2014T03 GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Part 2
  • LDC2014T04 GALE Phase 2 Chinese Broadcast News Parallel Text Part 1
  • LDC2014T05 GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web
  • LDC2014T08 GALE Arabic-English Parallel Aligned Treebank -- Web Training
  • LDC2014T10 GALE Arabic-English Word Alignment Training Part 2 -- Newswire
  • LDC2014T11 GALE Phase 2 Chinese Broadcast News Parallel Text Part 2
  • LDC2014T14 GALE Arabic-English Word Alignment Training Part 3 -- Web
  • LDC2014T15 GALE Phase 2 Chinese Newswire Parallel Text Part 1
  • LDC2014T17 GALE Phase 2 Arabic Broadcast News Transcripts Part 1
  • LDC2014S07 GALE Phase 2 Arabic Broadcast News Speech Part 1
  • LDC2014T20 GALE Phase 2 Chinese Newswire Parallel Text Part 2
  • LDC2014T19 GALE Arabic-English Word Alignment -- Broadcast Training Part 1
  • LDC2014T22 GALE Arabic-English Word Alignment -- Broadcast Training Part 2
  • LDC2014T25 GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 2
  • LDC2014T26 GALE Phase 2 Chinese Web Parallel Text
  • LDC2014S09 GALE Phase 3 Chinese Broadcast Conversation Speech Part 1
  • LDC2014T28 GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 1
  • LDC2015T01 GALE Phase 2 Arabic Broadcast News Transcripts Part 2
  • LDC2015S01 GALE Phase 2 Arabic Broadcast News Speech Part 2
  • LDC2015T04 GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 3
  • LDC2015T05 GALE Phase 3 and 4 Arabic Broadcast Conversation Parallel Text
  • LDC2015T06 GALE Chinese-English Parallel Aligned Treebank -- Training
  • LDC2013T21 Chinese Treebank 8.0
  • LDC2015T07 GALE Phase 3 and 4 Arabic Broadcast News Parallel Text
  • LDC2015T09 GALE Phase 3 Chinese Broadcast Conversation Transcripts Part 2
  • LDC2015S06 GALE Phase 3 Chinese Broadcast Conversation Speech Part 2
  • LDC2015T14 GALE Phase 4 Chinese Broadcast Conversation Parallel Sentences
  • LDC2015T21 GALE Phase 4 Chinese Broadcast News Parallel Sentences
  • LDC2015S11 GALE Phase 3 Arabic Broadcast Conversation Speech Part 1
  • LDC2015T16 GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 1
  • LDC2015T18 GALE Chinese-English Word Alignment and Tagging -- Broadcast Training Part 4
  • LDC2015T19 GALE Phase 3 and 4 Arabic Newswire Parallel Text
  • LDC2015T24 GALE Phase 4 Chinese Newswire Parallel Sentences
  • LDC2015T25 GALE Phase 3 Chinese Broadcast News Transcripts
  • LDC2015S13 GALE Phase 3 Chinese Broadcast News Speech
  • LDC2016T02 Arabic Treebank - Weblog
  • LDC2016T04 GALE Phase 4 Chinese Weblog Parallel Sentences
  • LDC2016T06 GALE Phase 3 Arabic Broadcast Conversation Transcripts Part 2
  • LDC2016S01 GALE Phase 3 Arabic Broadcast Conversation Speech Part 2
  • LDC2016T08 GALE Phase 3 and 4 Arabic Web Parallel Text
  • LDC2016T09 GALE Phase 3 and 4 Chinese Broadcast Conversation Parallel Text
  • LDC2016T11 GALE Phase 4 Arabic Broadcast Conversation Parallel Sentences
  • LDC2016T12 GALE Phase 4 Chinese Broadcast Conversation Transcripts
  • LDC2016S03 GALE Phase 4 Chinese Broadcast Conversation Speech
  • LDC2016T13 Chinese Treebank 9.0
  • LDC2016T14 GALE Phase 4 Arabic Weblog Parallel Sentences
  • LDC2007T36 Chinese Treebank 6.0
  • LDC2010T07 Chinese Treebank 7.0
  • LDC2016T15 GALE Phase 3 and 4 Chinese Broadcast News Parallel Text
  • LDC2016T17 GALE Phase 3 Arabic Broadcast News Transcripts Part 1
  • LDC2016S07 GALE Phase 3 Arabic Broadcast News Speech Part 1
  • LDC2016T20 GALE Phase 4 Arabic Broadcast News Parallel Sentences
  • LDC2016T25 GALE Phase 3 and 4 Chinese Newswire Parallel Text
  • LDC2016T27 GALE Phase 4 Arabic Newswire Parallel Sentences
  • LDC2017T02 GALE Phase 3 and 4 Chinese Web Parallel Text
  • LDC2017T04 GALE Phase 3 Arabic Broadcast News Transcripts Part 2
  • LDC2017S02 GALE Phase 3 Arabic Broadcast News Speech Part 2
  • LDC2017T06 GALE English-Chinese Parallel Aligned Treebank -- Training

GENOA

HAVIC

Hub4

  • LDC2000S86 1998 HUB4 Broadcast News Evaluation English Test Material
  • LDC2001S91 1997 HUB4 Broadcast News Evaluation Non-English Test Material
  • LDC2002S11 1997 HUB4 English Evaluation Speech and Transcripts
  • LDC95T21 North American News Text Corpus
  • LDC97S44 1996 English Broadcast News Speech (HUB4)
  • LDC97S66 1996 English Broadcast News Dev and Eval (HUB4)
  • LDC97T22 1996 English Broadcast News Transcripts (HUB4)
  • LDC98S71 1997 English Broadcast News Speech (HUB4)
  • LDC98S73 1997 Mandarin Broadcast News Speech (HUB4-NE)
  • LDC98S74 1997 Spanish Broadcast News Speech (HUB4-NE)
  • LDC98T24 1997 Mandarin Broadcast News Transcripts (HUB4-NE)
  • LDC98T28 1997 English Broadcast News Transcripts (HUB4)
  • LDC98T29 1997 Spanish Broadcast News Transcripts (HUB4-NE)
  • LDC98T30 North American News Text Supplement
  • LDC98T31 1996 CSR HUB4 Language Model
  • LDC2015S05 Mandarin Chinese Phonetic Segmentation and Tone

Hub5-LVCSR

JANUS

LID

Linguistic Atlas Project

  • LDC2012S03 Digital Archive of Southern Speech
  • LDC2016S05 Digital Archive of Southern Speech - NLP Version

MADCAT

MALACH

  • LDC2012S05 USC-SFI MALACH Interviews and Transcripts English
  • LDC2014S04 USC-SFI MALACH Interviews and Transcripts Czech

MIXER

MT08

  • LDC2010T01 NIST Open MT 2008 Evaluation (MT08) Selected References and System Translations

MUC

NIST Automatic Meeting Recognition

  • LDC2004S09 NIST Meeting Pilot Corpus Speech
  • LDC2004T13 NIST Meeting Pilot Corpus Transcripts and Metadata

NIST LRE

  • LDC2008S05 2005 NIST Language Recognition Evaluation
  • LDC2009S04 2007 NIST Language Recognition Evaluation Test Set
  • LDC2009S05 2007 NIST Language Recognition Evaluation Supplemental Training Set
  • LDC2014S06 2009 NIST Language Recognition Evaluation Test Set
  • LDC2006S31 2003 NIST Language Recognition Evaluation
  • LDC2016S11 Multi-Language Conversational Telephone Speech 2011 -- Slavic Group

NIST MT

  • LDC2013T07 NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets
  • LDC2009T05 2008 NIST Metrics for Machine Translation (MetricsMATR08) Development Data
  • LDC2010T10 NIST 2002 Open Machine Translation (OpenMT) Evaluation
  • LDC2010T11 NIST 2003 Open Machine Translation (OpenMT) Evaluation
  • LDC2010T12 NIST 2004 Open Machine Translation (OpenMT) Evaluation
  • LDC2010T14 NIST 2005 Open Machine Translation (OpenMT) Evaluation
  • LDC2010T17 NIST 2006 Open Machine Translation (OpenMT) Evaluation
  • LDC2010T21 NIST 2008 Open Machine Translation (OpenMT) Evaluation
  • LDC2010T23 NIST 2009 Open Machine Translation (OpenMT) Evaluation
  • LDC2013T03 NIST 2012 Open Machine Translation (OpenMT) Evaluation
  • LDC2013T18 Semantic Textual Similarity (STS) 2013 Machine Translation
  • LDC2014T02 NIST 2012 Open Machine Translation (OpenMT) Progress Test Five Language Source
  • LDC2014T09 HyTER Networks of Selected OpenMT08/09 Sentences

NIST SRE

  • LDC2001S97 2000 NIST Speaker Recognition Evaluation
  • LDC2012S01 2006 NIST Speaker Recognition Evaluation Test Set Part 2
  • LDC2011S05 2008 NIST Speaker Recognition Evaluation Training Set Part 1
  • LDC2006S44 2004 NIST Speaker Recognition Evaluation
  • LDC2010S03 2003 NIST Speaker Recognition Evaluation
  • LDC2011S01 2005 NIST Speaker Recognition Evaluation Training Data
  • LDC2011S04 2005 NIST Speaker Recognition Evaluation Test Data
  • LDC2011S07 2008 NIST Speaker Recognition Evaluation Training Set Part 2
  • LDC2011S08 2008 NIST Speaker Recognition Evaluation Test Set
  • LDC2011S09 2006 NIST Speaker Recognition Evaluation Training Set
  • LDC2011S10 2006 NIST Speaker Recognition Evaluation Test Set Part 1
  • LDC2011S11 2008 NIST Speaker Recognition Evaluation Supplemental Set
  • LDC2013S05 Greybeard
  • LDC2017S06 2010 NIST Speaker Recognition Evaluation Test Set

RATS

REFLEX-MTE

  • LDC2009T11 REFLEX Entity Translation Training/DevTest

RM

  • LDC96S39 RM Isolated and Spelled Word Data

ROAR

RT

  • LDC2011S06 2005 Spring NIST Rich Transcription (RT-05S) Evaluation Set
  • LDC2007S11 2004 Spring NIST Rich Transcription (RT-04S) Development Data
  • LDC2007S12 2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data

SemEval

  • LDC2011T01 SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages
  • LDC2016T10 SDP 2014 & 2015: Broad Coverage Semantic Dependency Parsing

SID

  • LDC2001S13 Switchboard Cellular Part 1 Audio
  • LDC2001S15 Switchboard Cellular Part 1 Transcribed Audio
  • LDC2001T14 Switchboard Cellular Part 1 Transcription
  • LDC2002S06 Switchboard-2 Phase III Audio
  • LDC2002S34 2001 NIST Speaker Recognition Evaluation Corpus
  • LDC2004S04 2002 NIST Speaker Recognition Evaluation
  • LDC2004S07 Switchboard Cellular Part 2 Audio
  • LDC96S61 1996 Speaker Recognition Benchmark
  • LDC98S75 Switchboard-2 Phase I
  • LDC98S76 1998 Speaker Recognition Benchmark
  • LDC99S79 Switchboard-2 Phase II
  • LDC99S80 1997 Speaker Recognition Benchmark
  • LDC99S81 1999 Speaker Recognition Benchmark

SPINE

  • LDC2000S87 Speech in Noisy Environments (SPINE) Training Audio
  • LDC2000S96 Speech in Noisy Environments (SPINE) Evaluation Audio
  • LDC2000T49 Speech in Noisy Environments (SPINE) Training Transcripts
  • LDC2000T54 Speech in Noisy Environments (SPINE) Evaluation Transcripts
  • LDC2001S04 Speech in Noisy Environments (SPINE2) Part 1 Audio
  • LDC2001S06 Speech in Noisy Environments (SPINE2) Part 2 Audio
  • LDC2001S08 Speech in Noisy Environments (SPINE2) Part 3 Audio
  • LDC2001S99 Speech in Noisy Environments 1 (SPINE1 CODED) Coded Audio
  • LDC2001T05 Speech in Noisy Environments (SPINE2) Part 1 Transcripts
  • LDC2001T07 Speech in Noisy Environments (SPINE2) Part 2 Transcripts
  • LDC2001T09 Speech in Noisy Environments (SPINE2) Part 3 Transcripts

TAC

  • LDC2014T16 TAC KBP Reference Knowledge Base
  • LDC2016T26 TAC KBP Spanish Cross-lingual Entity Linking - Comprehensive Training and Evaluation Data 2012-2014

Talkbank

  • LDC2004V01 FORM1 Kinematic Gesture
  • LDC2001S16 Grassfields Bantu Fieldwork: Ngomba Tone Paradigms
  • LDC2003L01 Grassfields Bantu Fieldwork: Dschang Lexicon
  • LDC2003S02 Grassfields Bantu Fieldwork: Dschang Tone Paradigms
  • LDC2003S06 Santa Barbara Corpus of Spoken American English Part II
  • LDC2003T15 SLX Corpus of Classic Sociolinguistic Interviews
  • LDC2003V01 FORM2 Kinematic Gesture
  • LDC2004L01 Klex: Finite-State Lexical Transducer for Korean
  • LDC2005S25 Santa Barbara Corpus of Spoken American English Part IV
  • LDC2004S10 Santa Barbara Corpus of Spoken American English Part III
  • LDC2004S12 TalkBank Ethology Data: Field Recordings of Vervet Monkey Calls
  • LDC2004T03 Morphologically Annotated Korean Text
  • LDC2005T35 American National Corpus (ANC) Second Release

TDT

TERN

  • LDC2010T18 ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0

TIDES

Tipster

TREC

VACE

  • LDC2011V01 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Training Set Part 1
  • LDC2011V02 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Training Set Part 2
  • LDC2011V03 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 1
  • LDC2011V04 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 2
  • LDC2011V06 2006 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 2
  • LDC2011V05 2006 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 1
  • LDC2012V01 2005 NIST/USF Evaluation Resources for the VACE Program - Broadcast News