This dataset contains examples from the ATIS corpus [1] that have been translated into Turkish and Hindi. The release contains subset of examples from the original ATIS corpus distributed by LDC, with translations as well as annotations in the target language. The training and test split in each language is the one used in [2]. Please cite [2] when referring to the dataset. English ======== The dataset has 2 files for the original English atis corpus as used in [3]: train.tsv test.tsv Training set contains 4978 utterances selected from the Class A (context independent) training data in the ATIS-2 and ATIS-3 corpora, while the test set contains 893 utterances from the ATIS-3 Nov93 and Dec94 datasets. Each utterance has its named entities marked via table lookup, including domain specific entities such as city, airline, airport names, and dates. Format ====== Each line in this original dataset includes the following 3 tab-separated columns corresponding to an utterance: - original utterance in English - intents of the utterance (multiple intent are separated by a space) - utterance with xml tags for slot labels Hindi and Turkish ================== The dataset includes the following files for the translations to Hindi and Turkish (target languages): Hindi-train_1600.tsv Hindi-train.tsv Hindi-test.tsv Turkish-train_638.tsv Turkish-train.tsv Turkish-test.tsv Hindi-train_1600.tsv and Turkish-train_638.tsv contain 1600 Hindi utterances and 638 Turkish utterances obtained from crowd-sourcing. Hindi-train.tsv and Turkish-train.tsv contain 600 utterances (each) used in [2]. Hindi-test.tsv and Turkish-test.tsv contains the 893 and 715 test utterances used in [2]. Format ====== Each file is in tab-separated format, and has the following 6 columns: - Original English utterance in ATIS. - Original English utterance's gold BIO format slot sequence in ATIS. - Machine translation of manually translated target language utterance back to English. (Please note that this column is set to dummy_trans for the Turkish training set) - Intent of the utterance - Manually translated utterance into the target language - BIO format slot sequence for the manually translated target language utterance Contact ======= For questions about the dataset, please email dilek@ieee.org, gokhan.tur@ieee.org, or abhirast@google.com References ========== [1] G. Tur, D. Hakkani-Tur, and L. Heck, ``What is left to be understood in ATIS?'' in IEEE SLT, 2010. [2] Shyam Upadhyay, Manaal Faruqui, Gokhan Tur, Dilek Hakkani-Tur, Larry Heck. (Almost) Zero-Shot Cross-Lingual Spoken Language Understanding. IEEE ICASSP 2018. [3] G. Tur, D. Hakkani-Tur, and L. Heck, ``What is left to be understood in ATIS?'' in IEEE SLT, 2010.