|Item Name:||TS Wikipedia|
|Author(s):||Taner Sezer, Türker Sezer|
|LDC Catalog No.:||LDC2015T15|
|Release Date:||July 15, 2015|
|Data Source(s):||web collection|
|Application(s):||part of speech tagging, information extraction, morphology|
LDC For-Profit Membership Agreement
Creative Commons-Attribution-Share-Alike 3.0 (NFP, Non-Member)
|Online Documentation:||LDC2015T15 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Sezer, Taner, and Türker Sezer. TS Wikipedia LDC2015T15. Web Download. Philadelphia: Linguistic Data Consortium, 2015.|
TS Wikipedia is a collection of approximately 1.6 million processed Turkish Wikipedia pages. The data is tokenized and includes part-of-speech tags, morphological analysis, lemmas, bi-grams and tri-grams.
The data is in a word-per-line format with five tab-separated columns: token, part-of-speech tag, morphological analysis, lemma and corrected token spelling if needed. All data is presented in UTF-8 XML files and was selected and filtered to reduce non-Turkish characters, mathematical formulas and non-Turkish entries.
Please view this sample.
None at this time.