TS Wikipedia
| Item Name: | TS Wikipedia | 
| Author(s): | Taner Sezer, Türker Sezer | 
| LDC Catalog No.: | LDC2015T15 | 
| ISBN: | 1-58563-723-8 | 
| DOI: | https://doi.org/10.35111/mem6-4951 | 
| Release Date: | July 15, 2015 | 
| Member Year(s): | 2015 | 
| DCMI Type(s): | Text | 
| Data Source(s): | web collection | 
| Application(s): | part of speech tagging, information extraction, morphology | 
| Language(s): | Turkish | 
| Language ID(s): | tur | 
| License(s): | Creative Commons-Attribution-Share-Alike 3.0 (NFP, Non-Member)
                    LDC For-Profit Membership Agreement | 
| Online Documentation: | LDC2015T15 Documents | 
| Licensing Instructions: | Subscription & Standard Members, and Non-Members | 
| Citation: | Sezer, Taner, and Türker Sezer. TS Wikipedia LDC2015T15. Web Download. Philadelphia: Linguistic Data Consortium, 2015. | 
Introduction
TS Wikipedia is a collection of approximately 1.6 million processed Turkish Wikipedia pages. The data is tokenized and includes part-of-speech tags, morphological analysis, lemmas, bi-grams and tri-grams.
Data
The data is in a word-per-line format with five tab-separated columns: token, part-of-speech tag, morphological analysis, lemma and corrected token spelling if needed. All data is presented in UTF-8 XML files and was selected and filtered to reduce non-Turkish characters, mathematical formulas and non-Turkish entries.
Samples
Please view this sample.
Updates
None at this time.