Turkish NLP Resources

This project focuses on developing comprehensive language resources and tools for Turkish natural language processing. Despite being spoken by over 80 million people worldwide, Turkish remains relatively under-resourced in terms of NLP technologies compared to languages like English or Chinese.

Project Goals

Create large-scale annotated corpora for various NLP tasks
Develop Turkish-specific pre-trained language models
Build open-source tools for Turkish text processing
Establish benchmarks for evaluating Turkish NLP systems

Current Progress

Corpora

We have developed several annotated corpora:

TurkishNews: A corpus of 100,000 news articles with topic annotations
TurkishNER: A named entity recognition dataset with 50,000 annotated sentences
TurkishSentiment: A sentiment analysis dataset with 25,000 annotated reviews

Pre-trained Models

We have released several pre-trained models:

TurkishBERT: A BERT model trained on a large corpus of Turkish text
TurkishGPT: A generative model for Turkish text
TurkishWord2Vec: Word embeddings for Turkish

Tools

We have developed several tools:

TurkishTokenizer: A tokenizer specifically designed for Turkish
TurkishMorphAnalyzer: A morphological analyzer for Turkish
TurkishNLP-Pipeline: An end-to-end pipeline for Turkish text processing

Future Work

We plan to expand our resources to include:

Multimodal datasets (text-image, text-speech)
Domain-specific language models (medical, legal, etc.)
Cross-lingual resources for Turkish and related languages

Turkish NLP Resources

Status

Team

Links

Turkish NLP Resources

Project Goals

Current Progress

Corpora

Pre-trained Models

Tools

Future Work