awesome-turkish-language-models

awesome-turkish-language-models

A curated list of Turkish AI models, datasets, papers

Stars: 66

Visit
 screenshot

A curated list of Turkish AI models, datasets, and papers aimed at sharing and spreading information about Turkish AI resources. The repository brings together a selection of Turkish AI resources, focusing on unique elements like model performance, task uniqueness, and highlighting groups/companies. It includes various types of models such as LLMs, VLMs, NLP models, speech models, multi-modal models, datasets, live leaderboards, benchmark datasets, papers, tutorials, tools and APIs, MCPs, state of AI in Türkiye, and miscellaneous resources.

README:

awesome-turkish-language-models Awesome

Awesome turkish language models image A curated list of Turkish AI models, datasets, papers

The purpose of this repo to share and spread the information of Turkish AI models, datasets and papers. The amount of these Turkish resources are low and spread across the web. This repo aims to bring a curated selection of these resources together. This is not a list of all Turkish NLP/LLM models or datasets but a selection. So not all BERT or LLaMA based models are gonna make it here. The same applies to low quality Google translate translations of datasets. We aim each entry to have some kind of unique element to its own. This can be model performance, uniqueness in the task, highlighting the groups/companies (not everyone share their stuff so why not appreciate it!) etc. If you want to add anything you are welcomed 😏 , please check out the contributing section.

Table of Contents

Models

LLMs

  1. ytu-ce-cosmos/Turkish-Llama
  2. Trendyol/Llama-3-Trendyol-LLM-8b-chat-v2.0
  3. Trendyol/Trendyol-LLM-7B-chat-v4.1.0
  4. TURKCELL/Turkcell-LLM-7b-v1
  5. KOCDIGITAL/Kocdigital-LLM-8b-v0.1
  6. WiroAI/OpenR1-Qwen-7B-Turkish Reasoning model
  7. WiroAI/wiroai-turkish-llm-9b
  8. ytu-ce-cosmos/Turkish-Gemma-9b-v0.1
  9. Trendyol/Trendyol-LLM-8B-T1 Qwen3 finetune, has thinking mode
  10. ytu-ce-cosmos/Turkish-Gemma-9b-T1
  11. vngrs-ai/Kumru-2B Kumru model has the architecture of Mistral. Its a model trained from scratch (not a finetune).

VLMs

  1. ytu-ce-cosmos/Turkish-LLaVA

NLP

  1. Trendyol/tybert
  2. Trendyol/tyroberta
  3. ytu-ce-cosmos/turkish-base-bert-uncased
  4. ytu-ce-cosmos/turkish-colbert
  5. ytu-ce-cosmos/turkish-gpt2-large
  6. dbmdz/bert-base-turkish-128k-uncased
  7. TURKCELL/bert-offensive-lang-detection-tr
  8. asafaya/kanarya-2b
  9. boun-tabi-LMG/TURNA
  10. Helsinki-NLP group Lots of translation models for turkish
  11. VRLLab/TurkishBERTweet Tweet sentiment analysis
  12. akdeniz27/bert-base-turkish-cased-ner
  13. Trendyol/TY-ecomm-embed-multilingual-base-v1.2.0 Turkish and multilingual embeddings
  14. artiwise-ai/modernbert-base-tr-uncased
  15. ytu-ce-cosmos/turkish-e5-large Turkish retrieval model

Speech models

To be added

Multi-modal models

  1. kesimeg/lora-turkish-clip CLIP model finetuned on turkish dataset

Datasets

Text only

  1. merve/turkish_instructions Instruction tuning dataset
  2. BrewInteractive/alpaca-tr Instruction tuning dataset
  3. Metin/WikiRAG-TR
  4. MBZUAI/Bactrian-X
  5. Helsinki-NLP group Lots of translation models datasets for turkish
  6. turkish-nlp-suite/turkish-wikiNER
  7. turkish-nlp-suite/InstrucTurca
  8. WiroAI/dolphin-r1-turkish Reasoning dataset
  9. allenai/c4 Web scrape
  10. HPLT/HPLT2.0_cleaned Web scrape
  11. unimelb-nlp/wikiann NER
  12. TUR2SQL Text to SQL query dataset
  13. dolphin-r1-turkish Reasoning dataset
  14. emre/ct_tree_of_thought_turkish Turkish Tree of Thoughts (ToT) dataset
  15. evreny/prompt_injection_tr Turkish prompts for prompt injection
  16. HuggingFaceFW/fineweb-2 Has ~95 million turkish text
  17. TURSpider Text-to-SQL dataset
  18. vngrs-ai/vngrs-web-corpus Pretraining data which is a collection of different datasets crawled from the internet
  19. HuggingFaceFW/finetranslations Has 58 Million Turkish-English text pairs for translation. Translations were generated with Gemma3-27B (From original Turkish dataset to English)
  20. ytu-ce-cosmos/Cosmos-Turkish-Corpus-v1.0 Pretraining data which is a collection of different datasets crawled from the internet

Text & Images/Videos

  1. ytu-ce-cosmos/Turkish-LLaVA-Finetune
  2. ytu-ce-cosmos/Turkish-LLaVA-Pretrain
  3. ytu-ce-cosmos/turkce-kitap
  4. 99eren99/LLaVA1.5-Data-Turkish
  5. TasvirEt
  6. nezahatkorkmaz/turkish-medical-vqa-evaluated Medical image question and answer dataset
  7. nezahatkorkmaz/unsloth-pmc-vqa-tr Medical image question answering dataset. Translted from PMC-VQA dataset. Reiquires access to images from original dataset.
  8. BosphorusSign22k Sign recognition
  9. FinePDFs Has 1.7 million Turkish entries. A PDF dataset that can be great for pretraining, RAG benchmark curation.
  10. ituperceptron/image-captioning-turkish Image captioning dataset. 200k long, 100k short captions

Text & Speech

  1. mozilla-foundation/common_voice_17_0 This dataset also has older versions v16,v15, etc.

Live Leaderboards

  1. malhajar/OpenLLMTurkishLeaderboard_v0.2
  2. KUIS-AI/Cetvel
  3. kesimeg/Turkish-rewardbench Reward model comparison
  4. TurkBench/TurkBench
  5. newmindai/Mezura Has RAG, Human evaluation (ELO score) and other benchmark scores. It also includes benchmarks in malhajar/OpenLLMTurkishLeaderboard_v0.2
  6. newmindai/Mizan Embedding model leaderboard. Compares abilities of embedding models on tasks such as retrieval, clustering etc.

Benchmark Datasets

Text only

  1. AYueksel/TurkishMMLU
  2. alibayram/turkish_mmlu
  3. ytu-ce-cosmos/gsm8k_tr
  4. Holmeister's Collections A collection of 17 datasets for 11 different tasks (Truthfulness, fairness, summarization etc.). For more see the paper
  5. CohereLabs/Global-MMLU MMLU for multiple languages including Turkish
  6. mrlbenchmarks/global-piqa-nonparallel Cultural commonsense benchmark.
  7. ytu-ce-cosmos/gpqa-extended_tr Graduate level science questions.
  8. CohereLabsCommunity/multilingual-reward-bench Reward benchmark (preference prediction)

Visual and Text

  1. CohereLabs/m-WildVision
  2. CohereLabs/AyaVisionBench
  3. kesimeg/MMStar_tr
  4. metu-yks/yksbench A visual benchmark based on university entrance exam. Questions include visuals related to mathematics, geometry, physics, chemistry, biology, and geography

Papers

  1. Cosmos-LLaVA: Chatting with the Visual
  2. Introducing cosmosGPT: Monolingual Training for Turkish Language Models
  3. TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish
  4. TURSpider: A Turkish Text-to-SQL Dataset and LLM-Based Study
  5. How do LLMs perform on Turkish? A multi-faceted multi-prompt evaluation Performances of various LLMs in Turkish
  6. Evaluating the Quality of Benchmark Datasets for Low-Resource Languages: A Case Study on Turkish
  7. YKSBench: Stress-Testing Multimodal Models with Exam-Style Questions Paper of YKSBench benchmark.
  8. TurkBench: A Benchmark for Evaluating Turkish Large Language Models Paper of TurkBench benchmark

Tutorials and Codes

  1. METU NLP Lab Git repo
  2. wikipedia ToT data generation notebook

Tools and APIs

  1. Glosbe
  2. Wiktionary
  3. Zemberek Some turkish NLP tools
  4. 3rt4nm4n/turkish-apis A list of turkish-apis

MCPs

  1. THY-MCP
  2. borsa-mcp MCP Server for Istanbul Stock Exchange and Turkish Investment Fund Data
  3. yargi-cmp MCP Server For Turkish Legal Databases
  4. mezuat-mcp MCP Server for Searching Turkish Legislation
  5. yoktez-mcp MCP Server for Turkish Thesis Database
  6. yokatlas-mcp MCP Server for YOK Atlas

State of AI in Türkiye

  1. KUIS-AI Youtube channel
  2. TR-AI Youtube channel
  3. Trendyol Tech Youtube channel Has videos related to their AI products and how they integrate AI

Miscellaneous

  1. Mukayese: Turkish NLP Strikes Back
  2. Mukayese github repo
  3. Wikipedia dumps Can be used as a dataset
  4. Turkish Encoder-only Models List A collection of encoder only turkish models
  5. Turkish Instruction Datasets List A collection of turkish instruction datasets
  6. Turkish Vision-Language Datasets List A collection of turkish vision language datasets
  7. Cosmos App The app of Cosmos AI Research group hosting their cosmos model. (Also has an iOS version)
  8. ITU NLP Research Tools and Resources

Contributing

If you got anything to be added here just make a pull request! Before making a pull request please consider if a model/dataset/etc. has enough quality/uniqueness. Huggingface is crowded with finetuning of LLama and BERT, same applies to dataset. Many datasets have multiple machine translation version. This makes it hard to find good quality sources. We want to keep this list as curated as possible but still be able to cover enough sources.

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for awesome-turkish-language-models

Similar Open Source Tools

For similar tasks

For similar jobs