R&D Resources

Here are the list of datasets and resources shared publicly by TurkAI.

Multi-Purpose Language Understanding (ÇADA) Dataset

This Multi-Purpose Language Understanding (ÇADA) dataset has been developed to evaluate the performance of Turkish artificial intelligence (AI) models. ÇADA is a comprehensive dataset designed to measure the success of Turkish natural language processing (NLP) models in various tasks.

Access to Repository
* Requires HuggingFace account.

TVoice Dataset

TVoice is a dataset consisting of Turkish audio clips specifically curated for training speech-to-text (STT) models in Turkish. The dataset places special emphasis on capturing regional accents and dialects from various parts of Turkey, including the Doğu (Eastern), Ege (Aegean), and Kuzey Doğu (Northeastern) regions. TVoice aims to enhance the accuracy and versatility of STT models by providing rich linguistic diversity, helping models better understand and transcribe speech across different local dialects and accents in Turkiye.

Access Request Form
* After submitting the form you will get the HuggingFace repository access.

** Requires HuggingFace account.