R&D Resources
Here are the list of datasets and resources shared publicly by TurkAI.
Multi-Purpose Language Understanding (ÇADA) Dataset
This Multi-Purpose Language Understanding (ÇADA) dataset has been developed to evaluate the performance of Turkish artificial intelligence (AI) models. ÇADA is a comprehensive dataset designed to measure the success of Turkish natural language processing (NLP) models in various tasks.
TVoice Dataset
TVoice is a dataset consisting of Turkish audio clips specifically curated for training speech-to-text (STT) models in Turkish. The dataset places special emphasis on capturing regional accents and dialects from various parts of Turkey, including the Doğu (Eastern), Ege (Aegean), and Kuzey Doğu (Northeastern) regions. TVoice aims to enhance the accuracy and versatility of STT models by providing rich linguistic diversity, helping models better understand and transcribe speech across different local dialects and accents in Turkiye.
** Requires HuggingFace account.