Korean Undergraduates Launch ‘Dia,’ an Emotion-Driven AI Voice Model Running on a Single GPU

2025-04-23 08:35

Newsroom

How does the Dia model stand out compared to other TTS models?

What challenges did the developers of Dia face and how did they overcome them?

What are some potential applications of the Dia model?

실리콘밸리 아닌 서울서 탄생… 감정 표현 가능한 AI TTS 모델 ‘Dia’ 공개

Image source: Unblock Media

- Dia, developed by Toby Kim and Jaeyong Sung, is a TTS model capable of true emotional expression with 160 million parameters - It runs in real-time on a single GPU and reportedly outperforms ElevenLabs and Sesame CSM [Unblock Media] A groundbreaking shift in AI voice generation is being led not by a tech giant, but by two undergraduate students in South Korea. "Dia" is a text-to-speech model created by Toby Kim and Jaeyong Sung under Nari Labs. It can mimic authentic emotions, screams, and human-like alerts. With 160 million parameters, it runs on a single GPU in real time. According to Deedy Das, who introduced the model on Twitter, it clearly surpasses industry leaders like ElevenLabs and Sesame CSM.

Deedy

@deedydas

·Follow

We just solved text-to-speech AI. This model can simulate perfect emotion, screaming and show genuine alarm. — clearly beats 11 labs and Sesame — it’s only 1.6B params — streams realtime on 1 GPU — made by a 1.5 person team in Korea!! It's called Dia by Nari Labs.

Watch on X

4:15 PM · Apr 22, 2025

6.7K

Read 186 replies

Das wrote, "Audio has likely reached the point where it’s no longer distinguishable. Many won’t realize it’s AI." Toby Kim was inspired to launch the project after becoming intrigued by Google’s NotebookLM podcast feature and disappointed by the robotic nature of existing TTS APIs. The greatest challenge they faced was computing power. However, they accessed Google’s TPU research cloud and self-taught large-scale training tools like JAX, Flax, and Pallas kernels. After three months, they fully trained Dia. Dia is now transitioning from a research model into a B2C application. The app allows users to create natural conversations, remix voice content, and share expressive outputs with friends, lowering the barrier to expressive AI voice interaction. Use cases include personal emotional voice assistants, AI-driven storytelling, and voice-supported tools for healthcare. Unlike other TTS models like ElevenLabs that focus on clarity and rhythm, Dia’s strength lies in emotional fidelity. Its low inference cost and portability on a single GPU present a powerful leap toward real-time voice interfaces, virtual characters, and emotion-centric AI. And this model was developed not in Silicon Valley, but in Seoul.