AITRIX's Two Speech Papers Selected for ICASSP 2025

AI healthcare company AITRICS gained attention by successfully having two papers featuring innovative voice AI technology listed at ‘ICASSP 2025’, a global academic conference on speech, audio, and signal processing. At this conference, hel...

Editorial context: This article is part of Startup Korea's original market analysis coverage. It is written to explain startup trends, business model risks, and technology adoption signals for general information, not as investment advice.

admin

Apr 15, 2025 - 00:00

AI healthcare company AITRICS gained attention by successfully having two papers featuring innovative voice AI technology listed at ‘ICASSP 2025’, a global academic conference on speech, audio, and signal processing. At this conference, held from the 6th to the 11th in Hyderabad, India, AITRICS showcased its advanced voice AI research achievements through poster presentations. One of the accepted papers, 'Stable-TTS: Stable Speaker-Adaptive Text-to-Speech Synthesis via Prosody Prompting,' introduced a framework that naturally reproduces the unique speaking style and intonation of a specific speaker with only a small amount of voice data. This technology was developed to address the unstable audio quality issues inherent in existing speaker-adaptive models. By utilizing a Prosody Language Model (PLM) and prior preservation learning techniques, it enables stable and natural voice synthesis even in limited or noisy environments. In particular, it proved effective in generating voices with high speaker similarity even from low-quality samples or small amounts of data. Another paper, 'Face-StyleSpeech: Improved Face-Voice Mapping for Enhanced Face Image-based Zero-shot Speech Synthesis,' unveiled a zero-shot TTS model that generates realistic voices from just a single facial image. This model captures the speaker's personality inferred from the face and precisely combines it with prosody information (Prosody Codes), showcasing significantly improved naturalness and expressiveness compared to existing face-based speech synthesis models. Han Wooseok, a researcher at AITRICS, emphasized that this achievement demonstrates the possibility of generating natural and stable voices even in limited data environments. This is garnering anticipation that it will be particularly useful in situations where data acquisition is difficult, such as in actual medical settings. Furthermore, he stated that this research will serve as an important stepping stone for the expansion into multimodal LLMs that encompass not only text-based LLMs (Large Language Models) but also speech and images. AITRICS plans to lead the implementation of medical AI services that provide users with an even more enhanced experience and higher reliability through continuous research and development.