Kakao's Multimodal AI Kanana-o Proves Global Competitiveness

Kakao has opened a new horizon in AI technology by unveiling 'Kanana-o', an integrated multimodal language model that processes text, voice, and images simultaneously. This model has truly proven its evolution into an AI that sees, hears, s...

admin

May 1, 2025 - 00:00

0 554

Kakao has opened a new horizon in AI technology by unveiling 'Kanana-o', an integrated multimodal language model that processes text, voice, and images simultaneously. This model has truly proven its evolution into an AI that sees, hears, speaks, and even empathizes like a human. 'Kanana-o' is the result of efficiently integrating 'Kanana-v', a model specialized in image processing, and 'Kanana-a', a model strong in audio understanding and generation, using 'model merging' technology. Furthermore, through 'merged learning' that simultaneously trains image, audio, and text data, it has acquired the ability to comprehensively understand visual and auditory information and connect it with text. Now, no matter what combination of questions is entered, it can process them seamlessly and respond with contextually appropriate text or remarkably natural speech. Particularly noteworthy is 'Kanana-o''s voice emotion recognition technology. It analyzes non-verbal cues such as the user's intonation, tone of voice, and vocal tremors to generate emotional and natural voice responses that perfectly fit the conversation context. Utilizing a large-scale Korean dataset, it precisely reflects the unique speech structure, intonation, and ending changes of the Korean language, and also possesses the ability to recognize regional dialects like Jeju-do and Gyeongsang-do and convert them to standard Korean. Currently, it is also accelerating the development of its own Korean voice tokenizer. Thanks to streaming-based voice synthesis technology, users can receive instant responses without long waiting times. For example, if a user says, "Create a fairy tale that matches this picture" along with an image, 'Kanana-o' understands the voice, analyzes the user's emotions, and tells a creative story in real-time. In benchmark tests, 'Kanana-o' stood shoulder-to-shoulder with top global models in both Korean and English evaluations, and particularly excelled in Korean benchmarks. In emotion recognition capabilities, it led by a significant margin in both Korean and English, clearly demonstrating the potential for AI to understand and communicate emotions. Kakao plans to focus on multi-turn dialogue processing, enhancing full-duplex response capabilities, and ensuring safety to prevent inappropriate responses through 'Kanana-o' in the future. Through this, it aims to innovate user experience in multi-voice dialogue environments and achieve natural interactions closer to real conversations. Kim Byung-hak, Kakao's Kanana Performance Leader, stated, "The Kanana model is evolving into an AI that sees, hears, speaks, and empathizes like a human," and affirmed that they would contribute to the development of the domestic AI ecosystem through proprietary multimodal technology and continuous sharing of research results. This represents a remarkable evolution of Kakao's own AI model 'Kanana' lineup, which was unveiled last year, and the future holds even greater anticipation.