Two years after ChatGPT’s release, while the AI community focuses on advanced reasoning capabilities in English with models like o1, multilingual development deserves more attention! Qian Liu proudly present Sailor2, a community-driven project delivering state-of-the-art multilingual language models in three scales – 0.8B, 8B, and 20B parameters.
Released under the Apache 2.0 license, these models specialize in South-East Asian (SEA) languages, making advanced models more accessible across the region. 🌏
✈️ Building upon the foundation of Qwen2.5, Sailor2 is continually pre-trained over 500B high-quality tokens to support 15 languages, including English, Chinese, Burmese 🇲🇲, Cebuano🇵🇭, Ilocano🇵🇭, Indonesian🇮🇩, Javanese🇮🇩, Khmer🇰🇭, Lao🇱🇸, Malay🇲🇾, Sundanese🇮🇩, Tagalog🇵🇭, Thai🇹🇭, Vietnamese🇻🇳, Waray🇵🇭.
During development, we employ a range of advanced technologies to ensure top-tier performance and efficiency:
- 1️⃣ model expansion 📈
- 2️⃣ optimized data mixing strategies 🔄
- 3️⃣ multi-stage pre-training protocols 🔬
- 4️⃣ advanced multilingual post-training⚡
Despite the compact 20B parameters, our flagship 20B base model achieves performance that matches or surpasses significantly larger counterparts on different SEA languages, including Qwen2.5-32B, Gemma2-27B, Llama3.1-70B, and Aya-Expanse-32B 🎯. Our 20B chat model can achieve a 50-50 win rate with GPT-4o on most SEA languages using GPT-4o as judger!
Demo
https://huggingface.co/spaces/sail/Sailor2-20B-Chat
Read related articles: