Sailor2

Sailor2

Two years after ChatGPT’s release, while the AI community focuses on advanced reasoning capabilities in English with models like o1, multilingual development deserves more attention! Qian Liu proudly present Sailor2, a community-driven project delivering state-of-the-art multilingual language models in three scales – 0.8B, 8B, and 20B parameters.

Released under the Apache 2.0 license, these models specialize in South-East Asian (SEA) languages, making advanced models more accessible across the region. 🌏

✈️ Building upon the foundation of Qwen2.5, Sailor2 is continually pre-trained over 500B high-quality tokens to support 15 languages, including English, Chinese, Burmese 🇲🇲, Cebuano🇵🇭, Ilocano🇵🇭, Indonesian🇮🇩, Javanese🇮🇩, Khmer🇰🇭, Lao🇱🇸, Malay🇲🇾, Sundanese🇮🇩, Tagalog🇵🇭, Thai🇹🇭, Vietnamese🇻🇳, Waray🇵🇭.

During development, we employ a range of advanced technologies to ensure top-tier performance and efficiency:

  • 1️⃣ model expansion 📈
  • 2️⃣ optimized data mixing strategies 🔄
  • 3️⃣ multi-stage pre-training protocols 🔬
  • 4️⃣ advanced multilingual post-training⚡
Silor2 20B base model benchmark

Despite the compact 20B parameters, our flagship 20B base model achieves performance that matches or surpasses significantly larger counterparts on different SEA languages, including Qwen2.5-32B, Gemma2-27B, Llama3.1-70B, and Aya-Expanse-32B 🎯. Our 20B chat model can achieve a 50-50 win rate with GPT-4o on most SEA languages using GPT-4o as judger!

Demo

https://huggingface.co/spaces/sail/Sailor2-20B-Chat

Read related articles:


Posted

in

by

Tags: