Apple has officially entered the language model landscape with the release of a DCML-7B open-source language model, including weights, training code, and dataset.
Key Highlights
- Model Specifications: The 7B base model is trained on 2.5 trillion tokens, using primarily English data with a 2048 context window.
- Training Data: Combines datasets from DCLM-BASELINE, StarCoder, and ProofPile2.
- Performance: The model achieves an MMLU score of 0.6372, positioning it above Mistral but below Llama3 in performance.
- License: Released under an open license, specifically the Apple Sample Code License.
- Comparison: Matches the performance of closed-dataset models like Mistral.
- Training Framework: Developed using PyTorch and the OpenLM framework.
- Availability: The model is accessible on Hugging Face and integrated within Transformers.
Model | Params | Tokens | Open dataset? | CORE | MMLU | EXTENDED |
---|---|---|---|---|---|---|
Open weights, closed datasets | ||||||
Llama2 | 7B | 2T | ✗ | 49.2 | 45.8 | 34.1 |
DeepSeek | 7B | 2T | ✗ | 50.7 | 48.5 | 35.3 |
Mistral-0.3 | 7B | ? | ✗ | 57.0 | 62.7 | 45.1 |
QWEN-2 | 7B | ? | ✗ | 57.5 | 71.9 | 50.5 |
Llama3 | 8B | 15T | ✗ | 57.6 | 66.2 | 46.3 |
Gemma | 8B | 6T | ✗ | 57.8 | 64.3 | 44.6 |
Phi-3 | 7B | ? | ✗ | 61.0 | 69.9 | 57.9 |
Open weights, open datasets | ||||||
Falcon | 7B | 1T | ✓ | 44.1 | 27.4 | 25.1 |
OLMo-1.7 | 7B | 2.1T | ✓ | 47.0 | 54.0 | 34.2 |
MAP-Neo | 7B | 4.5T | ✓ | 50.2 | 57.1 | 40.4 |
DCLM-7B | 7B | 2.5T | ✓ | 56.1 | 63.7 | 43.6 |
Model Card for DCLM-Baseline-7B
DCLM-Baseline-7B is a language model with 7 billion parameters, trained on the DCLM-Baseline dataset, which is part of the DataComp for Language Models (DCLM) benchmark. This model aims to demonstrate the benefits of systematic data curation techniques in enhancing language model performance.
Model Details
Size | Training Tokens | Layers | Hidden Size | Attention Heads | Context Length |
---|---|---|---|---|---|
7B | 2.5T | 32 | 4096 | 32 | 2048 |
Model Sources
- Repository: https://github.com/mlfoundations/dclm
- Dataset: https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0
- Paper: DataComp-LM: In search of the next generation of training sets for language models
This release marks a significant step for Apple, contributing to the open-source AI community and providing developers with robust tools for natural language processing tasks.
Read related articles: