Apple DCLM-7B Model

Apple has officially entered the language model landscape with the release of a DCML-7B open-source language model, including weights, training code, and dataset.

Key Highlights

Model Specifications: The 7B base model is trained on 2.5 trillion tokens, using primarily English data with a 2048 context window.
Training Data: Combines datasets from DCLM-BASELINE, StarCoder, and ProofPile2.
Performance: The model achieves an MMLU score of 0.6372, positioning it above Mistral but below Llama3 in performance.
License: Released under an open license, specifically the Apple Sample Code License.
Comparison: Matches the performance of closed-dataset models like Mistral.
Training Framework: Developed using PyTorch and the OpenLM framework.
Availability: The model is accessible on Hugging Face and integrated within Transformers.

Model	Params	Tokens	Open dataset?	CORE	MMLU	EXTENDED
Open weights, closed datasets
Llama2	7B	2T	✗	49.2	45.8	34.1
DeepSeek	7B	2T	✗	50.7	48.5	35.3
Mistral-0.3	7B	?	✗	57.0	62.7	45.1
QWEN-2	7B	?	✗	57.5	71.9	50.5
Llama3	8B	15T	✗	57.6	66.2	46.3
Gemma	8B	6T	✗	57.8	64.3	44.6
Phi-3	7B	?	✗	61.0	69.9	57.9
Open weights, open datasets
Falcon	7B	1T	✓	44.1	27.4	25.1
OLMo-1.7	7B	2.1T	✓	47.0	54.0	34.2
MAP-Neo	7B	4.5T	✓	50.2	57.1	40.4
DCLM-7B	7B	2.5T	✓	56.1	63.7	43.6

Comparisions of DCLM-7B model with other models in the 7B regime.

Model Card for DCLM-Baseline-7B

DCLM-Baseline-7B is a language model with 7 billion parameters, trained on the DCLM-Baseline dataset, which is part of the DataComp for Language Models (DCLM) benchmark. This model aims to demonstrate the benefits of systematic data curation techniques in enhancing language model performance.

Model Details

Size	Training Tokens	Layers	Hidden Size	Attention Heads	Context Length
7B	2.5T	32	4096	32	2048

Model Sources

Repository: https://github.com/mlfoundations/dclm
Dataset: https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0
Paper: DataComp-LM: In search of the next generation of training sets for language models

This release marks a significant step for Apple, contributing to the open-source AI community and providing developers with robust tools for natural language processing tasks.

Read related articles: