Pixtral 12B (pixtral-12b-240910)

Mistral released Pixtral 12B Vision Language Model (pixtral-12b-240910). Some notes on the release below.

Text backbone: Mistral Nemo 12B
Vision Adapter: 400M
Uses GeLU (for vision adapter) & 2D RoPE (for vision encoder)
Larger vocabulary – 131,072
Three new special tokens – img, img_break, img_end
Image size: 1024 x 1024 pixels
Patch size: 16 x 16 pixels
Tokenizer support in mistral_common
Model weights in bf16
Haven’t seen the inference code yet
Weights up on Hugging Face Hub

Installation

Mistral common has image support! You can now pass images and URLs alongside text into the user message.

pip install --upgrade mistral_common

To use the model checkpoint:

# pip install huggingface-hub

from huggingface_hub import snapshot_download

snapshot_download(repo_id="mistral-community/pixtral-12b-240910", local_dir="...")

Images

You can encode images as follows:

from mistral_common.protocol.instruct.messages import (
    UserMessage,
    TextChunk,
    ImageURLChunk,
    ImageChunk,
)
from PIL import Image
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer

tokenizer = MistralTokenizer.from_model("pixtral")

image = Image.new('RGB', (64, 64))

# tokenize images and text
tokenized = tokenizer.encode_chat_completion(
    ChatCompletionRequest(
        messages=[
            UserMessage(
                content=[
                    TextChunk(text="Describe this image"),
                    ImageChunk(image=image),
                ]
            )
        ],
        model="pixtral",
    )
)
tokens, text, images = tokenized.tokens, tokenized.text, tokenized.images

# Count the number of tokens
print("# tokens", len(tokens))
print("# images", len(images))

Image URLs

You can pass image url which will be automatically downloaded

url_dog = "https://picsum.photos/id/237/200/300"
url_mountain = "https://picsum.photos/seed/picsum/200/300"

# tokenize image urls and text
tokenized = tokenizer.encode_chat_completion(
    ChatCompletionRequest(
        messages=[
            UserMessage(
                content=[
                    TextChunk(text="Can this animal"),
                    ImageURLChunk(image_url=url_dog),
                    TextChunk(text="live here?"),
                    ImageURLChunk(image_url=url_mountain),
                ]
            )
        ],
        model="pixtral",
    )
)
tokens, text, images = tokenized.tokens, tokenized.text, tokenized.images

# Count the number of tokens
print("# tokens", len(tokens))
print("# images", len(images))

Read related articles: