Pixtral 12B

Pixtral 12B (pixtral-12b-240910)

Mistral released Pixtral 12B Vision Language Model (pixtral-12b-240910). Some notes on the release below.

  1. Text backbone: Mistral Nemo 12B
  2. Vision Adapter: 400M
  3. Uses GeLU (for vision adapter) & 2D RoPE (for vision encoder)
  4. Larger vocabulary – 131,072
  5. Three new special tokens – img, img_break, img_end
  6. Image size: 1024 x 1024 pixels
  7. Patch size: 16 x 16 pixels
  8. Tokenizer support in mistral_common
  9. Model weights in bf16
  10. Haven’t seen the inference code yet
  11. Weights up on Hugging Face Hub

Installation

Mistral common has image support! You can now pass images and URLs alongside text into the user message.

pip install --upgrade mistral_common

To use the model checkpoint:

# pip install huggingface-hub

from huggingface_hub import snapshot_download

snapshot_download(repo_id="mistral-community/pixtral-12b-240910", local_dir="...")

Images

You can encode images as follows:

from mistral_common.protocol.instruct.messages import (
    UserMessage,
    TextChunk,
    ImageURLChunk,
    ImageChunk,
)
from PIL import Image
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer

tokenizer = MistralTokenizer.from_model("pixtral")

image = Image.new('RGB', (64, 64))

# tokenize images and text
tokenized = tokenizer.encode_chat_completion(
    ChatCompletionRequest(
        messages=[
            UserMessage(
                content=[
                    TextChunk(text="Describe this image"),
                    ImageChunk(image=image),
                ]
            )
        ],
        model="pixtral",
    )
)
tokens, text, images = tokenized.tokens, tokenized.text, tokenized.images

# Count the number of tokens
print("# tokens", len(tokens))
print("# images", len(images))

Image URLs

You can pass image url which will be automatically downloaded

url_dog = "https://picsum.photos/id/237/200/300"
url_mountain = "https://picsum.photos/seed/picsum/200/300"

# tokenize image urls and text
tokenized = tokenizer.encode_chat_completion(
    ChatCompletionRequest(
        messages=[
            UserMessage(
                content=[
                    TextChunk(text="Can this animal"),
                    ImageURLChunk(image_url=url_dog),
                    TextChunk(text="live here?"),
                    ImageURLChunk(image_url=url_mountain),
                ]
            )
        ],
        model="pixtral",
    )
)
tokens, text, images = tokenized.tokens, tokenized.text, tokenized.images

# Count the number of tokens
print("# tokens", len(tokens))
print("# images", len(images))

Read related articles:


Posted

in

by