Mistral released Pixtral 12B Vision Language Model (pixtral-12b-240910). Some notes on the release below.
- Text backbone: Mistral Nemo 12B
- Vision Adapter: 400M
- Uses GeLU (for vision adapter) & 2D RoPE (for vision encoder)
- Larger vocabulary – 131,072
- Three new special tokens –
img
,img_break
,img_end
- Image size: 1024 x 1024 pixels
- Patch size: 16 x 16 pixels
- Tokenizer support in mistral_common
- Model weights in bf16
- Haven’t seen the inference code yet
- Weights up on Hugging Face Hub
Installation
Mistral common has image support! You can now pass images and URLs alongside text into the user message.
pip install --upgrade mistral_common
To use the model checkpoint:
# pip install huggingface-hub
from huggingface_hub import snapshot_download
snapshot_download(repo_id="mistral-community/pixtral-12b-240910", local_dir="...")
Images
You can encode images as follows:
from mistral_common.protocol.instruct.messages import (
UserMessage,
TextChunk,
ImageURLChunk,
ImageChunk,
)
from PIL import Image
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
tokenizer = MistralTokenizer.from_model("pixtral")
image = Image.new('RGB', (64, 64))
# tokenize images and text
tokenized = tokenizer.encode_chat_completion(
ChatCompletionRequest(
messages=[
UserMessage(
content=[
TextChunk(text="Describe this image"),
ImageChunk(image=image),
]
)
],
model="pixtral",
)
)
tokens, text, images = tokenized.tokens, tokenized.text, tokenized.images
# Count the number of tokens
print("# tokens", len(tokens))
print("# images", len(images))
Image URLs
You can pass image url which will be automatically downloaded
url_dog = "https://picsum.photos/id/237/200/300"
url_mountain = "https://picsum.photos/seed/picsum/200/300"
# tokenize image urls and text
tokenized = tokenizer.encode_chat_completion(
ChatCompletionRequest(
messages=[
UserMessage(
content=[
TextChunk(text="Can this animal"),
ImageURLChunk(image_url=url_dog),
TextChunk(text="live here?"),
ImageURLChunk(image_url=url_mountain),
]
)
],
model="pixtral",
)
)
tokens, text, images = tokenized.tokens, tokenized.text, tokenized.images
# Count the number of tokens
print("# tokens", len(tokens))
print("# images", len(images))
Read related articles: