Mistral AI: Models and API Guide
Mistral AI offers a range of powerful open-weight models from the efficient 7B parameter model to the massive Mixtral 8x22B mixture-of-experts architecture. This guide covers everything from API access to self-hosting, quantization, and fine-tuning.
Learning Path
flowchart LR
A["DeepSeek API<br/>Open-Source LLMs"] --> B["Mistral AI<br/>Models & API"]
B --> C["LangChain<br/>LLM Applications"]
C --> D["Self-Hosting<br/>Ollama & vLLM"]
style B fill:#f90,color:#fff,stroke-width:2px
Model Family Overview
Mistral offers several models optimized for different use cases:
| Model | Parameters | Architecture | Best For |
|---|---|---|---|
| Mistral 7B | 7B | Dense transformer | Edge devices, fast inference |
| Mixtral 8x7B | 46B (12B active) | Mixture-of-Experts | High quality, good speed |
| Mixtral 8x22B | 141B (39B active) | Mixture-of-Experts | Best quality, larger compute |
| Mistral Large | Unknown | Proprietary | Enterprise (via API) |
| Codestral | 22B | Code-optimized | Code generation |
| Mistral Nemo | 12B | Optimized dense | Balanced quality/speed |
| Mistral Small | Unknown | Efficient | Cost-effective API calls |
def get_model_info(model_name):
"""Get information about Mistral models."""
models = {
"mistral-7b": {
"parameters": "7B",
"architecture": "Dense Transformer",
"context": "32K tokens",
"best_for": "Edge devices, quick inference",
"vram_min": "6GB (FP16), 4GB (quantized)"
},
"mixtral-8x7b": {
"parameters": "46B (12B active)",
"architecture": "Mixture of Experts",
"context": "32K tokens",
"best_for": "High quality with good speed",
"vram_min": "24GB (FP16), 12GB (quantized)"
},
"mixtral-8x22b": {
"parameters": "141B (39B active)",
"architecture": "Mixture of Experts",
"context": "65K tokens",
"best_for": "Maximum quality",
"vram_min": "80GB (FP16), 24GB (quantized)"
},
"mistral-large": {
"parameters": "Proprietary",
"architecture": "Proprietary",
"context": "128K tokens",
"best_for": "Enterprise, complex tasks",
"vram_min": "API only"
}
}
return models.get(model_name, "Unknown model")
for model in ["mistral-7b", "mixtral-8x7b", "mistral-large"]:
info = get_model_info(model)
print(f"{model}:")
print(f" Parameters: {info['parameters']}")
print(f" VRAM: {info['vram_min']}")
print()Expected output:
mistral-7b:
Parameters: 7B
VRAM: 6GB (FP16), 4GB (quantized)
mixtral-8x7b:
Parameters: 46B (12B active)
VRAM: 24GB (FP16), 12GB (quantized)
mistral-large:
Parameters: Proprietary
VRAM: API onlyAPI Access via La Plateforme
Mistral’s official API service:
pip install mistralaifrom mistralai import Mistral
client = Mistral(api_key="<your-mistral-api-key>")
response = client.chat.complete(
model="mistral-large-latest",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain what Mixture of Experts means in AI."}
],
temperature=0.7,
max_tokens=300
)
print(response.choices[0].message.content)Expected output:
Mixture of Experts (MoE) is a neural network architecture where multiple specialized sub-networks ("experts") are trained, but only a subset activates for each input. Think of it like a hospital: instead of every doctor seeing every patient, a routing system sends each case to the right specialist. This makes MoE models efficient — Mixtral 8x7B has 46B total parameters but only uses 12B per token, giving it the knowledge of a large model with the speed of a small one.Streaming with Mistral API
def stream_mistral(prompt):
"""Stream Mistral response token by token."""
response = client.chat.stream(
model="mistral-large-latest",
messages=[{"role": "user", "content": prompt}],
max_tokens=200
)
for chunk in response:
if chunk.data.choices[0].delta.content:
print(chunk.data.choices[0].delta.content, end="", flush=True)
print()
stream_mistral("List 3 benefits of open-source LLMs")Function Calling
Mistral supports tool use similar to OpenAI:
from mistralai import Mistral
from mistralai.models import Tool, Function, ToolCall
tools = [
Tool(
function=Function(
name="search_database",
description="Search the threat database for indicators",
parameters={
"type": "object",
"properties": {
"indicator": {
"type": "string",
"description": "IP address, hash, or domain to check"
}
},
"required": ["indicator"]
}
)
)
]
response = client.chat.complete(
model="mistral-large-latest",
messages=[
{"role": "user", "content": "Check IP 192.168.1.1 in the threat database"}
],
tools=tools,
tool_choice="auto"
)
message = response.choices[0].message
if message.tool_calls:
for tool_call in message.tool_calls:
print(f"Tool called: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")
else:
print(f"Response: {message.content}")Expected output:
Tool called: search_database
Arguments: {"indicator": "192.168.1.1"}Embeddings
Mistral provides embedding models for RAG applications:
from mistralai import Mistral
client = Mistral(api_key="<your-mistral-api-key>")
def get_embeddings(texts):
"""Get embeddings for a list of texts."""
response = client.embeddings.create(
model="mistral-embed",
inputs=texts
)
return [d.embedding for d in response.data]
# Example
embeddings = get_embeddings([
"Mistral AI provides powerful open-source models",
"Mixture of Experts architecture is efficient"
])
print(f"Embedding dimension: {len(embeddings[0])}")
print(f"First 5 values: {embeddings[0][:5]}")Expected output:
Embedding dimension: 1024
First 5 values: [0.032, -0.015, 0.087, -0.042, 0.011]Self-Hosting with Ollama
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull Mistral models
ollama pull mistral # Mistral 7B
ollama pull mixtral # Mixtral 8x7B
ollama pull codestral # Codestral 22B
# Run
ollama run mistralSelf-Hosting with vLLM
# Start vLLM server:
# python -m vllm.entrypoints.openai.api_server \
# --model mistralai/Mistral-7B-Instruct-v0.3 \
# --port 8000
# Then use OpenAI-compatible client:
from openai import OpenAI
local_client = OpenAI(
api_key="not-needed",
base_url="http://localhost:8000/v1"
)
response = local_client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.3",
messages=[{"role": "user", "content": "Hello from local Mistral!"}]
)
print(response.choices[0].message.content)Quantization (GGUF and GPTQ)
GGUF (CPU-friendly, via llama.cpp)
# Download quantized model
# Using LM Studio or Ollama (handles quantization automatically)
# Or manually with llama.cpp:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j
# Convert to GGUF and quantize
python convert.py ./Mistral-7B-Instruct-v0.3/
./quantize ./Mistral-7B-Instruct-v0.3/ggml-model-f16.gguf ./ggml-model-q4_k_m.gguf q4_k_mGPTQ (GPU-optimized, via AutoGPTQ)
# Using AutoGPTQ:
# from auto_gptq import AutoGPTQForCausalLM
# from transformers import AutoTokenizer
#
# model = AutoGPTQForCausalLM.from_quantized(
# "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ",
# use_safetensors=True,
# device="cuda:0"
# )
# tokenizer = AutoTokenizer.from_pretrained(
# "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
# )def compare_quantization_methods():
"""Compare quantization methods for Mistral 7B."""
methods = {
"FP16": {"vram": "~14GB", "speed": "1.0x", "quality": "100%"},
"GPTQ 4-bit": {"vram": "~4GB", "speed": "1.1x", "quality": "~99%"},
"GGUF Q4_K_M": {"vram": "~4.5GB", "speed": "0.8x", "quality": "~98%"},
"GGUF Q2_K": {"vram": "~2.5GB", "speed": "0.7x", "quality": "~85%"},
"AWQ 4-bit": {"vram": "~4GB", "speed": "1.2x", "quality": "~99%"},
}
print(f"{'Method':<15} {'VRAM':<12} {'Relative Speed':<15} {'Quality'}")
print("-" * 57)
for method, specs in methods.items():
print(f"{method:<15} {specs['vram']:<12} {specs['speed']:<15} {specs['quality']}")
compare_quantization_methods()Expected output:
Method VRAM Relative Speed Quality
---------------------------------------------------------
FP16 ~14GB 1.0x 100%
GPTQ 4-bit ~4GB 1.1x ~99%
GGUF Q4_K_M ~4.5GB 0.8x ~98%
GGUF Q2_K ~2.5GB 0.7x ~85%
AWQ 4-bit ~4GB 1.2x ~99%Performance Benchmarks
def mistral_benchmarks():
"""Mistral model benchmark data."""
benchmarks = {
"Mistral 7B": {
"MMLU": 64.2,
"HumanEval": 30.5,
"GSM8K": 43.2,
"inference_speed": "50-70 tok/s (8GB VRAM)"
},
"Mixtral 8x7B": {
"MMLU": 70.6,
"HumanEval": 40.2,
"GSM8K": 60.0,
"inference_speed": "40-60 tok/s (24GB VRAM)"
},
"Mistral Large": {
"MMLU": 84.0,
"HumanEval": 73.0, # Estimated
"GSM8K": 85.0, # Estimated
"inference_speed": "API only"
}
}
print(f"{'Model':<20} {'MMLU':<8} {'HumanEval':<12} {'GSM8K':<8}")
print("-" * 48)
for model, data in benchmarks.items():
print(f"{model:<20} {data['MMLU']:<8} {data['HumanEval']:<12} {data['GSM8K']:<8}")
mistral_benchmarks()Expected output:
Model MMLU HumanEval GSM8K
-------------------------------------------------
Mistral 7B 64.2 30.5 43.2
Mixtral 8x7B 70.6 40.2 60.0
Mistral Large 84.0 73.0 85.0 Fine-Tuning Mistral
# Using Hugging Face Transformers + PEFT
# from transformers import AutoModelForCausalLM, AutoTokenizer
# from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# from datasets import Dataset
def setup_lora_training():
"""Prepare Mistral 7B for LoRA fine-tuning."""
config = {
"model_name": "mistralai/Mistral-7B-Instruct-v0.3",
"lora_r": 16,
"lora_alpha": 32,
"lora_dropout": 0.05,
"target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
"batch_size": 4,
"learning_rate": 2e-4,
"num_epochs": 3,
}
print("LoRA fine-tuning configuration:")
for k, v in config.items():
print(f" {k}: {v}")
print()
print("Steps:")
print("1. Load tokenizer and model in 4-bit")
print("2. Apply LoRA adapters (adds ~0.1% trainable params)")
print("3. Train on your dataset")
print("4. Merge adapters or keep separate (recommended)")
print("5. Push to Hugging Face Hub or save locally")
setup_lora_training()Common Errors
- Wrong model name for API — Use
mistral-large-latestfor the latest Mistral Large. Model IDs change with versions. Check docs for current names. - Missing mistralai SDK — The official Python SDK is called
mistralai(notmistral). Install withpip install mistralai. - VRAM insufficient for local inference — Mixtral 8x7B in FP16 requires ~24GB VRAM. Use 4-bit quantization or a smaller model if you have consumer GPUs.
- Function calling format differences — Mistral’s tool format differs slightly from OpenAI’s. Use the
ToolandFunctionclasses frommistralai.modelsfor correct formatting. - Context length exceeded — Mistral 7B has 32K context. Mixtral 8x22B has 65K. Mistral Large has 128K. Know your model’s limits before chatting.
- Quantization quality loss — Q2_K quantization can lose 10-15% accuracy on complex tasks. Use Q4_K_M or Q5_K_M for production applications where quality matters.
- Embedding dimension mismatch — Mistral-embed produces 1024-dim vectors. If switching from OpenAI (1536-dim), update your vector database schema.
Practice Questions
1. What’s the difference between Mistral 7B and Mixtral 8x7B? Mistral 7B is a dense 7B model. Mixtral 8x7B uses mixture-of-experts with 8 expert modules (46B total, 12B active per token) — giving much higher quality at moderate compute cost.
2. How do you access Mistral models via API?
Use the mistralai Python SDK with client.chat.complete(). Choose models like mistral-large-latest, mistral-small-latest, or codestral-latest.
3. What quantization methods are available for Mistral? GGUF (for CPU/llama.cpp), GPTQ (for GPU), AWQ (for GPU, fastest), and bitsandbytes 4-bit (for Hugging Face inference).
4. Which Mistral model would you use for on-device deployment? Mistral 7B (quantized to 4-bit) — it fits in 4GB VRAM, runs on consumer hardware, and provides good quality for most tasks.
5. Challenge: Deploy Mistral 7B as a local API Set up Mistral 7B locally using Ollama or vLLM, then create a Python script that sends prompts to the local endpoint and returns responses. Benchmark latency vs the cloud API.
Mini Project: Mistral Model Selector
def recommend_mistral_model(use_case, hardware=None):
"""Recommend the best Mistral model for a use case."""
recommendations = {
"chat": {"model": "mistral-large", "reason": "Best conversation quality"},
"code": {"model": "codestral", "reason": "Specialized for code generation"},
"on-device": {"model": "Mistral 7B (Q4)", "reason": "Smallest, runs on edge"},
"rag": {"model": "mistral-embed + mistral-small", "reason": "Good quality embeddings"},
"classification": {"model": "Mistral 7B", "reason": "Fast, sufficient for labels"},
"summarization": {"model": "mixtral-8x7b", "reason": "Good quality/cost ratio"},
"analysis": {"model": "Mistral Large", "reason": "Best reasoning quality"},
}
rec = recommendations.get(use_case, {"model": "mistral-small", "reason": "Default choice"})
print(f"Recommended for '{use_case}':")
print(f" Model: {rec['model']}")
print(f" Why: {rec['reason']}")
if hardware:
if hardware.lower() in ["cpu", "apple silicon"]:
print(f" Tip: Use GGUF quantization for {hardware}")
elif "gpu" in hardware.lower():
print(f" Tip: Use AWQ or GPTQ quantization for GPU")
return rec
recommend_mistral_model("on-device", "CPU")
recommend_mistral_model("code", "NVIDIA RTX 4090")
recommend_mistral_model("analysis")FAQ
Related Tutorials
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro. Updated 2026-06-20.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro