Mistral AI: Models and API Guide– DodaTech Tutorials

Mistral AI: Models and API Guide

DodaTech Updated Jun 20, 2026 9 min read

Mistral AI offers a range of powerful open-weight models from the efficient 7B parameter model to the massive Mixtral 8x22B mixture-of-experts architecture. This guide covers everything from API access to self-hosting, quantization, and fine-tuning.

Learning Path

    flowchart LR
  A["DeepSeek API<br/>Open-Source LLMs"] --> B["Mistral AI<br/>Models & API"]
  B --> C["LangChain<br/>LLM Applications"]
  C --> D["Self-Hosting<br/>Ollama & vLLM"]
  style B fill:#f90,color:#fff,stroke-width:2px

What you’ll learn: Mistral AI model family, API access via La Plateforme, self-hosting with Ollama and vLLM, quantization methods (GGUF/GPTQ), function calling, embeddings, and fine-tuning. Why it matters: Mistral models offer the best performance-to-compute ratio in the open-source LLM space — Mixtral 8x7B matches GPT-3.5 at a fraction of the cost. Real-world use: DodaZIP uses Mistral 7B for on-device file description generation. Durga Antivirus Pro benchmarks Mistral models for lightweight threat classification on edge devices.

Model Family Overview

Mistral offers several models optimized for different use cases:

Model	Parameters	Architecture	Best For
Mistral 7B	7B	Dense transformer	Edge devices, fast inference
Mixtral 8x7B	46B (12B active)	Mixture-of-Experts	High quality, good speed
Mixtral 8x22B	141B (39B active)	Mixture-of-Experts	Best quality, larger compute
Mistral Large	Unknown	Proprietary	Enterprise (via API)
Codestral	22B	Code-optimized	Code generation
Mistral Nemo	12B	Optimized dense	Balanced quality/speed
Mistral Small	Unknown	Efficient	Cost-effective API calls

def get_model_info(model_name):
    """Get information about Mistral models."""
    models = {
        "mistral-7b": {
            "parameters": "7B",
            "architecture": "Dense Transformer",
            "context": "32K tokens",
            "best_for": "Edge devices, quick inference",
            "vram_min": "6GB (FP16), 4GB (quantized)"
        },
        "mixtral-8x7b": {
            "parameters": "46B (12B active)",
            "architecture": "Mixture of Experts",
            "context": "32K tokens",
            "best_for": "High quality with good speed",
            "vram_min": "24GB (FP16), 12GB (quantized)"
        },
        "mixtral-8x22b": {
            "parameters": "141B (39B active)",
            "architecture": "Mixture of Experts",
            "context": "65K tokens",
            "best_for": "Maximum quality",
            "vram_min": "80GB (FP16), 24GB (quantized)"
        },
        "mistral-large": {
            "parameters": "Proprietary",
            "architecture": "Proprietary",
            "context": "128K tokens",
            "best_for": "Enterprise, complex tasks",
            "vram_min": "API only"
        }
    }
    return models.get(model_name, "Unknown model")

for model in ["mistral-7b", "mixtral-8x7b", "mistral-large"]:
    info = get_model_info(model)
    print(f"{model}:")
    print(f"  Parameters: {info['parameters']}")
    print(f"  VRAM: {info['vram_min']}")
    print()

Expected output:

mistral-7b:
  Parameters: 7B
  VRAM: 6GB (FP16), 4GB (quantized)

mixtral-8x7b:
  Parameters: 46B (12B active)
  VRAM: 24GB (FP16), 12GB (quantized)

mistral-large:
  Parameters: Proprietary
  VRAM: API only

API Access via La Plateforme

Mistral’s official API service:

pip install mistralai

from mistralai import Mistral

client = Mistral(api_key="<your-mistral-api-key>")

response = client.chat.complete(
    model="mistral-large-latest",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain what Mixture of Experts means in AI."}
    ],
    temperature=0.7,
    max_tokens=300
)

print(response.choices[0].message.content)

Expected output:

Mixture of Experts (MoE) is a neural network architecture where multiple specialized sub-networks ("experts") are trained, but only a subset activates for each input. Think of it like a hospital: instead of every doctor seeing every patient, a routing system sends each case to the right specialist. This makes MoE models efficient — Mixtral 8x7B has 46B total parameters but only uses 12B per token, giving it the knowledge of a large model with the speed of a small one.

Streaming with Mistral API

def stream_mistral(prompt):
    """Stream Mistral response token by token."""
    response = client.chat.stream(
        model="mistral-large-latest",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=200
    )
    
    for chunk in response:
        if chunk.data.choices[0].delta.content:
            print(chunk.data.choices[0].delta.content, end="", flush=True)
    print()

stream_mistral("List 3 benefits of open-source LLMs")

Function Calling

Mistral supports tool use similar to OpenAI:

from mistralai import Mistral
from mistralai.models import Tool, Function, ToolCall

tools = [
    Tool(
        function=Function(
            name="search_database",
            description="Search the threat database for indicators",
            parameters={
                "type": "object",
                "properties": {
                    "indicator": {
                        "type": "string",
                        "description": "IP address, hash, or domain to check"
                    }
                },
                "required": ["indicator"]
            }
        )
    )
]

response = client.chat.complete(
    model="mistral-large-latest",
    messages=[
        {"role": "user", "content": "Check IP 192.168.1.1 in the threat database"}
    ],
    tools=tools,
    tool_choice="auto"
)

message = response.choices[0].message
if message.tool_calls:
    for tool_call in message.tool_calls:
        print(f"Tool called: {tool_call.function.name}")
        print(f"Arguments: {tool_call.function.arguments}")
else:
    print(f"Response: {message.content}")

Expected output:

Tool called: search_database
Arguments: {"indicator": "192.168.1.1"}

Embeddings

Mistral provides embedding models for RAG applications:

from mistralai import Mistral

client = Mistral(api_key="<your-mistral-api-key>")

def get_embeddings(texts):
    """Get embeddings for a list of texts."""
    response = client.embeddings.create(
        model="mistral-embed",
        inputs=texts
    )
    return [d.embedding for d in response.data]

# Example
embeddings = get_embeddings([
    "Mistral AI provides powerful open-source models",
    "Mixture of Experts architecture is efficient"
])

print(f"Embedding dimension: {len(embeddings[0])}")
print(f"First 5 values: {embeddings[0][:5]}")

Expected output:

Embedding dimension: 1024
First 5 values: [0.032, -0.015, 0.087, -0.042, 0.011]

Self-Hosting with Ollama

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull Mistral models
ollama pull mistral          # Mistral 7B
ollama pull mixtral          # Mixtral 8x7B
ollama pull codestral        # Codestral 22B

# Run
ollama run mistral

Self-Hosting with vLLM

# Start vLLM server:
# python -m vllm.entrypoints.openai.api_server \
#     --model mistralai/Mistral-7B-Instruct-v0.3 \
#     --port 8000

# Then use OpenAI-compatible client:
from openai import OpenAI

local_client = OpenAI(
    api_key="not-needed",
    base_url="http://localhost:8000/v1"
)

response = local_client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    messages=[{"role": "user", "content": "Hello from local Mistral!"}]
)
print(response.choices[0].message.content)

Quantization (GGUF and GPTQ)

GGUF (CPU-friendly, via llama.cpp)

# Download quantized model
# Using LM Studio or Ollama (handles quantization automatically)

# Or manually with llama.cpp:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j

# Convert to GGUF and quantize
python convert.py ./Mistral-7B-Instruct-v0.3/
./quantize ./Mistral-7B-Instruct-v0.3/ggml-model-f16.gguf ./ggml-model-q4_k_m.gguf q4_k_m

GPTQ (GPU-optimized, via AutoGPTQ)

# Using AutoGPTQ:
# from auto_gptq import AutoGPTQForCausalLM
# from transformers import AutoTokenizer
#
# model = AutoGPTQForCausalLM.from_quantized(
#     "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ",
#     use_safetensors=True,
#     device="cuda:0"
# )
# tokenizer = AutoTokenizer.from_pretrained(
#     "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
# )

def compare_quantization_methods():
    """Compare quantization methods for Mistral 7B."""
    methods = {
        "FP16": {"vram": "~14GB", "speed": "1.0x", "quality": "100%"},
        "GPTQ 4-bit": {"vram": "~4GB", "speed": "1.1x", "quality": "~99%"},
        "GGUF Q4_K_M": {"vram": "~4.5GB", "speed": "0.8x", "quality": "~98%"},
        "GGUF Q2_K": {"vram": "~2.5GB", "speed": "0.7x", "quality": "~85%"},
        "AWQ 4-bit": {"vram": "~4GB", "speed": "1.2x", "quality": "~99%"},
    }
    
    print(f"{'Method':<15} {'VRAM':<12} {'Relative Speed':<15} {'Quality'}")
    print("-" * 57)
    for method, specs in methods.items():
        print(f"{method:<15} {specs['vram']:<12} {specs['speed']:<15} {specs['quality']}")

compare_quantization_methods()

Expected output:

Method           VRAM         Relative Speed   Quality
---------------------------------------------------------
FP16             ~14GB        1.0x             100%
GPTQ 4-bit       ~4GB         1.1x             ~99%
GGUF Q4_K_M      ~4.5GB       0.8x             ~98%
GGUF Q2_K        ~2.5GB       0.7x             ~85%
AWQ 4-bit        ~4GB         1.2x             ~99%

Performance Benchmarks

def mistral_benchmarks():
    """Mistral model benchmark data."""
    benchmarks = {
        "Mistral 7B": {
            "MMLU": 64.2,
            "HumanEval": 30.5,
            "GSM8K": 43.2,
            "inference_speed": "50-70 tok/s (8GB VRAM)"
        },
        "Mixtral 8x7B": {
            "MMLU": 70.6,
            "HumanEval": 40.2,
            "GSM8K": 60.0,
            "inference_speed": "40-60 tok/s (24GB VRAM)"
        },
        "Mistral Large": {
            "MMLU": 84.0,
            "HumanEval": 73.0,  # Estimated
            "GSM8K": 85.0,      # Estimated
            "inference_speed": "API only"
        }
    }
    
    print(f"{'Model':<20} {'MMLU':<8} {'HumanEval':<12} {'GSM8K':<8}")
    print("-" * 48)
    for model, data in benchmarks.items():
        print(f"{model:<20} {data['MMLU']:<8} {data['HumanEval']:<12} {data['GSM8K']:<8}")

mistral_benchmarks()

Expected output:

Model                MMLU     HumanEval    GSM8K   
-------------------------------------------------
Mistral 7B           64.2     30.5         43.2    
Mixtral 8x7B         70.6     40.2         60.0    
Mistral Large        84.0     73.0         85.0

Fine-Tuning Mistral

# Using Hugging Face Transformers + PEFT
# from transformers import AutoModelForCausalLM, AutoTokenizer
# from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# from datasets import Dataset

def setup_lora_training():
    """Prepare Mistral 7B for LoRA fine-tuning."""
    config = {
        "model_name": "mistralai/Mistral-7B-Instruct-v0.3",
        "lora_r": 16,
        "lora_alpha": 32,
        "lora_dropout": 0.05,
        "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
        "batch_size": 4,
        "learning_rate": 2e-4,
        "num_epochs": 3,
    }
    
    print("LoRA fine-tuning configuration:")
    for k, v in config.items():
        print(f"  {k}: {v}")
    print()
    print("Steps:")
    print("1. Load tokenizer and model in 4-bit")
    print("2. Apply LoRA adapters (adds ~0.1% trainable params)")
    print("3. Train on your dataset")
    print("4. Merge adapters or keep separate (recommended)")
    print("5. Push to Hugging Face Hub or save locally")

setup_lora_training()

Common Errors

Wrong model name for API — Use mistral-large-latest for the latest Mistral Large. Model IDs change with versions. Check docs for current names.
Missing mistralai SDK — The official Python SDK is called mistralai (not mistral). Install with pip install mistralai.
VRAM insufficient for local inference — Mixtral 8x7B in FP16 requires ~24GB VRAM. Use 4-bit quantization or a smaller model if you have consumer GPUs.
Function calling format differences — Mistral’s tool format differs slightly from OpenAI’s. Use the Tool and Function classes from mistralai.models for correct formatting.
Context length exceeded — Mistral 7B has 32K context. Mixtral 8x22B has 65K. Mistral Large has 128K. Know your model’s limits before chatting.
Quantization quality loss — Q2_K quantization can lose 10-15% accuracy on complex tasks. Use Q4_K_M or Q5_K_M for production applications where quality matters.
Embedding dimension mismatch — Mistral-embed produces 1024-dim vectors. If switching from OpenAI (1536-dim), update your vector database schema.

Practice Questions

1. What’s the difference between Mistral 7B and Mixtral 8x7B? Mistral 7B is a dense 7B model. Mixtral 8x7B uses mixture-of-experts with 8 expert modules (46B total, 12B active per token) — giving much higher quality at moderate compute cost.

2. How do you access Mistral models via API? Use the mistralai Python SDK with client.chat.complete(). Choose models like mistral-large-latest, mistral-small-latest, or codestral-latest.

3. What quantization methods are available for Mistral? GGUF (for CPU/llama.cpp), GPTQ (for GPU), AWQ (for GPU, fastest), and bitsandbytes 4-bit (for Hugging Face inference).

4. Which Mistral model would you use for on-device deployment? Mistral 7B (quantized to 4-bit) — it fits in 4GB VRAM, runs on consumer hardware, and provides good quality for most tasks.

5. Challenge: Deploy Mistral 7B as a local API Set up Mistral 7B locally using Ollama or vLLM, then create a Python script that sends prompts to the local endpoint and returns responses. Benchmark latency vs the cloud API.

Mini Project: Mistral Model Selector

def recommend_mistral_model(use_case, hardware=None):
    """Recommend the best Mistral model for a use case."""
    recommendations = {
        "chat": {"model": "mistral-large", "reason": "Best conversation quality"},
        "code": {"model": "codestral", "reason": "Specialized for code generation"},
        "on-device": {"model": "Mistral 7B (Q4)", "reason": "Smallest, runs on edge"},
        "rag": {"model": "mistral-embed + mistral-small", "reason": "Good quality embeddings"},
        "classification": {"model": "Mistral 7B", "reason": "Fast, sufficient for labels"},
        "summarization": {"model": "mixtral-8x7b", "reason": "Good quality/cost ratio"},
        "analysis": {"model": "Mistral Large", "reason": "Best reasoning quality"},
    }
    
    rec = recommendations.get(use_case, {"model": "mistral-small", "reason": "Default choice"})
    
    print(f"Recommended for '{use_case}':")
    print(f"  Model: {rec['model']}")
    print(f"  Why: {rec['reason']}")
    
    if hardware:
        if hardware.lower() in ["cpu", "apple silicon"]:
            print(f"  Tip: Use GGUF quantization for {hardware}")
        elif "gpu" in hardware.lower():
            print(f"  Tip: Use AWQ or GPTQ quantization for GPU")
    
    return rec

recommend_mistral_model("on-device", "CPU")
recommend_mistral_model("code", "NVIDIA RTX 4090")
recommend_mistral_model("analysis")

FAQ

Is Mistral AI truly open-source?

Mistral 7B and Mixtral 8x7B are released under Apache 2.0 — fully open for commercial use. Mistral Large is proprietary (API only). Codestral and Nemo have custom licenses — check terms for your use case.

How does Mistral 7B compare to Llama 3 8B?

Mistral 7B and Llama 3 8B are comparable. Llama 3 8B scores slightly higher on MMLU (68.4 vs 64.2) but Mistral 7B has better token efficiency (more tokens per second). Choose based on your specific benchmarks.

Can I fine-tune Mistral models?

Yes. Mistral 7B and Mixtral 8x7B support full fine-tuning and LoRA/QLoRA. Hugging Face Transformers, Axolotl, and Unsloth all support Mistral fine-tuning with minimal code changes.

What is Mistral’s Le Chat?

Le Chat is Mistral’s conversational chat interface (like ChatGPT). It offers free access to Mistral models with web search, file upload, and multi-modal capabilities. The API powers the same models for programmatic access.

Which Mistral model is best for coding?

Codestral (22B) is specialized for code generation. Mixtral 8x22B is the best general model for complex coding tasks. Mistral 7B works well for simple code generation on edge devices.

Mistral AI: Models and API Guide

Learning Path

Model Family Overview

API Access via La Plateforme

Streaming with Mistral API

Function Calling

Embeddings

Self-Hosting with Ollama

Self-Hosting with vLLM

Quantization (GGUF and GPTQ)

GGUF (CPU-friendly, via llama.cpp)

GPTQ (GPU-optimized, via AutoGPTQ)

Performance Benchmarks

Fine-Tuning Mistral

Common Errors

Practice Questions

Mini Project: Mistral Model Selector

FAQ

Related Tutorials