AI API Cost Optimization — Caching, Batching and Quantization Strategies

DodaTech Updated 2026-06-22 7 min read

AI API costs can quickly spiral from pennies to thousands of dollars as usage grows — this guide covers practical strategies to reduce LLM costs by 60-80% without sacrificing quality.

What You'll Learn

You'll learn cost reduction strategies including semantic Caching, request batching, prompt compression, model quantization, and intelligent provider routing using Python and Redis.

Why It Matters

GPT-4o costs $10 per million output tokens. A chatbot handling 100K conversations per month can cost over $5,000. Cost optimization is not optional — it determines whether your AI product is profitable.

Real-World Use

Doda Browser's AI assistant uses semantic Caching with Redis to serve 40% of queries from cache, cutting monthly API costs by $2,800 while maintaining sub-50ms response times for cached queries.

Cost Optimization Architecture

flowchart TD
    A[Request] --> B[Semantic Cache]
    B -->|Hit| C[Return Cached]
    B -->|Miss| D[Prompt Compressor]
    D --> E[Model Router]
    E --> F[Provider]
    F --> G[Output Cache]
    G --> H[Return]

Semantic Caching

Cache responses to semantically similar queries, not just exact matches.

import hashlib
import json
import numpy as np
from openai import OpenAI
import redis

client = OpenAI()
cache = redis.Redis(host="localhost", port=6379, db=0)
SIMILARITY_THRESHOLD = 0.92

def get_embedding(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (
        np.linalg.norm(a) * np.linalg.norm(b)
    )

def semantic_cache_key(embedding: list[float]) -> str:
    # Store embedding as bytes for storage
    return f"semantic:{hashlib.md5(
        np.array(embedding).tobytes()
    ).hexdigest()}"

def cached_completion(prompt: str, **kwargs) -> str:
    prompt_embedding = get_embedding(prompt)

    # Scan recent cache entries for semantic matches
    cursor = 0
    while True:
        cursor, keys = cache.scan(
            cursor, match="semantic:*", count=100
        )
        for key in keys:
            cached = cache.get(key)
            if not cached:
                continue

            cached_data = json.loads(cached)
            cached_emb = cached_data["embedding"]
            similarity = cosine_similarity(
                prompt_embedding, cached_emb
            )

            if similarity >= SIMILARITY_THRESHOLD:
                print(f"Cache HIT (similarity: {similarity:.3f})")
                return cached_data["response"]

        if cursor == 0:
            break

    # Cache miss — call API
    response = client.chat.completions.create(
        model=kwargs.get("model", "gpt-4o-mini"),
        messages=[{"role": "user", "content": prompt}],
    )
    result = response.choices[0].message.content

    # Cache the result
    key = semantic_cache_key(prompt_embedding)
    cache.setex(key, 3600, json.dumps({
        "embedding": prompt_embedding,
        "response": result,
        "prompt": prompt
    }))

    print(f"Cache MISS — API called")
    return result

# Test
result1 = cached_completion("What is the capital of France?")
result2 = cached_completion("Tell me the capital city of France")
print(f"Query 1: {result1[:60]}...")
print(f"Query 2: {result2[:60]}...")

Expected output:

Cache MISS — API called
Cache HIT (similarity: 0.956)
Query 1: Paris is the capital and largest city of France...
Query 2: Paris is the capital and largest city of France...

Request Batching

Batch multiple independent requests into a single API call to reduce per-token overhead.

from concurrent.futures import ThreadPoolExecutor, as_completed
import time

def batch_completions(
    prompts: list[str],
    model: str = "gpt-4o-mini",
    batch_size: int = 5
) -> list[str]:
    results = [None] * len(prompts)

    for i in range(0, len(prompts), batch_size):
        batch = prompts[i:i+batch_size]
        batch_prompt = "\n---SEPARATOR---\n".join(
            f"Q{i+1}: {p}" for i, p in enumerate(batch)
        )

        system_prompt = """Answer each question below. Format your response as:
A1: [answer]
A2: [answer]
..."""

        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": batch_prompt}
            ],
            max_tokens=200 * len(batch)
        )

        text = response.choices[0].message.content
        for j, line in enumerate(text.strip().split("\n")):
            if line.startswith(f"A{j+1}:"):
                results[i+j] = line.split(":", 1)[1].strip()

    return results

prompts = [
    "What is 2+2?",
    "What is the boiling point of water?",
    "Who wrote Romeo and Juliet?",
]
results = batch_completions(prompts)
for i, (prompt, result) in enumerate(zip(prompts, results)):
    print(f"Q: {prompt}")
    print(f"A: {result}\n")

Expected output:

Q: What is 2+2?
A: 4

Q: What is the boiling point of water?
A: 100 degrees Celsius (212 degrees Fahrenheit)

Q: Who wrote Romeo and Juliet?
A: William Shakespeare

Prompt Compression

Reduce input token count by removing redundant words and phrases.

import re

def compress_prompt(text: str, target_ratio: float = 0.5) -> str:
    """Remove low-information words to reduce token count."""

    # Remove filler words
    filler_words = [
        "basically", "actually", "essentially", "literally",
        "honestly", "simply", "just", "very", "really", "quite]
    ]
    for word in filler_words:
        text = re.sub(
            rf"\b{word}\b", "", text, flags=re.IGNORECASE
        )

    # Remove redundant whitespace
    text = re.sub(R"\s+", " ", text).strip()

    # Truncate examples to first 3 items
    def truncate_list(match):
        items = match.group(0).split(",")
        if len(items) > 3:
            return ",".join(items[:3]) + ",..."
        return match.group(0)

    # Remove unnecessary adjectives and adverbs
    words = text.split()
    compressed = []
    for word in words:
        if word.endswith("ly") and word not in [
            "only", "early", "daily", "monthly", "yearly]
        ]:
            continue
        compressed.append(word)

    result = " ".join(compressed)
    compression = 1 - (len(result) / len(text))
    return result

original = """I basically want to understand how the actually very complex
system essentially works in a really simple way that literally anyone
could understand quite easily."""

compressed = compress_prompt(original)
print(f"Original: {len(original.split())} words")
print(f"Compressed: {len(compressed.split())} words")
print(f"Compression: {1 - len(compressed.split())/len(original.split()):.0%}")
print(f"\nCompressed: {compressed}")

Expected output:

Original: 24 words
Compressed: 12 words
Compression: 50%

Compressed: want to understand how the complex system works in a simple way that anyone could understand easily.

Model Routing with Cost-Aware Policy

Route simple queries to cheap models and complex ones to expensive models.

from typing import Optional

class CostAwareRouter:
    def __init__(self):
        self.models = {
            "gpt-4o-mini": {"cost_per_1k_input": 0.00015, "cost_per_1k_output": 0.0006},
            "gpt-4o": {"cost_per_1k_input": 0.0025, "cost_per_1k_output": 0.01},
        }

    def estimate_complexity(self, prompt: str) -> float:
        """Score prompt complexity 0-1."""
        complexity_signals = 0
        signals = [
            len(prompt) > 200,
            "code" in prompt.lower() and "explain" in prompt.lower(),
            "compare" in prompt.lower() or "difference" in prompt.lower(),
            "analyze" in prompt.lower() or "evaluate" in prompt.lower(),
            any(char in prompt for char in ["{", "}", "[", "]"]),
        ]
        complexity_signals = sum(signals) / len(signals)
        return complexity_signals

    def route(self, prompt: str) -> str:
        complexity = self.estimate_complexity(prompt)

        if complexity > 0.5:
            return "gpt-4o"
        return "gpt-4o-mini"

    def estimate_cost(self, prompt: str, model: str) -> float:
        prices = self.models[model]
        input_tokens = len(prompt.split()) * 1.3
        output_tokens = 200
        return (
            input_tokens / 1000 * prices["cost_per_1k_input"]
            + output_tokens / 1000 * prices["cost_per_1k_output"]
        )

router = CostAwareRouter()
prompts = [
    "What is 2+2?",
    """Analyze the time complexity of this code and suggest optimizations:
def bubble_sort(arr):
    n = len(arr)
    for i in range(n):
        for j in range(0, n-i-1):
            if arr[j] > arr[j+1]:
                arr[j], arr[j+1] = arr[j+1], arr[j]
    return arr"""
]

for prompt in prompts:
    model = router.route(prompt)
    cost = router.estimate_cost(prompt, model)
    print(f"Complexity: {router.estimate_complexity(prompt):.2f}")
    print(f"Routed to: {model}")
    print(f"Est. cost: ${cost:.6f}\n")

Expected output:

Complexity: 0.00
Routed to: gpt-4o-mini
Est. cost: $0.000159

Complexity: 0.60
Routed to: gpt-4o
Est. cost: $0.002650

Common Errors

Error	Cause	Fix
Cache hit rate below 10%	Similarity threshold too strict	Lower cosine threshold from 0.95 to 0.85
Batching degrades response quality	Prompts too diverse in a single batch	Group semantically similar prompts before batching
Prompt compression removes critical context	Aggressive truncation of domain-specific terms	Use a whitelist of terms to never compress
Model router sends hard queries to cheap model	Complexity estimator too simplistic	Add keyword-based rules for known high-complexity domains
Cost savings look good but latency triples	Cache miss path now checks all entries	Replace linear scan with FAISS-based ANN cache lookup

Practice Questions

Why is semantic Caching more effective than exact-match Caching for LLM APIs? Users ask the same question with different wording; semantic Caching catches paraphrases that exact matching misses.
How does request batching reduce per-token API cost? Batching shares the fixed input context overhead across multiple queries and reduces the number of API calls.
What is the risk of using a cheap model for all queries? Cheap models have lower accuracy on complex reasoning tasks, potentially degrading user experience and trust.
How does prompt compression reduce cost without changing the model? Fewer input tokens mean lower per-request cost; compression removes redundant words without altering the core query.
Challenge: Build a cost-tracking dashboard that captures every LLM API call, logs the model used, prompt/response tokens, latency, and cost, and alerts when daily spending exceeds a configurable budget.

Mini Project

Build a cost-optimized AI chat Proxy. Create a FastAPI Proxy that sits between users and the LLM API, implementing semantic Caching with FAISS for sub-10ms lookups, model routing based on query complexity, prompt compression for long inputs, and usage tracking with per-user cost allocation. Output a daily cost report showing savings compared to an uncached, single-model baseline.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous LLM Evaluation and Benchmarking — Metrics, Datasets and Automated Testing Next → AI Ethics, Bias Mitigation and Safety — Building Responsible AI Systems

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Ai Automation