AI API Cost Optimization — Caching, Batching and Quantization Strategies
AI API costs can quickly spiral from pennies to thousands of dollars as usage grows — this guide covers practical strategies to reduce LLM costs by 60-80% without sacrificing quality.
What You'll Learn
You'll learn cost reduction strategies including semantic Caching, request batching, prompt compression, model quantization, and intelligent provider routing using Python and Redis.
Why It Matters
GPT-4o costs $10 per million output tokens. A chatbot handling 100K conversations per month can cost over $5,000. Cost optimization is not optional — it determines whether your AI product is profitable.
Real-World Use
Doda Browser's AI assistant uses semantic Caching with Redis to serve 40% of queries from cache, cutting monthly API costs by $2,800 while maintaining sub-50ms response times for cached queries.
Cost Optimization Architecture
flowchart TD
A[Request] --> B[Semantic Cache]
B -->|Hit| C[Return Cached]
B -->|Miss| D[Prompt Compressor]
D --> E[Model Router]
E --> F[Provider]
F --> G[Output Cache]
G --> H[Return]
Semantic Caching
Cache responses to semantically similar queries, not just exact matches.
import hashlib
import json
import numpy as np
from openai import OpenAI
import redis
client = OpenAI()
cache = redis.Redis(host="localhost", port=6379, db=0)
SIMILARITY_THRESHOLD = 0.92
def get_embedding(text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def cosine_similarity(a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return np.dot(a, b) / (
np.linalg.norm(a) * np.linalg.norm(b)
)
def semantic_cache_key(embedding: list[float]) -> str:
# Store embedding as bytes for storage
return f"semantic:{hashlib.md5(
np.array(embedding).tobytes()
).hexdigest()}"
def cached_completion(prompt: str, **kwargs) -> str:
prompt_embedding = get_embedding(prompt)
# Scan recent cache entries for semantic matches
cursor = 0
while True:
cursor, keys = cache.scan(
cursor, match="semantic:*", count=100
)
for key in keys:
cached = cache.get(key)
if not cached:
continue
cached_data = json.loads(cached)
cached_emb = cached_data["embedding"]
similarity = cosine_similarity(
prompt_embedding, cached_emb
)
if similarity >= SIMILARITY_THRESHOLD:
print(f"Cache HIT (similarity: {similarity:.3f})")
return cached_data["response"]
if cursor == 0:
break
# Cache miss — call API
response = client.chat.completions.create(
model=kwargs.get("model", "gpt-4o-mini"),
messages=[{"role": "user", "content": prompt}],
)
result = response.choices[0].message.content
# Cache the result
key = semantic_cache_key(prompt_embedding)
cache.setex(key, 3600, json.dumps({
"embedding": prompt_embedding,
"response": result,
"prompt": prompt
}))
print(f"Cache MISS — API called")
return result
# Test
result1 = cached_completion("What is the capital of France?")
result2 = cached_completion("Tell me the capital city of France")
print(f"Query 1: {result1[:60]}...")
print(f"Query 2: {result2[:60]}...")
Expected output:
Cache MISS — API called
Cache HIT (similarity: 0.956)
Query 1: Paris is the capital and largest city of France...
Query 2: Paris is the capital and largest city of France...
Request Batching
Batch multiple independent requests into a single API call to reduce per-token overhead.
from concurrent.futures import ThreadPoolExecutor, as_completed
import time
def batch_completions(
prompts: list[str],
model: str = "gpt-4o-mini",
batch_size: int = 5
) -> list[str]:
results = [None] * len(prompts)
for i in range(0, len(prompts), batch_size):
batch = prompts[i:i+batch_size]
batch_prompt = "\n---SEPARATOR---\n".join(
f"Q{i+1}: {p}" for i, p in enumerate(batch)
)
system_prompt = """Answer each question below. Format your response as:
A1: [answer]
A2: [answer]
..."""
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": batch_prompt}
],
max_tokens=200 * len(batch)
)
text = response.choices[0].message.content
for j, line in enumerate(text.strip().split("\n")):
if line.startswith(f"A{j+1}:"):
results[i+j] = line.split(":", 1)[1].strip()
return results
prompts = [
"What is 2+2?",
"What is the boiling point of water?",
"Who wrote Romeo and Juliet?",
]
results = batch_completions(prompts)
for i, (prompt, result) in enumerate(zip(prompts, results)):
print(f"Q: {prompt}")
print(f"A: {result}\n")
Expected output:
Q: What is 2+2?
A: 4
Q: What is the boiling point of water?
A: 100 degrees Celsius (212 degrees Fahrenheit)
Q: Who wrote Romeo and Juliet?
A: William Shakespeare
Prompt Compression
Reduce input token count by removing redundant words and phrases.
import re
def compress_prompt(text: str, target_ratio: float = 0.5) -> str:
"""Remove low-information words to reduce token count."""
# Remove filler words
filler_words = [
"basically", "actually", "essentially", "literally",
"honestly", "simply", "just", "very", "really", "quite]
]
for word in filler_words:
text = re.sub(
rf"\b{word}\b", "", text, flags=re.IGNORECASE
)
# Remove redundant whitespace
text = re.sub(R"\s+", " ", text).strip()
# Truncate examples to first 3 items
def truncate_list(match):
items = match.group(0).split(",")
if len(items) > 3:
return ",".join(items[:3]) + ",..."
return match.group(0)
# Remove unnecessary adjectives and adverbs
words = text.split()
compressed = []
for word in words:
if word.endswith("ly") and word not in [
"only", "early", "daily", "monthly", "yearly]
]:
continue
compressed.append(word)
result = " ".join(compressed)
compression = 1 - (len(result) / len(text))
return result
original = """I basically want to understand how the actually very complex
system essentially works in a really simple way that literally anyone
could understand quite easily."""
compressed = compress_prompt(original)
print(f"Original: {len(original.split())} words")
print(f"Compressed: {len(compressed.split())} words")
print(f"Compression: {1 - len(compressed.split())/len(original.split()):.0%}")
print(f"\nCompressed: {compressed}")
Expected output:
Original: 24 words
Compressed: 12 words
Compression: 50%
Compressed: want to understand how the complex system works in a simple way that anyone could understand easily.
Model Routing with Cost-Aware Policy
Route simple queries to cheap models and complex ones to expensive models.
from typing import Optional
class CostAwareRouter:
def __init__(self):
self.models = {
"gpt-4o-mini": {"cost_per_1k_input": 0.00015, "cost_per_1k_output": 0.0006},
"gpt-4o": {"cost_per_1k_input": 0.0025, "cost_per_1k_output": 0.01},
}
def estimate_complexity(self, prompt: str) -> float:
"""Score prompt complexity 0-1."""
complexity_signals = 0
signals = [
len(prompt) > 200,
"code" in prompt.lower() and "explain" in prompt.lower(),
"compare" in prompt.lower() or "difference" in prompt.lower(),
"analyze" in prompt.lower() or "evaluate" in prompt.lower(),
any(char in prompt for char in ["{", "}", "[", "]"]),
]
complexity_signals = sum(signals) / len(signals)
return complexity_signals
def route(self, prompt: str) -> str:
complexity = self.estimate_complexity(prompt)
if complexity > 0.5:
return "gpt-4o"
return "gpt-4o-mini"
def estimate_cost(self, prompt: str, model: str) -> float:
prices = self.models[model]
input_tokens = len(prompt.split()) * 1.3
output_tokens = 200
return (
input_tokens / 1000 * prices["cost_per_1k_input"]
+ output_tokens / 1000 * prices["cost_per_1k_output"]
)
router = CostAwareRouter()
prompts = [
"What is 2+2?",
"""Analyze the time complexity of this code and suggest optimizations:
def bubble_sort(arr):
n = len(arr)
for i in range(n):
for j in range(0, n-i-1):
if arr[j] > arr[j+1]:
arr[j], arr[j+1] = arr[j+1], arr[j]
return arr"""
]
for prompt in prompts:
model = router.route(prompt)
cost = router.estimate_cost(prompt, model)
print(f"Complexity: {router.estimate_complexity(prompt):.2f}")
print(f"Routed to: {model}")
print(f"Est. cost: ${cost:.6f}\n")
Expected output:
Complexity: 0.00
Routed to: gpt-4o-mini
Est. cost: $0.000159
Complexity: 0.60
Routed to: gpt-4o
Est. cost: $0.002650
Common Errors
| Error | Cause | Fix |
|---|---|---|
| Cache hit rate below 10% | Similarity threshold too strict | Lower cosine threshold from 0.95 to 0.85 |
| Batching degrades response quality | Prompts too diverse in a single batch | Group semantically similar prompts before batching |
| Prompt compression removes critical context | Aggressive truncation of domain-specific terms | Use a whitelist of terms to never compress |
| Model router sends hard queries to cheap model | Complexity estimator too simplistic | Add keyword-based rules for known high-complexity domains |
| Cost savings look good but latency triples | Cache miss path now checks all entries | Replace linear scan with FAISS-based ANN cache lookup |
Practice Questions
Why is semantic Caching more effective than exact-match Caching for LLM APIs? Users ask the same question with different wording; semantic Caching catches paraphrases that exact matching misses.
How does request batching reduce per-token API cost? Batching shares the fixed input context overhead across multiple queries and reduces the number of API calls.
What is the risk of using a cheap model for all queries? Cheap models have lower accuracy on complex reasoning tasks, potentially degrading user experience and trust.
How does prompt compression reduce cost without changing the model? Fewer input tokens mean lower per-request cost; compression removes redundant words without altering the core query.
Challenge: Build a cost-tracking dashboard that captures every LLM API call, logs the model used, prompt/response tokens, latency, and cost, and alerts when daily spending exceeds a configurable budget.
Mini Project
Build a cost-optimized AI chat Proxy. Create a FastAPI Proxy that sits between users and the LLM API, implementing semantic Caching with FAISS for sub-10ms lookups, model routing based on query complexity, prompt compression for long inputs, and usage tracking with per-user cost allocation. Output a daily cost report showing savings compared to an uncached, single-model baseline.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro