Designing AI API Endpoints — Best Practices for LLM-Powered Services
In this tutorial, you'll learn about Designing AI API Endpoints. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Designing AI API endpoints requires different patterns than traditional REST APIs — streaming responses, prompt validation, context management, and cost-aware Rate Limiting are unique to LLM-powered services.
What You'll Learn
You'll learn patterns for building AI API endpoints including streaming responses, request caching, Prompt Injection detection, structured JSON output, and usage-based Rate Limiting with FastAPI.
Why It Matters
Standard REST patterns break under AI workloads. LLM calls are slow, expensive, and nondeterministic. Proper Api Design reduces latency by 60%, cuts costs by half, and prevents abuse through Prompt Injection and excessive usage.
Real-World Use
Doda Browser's AI features — smart search, page summarization, and code completion — are all served through a unified AI API Gateway that handles streaming, caching, and Rate Limiting across multiple LLM providers.
AI API Architecture
flowchart LR
A[Client] --> B[API Gateway]
B --> C[Auth / Rate Limit]
C --> D[Prompt Guard]
D --> E[Cache Check]
E --> F[LLM Provider]
F --> G[Streaming Response]
E --> H[Cache Store]
G --> A
Streaming Responses
LLM responses take seconds to generate. Streaming returns tokens as they arrive, improving perceived latency.
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from openai import OpenAI
from fastapi.responses import StreamingResponse
import json
app = FastAPI(title="AI API Gateway")
client = OpenAI()
class ChatRequest(BaseModel):
model: str = "gpt-4o-mini"
messages: list[dict]
stream: bool = True
max_tokens: int = 1024
def stream_generator(response):
for chunk in response:
if chunk.choices[0].delta.content:
yield f"data: {json.dumps({
'content': chunk.choices[0].delta.content
})}\n\n"
yield "data: [DONE]\n\n"
@app.post("/v1/chat/completions")
async def chat_completion(request: ChatRequest):
try:
response = client.chat.completions.create(
model=request.model,
messages=request.messages,
stream=request.stream,
max_tokens=request.max_tokens,
)
return StreamingResponse(
stream_generator(response),
media_type="text/event-stream"
)
except Exception as e:
raise HTTPException(status_code=502, detail=str(e))
# Test with curl
print("Endpoint: POST /v1/chat/completions")
print("Streams SSE-formatted tokens as they arrive")
Expected output:
Endpoint: POST /v1/chat/completions
Streams SSE-formatted tokens as they arrive
Structured Output with Pydantic
Force LLMs to return valid JSON with a predefined schema.
from pydantic import BaseModel, Field
from typing import List, Optional
from openai import OpenAI
client = OpenAI()
class AnalysisResult(BaseModel):
sentiment: str = Field(
description="One of: positive, negative, neutral"
)
confidence: float = Field(
ge=0.0, le=1.0, description="Confidence score"
)
key_points: List[str] = Field(
max_length=5, description="Up to 5 key points"
)
summary: str = Field(
max_length=200, description="One-sentence summary"
)
def analyze_text(text: str) -> AnalysisResult:
response = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[
{"role": "system", "content": "Analyze the text and return structured output."},
{"role": "user", "content": text}
],
response_format=AnalysisResult,
)
return response.choices[0].message.parsed
result = analyze_text(
"The new update is fantastic! The speed improvements are remarkable, "
"though the UI changes will take some getting used to."
)
print(f"Sentiment: {result.sentiment}")
print(f"Confidence: {result.confidence:.2f}")
print(f"Key Points: {result.key_points}")
print(f"Summary: {result.summary}")
Expected output:
Sentiment: positive
Confidence: 0.92
Key Points: ['Speed improvements are remarkable', 'UI changes may need adjustment']
Summary: Users are very positive about performance gains but have mixed feelings about the interface changes.
Request Caching
Cache identical requests to reduce costs and latency for repeated queries.
import hashlib
import JSON
import Redis.asyncio as Redis
from FastAPI import Depends
Redis_client = Redis.Redis(host="localhost", port=6379, db=0)
CACHE_TTL = 3600 # 1 hour
def make_cache_key(request: ChatRequest) -> str:
raw = JSON.dumps(request.model_dump(), sort_keys=True)
return f"ai:cache:{hashlib.sha256(raw.encode()).hexdigest()}"
@app.post("/v1/chat/completions/cached")
async def cached_chat(request: ChatRequest):
cache_key = make_cache_key(request)
# Check cache
cached = await Redis_client.get(cache_key)
if cached:
return JSON.loads(cached)
# No cache hit — call LLM
response = client.chat.completions.create(
model=request.model,
messages=request.messages,
max_tokens=request.max_tokens,
)
result = {
"content": response.choices[0].message.content,
"model": request.model,
"cached": False
}
# Store in cache
await Redis_client.setex(
cache_key, CACHE_TTL, JSON.dumps(result)
)
return result
# Test cache behavior
print("First call: hits LLM, caches result")
print("Second call with same input: returns cached result")
print(f"Cache TTL: {CACHE_TTL}s")
Expected output:
First call: hits LLM, caches result
Second call with same input: returns cached result
Cache TTL: 3600s
Rate Limiting with Token Awareness
Track usage per API key and limit based on tokens consumed.
from FastAPI import Request, HTTPException
import time
class TokenBucket:
def __init__(self, max_tokens: int, refill_rate: float):
self.max_tokens = max_tokens
self.refill_rate = refill_rate
self.tokens = max_tokens
self.last_refill = time.time()
def consume(self, tokens: int) -> bool:
now = time.time()
elapsed = now - self.last_refill
self.tokens = min(
self.max_tokens,
self.tokens + elapsed * self.refill_rate
)
self.last_refill = now
if self.tokens >= tokens:
self.tokens -= tokens
return True
return False
buckets = {} # API_key -> TokenBucket
def get_token_bucket(API_key: str) -> TokenBucket:
if API_key not in buckets:
buckets[API_key] = TokenBucket(
max_tokens=100000,
refill_rate=1000 # tokens per second
)
return buckets[API_key]
@app.middleware("HTTP")
async def rate_limit_middleware(request: Request, call_next):
API_key = request.headers.get("X-API-Key", "anonymous")
bucket = get_token_bucket(API_key)
estimated_tokens = 500 # estimate per request
if not bucket.consume(estimated_tokens):
raise HTTPException(
status_code=429,
detail={
"error": "rate_limit_exceeded",
"message": "Token quota exceeded. Try again later.",
"retry_after": 60
}
)
response = await call_next(request)
response.headers["X-RateLimit-Remaining"] = str(
int(bucket.tokens)
)
return response
print("Rate limit middleware active")
print("Token bucket: 100K max, 1000 tokens/sec refill")
Expected output:
Rate limit middleware active
Token bucket: 100K max, 1000 tokens/sec refill
Common Errors
| Error | Cause | Fix |
|---|---|---|
| Timeout on long LLM calls | Default HTTP timeout too short | Set timeout to 300s or use streaming with keep-alive |
| Repeated identical API calls | No caching layer | Add Redis cache with SHA256 request hash |
| Prompt Injection in system message | User input not sanitized | Separate system and user messages; validate with guardrail model |
| JSON parsing fails on structured output | LLM returns malformed JSON | Use response_format with Pydantic model in supported models |
| Rate Limiting blocks legitimate users | Global rate limit without per-key tracking | Implement token bucket per API key instead of global counter |
Practice Questions
Why is streaming important for LLM API endpoints? Streaming returns tokens as they are generated, reducing perceived latency from seconds to milliseconds for the first token.
How does request caching reduce costs in AI APIs? Caching identical requests avoids repeated LLM invocations, cutting API costs dollar-for-dollar for repeated queries.
What is the difference between a token bucket and a fixed-window rate limiter? Token buckets allow bursts up to the bucket capacity while limiting average rate; fixed windows cap requests per calendar interval.
Why should structured output be validated server-side and not trusted from the LLM? LLMs can still produce invalid output even with structured mode; server-side Pydantic validation catches and handles malformed responses.
Challenge: Build an API Gateway that routes to different LLM providers (OpenAI, Anthropic, local Ollama) based on the model name in the request, with automatic fallback if one provider fails.
Mini Project
Build an AI-powered content moderation API. Create a FastAPI endpoint that accepts text, sends it to an LLM for toxicity classification (categories: hate speech, harassment, spam, safe), returns structured JSON with category and confidence, implements request caching with Redis to avoid re-checking identical content, and logs every request with token usage per API key for billing.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro