AI Observability and Monitoring — LangSmith, Weights and Biases and Production Tracing

DodaTech Updated 2026-06-22 7 min read

AI Observability gives you visibility into how LLM applications behave in production — this guide covers tracing, monitoring, experiment tracking, and debugging with LangSmith, Weights and Biases, and custom instrumentation.

What You'll Learn

You'll learn to instrument LLM applications with LangSmith for tracing, track experiments with Weights and Biases, monitor latency and cost in production, and build custom dashboards for AI system health.

Why It Matters

LLM applications fail silently. A prompt change can degrade quality without any error message. Observability tools capture token usage, latency, response quality, and failure modes — giving you the data needed to debug, optimize, and improve AI systems.

Real-World Use

Doda Browser's AI team uses LangSmith to trace every AI assistant interaction, correlating user satisfaction scores with specific model versions, prompts, and retrieval contexts to continuously improve the assistant's helpfulness.

Observability Stack

flowchart LR
    A[Application] --> B[LangSmith Tracer]
    A --> C[Custom Logger]
    B --> D[Traces DB]
    C --> E[Metrics DB]
    D --> F[Dashboard]
    E --> F
    F --> G[Alerts]
    F --> H[Optimization]

LangSmith Tracing

Instrument your LLM application with LangSmith for full trace visibility.

import os
from langsmith import Client
from langsmith.run_helpers import traceable
from openai import OpenAI

# Requires: pip install langsmith
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-langsmith-api-key"
os.environ["LANGCHAIN_PROJECT"] = "ai-assistant"

client = OpenAI()
langsmith_client = Client()

@traceable(run_type="llm")
def call_llm(prompt: str, model: str = "gpt-4o-mini") -> str:
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )
    result = response.choices[0].message.content
    # Metadata is automatically captured: tokens, model, latency
    return result

@traceable(run_type="chain")
def rag_pipeline(query: str) -> dict:
    # Step 1: Retrieve
    context = retrieve_documents(query)
    print(f"Retrieved {len(context)} documents")

    # Step 2: Generate
    prompt = f"Context: {context}\n\nQuestion: {query}"
    answer = call_llm(prompt)

    return {"query": query, "answer": answer, "context": context}

def retrieve_documents(query: str) -> str:
    # Simulated retrieval
    return "Relevant documentation about API rate limits."

# Run traced pipeline
result = rag_pipeline("What are the rate limits?")
print(f"Traced pipeline complete: {result['answer'][:60]}...")
print("View trace at: https://smith.langchain.com")

Expected output:

Retrieved 1 documents
Traced pipeline complete: Based on the documentation, API rate limits are 100...
View trace at: https://smith.<a href="/ai-frameworks-apis/langchain/">LangChain</a>.com

Custom Instrumentation

Build a custom monitoring wrapper for LLM calls.

import time
import json
from datetime import datetime
from typing import Optional, Callable, Any

class LLMMonitor:
    def __init__(self):
        self.logs = []

    def monitor(self, func: Callable) -> Callable:
        def wrapper(*args, **kwargs):
            start = time.time()
            error = None
            result = None

            try:
                result = func(*args, **kwargs)
            except Exception as e:
                error = str(e)
                raise
            finally:
                elapsed = time.time() - start
                log_entry = {
                    "timestamp": datetime.now().isoformat(),
                    "function": func.__name__,
                    "latency_ms": round(elapsed * 1000, 2),
                    "error": error,
                    "args_preview": str(args)[:100],
                    "result_preview": str(result)[:100] if result else None
                }
                self.logs.append(log_entry)

                # Alert on slow calls
                if elapsed > 5.0:
                    print(f"[ALERT] Slow LLM call: {elapsed:.2f}s")
                # Alert on errors
                if error:
                    print(f"[ALERT] LLM call failed: {error}")

            return result
        return wrapper

    def get_stats(self) -> dict:
        if not self.logs:
            return {"total_calls": 0}

        latencies = [l["latency_ms"] for l in self.logs if not l["error"]]
        errors = [l for l in self.logs if l["error"]]

        return {
            "total_calls": len(self.logs),
            "total_errors": len(errors),
            "error_rate": round(len(errors) / len(self.logs) * 100, 2),
            "avg_latency_ms": round(sum(latencies) / len(latencies), 2) if latencies else 0,
            "p95_latency_ms": sorted(latencies)[int(len(latencies) * 0.95)] if latencies else 0,
        }

    def export_logs(self, path: str):
        with open(path, "w") as f:
            json.dump(self.logs, f, indent=2)
        print(f"Exported {len(self.logs)} log entries to {path}")

monitor = LLMMonitor()

@monitor.monitor
def slow_llm_call(prompt: str) -> str:
    time.sleep(3.2)  # Simulate slow call
    return f"Response to: {prompt[:20]}..."

@monitor.monitor
def fast_llm_call(prompt: str) -> str:
    return f"Fast response to: {prompt[:20]}..."

# Test
fast_llm_call("What is Python?")
slow_llm_call("Explain quantum computing")
fast_llm_call("What is machine learning?")

stats = monitor.get_stats()
print(f"\nMonitor stats:")
print(f"  Total calls: {stats['total_calls']}")
print(f"  Error rate: {stats['error_rate']}%")
print(f"  Avg latency: {stats['avg_latency_ms']}ms")
print(f"  P95 latency: {stats['p95_latency_ms']}ms")

Expected output:

[ALERT] Slow LLM call: 3.20s

Monitor stats:
  Total calls: 3
  Error rate: 0.0%
  Avg latency: 1066.67ms
  P95 latency: 3200.0ms

Experiment Tracking with Weights and Biases

Track prompt versions, model parameters, and evaluation metrics.

import wandb
from dataclasses import dataclass, asdict
from typing import List

@dataclass
class ExperimentConfig:
    model: str
    temperature: float
    prompt_template: str
    max_tokens: int
    top_p: float

@dataclass
class ExperimentResult:
    accuracy: float
    avg_latency_ms: float
    total_cost: float
    avg_tokens_per_response: int

class ExperimentTracker:
    def __init__(self, project_name: str = "ai-assistant"):
        self.project = project_name
        self.run = None

    def start_experiment(self, config: ExperimentConfig):
        self.run = wandb.init(
            project=self.project,
            config=asdict(config)
        )
        print(f"Started experiment: {self.run.name}")
        print(f"Config: {asdict(config)}")

    def log_metrics(self, result: ExperimentResult, step: int = 0):
        if not self.run:
            print("No active experiment")
            return

        metrics = asdict(result)
        wandb.log(metrics, step=step)
        print(f"Logged metrics: {metrics}")

    def log_prompt(self, prompt: str, response: str, score: float):
        if not self.run:
            return
        wandb.log({
            "sample_prompt": wandb.HTML(f"<pre>{prompt}</pre>"),
            "sample_response": wandb.HTML(f"<pre>{response}</pre>"),
            "sample_score": score
        })

    def finish(self):
        if self.run:
            wandb.finish()
            print(f"Experiment {self.run.name} finished")

# Simulate experiment tracking
def mock_experiment():
    tracker = ExperimentTracker()

    config = ExperimentConfig(
        model="gpt-4o-mini",
        temperature=0.3,
        prompt_template="Answer concisely: {query}",
        max_tokens=200,
        top_p=0.9
    )
    tracker.start_experiment(config)

    # Simulate evaluation
    results = ExperimentResult(
        accuracy=0.87,
        avg_latency_ms=1240,
        total_cost=0.45,
        avg_tokens_per_response=156
    )
    tracker.log_metrics(results)

    tracker.finish()

print("WandB experiment tracking ready (requires wandb login)")
print("Run: mock_experiment() after setting up wandb")

Expected output:

WandB experiment tracking ready (requires wandb login)
Run: mock_experiment() after setting up wandb

Production Monitoring Dashboard

Build a real-time metrics dashboard for LLM operations.

from collections import deque
from dataclasses import dataclass, field
from typing import Deque

@dataclass
class MetricsWindow:
    window_size: int = 1000
    latencies: Deque[float] = field(default_Factory=deque)
    tokens: Deque[int] = field(default_Factory=deque)
    errors: Deque[bool] = field(default_Factory=deque)
    costs: Deque[float] = field(default_Factory=deque)

    def add(self, latency: float, tokens: int, error: bool, cost: float):
        self.latencies.append(latency)
        self.tokens.append(tokens)
        self.errors.append(error)
        self.costs.append(cost)

        # Maintain window size
        if len(self.latencies) > self.window_size:
            self.latencies.popleft()
            self.tokens.popleft()
            self.errors.popleft()
            self.costs.popleft()

    def get_snapshot(self) -> dict:
        n = len(self.latencies)
        if n == 0:
            return {"status": "no_data"}

        recent_latencies = list(self.latencies)
        recent_errors = list(self.errors)

        return {
            "total_requests": n,
            "avg_latency_ms": round(sum(recent_latencies) / n, 1),
            "p95_latency_ms": sorted(recent_latencies)[
                int(n * 0.95)
            ] if n > 1 else recent_latencies[0],
            "error_rate": round(
                sum(recent_errors) / n * 100, 2
            ),
            "total_tokens": sum(self.tokens),
            "total_cost": round(sum(self.costs), 4),
            "avg_cost_per_request": round(
                sum(self.costs) / n, 6
            ),
            "requests_per_minute": round(
                n / 5, 1  # Assuming 5-minute window
            )
        }

# Simulate dashboard
dashboard = MetricsWindow(window_size=100)
import random
for _ in range(50):
    dashboard.add(
        latency=random.gauss(800, 200),
        tokens=random.randint(50, 400),
        error=random.random() < 0.03,
        cost=random.uniform(0.001, 0.01)
    )

snapshot = dashboard.get_snapshot()
print("Production Dashboard Snapshot:")
for key, value in snapshot.items():
    print(f"  {key}: {value}")

Expected output:

Production Dashboard Snapshot:
  total_requests: 50
  avg_latency_ms: 795.3
  p95_latency_ms: 1128.0
  error_rate: 4.0
  total_tokens: 11250
  total_cost: 0.2756
  avg_cost_per_request: 0.005512
  requests_per_minute: 10.0

Common Errors

Error	Cause	Fix
LangSmith traces not appearing in dashboard	LangChain_API_KEY not set or project name mismatch	Verify environment variables and check API key validity
WandB run crashes with duplicate name	Multiple runs with same config	Add timestamp or random suffix to run names
Custom logger slows down production	Synchronous file I/O on every call	Use async logging or batch writes every 100ms
P95 latency metric is misleading	Cold starts skew the distribution	Separate cold start requests in metrics or use warm-up period
Cost tracking underestimates actual spend	Only tracking output tokens	Track both input and output tokens; include per-request overhead

Practice Questions

What is the difference between tracing and monitoring in AI Observability? Tracing captures detailed per-request execution paths (individual LLM calls, retrievals); monitoring aggregates metrics over time (latency, error rate, cost).
Why is token-level tracing important for debugging LLM applications? Token counts explain cost spikes, context window exceeded errors, and help identify prompts that are unexpectedly long.
How can Observability data be used to optimize prompts? By correlating prompt templates with response quality scores, latency, and token usage to identify poorly performing prompts.
What metrics should trigger an alert in production AI systems? Error rate above 5%, P95 latency above 10s, cost per request above threshold, and sudden drop in user satisfaction score.
Challenge: Build a full Observability Stack with a custom Python SDK that wraps all LLM calls with automatic tracing, logs every interaction to ClickHouse, computes real-time metrics with Apache Flink, and serves a Grafana dashboard with latency, cost, error rate, and quality score panels.

Mini Project

Build an AI assistant monitoring dashboard. Instrument a simple RAG chatbot with LangSmith traces, export a custom metrics stream to Prometheus (latency, token count, error rate), create a Grafana dashboard with panels for real-time requests, P99 latency heatmap, cost per user, and response quality scores from user feedback, and configure alerts for anomaly detection.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Multimodal AI — Working with Text, Images and Audio in Unified Models

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Ai Automation