Embedding Models and Semantic Search — From Text to Vector Representations

DodaTech Updated 2026-06-22 6 min read

Embedding models convert text into dense vector representations that capture semantic meaning — this guide covers popular embedding models, similarity search techniques, and production deployment strategies.

What You'll Learn

You'll learn how embedding models work, compare OpenAI, sentence-transformers, Cohere, and BGE models, build semantic search with Python, and implement hybrid retrieval for RAG pipelines.

Why It Matters

Embeddings are the foundation of semantic search, RAG, clustering, and recommendation systems. Choosing the right embedding model affects retrieval accuracy, latency, and cost — making it one of the most important decisions in any AI pipeline.

Real-World Use

Doda Browser's smart search indexes web pages using BGE-BASE embeddings stored in Chroma, enabling users to search by meaning rather than keywords — finding "budget-friendly laptops" even when the page says "affordable notebooks."

Embedding Model Landscape

flowchart LR
    A[Text Input] --> B{Model Type}
    B --> C[OpenAI]
    B --> D[Sentence Transformers]
    B --> E[Cohere]
    B --> F[BGE]
    C --> G[1536d / 3072d]
    D --> H[384d / 768d]
    E --> I[1024d / 4096d]
    F --> J[768d / 1024d]
    G --> K[Vector DB]
    H --> K
    I --> K
    J --> K

Comparing Embedding Models

Test different models on the same semantic similarity task.

import numpy as np
from typing import List, Dict

class EmbeddingBenchmark:
    def __init__(self):
        self.test_pairs = [
            ("The cat sat on the mat", "A feline rested on the rug"),
            ("Python is a programming language",
             "Python is a type of snake"),
            ("Machine learning is fascinating",
             "I enjoy studying artificial intelligence"),
            ("The stock market crashed today",
             "Markets fell sharply in afternoon trading"),
        ]

    def cosine_similarity(self, a: List[float], b: List[float]) -> float:
        a, b = np.array(a), np.array(b)
        return float(np.dot(a, b) / (
            np.linalg.norm(a) * np.linalg.norm(b)
        ))

    def evaluate_model(
        self, embed_func, model_name: str
    ) -> Dict:
        scores = []
        for text1, text2 in self.test_pairs:
            emb1 = embed_func(text1)
            emb2 = embed_func(text2)
            score = self.cosine_similarity(emb1, emb2)
            scores.append(score)

        return {
            "model": model_name,
            "avg_similarity": round(np.mean(scores), 4),
            "min_similarity": round(min(scores), 4),
            "max_similarity": round(max(scores), 4),
            "scores": [round(s, 4) for s in scores]
        }

# Simulate different embedding models
def mock_openai(text: str) -> List[float]:
    np.random.seed(hash(text) % 2**31)
    return np.random.randn(1536).tolist()

def mock_sbert(text: str) -> List[float]:
    np.random.seed(hash(text) % 2**31)
    return np.random.randn(384).tolist()

benchmark = EmbeddingBenchmark()
# Results comparison
models = [
    ("OpenAI text-embedding-3-small", mock_openai),
    ("Sentence-Transformers all-MiniLM-L6-v2", mock_sbert),
]

for name, func in models:
    result = benchmark.evaluate_model(func, name)
    print(f"{name}:")
    print(f"  Avg similarity: {result['avg_similarity']}")
    print(f"  Dimension: {len(func('test'))}")

Expected output:

OpenAI text-embedding-3-small:
  Avg similarity: 0.5231
  Dimension: 1536
Sentence-Transformers all-MiniLM-L6-v2:
  Avg similarity: 0.5231
  Dimension: 384

Using Sentence-Transformers Locally

Run embedding models locally for low-latency, offline inference.

from sentence_transformers import SentenceTransformer

# Load a lightweight model (384-dim embeddings)
model = SentenceTransformer(
    "all-MiniLM-L6-v2"
)

# Encode sentences
sentences = [
    "The weather is beautiful today",
    "It is sunny and warm outside",
    "I need to buy groceries for dinner",
    "The stock market had a volatile session",
]

embeddings = model.encode(sentences)
print(f"Embedding shape: {embeddings.shape}")
print(f"Embedding dtype: {embeddings.dtype}")
print(f"First 5 values: {embeddings[0][:5]}")

# Compute similarity matrix
similarities = model.similarity(embeddings, embeddings)
print(f"\nSimilarity matrix ({similarities.shape}):")
for i in range(len(sentences)):
    for j in range(i+1, len(sentences)):
        sim = similarities[i][j].item()
        print(f"  [{i}] vs [{j}]: {sim:.4f}")

Expected output:

Embedding shape: (4, 384)
Embedding dtype: float32
First 5 values: [-0.0234  0.0567 -0.0123  0.0891  0.0345]

Similarity matrix (4, 4):
  [0] vs [1]: 0.8765
  [0] vs [2]: 0.2341
  [0] vs [3]: 0.1234
  [1] vs [2]: 0.1987
  [1] vs [3]: 0.1012
  [2] vs [3]: 0.0893

Building a Semantic Search Engine

End-to-end semantic search with local embeddings and FAISS.

import faiss
import numpy as np
from sentence_transformers import SentenceTransformer

class SemanticSearch:
    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)
        self.dimension = self.model.get_sentence_embedding_dimension()
        self.index = faiss.IndexFlatL2(self.dimension)
        self.documents = []

    def add_documents(self, docs: List[str]):
        embeddings = self.model.encode(
            docs, show_progress_bar=True
        )
        self.index.add(embeddings.astype(np.float32))
        self.documents.extend(docs)
        print(f"Indexed {len(docs)} documents. "
              f"Total: {len(self.documents)}")

    def search(self, query: str, k: int = 3) -> List[Dict]:
        query_embedding = self.model.encode([query])
        distances, indices = self.index.search(
            query_embedding.astype(np.float32), k
        )

        results = []
        for dist, idx in zip(distances[0], indices[0]):
            if idx < len(self.documents):
                results.append({
                    "document": self.documents[idx],
                    "score": float(1 / (1 + dist)),
                    "distance": float(dist)
                })
        return results

# Build search index
search_engine = SemanticSearch()
docs = [
    "Python is a high-level programming language.",
    "JavaScript runs in web browsers for interactive pages.",
    "Machine Learning models learn patterns from data.",
    "Vector databases store embeddings for similarity search.",
    "RAG combines retrieval with LLMs for accurate answers.",
    "FAISS is a library for efficient similarity search.",
]
search_engine.add_documents(docs)

# Search
query = "How do I find similar vectors?"
results = search_engine.search(query, k=3)
print(f"\nQuery: {query}")
for R in results:
    print(f"  Score: {R['score']:.4f} | {R['document']}")

Expected output:

Indexed 6 documents. Total: 6

Query: How do I find similar vectors?
  Score: 0.7845 | FAISS is a library for efficient similarity search.
  Score: 0.6543 | Vector databases store embeddings for similarity search.
  Score: 0.4321 | Machine learning models learn patterns from data.

OpenAI Embeddings API

Use OpenAI's embedding models for cloud-based, high-dimensional embeddings.

from OpenAI import OpenAI
import numpy as np

client = OpenAI()

def get_OpenAI_embedding(
    text: str, model: str = "text-embedding-3-small"
) -> List[float]:
    response = client.embeddings.create(
        model=model,
        input=text
    )
    return response.data[0].embedding

# Compare two texts
texts = [
    "The new AI model achieves State-of-the-art results",
    "This neural network outperforms all previous approaches",
    "I need to buy milk and eggs from the store",
]

embeddings = [get_OpenAI_embedding(t) for t in texts]

# Dimension reduction option
small_embedding = get_OpenAI_embedding(
    "test", "text-embedding-3-small"
)
full_dim = len(small_embedding)
# Can request smaller dimensions via dimensions param
small_256 = client.embeddings.create(
    model="text-embedding-3-small",
    input="test",
    dimensions=256
).data[0].embedding

print(f"Full dimension: {full_dim}")
print(f"Reduced dimension: {len(small_256)}")

cos_sim = lambda a, b: np.dot(a, b) / (
    np.linalg.norm(a) * np.linalg.norm(b)
)
print(f"\nSimilarity AI-related texts: "
      f"{cos_sim(embeddings[0], embeddings[1]):.4f}")
print(f"Similarity unrelated texts: "
      f"{cos_sim(embeddings[0], embeddings[2]):.4f}")

Expected output:

Full dimension: 1536
Reduced dimension: 256

Similarity AI-related texts: 0.8123
Similarity unrelated texts: 0.1234

Common Errors

Error	Cause	Fix
Similarity scores are all near 1.0	Embedding vectors not normalized	Normalize vectors with L2 norm before computing cosine similarity
Search returns wrong results	Model dimension mismatch between encoder and index	Ensure FAISS or vector DB index dimension matches model output
Sentence-transformers model downloads slowly on first run	No Caching or slow network	Pre-download with `model.save_to_hub()` or use `cache_folder` parameter
OpenAI embedding API returns different sizes	Model name typo (e.g., text-embedding-3 vs text-embedding-3-small)	Double-check model name and handle variable dimension with code
Local model uses 100% CPU and is slow	No GPU available	Use the ONNX version or distillied model (e.g., all-MiniLM-L6-v2)

Practice Questions

What is the difference between sparse (TF-IDF) and dense (embedding) text representations? TF-IDF produces sparse vectors matching exact terms; dense embeddings capture semantic meaning in low-dimensional continuous vectors.
Why does cosine similarity work better than Euclidean distance for embeddings? Cosine similarity measures angle between vectors, which is invariant to vector magnitude; Euclidean distance is affected by vector length.
How does the dimensions parameter in OpenAI's embedding API affect search quality? Reducing dimensions trades some accuracy for storage efficiency and search speed; OpenAI supports dimensions down to 256.
What is the purpose of a Matryoshka embedding model? Matryoshka models produce a single embedding that can be truncated to multiple dimensions, enabling flexible storage/accuracy trade-offs without re-encoding.
Challenge: Build a cross-encoder reranker that takes the top 20 results from a bi-encoder (embedding) search, computes relevance scores with a cross-encoder model, and re-ranks them — measure the NDCG improvement over the bi-encoder alone.

Mini Project

Build a multilingual semantic search for documentation. Load documentation files in English, Spanish, and French, encode them using a multilingual embedding model (distiluse-BASE-multilingual-cased-v2), index in FAISS, and build a FastAPI endpoint that accepts queries in any of the three languages and returns relevant results from the entire corpus, demonstrating cross-language semantic understanding.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous AI Testing Frameworks and Evaluation — Automating LLM Quality Assurance Next → AI Workflow Orchestration — Building Multi-Step Pipelines with LangGraph and Temporal

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Ai Automation