NLP Basics for Developers — Tokenization, Embeddings & Text Processing Explained

DodaTech Updated 2026-06-22 4 min read

Natural language processing (NLP) gives computers the ability to understand, interpret, and generate human language — this guide covers core NLP concepts every developer should know.

What You'll Learn

You'll learn the fundamentals of NLP including tokenization, stop word removal, stemming, lemmatization, embeddings, and building a text classifier with Python.

Why It Matters

NLP powers search engines, chatbots, translation services, and content moderation systems. Understanding how text is processed and represented is the foundation for building any AI-powered language application.

Real-World Use

Durga Antivirus Pro uses NLP to analyze threat descriptions and classify malware reports by severity, helping security analysts prioritize incidents without reading every report manually.

NLP Pipeline Overview

flowchart LR
    A[Raw Text] --> B[Tokenization]
    B --> C[Cleaning]
    C --> D[Normalization]
    D --> E[Feature Extraction]
    E --> F[Model Training]
    F --> G[Prediction]

Tokenization

Tokenization splits text into individual units — tokens — which can be words, subwords, or characters.

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download("punkt_tab", quiet=True)

text = "Natural language processing enables computers to understand text. It powers search engines and chatbots."

# Sentence tokenization
sentences = sent_tokenize(text)
print("Sentences:", sentences)

# Word tokenization
words = word_tokenize(text)
print("Words:", words)
print("Token count:", len(words))

Expected output:

Sentences: ['Natural language processing enables computers to understand text.', 'It powers search engines and chatbots.']
Words: ['Natural', 'language', 'processing', 'enables', 'computers', 'to', 'understand', 'text', '.', 'It', 'powers', 'search', 'engines', 'and', 'chatbots', '.']
Token count: 16

Text Normalization

Normalization converts text to a consistent format for processing.

import re
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords

nltk.download("wordnet", quiet=True)
nltk.download("stopwords", quiet=True)

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

text = "The running dogs were playing happily in the gardens"
tokens = word_tokenize(text.lower())

# Remove stop words and punctuation
cleaned = [t for t in tokens if t not in stop_words and t.isalpha()]
print("Cleaned tokens:", cleaned)

# Stemming
stems = [stemmer.stem(t) for t in cleaned]
print("Stemmed:", stems)

# Lemmatization
lemmas = [lemmatizer.lemmatize(t) for t in cleaned]
print("Lemmatized:", lemmas)

Expected output:

Cleaned tokens: ['running', 'dogs', 'playing', 'happily', 'gardens']
Stemmed: ['run', 'dog', 'play', 'happili', 'garden']
Lemmatized: ['running', 'dog', 'playing', 'happily', 'garden']

Text Representations

Raw text must be converted into numerical form for Machine Learning models.

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    "NLP powers search engines and chatbots",
    "Chatbots use NLP to understand user queries",
    "Search engines index web pages for fast retrieval]
]

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# Display feature names and matrix
feature_names = vectorizer.get_feature_names_out()
print("Vocabulary size:", len(feature_names))
print("\nTop terms per document:")
for i, doc in enumerate(documents):
    row = tfidf_matrix[i].toarray().flatten()
    top_indices = row.argsort()[-3:][::-1]
    top_terms = [(feature_names[j], round(row[j], 3)) for j in top_indices]
    print(f"  Document {i+1}: {top_terms}")

Expected output:

Vocabulary size: 14

Top terms per document:
  Document 1: [('search', 0.528), ('engines', 0.528), ('chatbots', 0.378)]
  Document 2: [('chatbots', 0.469), ('understand', 0.469), ('queries', 0.469)]
  Document 3: [('search', 0.469), ('web', 0.469), ('index', 0.469)]

Word Embeddings with spaCy

Word embeddings capture semantic meaning by mapping words to dense vector spaces.

import spacy

# Load a small English model
NLP = spacy.load("en_core_web_sm")

word1 = NLP("king")
word2 = NLP("queen")
word3 = NLP("man")
word4 = NLP("woman")

# Vector similarity
print(f"king vs queen: {word1.similarity(word2):.4f}")
print(f"king vs man: {word1.similarity(word3):.4f}")
print(f"man vs woman: {word3.similarity(word4):.4f}")
print(f"king vs woman: {word1.similarity(word4):.4f}")

# Vector size
print(f"\nEmbedding dimension: {word1.vector.shape}")

Expected output:

king vs queen: 0.7254
king vs man: 0.4378
man vs woman: 0.6421
king vs woman: 0.3195

Embedding dimension: (96,)

Common Errors

Error	Cause	Fix
Tokenization splits contractions wrong	Default tokenizer misses edge cases	Use a dedicated tokenizer like spaCy or GPT tokenizer
Stemming produces non-words	Aggressive suffix removal	Use lemmatization for readability-critical tasks
Out-of-vocabulary words in prediction	Vocabulary built on limited training data	Use subword tokenization (BPE, WordPiece)
Stop word removal deletes meaningful terms	Generic stop word list	Customize stop words for your domain
Similarity score is too low for synonyms	Embedding model lacks context	Use contextual embeddings (BERT, sentence-transformers)

Practice Questions

What is the difference between stemming and lemmatization? Stemming removes suffixes heuristically and may produce non-words; lemmatization uses vocabulary analysis to return dictionary BASE forms.
Why are stop words removed during text preprocessing? Stop words carry little semantic meaning and add noise, so removing them reduces dimensionality and improves model focus on meaningful terms.
What advantage do embeddings have over TF-IDF vectors? Embeddings capture semantic relationships between words (synonyms, analogies) whereas TF-IDF relies on exact term matching and frequency.
How does subword tokenization handle rare words? Subword tokenization splits rare words into smaller known subunits, so no word is ever out of vocabulary.
Challenge: Build a document similarity search tool that takes a query, vectorizes all documents using TF-IDF and sentence embeddings, and returns the top 3 most relevant documents with similarity scores using both methods.

Mini Project

Build a spam detector using NLP. Collect a dataset of SMS messages labeled as spam or ham, preprocess the text with tokenization and TF-IDF vectorization, train a logistic regression classifier, and evaluate precision and recall. Deploy the model as a Flask endpoint that accepts text and returns a spam probability score.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous AI-Powered Code Generation Best Practices — Write Code Faster with LLMs Next → Computer Vision Introduction — Image Processing, CNNs & Object Detection for Developers

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Ai Automation