NLP Basics for Developers — Tokenization, Embeddings & Text Processing Explained
Natural language processing (NLP) gives computers the ability to understand, interpret, and generate human language — this guide covers core NLP concepts every developer should know.
What You'll Learn
You'll learn the fundamentals of NLP including tokenization, stop word removal, stemming, lemmatization, embeddings, and building a text classifier with Python.
Why It Matters
NLP powers search engines, chatbots, translation services, and content moderation systems. Understanding how text is processed and represented is the foundation for building any AI-powered language application.
Real-World Use
Durga Antivirus Pro uses NLP to analyze threat descriptions and classify malware reports by severity, helping security analysts prioritize incidents without reading every report manually.
NLP Pipeline Overview
flowchart LR
A[Raw Text] --> B[Tokenization]
B --> C[Cleaning]
C --> D[Normalization]
D --> E[Feature Extraction]
E --> F[Model Training]
F --> G[Prediction]
Tokenization
Tokenization splits text into individual units — tokens — which can be words, subwords, or characters.
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download("punkt_tab", quiet=True)
text = "Natural language processing enables computers to understand text. It powers search engines and chatbots."
# Sentence tokenization
sentences = sent_tokenize(text)
print("Sentences:", sentences)
# Word tokenization
words = word_tokenize(text)
print("Words:", words)
print("Token count:", len(words))
Expected output:
Sentences: ['Natural language processing enables computers to understand text.', 'It powers search engines and chatbots.']
Words: ['Natural', 'language', 'processing', 'enables', 'computers', 'to', 'understand', 'text', '.', 'It', 'powers', 'search', 'engines', 'and', 'chatbots', '.']
Token count: 16
Text Normalization
Normalization converts text to a consistent format for processing.
import re
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download("wordnet", quiet=True)
nltk.download("stopwords", quiet=True)
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))
text = "The running dogs were playing happily in the gardens"
tokens = word_tokenize(text.lower())
# Remove stop words and punctuation
cleaned = [t for t in tokens if t not in stop_words and t.isalpha()]
print("Cleaned tokens:", cleaned)
# Stemming
stems = [stemmer.stem(t) for t in cleaned]
print("Stemmed:", stems)
# Lemmatization
lemmas = [lemmatizer.lemmatize(t) for t in cleaned]
print("Lemmatized:", lemmas)
Expected output:
Cleaned tokens: ['running', 'dogs', 'playing', 'happily', 'gardens']
Stemmed: ['run', 'dog', 'play', 'happili', 'garden']
Lemmatized: ['running', 'dog', 'playing', 'happily', 'garden']
Text Representations
Raw text must be converted into numerical form for Machine Learning models.
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [
"NLP powers search engines and chatbots",
"Chatbots use NLP to understand user queries",
"Search engines index web pages for fast retrieval]
]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
# Display feature names and matrix
feature_names = vectorizer.get_feature_names_out()
print("Vocabulary size:", len(feature_names))
print("\nTop terms per document:")
for i, doc in enumerate(documents):
row = tfidf_matrix[i].toarray().flatten()
top_indices = row.argsort()[-3:][::-1]
top_terms = [(feature_names[j], round(row[j], 3)) for j in top_indices]
print(f" Document {i+1}: {top_terms}")
Expected output:
Vocabulary size: 14
Top terms per document:
Document 1: [('search', 0.528), ('engines', 0.528), ('chatbots', 0.378)]
Document 2: [('chatbots', 0.469), ('understand', 0.469), ('queries', 0.469)]
Document 3: [('search', 0.469), ('web', 0.469), ('index', 0.469)]
Word Embeddings with spaCy
Word embeddings capture semantic meaning by mapping words to dense vector spaces.
import spacy
# Load a small English model
NLP = spacy.load("en_core_web_sm")
word1 = NLP("king")
word2 = NLP("queen")
word3 = NLP("man")
word4 = NLP("woman")
# Vector similarity
print(f"king vs queen: {word1.similarity(word2):.4f}")
print(f"king vs man: {word1.similarity(word3):.4f}")
print(f"man vs woman: {word3.similarity(word4):.4f}")
print(f"king vs woman: {word1.similarity(word4):.4f}")
# Vector size
print(f"\nEmbedding dimension: {word1.vector.shape}")
Expected output:
king vs queen: 0.7254
king vs man: 0.4378
man vs woman: 0.6421
king vs woman: 0.3195
Embedding dimension: (96,)
Common Errors
| Error | Cause | Fix |
|---|---|---|
| Tokenization splits contractions wrong | Default tokenizer misses edge cases | Use a dedicated tokenizer like spaCy or GPT tokenizer |
| Stemming produces non-words | Aggressive suffix removal | Use lemmatization for readability-critical tasks |
| Out-of-vocabulary words in prediction | Vocabulary built on limited training data | Use subword tokenization (BPE, WordPiece) |
| Stop word removal deletes meaningful terms | Generic stop word list | Customize stop words for your domain |
| Similarity score is too low for synonyms | Embedding model lacks context | Use contextual embeddings (BERT, sentence-transformers) |
Practice Questions
What is the difference between stemming and lemmatization? Stemming removes suffixes heuristically and may produce non-words; lemmatization uses vocabulary analysis to return dictionary BASE forms.
Why are stop words removed during text preprocessing? Stop words carry little semantic meaning and add noise, so removing them reduces dimensionality and improves model focus on meaningful terms.
What advantage do embeddings have over TF-IDF vectors? Embeddings capture semantic relationships between words (synonyms, analogies) whereas TF-IDF relies on exact term matching and frequency.
How does subword tokenization handle rare words? Subword tokenization splits rare words into smaller known subunits, so no word is ever out of vocabulary.
Challenge: Build a document similarity search tool that takes a query, vectorizes all documents using TF-IDF and sentence embeddings, and returns the top 3 most relevant documents with similarity scores using both methods.
Mini Project
Build a spam detector using NLP. Collect a dataset of SMS messages labeled as spam or ham, preprocess the text with tokenization and TF-IDF vectorization, train a logistic regression classifier, and evaluate precision and recall. Deploy the model as a Flask endpoint that accepts text and returns a spam probability score.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro