Skip to content
Natural Language Processing (NLP): Complete Beginner's Guide

Natural Language Processing (NLP): Complete Beginner's Guide

DodaTech Updated Jun 20, 2026 9 min read

Natural Language Processing (NLP) is the branch of AI that enables computers to understand, interpret, and generate human language — powering everything from chatbots and translation to sentiment analysis and search engines.

What You’ll Learn

By the end of this tutorial, you’ll understand the complete NLP pipeline: tokenization, stemming, lemmatization, stop word removal, bag of words, TF-IDF, word embeddings (Word2Vec, GloVe), and transformers. You’ll build a real sentiment analysis model using Python with NLTK and spaCy.

Why It Matters

Every Google search, every chatbot interaction, every spell-check relies on NLP. It’s the technology that bridges human communication and machine understanding — and it’s embedded in almost every modern application.

Real-World Use

When you type “best Italian restaurants near me” into Google, NLP tokenizes your query, recognizes “Italian” as a cuisine type and “near me” as a location intent, then returns geo-filtered results — all in under a second.

The NLP Pipeline


flowchart LR
  A[Raw Text] --> B[Tokenization]
  B --> C[Cleaning]
  C --> D[Stemming/Lemmatization]
  D --> E[Stop Word Removal]
  E --> F[Vectorization]
  F --> G[Model]
  G --> H[Prediction]
  B --> I["'Hello world' → ['Hello', 'world']"]
  F --> J["'hello' → [0.1, 0.3, 0.8, ...]"]

Prerequisites: Python basics. Familiarity with Machine Learning concepts helps but isn’t required for the first half of this tutorial.

Tokenization: Splitting Text into Pieces

Tokenization is the first step in any NLP pipeline. It splits raw text into smaller units called tokens — typically words, subwords, or characters. Think of it like breaking a sentence into individual LEGO bricks before building something new.

import re

def simple_tokenize(text):
    tokens = re.findall(r'\b\w+\b', text.lower())
    return tokens

text = "NLP is amazing! Can computers really understand language?"
tokens = simple_tokenize(text)
print(f"Original: {text}")
print(f"Tokens: {tokens}")

Expected output:

Original: NLP is amazing! Can computers really understand language?
Tokens: ['nlp', 'is', 'amazing', 'can', 'computers', 'really', 'understand', 'language']

Why lowercase? Without it, “NLP”, “Nlp”, and “nlp” would be three different tokens. Lowercasing reduces vocabulary size and improves model generalization.

Using NLTK for Tokenization

import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Dr. Smith visited OpenAI's office. He was impressed!"

# Sentence tokenization
sentences = sent_tokenize(text)
print("Sentences:", sentences)

# Word tokenization
words = word_tokenize(text)
print("Words:", words)

Expected output:

Sentences: ["Dr. Smith visited OpenAI's office.", 'He was impressed!']
Words: ['Dr.', 'Smith', 'visited', 'OpenAI', "'s", 'office', '.', 'He', 'was', 'impressed', '!']

Notice how sent_tokenize correctly handles “Dr.” as part of a sentence rather than splitting at the period. This is why you should use libraries instead of writing your own tokenizer.

Stemming vs Lemmatization

Both techniques reduce words to their base form, but they work differently.

Stemming chops off prefixes/suffixes crudely. Lemmatization uses vocabulary and morphological analysis to return the dictionary form.

from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ['running', 'runner', 'ran', 'better', 'studies']
for word in words:
    print(f"{word:12} → stem: {stemmer.stem(word):12} lemma: {lemmatizer.lemmatize(word)}")

Expected output:

running      → stem: run         lemma: running
runner       → stem: runner      lemma: runner
ran          → stem: ran         lemma: ran
better       → stem: better      lemma: better
studies      → stem: studi       lemma: study

Key difference — stemming is fast but crude (it produces non-words like “studi”). Lemmatization is slower but produces real words. For Deep Learning models, lemmatization usually yields better results.

Bag of Words and TF-IDF

Computers can’t understand words — they need numbers. Bag of Words (BoW) converts text into a vector by counting word occurrences. TF-IDF improves on BoW by weighting words by their importance across documents.

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    "I love machine learning and NLP",
    "NLP is fascinating and useful",
    "Machine learning builds intelligent systems"
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("\nTF-IDF matrix:")
print(X.toarray().round(3))

Expected output:

Vocabulary: ['and' 'builds' 'fascinating' 'intelligent' 'is' 'learning' 'love' 'machine' 'nlp' 'systems' 'useful']

TF-IDF matrix:
[[0.408 0.    0.    0.    0.    0.408 0.408 0.408 0.408 0.    0.   ]
 [0.354 0.    0.5   0.    0.5   0.    0.    0.    0.354 0.    0.5  ]
 [0.    0.447 0.    0.447 0.    0.447 0.    0.447 0.    0.447 0.   ]]

TF-IDF gives higher weight to words that appear frequently in a document but rarely across all documents. Words like “the” and “and” get low weights because they appear everywhere — they carry little meaning.

Word Embeddings: Word2Vec and GloVe

BoW and TF-IDF create sparse vectors that don’t capture meaning. Word embeddings are dense vectors where similar words have similar representations.

Word2Vec learns embeddings by predicting a word from its neighbors (CBOW) or neighbors from a word (Skip-gram). GloVe learns by analyzing word co-occurrence statistics across the entire corpus.

# Simplified demonstration of embedding similarity
import numpy as np

embeddings = {
    "king": [0.8, 0.2, 0.9, 0.1],
    "queen": [0.8, 0.1, 0.7, 0.3],
    "man": [0.7, 0.3, 0.2, 0.1],
    "woman": [0.7, 0.2, 0.1, 0.3],
}

def cosine_similarity(v1, v2):
    dot = sum(a * b for a, b in zip(v1, v2))
    mag1 = sum(a * a for a in v1) ** 0.5
    mag2 = sum(b * b for b in v2) ** 0.5
    return dot / (mag1 * mag2)

print(f"king vs queen: {cosine_similarity(embeddings['king'], embeddings['queen']):.3f}")
print(f"king vs man:   {cosine_similarity(embeddings['king'], embeddings['man']):.3f}")
print(f"man vs woman:  {cosine_similarity(embeddings['man'], embeddings['woman']):.3f}")

Expected output:

king vs queen: 0.980
king vs man:   0.907
man vs woman:  0.973

The classic example: king - man + woman ≈ queen. Embeddings capture these semantic relationships because they’re trained on billions of words of real text.

Transformers: Modern NLP

{% raw %} Transformers (introduced in the “Attention Is All You Need” paper, 2017) revolutionized NLP by replacing sequential processing with self-attention. Instead of reading words left-to-right, transformers look at all words simultaneously, weighing each word’s relevance to every other word.

Key components:

  • Self-attention — computes how much each word should “attend” to every other word
  • Multi-head attention — runs multiple attention mechanisms in parallel
  • Positional encoding — adds position information since there’s no sequence order
  • Encoder-decoder structure — encoder reads input, decoder generates output

Pre-trained models like BERT and GPT are transformers trained on massive text corpora. You can fine-tune them for specific tasks with minimal data. {% endraw %}

Building a Sentiment Analyzer with NLTK

Let’s build a complete sentiment analysis system using NLTK and scikit-learn:

import nltk
import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = text.split()
    tokens = [t for t in tokens if t not in stop_words]
    return ' '.join(tokens)

# Training data
reviews = [
    "This product is amazing and wonderful I love it",
    "Absolutely fantastic best purchase ever made",
    "Great quality works perfectly highly recommend",
    "Terrible product complete waste of money",
    "Horrible experience would never buy again",
    "Poor quality broke in one day very disappointed",
]
labels = [1, 1, 1, 0, 0, 0]

# Preprocess and vectorize
processed = [preprocess(r) for r in reviews]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(processed)

# Train model
model = MultinomialNB()
model.fit(X, labels)

# Test
test_reviews = [
    "This is the best thing I ever bought amazing quality",
    "Worst product ever avoid at all costs terrible",
]
test_processed = [preprocess(r) for r in test_reviews]
test_X = vectorizer.transform(test_processed)
predictions = model.predict(test_X)

for review, pred in zip(test_reviews, predictions):
    sentiment = "Positive" if pred else "Negative"
    print(f"'{review[:40]}...' → {sentiment}")

Expected output:

'This is the best thing I ever bought ...' → Positive
'Worst product ever avoid at all costs t...' → Negative

Common NLP Errors

1. Not Removing Stop Words

Words like “the”, “is”, “at” add noise without signal. Filter them early in your pipeline.

2. Case Sensitivity Mismatch

“Apple” (company) and “apple” (fruit) differ by case. Lowercase everything unless case is semantically meaningful.

3. Using the Wrong Tokenizer

Splitting on whitespace breaks on punctuation. “Don’t” becomes [“Don”, “t”] instead of [“Do”, “n’t”]. Use NLTK or spaCy tokenizers.

4. Ignoring Out-of-Vocabulary Words

When your model sees an unseen word at inference, it fails silently. Use subword tokenization (BPE, WordPiece) or pre-trained embeddings.

5. Forgetting About Word Ambiguity

“Bank” can mean a financial institution or a river bank. Without context, models get confused. Contextual embeddings (BERT) solve this.

6. Training on Imbalanced Data

If 95% of your reviews are positive, a model that always predicts “positive” gets 95% accuracy but is useless. Always check class distribution.

Practice Questions

1. What’s the difference between stemming and lemmatization? Stemming chops off prefixes/suffixes crudely (fast, produces non-words like “studi”). Lemmatization returns dictionary words using vocabulary analysis (slower, more accurate).

2. Why do we convert text to numbers in NLP? Computers process numbers, not words. Vectorization converts text into numerical representations that machine learning models can process.

3. What problem do word embeddings solve that BoW doesn’t? BoW treats “good” and “excellent” as completely unrelated tokens. Embeddings represent them as similar vectors, capturing semantic meaning.

4. How do transformers differ from RNNs? Transformers process all words in parallel using self-attention. RNNs process sequentially. Transformers are faster to train and better at capturing long-range dependencies.

5. Challenge: Build a spam classifier Collect 50 spam and 50 non-spam emails. Use TF-IDF vectorization and a Naive Bayes classifier. What’s your precision and recall?

FAQ

What's the difference between NLP and NLU?
NLP is the broad field of processing text. NLU (Natural Language Understanding) is a subset focused on comprehension — understanding intent, meaning, and context.
Do I need to be good at linguistics for NLP?
No. Modern NLP relies on statistical patterns learned from data. Basic language understanding helps but isn’t required.
Is ChatGPT an NLP model?
Yes. ChatGPT is a large language model (LLM) based on the transformer architecture — currently the most advanced NLP technology.
Which NLP library should I start with?
Start with NLTK for learning fundamentals, then move to spaCy for production. For deep learning, use Hugging Face Transformers.
Can NLP handle multiple languages?
Yes. Multilingual BERT handles 100+ languages. Most NLP libraries support multiple languages with different accuracy levels.

Try It Yourself

▶ Try It Yourself Edit the code and click Run

Mini Project: Review Sentiment Dashboard

Build a tool that reads product reviews from a CSV file, analyzes sentiment for each review, and displays a pie chart of positive vs negative results. Security angle: The same technique powers Durga Antivirus Pro’s analysis of security reports — processing thousands of threat descriptions to classify severity automatically.

What’s Next

Before moving on, make sure you understand:

  • The NLP pipeline: tokenization → vectorization → modeling
  • The difference between stemming and lemmatization
  • How TF-IDF improves on Bag of Words
  • Why embeddings capture semantic meaning

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

What’s Next

Congratulations on completing this NLP tutorial! Here’s where to go from here:

  • Practice daily — Apply NLP to text you encounter every day
  • Build a project — Create a spam filter or sentiment analyzer for your own data
  • Explore related topics — Check out Computer Vision and Machine Learning tutorials

Remember: every expert was once a beginner. Keep coding!

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro