Learn Artificial: Natural Language Processing (NLP) — Beginner's Guide

Natural Language Processing (NLP) — Beginner's Guide

DodaTech Updated Jun 6, 2026 8 min read

Natural Language Processing (NLP) is the branch of AI that enables computers to understand, interpret, and generate human language — powering chatbots, translation, and sentiment analysis.

What You’ll Learn

In this tutorial, you’ll learn how computers process text using tokenization, embeddings, and bag-of-words, then build a simple sentiment analysis model in Python.

Why It Matters

Every time you use Google Search, chat with a customer service bot, or use autocorrect, NLP is working. It’s how machines bridge the gap between human language and computer understanding.

Real-World Use

When you type “weather in Tokyo” into Google, NLP breaks your query into words, understands “weather” is the topic and “Tokyo” is the location, then returns relevant results — all in milliseconds.

    flowchart LR
  A[Raw Text] --> B[Tokenization]
  B --> C[Cleaning]
  C --> D[Vectorization]
  D --> E[Model]
  E --> F[Prediction]
  B --> G["'Hello world' -> ['Hello', 'world']"]
  D --> H["'hello' -> [0.1, 0.3, 0.8]"]

What Is NLP?

Imagine you meet someone who speaks only Chinese, and you speak only English. You can’t communicate because there’s no shared representation. NLP bridges this gap — but between humans and computers.

Computers don’t understand words. They understand numbers. So NLP is about converting text into numbers that computers can process, while preserving the meaning.

The NLP Pipeline

Every NLP project follows these steps:

Raw text — collect the text data
Tokenization — split text into pieces (words, sentences)
Cleaning — remove irrelevant characters, punctuation, stop words
Vectorization — convert tokens to numbers
Modeling — train or use a model
Prediction — get results

Tokenization: Breaking Text into Pieces

Tokenization is the first step. It splits text into tokens — usually words or subwords.

import re

def simple_tokenize(text):
    # Split on whitespace and remove punctuation
    tokens = re.findall(r'\b\w+\b', text.lower())
    return tokens

text = "NLP is amazing! Can computers really understand language?"
tokens = simple_tokenize(text)
print(f"Original: {text}")
print(f"Tokens: {tokens}")

Expected output:

Original: NLP is amazing! Can computers really understand language?
Tokens: ['nlp', 'is', 'amazing', 'can', 'computers', 'really', 'understand', 'language']

What happened here? We converted the text into a list of lowercase words. The exclamation mark and question mark are gone because \b\w+\b only matches word characters. This makes it easier for the computer to process.

Why lowercase?

Without lowercase, “NLP”, “Nlp”, and “nlp” would be three different tokens. Lowercasing treats them as the same word, reducing complexity.

Bag of Words: The Simplest Vectorization

Bag of Words (BoW) converts text into a vector by counting word occurrences. It ignores word order — like a “bag” of words thrown together.

from sklearn.feature_extraction.text import CountVectorizer

documents = [
    "I love machine learning",
    "I love deep learning",
    "Machine learning is amazing"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("Bag of Words matrix:")
print(X.toarray())

Expected output:

Vocabulary: ['amazing' 'deep' 'is' 'learning' 'love' 'machine']
Bag of Words matrix:
[[0 0 0 1 1 1]
 [0 1 0 1 1 0]
 [1 0 1 1 0 1]]

Reading the matrix:

Row 1: “I love machine learning” → one ’learning’, one ’love’, one ‘machine’
Row 2: “I love deep learning” → one ‘deep’, one ’learning’, one ’love’
Row 3: “Machine learning is amazing” → one ‘amazing’, one ‘is’, one ’learning’, one ‘machine’

Each column is a word from the vocabulary. Each row is a document. The number shows how many times that word appears.

Word Embeddings: Capturing Meaning

Bag of Words can’t capture meaning. “Good” and “excellent” are different tokens even though they’re similar in meaning. Embeddings solve this.

Word embeddings are dense vectors where similar words have similar vectors. They’re learned from billions of words of text.

# Simplified concept: what embeddings look like
embeddings = {
    "king": [0.8, 0.2, 0.9, 0.1],
    "queen": [0.8, 0.1, 0.7, 0.3],
    "man": [0.7, 0.3, 0.2, 0.1],
    "woman": [0.7, 0.2, 0.1, 0.3],
}

def cosine_similarity(v1, v2):
    dot = sum(a * b for a, b in zip(v1, v2))
    mag1 = sum(a * a for a in v1) ** 0.5
    mag2 = sum(b * b for b in v2) ** 0.5
    return dot / (mag1 * mag2)

print(f"king vs queen: {cosine_similarity(embeddings['king'], embeddings['queen']):.3f}")
print(f"king vs man: {cosine_similarity(embeddings['king'], embeddings['man']):.3f}")

Expected output:

king vs queen: 0.980
king vs man: 0.907

What’s happening? king and queen are more similar (0.980) than king and man (0.907) because they share more semantic properties — both are royalty. This lets models understand relationships like “king - man + woman ≈ queen.”

Sentiment Analysis: Real NLP in Action

Let’s build a simple sentiment classifier:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

# Training data: reviews and their sentiment (1=positive, 0=negative)
reviews = [
    "This product is amazing and wonderful",
    "I love this so much it is great",
    "Terrible product complete waste of money",
    "Horrible experience would not recommend",
    "Absolutely fantastic best purchase ever",
    "Poor quality broke in one day",
]
labels = [1, 1, 0, 0, 1, 0]

# Convert text to TF-IDF features
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(reviews)

model = MultinomialNB()
model.fit(X, labels)

# Test new reviews
test_reviews = [
    "This is the best thing I ever bought",
    "Worst product avoid at all costs",
]
test_X = vectorizer.transform(test_reviews)
predictions = model.predict(test_X)

for review, pred in zip(test_reviews, predictions):
    sentiment = "Positive" if pred else "Negative"
    print(f"'{review}' -> {sentiment}")

Expected output:

'This is the best thing I ever bought' -> Positive
'Worst product avoid at all costs' -> Negative

How it works:

TfidfVectorizer converts text to numerical features, weighting words by importance
MultinomialNB (Naive Bayes) learns which words are associated with positive vs negative sentiment
The model can then classify new, unseen reviews based on word patterns

Security Applications of NLP

Phishing email detection — NLP models analyze email text for suspicious patterns: urgent language, mismatched URLs, unusual sender addresses.

Threat intelligence — NLP processes security reports and dark web forums to identify emerging threats and vulnerabilities.

Malware analysis — NLP techniques analyze API call sequences in malware samples, treating them like “sentences” of system calls.

Content filtering — Social media platforms use NLP to detect hate speech, harassment, and harmful content automatically.

Common Mistakes Beginners Make

1. Not removing stop words

Words like “the”, “is”, “at” add noise without meaning. Remove them early.

2. Ignoring case sensitivity

“Apple” (company) and “apple” (fruit) are different. Lowercase everything unless case matters for your task.

3. Using the wrong tokenization

Splitting on whitespace misses punctuation handling. Use libraries like NLTK or spaCy.

4. Not handling out-of-vocabulary words

When your model sees a word it wasn’t trained on, it fails. Use embeddings or subword tokenization.

5. Forgetting about context

“Bank” can mean a financial institution or a riverbank. Without context, models get confused.

Practice Questions

What is tokenization? The process of splitting text into smaller pieces (tokens) like words, subwords, or characters.
What’s the difference between Bag of Words and word embeddings? BoW counts word occurrences (sparse, ignores meaning). Embeddings are dense vectors that capture semantic relationships.
Why do we lowercase text in NLP? To reduce vocabulary size and treat “Hello” and “hello” as the same word.
What’s TF-IDF and why use it over simple counts? TF-IDF weights words by how important they are to a document, down-weighting common words like “the” that appear everywhere.
Give an example of an NLP security application. Phishing email detection — NLP analyzes email text for suspicious patterns.

Challenge

Collect 20 tweets about a popular product (10 positive, 10 negative). Build a sentiment classifier. How well does it generalize to new tweets?

Real-World Task

Use Python’s textblob library to analyze the sentiment of news headlines about a topic you care about. Are most headlines positive, negative, or neutral?

FAQ

Do I need to be good at linguistics for NLP?

No. Basic understanding of language structure helps, but modern NLP relies on statistical patterns learned from data, not linguistic rules.

What’s the difference between NLP and NLU?

NLP is the broad field (processing text). NLU (Natural Language Understanding) is a subset focused on comprehension — understanding intent and meaning.

Is ChatGPT an NLP model?

Yes. ChatGPT is a large language model (LLM) built on transformer architecture — currently the most advanced NLP technology.

Can NLP handle multiple languages?

Yes. Models like multilingual BERT can process 100+ languages. Most NLP libraries support multiple languages.

What’s a stop word?

Common words like “the”, “and”, “is” that carry little meaning and are often removed during preprocessing to reduce noise.

Try It Yourself

▶ Try It Yourself Edit the code and click Run

Mini Project: Review Analyzer

Build a tool that reads product reviews from a file and classifies them as positive or negative. Add a simple visualization showing the positive/negative ratio.

Security angle: This same technique powers Durga Antivirus Pro’s analysis of security reports — processing thousands of threat descriptions to classify their severity automatically.