Skip to content
Apache Cassandra Guide — NoSQL Distributed Database

Apache Cassandra Guide — NoSQL Distributed Database

DodaTech Updated Jun 7, 2026 11 min read

Apache Cassandra is a highly scalable, distributed NoSQL database designed for handling massive amounts of data across commodity servers with no single point of failure, offering tunable consistency and linear scalability.

What You’ll Learn

By the end of this tutorial, you’ll understand Cassandra’s distributed architecture, write CQL queries with partition keys and clustering columns, configure consistency levels for reads and writes, design data models for scale, and manage data replication across clusters.

Why Cassandra Matters

Cassandra powers some of the world’s largest data systems — Netflix, Apple, Instagram, and Uber rely on it for high-availability, high-throughput workloads. Doda Browser uses Cassandra for session management across millions of users, while Durga Antivirus Pro leverages it for real-time threat intelligence ingestion from global sensors. Learning Cassandra gives you a skill critical for big data and real-time systems.

Cassandra Learning Path

    flowchart LR
  A[SQL Basics] --> B[MongoDB]
  B --> C[Cassandra]
  C --> D[Redis]
  D --> E[Elasticsearch]
  E --> F[Database Design]
  C --> G{You Are Here}
  style G fill:#f90,color:#fff
  
Prerequisites: Familiarity with SQL concepts helps since CQL (Cassandra Query Language) looks similar. Understanding of distributed systems basics (nodes, replication) is beneficial but not required.

What Is Cassandra? (The “Why” First)

Imagine you’re building a global application with users in every country. A single SQL database on one server won’t work — if that server goes down, your entire app goes down. And one server can’t handle millions of concurrent writes. Cassandra solves this with a ring of nodes — every node is identical, there’s no master, and data is distributed and replicated automatically. Add more nodes and your capacity grows linearly.

Cassandra vs Traditional SQL

FeatureSQL DatabaseCassandra
ArchitectureMaster-slaveMasterless (ring)
Query languageSQLCQL (SQL-like)
JoinsYesNo (denormalized)
ACIDYesEventually consistent
ScalabilityVertical (scale up)Horizontal (scale out)
Write throughputLimited by masterLinear with nodes

Cassandra Architecture

    flowchart TB
    subgraph DataCenter
        subgraph Rack1
            N1[Node 1]
            N2[Node 2]
        end
        subgraph Rack2
            N3[Node 3]
            N4[Node 4]
        end
    end
    Client1[App Client] --> N1
    Client2[App Client] --> N3
    N1 --- N2
    N1 --- N3
    N1 --- N4
    N2 --- N3
    N2 --- N4
    N3 --- N4
    subgraph Keyspace
        T1[Table: users]
        T2[Table: orders]
    end
    N1 --> T1
    N1 --> T2
    N2 --> T1
    N2 --> T2
    N3 --> T1
    N3 --> T2
  

Every node in a Cassandra cluster is equal. Data is automatically partitioned across nodes using consistent hashing, and each piece of data is replicated to multiple nodes based on the replication factor.

Getting Started with CQL

CQL (Cassandra Query Language) looks like SQL but has important differences:

-- Create a keyspace (similar to a database in SQL)
CREATE KEYSPACE shop
WITH replication = {
    'class': 'SimpleStrategy',
    'replication_factor': 3
};

-- Use the keyspace
USE shop;

-- Create a table
CREATE TABLE users (
    user_id UUID PRIMARY KEY,
    first_name TEXT,
    last_name TEXT,
    email TEXT,
    created_at TIMESTAMP
);

-- Insert data
INSERT INTO users (user_id, first_name, last_name, email, created_at)
VALUES (uuid(), 'Alice', 'Johnson', 'alice@example.com', toTimestamp(now()));

INSERT INTO users (user_id, first_name, last_name, email, created_at)
VALUES (uuid(), 'Bob', 'Smith', 'bob@example.com', toTimestamp(now()));

Partition Keys and Clustering Columns

The most important concept in Cassandra is the primary key design — it determines how data is distributed and ordered:

-- Composite partition key: orders are partitioned by (year, month)
-- and clustered by order_id within each partition
CREATE TABLE orders_by_month (
    year INT,
    month INT,
    order_id UUID,
    customer_id UUID,
    total_amount DECIMAL,
    status TEXT,
    PRIMARY KEY ((year, month), order_id, created_at)
) WITH CLUSTERING ORDER BY (order_id DESC, created_at DESC);

-- Insert some orders
INSERT INTO orders_by_month (year, month, order_id, customer_id, total_amount, status)
VALUES (2026, 6, uuid(), uuid(), 149.99, 'PENDING');

-- Query all orders from June 2026 (efficient — hits ONE partition)
SELECT * FROM orders_by_month WHERE year = 2026 AND month = 6;

-- Output:
-- year | month | order_id                             | customer_id                          | total_amount | status
-- 2026 | 6     | 550e8400-e29b-41d4-a716-446655440000 | 660e8400-e29b-41d4-a716-446655440001 | 149.99       | PENDING

Key Rules for Primary Keys:

  • The partition key(year, month) determines which node stores the data
  • All rows with the same partition key are stored on the same node
  • Clustering columns(order_id, created_at) sort rows within a partition
  • You MUST include the partition key in WHERE clauses for efficient queries

Why This Matters

In SQL databases, you can query any column efficiently with proper indexes. In Cassandra, you design your tables around your query patterns — not the other way around. If you need to query orders by customer, create a table with customer_id as the partition key. If you also need to query by date, create a separate table.

Consistency Levels

Cassandra’s tunable consistency lets you choose between availability and accuracy:

-- Strong consistency: all nodes must respond
INSERT INTO users (...) VALUES (...) USING CONSISTENCY ALL;
-- Slower but guarantees all replicas are updated

-- Eventual consistency: just one node
INSERT INTO users (...) VALUES (...) USING CONSISTENCY ONE;
-- Fast but reads may see stale data

-- Quorum consistency: majority of replicas
INSERT INTO users (...) VALUES (...) USING CONSISTENCY QUORUM;
-- Balance of speed and consistency

-- Read with quorum
SELECT * FROM users WHERE user_id = ? USING CONSISTENCY QUORUM;

Consistency Level Trade-offs

LevelWritesReadsFailure Tolerance
ONEFastestFastestAny node can fail
QUORUMBalancedBalancedMinority can fail
ALLSlowestSlowestNo failure tolerance
LOCAL_QUORUMGood (multi-DC)Good (multi-DC)Minority per DC

Data Replication

-- Create a keyspace with NetworkTopologyStrategy (production)
CREATE KEYSPACE production
WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'us-east': 3,
    'eu-west': 3
};

-- Check replication status
DESCRIBE KEYSPACE production;

Cassandra replicates data across data centers automatically. With NetworkTopologyStrategy, each data center has its own replication factor. This means a failure in us-east doesn’t affect eu-west users.

Query Examples with Expected Output

Time-Series Data

-- Create a table for sensor readings (like Durga Antivirus Pro threat events)
CREATE TABLE sensor_readings (
    sensor_id TEXT,
    reading_time TIMESTAMP,
    temperature FLOAT,
    humidity FLOAT,
    pressure FLOAT,
    PRIMARY KEY (sensor_id, reading_time)
) WITH CLUSTERING ORDER BY (reading_time DESC);

-- Insert readings
INSERT INTO sensor_readings (sensor_id, reading_time, temperature, humidity, pressure)
VALUES ('sensor-01', '2026-06-07 10:00:00', 22.5, 45.0, 1013.2);
INSERT INTO sensor_readings (sensor_id, reading_time, temperature, humidity, pressure)
VALUES ('sensor-01', '2026-06-07 10:05:00', 22.7, 44.8, 1013.1);
INSERT INTO sensor_readings (sensor_id, reading_time, temperature, humidity, pressure)
VALUES ('sensor-01', '2026-06-07 10:10:00', 22.8, 44.5, 1013.0);

-- Get latest 10 readings for sensor-01
SELECT * FROM sensor_readings
WHERE sensor_id = 'sensor-01'
LIMIT 10;

-- Output (reverse order due to CLUSTERING ORDER BY DESC):
-- sensor_id  | reading_time                | temperature | humidity | pressure
-- sensor-01  | 2026-06-07 10:10:00         | 22.8        | 44.5     | 1013.0
-- sensor-01  | 2026-06-07 10:05:00         | 22.7        | 44.8     | 1013.1
-- sensor-01  | 2026-06-07 10:00:00         | 22.5        | 45.0     | 1013.2

Using ALLOW FILTERING (Use Sparingly)

-- Query without partition key (slower — scans all nodes)
SELECT * FROM sensor_readings
WHERE temperature > 23.0
ALLOW FILTERING;

Warning: ALLOW FILTERING should be avoided in production. It forces Cassandra to scan all partitions. Create a separate table designed for this query pattern instead.

Common Cassandra Errors

1. Cannot add a non-clustering column to a table without losing data

Cassandra’s storage engine makes schema changes expensive. You can add columns but not change primary key definitions. Fix: Design your primary key carefully upfront. To add a new column: ALTER TABLE users ADD phone TEXT;

2. InvalidQueryException: Only EQ and IN relations are supported on the partition key

-- WRONG
SELECT * FROM orders_by_month WHERE total_amount > 100;
-- Error: Cannot execute this query as it might involve data filtering

-- RIGHT
SELECT * FROM orders_by_month WHERE year = 2026 AND month = 6;

Fix: Always query by the partition key. Use ALLOW FILTERING only for one-off analytical queries.

3. ReadTimeout or WriteTimeout

The coordinator node timed out waiting for replicas. Fix: Increase timeout settings, add more nodes, reduce consistency level, or optimize queries to hit fewer partitions.

4. Hinted handoff errors

When a node is down, other nodes store hints to replay later. If a node is down too long, hints are dropped. Fix: Monitor node health with nodetool status. Bring nodes back within max_hint_window_in_ms (default 3 hours).

5. Tombstone overload

Deletes in Cassandra don’t immediately remove data — they write tombstones (deletion markers). Too many tombstones slow down reads. Fix: Use TTL (time-to-live) for automatic expiry instead of explicit deletes, and compact tables regularly.

-- Use TTL for automatic cleanup
INSERT INTO sessions (session_id, user_id, data)
VALUES ('abc123', 'user1', '{"ip":"192.168.1.1"}')
USING TTL 86400;  -- Auto-delete after 24 hours

6. Size of partition XYZ exceeds recommended limit

Each partition has a recommended maximum of 100MB. Large partitions cause GC pressure and slow queries. Fix: Add more partition key columns to spread data across more partitions.

7. Connection refused / Unavailable exception

Not enough replicas are available to meet the consistency level. Fix: Check nodetool status to see which nodes are down. Reduce consistency level or add more replicas.

Practice Questions

1. What is a partition key in Cassandra?

The partition key determines which node stores a given row. It’s the first part of the PRIMARY KEY definition. All rows with the same partition key reside on the same node, making queries by partition key fast.

2. How does Cassandra achieve high availability?

Cassandra has a masterless architecture where every node is identical. Data is automatically replicated to multiple nodes (configurable via replication_factor). If any node fails, other nodes serve the data without downtime.

3. What is tunable consistency?

Tunable consistency lets you choose the trade-off between consistency and availability per query. You specify how many replicas must acknowledge writes or respond to reads — ONE (fast, eventual), QUORUM (balanced), or ALL (strong).

4. Challenge: Design a table for storing user sessions by user_id.

CREATE TABLE user_sessions (
    user_id UUID,
    session_id UUID,
    login_time TIMESTAMP,
    ip_address TEXT,
    user_agent TEXT,
    expires_at TIMESTAMP,
    PRIMARY KEY (user_id, session_id)
) WITH CLUSTERING ORDER BY (session_id DESC);

5. Why can’t you do JOINs in Cassandra?

Joins require scanning multiple partitions across different nodes, which defeats the purpose of a distributed database. Instead, denormalize your data: store related information together in the same table, designed around your query patterns.

Real-World Task: Build a Real-Time Analytics Pipeline

Design a Cassandra schema for tracking user events — similar to what Doda Browser uses for analytics:

-- Events by type (for dashboard aggregation)
CREATE TABLE events_by_type (
    event_type TEXT,
    year INT,
    month INT,
    day INT,
    event_time TIMEUUID,
    user_id UUID,
    metadata TEXT,
    PRIMARY KEY ((event_type, year, month, day), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);

-- Events by user (for user activity timeline)
CREATE TABLE events_by_user (
    user_id UUID,
    year INT,
    month INT,
    event_time TIMEUUID,
    event_type TEXT,
    metadata TEXT,
    PRIMARY KEY ((user_id, year, month), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);

-- Insert sample events
INSERT INTO events_by_type (event_type, year, month, day, event_time, user_id, metadata)
VALUES ('page_view', 2026, 6, 7, now(), uuid(), '{"page":"/home","referrer":"google.com"}');

INSERT INTO events_by_user (user_id, year, month, event_time, event_type, metadata)
VALUES (uuid(), 2026, 6, now(), 'page_view', '{"page":"/home"}');

FAQ

What is the difference between Cassandra and MongoDB?
Cassandra is a wide-column store optimized for high write throughput and distributed deployments. MongoDB is a document store with richer query capabilities. Cassandra uses CQL (SQL-like), MongoDB uses JSON-like documents. Both are NoSQL, but they serve different use cases.
Is Cassandra ACID compliant?
No. Cassandra provides eventual consistency by default. You can achieve strong consistency with CONSISTENCY ALL, but it comes at a performance cost. Cassandra prioritizes availability and partition tolerance (AP in CAP theorem).
How do I back up Cassandra?
Use nodetool snapshot to take a snapshot of a keyspace, which creates hard links to SSTable files. Then copy the snapshot directory to backup storage. For incremental backups, enable incremental_backups: true in cassandra.yaml.
What is a wide-row model in Cassandra?
A wide row has a single partition key with many clustering columns, resulting in a large partition containing many rows. This is efficient for time-series data where queries always include the partition key and a range on the clustering column.
How do I monitor Cassandra performance?
Use nodetool info (node stats), nodetool cfstats (table metrics), nodetool tpstats (thread pool stats), and nodetool proxyhistograms (latency percentiles). Enable metrics reporting to Elasticsearch or Prometheus.

Try It Yourself

Start a Cassandra cluster with Docker and run these cluster management commands:

# Check cluster status
nodetool status

# Output (abbreviated):
# Status=Up/Down | State=Normal/Leaving/Joining/Moving
-- Address    Load       Tokens  Owns   Host ID                               Rack
-- 192.168.1.1  1.5 GB    256     33.3%  abc123...  rack1
-- 192.168.1.2  1.5 GB    256     33.3%  def456...  rack1
-- 192.168.1.3  1.5 GB    256     33.4%  ghi789...  rack1

# Check keyspace replication
DESCRIBE KEYSPACE shop;

# Flush data from memtable to SSTable
nodetool flush shop

# Repair inconsistencies
nodetool repair shop

These operational patterns are used daily in DodaZIP’s distributed file storage and Durga Antivirus Pro’s global threat sensor network.

What’s Next

Congratulations on completing this Cassandra tutorial! Here’s where to go from here:

  • Practice daily — Consistency is more important than long study sessions
  • Build a project — Apply what you learned by building something real
  • Explore related topics — Check out other tutorials in the same category
  • Join the community — Discuss with other learners and share your progress

Remember: every expert was once a beginner. Keep coding!

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro