Learn Developer: Pandas Guide — Python Data Analysis and Manipulation Library

Q: How do I improve Pandas performance?

: Specify dtypes on read, use categorical types for string columns with few unique values, avoid iterrows(), use inplace sparingly, and filter early to reduce data size.

Q: Can Pandas handle streaming data?

: Not natively. Use pd.read_csv() with chunksize parameter for chunked processing, or integrate with streaming frameworks like Kafka and process micro-batches.

Developer Tools

Pandas Guide — Python Data Analysis and Manipulation Library

DodaTech Updated Jun 7, 2026 11 min read

Pandas is the foundational Python library for data analysis, providing high-performance, easy-to-use data structures (DataFrame and Series) and tools for reading, cleaning, transforming, analyzing, and visualizing structured data.

What You’ll Learn

You’ll use DataFrames and Series for tabular data, read data from CSV/Excel/JSON, perform groupby aggregations, merge and join multiple datasets, create pivot tables, handle missing values, work with time series data, and integrate with matplotlib for visualization.

Why Pandas Matters

Data analysis is the foundation of data science, machine learning, and business intelligence. Pandas is the most widely used Python library for this purpose — installed in virtually every Python data environment. DodaTech’s security analytics pipeline uses Pandas to process millions of log entries daily, grouping by IP addresses, detecting anomalous patterns, and generating threat reports. Without Pandas, this analysis would require writing complex loops in pure Python that are slower and more error-prone.

Pandas Learning Path

    flowchart LR
  A[Python Basics] --> B[Pandas]
  B --> C[Series & DataFrame]
  C --> D[Data I/O]
  C --> E[Data Cleaning]
  C --> F[Transformation]
  D --> G[GroupBy & Aggregation]
  E --> H[Merge & Join]
  F --> I[Pivot Tables]
  I --> J[Time Series]
  J --> K[Visualization]
  B:::current

  classDef current fill:#150458,color:#fff,stroke:#333,stroke-width:2px

Prerequisites: Solid Python fundamentals (lists, dicts, loops, functions). Familiarity with NumPy arrays is helpful but not required.

DataFrame and Series

A Series is a one-dimensional labeled array. A DataFrame is a two-dimensional labeled data structure (like a spreadsheet):

import pandas as pd
import numpy as np

# Creating a Series
temperatures = pd.Series([22.5, 24.0, 19.8, 21.3, 23.1],
                         index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'],
                         name='Temperature (°C)')

print(temperatures)
# Output:
# Mon    22.5
# Tue    24.0
# Wed    19.8
# Thu    21.3
# Fri    23.1
# Name: Temperature (°C), dtype: float64

# Creating a DataFrame from a dictionary
data = {
    'city': ['Mumbai', 'Delhi', 'Bangalore', 'Chennai', 'Hyderabad'],
    'temperature': [32.5, 35.2, 28.0, 34.1, 30.8],
    'humidity': [78, 65, 82, 74, 70],
    'aqi': [152, 210, 68, 95, 112]  # Air Quality Index
}

df = pd.DataFrame(data)
print(df)
# Output:
#         city  temperature  humidity  aqi
# 0     Mumbai         32.5        78  152
# 1      Delhi         35.2        65  210
# 2  Bangalore         28.0        82   68
# 3    Chennai         34.1        74   95
# 4  Hyderabad         30.8        70  112

print(df.shape)   # (5, 4) — 5 rows, 4 columns
print(df.dtypes)  # Data types of each column
print(df.describe())  # Summary statistics

Reading Data

Pandas reads data from dozens of formats. CSV is the most common:

# Read from CSV
df = pd.read_csv('sensor_data.csv')
print(df.head())  # First 5 rows

# Common read_csv parameters
df = pd.read_csv(
    'log_data.csv',
    sep=',',              # Delimiter
    header=0,             # Row 0 is the header
    index_col='timestamp', # Use timestamp column as index
    parse_dates=['timestamp'],  # Parse as datetime
    na_values=['NA', 'NULL', '-'],  # Treat these as NaN
    usecols=['timestamp', 'sensor_id', 'value'],  # Only load these columns
    dtype={'sensor_id': 'int32', 'value': 'float32'},  # Specify types
    nrows=10000            # Only read first 10,000 rows
)

# Other file formats
excel_df = pd.read_excel('report.xlsx', sheet_name='Sheet1')
json_df = pd.read_json('data.json')
sql_df = pd.read_sql('SELECT * FROM devices', connection)
parquet_df = pd.read_parquet('data.parquet')

Output: df.head() shows the first 5 rows with column headers. Using parse_dates converts string timestamps to datetime objects automatically. na_values ensures missing data is consistently represented as NaN.

Data Cleaning and Missing Data

Real-world data is messy. Pandas provides tools to handle this:

# Check for missing values
print(df.isnull().sum())
# Output:
# timestamp      0
# sensor_id      0
# value         12
# location      45
# dtype: int64

# Handle missing values
df_cleaned = df.copy()

# Option 1: Drop rows with any NaN
df_dropped = df_cleaned.dropna()

# Option 2: Fill with specific value
df_cleaned['value'] = df_cleaned['value'].fillna(df_cleaned['value'].mean())

# Option 3: Forward fill (use previous value)
df_cleaned['location'] = df_cleaned['location'].ffill()

# Option 4: Interpolate (for time series)
df_cleaned['value'] = df_cleaned['value'].interpolate(method='linear')

# Remove duplicates
df_cleaned = df_cleaned.drop_duplicates(subset=['timestamp', 'sensor_id'])

# Filter out outliers (values beyond 3 standard deviations)
mean = df_cleaned['value'].mean()
std = df_cleaned['value'].std()
df_cleaned = df_cleaned[
    (df_cleaned['value'] >= mean - 3 * std) &
    (df_cleaned['value'] <= mean + 3 * std)
]

print(f"Original rows: {len(df)}, Cleaned: {len(df_cleaned)}")

Output: The dataset goes from having 12 null values in the value column and 45 in location to a fully populated, duplicate-free, outlier-filtered DataFrame. The interpolate() method fills gaps in time series data by drawing a straight line between known points.

GroupBy and Aggregation

GroupBy splits data into groups, applies a function, and combines results:

# Sample IoT sensor data
sensor_data = pd.DataFrame({
    'sensor_id': ['S01', 'S02', 'S03', 'S01', 'S02', 'S03', 'S01', 'S02', 'S03'],
    'location': ['Room-A', 'Room-B', 'Room-C', 'Room-A', 'Room-B', 'Room-C', 'Room-A', 'Room-B', 'Room-C'],
    'reading': [23.5, 24.1, 22.8, 24.0, 24.3, 23.1, 23.8, 24.5, 22.9],
    'hour': [8, 8, 8, 12, 12, 12, 16, 16, 16]
})

# Group by sensor and compute statistics
stats = sensor_data.groupby('sensor_id')['reading'].agg([
    ('mean_reading', 'mean'),
    ('max_reading', 'max'),
    ('min_reading', 'min'),
    ('range', lambda x: x.max() - x.min()),
    ('count', 'count')
]).reset_index()

print(stats)
# Output:
#   sensor_id  mean_reading  max_reading  min_reading  range  count
# 0       S01         23.77         24.0         23.5    0.5      3
# 1       S02         24.30         24.5         24.1    0.4      3
# 2       S03         22.93         23.1         22.8    0.3      3

# Multiple groupby columns
hourly = sensor_data.groupby(['location', 'hour'])['reading'].mean()
print(hourly)
# Output:
# location  hour
# Room-A    8      23.50
#           12     24.00
#           16     23.80
# Room-B    8      24.10
#           12     24.30
#           16     24.50
# Room-C    8      22.80
#           12     23.10
#           16     22.90

Merge and Join

Combine datasets from different sources:

# Two related datasets
devices = pd.DataFrame({
    'device_id': [1, 2, 3, 4],
    'device_name': ['Thermostat', 'Camera', 'Motion Sensor', 'Smoke Detector'],
    'location_id': [101, 102, 101, 103]
})

locations = pd.DataFrame({
    'location_id': [101, 102, 103, 104],
    'building': ['HQ-A', 'HQ-B', 'Warehouse-1', 'Warehouse-2'],
    'floor': [3, 5, 1, 2]
})

readings = pd.DataFrame({
    'device_id': [1, 1, 2, 3, 4],
    'timestamp': ['2026-06-07 08:00', '2026-06-07 09:00',
                  '2026-06-07 08:00', '2026-06-07 08:00', '2026-06-07 08:00'],
    'value': [23.5, 24.0, 12, 44, 0]
})

# Inner join — only matching records
device_location = pd.merge(devices, locations, on='location_id', how='inner')

# Left join — keep all devices
all_devices = pd.merge(devices, locations, on='location_id', how='left')

# Full outer join
full_join = pd.merge(devices, locations, on='location_id', how='outer')

# Merge with readings
complete = pd.merge(
    pd.merge(devices, readings, on='device_id', how='left'),
    locations, on='location_id', how='left'
)

print(complete[['device_name', 'building', 'value', 'timestamp']])
# Output:
#       device_name    building  value            timestamp
# 0      Thermostat        HQ-A   23.5  2026-06-07 08:00:00
# 1      Thermostat        HQ-A   24.0  2026-06-07 09:00:00
# 2           Camera        HQ-B   12.0  2026-06-07 08:00:00
# 3    Motion Sensor        HQ-A   44.0  2026-06-07 08:00:00
# 4  Smoke Detector  Warehouse-1    0.0  2026-06-07 08:00:00

Output: The three datasets (devices, locations, readings) are combined into a single unified DataFrame showing device names alongside their building locations and sensor readings.

Pivot Tables

Pivot tables summarize data in a cross-tabulation format:

# Create a pivot table — average reading by sensor and hour
pivot = sensor_data.pivot_table(
    values='reading',
    index='location',
    columns='hour',
    aggfunc='mean',
    fill_value=0
)

print(pivot)
# Output:
# hour         8     12    16
# location
# Room-A    23.5  24.0  23.8
# Room-B    24.1  24.3  24.5
# Room-C    22.8  23.1  22.9

# Multi-level pivot
pivot_multi = sensor_data.pivot_table(
    values='reading',
    index=['location', 'sensor_id'],
    columns='hour',
    aggfunc='mean',
    margins=True,       # Add row/column totals
    margins_name='Avg'
)

Time Series Analysis

Pandas excels at time series data:

# Create a time series
dates = pd.date_range('2026-06-01', periods=30, freq='H')
ts = pd.Series(
    np.random.normal(25, 3, len(dates)),
    index=dates,
    name='temperature'
)

# Resample to different frequencies
hourly = ts.resample('H').mean()
daily = ts.resample('D').agg(['mean', 'min', 'max', 'std'])
weekly = ts.resample('W').mean()

# Rolling window calculations
ts_rolling = ts.rolling(window=6).mean()  # 6-hour rolling average

print(daily.head())
# Output:
#                           mean        min        max       std
# 2026-06-01  00:00:00  24.894264  20.975229  28.213571  2.281423
# 2026-06-02  00:00:00  24.931515  21.264955  29.667568  2.528765
# 2026-06-03  00:00:00  25.425816  19.908546  29.003511  2.681503

# Shift and lag
ts.shift(1)   # Previous hour
ts.diff()     # Change from previous hour
ts.pct_change()  # Percentage change

# Date-based filtering
june_week2 = ts['2026-06-08':'2026-06-14']
weekday_avg = ts.groupby(ts.index.dayofweek).mean()  # 0=Monday

Output: The daily resample computes mean, min, max, and standard deviation for each day. The 6-hour rolling average smooths out short-term fluctuations, making trends visible.

Visualization with Matplotlib

Pandas integrates with matplotlib for quick plotting:

import matplotlib.pyplot as plt

# Prepare data
city_data = pd.DataFrame({
    'city': ['Mumbai', 'Delhi', 'Bangalore', 'Chennai', 'Hyderabad'],
    'temperature': [32.5, 35.2, 28.0, 34.1, 30.8],
    'humidity': [78, 65, 82, 74, 70],
    'aqi': [152, 210, 68, 95, 112]
})

# Create subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Bar plot
city_data.plot(x='city', y='temperature', kind='bar',
               ax=axes[0, 0], title='Temperature by City', color='orange')

# Line plot
city_data.plot(x='city', y=['temperature', 'humidity'], kind='line',
               ax=axes[0, 1], title='Temperature & Humidity', marker='o')

# Scatter plot
city_data.plot(x='humidity', y='temperature', kind='scatter',
               ax=axes[1, 0], title='Humidity vs Temperature', c='aqi', colormap='viridis')

# Pie chart
city_data.plot(x='city', y='aqi', kind='pie',
               ax=axes[1, 1], title='AQI Distribution', autopct='%1.1f%%',
               legend=False)

plt.tight_layout()
plt.show()

Security Angle: Sanitizing Imported Data

def safely_import_csv(filepath):
    """Import CSV with security sanitization"""
    df = pd.read_csv(filepath)

    # Strip whitespace from string columns
    string_cols = df.select_dtypes(include='object').columns
    df[string_cols] = df[string_cols].apply(lambda x: x.str.strip())

    # Remove potentially malicious characters
    for col in string_cols:
        df[col] = df[col].str.replace(r'[<>\'"]', '', regex=True)

    # Validate numeric columns
    numeric_cols = df.select_dtypes(include='number').columns
    for col in numeric_cols:
        # Clip outliers to reasonable ranges
        df[col] = df[col].clip(lower=-1e6, upper=1e6)

    # Check for SQL injection patterns in string fields
    sql_patterns = r'(\bSELECT\b|\bDROP\b|\bINSERT\b|\bDELETE\b|\bUNION\b|\b--\b)'
    for col in string_cols:
        suspicious = df[col].str.contains(sql_patterns, case=False, na=False)
        if suspicious.any():
            print(f"Warning: Suspicious content found in column '{col}'")

    return df

DodaTech uses this sanitization pattern before analyzing CSV data uploaded through Durga Antivirus Pro’s threat intelligence portal.

Common Mistakes Beginners Make

Chained indexing (df[df['col'] > 5]['other']): This creates a copy and can raise SettingWithCopyWarning. Use .loc[] for all assignment operations.
Modifying a slice instead of the original: Operations on filtered DataFrames may modify copies, not the original. Use .copy() when you need a separate DataFrame.
Forgetting that inplace=True modifies the original: Many Pandas operations have an inplace parameter. Using it without understanding the side effects leads to data loss.
Iterating rows with iterrows(): It’s extremely slow for large datasets. Use vectorized operations, apply(), or groupby/transform instead.
Not specifying data types on large imports: Default dtype inference reads the entire file first. Specify dtype in read_csv for large datasets to save memory and speed up loading.
Assuming groupby preserves original index: After groupby operations, the grouped columns become the index. Use as_index=False or .reset_index() to keep them as columns.

Practice Questions

What is the difference between a Series and a DataFrame?
How do you handle missing values in a DataFrame?
What does groupby do and how do you aggregate results?
How do you merge two DataFrames on a common column?
What is the purpose of pivot_table?

Answers:

A Series is a one-dimensional labeled array (single column). A DataFrame is a two-dimensional labeled structure with multiple columns, like a spreadsheet.
Use dropna() to remove rows with NaN, fillna() to replace with a value, ffill()/bfill() for forward/backward fill, or interpolate() for time series gap filling.
groupby splits data into groups based on column values. Apply aggregation functions (mean, sum, count, etc.) to compute per-group statistics.
Use pd.merge(df1, df2, on='key_column', how='inner'). The how parameter controls inner, left, right, or outer join behavior.
pivot_table creates a cross-tabulation, with one column as index, another as columns, and aggregated values in the cells.

Challenge

Analyze a 30-day IoT sensor dataset: load and clean missing timestamps, detect and remove outliers (values beyond 3 standard deviations), calculate hourly and daily aggregates per sensor, merge with location data, create a pivot table showing average readings by location and hour, generate a time series plot with rolling averages, and export the cleaned data as Parquet.

Real-World Task

Process a web server access log: parse timestamps, extract status codes and request paths, calculate requests per minute (resample), identify the top 10 most-requested URLs, group by IP to detect potential DDoS patterns (too many requests from single IP), merge with a known_bots.csv whitelist, and generate a security summary report as CSV.

FAQ

What is the difference between Pandas and NumPy?

: NumPy provides multi-dimensional arrays and mathematical operations. Pandas builds on NumPy with labeled axes, missing data handling, time series functionality, and data manipulation tools like groupby and merge.

Is Pandas suitable for big data?

: Pandas works in memory, so it’s limited by RAM. For datasets larger than memory, use Dask (Pandas-compatible parallel computing), Vaex, or PySpark.

When should I use apply() vs vectorized operations?

: Vectorized operations (column arithmetic, built-in methods) are 10-100x faster than apply(). Use apply() only when you need row-wise logic that can’t be vectorized.

How do I improve Pandas performance?

: Specify dtypes on read, use categorical types for string columns with few unique values, avoid iterrows(), use inplace sparingly, and filter early to reduce data size.

Can Pandas handle streaming data?

: Not natively. Use pd.read_csv() with chunksize parameter for chunked processing, or integrate with streaming frameworks like Kafka and process micro-batches.

Try It Yourself

import pandas as pd
import numpy as np

# Create a DataFrame with 1000 rows of sensor data
df = pd.DataFrame({
    'sensor': np.random.choice(['S01', 'S02', 'S03', 'S04'], 1000),
    'value': np.random.normal(25, 5, 1000),
    'timestamp': pd.date_range('2026-06-01', periods=1000, freq='H')
})

print(df.head(10))
print(df.groupby('sensor')['value'].agg(['mean', 'std', 'count']))
print(df.set_index('timestamp').resample('D')['value'].mean().head())