Pandas Guide — Python Data Analysis and Manipulation Library
Pandas is the foundational Python library for data analysis, providing high-performance, easy-to-use data structures (DataFrame and Series) and tools for reading, cleaning, transforming, analyzing, and visualizing structured data.
What You’ll Learn
You’ll use DataFrames and Series for tabular data, read data from CSV/Excel/JSON, perform groupby aggregations, merge and join multiple datasets, create pivot tables, handle missing values, work with time series data, and integrate with matplotlib for visualization.
Why Pandas Matters
Data analysis is the foundation of data science, machine learning, and business intelligence. Pandas is the most widely used Python library for this purpose — installed in virtually every Python data environment. DodaTech’s security analytics pipeline uses Pandas to process millions of log entries daily, grouping by IP addresses, detecting anomalous patterns, and generating threat reports. Without Pandas, this analysis would require writing complex loops in pure Python that are slower and more error-prone.
Pandas Learning Path
flowchart LR
A[Python Basics] --> B[Pandas]
B --> C[Series & DataFrame]
C --> D[Data I/O]
C --> E[Data Cleaning]
C --> F[Transformation]
D --> G[GroupBy & Aggregation]
E --> H[Merge & Join]
F --> I[Pivot Tables]
I --> J[Time Series]
J --> K[Visualization]
B:::current
classDef current fill:#150458,color:#fff,stroke:#333,stroke-width:2px
DataFrame and Series
A Series is a one-dimensional labeled array. A DataFrame is a two-dimensional labeled data structure (like a spreadsheet):
import pandas as pd
import numpy as np
# Creating a Series
temperatures = pd.Series([22.5, 24.0, 19.8, 21.3, 23.1],
index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'],
name='Temperature (°C)')
print(temperatures)
# Output:
# Mon 22.5
# Tue 24.0
# Wed 19.8
# Thu 21.3
# Fri 23.1
# Name: Temperature (°C), dtype: float64
# Creating a DataFrame from a dictionary
data = {
'city': ['Mumbai', 'Delhi', 'Bangalore', 'Chennai', 'Hyderabad'],
'temperature': [32.5, 35.2, 28.0, 34.1, 30.8],
'humidity': [78, 65, 82, 74, 70],
'aqi': [152, 210, 68, 95, 112] # Air Quality Index
}
df = pd.DataFrame(data)
print(df)
# Output:
# city temperature humidity aqi
# 0 Mumbai 32.5 78 152
# 1 Delhi 35.2 65 210
# 2 Bangalore 28.0 82 68
# 3 Chennai 34.1 74 95
# 4 Hyderabad 30.8 70 112
print(df.shape) # (5, 4) — 5 rows, 4 columns
print(df.dtypes) # Data types of each column
print(df.describe()) # Summary statisticsReading Data
Pandas reads data from dozens of formats. CSV is the most common:
# Read from CSV
df = pd.read_csv('sensor_data.csv')
print(df.head()) # First 5 rows
# Common read_csv parameters
df = pd.read_csv(
'log_data.csv',
sep=',', # Delimiter
header=0, # Row 0 is the header
index_col='timestamp', # Use timestamp column as index
parse_dates=['timestamp'], # Parse as datetime
na_values=['NA', 'NULL', '-'], # Treat these as NaN
usecols=['timestamp', 'sensor_id', 'value'], # Only load these columns
dtype={'sensor_id': 'int32', 'value': 'float32'}, # Specify types
nrows=10000 # Only read first 10,000 rows
)
# Other file formats
excel_df = pd.read_excel('report.xlsx', sheet_name='Sheet1')
json_df = pd.read_json('data.json')
sql_df = pd.read_sql('SELECT * FROM devices', connection)
parquet_df = pd.read_parquet('data.parquet')Output: df.head() shows the first 5 rows with column headers. Using parse_dates converts string timestamps to datetime objects automatically. na_values ensures missing data is consistently represented as NaN.
Data Cleaning and Missing Data
Real-world data is messy. Pandas provides tools to handle this:
# Check for missing values
print(df.isnull().sum())
# Output:
# timestamp 0
# sensor_id 0
# value 12
# location 45
# dtype: int64
# Handle missing values
df_cleaned = df.copy()
# Option 1: Drop rows with any NaN
df_dropped = df_cleaned.dropna()
# Option 2: Fill with specific value
df_cleaned['value'] = df_cleaned['value'].fillna(df_cleaned['value'].mean())
# Option 3: Forward fill (use previous value)
df_cleaned['location'] = df_cleaned['location'].ffill()
# Option 4: Interpolate (for time series)
df_cleaned['value'] = df_cleaned['value'].interpolate(method='linear')
# Remove duplicates
df_cleaned = df_cleaned.drop_duplicates(subset=['timestamp', 'sensor_id'])
# Filter out outliers (values beyond 3 standard deviations)
mean = df_cleaned['value'].mean()
std = df_cleaned['value'].std()
df_cleaned = df_cleaned[
(df_cleaned['value'] >= mean - 3 * std) &
(df_cleaned['value'] <= mean + 3 * std)
]
print(f"Original rows: {len(df)}, Cleaned: {len(df_cleaned)}")Output: The dataset goes from having 12 null values in the value column and 45 in location to a fully populated, duplicate-free, outlier-filtered DataFrame. The interpolate() method fills gaps in time series data by drawing a straight line between known points.
GroupBy and Aggregation
GroupBy splits data into groups, applies a function, and combines results:
# Sample IoT sensor data
sensor_data = pd.DataFrame({
'sensor_id': ['S01', 'S02', 'S03', 'S01', 'S02', 'S03', 'S01', 'S02', 'S03'],
'location': ['Room-A', 'Room-B', 'Room-C', 'Room-A', 'Room-B', 'Room-C', 'Room-A', 'Room-B', 'Room-C'],
'reading': [23.5, 24.1, 22.8, 24.0, 24.3, 23.1, 23.8, 24.5, 22.9],
'hour': [8, 8, 8, 12, 12, 12, 16, 16, 16]
})
# Group by sensor and compute statistics
stats = sensor_data.groupby('sensor_id')['reading'].agg([
('mean_reading', 'mean'),
('max_reading', 'max'),
('min_reading', 'min'),
('range', lambda x: x.max() - x.min()),
('count', 'count')
]).reset_index()
print(stats)
# Output:
# sensor_id mean_reading max_reading min_reading range count
# 0 S01 23.77 24.0 23.5 0.5 3
# 1 S02 24.30 24.5 24.1 0.4 3
# 2 S03 22.93 23.1 22.8 0.3 3
# Multiple groupby columns
hourly = sensor_data.groupby(['location', 'hour'])['reading'].mean()
print(hourly)
# Output:
# location hour
# Room-A 8 23.50
# 12 24.00
# 16 23.80
# Room-B 8 24.10
# 12 24.30
# 16 24.50
# Room-C 8 22.80
# 12 23.10
# 16 22.90Merge and Join
Combine datasets from different sources:
# Two related datasets
devices = pd.DataFrame({
'device_id': [1, 2, 3, 4],
'device_name': ['Thermostat', 'Camera', 'Motion Sensor', 'Smoke Detector'],
'location_id': [101, 102, 101, 103]
})
locations = pd.DataFrame({
'location_id': [101, 102, 103, 104],
'building': ['HQ-A', 'HQ-B', 'Warehouse-1', 'Warehouse-2'],
'floor': [3, 5, 1, 2]
})
readings = pd.DataFrame({
'device_id': [1, 1, 2, 3, 4],
'timestamp': ['2026-06-07 08:00', '2026-06-07 09:00',
'2026-06-07 08:00', '2026-06-07 08:00', '2026-06-07 08:00'],
'value': [23.5, 24.0, 12, 44, 0]
})
# Inner join — only matching records
device_location = pd.merge(devices, locations, on='location_id', how='inner')
# Left join — keep all devices
all_devices = pd.merge(devices, locations, on='location_id', how='left')
# Full outer join
full_join = pd.merge(devices, locations, on='location_id', how='outer')
# Merge with readings
complete = pd.merge(
pd.merge(devices, readings, on='device_id', how='left'),
locations, on='location_id', how='left'
)
print(complete[['device_name', 'building', 'value', 'timestamp']])
# Output:
# device_name building value timestamp
# 0 Thermostat HQ-A 23.5 2026-06-07 08:00:00
# 1 Thermostat HQ-A 24.0 2026-06-07 09:00:00
# 2 Camera HQ-B 12.0 2026-06-07 08:00:00
# 3 Motion Sensor HQ-A 44.0 2026-06-07 08:00:00
# 4 Smoke Detector Warehouse-1 0.0 2026-06-07 08:00:00Output: The three datasets (devices, locations, readings) are combined into a single unified DataFrame showing device names alongside their building locations and sensor readings.
Pivot Tables
Pivot tables summarize data in a cross-tabulation format:
# Create a pivot table — average reading by sensor and hour
pivot = sensor_data.pivot_table(
values='reading',
index='location',
columns='hour',
aggfunc='mean',
fill_value=0
)
print(pivot)
# Output:
# hour 8 12 16
# location
# Room-A 23.5 24.0 23.8
# Room-B 24.1 24.3 24.5
# Room-C 22.8 23.1 22.9
# Multi-level pivot
pivot_multi = sensor_data.pivot_table(
values='reading',
index=['location', 'sensor_id'],
columns='hour',
aggfunc='mean',
margins=True, # Add row/column totals
margins_name='Avg'
)Time Series Analysis
Pandas excels at time series data:
# Create a time series
dates = pd.date_range('2026-06-01', periods=30, freq='H')
ts = pd.Series(
np.random.normal(25, 3, len(dates)),
index=dates,
name='temperature'
)
# Resample to different frequencies
hourly = ts.resample('H').mean()
daily = ts.resample('D').agg(['mean', 'min', 'max', 'std'])
weekly = ts.resample('W').mean()
# Rolling window calculations
ts_rolling = ts.rolling(window=6).mean() # 6-hour rolling average
print(daily.head())
# Output:
# mean min max std
# 2026-06-01 00:00:00 24.894264 20.975229 28.213571 2.281423
# 2026-06-02 00:00:00 24.931515 21.264955 29.667568 2.528765
# 2026-06-03 00:00:00 25.425816 19.908546 29.003511 2.681503
# Shift and lag
ts.shift(1) # Previous hour
ts.diff() # Change from previous hour
ts.pct_change() # Percentage change
# Date-based filtering
june_week2 = ts['2026-06-08':'2026-06-14']
weekday_avg = ts.groupby(ts.index.dayofweek).mean() # 0=MondayOutput: The daily resample computes mean, min, max, and standard deviation for each day. The 6-hour rolling average smooths out short-term fluctuations, making trends visible.
Visualization with Matplotlib
Pandas integrates with matplotlib for quick plotting:
import matplotlib.pyplot as plt
# Prepare data
city_data = pd.DataFrame({
'city': ['Mumbai', 'Delhi', 'Bangalore', 'Chennai', 'Hyderabad'],
'temperature': [32.5, 35.2, 28.0, 34.1, 30.8],
'humidity': [78, 65, 82, 74, 70],
'aqi': [152, 210, 68, 95, 112]
})
# Create subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
# Bar plot
city_data.plot(x='city', y='temperature', kind='bar',
ax=axes[0, 0], title='Temperature by City', color='orange')
# Line plot
city_data.plot(x='city', y=['temperature', 'humidity'], kind='line',
ax=axes[0, 1], title='Temperature & Humidity', marker='o')
# Scatter plot
city_data.plot(x='humidity', y='temperature', kind='scatter',
ax=axes[1, 0], title='Humidity vs Temperature', c='aqi', colormap='viridis')
# Pie chart
city_data.plot(x='city', y='aqi', kind='pie',
ax=axes[1, 1], title='AQI Distribution', autopct='%1.1f%%',
legend=False)
plt.tight_layout()
plt.show()Security Angle: Sanitizing Imported Data
def safely_import_csv(filepath):
"""Import CSV with security sanitization"""
df = pd.read_csv(filepath)
# Strip whitespace from string columns
string_cols = df.select_dtypes(include='object').columns
df[string_cols] = df[string_cols].apply(lambda x: x.str.strip())
# Remove potentially malicious characters
for col in string_cols:
df[col] = df[col].str.replace(r'[<>\'"]', '', regex=True)
# Validate numeric columns
numeric_cols = df.select_dtypes(include='number').columns
for col in numeric_cols:
# Clip outliers to reasonable ranges
df[col] = df[col].clip(lower=-1e6, upper=1e6)
# Check for SQL injection patterns in string fields
sql_patterns = r'(\bSELECT\b|\bDROP\b|\bINSERT\b|\bDELETE\b|\bUNION\b|\b--\b)'
for col in string_cols:
suspicious = df[col].str.contains(sql_patterns, case=False, na=False)
if suspicious.any():
print(f"Warning: Suspicious content found in column '{col}'")
return dfDodaTech uses this sanitization pattern before analyzing CSV data uploaded through Durga Antivirus Pro’s threat intelligence portal.
Common Mistakes Beginners Make
Chained indexing (
df[df['col'] > 5]['other']): This creates a copy and can raiseSettingWithCopyWarning. Use.loc[]for all assignment operations.Modifying a slice instead of the original: Operations on filtered DataFrames may modify copies, not the original. Use
.copy()when you need a separate DataFrame.Forgetting that
inplace=Truemodifies the original: Many Pandas operations have aninplaceparameter. Using it without understanding the side effects leads to data loss.Iterating rows with
iterrows(): It’s extremely slow for large datasets. Use vectorized operations,apply(), or groupby/transform instead.Not specifying data types on large imports: Default
dtypeinference reads the entire file first. Specifydtypeinread_csvfor large datasets to save memory and speed up loading.Assuming
groupbypreserves original index: After groupby operations, the grouped columns become the index. Useas_index=Falseor.reset_index()to keep them as columns.
Practice Questions
- What is the difference between a Series and a DataFrame?
- How do you handle missing values in a DataFrame?
- What does
groupbydo and how do you aggregate results? - How do you merge two DataFrames on a common column?
- What is the purpose of
pivot_table?
Answers:
- A Series is a one-dimensional labeled array (single column). A DataFrame is a two-dimensional labeled structure with multiple columns, like a spreadsheet.
- Use
dropna()to remove rows with NaN,fillna()to replace with a value,ffill()/bfill()for forward/backward fill, orinterpolate()for time series gap filling. groupbysplits data into groups based on column values. Apply aggregation functions (mean,sum,count, etc.) to compute per-group statistics.- Use
pd.merge(df1, df2, on='key_column', how='inner'). Thehowparameter controls inner, left, right, or outer join behavior. pivot_tablecreates a cross-tabulation, with one column as index, another as columns, and aggregated values in the cells.
Challenge
Analyze a 30-day IoT sensor dataset: load and clean missing timestamps, detect and remove outliers (values beyond 3 standard deviations), calculate hourly and daily aggregates per sensor, merge with location data, create a pivot table showing average readings by location and hour, generate a time series plot with rolling averages, and export the cleaned data as Parquet.
Real-World Task
Process a web server access log: parse timestamps, extract status codes and request paths, calculate requests per minute (resample), identify the top 10 most-requested URLs, group by IP to detect potential DDoS patterns (too many requests from single IP), merge with a known_bots.csv whitelist, and generate a security summary report as CSV.
FAQ
Try It Yourself
import pandas as pd
import numpy as np
# Create a DataFrame with 1000 rows of sensor data
df = pd.DataFrame({
'sensor': np.random.choice(['S01', 'S02', 'S03', 'S04'], 1000),
'value': np.random.normal(25, 5, 1000),
'timestamp': pd.date_range('2026-06-01', periods=1000, freq='H')
})
print(df.head(10))
print(df.groupby('sensor')['value'].agg(['mean', 'std', 'count']))
print(df.set_index('timestamp').resample('D')['value'].mean().head())What’s Next
Related topics: Python, NumPy, Matplotlib, Data Science
What’s Next
Congratulations on completing this Pandas tutorial! Here’s where to go from here:
- Practice daily — Consistency is more important than long study sessions
- Build a project — Apply what you learned by building something real
- Explore related topics — Check out other tutorials in the same category
- Join the community — Discuss with other learners and share your progress
Remember: every expert was once a beginner. Keep coding!
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro