DataHark - Every Data is a Story

Q: What does Pandas describe() do?

Pandas describe() generates a summary of descriptive statistics including count, mean, standard deviation, minimum, maximum, and percentile values for numerical columns.

Wednesday, December 17, 2025

📊 Descriptive Statistics Explained Simply | Mean, Median & Std Dev

Part 3: Statistics for Data Science Series

Goal: Learn how to summarize data effectively before analysis

When working with data, the first and most important question is:

👉 “What does this data look like?”

This is where Descriptive Statistics comes in.

Descriptive statistics help us summarize, understand, and interpret data using simple numerical measures. Before any machine learning, prediction, or dashboarding — descriptive stats are your foundation.

“Descriptive Statistics Explained Simply” on datahark.in

📌 What is Descriptive Statistics?

Descriptive statistics are techniques used to summarize and describe the main features of a dataset.

They help answer questions like:

What is the average value?
How spread out is the data?
Are there extreme values?
Where does most of the data lie?

👉 Unlike inferential statistics, descriptive statistics do not make predictions — they explain what already exists.

📈 Measures of Central Tendency (Finding the “Center”)

1️⃣ Mean (Average)

Definition:
The sum of all values divided by the total number of values.

Formula:
Mean = (Sum of values) / (Number of values)

📌 Real-Life Example:
Average daily sales of an e-commerce store over 5 days:
₹10k, ₹12k, ₹11k, ₹13k, ₹14k

Mean = ₹12k

🟢 Best used when data has no extreme outliers

2️⃣ Median (Middle Value)

Definition:
The middle value when data is sorted.

📌 Example:
Monthly salaries in a startup:
₹25k, ₹30k, ₹35k, ₹40k, ₹5,00,000

Median = ₹35k

🟢 Best used when data contains outliers

3️⃣ Mode (Most Frequent Value)

Definition:
The value that appears most often.

📌 Example:
Most sold product sizes:
M, M, L, S, M, L

Mode = M

🟢 Useful for categorical data

📊 Measures of Dispersion (Understanding Spread)

4️⃣ Range

Definition:
Difference between the maximum and minimum values.

📌 Example:
City temperature range:
Min = 20°C, Max = 45°C

Range = 25°C

⚠️ Highly sensitive to outliers

5️⃣ Variance

Definition:
Average of squared differences from the mean.

Low variance → values are close together
High variance → values are spread out

🟢 Widely used in finance and risk analysis

6️⃣ Standard Deviation

Definition:
Square root of variance.

📌 Example:
Stocks with high standard deviation are more volatile.

🟢 Most important dispersion metric
🟢 Same unit as original data

📐 Percentiles & Interquartile Range (IQR)

7️⃣ Percentiles

Definition:
A percentile shows the value below which a certain percentage of data falls.

📌 Example:
90th percentile salary = ₹20 LPA
You earn more than 90% of employees.

8️⃣ Interquartile Range (IQR)

Formula:
IQR = Q3 − Q1

Why it matters:

Identifies outliers
Used in box plots
Robust against extreme values

📌 Real Scenario:
Detecting abnormal insurance claims or fraud transactions.

🧠 Summary Statistics in Python (Hands-On)

Pandas is the most widely used Python library for descriptive statistics.

🔹 Sample Dataset

import pandas as pd

data = {
    "Sales": [12000, 15000, 10000, 18000, 16000],
    "Profit": [2000, 3000, 1500, 4000, 3500]
}

df = pd.DataFrame(data)

🔹 Using Pandas `.describe()`

df.describe()

.describe() instantly provides:

Count
Mean
Standard Deviation
Minimum & Maximum
25%, 50% (Median), 75% percentiles

🟢 Used in almost every real-world data analysis project

🔹 Individual Statistics in Pandas

df.mean()
df.median()
df.std()
df.var()
df.quantile(0.75)

🏢 Real-World Applications

📌 Business

Average revenue per customer
Monthly sales analysis
Customer behavior tracking

📌 Finance

Stock volatility measurement
Risk evaluation
Portfolio performance

📌 Healthcare

Patient recovery analysis
Hospital stay durations
Disease statistics

🚀 Key Takeaways

Descriptive statistics summarize data
Mean, median, mode explain central tendency
Standard deviation explains variability
Percentiles & IQR handle outliers
Pandas .describe() is essential for EDA

🔜 What’s Next?

Part 4: Data Visualization for Statistics

Histograms
Box plots
Bar charts
Python visualizations

Statistics isn’t hard — it’s just misunderstood. Keep learning! 🚀

❓ Frequently Asked Questions (FAQ)

What is descriptive statistics?

Descriptive statistics is a branch of statistics that summarizes and describes the main characteristics of a dataset using measures like mean, median, mode, variance, and standard deviation.

Why is descriptive statistics important in data science?

Descriptive statistics helps data scientists understand data distribution, identify patterns, detect outliers, and prepare datasets for further analysis and machine learning models.

What is the difference between mean and median?

The mean is the average of all values, while the median is the middle value when data is sorted. Median is more reliable when the dataset contains outliers.

When should I use standard deviation?

Standard deviation is used to measure how spread out values are from the mean. It is commonly used in finance, business analytics, and risk assessment.

What is Pandas describe() used for?

The describe() function in Pandas provides a quick summary of key descriptive statistics including count, mean, standard deviation, minimum, maximum, and percentiles.

Is descriptive statistics enough for data analysis?

Descriptive statistics is the first step in data analysis. For predictions and conclusions about future data, inferential statistics and machine learning techniques are required.

Labels: Descriptive Statistics in SAS, Statistics, statistics examples, Statistics for Data Science

📊 Types of Data & Data Collection Methods in Data Science (Part 2)

Understanding data is the first and most important step in data science.

Before analysis, modeling, or machine learning, a data scientist must know what type of data they are working with and how it was collected.

In Part 2 of our Statistics for Data Science series, you’ll learn:

Different types of data used in data science
How data is collected in real-world projects
Sampling methods and their importance
Hands-on examples to classify datasets

🎯 Goal of This Post

Understand your data before analyzing it.

Incorrect data understanding leads to:

Wrong statistical methods
Poor model performance
Misleading insights

📌 Types of Data in Data Science

Data can be classified in multiple ways depending on its nature and usage.

🔹 Qualitative vs Quantitative Data

📘 Qualitative Data (Categorical Data)

Qualitative data describes qualities or characteristics and is non-numeric.

Examples:

Gender (Male/Female)
Product category
Customer feedback (Good, Bad, Average)
City names

📌 Used for:

Classification
Sentiment analysis
Grouping and segmentation

📗 Quantitative Data (Numerical Data)

Quantitative data represents numbers and measurable values.

Examples:

Age
Salary
Temperature
Number of purchases

📌 Used for:

Statistical calculations
Regression models
Forecasting

🔹 Discrete vs Continuous Data

📘 Discrete Data

Discrete data consists of countable values.

Examples:

Number of customers
Number of defects
Number of website visits

📌 Values are whole numbers.

📗 Continuous Data

Continuous data can take any value within a range.

Examples:

Height
Weight
Time
Temperature

📌 Can have decimal values.

🔹 Structured vs Unstructured Data

📘 Structured Data

Structured data is organized in rows and columns.

Examples:

Excel files
SQL tables
CSV datasets

📌 Easy to analyze using SQL, Excel, Python, or BI tools.

📗 Unstructured Data

Unstructured data has no predefined format.

Examples:

Text documents
Emails
Images
Videos

Social media posts

📌 Requires advanced processing (NLP, Computer Vision).

📌 Data Collection Methods in Data Science

Understanding how data is collected helps assess data quality and bias.

🔹 Common Data Collection Techniques

1️⃣ Surveys & Questionnaires

Online forms
Feedback surveys
Market research

📌 Risk: Response bias

2️⃣ Observational Data

Website click tracking
User behavior logs
Sensor data

📌 Real-time and unbiased

3️⃣ Experiments (A/B Testing)

Marketing experiments
Product feature testing

📌 Controlled and reliable

4️⃣ Transactional Data

Sales records
Banking transactions
E-commerce logs

📌 Highly structured and reliable

5️⃣ Third-Party Data

Government datasets
APIs
External vendors

📌 Verify credibility and freshness

📌 Sampling Methods in Statistics

Sampling allows us to study a subset of data instead of the entire population.

🔹 Types of Sampling Methods

📘 Random Sampling

Every unit has equal chance
Reduces bias

📘 Stratified Sampling

Population divided into groups (strata)
Sample taken from each group

📌 Used in surveys and finance

📘 Systematic Sampling

Every nth observation selected

📌 Simple and efficient

📘 Convenience Sampling

Easily available data

📌 Risk: High bias

📌 Why Sampling Matters in Data Science

Saves time and cost
Makes large datasets manageable
Enables faster experimentation
Supports inferential statistics

🧪 Hands-On: Classify Sample Datasets

Let’s classify real-world datasets.

Dataset	Qualitative / Quantitative	Discrete / Continuous	Structured / Unstructured
Customer Gender	Qualitative	Discrete	Structured
Monthly Salary	Quantitative	Continuous	Structured
Product Reviews	Qualitative	N/A	Unstructured
Number of Orders	Quantitative	Discrete	Structured
Website Session Time	Quantitative	Continuous	Structured

🧠 Key Takeaways

✔ Always identify data type before analysis
✔ Choose statistical methods based on data nature
✔ Understand data collection to avoid bias
✔ Sampling impacts accuracy and conclusions

🔗 What’s Next in This Series?

👉 Part 3: Descriptive Statistics – Mean, Median, Mode & Variability

Labels: Data Collection Methods, Qualitative vs Quantitative Data, Sampling Methods, Statistics, Statistics for Data Science, Types of Data in Data Science

DataHark - Every Data is a Story

Wednesday, December 17, 2025

📊 Descriptive Statistics Explained Simply | Mean, Median & Std Dev

Part 3: Statistics for Data Science Series

📌 What is Descriptive Statistics?

📈 Measures of Central Tendency (Finding the “Center”)

1️⃣ Mean (Average)

2️⃣ Median (Middle Value)

3️⃣ Mode (Most Frequent Value)

📊 Measures of Dispersion (Understanding Spread)

4️⃣ Range

5️⃣ Variance

6️⃣ Standard Deviation

📐 Percentiles & Interquartile Range (IQR)

7️⃣ Percentiles

8️⃣ Interquartile Range (IQR)

🧠 Summary Statistics in Python (Hands-On)

🔹 Sample Dataset

🔹 Using Pandas .describe()

🔹 Individual Statistics in Pandas

🏢 Real-World Applications

📌 Business

📌 Finance

📌 Healthcare

🚀 Key Takeaways

🔜 What’s Next?

❓ Frequently Asked Questions (FAQ)

What is descriptive statistics?

Why is descriptive statistics important in data science?

What is the difference between mean and median?

When should I use standard deviation?

What is Pandas describe() used for?

Is descriptive statistics enough for data analysis?

📊 Types of Data & Data Collection Methods in Data Science (Part 2)

Understanding data is the first and most important step in data science.

🎯 Goal of This Post

📌 Types of Data in Data Science

🔹 Qualitative vs Quantitative Data

📘 Qualitative Data (Categorical Data)

📗 Quantitative Data (Numerical Data)

🔹 Discrete vs Continuous Data

📘 Discrete Data

📗 Continuous Data

🔹 Structured vs Unstructured Data

📘 Structured Data

📗 Unstructured Data

📌 Data Collection Methods in Data Science

🔹 Common Data Collection Techniques

1️⃣ Surveys & Questionnaires

2️⃣ Observational Data

3️⃣ Experiments (A/B Testing)

4️⃣ Transactional Data

5️⃣ Third-Party Data

📌 Sampling Methods in Statistics

🔹 Types of Sampling Methods

📘 Random Sampling

📘 Stratified Sampling

📘 Systematic Sampling

📘 Convenience Sampling

📌 Why Sampling Matters in Data Science

🧪 Hands-On: Classify Sample Datasets

🧠 Key Takeaways

🔗 What’s Next in This Series?

About Me

Links

Previous Posts

Archives

🔹 Using Pandas `.describe()`