Wednesday, December 17, 2025

πŸ“Š Descriptive Statistics Explained Simply | Mean, Median & Std Dev

Part 3: Statistics for Data Science Series

Goal: Learn how to summarize data effectively before analysis

When working with data, the first and most important question is:

πŸ‘‰ β€œWhat does this data look like?”

This is where Descriptive Statistics comes in.

Descriptive statistics help us summarize, understand, and interpret data using simple numerical measures. Before any machine learning, prediction, or dashboarding β€” descriptive stats are your foundation.

β€œDescriptive Statistics Explained Simply” on datahark.in


πŸ“Œ What is Descriptive Statistics?

Descriptive statistics are techniques used to summarize and describe the main features of a dataset.

They help answer questions like:

  • What is the average value?
  • How spread out is the data?
  • Are there extreme values?
  • Where does most of the data lie?

πŸ‘‰ Unlike inferential statistics, descriptive statistics do not make predictions β€” they explain what already exists.


πŸ“ˆ Measures of Central Tendency (Finding the β€œCenter”)

1️⃣ Mean (Average)

Definition:
The sum of all values divided by the total number of values.

Formula:
Mean = (Sum of values) / (Number of values)

πŸ“Œ Real-Life Example:
Average daily sales of an e-commerce store over 5 days:
β‚Ή10k, β‚Ή12k, β‚Ή11k, β‚Ή13k, β‚Ή14k

Mean = β‚Ή12k

🟒 Best used when data has no extreme outliers


2️⃣ Median (Middle Value)

Definition:
The middle value when data is sorted.

πŸ“Œ Example:
Monthly salaries in a startup:
β‚Ή25k, β‚Ή30k, β‚Ή35k, β‚Ή40k, β‚Ή5,00,000

Median = β‚Ή35k

🟒 Best used when data contains outliers


3️⃣ Mode (Most Frequent Value)

Definition:
The value that appears most often.

πŸ“Œ Example:
Most sold product sizes:
M, M, L, S, M, L

Mode = M

🟒 Useful for categorical data


πŸ“Š Measures of Dispersion (Understanding Spread)

4️⃣ Range

Definition:
Difference between the maximum and minimum values.

πŸ“Œ Example:
City temperature range:
Min = 20Β°C, Max = 45Β°C

Range = 25Β°C

⚠️ Highly sensitive to outliers


5️⃣ Variance

Definition:
Average of squared differences from the mean.

  • Low variance β†’ values are close together
  • High variance β†’ values are spread out

🟒 Widely used in finance and risk analysis


6️⃣ Standard Deviation

Definition:
Square root of variance.

πŸ“Œ Example:
Stocks with high standard deviation are more volatile.

🟒 Most important dispersion metric
🟒 Same unit as original data


πŸ“ Percentiles & Interquartile Range (IQR)

7️⃣ Percentiles

Definition:
A percentile shows the value below which a certain percentage of data falls.

πŸ“Œ Example:
90th percentile salary = β‚Ή20 LPA
You earn more than 90% of employees.


8️⃣ Interquartile Range (IQR)

Formula:
IQR = Q3 βˆ’ Q1

Why it matters:

  • Identifies outliers
  • Used in box plots
  • Robust against extreme values

πŸ“Œ Real Scenario:
Detecting abnormal insurance claims or fraud transactions.


🧠 Summary Statistics in Python (Hands-On)

Pandas is the most widely used Python library for descriptive statistics.

πŸ”Ή Sample Dataset

import pandas as pd

data = {
    "Sales": [12000, 15000, 10000, 18000, 16000],
    "Profit": [2000, 3000, 1500, 4000, 3500]
}

df = pd.DataFrame(data)

πŸ”Ή Using Pandas .describe()

df.describe()

.describe() instantly provides:

  • Count
  • Mean
  • Standard Deviation
  • Minimum & Maximum
  • 25%, 50% (Median), 75% percentiles

🟒 Used in almost every real-world data analysis project


πŸ”Ή Individual Statistics in Pandas

df.mean()
df.median()
df.std()
df.var()
df.quantile(0.75)

🏒 Real-World Applications

πŸ“Œ Business

  • Average revenue per customer
  • Monthly sales analysis
  • Customer behavior tracking

πŸ“Œ Finance

  • Stock volatility measurement
  • Risk evaluation
  • Portfolio performance

πŸ“Œ Healthcare

  • Patient recovery analysis
  • Hospital stay durations
  • Disease statistics

πŸš€ Key Takeaways

  • Descriptive statistics summarize data
  • Mean, median, mode explain central tendency
  • Standard deviation explains variability
  • Percentiles & IQR handle outliers
  • Pandas .describe() is essential for EDA

πŸ”œ What’s Next?

Part 4: Data Visualization for Statistics

  • Histograms
  • Box plots
  • Bar charts
  • Python visualizations

Statistics isn’t hard β€” it’s just misunderstood. Keep learning! πŸš€

❓ Frequently Asked Questions (FAQ)

What is descriptive statistics?

Descriptive statistics is a branch of statistics that summarizes and describes the main characteristics of a dataset using measures like mean, median, mode, variance, and standard deviation.

Why is descriptive statistics important in data science?

Descriptive statistics helps data scientists understand data distribution, identify patterns, detect outliers, and prepare datasets for further analysis and machine learning models.

What is the difference between mean and median?

The mean is the average of all values, while the median is the middle value when data is sorted. Median is more reliable when the dataset contains outliers.

When should I use standard deviation?

Standard deviation is used to measure how spread out values are from the mean. It is commonly used in finance, business analytics, and risk assessment.

What is Pandas describe() used for?

The describe() function in Pandas provides a quick summary of key descriptive statistics including count, mean, standard deviation, minimum, maximum, and percentiles.

Is descriptive statistics enough for data analysis?

Descriptive statistics is the first step in data analysis. For predictions and conclusions about future data, inferential statistics and machine learning techniques are required.

Labels: , , ,

πŸ“Š Types of Data & Data Collection Methods in Data Science (Part 2)

Understanding data is the first and most important step in data science.

Before analysis, modeling, or machine learning, a data scientist must know what type of data they are working with and how it was collected.

In Part 2 of our Statistics for Data Science series, you’ll learn:

  • Different types of data used in data science
  • How data is collected in real-world projects
  • Sampling methods and their importance
  • Hands-on examples to classify datasets

Data types in data science


🎯 Goal of This Post

Understand your data before analyzing it.

Incorrect data understanding leads to:

  • Wrong statistical methods
  • Poor model performance
  • Misleading insights


πŸ“Œ Types of Data in Data Science

Data can be classified in multiple ways depending on its nature and usage.


πŸ”Ή Qualitative vs Quantitative Data

πŸ“˜ Qualitative Data (Categorical Data)

Qualitative data describes qualities or characteristics and is non-numeric.

Examples:

  • Gender (Male/Female)
  • Product category
  • Customer feedback (Good, Bad, Average)
  • City names

πŸ“Œ Used for:

  • Classification
  • Sentiment analysis
  • Grouping and segmentation


πŸ“— Quantitative Data (Numerical Data)

Quantitative data represents numbers and measurable values.

Examples:

  • Age
  • Salary
  • Temperature
  • Number of purchases

πŸ“Œ Used for:

  • Statistical calculations
  • Regression models
  • Forecasting


πŸ”Ή Discrete vs Continuous Data

πŸ“˜ Discrete Data

Discrete data consists of countable values.

Examples:

  • Number of customers
  • Number of defects
  • Number of website visits

πŸ“Œ Values are whole numbers.


πŸ“— Continuous Data

Continuous data can take any value within a range.

Examples:

  • Height
  • Weight
  • Time
  • Temperature

πŸ“Œ Can have decimal values.


πŸ”Ή Structured vs Unstructured Data

πŸ“˜ Structured Data

Structured data is organized in rows and columns.

Examples:

  • Excel files
  • SQL tables
  • CSV datasets

πŸ“Œ Easy to analyze using SQL, Excel, Python, or BI tools.


πŸ“— Unstructured Data

Unstructured data has no predefined format.

Examples:

  • Text documents
  • Emails
  • Images
  • Videos

  • Social media posts

πŸ“Œ Requires advanced processing (NLP, Computer Vision).


πŸ“Œ Data Collection Methods in Data Science

Understanding how data is collected helps assess data quality and bias.


πŸ”Ή Common Data Collection Techniques

1️⃣ Surveys & Questionnaires

  • Online forms
  • Feedback surveys
  • Market research

πŸ“Œ Risk: Response bias


2️⃣ Observational Data

  • Website click tracking
  • User behavior logs
  • Sensor data

πŸ“Œ Real-time and unbiased


3️⃣ Experiments (A/B Testing)

  • Marketing experiments
  • Product feature testing

πŸ“Œ Controlled and reliable


4️⃣ Transactional Data

  • Sales records
  • Banking transactions
  • E-commerce logs

πŸ“Œ Highly structured and reliable


5️⃣ Third-Party Data

  • Government datasets
  • APIs
  • External vendors

πŸ“Œ Verify credibility and freshness


πŸ“Œ Sampling Methods in Statistics

Sampling allows us to study a subset of data instead of the entire population.


πŸ”Ή Types of Sampling Methods

πŸ“˜ Random Sampling

  • Every unit has equal chance
  • Reduces bias


πŸ“˜ Stratified Sampling

  • Population divided into groups (strata)
  • Sample taken from each group

πŸ“Œ Used in surveys and finance


πŸ“˜ Systematic Sampling

  • Every nth observation selected

πŸ“Œ Simple and efficient


πŸ“˜ Convenience Sampling

  • Easily available data

πŸ“Œ Risk: High bias


πŸ“Œ Why Sampling Matters in Data Science

  • Saves time and cost
  • Makes large datasets manageable
  • Enables faster experimentation
  • Supports inferential statistics


πŸ§ͺ Hands-On: Classify Sample Datasets

Let’s classify real-world datasets.

DatasetQualitative / QuantitativeDiscrete / ContinuousStructured / Unstructured
Customer GenderQualitativeDiscreteStructured
Monthly SalaryQuantitativeContinuousStructured
Product ReviewsQualitativeN/AUnstructured
Number of OrdersQuantitativeDiscreteStructured
Website Session TimeQuantitativeContinuousStructured

🧠 Key Takeaways

βœ” Always identify data type before analysis
βœ” Choose statistical methods based on data nature
βœ” Understand data collection to avoid bias
βœ” Sampling impacts accuracy and conclusions


πŸ”— What’s Next in This Series?

πŸ‘‰ Part 3: Descriptive Statistics – Mean, Median, Mode & Variability

Labels: , , , , ,

Tuesday, December 16, 2025

πŸ“Š Introduction to Statistics in Data Science: A Complete Beginner’s Guide

Statistics is the backbone of Data Science.

From understanding raw data to building predictive models, statistics helps data scientists make sense of uncertainty, patterns, and trends hidden inside data.

In this detailed guide, you’ll learn:

  • What statistics is (in simple terms)
  • Why statistics is critical for data science
  • Key statistical concepts every data scientist must know
  • Difference between statistics, mathematics, and machine learning
  • Real-world applications of statistics in data science

statistics for data science


πŸ“Œ What is Statistics?

Statistics is the science of collecting, analyzing, interpreting, and presenting data.

In simple words:

Statistics helps us turn raw data into meaningful insights.

Example:

If a company collects sales data from 10,000 customers, statistics helps answer:

  • What is the average purchase value?
  • Which product sells the most?
  • Is sales increasing or decreasing over time?

Without statistics, data is just numbers with no meaning.


πŸ“Œ Why Statistics is the Backbone of Data Science

Data Science is not just about coding or machine learning.
At its core, it is about decision-making using data, and statistics provides the foundation for that.

Why statistics is essential in data science:

1️⃣ Understanding Data

Before applying any machine learning algorithm, a data scientist must:

  • Understand data distribution
  • Detect outliers
  • Identify missing values
  • Summarize data using statistical measures

2️⃣ Making Inferences from Data

Statistics helps answer questions like:

  • Is this result significant or just random?
  • Can we generalize sample results to the population?
  • How confident are we in our predictions?

3️⃣ Model Evaluation

Statistical concepts are used to:

  • Measure model accuracy
  • Compare multiple models
  • Validate assumptions
  • Avoid overfitting

4️⃣ Decision Making Under Uncertainty

Real-world data is noisy and imperfect.
Statistics allows data scientists to quantify uncertainty and make informed decisions.


πŸ“Œ Key Statistical Concepts Used in Data Science

πŸ”Ή 1. Descriptive Statistics

Descriptive statistics summarize and describe data.

Common measures include:

  • Mean (Average)
  • Median
  • Mode
  • Variance
  • Standard Deviation
  • Percentiles

πŸ“Š Example:
Average salary of employees, highest score in an exam, monthly revenue summary.


πŸ”Ή 2. Probability

Probability measures the likelihood of an event occurring.

Why probability matters in data science:

  • Used in predictive modeling
  • Foundation of machine learning algorithms
  • Helps estimate risk and uncertainty

πŸ“Œ Example:

What is the probability that a customer will churn next month?


πŸ”Ή 3. Inferential Statistics

Inferential statistics allows us to draw conclusions about a population using a sample.

Key techniques:

  • Confidence Intervals
  • Hypothesis Testing
  • Statistical Significance

πŸ“Œ Example:

Can we conclude that a new marketing strategy increased sales?


πŸ”Ή 4. Data Distributions

Understanding how data is distributed is crucial.

Common distributions:

  • Normal Distribution
  • Binomial Distribution
  • Poisson Distribution

πŸ“Š Many ML algorithms assume data follows a normal distribution.


πŸ”Ή 5. Correlation and Regression

These techniques help understand relationships between variables.

  • Correlation: Measures strength of relationship
  • Regression: Predicts one variable using others

πŸ“Œ Example:

How does advertising spend affect sales?


πŸ“Œ Statistics vs Mathematics vs Machine Learning

AspectStatisticsMathematicsMachine Learning
PurposeAnalyze & interpret dataAbstract problem solvingLearn patterns from data
FocusUncertainty & inferenceTheory & proofsPrediction & automation
DataReal-world dataOften theoreticalLarge datasets
OutputInsights & decisionsEquationsModels & predictions

πŸ‘‰ Statistics bridges mathematics and machine learning.


πŸ“Œ Real-World Use Cases of Statistics in Data Science

🏦 1. Business & Marketing

  • Customer segmentation
  • A/B testing
  • Demand forecasting
  • Pricing optimization

πŸ’° 2. Finance

  • Risk analysis
  • Fraud detection
  • Portfolio optimization
  • Credit scoring

πŸ₯ 3. Healthcare

  • Clinical trials
  • Disease prediction
  • Treatment effectiveness analysis

πŸ›’ 4. E-Commerce

  • Recommendation systems
  • Conversion rate optimization
  • Customer churn analysis

πŸ“± 5. Technology & AI

  • Model evaluation
  • Feature selection
  • Performance metrics


πŸ“Œ Simple Example Using Python

import numpy as np data = [50, 60, 70, 80, 90] mean = np.mean(data) std_dev = np.std(data) print("Mean:", mean) print("Standard Deviation:", std_dev)

πŸ“Œ This basic statistical analysis helps understand data spread before modeling.


πŸ“Œ Why Data Scientists Must Learn Statistics First

Many beginners jump directly into machine learning, but without statistics:

  • Models become black boxes
  • Results are misinterpreted
  • Decisions become risky

πŸ‘‰ Strong statistics = strong data scientist


πŸ“Œ Final Thoughts

Statistics is not optional in data science β€” it is foundational.

Whether you’re analyzing customer data, building predictive models, or evaluating AI systems, statistics ensures your insights are accurate, reliable, and meaningful.

In the next post of this series, we’ll dive deeper into Types of Data and Measurement Scales in Statistics.


statistics for data science, why statistics is important, data science basics

What’s Next in This Series?

πŸ‘‰ Part 2: Types of Data & Data Collection Methods in Data Science


Labels: , , ,