Wednesday, December 17, 2025

📊 Descriptive Statistics Explained Simply | Mean, Median & Std Dev

Part 3: Statistics for Data Science Series

Goal: Learn how to summarize data effectively before analysis

When working with data, the first and most important question is:

👉 “What does this data look like?”

This is where Descriptive Statistics comes in.

Descriptive statistics help us summarize, understand, and interpret data using simple numerical measures. Before any machine learning, prediction, or dashboarding — descriptive stats are your foundation.

“Descriptive Statistics Explained Simply” on datahark.in

📌 What is Descriptive Statistics?

Descriptive statistics are techniques used to summarize and describe the main features of a dataset.

They help answer questions like:

What is the average value?
How spread out is the data?
Are there extreme values?
Where does most of the data lie?

👉 Unlike inferential statistics, descriptive statistics do not make predictions — they explain what already exists.

📈 Measures of Central Tendency (Finding the “Center”)

1️⃣ Mean (Average)

Definition:
The sum of all values divided by the total number of values.

Formula:
Mean = (Sum of values) / (Number of values)

📌 Real-Life Example:
Average daily sales of an e-commerce store over 5 days:
₹10k, ₹12k, ₹11k, ₹13k, ₹14k

Mean = ₹12k

🟢 Best used when data has no extreme outliers

2️⃣ Median (Middle Value)

Definition:
The middle value when data is sorted.

📌 Example:
Monthly salaries in a startup:
₹25k, ₹30k, ₹35k, ₹40k, ₹5,00,000

Median = ₹35k

🟢 Best used when data contains outliers

3️⃣ Mode (Most Frequent Value)

Definition:
The value that appears most often.

📌 Example:
Most sold product sizes:
M, M, L, S, M, L

Mode = M

🟢 Useful for categorical data

📊 Measures of Dispersion (Understanding Spread)

4️⃣ Range

Definition:
Difference between the maximum and minimum values.

📌 Example:
City temperature range:
Min = 20°C, Max = 45°C

Range = 25°C

⚠️ Highly sensitive to outliers

5️⃣ Variance

Definition:
Average of squared differences from the mean.

Low variance → values are close together
High variance → values are spread out

🟢 Widely used in finance and risk analysis

6️⃣ Standard Deviation

Definition:
Square root of variance.

📌 Example:
Stocks with high standard deviation are more volatile.

🟢 Most important dispersion metric
🟢 Same unit as original data

📐 Percentiles & Interquartile Range (IQR)

7️⃣ Percentiles

Definition:
A percentile shows the value below which a certain percentage of data falls.

📌 Example:
90th percentile salary = ₹20 LPA
You earn more than 90% of employees.

8️⃣ Interquartile Range (IQR)

Formula:
IQR = Q3 − Q1

Why it matters:

Identifies outliers
Used in box plots
Robust against extreme values

📌 Real Scenario:
Detecting abnormal insurance claims or fraud transactions.

🧠 Summary Statistics in Python (Hands-On)

Pandas is the most widely used Python library for descriptive statistics.

🔹 Sample Dataset

import pandas as pd

data = {
    "Sales": [12000, 15000, 10000, 18000, 16000],
    "Profit": [2000, 3000, 1500, 4000, 3500]
}

df = pd.DataFrame(data)

🔹 Using Pandas `.describe()`

df.describe()

.describe() instantly provides:

Count
Mean
Standard Deviation
Minimum & Maximum
25%, 50% (Median), 75% percentiles

🟢 Used in almost every real-world data analysis project

🔹 Individual Statistics in Pandas

df.mean()
df.median()
df.std()
df.var()
df.quantile(0.75)

🏢 Real-World Applications

📌 Business

Average revenue per customer
Monthly sales analysis
Customer behavior tracking

📌 Finance

Stock volatility measurement
Risk evaluation
Portfolio performance

📌 Healthcare

Patient recovery analysis
Hospital stay durations
Disease statistics

🚀 Key Takeaways

Descriptive statistics summarize data
Mean, median, mode explain central tendency
Standard deviation explains variability
Percentiles & IQR handle outliers
Pandas .describe() is essential for EDA

🔜 What’s Next?

Part 4: Data Visualization for Statistics

Histograms
Box plots
Bar charts
Python visualizations

Statistics isn’t hard — it’s just misunderstood. Keep learning! 🚀

❓ Frequently Asked Questions (FAQ)

What is descriptive statistics?

Descriptive statistics is a branch of statistics that summarizes and describes the main characteristics of a dataset using measures like mean, median, mode, variance, and standard deviation.

Why is descriptive statistics important in data science?

Descriptive statistics helps data scientists understand data distribution, identify patterns, detect outliers, and prepare datasets for further analysis and machine learning models.

What is the difference between mean and median?

The mean is the average of all values, while the median is the middle value when data is sorted. Median is more reliable when the dataset contains outliers.

When should I use standard deviation?

Standard deviation is used to measure how spread out values are from the mean. It is commonly used in finance, business analytics, and risk assessment.

What is Pandas describe() used for?

The describe() function in Pandas provides a quick summary of key descriptive statistics including count, mean, standard deviation, minimum, maximum, and percentiles.

Is descriptive statistics enough for data analysis?

Descriptive statistics is the first step in data analysis. For predictions and conclusions about future data, inferential statistics and machine learning techniques are required.

Labels: Descriptive Statistics in SAS, Statistics, statistics examples, Statistics for Data Science

Monday, August 4, 2025

📊 PROC RANK in SAS – Rank, Percentile, and Group Your Data Easily

Introduction

Q: What does Pandas describe() do?

Pandas describe() generates a summary of descriptive statistics including count, mean, standard deviation, minimum, maximum, and percentile values for numerical columns.

In data analysis, ranking values is essential for identifying top performers, segmenting data, and calculating percentiles. PROC RANK in SAS makes this process easy by assigning ranks, percentiles, or group numbers to numeric variables.

🔧 Syntax of PROC RANK

PROC RANK DATA=input_dataset OUT=output_dataset 
           RANKS=rank_variable <TIES=LOW|HIGH|MEAN|DENSE>;
    VAR variable_to_rank;
    BY group_variable;
    <GROUPS=n>;
RUN;

Key Options Explained:

Option	Description
`DATA=`	Input dataset
`OUT=`	Output dataset with new rank variable
`RANKS=`	Name of the new variable that stores the rank
`TIES=`	Specifies how tied values are handled (default is MEAN)
`BY`	Perform ranking within each BY-group
`VAR`	Variable to rank
`GROUPS=`	Divide data into equal-sized groups (like quantiles or deciles)

📌 Example 1: Basic Ranking

proc rank data=sashelp.class out=ranked_class;
    var height;
    ranks height_rank;
run;

Explanation:
Ranks students in sashelp.class by their height, storing the result in height_rank.

📌 Example 2: Ranking within Groups

proc sort data=sashelp.class out=sorted_class;
    by sex;
run;

proc rank data=sorted_class out=ranked_sex;
    by sex;
    var weight;
    ranks weight_rank;
run;

Explanation:
Ranks weight within each sex group.

📌 Example 3: Create Percentile or Quantile Groups

proc rank data=sashelp.class out=grouped_class groups=4;
    var age;
    ranks age_quartile;
run;

Explanation:
Divides age into 4 quartile groups (0 to 3).

📌 TIES= Option in Action

proc rank data=sashelp.class out=ranked_ties ties=low;
    var height;
    ranks height_rank;
run;

TIES= Options:

LOW – Lowest rank for all ties
HIGH – Highest rank for all ties
MEAN – Average rank (default)
DENSE – No gaps between ranks

✅ When to Use PROC RANK

Ranking top N values
Creating quantile-based bins (e.g., deciles, quartiles)
Calculating percentiles
Segmenting customers or products
Normalizing scorecards

🧠 Tips for Using PROC RANK

Always sort the dataset before using BY.
Use GROUPS= for percentiles or bucketing.
For multiple variables, use multiple VAR and RANKS pairs.
Combine with PROC SQL or PROC PRINT for better reporting.

📎 Final Thoughts

PROC RANK is a powerful yet simple procedure in SAS that enables effective data ranking and segmentation. It’s especially useful in scoring, customer segmentation, and exploratory data analysis.

Labels: About SAS, Base SAS, Descriptive Statistics in SAS, PROC Rank, PROC Rank Example, PROC Rank with BY, PROC Rank with CLASS, Procs, SAS, SAS Data Analysis, SAS PROC guide, SAS Procedures, SAS Programming Basics, SAS Summary

Saturday, July 12, 2025

✅ All About PROC SUMMARY in SAS: A Comprehensive Guide with Practical Examples

📘 Introduction

PROC SUMMARY in SAS is a powerful procedure used to generate summary statistics for numeric variables. It is often considered functionally equivalent to PROC MEANS but with more flexibility in generating customized outputs and silent summaries.

In this guide, you'll learn:

Complete syntax of PROC SUMMARY
All available options and statements
Grouped examples
How it differs from PROC MEANS
Output datasets and tips

🛠 Syntax of PROC SUMMARY

PROC SUMMARY <options>;
  VAR variable(s);
  CLASS variable(s);
  BY variable(s);
  OUTPUT OUT=dataset <stat-options>;
RUN;

⚙️ Common Options in PROC SUMMARY

Option	Description
`DATA=`	Specifies the input dataset
`N`	Count of non-missing values
`MEAN`	Mean or average value
`STD`	Standard deviation
`MIN`	Minimum value
`MAX`	Maximum value
`SUM`	Total sum
`MAXDEC=`	Maximum number of decimals
`NWAY`	Outputs only rows with all CLASS variables present
`CHARTYPE`	Adds a TYPE variable to output

📄 Statements in PROC SUMMARY

Statement	Purpose
`VAR`	Specifies numeric variables to analyze
`CLASS`	Group summary statistics by categorical variables
`BY`	BY-group processing; data must be sorted
`OUTPUT`	Outputs summary statistics to a dataset

✅ PROC SUMMARY vs PROC MEANS

Feature	PROC MEANS	PROC SUMMARY
Displays output	Yes (default)	No (unless `PRINT`)
Flexibility	Moderate	High
Use in production	For display/review	For data pipelines
Output dataset	Optional	Common

🧪 Examples of PROC SUMMARY

Example 1: Basic Summary Statistics

proc summary data=sashelp.class print;
run;

Note: Use PRINT to display output.

Example 2: Specifying Variables

proc summary data=sashelp.class print;
  var height weight;
run;

Example 3: Using CLASS Statement

proc summary data=sashelp.class print;
  class sex;
  var height weight;
run;

Example 4: Creating Output Dataset

proc summary data=sashelp.class n mean maxdec=1;
  class sex;
  var height weight;
  output out=summary_stats;
run;

Example 5: Custom Output Variable Names

proc summary data=sashelp.class;
  class sex;
  var weight;
  output out=summary_data
    mean=mean_weight
    max=max_weight
    min=min_weight;
run;

Example 6: Using BY Statement (sorted data)

proc sort data=sashelp.class out=sorted;
  by sex;
run;

proc summary data=sorted print;
  by sex;
  var height;
run;

🔄 Output Options in PROC SUMMARY

You can customize the summary stats by combining the following keywords in the OUTPUT statement:

Keyword	Description
`N=`	Assigns name to N (count)
`MEAN=`	Assigns name to mean
`STD=`	Standard deviation
`SUM=`	Assigns name to sum
`MIN=`	Assigns name to minimum
`MAX=`	Assigns name to maximum

Example:

output out=myout n=n_obs mean=avg std=stdev;

🧠 Tips for PROC SUMMARY

Always use PRINT if you want to display results.
Use NWAY if you need only fully classified combinations.
Use meaningful output variable names with =.
Great for creating reusable summary datasets.

📦 Summary Table

Feature	Description
Primary Use	Summary statistics
Shows Output	No (by default)
Supports Groups	Yes (CLASS or BY)
Custom Output	Yes (OUTPUT statement)
Output Dataset	Yes
Flexible Output	Highly customizable

Click here to Read more »

Labels: About SAS, Base SAS, Basics, Descriptive Statistics in SAS, PROC MEAN, PROC summary with BY, Procs, SAS, SAS Data Analysis, SAS PROC guide, SAS Procedures, SAS Programming, SAS Summary, SPROC summary Example

📊 PROC MEANS in SAS – A Complete Guide with Syntax, Options & Examples

📘 Introduction

The PROC MEANS procedure in SAS is one of the most frequently used procedures for generating descriptive statistics. It helps compute means, medians, standard deviations, minimums, maximums, and more for numeric variables.

This blog post explores everything about PROC MEANS:

Syntax and arguments
Available statistics
Options and statements
Multiple grouped examples
Tips for better analysis

🔧 Syntax of PROC MEANS

PROC MEANS <options>;
  VAR variable(s);
  CLASS variable(s);
  BY variable(s);
  OUTPUT OUT=dataset <output-options>;
RUN;

🧾 Commonly Used Options in PROC MEANS

Option	Description
`N`	Count of non-missing values
`MEAN`	Average value
`STD`	Standard deviation
`MIN`	Minimum value
`MAX`	Maximum value
`MEDIAN`	Median value
`SUM`	Sum of values
`MAXDEC=`	Maximum number of decimals
`DATA=`	Specifies input dataset
`NWAY`	Forces output only for combinations of all class variables
`CHARTYPE`	Adds type variable in output
`Q1`, `Q3`	1st and 3rd quartiles

🧠 Key Statements in PROC MEANS

Statement	Purpose
`VAR`	Specifies numeric variables to analyze
`CLASS`	Performs group-wise analysis (similar to GROUP BY)
`BY`	Performs BY-group processing (requires sorted data)
`OUTPUT`	Saves results to a new dataset

🧪 PROC MEANS Examples

✅ Example 1: Basic Summary Statistics

proc means data=sashelp.class;
run;

Output: N, Mean, Std, Min, Max for all numeric variables.

✅ Example 2: Specify Variables and Options

proc means data=sashelp.class mean std maxdec=2;
  var age height weight;
run;

Output: Mean and standard deviation for specified variables with 2 decimals.

✅ Example 3: Using CLASS Statement

proc means data=sashelp.class n mean median maxdec=1;
  class sex;
  var height weight;
run;

Output: Summary by gender.

✅ Example 4: Using BY Statement

proc sort data=sashelp.class out=sorted;
  by sex;
run;

proc means data=sorted n mean std;
  by sex;
run;

Note: BY requires pre-sorting.

✅ Example 5: Saving Output to a Dataset

proc means data=sashelp.class n mean max min;
  var height weight;
  class sex;
  output out=class_summary mean=mean_height mean_weight;
run;

Output Dataset: class_summary with mean of height and weight by sex.

✅ Example 6: Percentiles and Custom Statistics

proc means data=sashelp.class n mean median q1 q3;
  var weight;
run;

📌 When to Use CLASS vs BY in PROC MEANS

Feature	CLASS	BY
Sorting	Not required	Requires sorting
Output	Summary by group	Separate table per group
Flexibility	More user-friendly for reporting	Ideal for structured data

🧠 Tips for Using PROC MEANS Effectively

Use MAXDEC= to format output.
CLASS is easier to use than BY for grouped summaries.
Combine with OUTPUT statement to reuse summary data.
Filter data using WHERE before calling PROC MEANS.

🧾 Summary Table

Feature	Description
Procedure Name	`PROC MEANS`
Primary Use	Descriptive statistics
Key Outputs	N, Mean, Std, Min, Max, Median, etc.
Common Options	`MAXDEC=`, `NWAY`, `CHARTYPE`
Supports Grouping	Yes – via `CLASS` and `BY`

Click here to Read more »

Labels: About SAS, Base SAS, Descriptive Statistics in SAS, PROC MEANS, PROC MEANS Example, PROC MEANS with BY, PROC MEANS with CLASS, Procs, SAS, SAS Data Analysis, SAS PROC guide, SAS Procedures, SAS Programming Basics, SAS Summary

DataHark - Every Data is a Story

Wednesday, December 17, 2025

📊 Descriptive Statistics Explained Simply | Mean, Median & Std Dev

Part 3: Statistics for Data Science Series

📌 What is Descriptive Statistics?

📈 Measures of Central Tendency (Finding the “Center”)

1️⃣ Mean (Average)

2️⃣ Median (Middle Value)

3️⃣ Mode (Most Frequent Value)

📊 Measures of Dispersion (Understanding Spread)

4️⃣ Range

5️⃣ Variance

6️⃣ Standard Deviation

📐 Percentiles & Interquartile Range (IQR)

7️⃣ Percentiles

8️⃣ Interquartile Range (IQR)

🧠 Summary Statistics in Python (Hands-On)

🔹 Sample Dataset

🔹 Using Pandas .describe()

🔹 Individual Statistics in Pandas

🏢 Real-World Applications

📌 Business

📌 Finance

📌 Healthcare

🚀 Key Takeaways

🔜 What’s Next?

❓ Frequently Asked Questions (FAQ)

What is descriptive statistics?

Why is descriptive statistics important in data science?

What is the difference between mean and median?

When should I use standard deviation?

What is Pandas describe() used for?

Is descriptive statistics enough for data analysis?

Monday, August 4, 2025

📊 PROC RANK in SAS – Rank, Percentile, and Group Your Data Easily

Introduction

🔧 Syntax of PROC RANK

Key Options Explained:

📌 Example 1: Basic Ranking

📌 Example 2: Ranking within Groups

📌 Example 3: Create Percentile or Quantile Groups

📌 TIES= Option in Action

✅ When to Use PROC RANK

🧠 Tips for Using PROC RANK

📎 Final Thoughts

Saturday, July 12, 2025

✅ All About PROC SUMMARY in SAS: A Comprehensive Guide with Practical Examples

📘 Introduction

🛠 Syntax of PROC SUMMARY

⚙️ Common Options in PROC SUMMARY

📄 Statements in PROC SUMMARY

✅ PROC SUMMARY vs PROC MEANS

🧪 Examples of PROC SUMMARY

Example 1: Basic Summary Statistics

Example 2: Specifying Variables

Example 3: Using CLASS Statement

Example 4: Creating Output Dataset

Example 5: Custom Output Variable Names

Example 6: Using BY Statement (sorted data)

🔄 Output Options in PROC SUMMARY

🧠 Tips for PROC SUMMARY

📦 Summary Table

📊 PROC MEANS in SAS – A Complete Guide with Syntax, Options & Examples

📘 Introduction

🔧 Syntax of PROC MEANS

🧾 Commonly Used Options in PROC MEANS

🧠 Key Statements in PROC MEANS

🧪 PROC MEANS Examples

✅ Example 1: Basic Summary Statistics

✅ Example 2: Specify Variables and Options

✅ Example 3: Using CLASS Statement

✅ Example 4: Using BY Statement

✅ Example 5: Saving Output to a Dataset

✅ Example 6: Percentiles and Custom Statistics

📌 When to Use CLASS vs BY in PROC MEANS

🧠 Tips for Using PROC MEANS Effectively

🧾 Summary Table

About Me

Links

Previous Posts

🔹 Using Pandas `.describe()`