Wednesday, December 17, 2025

πŸ“Š Descriptive Statistics Explained Simply | Mean, Median & Std Dev

Part 3: Statistics for Data Science Series

Goal: Learn how to summarize data effectively before analysis

When working with data, the first and most important question is:

πŸ‘‰ β€œWhat does this data look like?”

This is where Descriptive Statistics comes in.

Descriptive statistics help us summarize, understand, and interpret data using simple numerical measures. Before any machine learning, prediction, or dashboarding β€” descriptive stats are your foundation.

β€œDescriptive Statistics Explained Simply” on datahark.in


πŸ“Œ What is Descriptive Statistics?

Descriptive statistics are techniques used to summarize and describe the main features of a dataset.

They help answer questions like:

  • What is the average value?
  • How spread out is the data?
  • Are there extreme values?
  • Where does most of the data lie?

πŸ‘‰ Unlike inferential statistics, descriptive statistics do not make predictions β€” they explain what already exists.


πŸ“ˆ Measures of Central Tendency (Finding the β€œCenter”)

1️⃣ Mean (Average)

Definition:
The sum of all values divided by the total number of values.

Formula:
Mean = (Sum of values) / (Number of values)

πŸ“Œ Real-Life Example:
Average daily sales of an e-commerce store over 5 days:
β‚Ή10k, β‚Ή12k, β‚Ή11k, β‚Ή13k, β‚Ή14k

Mean = β‚Ή12k

🟒 Best used when data has no extreme outliers


2️⃣ Median (Middle Value)

Definition:
The middle value when data is sorted.

πŸ“Œ Example:
Monthly salaries in a startup:
β‚Ή25k, β‚Ή30k, β‚Ή35k, β‚Ή40k, β‚Ή5,00,000

Median = β‚Ή35k

🟒 Best used when data contains outliers


3️⃣ Mode (Most Frequent Value)

Definition:
The value that appears most often.

πŸ“Œ Example:
Most sold product sizes:
M, M, L, S, M, L

Mode = M

🟒 Useful for categorical data


πŸ“Š Measures of Dispersion (Understanding Spread)

4️⃣ Range

Definition:
Difference between the maximum and minimum values.

πŸ“Œ Example:
City temperature range:
Min = 20Β°C, Max = 45Β°C

Range = 25Β°C

⚠️ Highly sensitive to outliers


5️⃣ Variance

Definition:
Average of squared differences from the mean.

  • Low variance β†’ values are close together
  • High variance β†’ values are spread out

🟒 Widely used in finance and risk analysis


6️⃣ Standard Deviation

Definition:
Square root of variance.

πŸ“Œ Example:
Stocks with high standard deviation are more volatile.

🟒 Most important dispersion metric
🟒 Same unit as original data


πŸ“ Percentiles & Interquartile Range (IQR)

7️⃣ Percentiles

Definition:
A percentile shows the value below which a certain percentage of data falls.

πŸ“Œ Example:
90th percentile salary = β‚Ή20 LPA
You earn more than 90% of employees.


8️⃣ Interquartile Range (IQR)

Formula:
IQR = Q3 βˆ’ Q1

Why it matters:

  • Identifies outliers
  • Used in box plots
  • Robust against extreme values

πŸ“Œ Real Scenario:
Detecting abnormal insurance claims or fraud transactions.


🧠 Summary Statistics in Python (Hands-On)

Pandas is the most widely used Python library for descriptive statistics.

πŸ”Ή Sample Dataset

import pandas as pd

data = {
    "Sales": [12000, 15000, 10000, 18000, 16000],
    "Profit": [2000, 3000, 1500, 4000, 3500]
}

df = pd.DataFrame(data)

πŸ”Ή Using Pandas .describe()

df.describe()

.describe() instantly provides:

  • Count
  • Mean
  • Standard Deviation
  • Minimum & Maximum
  • 25%, 50% (Median), 75% percentiles

🟒 Used in almost every real-world data analysis project


πŸ”Ή Individual Statistics in Pandas

df.mean()
df.median()
df.std()
df.var()
df.quantile(0.75)

🏒 Real-World Applications

πŸ“Œ Business

  • Average revenue per customer
  • Monthly sales analysis
  • Customer behavior tracking

πŸ“Œ Finance

  • Stock volatility measurement
  • Risk evaluation
  • Portfolio performance

πŸ“Œ Healthcare

  • Patient recovery analysis
  • Hospital stay durations
  • Disease statistics

πŸš€ Key Takeaways

  • Descriptive statistics summarize data
  • Mean, median, mode explain central tendency
  • Standard deviation explains variability
  • Percentiles & IQR handle outliers
  • Pandas .describe() is essential for EDA

πŸ”œ What’s Next?

Part 4: Data Visualization for Statistics

  • Histograms
  • Box plots
  • Bar charts
  • Python visualizations

Statistics isn’t hard β€” it’s just misunderstood. Keep learning! πŸš€

❓ Frequently Asked Questions (FAQ)

What is descriptive statistics?

Descriptive statistics is a branch of statistics that summarizes and describes the main characteristics of a dataset using measures like mean, median, mode, variance, and standard deviation.

Why is descriptive statistics important in data science?

Descriptive statistics helps data scientists understand data distribution, identify patterns, detect outliers, and prepare datasets for further analysis and machine learning models.

What is the difference between mean and median?

The mean is the average of all values, while the median is the middle value when data is sorted. Median is more reliable when the dataset contains outliers.

When should I use standard deviation?

Standard deviation is used to measure how spread out values are from the mean. It is commonly used in finance, business analytics, and risk assessment.

What is Pandas describe() used for?

The describe() function in Pandas provides a quick summary of key descriptive statistics including count, mean, standard deviation, minimum, maximum, and percentiles.

Is descriptive statistics enough for data analysis?

Descriptive statistics is the first step in data analysis. For predictions and conclusions about future data, inferential statistics and machine learning techniques are required.

Labels: , , ,

Monday, August 4, 2025

πŸ“Š PROC RANK in SAS – Rank, Percentile, and Group Your Data Easily

Introduction

In data analysis, ranking values is essential for identifying top performers, segmenting data, and calculating percentiles. PROC RANK in SAS makes this process easy by assigning ranks, percentiles, or group numbers to numeric variables.

Proc Rank by Datahark


πŸ”§ Syntax of PROC RANK

PROC RANK DATA=input_dataset OUT=output_dataset
RANKS=rank_variable <TIES=LOW|HIGH|MEAN|DENSE>; VAR variable_to_rank; BY group_variable; <GROUPS=n>; RUN;

Key Options Explained:

OptionDescription
DATA=Input dataset
OUT=Output dataset with new rank variable
RANKS=Name of the new variable that stores the rank
TIES=Specifies how tied values are handled (default is MEAN)
BYPerform ranking within each BY-group
VARVariable to rank
GROUPS=Divide data into equal-sized groups (like quantiles or deciles)

πŸ“Œ Example 1: Basic Ranking

proc rank data=sashelp.class out=ranked_class;
var height; ranks height_rank; run;

Explanation:
Ranks students in sashelp.class by their height, storing the result in height_rank.


πŸ“Œ Example 2: Ranking within Groups

proc sort data=sashelp.class out=sorted_class;
by sex; run; proc rank data=sorted_class out=ranked_sex; by sex; var weight; ranks weight_rank; run;

Explanation:
Ranks weight within each sex group.


πŸ“Œ Example 3: Create Percentile or Quantile Groups

proc rank data=sashelp.class out=grouped_class groups=4;
var age; ranks age_quartile; run;

Explanation:
Divides age into 4 quartile groups (0 to 3).


πŸ“Œ TIES= Option in Action

proc rank data=sashelp.class out=ranked_ties ties=low;
var height; ranks height_rank; run;

TIES= Options:

  • LOW – Lowest rank for all ties
  • HIGH – Highest rank for all ties
  • MEAN – Average rank (default)
  • DENSE – No gaps between ranks


βœ… When to Use PROC RANK

  • Ranking top N values
  • Creating quantile-based bins (e.g., deciles, quartiles)
  • Calculating percentiles
  • Segmenting customers or products
  • Normalizing scorecards


🧠 Tips for Using PROC RANK

  • Always sort the dataset before using BY.
  • Use GROUPS= for percentiles or bucketing.
  • For multiple variables, use multiple VAR and RANKS pairs.
  • Combine with PROC SQL or PROC PRINT for better reporting.


πŸ“Ž Final Thoughts

PROC RANK is a powerful yet simple procedure in SAS that enables effective data ranking and segmentation. It’s especially useful in scoring, customer segmentation, and exploratory data analysis.

Labels: , , , , , , , , , , , , ,

Saturday, July 12, 2025

βœ… All About PROC SUMMARY in SAS: A Comprehensive Guide with Practical Examples

πŸ“˜ Introduction

PROC SUMMARY in SAS is a powerful procedure used to generate summary statistics for numeric variables. It is often considered functionally equivalent to PROC MEANS but with more flexibility in generating customized outputs and silent summaries.

In this guide, you'll learn:

  • Complete syntax of PROC SUMMARY
  • All available options and statements
  • Grouped examples
  • How it differs from PROC MEANS
  • Output datasets and tips

Proc Summary in SAS - Datahark.in


πŸ›  Syntax of PROC SUMMARY

PROC SUMMARY <options>;
VAR variable(s); CLASS variable(s); BY variable(s); OUTPUT OUT=dataset <stat-options>; RUN;

βš™οΈ Common Options in PROC SUMMARY

OptionDescription
DATA=Specifies the input dataset
NCount of non-missing values
MEANMean or average value
STDStandard deviation
MINMinimum value
MAXMaximum value
SUMTotal sum
MAXDEC=Maximum number of decimals
NWAYOutputs only rows with all CLASS variables present
CHARTYPEAdds a TYPE variable to output

πŸ“„ Statements in PROC SUMMARY

StatementPurpose
VARSpecifies numeric variables to analyze
CLASSGroup summary statistics by categorical variables
BYBY-group processing; data must be sorted
OUTPUTOutputs summary statistics to a dataset

βœ… PROC SUMMARY vs PROC MEANS

FeaturePROC MEANSPROC SUMMARY
Displays outputYes (default)No (unless PRINT)
FlexibilityModerateHigh
Use in productionFor display/reviewFor data pipelines
Output datasetOptionalCommon

πŸ§ͺ Examples of PROC SUMMARY

Example 1: Basic Summary Statistics

proc summary data=sashelp.class print;
run;

Note: Use PRINT to display output.


Example 2: Specifying Variables

proc summary data=sashelp.class print;
var height weight; run;

Example 3: Using CLASS Statement

proc summary data=sashelp.class print;
class sex; var height weight; run;

Example 4: Creating Output Dataset

proc summary data=sashelp.class n mean maxdec=1;
class sex; var height weight; output out=summary_stats; run;

Example 5: Custom Output Variable Names

proc summary data=sashelp.class;
class sex; var weight; output out=summary_data mean=mean_weight max=max_weight min=min_weight; run;

Example 6: Using BY Statement (sorted data)

proc sort data=sashelp.class out=sorted;
by sex; run; proc summary data=sorted print; by sex; var height; run;

πŸ”„ Output Options in PROC SUMMARY

You can customize the summary stats by combining the following keywords in the OUTPUT statement:

KeywordDescription
N=Assigns name to N (count)
MEAN=Assigns name to mean
STD=Standard deviation
SUM=Assigns name to sum
MIN=Assigns name to minimum
MAX=Assigns name to maximum

Example:

output out=myout n=n_obs mean=avg std=stdev;

🧠 Tips for PROC SUMMARY

  • Always use PRINT if you want to display results.
  • Use NWAY if you need only fully classified combinations.
  • Use meaningful output variable names with =.
  • Great for creating reusable summary datasets.


πŸ“¦ Summary Table

FeatureDescription
Primary UseSummary statistics
Shows OutputNo (by default)
Supports GroupsYes (CLASS or BY)
Custom OutputYes (OUTPUT statement)
Output DatasetYes
Flexible OutputHighly customizable

Β 

Click here to Read more Β»

Labels: , , , , , , , , , , , , ,

πŸ“Š PROC MEANS in SAS – A Complete Guide with Syntax, Options & Examples

πŸ“˜ Introduction

The PROC MEANS procedure in SAS is one of the most frequently used procedures for generating descriptive statistics. It helps compute means, medians, standard deviations, minimums, maximums, and more for numeric variables.

This blog post explores everything about PROC MEANS:

  • Syntax and arguments
  • Available statistics
  • Options and statements
  • Multiple grouped examples
  • Tips for better analysis


Proc means in SAS - Datahark.in

πŸ”§ Syntax of PROC MEANS

PROC MEANS <options>;
VAR variable(s); CLASS variable(s); BY variable(s); OUTPUT OUT=dataset <output-options>; RUN;

🧾 Commonly Used Options in PROC MEANS

OptionDescription
NCount of non-missing values
MEANAverage value
STDStandard deviation
MINMinimum value
MAXMaximum value
MEDIANMedian value
SUMSum of values
MAXDEC=Maximum number of decimals
DATA=Specifies input dataset
NWAYForces output only for combinations of all class variables
CHARTYPEAdds type variable in output
Q1, Q31st and 3rd quartiles

🧠 Key Statements in PROC MEANS

StatementPurpose
VARSpecifies numeric variables to analyze
CLASSPerforms group-wise analysis (similar to GROUP BY)
BYPerforms BY-group processing (requires sorted data)
OUTPUTSaves results to a new dataset

πŸ§ͺ PROC MEANS Examples

βœ… Example 1: Basic Summary Statistics

proc means data=sashelp.class;
run;

Output: N, Mean, Std, Min, Max for all numeric variables.


βœ… Example 2: Specify Variables and Options

proc means data=sashelp.class mean std maxdec=2;
var age height weight; run;

Output: Mean and standard deviation for specified variables with 2 decimals.


βœ… Example 3: Using CLASS Statement

proc means data=sashelp.class n mean median maxdec=1;
class sex; var height weight; run;

Output: Summary by gender.


βœ… Example 4: Using BY Statement

proc sort data=sashelp.class out=sorted;
by sex; run; proc means data=sorted n mean std; by sex; run;

Note: BY requires pre-sorting.


βœ… Example 5: Saving Output to a Dataset

proc means data=sashelp.class n mean max min;
var height weight; class sex; output out=class_summary mean=mean_height mean_weight; run;

Output Dataset: class_summary with mean of height and weight by sex.


βœ… Example 6: Percentiles and Custom Statistics

proc means data=sashelp.class n mean median q1 q3;
var weight; run;

πŸ“Œ When to Use CLASS vs BY in PROC MEANS

FeatureCLASSBY
SortingNot requiredRequires sorting
OutputSummary by groupSeparate table per group
FlexibilityMore user-friendly for reportingIdeal for structured data

🧠 Tips for Using PROC MEANS Effectively

  • Use MAXDEC= to format output.
  • CLASS is easier to use than BY for grouped summaries.
  • Combine with OUTPUT statement to reuse summary data.
  • Filter data using WHERE before calling PROC MEANS.

🧾 Summary Table

FeatureDescription
Procedure NamePROC MEANS
Primary UseDescriptive statistics
Key OutputsN, Mean, Std, Min, Max, Median, etc.
Common OptionsMAXDEC=, NWAY, CHARTYPE
Supports GroupingYes – via CLASS and BY

Click here to Read more Β»

Labels: , , , , , , , , , , , , ,