Wednesday, December 17, 2025

πŸ“Š Types of Data & Data Collection Methods in Data Science (Part 2)

Understanding data is the first and most important step in data science.

Before analysis, modeling, or machine learning, a data scientist must know what type of data they are working with and how it was collected.

In Part 2 of our Statistics for Data Science series, you’ll learn:

  • Different types of data used in data science
  • How data is collected in real-world projects
  • Sampling methods and their importance
  • Hands-on examples to classify datasets

Data types in data science


🎯 Goal of This Post

Understand your data before analyzing it.

Incorrect data understanding leads to:

  • Wrong statistical methods
  • Poor model performance
  • Misleading insights


πŸ“Œ Types of Data in Data Science

Data can be classified in multiple ways depending on its nature and usage.


πŸ”Ή Qualitative vs Quantitative Data

πŸ“˜ Qualitative Data (Categorical Data)

Qualitative data describes qualities or characteristics and is non-numeric.

Examples:

  • Gender (Male/Female)
  • Product category
  • Customer feedback (Good, Bad, Average)
  • City names

πŸ“Œ Used for:

  • Classification
  • Sentiment analysis
  • Grouping and segmentation


πŸ“— Quantitative Data (Numerical Data)

Quantitative data represents numbers and measurable values.

Examples:

  • Age
  • Salary
  • Temperature
  • Number of purchases

πŸ“Œ Used for:

  • Statistical calculations
  • Regression models
  • Forecasting


πŸ”Ή Discrete vs Continuous Data

πŸ“˜ Discrete Data

Discrete data consists of countable values.

Examples:

  • Number of customers
  • Number of defects
  • Number of website visits

πŸ“Œ Values are whole numbers.


πŸ“— Continuous Data

Continuous data can take any value within a range.

Examples:

  • Height
  • Weight
  • Time
  • Temperature

πŸ“Œ Can have decimal values.


πŸ”Ή Structured vs Unstructured Data

πŸ“˜ Structured Data

Structured data is organized in rows and columns.

Examples:

  • Excel files
  • SQL tables
  • CSV datasets

πŸ“Œ Easy to analyze using SQL, Excel, Python, or BI tools.


πŸ“— Unstructured Data

Unstructured data has no predefined format.

Examples:

  • Text documents
  • Emails
  • Images
  • Videos

  • Social media posts

πŸ“Œ Requires advanced processing (NLP, Computer Vision).


πŸ“Œ Data Collection Methods in Data Science

Understanding how data is collected helps assess data quality and bias.


πŸ”Ή Common Data Collection Techniques

1️⃣ Surveys & Questionnaires

  • Online forms
  • Feedback surveys
  • Market research

πŸ“Œ Risk: Response bias


2️⃣ Observational Data

  • Website click tracking
  • User behavior logs
  • Sensor data

πŸ“Œ Real-time and unbiased


3️⃣ Experiments (A/B Testing)

  • Marketing experiments
  • Product feature testing

πŸ“Œ Controlled and reliable


4️⃣ Transactional Data

  • Sales records
  • Banking transactions
  • E-commerce logs

πŸ“Œ Highly structured and reliable


5️⃣ Third-Party Data

  • Government datasets
  • APIs
  • External vendors

πŸ“Œ Verify credibility and freshness


πŸ“Œ Sampling Methods in Statistics

Sampling allows us to study a subset of data instead of the entire population.


πŸ”Ή Types of Sampling Methods

πŸ“˜ Random Sampling

  • Every unit has equal chance
  • Reduces bias


πŸ“˜ Stratified Sampling

  • Population divided into groups (strata)
  • Sample taken from each group

πŸ“Œ Used in surveys and finance


πŸ“˜ Systematic Sampling

  • Every nth observation selected

πŸ“Œ Simple and efficient


πŸ“˜ Convenience Sampling

  • Easily available data

πŸ“Œ Risk: High bias


πŸ“Œ Why Sampling Matters in Data Science

  • Saves time and cost
  • Makes large datasets manageable
  • Enables faster experimentation
  • Supports inferential statistics


πŸ§ͺ Hands-On: Classify Sample Datasets

Let’s classify real-world datasets.

DatasetQualitative / QuantitativeDiscrete / ContinuousStructured / Unstructured
Customer GenderQualitativeDiscreteStructured
Monthly SalaryQuantitativeContinuousStructured
Product ReviewsQualitativeN/AUnstructured
Number of OrdersQuantitativeDiscreteStructured
Website Session TimeQuantitativeContinuousStructured

🧠 Key Takeaways

βœ” Always identify data type before analysis
βœ” Choose statistical methods based on data nature
βœ” Understand data collection to avoid bias
βœ” Sampling impacts accuracy and conclusions


πŸ”— What’s Next in This Series?

πŸ‘‰ Part 3: Descriptive Statistics – Mean, Median, Mode & Variability

Labels: , , , , ,

Tuesday, December 16, 2025

πŸ“Š Introduction to Statistics in Data Science: A Complete Beginner’s Guide

Statistics is the backbone of Data Science.

From understanding raw data to building predictive models, statistics helps data scientists make sense of uncertainty, patterns, and trends hidden inside data.

In this detailed guide, you’ll learn:

  • What statistics is (in simple terms)
  • Why statistics is critical for data science
  • Key statistical concepts every data scientist must know
  • Difference between statistics, mathematics, and machine learning
  • Real-world applications of statistics in data science

statistics for data science


πŸ“Œ What is Statistics?

Statistics is the science of collecting, analyzing, interpreting, and presenting data.

In simple words:

Statistics helps us turn raw data into meaningful insights.

Example:

If a company collects sales data from 10,000 customers, statistics helps answer:

  • What is the average purchase value?
  • Which product sells the most?
  • Is sales increasing or decreasing over time?

Without statistics, data is just numbers with no meaning.


πŸ“Œ Why Statistics is the Backbone of Data Science

Data Science is not just about coding or machine learning.
At its core, it is about decision-making using data, and statistics provides the foundation for that.

Why statistics is essential in data science:

1️⃣ Understanding Data

Before applying any machine learning algorithm, a data scientist must:

  • Understand data distribution
  • Detect outliers
  • Identify missing values
  • Summarize data using statistical measures

2️⃣ Making Inferences from Data

Statistics helps answer questions like:

  • Is this result significant or just random?
  • Can we generalize sample results to the population?
  • How confident are we in our predictions?

3️⃣ Model Evaluation

Statistical concepts are used to:

  • Measure model accuracy
  • Compare multiple models
  • Validate assumptions
  • Avoid overfitting

4️⃣ Decision Making Under Uncertainty

Real-world data is noisy and imperfect.
Statistics allows data scientists to quantify uncertainty and make informed decisions.


πŸ“Œ Key Statistical Concepts Used in Data Science

πŸ”Ή 1. Descriptive Statistics

Descriptive statistics summarize and describe data.

Common measures include:

  • Mean (Average)
  • Median
  • Mode
  • Variance
  • Standard Deviation
  • Percentiles

πŸ“Š Example:
Average salary of employees, highest score in an exam, monthly revenue summary.


πŸ”Ή 2. Probability

Probability measures the likelihood of an event occurring.

Why probability matters in data science:

  • Used in predictive modeling
  • Foundation of machine learning algorithms
  • Helps estimate risk and uncertainty

πŸ“Œ Example:

What is the probability that a customer will churn next month?


πŸ”Ή 3. Inferential Statistics

Inferential statistics allows us to draw conclusions about a population using a sample.

Key techniques:

  • Confidence Intervals
  • Hypothesis Testing
  • Statistical Significance

πŸ“Œ Example:

Can we conclude that a new marketing strategy increased sales?


πŸ”Ή 4. Data Distributions

Understanding how data is distributed is crucial.

Common distributions:

  • Normal Distribution
  • Binomial Distribution
  • Poisson Distribution

πŸ“Š Many ML algorithms assume data follows a normal distribution.


πŸ”Ή 5. Correlation and Regression

These techniques help understand relationships between variables.

  • Correlation: Measures strength of relationship
  • Regression: Predicts one variable using others

πŸ“Œ Example:

How does advertising spend affect sales?


πŸ“Œ Statistics vs Mathematics vs Machine Learning

AspectStatisticsMathematicsMachine Learning
PurposeAnalyze & interpret dataAbstract problem solvingLearn patterns from data
FocusUncertainty & inferenceTheory & proofsPrediction & automation
DataReal-world dataOften theoreticalLarge datasets
OutputInsights & decisionsEquationsModels & predictions

πŸ‘‰ Statistics bridges mathematics and machine learning.


πŸ“Œ Real-World Use Cases of Statistics in Data Science

🏦 1. Business & Marketing

  • Customer segmentation
  • A/B testing
  • Demand forecasting
  • Pricing optimization

πŸ’° 2. Finance

  • Risk analysis
  • Fraud detection
  • Portfolio optimization
  • Credit scoring

πŸ₯ 3. Healthcare

  • Clinical trials
  • Disease prediction
  • Treatment effectiveness analysis

πŸ›’ 4. E-Commerce

  • Recommendation systems
  • Conversion rate optimization
  • Customer churn analysis

πŸ“± 5. Technology & AI

  • Model evaluation
  • Feature selection
  • Performance metrics


πŸ“Œ Simple Example Using Python

import numpy as np data = [50, 60, 70, 80, 90] mean = np.mean(data) std_dev = np.std(data) print("Mean:", mean) print("Standard Deviation:", std_dev)

πŸ“Œ This basic statistical analysis helps understand data spread before modeling.


πŸ“Œ Why Data Scientists Must Learn Statistics First

Many beginners jump directly into machine learning, but without statistics:

  • Models become black boxes
  • Results are misinterpreted
  • Decisions become risky

πŸ‘‰ Strong statistics = strong data scientist


πŸ“Œ Final Thoughts

Statistics is not optional in data science β€” it is foundational.

Whether you’re analyzing customer data, building predictive models, or evaluating AI systems, statistics ensures your insights are accurate, reliable, and meaningful.

In the next post of this series, we’ll dive deeper into Types of Data and Measurement Scales in Statistics.


statistics for data science, why statistics is important, data science basics

What’s Next in This Series?

πŸ‘‰ Part 2: Types of Data & Data Collection Methods in Data Science


Labels: , , ,

Tuesday, August 5, 2025

πŸ”„ Mastering IF-THEN-ELSE and DO Loops in SAS – Complete Guide with Examples

πŸ”° Introduction

Conditional logic and looping are core concepts in any programming language. In SAS, IF-THEN-ELSE statements and DO loops provide powerful tools for controlling the flow of your Data Step code. This guide explains how to use these tools effectively with real-world examples.


If else in SAS

🧩 1. IF-THEN-ELSE in SAS

The IF-THEN-ELSE statement allows you to execute specific code based on conditions.

βœ… Syntax:

IF condition THEN action;
ELSE IF condition THEN action; ELSE action;

πŸ“Œ Example 1: Simple IF-THEN-ELSE

data class_flag;
set sashelp.class; if age < 13 then group = 'Child'; else if age < 18 then group = 'Teen'; else group = 'Adult'; run;

Explanation:
Classifies students into Child, Teen, or Adult groups based on age.


πŸ§ͺ More IF-THEN-ELSE Examples in SAS


πŸ“Œ Example 1: Assign Grades Based on Scores

data grades;
input name $ score; if score >= 90 then grade = 'A'; else if score >= 80 then grade = 'B'; else if score >= 70 then grade = 'C'; else if score >= 60 then grade = 'D'; else grade = 'F'; datalines; John 85 Sara 92 Alex 67 Nina 74 Bob 58 ; run;

πŸ“Œ Example 2: Handle Missing Values in Conditions

data test_missing;
input id age; if age = . then status = 'Missing'; else if age < 18 then status = 'Minor'; else status = 'Adult'; datalines; 1 25 2 . 3 17 ; run;

πŸ“Œ Example 3: Create Flags for Categorical Variables

data product_flag;
input product $ category $; if category = 'Electronics' then flag = 1; else flag = 0; datalines; Laptop Electronics Shoes Apparel Phone Electronics Watch Accessories ; run;

πŸ“Œ Example 4: Nested IF-THEN-ELSE Logic

data nested_logic;
input city $ temp; if city = 'Delhi' then do; if temp > 40 then warning = 'Heatwave'; else warning = 'Normal'; end; else warning = 'Check city'; datalines; Delhi 45 Delhi 30 Mumbai 32 ; run;

πŸ“Œ Example 5: Use with Multiple Variables

data risk_check;
input age income; if age < 25 and income < 30000 then risk = 'High'; else if age >= 25 and income < 30000 then risk = 'Medium'; else risk = 'Low'; datalines; 22 25000 28 28000 35 60000 ; run;

πŸ“Œ Example 6: Case-Insensitive Character Comparison

data department;
input empname $ dept $; if upcase(dept) = 'HR' then team = 'Human Resources'; else team = 'Other'; datalines; John HR Alex finance Sara hr ; run;

πŸ“Œ Example 7: Assign Labels to Numeric Ranges

data salary_bracket;
input empid salary; if salary < 30000 then bracket = 'Low'; else if 30000 <= salary < 60000 then bracket = 'Medium'; else bracket = 'High'; datalines; 101 25000 102 32000 103 60000 104 58000 ; run;

πŸ“Œ Example 8: IF Without ELSE (Slower)

Data no_else;
set sashelp.class; if age < 12 then group = 'Preteen'; if age >= 12 then group = 'Teen'; run;

πŸ’‘ Best Practices:

  • Use IF-THEN/ELSE instead of multiple IF statements for better performance.
  • When checking for missing values, use IF var = ..


πŸ”„ 2. DO Loops in SAS

DO loops are used to repeat code a specified number of times or while a condition is true.


βœ… 2.1 DO Loop Syntax

do index = start to end;
/* repeated statements */ end;

πŸ“Œ Example 2: Basic DO Loop

data loop_example;
do i = 1 to 5; square = i**2; output; end; run;

Explanation:
Generates a dataset with numbers 1 to 5 and their squares.


βœ… 2.2 DO WHILE Loop

data do_while;
x = 1; do while (x < 5); square = x**2; output; x + 1; end; run;

βœ… 2.3 DO UNTIL Loop

data do_until;
x = 1; do until (x > 5); cube = x**3; output; x + 1; end; run;

🧠 Combining IF and DO Loops

data even_numbers;
do i = 1 to 10; if mod(i, 2) = 0 then output; end; run;

Explanation:
Generates only even numbers between 1 and 10 using both DO loop and IF condition.


⚠️ Common Mistakes to Avoid

MistakeFix
Missing OUTPUT in DO loopAlways include output; if needed
Forgetting semicolonsEnd each SAS statement with ;
Using multiple IF instead of IF-THEN-ELSECan slow performance

πŸ”š Conclusion

Mastering IF-THEN-ELSE and DO loops in SAS empowers you to create dynamic, flexible, and readable data processing routines. Whether you're classifying data, creating new variables, or iterating through logic, these tools are fundamental to writing efficient SAS programs.

Labels: , , , , , , , ,

Monday, August 4, 2025

πŸ“Š PROC RANK in SAS – Rank, Percentile, and Group Your Data Easily

Introduction

In data analysis, ranking values is essential for identifying top performers, segmenting data, and calculating percentiles. PROC RANK in SAS makes this process easy by assigning ranks, percentiles, or group numbers to numeric variables.

Proc Rank by Datahark


πŸ”§ Syntax of PROC RANK

PROC RANK DATA=input_dataset OUT=output_dataset
RANKS=rank_variable <TIES=LOW|HIGH|MEAN|DENSE>; VAR variable_to_rank; BY group_variable; <GROUPS=n>; RUN;

Key Options Explained:

OptionDescription
DATA=Input dataset
OUT=Output dataset with new rank variable
RANKS=Name of the new variable that stores the rank
TIES=Specifies how tied values are handled (default is MEAN)
BYPerform ranking within each BY-group
VARVariable to rank
GROUPS=Divide data into equal-sized groups (like quantiles or deciles)

πŸ“Œ Example 1: Basic Ranking

proc rank data=sashelp.class out=ranked_class;
var height; ranks height_rank; run;

Explanation:
Ranks students in sashelp.class by their height, storing the result in height_rank.


πŸ“Œ Example 2: Ranking within Groups

proc sort data=sashelp.class out=sorted_class;
by sex; run; proc rank data=sorted_class out=ranked_sex; by sex; var weight; ranks weight_rank; run;

Explanation:
Ranks weight within each sex group.


πŸ“Œ Example 3: Create Percentile or Quantile Groups

proc rank data=sashelp.class out=grouped_class groups=4;
var age; ranks age_quartile; run;

Explanation:
Divides age into 4 quartile groups (0 to 3).


πŸ“Œ TIES= Option in Action

proc rank data=sashelp.class out=ranked_ties ties=low;
var height; ranks height_rank; run;

TIES= Options:

  • LOW – Lowest rank for all ties
  • HIGH – Highest rank for all ties
  • MEAN – Average rank (default)
  • DENSE – No gaps between ranks


βœ… When to Use PROC RANK

  • Ranking top N values
  • Creating quantile-based bins (e.g., deciles, quartiles)
  • Calculating percentiles
  • Segmenting customers or products
  • Normalizing scorecards


🧠 Tips for Using PROC RANK

  • Always sort the dataset before using BY.
  • Use GROUPS= for percentiles or bucketing.
  • For multiple variables, use multiple VAR and RANKS pairs.
  • Combine with PROC SQL or PROC PRINT for better reporting.


πŸ“Ž Final Thoughts

PROC RANK is a powerful yet simple procedure in SAS that enables effective data ranking and segmentation. It’s especially useful in scoring, customer segmentation, and exploratory data analysis.

Labels: , , , , , , , , , , , , ,

Saturday, July 12, 2025

βœ… All About PROC SUMMARY in SAS: A Comprehensive Guide with Practical Examples

πŸ“˜ Introduction

PROC SUMMARY in SAS is a powerful procedure used to generate summary statistics for numeric variables. It is often considered functionally equivalent to PROC MEANS but with more flexibility in generating customized outputs and silent summaries.

In this guide, you'll learn:

  • Complete syntax of PROC SUMMARY
  • All available options and statements
  • Grouped examples
  • How it differs from PROC MEANS
  • Output datasets and tips

Proc Summary in SAS - Datahark.in


πŸ›  Syntax of PROC SUMMARY

PROC SUMMARY <options>;
VAR variable(s); CLASS variable(s); BY variable(s); OUTPUT OUT=dataset <stat-options>; RUN;

βš™οΈ Common Options in PROC SUMMARY

OptionDescription
DATA=Specifies the input dataset
NCount of non-missing values
MEANMean or average value
STDStandard deviation
MINMinimum value
MAXMaximum value
SUMTotal sum
MAXDEC=Maximum number of decimals
NWAYOutputs only rows with all CLASS variables present
CHARTYPEAdds a TYPE variable to output

πŸ“„ Statements in PROC SUMMARY

StatementPurpose
VARSpecifies numeric variables to analyze
CLASSGroup summary statistics by categorical variables
BYBY-group processing; data must be sorted
OUTPUTOutputs summary statistics to a dataset

βœ… PROC SUMMARY vs PROC MEANS

FeaturePROC MEANSPROC SUMMARY
Displays outputYes (default)No (unless PRINT)
FlexibilityModerateHigh
Use in productionFor display/reviewFor data pipelines
Output datasetOptionalCommon

πŸ§ͺ Examples of PROC SUMMARY

Example 1: Basic Summary Statistics

proc summary data=sashelp.class print;
run;

Note: Use PRINT to display output.


Example 2: Specifying Variables

proc summary data=sashelp.class print;
var height weight; run;

Example 3: Using CLASS Statement

proc summary data=sashelp.class print;
class sex; var height weight; run;

Example 4: Creating Output Dataset

proc summary data=sashelp.class n mean maxdec=1;
class sex; var height weight; output out=summary_stats; run;

Example 5: Custom Output Variable Names

proc summary data=sashelp.class;
class sex; var weight; output out=summary_data mean=mean_weight max=max_weight min=min_weight; run;

Example 6: Using BY Statement (sorted data)

proc sort data=sashelp.class out=sorted;
by sex; run; proc summary data=sorted print; by sex; var height; run;

πŸ”„ Output Options in PROC SUMMARY

You can customize the summary stats by combining the following keywords in the OUTPUT statement:

KeywordDescription
N=Assigns name to N (count)
MEAN=Assigns name to mean
STD=Standard deviation
SUM=Assigns name to sum
MIN=Assigns name to minimum
MAX=Assigns name to maximum

Example:

output out=myout n=n_obs mean=avg std=stdev;

🧠 Tips for PROC SUMMARY

  • Always use PRINT if you want to display results.
  • Use NWAY if you need only fully classified combinations.
  • Use meaningful output variable names with =.
  • Great for creating reusable summary datasets.


πŸ“¦ Summary Table

FeatureDescription
Primary UseSummary statistics
Shows OutputNo (by default)
Supports GroupsYes (CLASS or BY)
Custom OutputYes (OUTPUT statement)
Output DatasetYes
Flexible OutputHighly customizable

Β 

Click here to Read more Β»

Labels: , , , , , , , , , , , , ,