Personal Learning Notes · BRUR · Statistics

Undergraduate Course
Learning Notes

A complete personal reference covering undergraduate statistics courses — definitions, theory, formulas, visualizations, applications, and critical notes from lectures and textbooks at Begum Rokeya University, Rangpur.

STAT1101 · Principles of Statistics I STAT1201 · Principles of Statistics II STAT1102 · Probability Theory SYAT2102 · Probability Distributions STAT2101 · Regression Analysis & Diagnostics STAT3203 · Econometrics STAT2201 · Sampling Distribution STAT2203 · ANOVA & Design of Experiment STAT3201 · Hypothesis Testing STAT4101 · Multivariate Distribution STAT4201 · Multivariate Analysis II STAT4102 · Sampling Techniques STAT4106 · Categorical Data Analysis STAT4104 · Research Methodology
σ
STAT1101 · STAT1201 · B.Sc. Statistics · BRUR
Principles of Statistics I & II
Statistics & Origin · Central Tendency · Dispersion · Index Numbers · Time Series · Correlation · Regression · Attributes · Shape · Bivariate
01
Foundation
What is Statistics?
📖
What it is

The Science of Data

Statistics is the science of collecting, organising, analysing, interpreting, and presenting data to make informed decisions and draw conclusions under uncertainty.

💡
Two Branches

Descriptive vs Inferential

  • Descriptive: Summarises & describes data (means, charts, tables)
  • Inferential: Draws conclusions about a population from a sample using probability
Where to Use

Applications

  • Medical research & clinical trials
  • Economics & finance forecasting
  • Government census & planning
  • Machine learning & AI systems
  • Agriculture & environmental studies
⚠️
Where NOT to Use

Cautions

  • Predicting individuals with certainty
  • When data quality is very poor
  • Proving causation from correlation alone
  • Non-homogeneous data without caution
Key Quote"Statistics is the grammar of science." — Karl Pearson. It converts raw numbers into knowledge.
· · ·
02
Background
History & Origins
🏛️
Ancient Roots

Early Beginnings

  • Babylonians collected census data ~3000 BCE
  • Egyptians used data for pyramid construction planning
  • Romans conducted systematic population censuses
  • India: Arthashastra of Kautilya mentions data collection
📜
Modern Development

17th–20th Century

  • Graunt (1662): Bills of Mortality — first statistical study of births and deaths
  • Gauss & Laplace: Normal distribution, method of least squares
  • Pearson: Correlation coefficient r, chi-square test
  • Fisher: ANOVA, experimental design, p-values, maximum likelihood
EtymologyFrom the Latin statisticum collegium ("council of state") and Italian statista ("statesman") — originally about data useful to the state. The word entered English statistics in the 18th century.
· · ·
03
Core Concepts
Definitions & Classifications
PopulationSampleParameterStatisticVariableAttributeQuantitativeQualitativeDiscreteContinuousNominalOrdinalIntervalRatio
🔍
Key Definitions

Must-Know Terms

  • Population (N): Complete set of all items of interest
  • Sample (n): Subset of population actually studied
  • Parameter: Numerical measure of a population (μ, σ, π)
  • Statistic: Numerical measure of a sample (x̄, s, p̂)
  • Variable: A measurable characteristic that varies
🗂️
Levels of Measurement

Classification of Data

  • Nominal: Labels/categories only — gender, blood type, colour. No ordering.
  • Ordinal: Ordered categories, unequal intervals — ranking, education level
  • Interval: Equal intervals, no true zero — temperature (°C, °F), IQ
  • Ratio: True zero exists — weight, height, income, time
Levels of Measurement — Hierarchy
Nominal Labels only gender, colour Ordinal + Order rankings, grades ⊃ Nominal Interval + Equal intervals temperature (°C) ⊃ Ordinal ⊃ Nominal No true zero Ratio + True zero weight, height Ratios meaningful e.g. 2× as heavy Most informative Increasing information content →
· · ·
04
Practical View
Uses, Importance & Limitations
Major Uses

Why We Use Statistics

  • Simplifying complex masses of data into meaningful summaries
  • Comparing groups, phenomena, and time periods
  • Establishing relationships between variables
  • Forecasting future trends based on past data
  • Testing hypotheses scientifically with rigour
Importance

Why It Matters

  • Basis for evidence-based policy and decision-making
  • Essential in every science, social study, and industry
  • Enables uncertainty quantification and risk assessment
  • Guides business, economic, and medical decisions
⚠️
Limitations

What Statistics Cannot Do

  • Deals only with quantifiable, aggregated facts
  • Results can be misused or deliberately manipulated
  • Statistical laws apply to groups, not individuals
  • Requires homogeneous, high-quality data
  • Cannot prove causation on its own
· · ·
05
Data Collection
Sources of Statistical Data
🔵
Primary Sources

Original Data (First-hand)

  • Direct personal observation
  • Questionnaires & structured surveys
  • Interviews (direct/indirect methods)
  • Experimental data from controlled studies
  • Registration systems (births, deaths, marriages)
📂
Secondary Sources

Existing/Published Data

  • Government publications & national census
  • Research journals, reports & theses
  • International agencies (UN, WHO, World Bank, IMF)
  • Newspapers, almanacs, online databases
💡
Which to Choose?

Primary vs Secondary

Use primary when precision & specificity are critical and budget allows. Use secondary when time/cost are constraints. Always check secondary data for reliability, suitability, and adequacy before use.

· · ·
06
Data Pipeline
Processing & Preprocessing
⚙️
Steps in the Process

Data Processing Pipeline

  • Editing: Check for errors, omissions, inconsistencies
  • Coding: Assign numerical values to categorical responses
  • Classification: Group data into meaningful classes
  • Tabulation: Arrange data in tables (frequency distributions)
  • Presentation: Charts, graphs, diagrams for communication
📊
Frequency Distributions

Organising Raw Data

  • Class interval, class limits, class mark (midpoint)
  • Class frequency & relative frequency (proportion)
  • Cumulative frequency (less than / greater than)
  • Histogram, Frequency Polygon, Ogive (cumulative curve)
Golden Rule of Preprocessing"Garbage in, garbage out." Clean, complete data is the most critical step. Missing values, outliers, and coding errors must be detected and handled before any statistical analysis.
Histogram — Frequency Distribution Concept
5 12 18 15 9 3 10–20 20–30 30–40 40–50 50–60 60–70 Class Intervals Frequency Polygon
· · ·
07
Descriptive Statistics
Measures of Central Tendency
📖
What it is

The "Centre" of Data

A single value representing the typical or central value in a dataset. The three primary measures are Mean, Median, and Mode, each optimal under different data conditions.

🔢
Key Formulas

The Big Five

  • AM: Σx / n  — arithmetic average
  • Median: Middle value in sorted data
  • Mode: Most frequently occurring value
  • GM: (x₁·x₂·…·xₙ)^(1/n) — for ratios, growth
  • HM: n / Σ(1/xᵢ) — for rates & speeds
When to Use Each

Right Tool, Right Job

  • Mean: Symmetric data, no extreme outliers, interval/ratio scale
  • Median: Skewed distributions, income, housing prices, ordinal data
  • Mode: Categorical data, most popular item, bimodal distributions
  • GM: Ratios, growth rates, compound interest, index numbers
  • HM: Averaging rates, speeds, prices per unit
⚠️
Cautions

Common Mistakes

  • Mean is highly sensitive to outliers — check for skewness first
  • Mode may not exist or may not be unique (bimodal)
  • Never compute the mean for nominal or ordinal data
  • AM ≥ GM ≥ HM always (equality only when all values equal)
Arithmetic Meanx̄ = (1/n) · Σᵢ xᵢ
Geometric MeanGM = (x₁ · x₂ · … · xₙ)^(1/n) = exp[(1/n)Σ ln xᵢ]
Harmonic MeanHM = n / Σᵢ(1/xᵢ)
InequalityHM ≤ GM ≤ AM   (always; equality iff all xᵢ equal)
Median (odd n)M = x₍(n+1)/2₎ after sorting
Median (even n)M = [x₍n/2₎ + x₍n/2+1₎] / 2
Central Tendency — Symmetric vs Skewed Distributions
Symmetric Mean=Median=Mode Right Skewed (+) Mode Median Mean Mean > Median > Mode
· · ·
08
Spread of Data
Measures of Dispersion
📏
What it is

Quantifying Variability

Dispersion measures the spread or variability in a dataset. Two datasets can have the same mean but vastly different spreads — dispersion captures this critical difference.

⚙️
All Measures

Absolute & Relative

  • Range: Max − Min (simplest; very sensitive to outliers)
  • Quartile Deviation (QD): (Q3−Q1)/2
  • Mean Deviation (MD): Σ|x−x̄| / n
  • Variance (σ²): Σ(x−x̄)² / n or s² = Σ(x−x̄)² / (n−1)
  • Std Deviation (σ): √Variance
  • Coeff. of Variation (CV): (σ/x̄)×100 — unit-free comparator
💡
Main Idea

Absolute vs Relative

  • Absolute: Range, SD, Variance — in original units; cannot compare datasets with different units
  • Relative: CV — unit-free percentage; use to compare variability across different datasets
Population Varianceσ² = (1/N) · Σᵢ(xᵢ − μ)²
Sample Variances² = (1/(n−1)) · Σᵢ(xᵢ − x̄)²
Std Deviationσ = √[ Σᵢ(xᵢ − x̄)² / N ]
Computing formulaσ² = (1/n)Σxᵢ² − x̄²
Coeff. of VariationCV = (σ / x̄) × 100%
Quartile DeviationQD = (Q3 − Q1) / 2
· · ·
09
Economic Measurement
Index Numbers
📈
What it is

Relative Change Measure

An index number measures the relative change in a variable (or group) compared to a base period. Expressed as a percentage relative to the base (base period = 100). Used to track changes over time.

⚙️
Types

Key Methods

  • Laspeyres Index: Uses base-period quantities as weights
  • Paasche Index: Uses current-period quantities as weights
  • Fisher's Ideal Index: Geometric mean of Laspeyres & Paasche — satisfies time reversal & factor reversal tests
  • Value index: Ratio of current to base-period value
Real-World Use

Applications

  • Consumer Price Index (CPI) — measuring inflation
  • Stock market indices (S&P 500, BSE Sensex)
  • Human Development Index (HDI)
  • Adjusting wages for purchasing power
Laspeyres P-IndexL = (Σ p₁q₀) / (Σ p₀q₀) × 100
Paasche P-IndexP = (Σ p₁q₁) / (Σ p₀q₁) × 100
Fisher Ideal IndexF = √(L × P)
Simple Price Rel.P₀₁ = (p₁ / p₀) × 100
· · ·
10
Temporal Data
Time Series Basics
🕐
What it is

Data Over Time

A time series is a sequence of data points collected at successive, equally-spaced time intervals. Goal: identify patterns, decompose components, and forecast future values.

💡
4 Components

Decomposition (TSCI)

  • Trend (T): Long-term direction (upward/downward/stationary)
  • Seasonal (S): Regular periodic fluctuations within a year
  • Cyclical (C): Long-run waves lasting 2–10 years (business cycles)
  • Irregular (I): Random, unpredictable residual variation
⚙️
Methods

Trend Estimation

  • Moving averages: Simple smoothing of irregular fluctuations
  • Least squares: Fit linear/polynomial trend equation
  • Exponential smoothing: Weighted past observations
Additive ModelY = T + S + C + I
Multiplicative ModelY = T × S × C × I
Trend Line (OLS)Ŷ = a + bt   (t = coded time)
3-Period Moving AvgMA₃ = (Yₜ₋₁ + Yₜ + Yₜ₊₁) / 3
· · ·
11
Bivariate Analysis
Correlation
🔗
What it is

Measuring Association

Correlation measures the strength and direction of the linear relationship between two variables. The Pearson correlation coefficient r ranges from −1 to +1.

💡
Types

Types of Correlation

  • Positive (r > 0): Both variables increase together
  • Negative (r < 0): One increases, other decreases
  • Zero (r = 0): No linear relationship
  • Perfect (r = ±1): All points on a straight line
⚙️
Methods

How to Compute

  • Pearson's r: For interval/ratio data with linear relation
  • Spearman's ρ: For ordinal/ranked data or non-linear relations
  • Scatter diagram: Always plot first — visualise the relationship
⚠️
Critical Warning

Correlation ≠ Causation

High correlation does not prove one variable causes the other. A lurking (confounding) variable may drive both. Always investigate mechanism and theory before claiming causation.

Pearson's rr = Σ(xᵢ−x̄)(yᵢ−ȳ) / √[Σ(xᵢ−x̄)² · Σ(yᵢ−ȳ)²]
r (computing form)r = [nΣxy − (Σx)(Σy)] / √{[nΣx²−(Σx)²][nΣy²−(Σy)²]}
Spearman's ρρ = 1 − 6Σdᵢ² / [n(n²−1)]   (dᵢ = rank difference)
r² (Coeff. of Det.)r² = Explained variation / Total variation
Correlation Strength — Scatter Plot Patterns
r ≈ +0.95 Strong positive r ≈ +0.50 Moderate positive r ≈ 0 No linear relationship r ≈ −0.90 Strong negative
· · ·
12
Prediction
Regression Analysis
📉
What it is

Line of Best Fit

Regression establishes a mathematical relationship to predict the value of a dependent variable (Y) from an independent variable (X) using the principle of Ordinary Least Squares (OLS).

⚙️
OLS Principle

Minimising Residuals

OLS minimises the sum of squared residuals (SSE) — the vertical distances between observed Y and predicted Ŷ. This gives the unique best-fit line through the data. Two regression lines exist: Y on X, and X on Y; they intersect at (x̄, ȳ).

💡
Link with Correlation

Regression Coefficients & r

  • b_yx × b_xy = r² (always)
  • r = √(b_yx × b_xy) when both same sign
  • Sign of b always equals sign of r
  • r² = proportion of variance explained
Regression Line Y on XŶ = ȳ + b_yx(x − x̄)
Slope b_yxb_yx = Σ(xᵢ−x̄)(yᵢ−ȳ) / Σ(xᵢ−x̄)² = r · (σ_y / σ_x)
Intercept aa = ȳ − b_yx · x̄
Regression Line X on YX̂ = x̄ + b_xy(y − ȳ)
b_xyb_xy = r · (σ_x / σ_y)
r² (Coeff. of Det.)r² = b_yx × b_xy = SSR / SST
· · ·
13
Qualitative Data
Analysis of Attributes
🔤
What it is

Non-numerical Characteristics

Attributes are qualitative characteristics (literacy, colour, gender, disease) that are categorised rather than measured. Analysis counts classes and tests association between categories.

⚙️
Methods

Statistical Tools

  • Contingency tables: Cross-tabulation of two attributes
  • χ² test: Tests independence between attributes
  • Yule's Q: Coefficient of association (−1 to +1)
  • Consistency check: Ensure all class frequencies ≥ 0
💡
Association

When Are Attributes Related?

Two attributes are associated if their joint frequency differs from expectation under independence. Positive association: both present together more than chance. Negative: inversely linked.

Chi-square Testχ² = Σ (O − E)² / E    (O=observed, E=expected)
Expected FrequencyE = (Row total × Column total) / Grand total
Yule's QQ = (AD − BC) / (AD + BC)
· · ·
14
Distribution Shape
Shape Characteristics — Skewness & Kurtosis
〰️
Skewness

Asymmetry of Distribution

  • Symmetric (Sk=0): Mean = Median = Mode
  • Positive skew (+): Mean > Median > Mode — right tail longer
  • Negative skew (−): Mean < Median < Mode — left tail longer
📐
Kurtosis

Peakedness (Tailedness)

  • Mesokurtic (β₂=3): Normal distribution — standard shape
  • Leptokurtic (β₂>3): More peaked, heavier tails than normal
  • Platykurtic (β₂<3): Flatter peak, lighter tails than normal
  • Excess kurtosis = β₂ − 3
Pearson's SkewnessSk₁ = (Mean − Mode) / σ
Pearson's 2ndSk₂ = 3(Mean − Median) / σ
Bowley's SkewnessSk_B = (Q3 + Q1 − 2·Median) / (Q3 − Q1)
Kurtosis β₂β₂ = μ₄ / σ⁴   (4th central moment / (σ²)²)
Excess Kurtosisγ₂ = β₂ − 3   (= 0 for normal)
· · ·
15
Two-Variable Analysis
Bivariate Distribution
📊
What it is

Joint Distribution of (X, Y)

A bivariate distribution shows the joint frequency distribution of two variables simultaneously — revealing their individual behaviour AND their joint patterns and dependence structure.

⚙️
Key Concepts

Components

  • Marginal distributions: Distribution of each variable alone (sum over other)
  • Conditional distributions: One variable given the other fixed
  • Bivariate normal: 2D bell curve — described by μₓ, μᵧ, σₓ, σᵧ, and ρ
💡
Why It Matters

Bridge to Multivariate

Bivariate analysis is the essential bridge between single-variable and multivariate statistics. Correlation and regression both rest on understanding the bivariate joint distribution of (X, Y).

Bivariate Normalf(x,y) parameterised by: μₓ, μᵧ, σₓ, σᵧ, ρ
Conditional MeanE(Y|X=x) = μᵧ + ρ·(σᵧ/σₓ)·(x − μₓ)
Conditional VarianceVar(Y|X=x) = σᵧ²(1 − ρ²)
Independenceρ = 0 → X and Y uncorrelated (independent in bivariate normal)
STAT1102 · Probability Theory
P(A)
STAT1102 · B.Sc. Statistics · BRUR
Probability Theory
Set Theory · De Morgan's Laws · Probability Axioms · Conditional Probability · Bayes' Theorem · Random Variables · Mathematical Expectation
P1
Foundations
Set Theory & Algebra of Sets
📖
What it is

Sets & Notation

  • Set: A well-defined collection of distinct objects
  • Roster: A = {1, 2, 3}; Set-builder: {x : x < 4, x ∈ ℕ}
  • Universal set (Ω): Contains all elements under study
  • Empty set (∅): No elements; ∅ ⊆ every set
⚙️
Set Operations

Union, Intersection, Complement

  • Union (A ∪ B): Elements in A or B or both
  • Intersection (A ∩ B): Elements in both A and B
  • Complement (Aᶜ): In Ω but not in A
  • Difference (A − B): In A but not in B = A ∩ Bᶜ
  • Sym. difference (A △ B): (A−B) ∪ (B−A)
💡
Key Laws

Algebra Laws

  • Commutative: A∪B = B∪A; A∩B = B∩A
  • Associative: (A∪B)∪C = A∪(B∪C)
  • Distributive: A∩(B∪C) = (A∩B)∪(A∩C)
  • Idempotent: A∪A = A; A∩A = A
🏛️
De Morgan's Laws

Complement of Unions

  • (A ∪ B)ᶜ = Aᶜ ∩ Bᶜ — complement of union = intersection of complements
  • (A ∩ B)ᶜ = Aᶜ ∪ Bᶜ — complement of intersection = union of complements
  • Used constantly to find complements of complex events
De Morgan I(A ∪ B)ᶜ = Aᶜ ∩ Bᶜ
De Morgan II(A ∩ B)ᶜ = Aᶜ ∪ Bᶜ
Inclusion-Exclusion|A ∪ B| = |A| + |B| − |A ∩ B|
Power Set Size|P(A)| = 2^|A|
· · ·
P2
Core Theory
Probability Fundamentals & Axioms
🎲
Building Blocks

Experiment, Outcomes, Events

  • Random experiment: Outcome not predictable with certainty
  • Sample space (S): Set of ALL possible outcomes
  • Event (A): A subset of the sample space
  • Mutually exclusive: A ∩ B = ∅
  • Exhaustive: A ∪ B = S
📐
Kolmogorov's 3 Axioms

Foundations of Probability

  • Axiom 1: P(A) ≥ 0 for every event A
  • Axiom 2: P(S) = 1 (certain event)
  • Axiom 3: Mutually exclusive events: P(A∪B) = P(A)+P(B)
⚙️
Methods of Assignment

4 Approaches

  • Classical: P(A) = m/n (equally likely outcomes)
  • Relative frequency: P(A) = lim f/n as n→∞
  • Subjective: Expert judgment & belief
  • Axiomatic: Kolmogorov's general framework
Classical ProbabilityP(A) = (No. of favourable outcomes) / (Total equally likely outcomes)
Complement RuleP(Aᶜ) = 1 − P(A)
Bounds0 ≤ P(A) ≤ 1   always
Addition (general)P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
Mutually exclusiveP(A ∪ B) = P(A) + P(B)
· · ·
P3
Counting
Permutations, Combinations & Counting Rules
🔢
Fundamental Principle

Multiplication Rule

If task 1 can be done in m ways and task 2 in n ways, then both can be done in m × n ways. Extended to k tasks: m₁ × m₂ × … × mₖ.

🔀
Permutations

Ordered Arrangements

  • nPr = n! / (n−r)! arrangements of r from n
  • All n items: n! arrangements
  • With repetition: nʳ arrangements
  • Circular: (n−1)! arrangements
🎯
Combinations

Unordered Selections

  • nCr = n! / [r!(n−r)!] — select r from n, order irrelevant
  • Also written C(n,r) or ⁿCᵣ or (n choose r)
  • nC0 = nCn = 1; nC1 = n
  • nCr = nC(n−r) (symmetry)
Permutation nPrn! / (n−r)!
Combination nCrn! / [r! · (n−r)!]
Binomial Theorem(a+b)ⁿ = Σₖ nCk · aⁿ⁻ᵏ · bᵏ
3-Event AdditionP(A∪B∪C) = P(A)+P(B)+P(C)−P(A∩B)−P(B∩C)−P(A∩C)+P(A∩B∩C)
· · ·
P4
Updated Belief
Conditional Probability, Independence & Bayes' Theorem
🔍
Conditional Probability

Probability Given Information

P(A|B) = the probability of A given that B has already occurred. We restrict the sample space to B and measure A within it. This is the "updated" probability with new information.

💡
Independence

When Knowledge Changes Nothing

  • A and B independent iff P(A|B) = P(A)
  • Equivalently: P(A ∩ B) = P(A) · P(B)
  • Independence ≠ mutual exclusivity
  • Mutually exclusive events with P>0 are never independent
🔄
Bayes' Theorem

Reversing Conditional Probability

Given P(E|H) we find P(H|E). We update prior belief P(H) with evidence E to get posterior P(H|E). Used in: medical diagnosis, spam filtering, ML classifiers.

Law of Total Probability

Averaging Over Causes

If {H₁,…,Hₙ} is a partition of S, then: P(E) = Σᵢ P(E|Hᵢ)·P(Hᵢ). The denominator of Bayes' theorem — the total probability of the evidence.

Conditional Prob.P(A|B) = P(A ∩ B) / P(B)
Multiplication RuleP(A ∩ B) = P(A) · P(B|A) = P(B) · P(A|B)
Independence TestP(A ∩ B) = P(A) · P(B) iff independent
Total ProbabilityP(E) = Σᵢ P(E|Hᵢ) · P(Hᵢ)
Bayes' TheoremP(Hᵢ|E) = P(E|Hᵢ)·P(Hᵢ) / Σⱼ P(E|Hⱼ)·P(Hⱼ)
Bayes IntuitionA medical test is positive. Bayes tells you the true probability of actually having the disease, accounting for the test's false positive rate AND the disease prevalence (prior). Without Bayes, most people vastly overestimate their risk.
· · ·
P5
Core Theory
Random Variables & Mathematical Expectation
🎯
Random Variable

Mapping Outcomes to Numbers

X: S → ℝ assigns a real number to each sample point. Capital X = the RV (function); lowercase x = the value it takes. Converts non-numeric experiments into numbers for analysis.

💡
Discrete vs Continuous

Two Types of RVs

  • Discrete: Countable values {0,1,2,…} — described by PMF p(x)
  • Continuous: Any value in an interval — described by PDF f(x)
  • CDF F(x) = P(X ≤ x) exists for both types
⚙️
Expectation & Moments

Summary Measures

  • E(X): Probability-weighted average — the "centre of gravity"
  • Var(X) = E(X²) − [E(X)]²
  • rth raw moment: μ'ᵣ = E(Xʳ)
  • rth central moment: μᵣ = E[(X−μ)ʳ]
  • Linearity: E(aX+b) = aE(X)+b
🔢
Covariance & Correlation

Between Two RVs

  • Cov(X,Y) = E(XY) − E(X)·E(Y)
  • ρ(X,Y) = Cov(X,Y) / (σ_X·σ_Y)
  • Independent → Cov = 0 (not always vice versa)
E(X) — discreteΣₓ x · p(x)   where Σ p(x)=1
E(X) — continuous∫₋∞^∞ x · f(x) dx   where ∫f(x)dx=1
VarianceVar(X) = E(X²) − [E(X)]²
CDFF(x) = P(X ≤ x)
CovarianceCov(X,Y) = E[(X−μₓ)(Y−μᵧ)]
Var(aX+bY)a²Var(X) + b²Var(Y) + 2ab·Cov(X,Y)
SYAT2102 · Probability Distributions
f(x)
SYAT2102 · B.Sc. Statistics · BRUR
Probability Distributions
Bernoulli · Binomial · Poisson · Geometric · Negative Binomial · Hypergeometric · Uniform · Normal · Exponential · Gamma · Beta
BernoulliBinomialPoissonGeometricNeg. BinomialHypergeometricUniformNormalExponentialGammaBeta
D1
Discrete Distributions
Bernoulli, Binomial & Poisson
🪙
Bernoulli(p)

Single Trial — Success/Failure

  • One trial, two outcomes: 1 (success) with prob p, 0 with prob (1−p)
  • E(X) = p; Var(X) = p(1−p)
  • Building block for Binomial
🎰
Binomial(n, p)

n Independent Bernoulli Trials

  • Counts number of successes in n independent trials
  • P(X=k) = C(n,k)·pᵏ·(1−p)ⁿ⁻ᵏ
  • E(X) = np; Var(X) = np(1−p)
  • Use when: fixed n, each trial independent, constant p
☎️
Poisson(λ)

Rare Events in Time/Space

  • Counts events in a fixed interval (time, area, volume)
  • P(X=k) = e⁻λ·λᵏ / k!
  • E(X) = Var(X) = λ — unique equal mean & variance!
  • Use for: calls/hour, defects/unit, accidents/year
🔢
Geometric(p)

Waiting for First Success

  • P(X=k) = (1−p)^(k−1)·p where k=1,2,3,…
  • E(X) = 1/p; Var(X) = (1−p)/p²
  • Memoryless: P(X>s+t|X>s) = P(X>t)
  • Use for: number of trials to first success
Binomial PMFP(X=k) = C(n,k) · pᵏ · (1−p)ⁿ⁻ᵏ
Binomial Mean/VarE(X) = np   ;   Var(X) = np(1−p)
Poisson PMFP(X=k) = e⁻λ · λᵏ / k!   (k=0,1,2,…)
Poisson Mean/VarE(X) = Var(X) = λ
Geometric PMFP(X=k) = (1−p)^(k−1) · p
HypergeometricP(X=k) = C(K,k)·C(N−K,n−k) / C(N,n)
Binomial(10, 0.3) vs Poisson(3) — PMF Comparison
0 1 2 3 4 5 6 7 8 9 10 Binomial(10, 0.3) Poisson(λ=3) k (number of successes/events) P(X=k)
· · ·
D2
Continuous Distributions
Normal, Exponential, Uniform, Gamma & Beta
🔔
Normal N(μ, σ²)

The Bell Curve — Most Important

  • Symmetric about mean μ; inflection points at μ±σ
  • 68-95-99.7 rule for 1σ, 2σ, 3σ from mean
  • Standard Normal Z ~ N(0,1): Z = (X−μ)/σ
  • Central Limit Theorem: sample means → Normal
⏱️
Exponential(λ)

Time Until First Event

  • f(x) = λe⁻λˣ for x ≥ 0
  • E(X) = 1/λ; Var(X) = 1/λ²
  • Memoryless: P(X>s+t|X>s) = P(X>t)
  • Continuous analog of geometric distribution
📐
Uniform U(a, b)

Equal Probability Everywhere

  • f(x) = 1/(b−a) for a ≤ x ≤ b
  • E(X) = (a+b)/2; Var(X) = (b−a)²/12
  • All values equally likely in [a, b]
🌀
Gamma & Beta

Flexible Family Distributions

  • Gamma(α,β): Generalises exponential; waiting time for αth event. E(X)=αβ
  • Beta(α,β): Defined on [0,1]; used for proportions, probabilities. Very flexible shape.
Normal PDFf(x) = (1/σ√2π) · exp[−(x−μ)²/(2σ²)]
Standard Normal ZZ = (X − μ) / σ   ~ N(0,1)
Exponential PDFf(x) = λ·e⁻λˣ , x ≥ 0   ; E(X)=1/λ
Uniform PDFf(x) = 1/(b−a) for x ∈ [a,b]
Gamma PDFf(x) = xᵅ⁻¹·e⁻ˣ/ᵝ / [βᵅ·Γ(α)] , x>0
Beta PDFf(x) = xᵅ⁻¹(1−x)ᵝ⁻¹/B(α,β) , x∈[0,1]
Normal Distribution — The 68-95-99.7 Empirical Rule
μ μ−σ μ+σ μ−2σ μ+2σ 68.27% 95.45% 95.45% 99.73%
Which Distribution to Use?Binary single trial → Bernoulli. Counting successes in n fixed trials (replacement) → Binomial. Rare events in time/space → Poisson. Waiting for first event → Geometric/Exponential. Without replacement → Hypergeometric. Heights, errors, averages → Normal. Waiting for αth event → Gamma. Proportions → Beta.
STAT2101 · Regression Analysis & Diagnostics
β
STAT2101 · B.Sc. Statistics · BRUR
Regression Analysis & Diagnostics
Simple Linear Regression · OLS · Hypothesis Testing · ANOVA · Model Diagnostics · Residual Analysis · Multicollinearity · Influential Points · Logistic Regression
R1
Foundation
Simple Linear Regression Model
📉
What it is

The Population Model

Y = β₀ + β₁X + ε. We model the linear relationship between a response Y (dependent) and a predictor X (independent), where ε is random error. We estimate β₀ & β₁ from sample data.

⚙️
Model Assumptions

LINE Assumptions

  • L — Linearity: True relationship is linear in X
  • I — Independence: Errors εᵢ are independent
  • N — Normality: Errors ~ N(0, σ²)
  • E — Equal variance: Var(εᵢ) = σ² (homoscedasticity)
💡
Interpretation

Meaning of Coefficients

  • β₀ (intercept): Expected value of Y when X = 0
  • β₁ (slope): Change in E(Y) for each 1-unit increase in X
  • Sign of β₁ tells direction; magnitude tells strength
Where to Use

Regression Applications

  • Predicting outcomes (sales, yield, price) from predictors
  • Quantifying effect size of a predictor on outcome
  • Controlling for confounders in observational studies
  • Building clinical prediction models
Population ModelYᵢ = β₀ + β₁Xᵢ + εᵢ   , εᵢ ~ N(0, σ²)
Fitted ModelŶᵢ = b₀ + b₁Xᵢ
Residualseᵢ = Yᵢ − Ŷᵢ   (observed minus predicted)
Simple Linear Regression — Fitted Line & Residuals
Ŷ = b₀+b₁X ← eᵢ (residuals) X (predictor) Y (response) ȳ
· · ·
R2
Estimation
OLS Estimation & BLUE Properties
⚙️
OLS Principle

Minimise Sum of Squared Errors

We choose b₀ and b₁ to minimise SSE = Σ(Yᵢ − b₀ − b₁Xᵢ)². Taking partial derivatives and setting to zero gives the normal equations, leading to closed-form solutions.

💡
Gauss-Markov Theorem

BLUE Estimators

Under the LINE assumptions, OLS estimators are Best Linear Unbiased Estimators (BLUE). They have the smallest variance among all linear unbiased estimators. This is the most important theorem in regression.

📊
Variance Decomposition

SST = SSR + SSE

  • SST: Total sum of squares = Σ(Yᵢ−ȳ)²
  • SSR: Regression SS = Σ(Ŷᵢ−ȳ)² (explained by model)
  • SSE: Error SS = Σ(Yᵢ−Ŷᵢ)² (unexplained/residual)
  • = SSR/SST — proportion of variance explained
OLS Slopeb₁ = Σ(xᵢ−x̄)(yᵢ−ȳ) / Σ(xᵢ−x̄)² = Sxy / Sxx
OLS Interceptb₀ = ȳ − b₁x̄
Error Variance σ²s² = MSE = SSE / (n−2)   (unbiased estimator)
SSTΣ(yᵢ−ȳ)² = SSR + SSE
R² (Coeff. of Det.)R² = SSR/SST = 1 − SSE/SST   ∈ [0,1]
Var(b₁)σ²/Sxx = σ² / Σ(xᵢ−x̄)²
· · ·
R3
Inference
Hypothesis Tests & Confidence Intervals
🔬
t-Test for Slope

Is X a Significant Predictor?

  • H₀: β₁ = 0 (X has no linear effect on Y)
  • H₁: β₁ ≠ 0 (X is a significant predictor)
  • t = b₁ / SE(b₁) ~ t(n−2) under H₀
  • Reject H₀ if |t| > t_(α/2, n−2)
📏
Confidence Intervals

For β₁ and Mean Response

  • CI for β₁: b₁ ± t_(α/2, n−2) · SE(b₁)
  • CI for E(Y|x*): Ŷ ± t · s·√[1/n + (x*−x̄)²/Sxx]
  • PI for new Y: Ŷ ± t · s·√[1 + 1/n + (x*−x̄)²/Sxx] — wider!
💡
CI vs Prediction Interval

Key Distinction

CI for mean E(Y|x*) is narrower — for the average at x*. Prediction interval (PI) is wider — for an individual future observation. PI includes extra uncertainty from ε. Both narrow near x̄, widen as x* moves away.

🔢
F-Test

Overall Model Significance

  • H₀: All β₁ = … = βₖ = 0 (no predictors help)
  • F = MSR / MSE ~ F(k, n−k−1) under H₀
  • Equivalent to t-test in simple regression (F = t²)
t-statistic for β₁t = b₁ / [s / √Sxx]   ~ t(n−2)
SE(b₁)SE(b₁) = s / √Sxx   where s = √MSE
CI for β₁b₁ ± t_(α/2, n−2) · SE(b₁)
F for overall modelF = MSR / MSE = (SSR/k) / (SSE/(n−k−1))
· · ·
R4
Variance Partitioning
ANOVA Table for Regression
ANOVA Table Structure — Simple Linear Regression
Source df SS MS F ratio Regression (Model) k = 1 SSR = Σ(Ŷᵢ − ȳ)² MSR = SSR/k MSR/MSE Error (Residual) n − 2 SSE = Σ(Yᵢ − Ŷᵢ)² MSE = SSE/(n−2) — (error) Total n − 1 SST = Σ(Yᵢ − ȳ)² R² = SSR/SST  |  Adjusted R² = 1 − [(n−1)/(n−k−1)](1−R²)  |  F ~ F(k, n−k−1)
· · ·
R5
Model Checking
Residual Analysis & Diagnostics
🔬
Why Diagnostics?

Checking Model Assumptions

Residuals eᵢ = Yᵢ − Ŷᵢ carry information about assumption violations. Always plot residuals before trusting inference. A good model has residuals that look like random noise.

📊
Key Diagnostic Plots

4 Essential Plots

  • Residuals vs Fitted (Ŷᵢ): Check linearity & homoscedasticity. Should be random scatter around zero.
  • Normal Q-Q plot: Check normality of residuals. Points should lie on a straight diagonal line.
  • Scale-Location plot: √|eᵢ| vs Ŷᵢ — check homoscedasticity.
  • Residuals vs Leverage: Identify influential points & Cook's D.
💡
Standardised Residuals

Types of Residuals

  • Ordinary: eᵢ = Yᵢ − Ŷᵢ (raw residuals)
  • Standardised: rᵢ = eᵢ / (s√(1−hᵢᵢ)) — scale-free; should be within ±2
  • Studentised deleted: rᵢ* — uses s₍ᵢ₎ without point i — best for outlier detection
Residual Patterns — Diagnosing Assumption Violations
✓ Good — Random ✗ Nonlinearity ✗ Heteroscedasticity ✓ Normal Q-Q Residuals vs Ŷ Nonlinear Pattern Fan Shape Q-Q Plot
· · ·
R6
Diagnostics
Assumption Violations & Remedies
📈
Non-linearity

Pattern in Residuals

  • Detect: Curved pattern in residuals vs Ŷ plot
  • Remedy: Add polynomial term (X²), log-transform X or Y, use non-parametric regression
  • Test: Ramsey RESET test
📡
Heteroscedasticity

Non-constant Variance

  • Detect: Fan shape in residuals; Breusch-Pagan or White test
  • Consequence: OLS still unbiased but NOT BLUE; SEs are wrong
  • Remedy: WLS (weighted least squares), log(Y), robust SEs (HC errors)
🔗
Autocorrelation

Non-independent Errors

  • Detect: Durbin-Watson test (d ≈ 2 is good; d < 1.5 or d > 2.5 signals problem)
  • Common in: Time series data
  • Remedy: Include lagged variables, GLS, Cochrane-Orcutt
🧩
Multicollinearity

Correlated Predictors

  • Detect: VIF > 10 signals serious problem; VIF > 5 is concerning
  • Consequence: Large SEs, unstable coefficient estimates, wrong signs
  • Remedy: Drop one correlated variable, ridge regression, PCA
Durbin-Watson dd = Σᵢ(eᵢ − eᵢ₋₁)² / Σᵢeᵢ²   ∈ [0,4]; d≈2 → no autocorrelation
VIF (Var. Inflation)VIF_j = 1/(1 − Rj²)   where Rj² = R² of Xj on all other predictors
Breusch-Pagan TestRegress eᵢ² on Xᵢ; test F or nR² ~ χ²(k)
· · ·
R7
Outlier Detection
Influential Points, Outliers & Leverage
🎯
Outliers in Y

Large Residuals

A point with a large studentised residual |rᵢ| > 2 or 3. Outliers in Y can inflate MSE and distort regression estimates. Check if real or data error.

🔭
High Leverage Points

Outliers in X Space

Leverage hᵢᵢ (hat matrix diagonal) measures how far Xᵢ is from x̄. Rule of thumb: hᵢᵢ > 2(k+1)/n signals high leverage. High leverage = potential for high influence.

💡
Cook's Distance D

Overall Influence

Cook's D measures the effect of deleting point i on ALL fitted values. D > 1 (or D > 4/n) suggests the point is influential. Combines residual size and leverage: a high-leverage point with large residual is most influential.

🔢
DFFITS & DFBETAS

Change-in-Fit Statistics

  • DFFITS: Change in Ŷᵢ when point i is deleted (standardised)
  • DFBETAS_j: Change in b_j when point i is deleted
  • Flag if |DFFITS| > 2√(k/n)
Leverage hᵢᵢhᵢᵢ = (Hat matrix)ᵢᵢ = 1/n + (xᵢ−x̄)²/Sxx   ∈ [1/n, 1]
Standardised Resid.rᵢ = eᵢ / (s√(1−hᵢᵢ))
Cook's DistanceDᵢ = eᵢ² · hᵢᵢ / [p · MSE · (1−hᵢᵢ)²]   (p = k+1 parameters)
· · ·
R8
Extension
Multiple Linear Regression
🔢
The Model

k Predictors

Y = β₀ + β₁X₁ + β₂X₂ + … + βₖXₖ + ε. Each βⱼ is the partial effect of Xⱼ on Y, holding all other predictors constant. Estimated by matrix algebra: b = (X'X)⁻¹X'Y.

💡
Adjusted R²

Penalised Fit Measure

R² always increases when adding predictors (even irrelevant ones). Adjusted R² penalises for the number of predictors — use this to compare models with different numbers of predictors.

⚙️
Model Selection

Choosing Predictors

  • Forward selection: Add predictors one at a time
  • Backward elimination: Remove least significant predictors
  • Stepwise: Combine both directions
  • AIC/BIC: Information criteria — lower is better
  • Cross-validation: Out-of-sample prediction error
🎯
Logistic Regression

Binary Response Variable

When Y ∈ {0,1}, linear regression is inappropriate. Use logistic regression: log[p/(1−p)] = β₀ + β₁X₁ + …. Coefficients interpreted as log-odds; exp(βⱼ) = odds ratio. Estimated by MLE, not OLS.

Matrix FormY = Xβ + ε   ;   b = (X'X)⁻¹X'Y
Adjusted R²R²_adj = 1 − [(n−1)/(n−k−1)] · (1 − R²)
AICAIC = n·ln(SSE/n) + 2k
Logit Modelln[p/(1−p)] = β₀ + Σ βⱼXⱼ   ;   p = P(Y=1|X)
Odds RatioOR_j = exp(βⱼ) — effect of 1-unit increase in Xⱼ on odds of Y=1
OLS vs LogisticUse OLS regression when Y is continuous (approximately). Use logistic regression when Y is binary (0/1). Never fit a linear regression to a binary outcome — it can predict probabilities outside [0,1] and violates the normality/homoscedasticity assumptions.
STAT3203 · Econometrics
Y
STAT3203 · B.Sc. Statistics Year 3 · BRUR
Econometrics
Classical Linear Model · OLS · Multicollinearity · Heteroscedasticity · Autocorrelation · Specification Errors · Dummy Variables · Simultaneous Equations · Time Series
🎓 What is Econometrics? Econometrics is what happens when statistics and economics go on a date and have a baby called "regression." It asks: "Yes, we think X causes Y in theory — but how strong is that relationship in actual data, and can we prove it?" As Gujarati puts it: "Econometrics is the art and science of using statistical methods to test economic theories and forecast economic phenomena." The joke among economists: "Economists use models to explain what has already happened, and models to predict the future — and the same model is usually wrong in both cases." 😄
E1
Introduction
What is Econometrics?
📖
What it is

Definition

Econometrics = Economics + Metrics. It applies statistical and mathematical methods to quantify economic relationships, test economic theories, and forecast future economic activity. Gujarati defines it as the "quantitative analysis of actual economic phenomena."

⚙️
The Three Steps

Econometric Methodology

  • 1. Economic model: Theory says Y depends on X₁, X₂,… (e.g., consumption depends on income)
  • 2. Econometric model: Add error term — Y = f(X₁,X₂) + ε
  • 3. Estimate & test: Use data to estimate parameters and test hypotheses
💡
Real World Example

Keynesian Consumption Function

Theory: Consumption increases with income.
Econometric model: C = β₀ + β₁Y + ε
β₁ = Marginal Propensity to Consume (MPC) — how much of each extra taka is consumed. We estimate this from real survey data!

Where Used

Applications

  • Estimating wage-education returns (does education pay?)
  • Measuring price elasticity of demand
  • Evaluating effect of minimum wage on employment
  • Forecasting GDP, inflation, exchange rates
  • Policy evaluation (did a program reduce poverty?)
😂 Econometrician's Joke"An economist, a physicist, and an econometrician are stranded on an island with canned food. The physicist says 'let's use a rock to open the cans.' The economist says 'assume we have a can opener.' The econometrician says 'let's regress can-opening on island conditions, correct for heteroscedasticity, and check the instrumental variables.'" — Econometrics solves real problems, just very thoroughly! 😄
· · ·
E2
Foundation
Classical Linear Regression Model (CLRM)
🎯 The CLRM is the BackboneEvery econometrics problem starts by asking: "Which CLRM assumption is violated here?" Like a doctor checking vital signs before treating a patient — you must check the assumptions before trusting the results.
📋
The 10 Assumptions

CLRM Assumptions (Gujarati)

  • A1: Linear in parameters — model is linear in β (not necessarily in X)
  • A2: Fixed X values — X is non-stochastic (or fixed in repeated sampling)
  • A3: Zero mean error — E(εᵢ) = 0
  • A4: Homoscedasticity — Var(εᵢ) = σ² (constant)
  • A5: No autocorrelation — Cov(εᵢ, εⱼ) = 0, i≠j
  • A6: X non-stochastic — Cov(εᵢ, Xᵢ) = 0
  • A7: n > k — more observations than parameters
  • A8: Variability in X — Var(X) ≠ 0
  • A9: No perfect multicollinearity — no exact linear relation among Xs
  • A10: Normality of ε — εᵢ ~ N(0, σ²)
💡
LINE Simplified

Remember: LINE

  • Linearity — relationship is linear in parameters
  • Independence — errors are independent of each other
  • Normality — errors are normally distributed
  • Equal variance — errors have constant variance (homoscedastic)

😄 Memory tip: "LINE up your assumptions or your results will be crooked!"

⚠️
What Happens When Violated

Consequences Table

  • A4 violated (hetero): OLS unbiased but inefficient; wrong SEs
  • A5 violated (autocorr): OLS unbiased but inefficient; wrong SEs
  • A9 violated (multicoll): OLS unbiased but very large variance; unreliable estimates
  • Omitted variable: OLS biased AND inconsistent — the worst!
🌍
Real Scenario

Estimating Wage Equation

Model: Wage = β₀ + β₁Education + β₂Experience + ε
Check: Does error have constant variance? (Workers with more education may have more variable wages → heteroscedasticity). Are education & experience correlated? (Older workers often have more experience AND education → multicollinearity). Always diagnose first!

CLRM ModelYᵢ = β₁ + β₂X₂ᵢ + β₃X₃ᵢ + … + βₖXₖᵢ + εᵢ
Error assumptionsE(εᵢ)=0 ; Var(εᵢ)=σ² ; Cov(εᵢ,εⱼ)=0 (i≠j) ; εᵢ~N(0,σ²)
Matrix formY = Xβ + ε  ;  β̂ = (X'X)⁻¹X'Y (OLS estimator)
· · ·
E3
Estimation
OLS & Gauss-Markov Theorem
🎯
OLS Principle

Minimise Squared Errors

OLS chooses β̂ to minimise SSE = Σeᵢ² = Σ(Yᵢ − Ŷᵢ)². The "squaring" penalises large errors more — like a strict teacher who really hates big mistakes more than small ones! 😄 The solution is unique and closed-form.

🏆
Gauss-Markov Theorem

BLUE — Why OLS is Best

Under assumptions A1–A9 (without normality), OLS estimators are:
Best — minimum variance
Linear — in Y
Unbiased — E(β̂) = β
Estimators
No other linear unbiased estimator has smaller variance! Think of it as OLS being the "most efficient honest statistician."

⚙️
OLS Properties

Algebraic Properties

  • Σeᵢ = 0 (residuals sum to zero)
  • Σeᵢ·Xᵢ = 0 (residuals uncorrelated with X)
  • Regression line passes through (X̄, Ȳ)
  • Σeᵢ·Ŷᵢ = 0 (residuals uncorrelated with fitted values)
📐
Goodness of Fit

R² and its Limits

  • R² ∈ [0,1]; R²=1 perfect fit; R²=0 model explains nothing
  • Warning: High R² ≠ good model! You can have high R² with spurious regression (two random trends)
  • Adjusted R²: Penalises for extra predictors — use for model comparison
  • 😄 "A high R² in time series is suspicious, not impressive!"
OLS slope (simple)β̂₂ = Σ(Xᵢ−X̄)(Yᵢ−Ȳ) / Σ(Xᵢ−X̄)² = Cov(X,Y)/Var(X)
OLS interceptβ̂₁ = Ȳ − β̂₂·X̄
UnbiasednessE(β̂) = β   (on average, hits the true value)
R² = ESS/TSS = 1 − RSS/TSS  ;  RSS=Σeᵢ², ESS=Σ(Ŷᵢ−Ȳ)², TSS=Σ(Yᵢ−Ȳ)²
Adjusted R²R̄² = 1 − [(n−1)/(n−k)](1−R²)
🌍 Real WorldBangladesh rice yield data: Yield = 1200 + 45·Fertiliser + 30·Rain + ε. R² = 0.82 means 82% of variation in yield is explained by fertiliser and rainfall. β̂₂=45 means: holding rain constant, each kg of fertiliser per acre increases yield by 45 kg. This directly guides agricultural policy!
· · ·
E4
Problem 1
Multicollinearity — The Identity Crisis
😂 The Multicollinearity Joke"Multicollinearity is like trying to tell apart identical twins by asking their friends — everyone says 'they're basically the same.' Your model literally cannot figure out who is doing what." When X₁ and X₂ are nearly perfectly correlated, the model gets confused about whose "fault" it is when Y changes.
🔍
What it is

Correlated Predictors

Multicollinearity occurs when two or more predictor variables are highly correlated with each other. Perfect multicollinearity = exact linear relationship (OLS breaks down entirely). Near-perfect = high but not perfect correlation (OLS works but gives unreliable estimates).

⚙️
Detection

How to Detect

  • Correlation matrix: |rᵢⱼ| > 0.8 between predictors — warning sign
  • VIF (Variance Inflation Factor): VIF > 10 = serious; VIF > 5 = concern
  • Condition number: κ > 30 signals multicollinearity
  • Rule of thumb sign: High R² but few significant individual t-tests
⚠️
Consequences

What Goes Wrong

  • OLS still unbiased and BLUE — estimates are correct on average
  • But standard errors inflate — estimates are imprecise
  • t-statistics become small → variables appear insignificant even when they matter
  • Coefficient signs may be wrong or change with small data changes
  • Confidence intervals become very wide
💡
Remedies

What to Do

  • Drop a variable — but risk omitted variable bias
  • Ridge regression — adds a penalty λ to shrink coefficients
  • Get more data — reduces variance (often the best solution)
  • Principal Component Analysis (PCA) — use orthogonal components
  • Combine variables — e.g., use wealth index instead of income + savings
VIF formulaVIF_j = 1/(1 − Rj²)   where Rj² = R² from regressing Xⱼ on all other X's
VIF interpretationVIF=1 → no collinearity ; VIF=5 → moderate ; VIF>10 → serious problem
Var(β̂ⱼ) inflatedVar(β̂ⱼ) = σ²/[Sⱼⱼ(1−Rⱼ²)] = (σ²/Sⱼⱼ) · VIF_j
🌍 Bangladesh ExampleRegressing household expenditure on income and wealth. Income and wealth are highly correlated (r=0.92). VIF comes out at 8.2. The model can't tell apart the separate effects of income vs wealth. Solution: use only income, or create a composite "socioeconomic status" score.
😄 Tip: "If two variables always go up together in your data, your model has the same problem as a detective who always finds two suspects at the crime scene at the same time — it cannot tell who did it."
· · ·
E5
Problem 2
Heteroscedasticity — The Unequal Spreader
😄 Analogy"Heteroscedasticity is like a group of students whose test scores vary wildly for rich students (some study hard, some don't) but are very consistent for poor students (all must study). The variance of the 'error' in predicting scores is not equal across income groups." This violates A4!
📡
What it is

Non-constant Error Variance

Heteroscedasticity means Var(εᵢ) = σᵢ² — the variance of the error term is NOT constant across observations. It changes with one or more predictors. Very common in cross-sectional data (individuals, firms, countries with very different sizes).

⚙️
Detection Tests

How to Detect

  • Visual: Plot residuals vs fitted Ŷ — fan/funnel shape = hetero
  • Park test: Regress ln(eᵢ²) on ln(Xᵢ)
  • Glejser test: Regress |eᵢ| on Xᵢ
  • Breusch-Pagan (BP) test: Lagrange multiplier test — most popular
  • White test: More general — no specific form assumed
⚠️
Consequences

What Goes Wrong

  • OLS estimators remain unbiased and consistent
  • BUT they are no longer BLUE (not minimum variance)
  • Standard errors are biased → t and F tests unreliable
  • Confidence intervals too narrow or too wide
  • Hypothesis tests give wrong conclusions
💡
Remedies

Fixing Heteroscedasticity

  • WLS (Weighted Least Squares): Weight observations by 1/σᵢ² — best if σᵢ² known
  • Log transformation: ln(Y) = β₀ + β₁X — often stabilises variance
  • White's HC standard errors: "Robust" SEs — keeps OLS estimates but corrects SEs
  • FGLS: Feasible GLS when form is estimated
Heteroscedastic modelYᵢ = β₁ + β₂Xᵢ + εᵢ  where  Var(εᵢ) = σᵢ² ≠ constant
WLS objectiveMinimise: Σ wᵢeᵢ²  where  wᵢ = 1/σᵢ² (higher weight = more precise obs.)
Breusch-PaganBP = n·R² ~ χ²(k)  from regressing eᵢ²/σ̂² on all X's
White test statn·R² ~ χ²(p)  where p = number of regressors in auxiliary regression
🌍 Real ExampleRegressing household food expenditure on income across 1000 Bangladeshi families. Rich families have very variable food spending (some eat lavishly, some save); poor families all spend similarly near subsistence. This creates a fan shape in residuals — classic heteroscedasticity. Remedy: use ln(expenditure) or WLS with weight 1/income².
· · ·
E6
Problem 3
Autocorrelation — The Time Traveller's Problem
😄 The Autocorrelation Joke"Autocorrelation is like a gossip chain. What happened yesterday affects what people say today, which affects tomorrow. Errors in time series data are like rumours — yesterday's error whispers to today's error." When today's residual tells tomorrow's what to be, you have autocorrelation!
🔗
What it is

Correlated Error Terms

Autocorrelation (serial correlation) means Cov(εᵢ, εⱼ) ≠ 0 for i≠j. Violations of assumption A5. Most common in time series data (monthly GDP, daily stock prices, annual inflation). Positive autocorrelation is most common — errors persist in the same direction.

⚙️
Detection

Tests for Autocorrelation

  • Plot residuals over time: Look for cyclical or trending patterns
  • Durbin-Watson (DW) test: d ≈ 2 → no autocorrelation; d < 1.5 → positive AC; d > 2.5 → negative AC
  • Breusch-Godfrey (BG) test: More general — detects higher-order autocorrelation
  • Run test: Non-parametric test for randomness in residuals
⚠️
Consequences

What Goes Wrong

  • OLS estimates remain unbiased and consistent
  • But NOT BLUE — inefficient; larger variances than GLS
  • s² underestimates σ² → t & F tests give inflated significance
  • R² is overestimated — model looks better than it is!
💡
Remedies

Fixing Autocorrelation

  • Generalised Least Squares (GLS): Use the transformed model (most correct)
  • Cochrane-Orcutt method: Iterative GLS for AR(1) errors
  • Include lagged Y (Yₜ₋₁): Often removes autocorrelation
  • Newey-West HAC SEs: Robust SEs that account for autocorrelation
  • First-differencing: Use ΔY = Yₜ − Yₜ₋₁ as the dependent variable
AR(1) error processεₜ = ρεₜ₋₁ + uₜ  where  |ρ| < 1  and  uₜ ~ WN(0, σ²)
Durbin-Watson dd = Σₜ(eₜ − eₜ₋₁)² / Σeₜ² ≈ 2(1−ρ̂) ; d∈[0,4]
ρ̂ estimatorρ̂ = Σₜ eₜeₜ₋₁ / Σₜ eₜ₋₁²
GLS transformationYₜ − ρYₜ₋₁ = β₁(1−ρ) + β₂(Xₜ − ρXₜ₋₁) + uₜ
🌍 Bangladesh ExampleRegressing annual rice production on fertiliser use and rainfall (1980–2023). The DW statistic = 1.12 signals positive autocorrelation — a good crop year tends to be followed by another good year (farmers reinvest; soil quality persists). Cochrane-Orcutt iteration gives ρ̂ = 0.48, and the corrected model gives more reliable coefficient estimates.
· · ·
E7
Model Misspecification
Specification Errors — Building the Wrong House
🏗️
What it is

Using the Wrong Model

Specification errors arise when the model is incorrectly specified — wrong variables, wrong functional form, or wrong structural assumptions. The most dangerous error in econometrics!

⚠️
Type 1: Omitted Variable

Leaving Out a Key Variable

  • True model: Y = β₁ + β₂X₂ + β₃X₃ + ε
  • Estimated model: Y = α₁ + α₂X₂ + u (X₃ omitted)
  • Result: OLS estimator of β₂ is biased and inconsistent
  • Bias direction depends on correlation between X₂ and X₃
  • 😄 "Like measuring height but ignoring whether you're on a slope!"
⚠️
Type 2: Irrelevant Variable

Including an Unnecessary Variable

  • True model: Y = β₁ + β₂X₂ + ε
  • Estimated model: Y = α₁ + α₂X₂ + α₃X₃ + u (X₃ is irrelevant)
  • Result: OLS estimators remain unbiased but inefficient (larger variance)
  • R² increases artificially — use adjusted R² instead!
⚙️
Type 3: Wrong Functional Form

Linear When Non-linear

  • True: Y = β₁ + β₂X + β₃X² + ε (quadratic)
  • Fitted: Y = α₁ + α₂X + u (linear)
  • Residuals will show a curved pattern
  • RESET test (Ramsey) detects wrong functional form
💡
Detecting Specification Errors

Tests

  • RESET test: Add Ŷ², Ŷ³ to model; test their joint significance
  • Davidson-MacKinnon J-test: Test between non-nested models
  • Residual plots: Patterns indicate misspecification
  • Theory: Always use economic theory to guide model choice!
Omitted Variable BiasBias(β̂₂) = β₃ · (Cov(X₂,X₃)/Var(X₂))   ≠ 0 if β₃≠0 & X₂,X₃ correlated
RESET testAdd Ŷ², Ŷ³ to regression; F-test on their coefficients. Reject H₀ → misspecification.
🌍 Classic ExampleWage regression omitting "ability." Model: Wage = β₀ + β₁Education + ε. Problem: Ability affects both wages AND education choices. Omitting ability biases β̂₁ upward — we attribute to education some of what is really due to innate ability. This is the classic "ability bias" in returns to education. Solution: use IQ scores, sibling fixed effects, or instrumental variables (Angrist & Krueger's famous quarter-of-birth IV).
· · ·
E8
Qualitative Predictors
Dummy Variables — Turning Categories into Numbers
💡 What is a Dummy Variable?A dummy (indicator) variable takes values 0 or 1 to represent a categorical characteristic. Male = 1, Female = 0. Urban = 1, Rural = 0. It's called "dummy" because it's a stand-in number for something that isn't naturally numeric. 😄 "It's not that the variable is stupid — it's just pretending to be a number!"
🔢
What it is

Binary Indicator Variables

For a qualitative variable with m categories, we include m−1 dummy variables (omit one — the "base" or "reference" category). Including all m dummies causes perfect multicollinearity — the dummy variable trap!

⚙️
Interpretation

Reading Dummy Coefficients

  • Wage = 5000 + 800·MALE + 200·Education + ε
  • MALE=1 (male): avg wage = 5000+800+200·Edu = 5800+200·Edu
  • MALE=0 (female): avg wage = 5000+200·Edu = 5000+200·Edu
  • So: men earn 800 taka more than women on average, holding education fixed
  • The dummy coefficient is the shift in intercept for that category
💡
Interaction Dummies

Dummies with Slopes

  • Wage = β₀ + β₁MALE + β₂Education + β₃(MALE×Education) + ε
  • β₃ allows the slope of education to differ by gender
  • Male return to education: β₂ + β₃
  • Female return to education: β₂
  • This is the Chow test idea — testing if two groups have different regression relationships
⚠️
Dummy Trap

The Most Common Mistake!

For m categories, ALWAYS include m−1 dummies. If you include all m, the sum of all dummies = 1 (a constant) which creates PERFECT multicollinearity. Example: if you have MALE and FEMALE dummies, they always sum to 1 = the intercept column → perfect collinearity. Drop one! The dropped category is the "reference group."

General formYᵢ = β₀ + β₁Dᵢ + β₂Xᵢ + εᵢ  (D=1 for group A, D=0 for group B)
Group A meanE(Yᵢ|Dᵢ=1,Xᵢ) = (β₀+β₁) + β₂Xᵢ   (shifted intercept)
Group B meanE(Yᵢ|Dᵢ=0,Xᵢ) = β₀ + β₂Xᵢ   (reference group)
Interaction (slope shift)Yᵢ = β₀ + β₁Dᵢ + β₂Xᵢ + β₃(DᵢXᵢ) + εᵢ
Chow Test F-statF = [(SSEᵣ − (SSE₁+SSE₂))/k] / [(SSE₁+SSE₂)/(n₁+n₂−2k)]
🌍 Bangladesh Policy ExampleEvaluating impact of a microfinance program: Treatment = 1 (received loan), Control = 0. Model: Income = β₀ + β₁·Treatment + β₂·Education + β₃·Age + ε. β₁ estimates the Average Treatment Effect (ATE) — did the loan raise income? If β₁ = 2500 (significant), the program raises income by Tk 2500 holding other factors fixed. This is the basis of impact evaluation / program evaluation in development economics!
· · ·
E9
Advanced
Simultaneous Equation Models — Cause and Effect in Both Directions
🔄
What it is

Bidirectional Causality

In many economic situations, variables determine each other simultaneously. Supply & demand: price determines quantity demanded AND quantity supplied determines price. This simultaneity causes OLS to be biased and inconsistent — the "simultaneity bias."

⚙️
Endogenous vs Exogenous

Variable Classification

  • Endogenous (jointly determined): Price & Quantity in supply-demand system
  • Exogenous (determined outside): Income, weather, policy variables
  • Structural form: The economic behavioural equations
  • Reduced form: Each endogenous variable expressed only in terms of exogenous variables
💡
Identification Problem

Can We Estimate the Equations?

  • Under-identified: Cannot estimate from data alone
  • Exactly identified: Unique estimates possible
  • Over-identified: Multiple estimates possible; use 2SLS
  • Order condition: (K−k) ≥ (m−1) where K=total exogenous, k=exogenous in equation, m=endogenous in equation
🔢
Estimation Methods

How to Estimate

  • ILS (Indirect Least Squares): For exactly identified equations
  • 2SLS (Two-Stage Least Squares): Most popular for over-identified. Stage 1: regress endogenous X on instruments; Stage 2: use fitted X̂ in main regression
  • 3SLS / FIML: Full system methods for efficiency
Demand equationQd = α₀ + α₁P + α₂Income + u₁   (structural)
Supply equationQs = β₀ + β₁P + β₂Weather + u₂   (structural)
EquilibriumQd = Qs   (market clears)
2SLS Stage 1Regress P on ALL exogenous variables → get P̂
2SLS Stage 2Replace P with P̂ in structural equation → OLS gives consistent estimates
😄 Why OLS Fails Here"Using OLS for a simultaneous system is like trying to figure out who started a fight when both parties hit each other at exactly the same time — you can't tell cause from effect!" Price rises → quantity supplied rises (supply); but quantity demanded falls → price falls (demand). OLS blends these two directions and gives wrong answers for both. 2SLS untangles them using instruments.
· · ·
E10
Time Series
Time Series Econometrics — Stationarity, Unit Roots & Cointegration
⚡ The Spurious Regression Warning!Regressing one non-stationary time series on another can give a high R² and significant t-statistics PURELY BY CHANCE — even if they have nothing to do with each other. Example: Bangladesh rice production and global smartphone sales both trend upward → regressing one on the other gives R²=0.94 but it is COMPLETELY MEANINGLESS. Always test for stationarity first!
📈
Stationarity

The Key Concept in Time Series

A time series is weakly stationary if its mean, variance, and autocovariances are constant over time (don't depend on t). Most economic time series (GDP, prices, exchange rates) are NON-stationary — they have trends and drifts.

⚙️
Unit Root Tests

Testing for Non-stationarity

  • Augmented Dickey-Fuller (ADF) test: H₀: series has unit root (non-stationary); Reject H₀ → stationary. The most widely used test.
  • Phillips-Perron (PP) test: Non-parametric correction for serial correlation
  • KPSS test: H₀: stationary (opposite hypothesis — use alongside ADF)
💡
Cointegration

Long-Run Equilibrium

Two non-stationary I(1) series are cointegrated if their linear combination is stationary I(0). They share a long-run equilibrium relationship. Use Engle-Granger two-step method or Johansen test. If cointegrated: use Error Correction Model (ECM).

🔢
Remedies for Non-stationarity

Making Series Stationary

  • Differencing: ΔYₜ = Yₜ − Yₜ₋₁ removes unit root (most common)
  • Log transformation: ln(Yₜ) — stabilises variance and often stationarises
  • Detrending: Remove deterministic trend by regression
  • If I(1) and cointegrated → use ECM instead of differencing
Random walk (unit root)Yₜ = Yₜ₋₁ + εₜ   (non-stationary; var grows with t)
ADF test equationΔYₜ = α + βYₜ₋₁ + Σγⱼ·ΔYₜ₋ⱼ + εₜ   H₀: β=0 (unit root)
Integration order I(d)I(0)=stationary ; I(1)=one difference needed ; I(2)=two differences
Error Correction ModelΔYₜ = α + γ(Yₜ₋₁ − β·Xₜ₋₁) + θΔXₜ + εₜ  (short-run + long-run)
🌍 Bangladesh ApplicationTesting whether the taka-dollar exchange rate and domestic price level are cointegrated (Purchasing Power Parity). Both series are I(1). Engle-Granger test finds cointegration — a long-run PPP relationship holds. Estimate ECM: the speed-of-adjustment coefficient γ̂ = −0.23 means 23% of any deviation from long-run PPP is corrected each quarter. Highly useful for monetary policy!
STAT4101 · Multivariate Distribution
Σ
STAT4101 · B.Sc. Statistics Year 4 · BRUR
Multivariate Distribution
Aspects of MVA · Distances · Matrix Decompositions · Multivariate Normal · MLE · Inference · Hotelling T² · MANOVA · Multivariate Regression
🎓 Why Multivariate Analysis? "In real life, nothing happens in isolation." Blood pressure AND cholesterol AND BMI together predict heart disease — not one alone. Multivariate analysis handles p variables simultaneously, capturing their joint distributions, correlations, and interactions. As Johnson & Wichern put it: "Most data sets encountered in practice contain measurements on several variables that must be analyzed jointly." The key advantage: we preserve the covariance structure that gets lost when analyzing variables one at a time. 😄 Joke: "A univariate statistician sees a forest of trees. A multivariate statistician sees the forest, the ecosystem, the relationships between trees, AND the soil composition — all at once!"
M1
Introduction
Aspects of Multivariate Analysis
📖
What it is

Meaning & Scope

Multivariate Analysis (MVA) refers to statistical techniques for analysing data with p ≥ 2 variables measured on each observation. Goal: understand the joint behaviour, interdependencies, and structure of these variables simultaneously — not one at a time.

Applications

Where MVA is Used

  • Medical: Joint analysis of blood pressure, cholesterol, BMI, age for heart disease risk
  • Ecology: Species abundance across multiple environmental variables
  • Finance: Portfolio of stocks — returns, risks, correlations simultaneously
  • Psychology: Intelligence tests measuring multiple cognitive dimensions
  • Agriculture: Crop yield as function of soil, rain, temperature, fertiliser jointly
💡
Key Concept

The Data Matrix

MVA operates on an n × p data matrix X: n observations (rows), p variables (columns). Each row is a p-dimensional observation vector xᵢ = (x_{i1}, x_{i2}, …, x_{ip})'. The entire dataset is the matrix X of dimension n×p.

⚠️
When NOT to Use

Limitations & Cautions

  • Requires multivariate normality for many classical methods — always check!
  • Highly sensitive to outliers — a single bad row can distort everything
  • Sample size n must be >> p (as a rule: n ≥ 5p minimum)
  • Interpretation becomes very challenging as p grows large ("curse of dimensionality")
😄 The "Curse of Dimensionality" Joke"In 1D you need 10 points to understand a distribution. In 10D you need 10¹⁰ points — more than the world's population squared. This is why every multivariate statistician is simultaneously excited about p variables and terrified of having too many." — The curse is real, and MVA is largely about fighting it!
· · ·
M2
Distance Measures
Euclidean & Statistical Distance
📏
Euclidean Distance

Ordinary Geometric Distance

The familiar straight-line distance between two points x and y in p-dimensional space: d(x,y) = √[Σᵢ(xᵢ−yᵢ)²]. Simple but has a critical flaw: it treats all variables equally regardless of their scale or correlation. A variable measured in kilometres swamps one measured in centimetres!

🎯
Mahalanobis Distance

Statistical Distance — The MVP

Mahalanobis distance accounts for the scale AND correlation structure of the data via the covariance matrix Σ: d²(x,μ) = (x−μ)'Σ⁻¹(x−μ). It's unit-free and correlation-corrected. Think of it as Euclidean distance in "standardised space" rotated to remove correlations.

💡
Why Mahalanobis?

Advantages Over Euclidean

  • Scale-invariant — variables on different units treated fairly
  • Accounts for correlations — correlated variables don't double-count
  • Identifies multivariate outliers — points far from the centroid in σ units
  • d²(x,μ) ~ χ²(p) under multivariate normality — useful for outlier detection!
🌍
Real Example

Medical Diagnosis

Patient has systolic BP=140mmHg and age=45 years. Euclidean distance from population mean (120mmHg, 40yrs) = √(20²+5²) = 20.6. But BP and age have different scales AND are correlated. Mahalanobis distance gives a meaningful "how unusual is this patient" measure corrected for both scale and the BP-age correlation.

Euclidean Distanced(x,y) = √[(x−y)'(x−y)] = √[Σᵢ(xᵢ−yᵢ)²]
Statistical (Mahalanobis)d²(x,μ) = (x−μ)' Σ⁻¹ (x−μ)
Sample versiond²(xᵢ,x̄) = (xᵢ−x̄)' S⁻¹ (xᵢ−x̄) ~ χ²(p) approximately
Outlier thresholdFlag xᵢ as outlier if d²(xᵢ,x̄) > χ²_(0.975)(p)
😄 Distance Analogy"Euclidean distance measures 'as the crow flies.' Mahalanobis distance measures 'as the statistician walks' — taking into account the terrain (correlations) and the different scales of measurement (variances). They're both right, but one is much smarter about context."
· · ·
M3
Linear Algebra Tools
Matrix Decompositions — Spectral, Cholesky & Square Root
🔷
Spectral Decomposition

Eigenvalue Decomposition

Every symmetric positive definite matrix A can be decomposed as: A = PΛP' where P = matrix of eigenvectors (orthonormal columns) and Λ = diagonal matrix of eigenvalues λ₁ ≥ λ₂ ≥ … ≥ λₚ > 0. The eigenvectors give the "principal directions" of the data; eigenvalues give the "lengths" in those directions. Foundation of PCA!

🔺
Cholesky Decomposition

Lower-Triangular Factorisation

Every positive definite matrix Σ can be written as Σ = LL' where L is a lower-triangular matrix with positive diagonal entries. Why useful? (1) Simulate multivariate normal data: if Z~N(0,I), then X = μ + LZ ~ N(μ,Σ). (2) Solve linear systems efficiently. (3) Check positive definiteness — Cholesky fails if Σ is not positive definite.

💡
Square Root of Matrix

Matrix Square Root A^(1/2)

Using spectral decomposition: A^(1/2) = PΛ^(1/2)P' where Λ^(1/2) = diag(√λ₁, …, √λₚ). Property: A^(1/2) · A^(1/2) = A. Used to transform data to uncorrelated form: if X ~ Nₚ(μ,Σ), then Σ^(-1/2)(X−μ) ~ Nₚ(0,I) — the "sphering" or "whitening" transformation essential for many multivariate tests.

🔢
Partitioned Covariance

Block Structure of Σ

Partition the p-vector x = (x₍₁₎', x₍₂₎')' into groups of p₁ and p₂ variables. Then Σ = [[Σ₁₁, Σ₁₂],[Σ₂₁, Σ₂₂]] where Σ₁₁=Var(x₍₁₎), Σ₂₂=Var(x₍₂₎), Σ₁₂=Cov(x₍₁₎,x₍₂₎). Used in canonical correlation, conditional distributions, and regression of one group on another.

Spectral DecompositionΣ = PΛP' = Σᵢ λᵢ eᵢeᵢ'   (P orthogonal, Λ diagonal)
CholeskyΣ = LL'   (L lower triangular, lᵢᵢ > 0)
Matrix Square RootΣ^(1/2) = PΛ^(1/2)P'   where   Λ^(1/2)=diag(√λ₁,…,√λₚ)
Whitening TransformZ = Σ^(-1/2)(X − μ) ~ Nₚ(0, Iₚ)
Conditional (partitioned)E(X₁|X₂) = μ₁ + Σ₁₂Σ₂₂⁻¹(X₂−μ₂)
😄 Matrix Square Root Joke"Why can't a matrix go to therapy alone? Because it needs its square root to become 'whole' — and its inverse to undo its past mistakes!" More seriously: the matrix square root is what lets us transform any multivariate normal distribution into a standard one, making everything else tractable.
· · ·
M4
Variation in p Dimensions
Covariance Matrix & Generalised Variance
📊
Covariance Matrix Σ

The Multivariate Analogue of Variance

For a p-dimensional random vector X, the covariance matrix Σ (p×p) captures ALL pairwise variances and covariances: σᵢᵢ = Var(Xᵢ) on diagonal; σᵢⱼ = Cov(Xᵢ,Xⱼ) off-diagonal. Σ is symmetric and positive (semi)definite. The sample version S = (n−1)⁻¹Σᵢ(xᵢ−x̄)(xᵢ−x̄)' is the unbiased estimator.

🔢
Generalised Variance

|Σ| — One Number for All Variation

The determinant |Σ| is called the generalised variance — it summarises the total variation in all p variables in a single number. Geometrically: |Σ| is proportional to the squared volume of the p-dimensional ellipsoid formed by the data. |Σ| = 0 means variables are perfectly linearly dependent (degenerate distribution).

💡
Total Variation

Trace of Σ — Alternative Summary

tr(Σ) = σ₁₁ + σ₂₂ + … + σₚₚ = sum of all variances. This is the "total variance" measure. tr(Σ) = Σλᵢ (sum of eigenvalues). Used in PCA: proportion of variance explained by kth PC = λₖ/tr(Σ). Both |Σ| and tr(Σ) are used as scalar measures of multivariate scatter.

🌍
Correlation Matrix

Standardised Version

R = D^(-1/2) Σ D^(-1/2) where D = diag(σ₁₁,…,σₚₚ). All diagonal entries of R = 1; off-diagonal rᵢⱼ ∈ [−1,1]. Working with R (instead of Σ) is equivalent to standardising all variables to unit variance. Most MVA methods can work with either Σ or R — the choice matters for interpretation!

Population ΣΣ = E[(X−μ)(X−μ)']   (p×p symmetric positive definite)
Sample SS = (1/(n−1)) · Σᵢ(xᵢ−x̄)(xᵢ−x̄)'
Generalised Variance|S| = det(S)   (volume of data ellipsoid)
Total Variancetr(S) = s₁₁ + s₂₂ + … + sₚₚ = Σᵢ λᵢ
Correlation MatrixR = D^(-1/2) S D^(-1/2)   (D = diag of variances)
· · ·
M5
Core Distribution
The Multivariate Normal Distribution
🔔
Definition & Meaning

Nₚ(μ, Σ)

A p-dimensional random vector X follows a multivariate normal distribution Nₚ(μ,Σ) if every linear combination a'X is (univariate) normal for any non-zero vector a. Parameters: mean vector μ (p×1) — location; covariance matrix Σ (p×p) — shape and spread. The MVN is completely characterised by just these two parameters!

📐
Properties

Key Properties of MVN

  • Marginals are normal: Each Xᵢ ~ N(μᵢ, σᵢᵢ)
  • Conditionals are normal: (X₁|X₂=x₂) ~ N(μ₁.₂, Σ₁₁.₂)
  • Linear combinations: AX+b ~ N(Aμ+b, AΣA')
  • Uncorrelated → Independent: UNIQUE to MVN! If Cov(Xᵢ,Xⱼ)=0 then Xᵢ⊥Xⱼ
  • Quadratic forms: (X−μ)'Σ⁻¹(X−μ) ~ χ²(p)
💡
Contours & Geometry

Elliptical Contours

Contours of constant density for MVN are ellipsoids in p-dimensional space: {x : (x−μ)'Σ⁻¹(x−μ) = c²}. The shape/orientation is determined by Σ. Axes of the ellipse = eigenvectors of Σ; lengths proportional to √λᵢ. In 2D: a tilted ellipse if variables are correlated, circles if uncorrelated.

⚠️
Important Caution

Marginals Normal ≠ Joint Normal

Each variable being normally distributed does NOT imply joint multivariate normality! A classic counterexample: X~N(0,1) and Y = X if |X|>1, Y = −X otherwise. Then X~N, Y~N but (X,Y) is NOT bivariate normal. Always test joint normality, not just marginals!

MVN pdff(x) = (2π)^(-p/2)|Σ|^(-1/2) exp[-½(x−μ)'Σ⁻¹(x−μ)]
Density contour(x−μ)'Σ⁻¹(x−μ) = c²   (p-dim ellipsoid)
Linear transformAX + b ~ Nₖ(Aμ+b, AΣA')   (A: k×p)
Quadratic form(X−μ)'Σ⁻¹(X−μ) ~ χ²(p)
Conditional dist.X₁|X₂=x₂ ~ N(μ₁+Σ₁₂Σ₂₂⁻¹(x₂−μ₂), Σ₁₁−Σ₁₂Σ₂₂⁻¹Σ₂₁)
Bivariate Normal Contours — Different Correlation Structures
ρ = 0 (circles) Independent variables ρ = +0.7 (tilted ellipse) Positive correlation ρ = −0.8 (tilted other way) Negative correlation
· · ·
M6
Estimation
MLE of Mean Vector & Covariance Matrix
🎯
MLE of μ

Sample Mean Vector

The MLE of the mean vector μ is simply the sample mean vector x̄ = (1/n)Σᵢxᵢ. It is unbiased E(x̄)=μ and its sampling distribution is: x̄ ~ Nₚ(μ, Σ/n). Larger n → smaller variance of x̄ → more precise estimate. Intuition: just average each variable separately.

⚙️
MLE of Σ

MLE vs Unbiased Estimator

  • MLE: Σ̂ = (1/n)Σᵢ(xᵢ−x̄)(xᵢ−x̄)' — biased (uses n, not n−1)
  • Unbiased S: S = (1/(n−1))Σᵢ(xᵢ−x̄)(xᵢ−x̄)' — used in practice
  • MLE is biased by factor (n−1)/n — for large n, difference negligible
  • Both are consistent estimators (converge to Σ as n→∞)
💡
Sufficiency

Sufficient Statistics for MVN

For MVN data, (x̄, S) is a jointly sufficient statistic for (μ, Σ) — meaning all information in the sample about the parameters is captured by the sample mean vector and sample covariance matrix. No other summary can add more information. This is the multivariate analogue of the fact that (x̄, s²) is sufficient for (μ,σ²) in univariate normal.

📈
Large Sample Behaviour

Asymptotic Results

  • √n(x̄ − μ) → Nₚ(0, Σ) as n→∞ (multivariate CLT)
  • n·(x̄−μ)'S⁻¹(x̄−μ) → χ²(p) as n→∞
  • S → Σ in probability (consistency)
  • These are the basis for large-sample inference about μ
MLE of μμ̂ = x̄ = (1/n) Σᵢ xᵢ   (unbiased)
MLE of Σ (biased)Σ̂ = (1/n) Σᵢ(xᵢ−x̄)(xᵢ−x̄)'
Unbiased SS = (1/(n−1)) Σᵢ(xᵢ−x̄)(xᵢ−x̄)'   (used in tests)
Distribution of x̄x̄ ~ Nₚ(μ, Σ/n)
Multivariate CLT√n(x̄ − μ) →_d Nₚ(0, Σ)   as n→∞
· · ·
M7
Diagnostics
Assessing Multivariate Normality
🔬
Step 1: Marginal Checks

Univariate Marginal Normality

  • Plot histogram and Q-Q plot for each variable separately
  • Shapiro-Wilk or Kolmogorov-Smirnov test for each Xⱼ
  • Check for skewness and kurtosis near 0 and 3 respectively
  • Warning: All marginals normal ≠ joint MVN! This is necessary but NOT sufficient
📊
Step 2: Bivariate Checks

P-P and Q-Q Plots

  • Bivariate Q-Q plot: Plot ordered chi-square quantiles vs ordered Mahalanobis distances d²ᵢ
  • If MVN: points should fall approximately on a 45° line
  • Bivariate scatter: Should show elliptical pattern for each pair
  • Departures from ellipse indicate non-normality or outliers
💡
Step 3: Outlier Detection

Finding Multivariate Outliers

  • Compute d²ᵢ = (xᵢ−x̄)'S⁻¹(xᵢ−x̄) for each observation
  • Under MVN: d²ᵢ ≈ χ²(p)
  • Flag observations with d²ᵢ > χ²_{0.975}(p) as potential outliers
  • An outlier in a single variable may not appear as multivariate outlier and vice versa!
🔄
Step 4: Transformations

Achieving Near-Normality

  • Square root √x: Right-skewed count data (Poisson-like)
  • Log ln(x): Right-skewed positive data (income, concentrations)
  • Logit ln[p/(1-p)]: Proportions data bounded in (0,1)
  • Box-Cox: x^(λ) — λ estimated from data; λ=0 gives log, λ=0.5 gives √x
  • Fisher's z = ½ln[(1+r)/(1-r)]: For correlation coefficients
Chi-sq Q-Q plotPlot d²₍ᵢ₎ vs χ²_{i/(n+1)}(p) — should be linear if MVN holds
Box-Cox transformx^(λ) = (xλ−1)/λ if λ≠0 ; ln(x) if λ=0   (choose λ maximising normality)
Outlier thresholdd²ᵢ > χ²_{0.975}(p) → potential multivariate outlier
Fisher's zz = 0.5 · ln[(1+r)/(1−r)] ~ N(0.5·ln[(1+ρ)/(1−ρ)], 1/(n−3))
😄 Transformation Tip"Transforming data to normality is like ironing a wrinkled shirt — the content (information) doesn't change, but the shape becomes much more manageable. The Box-Cox transformation is like an automatic iron that figures out the right temperature (λ) by itself!" Remember to always report which transformation was used so results can be back-transformed for interpretation.
· · ·
M8
Sampling Theory
Wishart Distribution & Sampling Distributions
📐
Wishart Distribution

Multivariate Analogue of χ²

If X₁,…,Xₙ are iid Nₚ(0,Σ), then the matrix W = Σᵢ XᵢXᵢ' ~ Wₚ(n,Σ) follows a Wishart distribution with n degrees of freedom and scale matrix Σ. The sample covariance matrix satisfies: (n−1)S ~ Wₚ(n−1,Σ). It is the matrix generalisation of the chi-square distribution — just as s² has a chi-square distribution in univariate normal, S has a Wishart distribution!

⚙️
Properties of Wishart

Key Facts

  • E(W) = nΣ — so E(S) = Σ (unbiased)
  • If p=1: W reduces to σ²χ²(n) — the familiar univariate result
  • Reproductive: W₁~Wₚ(n₁,Σ) + W₂~Wₚ(n₂,Σ) → W₁+W₂~Wₚ(n₁+n₂,Σ)
  • Used in construction of Hotelling T² and Wilks' Lambda test statistics
💡
Key Sampling Results

Distribution of x̄ and S

  • x̄ and S are independent when sampling from MVN (multivariate analogue of independence of x̄ and s²)
  • x̄ ~ Nₚ(μ, Σ/n)
  • (n−1)S ~ Wₚ(n−1, Σ)
  • n(x̄−μ)'S⁻¹(x̄−μ) ~ [p(n−1)/(n−p)] · Fₚ,ₙ₋ₚ (Hotelling T²)
Wishart(n−1)S ~ Wₚ(n−1, Σ)   when X₁,…,Xₙ iid Nₚ(μ,Σ)
E(S)E(S) = Σ   (unbiased)
Independencex̄ ⊥ S   (when sampling from MVN)
Hotelling T² dist.T² = n(x̄−μ₀)'S⁻¹(x̄−μ₀) ~ [p(n−1)/(n−p)] · Fₚ,ₙ₋ₚ
· · ·
M9
Inference
Hotelling T² & MANOVA
🔬
Hotelling's T²

Multivariate t-Test

Tests H₀: μ = μ₀ (mean vector equals a specified vector). Hotelling's T² = n(x̄−μ₀)'S⁻¹(x̄−μ₀). This is the multivariate generalisation of the one-sample t-test. Under H₀: [(n−p)/p(n−1)]·T² ~ Fₚ,ₙ₋ₚ. Reject H₀ if this exceeds Fₐ(p,n−p). The TWO-SAMPLE version tests H₀: μ₁=μ₂ using the pooled covariance matrix.

📊
MANOVA

Multivariate ANOVA

MANOVA tests whether group mean vectors are equal: H₀: μ₁=μ₂=…=μg. Decomposes the total scatter matrix T into: T = H + E where H=between-group (hypothesis) matrix and E=within-group (error) matrix. Tests use functions of H and E — primarily Wilks' Lambda Λ = |E|/|H+E|.

💡
MANOVA Test Statistics

Four Equivalent Tests

  • Wilks' Lambda: Λ = |E|/|T| — most widely used
  • Pillai's Trace: tr(H(H+E)⁻¹)
  • Hotelling-Lawley Trace: tr(HE⁻¹)
  • Roy's Largest Root: λ₁/(1+λ₁) — most powerful for single-direction alternatives
  • All four equivalent in large samples; differ for small n or specific alternatives
⚠️
MANOVA Assumptions

Requirements

  • Multivariate normality within each group
  • Homogeneity of covariance matrices: Σ₁=Σ₂=…=Σg (Box's M test)
  • Independence of observations
  • n > p (more obs than variables — essential!)
  • ⚠ If assumptions violated → use permutation MANOVA (vegan package)
Hotelling T²T² = n(x̄−μ₀)'S⁻¹(x̄−μ₀)
T² to FF = [(n−p)/p(n−1)] · T² ~ Fₚ,ₙ₋ₚ   under H₀
MANOVA decomp.T = H + E   (Total = Between + Within)
Wilks' LambdaΛ = |E| / |H + E| ∈ (0,1]   (Λ≈1 → H₀ not rejected)
Pillai's TraceV = tr[H(H+E)⁻¹] = Σᵢ λᵢ/(1+λᵢ)
😄 MANOVA Analogy"MANOVA is like ANOVA but instead of asking 'do these groups have different means on ONE measure?' it asks 'do these groups differ on ANY combination of ALL measures simultaneously?' It's like comparing entire personality profiles rather than just one trait. Much more powerful when variables are correlated!" — And Wilks' Lambda is like the p-value's sophisticated older sibling who considers the whole picture.
· · ·
M10
Prediction
Multivariate Multiple Regression
📉
What it is

Multiple Y, Multiple X

Multivariate multiple regression has multiple response variables Y (n×m matrix) AND multiple predictors X (n×(k+1) matrix). Model: Y = XB + E where B (k+1)×m is the coefficient matrix and E is n×m error matrix. Each column of Y is a separate response; they share the same predictors X.

⚙️
Estimation

Matrix OLS

OLS estimator: B̂ = (X'X)⁻¹X'Y. Each column of B̂ is the OLS solution for that response variable separately — so multivariate regression is equivalent to running m separate univariate regressions! However, joint analysis is more efficient and enables tests involving ALL responses simultaneously.

💡
Why Use Jointly?

Advantage of Joint Analysis

  • Tests on coefficient matrix B involving multiple responses simultaneously
  • Accounts for correlations among response variables → more powerful tests
  • Can test hypotheses of form CBM = 0 (general linear hypothesis)
  • Residual covariance matrix Ê'Ê/(n−k−1) estimates Σ — the cross-response correlations
Model (matrix)Y(n×m) = X(n×(k+1)) · B((k+1)×m) + E(n×m)
OLS estimatorB̂ = (X'X)⁻¹X'Y
Residual matrixÊ = Y − XB̂ = (I − H)Y   where H=X(X'X)⁻¹X'
Error covariance est.Σ̂ = Ê'Ê/(n−k−1)
General hypothesisH₀: CBM = 0  → test via Wilks' Λ or Hotelling trace
STAT4201 · Multivariate Analysis II
W
STAT4201 · B.Sc. Statistics Year 4 · BRUR
Multivariate Analysis II
PCA · ICA · Factor Analysis · Cluster Analysis · Discriminant Analysis & Classification
🎓 The Big Picture of MVA II Where Multivariate I asked "how are variables distributed and how do we test hypotheses about means?", Multivariate II asks "what structure is hidden in the data?" PCA finds orthogonal dimensions of maximum variance. Factor Analysis finds latent constructs driving correlations. Cluster Analysis groups similar observations. Discriminant Analysis builds rules to classify new observations. Together, these are the core of unsupervised and supervised multivariate learning. 😄 "MVA II is where statistics starts looking suspiciously like machine learning — because it basically is!"
A1
Dimension Reduction
Principal Component Analysis (PCA)
😄 PCA Analogy"PCA is like finding the best angle to photograph a 3D sculpture so it reveals the most information in a 2D photo. You rotate your perspective to capture maximum variance in each new direction — the first principal component is the angle with the best overall view, the second adds what the first missed, and so on!" Each photo (PC) is orthogonal to the others.
📊
What it is

Finding Maximum Variance Directions

PCA transforms p correlated variables into p uncorrelated Principal Components (PCs) that are linear combinations of the originals. PC1 captures maximum variance; PC2 captures maximum of remaining variance orthogonal to PC1; and so on. Goal: represent data in fewer dimensions with minimal information loss.

⚙️
How PCA Works

The Eigenvalue Approach

  • Compute S (or R for standardised PCA)
  • Find eigenvalues λ₁≥λ₂≥…≥λₚ and eigenvectors e₁,e₂,…,eₚ of S
  • ith PC: Yᵢ = eᵢ'X (linear combination with eigenvector weights)
  • Var(Yᵢ) = λᵢ; Cov(Yᵢ,Yⱼ) = 0 for i≠j
  • Retain k PCs where Σᵢ₌₁ᵏ λᵢ/tr(S) ≥ 0.80 (80% variance rule)
💡
Choosing # of PCs

How Many to Keep?

  • 80% variance rule: Keep enough PCs to explain ≥80% of total variance
  • Scree plot: Plot λᵢ vs i; look for "elbow" — PCs before the bend
  • Kaiser criterion: Keep PCs with λᵢ > 1 (from R, not S)
  • Cross-validation: Prediction error-based selection
⚠️
Conditions & Cautions

When to Use PCA

  • ✅ Use when: variables are correlated; dimension reduction needed; no interpretability required
  • ❌ Don't use when: variables are already uncorrelated (PCA adds nothing)
  • ⚠ PCA components have no natural interpretation — a mix of original variables
  • ⚠ Sensitive to scale — standardise first (use R not S) if variables on different scales
  • ⚠ PCA is unsupervised — it ignores any response/class labels
PCs from eigendecomp.S = PΛP' → Yᵢ = eᵢ'(X−x̄)   (ith PC)
Variance of ith PCVar(Yᵢ) = λᵢ
Proportion explainedPVEᵢ = λᵢ / Σⱼλⱼ = λᵢ / tr(S)
Loadingslᵢⱼ = eᵢⱼ · √λᵢ   (correlation between PC i and variable j scaled)
Communalityhⱼ² = Σᵢ lᵢⱼ²   (variance of Xⱼ explained by retained PCs)
🌍 Real Application: Socioeconomic IndexBangladesh district data: 8 variables (income, education, health access, sanitation, literacy, employment, poverty rate, infrastructure). PCA extracts PC1 (accounts for 62% variance) which has high positive loadings on income, education, infrastructure and negative loading on poverty — this is a "development index" that can rank districts. Avoids multicollinearity issues in regression by replacing 8 correlated variables with 2-3 orthogonal PCs.
· · ·
A2
Signal Separation
Independent Component Analysis (ICA)
🎵
What it is

Beyond Uncorrelated — Finding Independence

ICA decomposes X = AS + noise where S are statistically independent source signals and A is the mixing matrix. Goal: estimate A and recover S. Unlike PCA (finds uncorrelated components), ICA finds components that are statistically independent — a much stronger condition. Non-Gaussian sources are required!

🔊
The Cocktail Party Problem

Classic Motivation

Imagine p microphones recording a party with p speakers talking simultaneously. Each microphone records a mixture of all voices. ICA recovers the individual voices (independent sources) from the mixed recordings. Applications: EEG/fMRI brain signal separation, audio source separation, financial return decomposition, image processing.

💡
ICA vs PCA

Key Differences

  • PCA: Finds uncorrelated components (2nd-order statistics only)
  • ICA: Finds statistically independent components (uses higher-order statistics)
  • PCA components: Gaussian by CLT — but Gaussians with zero covariance ARE independent → PCA = ICA for Gaussian data
  • ICA condition: At most one component can be Gaussian
  • ICA is not unique up to sign and scaling — unlike PCA
⚠️
When to Use ICA

Conditions

  • ✅ When sources are truly statistically independent (not just uncorrelated)
  • ✅ When sources are non-Gaussian (critical assumption!)
  • ✅ Signal/source separation problems
  • ❌ Don't use when sources are Gaussian — PCA is equivalent and simpler
  • ⚠ ICA components are not ordered by variance (unlike PCA)
ICA modelX = AS   (X=observed, A=mixing matrix, S=independent sources)
SeparationŜ = WX   where W = A⁻¹ is the unmixing matrix (estimated by ICA)
Non-GaussianityMaximise negentropy: J(y) ≈ [E{G(y)} − E{G(z)}]² (G non-quadratic function)
FastICA updatew ← E[Xg(w'X)] − E[g'(w'X)]w   (gradient-based iteration)
· · ·
A3
Latent Structure
Factor Analysis — Discovering Hidden Constructs
😄 Factor Analysis Analogy"Factor analysis is like figuring out that what students score on reading, writing, and comprehension tests is really driven by a single underlying construct: 'verbal intelligence.' You can't directly measure verbal intelligence, but you can observe its effects on multiple tests. Factor analysis extracts these invisible factors that drive the observable correlations." — Widely used in psychology, social science, and education.
🔍
What it is

Latent Factor Model

Factor Analysis (FA) models p observed variables as linear combinations of m << p latent (unobservable) common factors F plus unique factors: X = μ + LF + ε. L is the (p×m) loading matrix; F are m common factors; ε are p unique (specific) factors. Goal: interpret the common factors as meaningful latent constructs.

⚙️
FA vs PCA

Critical Differences

  • PCA: Explains total variance; components are explicit linear combos of X; descriptive
  • FA: Explains common variance only (not unique/error variance); factors are latent unobservables; model-based
  • PCA: Unique solution; components are ordered by variance
  • FA: Solution not unique — rotation can be applied to improve interpretability!
💡
Factor Rotation

Making Factors Interpretable

  • Orthogonal rotation (Varimax): Maximises variance of squared loadings per column — produces "simple structure" where each variable loads highly on one factor and near-zero on others. Factors remain uncorrelated.
  • Oblique rotation (Promax, Oblimin): Allows factors to be correlated — more realistic when latent constructs are related (e.g., verbal and mathematical intelligence are correlated)
⚠️
Conditions

When FA is Appropriate

  • ✅ Variables are correlated (|R|<1) — if uncorrelated, no common factors exist
  • ✅ You believe latent constructs drive the correlations (theory-driven)
  • ✅ Communalities h² should be reasonable — if all h²≈0, model fails
  • ❌ Don't use FA when all variance is unique — use PCA instead
  • ⚠ Factor identification requires subjective interpretation — what does Factor 1 "mean"?
Factor ModelX − μ = L·F + ε   (L: loadings p×m; F: factors m×1; ε: unique factors)
Covariance structureΣ = LL' + Ψ   (Ψ = diag unique variances)
Communalityhⱼ² = Σₖ lⱼₖ²   (proportion of Var(Xⱼ) explained by common factors)
Uniquenessψⱼ = 1 − hⱼ²   (proportion unexplained by common factors)
Factor scoresF̂ = L'Σ⁻¹(X−μ)   (Bartlett's method)
🌍 Bangladesh Application: Poverty Index10 district-level variables measured: income, education, sanitation, health access, child mortality, malnutrition, electricity, road access, drinking water quality, school enrolment. FA extracts 3 factors: Factor 1 (high loadings on income, electricity, roads) = "Infrastructure & Economy"; Factor 2 (health access, child mortality, malnutrition) = "Health Status"; Factor 3 (education, school enrolment) = "Human Capital". These factors become inputs to a multidimensional poverty index. Much more interpretable than raw 10-variable data!
· · ·
A4
Unsupervised Grouping
Cluster Analysis — Finding Natural Groups
😄 Clustering Joke"Cluster analysis is what you do when you have data but no one told you what groups exist. It's like showing up at a party where you know nobody — after a while you notice people naturally cluster by interest, age group, or how loudly they speak. Cluster analysis does this mathematically, without you having to mingle!" The key challenge: you don't know the 'right' answer — there's no objective truth in unsupervised learning.
🗂️
What it is

Grouping Without Labels

Cluster analysis partitions n observations into g groups (clusters) such that observations within a cluster are similar and observations between clusters are dissimilar. It is unsupervised — no predefined groups or labels. Goal: discover natural structure in data.

⚙️
Hierarchical Clustering

Building a Dendrogram

  • Agglomerative (bottom-up): Start with n clusters (each obs = 1 cluster); merge closest pair; repeat until all in 1 cluster. Most common.
  • Divisive (top-down): Start with 1 cluster; split recursively
  • Linkage methods: Single (minimum distance), Complete (maximum), Average (UPGMA), Ward's (minimise within-cluster variance)
  • Result: Dendrogram — cut at desired level to get g clusters
💡
K-Means Clustering

Iterative Partitioning

  • Specify k (number of clusters) in advance
  • Algorithm: (1) Assign each obs to nearest centroid; (2) Update centroids as cluster means; (3) Repeat until convergence
  • Minimises: Σₖ Σᵢ∈Cₖ ‖xᵢ − μₖ‖² (within-cluster sum of squares)
  • Sensitive to initial centroids — run multiple times with random starts
  • Choosing k: Elbow plot, Silhouette coefficient, Gap statistic
⚠️
Conditions & Cautions

When Each Method Works

  • ✅ Hierarchical: Small-medium n; want to see ALL possible groupings; no need to prespecify k
  • ✅ K-means: Large n; approximately spherical clusters; k known or can be estimated
  • ❌ K-means fails for non-spherical clusters (use DBSCAN, Gaussian mixture models)
  • ⚠ Scale matters enormously — standardise variables before clustering!
  • ⚠ No "correct" clustering — always validate with external criteria
K-Means objectiveMinimise W(C) = Σₖ₌₁ᴷ Σᵢ∈Cₖ ‖xᵢ − μ̄ₖ‖²
Single linkage dd(A,B) = min{d(a,b) : a∈A, b∈B}
Complete linkage dd(A,B) = max{d(a,b) : a∈A, b∈B}
Ward's linkageMerge A and B if merger minimises increase in total within-cluster SS
Silhouettes(i) = [b(i)−a(i)] / max{a(i),b(i)} ∈ [−1,1]  (higher=better cluster)
🌍 Bangladesh Health Cluster64 districts clustered on 6 health indicators. K-means (k=3 chosen by elbow plot) identifies: Cluster 1 (12 districts, Dhaka-centred) = high healthcare access, low mortality; Cluster 2 (28 districts) = moderate on all indicators; Cluster 3 (24 districts, Char/haor areas) = low access, high child mortality, high malnutrition. This clustering directly informs resource allocation for the Ministry of Health — districts in Cluster 3 receive priority funding.
· · ·
A5
Supervised Classification
Discriminant & Classification Analysis
😄 Discriminant Analysis Analogy"Discriminant analysis is like training a sorting machine. You show it thousands of labelled patients ('has disease' / 'no disease') along with their test results. It learns the pattern of test results that best separates the groups. Then when a new patient arrives with only test results (no diagnosis), the machine classifies them. Fisher's Linear Discriminant is one of the oldest and most elegant classification algorithms — predating neural networks by 80+ years!"
🎯
What it is

Supervised Group Separation

Discriminant Analysis has TWO goals: (1) Description: find linear combinations of variables (discriminant functions) that best separate g known groups; (2) Classification: build a rule to assign future observations to one of the g groups. Unlike cluster analysis: group memberships are KNOWN for the training data.

⚙️
Fisher's LDA

Linear Discriminant Analysis

  • Find direction w that maximises between-group variance / within-group variance: w = Sₚ⁻¹(x̄₁ − x̄₂) for 2-group case
  • Classify new x to group 1 if: w'x ≥ midpoint(w'x̄₁, w'x̄₂)
  • Assumes equal covariance matrices Σ₁=Σ₂ → uses pooled Sₚ
  • For g>2: compute g−1 discriminant functions
💡
Probabilistic Classification

Bayes Classification Rules

  • Linear discriminant rule (LDA): Equal Σ → linear boundary
  • Quadratic discriminant (QDA): Unequal Σ → quadratic boundary (more flexible)
  • Classify x to group g* = argmax Pᵢ·f(x|group i) (posterior probability)
  • Minimum ECM rule: Accounts for unequal misclassification costs and prior probabilities
⚠️
Evaluating Performance

Classification Error

  • APER: Apparent Error Rate — fraction misclassified on training data (optimistically biased)
  • Cross-validated error: Leave-one-out cross-validation — better estimate of true error
  • Confusion matrix: Rows = actual groups; columns = predicted groups; diagonal = correct classifications
  • ⚠ APER always underestimates true misclassification rate — always use CV!
Fisher's discriminantw = Sₚ⁻¹(x̄₁ − x̄₂)   (direction of maximum separation)
Pooled covarianceSₚ = [(n₁−1)S₁ + (n₂−1)S₂] / (n₁+n₂−2)
LDA classificationAssign x to g1 if (x̄₁−x̄₂)'Sₚ⁻¹[x − ½(x̄₁+x̄₂)] ≥ ln(π₂/π₁)
Mahalanobis criterionAssign x to group g* = argminᵢ D²(x, x̄ᵢ)   (nearest centroid)
APERAPER = (# misclassified) / n   (training error — always optimistic)
🌍 Bangladesh Medical ApplicationClassifying TB patients into 3 treatment response groups (rapid/moderate/slow responder) based on 6 baseline clinical variables (age, BMI, sputum grade, haemoglobin, ESR, CD4 count). LDA builds two discriminant functions. Cross-validated APER = 18% (82% correctly classified). The discriminant scores of new patients can be computed from their baseline labs to predict treatment response category — guiding personalised treatment decisions before expensive sensitivity testing is complete.
STAT2201 · Sampling Distribution
STAT2201 · B.Sc. Statistics Year 2 · BRUR
Sampling Distribution
Sampling Distributions of Mean · Variance · Proportions · CLT · t · F · χ² Distributions · Estimation · Confidence Intervals
🎓 What is a Sampling Distribution? "If you took your sample 10,000 times and computed the mean each time, what would the distribution of those means look like?" THAT is the sampling distribution — not the distribution of data, but the distribution of a statistic over repeated sampling. 😄 "The sampling distribution is the bridge between data and inference — without it, statistics would just be fancy arithmetic."
S1
Foundations
Population vs Sample · Parameters vs Statistics
🌍
Population & Parameters

The Complete Set

  • Population (N): All items of interest — fixed but usually unobservable
  • Parameter: μ, σ², π — numerical summaries of the population, FIXED but UNKNOWN
  • Almost never observe the whole population — too large, costly, or destructive
🔬
Sample & Statistics

What We Actually Observe

  • Sample (n): n observations drawn from the population
  • Statistic: x̄, s², p̂ — functions of the sample; RANDOM VARIABLE before sampling
  • The KEY insight: statistics vary sample to sample — this variation has a pattern = sampling distribution
💡
3 Different Distributions

Never Confuse These!

  • Population distribution: All individuals — shape could be anything
  • Sample distribution: Your n observations — approximates population
  • Sampling distribution: Distribution of the STATISTIC over repeated samples
  • 😄 "Confusing these three is the #1 intro-stats mistake. The CLT applies to the THIRD one!"
⚠️
Standard Error

SE ≠ SD

  • SD: Variability of individual observations (fixed, doesn't shrink with n)
  • SE(x̄): Variability of the sample MEAN over repeated samples = σ/√n
  • SE shrinks as n increases — more data → more precise estimate of μ
Standard Error of x̄SE(x̄) = σ/√n   (decreases with n — more data = more precise)
UnbiasednessE(x̄) = μ   ;   E(s²) = σ² (why we divide by n−1, not n)
· · ·
S2
Key Result
Sampling Distribution of the Mean
📊
Normal Population

Exact Result (Any n)

If X₁,…,Xₙ iid N(μ,σ²), then x̄ ~ N(μ, σ²/n) exactly for any n. Standardise: Z = (x̄−μ)/(σ/√n) ~ N(0,1). When σ unknown, replace with s → T = (x̄−μ)/(s/√n) ~ t(n−1).

💡
Effect of n

More Data = Narrower Distribution

  • Larger n → smaller SE = σ/√n → sampling distribution narrows around μ
  • Doubling n reduces SE by √2 ≈ 1.41 (not by 2 — diminishing returns!)
  • To halve the SE, you must QUADRUPLE n — sampling is expensive!
x̄ from N(μ,σ²)x̄ ~ N(μ, σ²/n)   exact for any n
Z statistic (σ known)Z = (x̄ − μ) / (σ/√n) ~ N(0,1)
t statistic (σ unknown)T = (x̄ − μ) / (s/√n) ~ t(n−1)
· · ·
S3
The Crown Jewel
Central Limit Theorem (CLT)
👑
The CLT

Most Important Theorem in Statistics

Let X₁,…,Xₙ be iid with mean μ and finite variance σ². Then as n→∞: √n(x̄−μ)/σ →_d N(0,1) regardless of the population distribution shape. For large n: x̄ ≈ N(μ, σ²/n). This is why the normal distribution appears everywhere!

💡
Why it's Magic

Any Population → Normal x̄

  • Population can be exponential, uniform, skewed, bimodal — doesn't matter!
  • n ≥ 30: CLT approximation usually good; n ≥ 50 for very skewed populations
  • Foundation for: t-tests, z-tests, ANOVA, regression inference, and almost everything
  • 😄 "CLT: Statistics' superhero. No matter what messy distribution you throw at it — average enough and you get normal. Every time."
⚠️
When CLT Fails

Important Exceptions

  • Cauchy distribution: NO finite variance → CLT doesn't apply
  • Very small n with highly skewed data
  • Dependent observations: standard CLT requires independence
  • Always use exact t/F/χ² when population is exactly normal
CLT (formal)√n(x̄ − μ)/σ →_d N(0,1) as n→∞   (iid, finite σ²)
Practical formx̄ ≈ N(μ, σ²/n) for n≥30 approximately
Sum versionSₙ = ΣXᵢ ≈ N(nμ, nσ²) for large n
· · ·
S4
Variance Inference
Chi-Square, t & F Distributions
📐
Chi-Square χ²(k)

Sum of Squared Normals

  • χ²(k) = Z₁²+…+Zₖ² where Zᵢ iid N(0,1)
  • Mean=k; Var=2k; Right-skewed; always ≥ 0
  • Sampling dist of variance: (n−1)s²/σ² ~ χ²(n−1)
  • ⚠ Requires population normality — sensitive to departures!
🍺
t Distribution

Z ÷ √(χ²/ν) — The Guinness Distribution

  • t(ν) = Z/√(χ²(ν)/ν); heavier tails than N(0,1)
  • T = (x̄−μ)/(s/√n) ~ t(n−1) when sampling from N(μ,σ²)
  • As ν→∞: t(ν) → N(0,1)
  • 😄 "Invented by Gosset at Guinness Brewery — published as 'Student' because Guinness prohibited employee publications. Cheers to small samples! 🍺"
💡
F Distribution

Ratio of Two Chi-Squares

  • F(k₁,k₂) = [χ²(k₁)/k₁] / [χ²(k₂)/k₂]
  • F = s₁²/s₂² ~ F(n₁−1,n₂−1) for variance ratio test
  • F = MSA/MSE in ANOVA; t²(ν) = F(1,ν)
  • Named for Ronald Fisher — inventor of ANOVA, p-values, and experimental design
χ² from sample variance(n−1)s²/σ² ~ χ²(n−1)   (population normal)
CI for σ²[(n−1)s²/χ²_{α/2}, (n−1)s²/χ²_{1−α/2}]
Two-sample t (equal σ)T = (x̄₁−x̄₂)/(sₚ√(1/n₁+1/n₂)) ~ t(n₁+n₂−2)
CI for μ (σ unknown)x̄ ± t_{α/2,n−1} · s/√n
Sample size for μn = (z_{α/2} · σ / E)²   (E = desired margin of error)
🌍 Bangladesh ExampleA nutritionist samples 40 children aged 5–10 from Rangpur to estimate mean height. Sample mean = 112 cm, s = 8.4 cm. 95% CI: 112 ± t_{0.025,39} × 8.4/√40 = 112 ± 2.023 × 1.33 = [109.3, 114.7] cm. We are 95% confident the true population mean height is between 109.3 and 114.7 cm. To halve the margin of error, we would need n = 4×40 = 160 children — quadrupling the sample!
· · ·
S5
Proportions
Sampling Distribution of Proportions & Estimation
📈
Sample Proportion

For Binary Outcomes

p̂ = X/n where X~Binomial(n,p). E(p̂)=p (unbiased); Var(p̂)=p(1−p)/n. By CLT: p̂ ≈ N(p, p(1−p)/n) when np≥10 AND n(1−p)≥10. Standard error: SE(p̂) = √[p(1−p)/n].

💡
CI Interpretation

What 95% CI Really Means

A 95% CI: if repeated sampling 100 times and CI computed each time, about 95 of those intervals contain the true μ. It does NOT mean "95% probability μ is in this specific interval" — μ is fixed! 😄 "The CI is a fishing net — 95% of the time it catches the fish (μ). Once cast, the fish is either inside or not."

p̂ approx. dist.p̂ ≈ N(p, p(1−p)/n) for large n (np≥10 AND n(1−p)≥10)
95% CI for pp̂ ± 1.96 · √[p̂(1−p̂)/n]   (Wald interval)
Sample size for pn = z²_{α/2} · p(1−p) / E²   (use p=0.5 if unknown)
STAT2203 · Analysis of Variance & Design of Experiment
F
STAT2203 · B.Sc. Statistics Year 2 · BRUR
Analysis of Variance & Design of Experiment
One-Way ANOVA · Two-Way ANOVA · Post-Hoc Tests · CRD · RBD · LSD · Factorial Designs · 2ᵏ Designs
🎓 ANOVA in one sentence ANOVA tests whether means of 3+ groups differ — by comparing BETWEEN-group variance to WITHIN-group variance. Why not just do many t-tests? With g groups you'd need C(g,2) t-tests, inflating Type I error massively. ANOVA controls this with ONE test. 😄 "ANOVA: Statistics' way of comparing all your groups at once, without letting false alarms pile up." Fisher's golden rule of DOE: "Block what you can; randomise what you cannot."
V1
Core Test
One-Way ANOVA
📊
What it is

Comparing g Group Means

H₀: μ₁=μ₂=…=μg vs H₁: at least one μᵢ differs. Partitions total variation: SST = SSA + SSE. If between-group (MSA) >> within-group (MSE), groups differ. F = MSA/MSE ~ F(g−1, N−g) under H₀.

⚙️
The Logic

Why Variance Tests Means

  • MSA (between): Measures group-mean differences — large if μᵢ differ
  • MSE (within): Measures random error — unaffected by group differences
  • Under H₀: both estimate σ² → F≈1. Under H₁: MSA >> MSE → F >> 1
💡
Effect Size

η² — How Meaningful Is the Effect?

η² = SSA/SST: proportion of variance explained. Benchmarks: small=0.01, medium=0.06, large=0.14. ALWAYS report — significant F with tiny η² means real but trivially small difference! 😄 "Statistical significance ≠ practical importance."

⚠️
What ANOVA Doesn't Tell

Which Groups Differ?

Significant F only says "at least one mean differs" — need post-hoc tests to find WHICH pairs differ. Never claim the group with highest mean is significantly different without a post-hoc test — that's data dredging!

One-Way ANOVA Table
Sourcedf SSMSF Among Groups (Treatment)g−1 SSA=Σnᵢ(ȳᵢ−ȳ)²MSA=SSA/(g−1)MSA/MSE Within Groups (Error)N−g SSE=ΣΣ(yᵢⱼ−ȳᵢ)²MSE=SSE/(N−g) TotalN−1 SST=ΣΣ(yᵢⱼ−ȳ)² F ~ F(g−1, N−g) under H₀  |  η² = SSA/SST
ANOVA modelYᵢⱼ = μ + αᵢ + εᵢⱼ   (αᵢ=group effect, Σαᵢ=0)
F statisticF = MSA/MSE ~ F(g−1, N−g) under H₀
Effect size η²η² = SSA/SST   (small=0.01, medium=0.06, large=0.14)
· · ·
V2
Checking the Model
Assumptions, Diagnostics & Post-Hoc Tests
📋
3 Assumptions

Independence · Equal Variance · Normality

  • Independence: All observations independent — ensured by randomisation
  • Homoscedasticity: σ₁²=…=σg² — test with Levene's test (robust) or Bartlett's (sensitive to non-normality)
  • Normality: Residuals eᵢⱼ=yᵢⱼ−ȳᵢ. ~ N(0,σ²) — check Q-Q plot or Shapiro-Wilk
🔍
Post-Hoc Tests

Finding Which Pairs Differ

  • Tukey's HSD: Best for all pairwise comparisons — controls FWER
  • Bonferroni: α* = α/m — conservative; good for any planned comparisons
  • Scheffé: Controls FWER for all contrasts — most conservative
  • Kruskal-Wallis: Non-parametric alternative when normality fails
Tukey HSD|ȳᵢ − ȳⱼ| > q_{α,g,N−g} · √(MSE/n) → significant pair
Bonferroni adjusted αα* = α/m   (m = number of comparisons)
· · ·
V3
Two Factors & Interaction
Two-Way ANOVA & Interaction Effect
📐
Two-Way ANOVA

Model & Decomposition

Tests: (1) Main effect of A; (2) Main effect of B; (3) Interaction A×B. SST = SSA + SSB + SSAB + SSE. Interaction is most interesting — does the effect of A depend on the level of B? Plot interaction plots: parallel lines = no interaction; crossing lines = interaction present.

💡
Interaction

It Depends! — The Most Important Result

Significant interaction means the effect of fertiliser on yield DEPENDS on which crop variety is used. Cannot interpret main effects in isolation when interaction is significant. 😄 "Interaction is statistics saying: 'it depends' — and that's almost always the most scientifically interesting answer."

Two-way modelYᵢⱼₖ = μ + αᵢ + βⱼ + (αβ)ᵢⱼ + εᵢⱼₖ
SS decompositionSST = SSA + SSB + SSAB + SSE
F for interactionF_{AB} = MSAB/MSE ~ F((a−1)(b−1), ab(n−1))
· · ·
V4
Experimental Designs
CRD · RBD · LSD & Factorial Designs
🎲
CRD

Completely Randomised Design

Treatments randomly assigned to all units with no restrictions. Simplest design — use when units are homogeneous. Analysis: one-way ANOVA. df_error = N−t. Disadvantage: if units are heterogeneous, MSE will be large and F-test will be weak.

🧱
RBD

Randomised Block Design

Group similar units into blocks; randomise treatments within blocks. Removes block variation from error → smaller MSE → more powerful F-test. Fisher's golden rule: "Block what you can, randomise what you cannot." df_error = (t−1)(b−1). Widely used in agricultural, medical, and industrial experiments.

🔲
LSD

Latin Square Design — Two-Way Blocking

Controls TWO nuisance variables (rows and columns) simultaneously. A t×t square where each treatment appears exactly once in each row and column. df_error = (t−1)(t−2). Assumes no three-way interaction between rows, columns, and treatments.

🔢
2ᵏ Factorial

k Factors at 2 Levels Each

  • All 2ᵏ combinations of k factors (each at low/high)
  • Estimates all main effects AND all interactions
  • Fractional 2^{k-p}: half/quarter fractions to reduce runs
  • Yates algorithm computes all effects efficiently
  • 😄 "The 2ᵏ design: maximum information, minimum runs — the statistician's favourite meal."
CRD modelYᵢⱼ = μ + τᵢ + εᵢⱼ   df_error = N−t
RBD modelYᵢⱼ = μ + τᵢ + βⱼ + εᵢⱼ   df_error = (t−1)(b−1)
LSD modelYᵢⱼₖ = μ + ρᵢ + γⱼ + τₖ + εᵢⱼₖ   df_error = (t−1)(t−2)
2² main effect AEffect A = [(y_a+y_ab) − (y_(1)+y_b)] / 2n
🌍 Bangladesh Agricultural TrialTesting 4 fertiliser treatments (t=4) on rice in 3 blocks (b=3) of similar soil fertility. RBD gives df_error=(4−1)(3−1)=6. Result: F=8.4 (p=0.014), η²=0.62 — treatments explain 62% of variance. Tukey post-hoc: Treatment D significantly outperforms A and B (p<0.05) but not C (p=0.12). Blocking removed soil-fertility variability, making the test sensitive enough to detect real treatment differences that a CRD might have missed.
STAT3201 · Hypothesis Testing
H₀
STAT3201 · B.Sc. Statistics Year 3 · BRUR
Hypothesis Testing
Neyman-Pearson Framework · Type I & II Errors · Power · MP & UMP Tests · Likelihood Ratio Tests · p-values · Non-Parametric Tests
🎓 The Court of Statistics We assume H₀ is true (innocent until proven guilty) and ask: how surprising is our data if H₀ were true? If very surprising (small p-value), we reject H₀. 😄 "H₀ is like a stubborn professor — it won't budge unless the evidence is overwhelming. And even then, there's a chance you made a mistake (Type I error)." Key texts: Casella & Berger for theory; Lehmann & Romano for advanced testing.
H1
Framework
Hypotheses, Decision Rules & Error Types
📖
H₀ and H₁

Setting Up the Test

  • H₀: Status quo / no effect — assumed true by default
  • H₁: What we're trying to demonstrate
  • Simple: Completely specifies distribution (μ=5)
  • Composite: Specifies a class (μ>5)
  • One vs two-sided: H₁: μ>μ₀ vs H₁: μ≠μ₀
⚖️
Error Types

Four Outcomes

  • ✅ H₀ true, Don't reject: Correct (prob 1−α)
  • ❌ H₀ true, Reject: Type I error α — false alarm
  • ❌ H₀ false, Don't reject: Type II error β — missed detection
  • ✅ H₀ false, Reject: Power = 1−β — correct detection
💡
The Tradeoff

α↓ → β↑ for Fixed n

Decreasing α (fewer false alarms) increases β (more misses) for fixed n. Only way to reduce both: increase n. Power = 1−β should be ≥ 0.80 in well-designed studies. 😄 "Demanding 99.9% confidence with n=5 is like demanding perfect night vision in complete darkness — physically impossible with so little data!"

⚠️
Key Asymmetry

H₀ and H₁ Are Not Equal

We control α directly. β depends on α, n, and the true parameter. We can NEVER "prove H₀" — only fail to reject it. "Not guilty ≠ innocent. Fail to reject H₀ ≠ H₀ is true."

Type I error αP(reject H₀ | H₀ true) — false positive; set before the test
Type II error βP(fail to reject H₀ | H₁ true) — false negative; depends on n, δ, σ
Power1 − β = P(reject H₀ | H₁ true) — ability to detect a real effect
Sample size (z-test)n = σ²(z_α + z_β)² / (μ₁−μ₀)²
· · ·
H2
Optimal Tests
Neyman-Pearson Lemma · UMP Tests & LRT
🏆
N-P Lemma

Most Powerful Test for Simple H

For H₀:θ=θ₀ vs H₁:θ=θ₁ (both simple), the Most Powerful (MP) test at level α rejects H₀ when Λ(x) = L(θ₁)/L(θ₀) > k. The N-P Lemma derives the optimal rejection region from the likelihood ratio — no guessing needed. For Gaussian data, this recovers the z-test as optimal.

🎯
UMP & MLR

Composite Alternatives

  • UMP test: Most powerful test for EVERY θ∈H₁ — exists for one-sided hypotheses in exponential families
  • MLR (Monotone Likelihood Ratio): If L(θ₁)/L(θ₀) is monotone in some statistic T(x), then rejecting for large T gives the UMP test for H₀:θ≤θ₀
  • Normal, Poisson, Binomial — all have MLR in their natural parameter
💡
LRT — General Tests

Wilks' Theorem

LRT: Λ = L(θ̂₀)/L(θ̂) ∈ [0,1]. Wilks (1938): −2 ln Λ → χ²(r) under H₀ where r = number of restrictions. This makes LRT applicable to ANY hypothesis. The chi-square test of independence is a special case. Reject H₀ if −2 ln Λ > χ²_α(r).

N-P MP testReject H₀ if L(θ₁;x)/L(θ₀;x) > k   (k: size-α critical value)
LRT statisticΛ = L(θ̂₀)/L(θ̂)   (θ̂₀=restricted MLE; θ̂=unrestricted MLE)
Wilks' theorem−2 ln Λ →_d χ²(r) under H₀   (r = #restrictions)
· · ·
H3
Most Misused
p-values — Meaning, Misuse & Parametric Tests
📖
What p IS

Correct Definition

p-value = P(T ≥ t_obs | H₀) = probability of data as extreme or more extreme than observed, assuming H₀ true. Small p → data surprising under H₀ → evidence against H₀. It is a continuous measure of evidence, NOT a binary pass/fail.

⚠️
What p is NOT

5 Common Misconceptions

  • ❌ "P(H₀ is true)" — H₀ has no probability in frequentist stats
  • ❌ "Probability results occurred by chance"
  • ❌ "Probability results will replicate"
  • ❌ Measures effect size — huge n can make trivial effects "significant"
  • ✅ "How surprising is my data if H₀ were true?"
💡
Common Tests

Parametric Quick Reference

  • One-sample z: Z = (x̄−μ₀)/(σ/√n) ~ N(0,1)
  • One-sample t: T = (x̄−μ₀)/(s/√n) ~ t(n−1)
  • Paired t: T = d̄/(sD/√n) ~ t(n−1)
  • χ² GOF: χ² = Σ(O−E)²/E ~ χ²(k−1−p)
  • χ² independence: ~ χ²((r−1)(c−1))
📊
Non-Parametric Tests

When Assumptions Fail

  • Wilcoxon signed-rank: Non-parametric one-sample/paired t
  • Mann-Whitney U: Non-parametric two-sample t (ranks)
  • Kruskal-Wallis: Non-parametric one-way ANOVA
  • Spearman's ρ: Non-parametric correlation
  • ⚠ Less powerful than parametric when assumptions hold — use as backup
p-value (two-sided)p = 2·P(T ≥ |t_obs| | H₀)
Decision ruleReject H₀ iff p < α   (set α before the test!)
Mann-Whitney UU = n₁n₂ + n₁(n₁+1)/2 − R₁   (R₁ = rank sum of group 1)
Kruskal-Wallis HH = [12/N(N+1)] Σ Rᵢ²/nᵢ − 3(N+1) ~ χ²(g−1)
😄 The p-hacking Warning"If you torture your data long enough, it will confess to anything." — Ronald Coase. Running 20 tests and reporting only the p<0.05 result guarantees a false positive. Pre-register your hypotheses before seeing the data, report ALL analyses, and always report effect sizes alongside p-values. The replication crisis in psychology was largely caused by widespread p-hacking and selective reporting. Register your analysis plan first — commit before you look!
STAT4102 · Sampling Techniques
n
STAT4102 · B.Sc. Statistics Year 4 · BRUR
Sampling Techniques
SRS · Stratified · Systematic · Cluster · PPS · Ratio & Regression Estimation · Non-Sampling Errors
🎓 Why Sampling? "You don't need to eat the whole pot of soup to know if it's salty — one spoonful is enough, IF it's well stirred." That's sampling. 😄 The goal: make valid inferences about a population of N units by examining only n << N units, saving time, cost, and resources while maintaining accuracy.
T1
Foundations
Basic Concepts & Probability Sampling
📖
Key Terms

Sampling Vocabulary

  • Sampling frame: List of all N population units — must be complete and up-to-date
  • Sampling unit: The unit selected at each draw
  • Inclusion probability πᵢ: Probability unit i is selected
  • Design effect (DEFF): Ratio of actual variance to SRS variance
💡
Probability vs Non-Probability

Two Types of Sampling

  • Probability: Every unit has known, non-zero inclusion probability → valid inference possible. SRS, stratified, cluster, systematic.
  • Non-probability: Convenience, purposive, quota — no valid inference to population. Use only for exploratory work.
⚠️
Key Principle

Unbiasedness & Efficiency

  • Unbiased estimator: E(ȳ) = Ȳ on average
  • Efficiency: Smaller variance = more information per unit cost
  • Goal: choose design that minimises variance for given cost
· · ·
T2
Baseline Design
Simple Random Sampling (SRS)
🎲
SRSWOR vs SRSWR

With vs Without Replacement

  • SRSWOR: Each unit selected at most once — more common; smaller variance
  • SRSWR: Units can repeat — simpler theory; larger variance
  • Finite population correction (FPC) = (1−f) = (1−n/N) — matters when n/N > 0.05
⚙️
Estimation

Mean, Total, Proportion

  • ȳ = (1/n)Σyᵢ — unbiased estimator of Ȳ
  • ŷ_total = Nȳ — unbiased estimator of total Y
  • p̂ = x/n — unbiased estimator of proportion P
Var(ȳ) — SRSWORV(ȳ) = (1−f)·S²/n   where f=n/N, S²=Σ(yᵢ−Ȳ)²/(N−1)
Estimated Varv(ȳ) = (1−f)·s²/n   where s²=Σ(yᵢ−ȳ)²/(n−1)
95% CI for Ȳȳ ± 1.96·√v(ȳ)
Sample size nn = N·z²S² / (N·e² + z²S²)   (e=desired margin of error)
· · ·
T3
Improved Efficiency
Stratified Random Sampling
🗂️
What it is

Divide & Sample

Divide population into L non-overlapping strata; take SRS within each stratum. Why? Reduces variance by removing between-stratum variation from the error. Always more efficient than SRS if strata are internally homogeneous.

⚙️
Allocation Methods

How Many from Each Stratum?

  • Proportional: nₕ = n·(Nₕ/N) — simple; good when σₕ similar
  • Optimal (Neyman): nₕ ∝ Nₕσₕ — minimises variance for fixed n
  • Cost-optimal: nₕ ∝ Nₕσₕ/√cₕ — accounts for variable cost per stratum
💡
When to Stratify

Good Stratification Criteria

  • Variable highly correlated with study variable Y
  • Administrative convenience (districts, regions, age groups)
  • Need separate estimates for subgroups (domains)
  • Oversampling rare subgroups for adequate representation
Stratified meanȳ_st = Σₕ Wₕȳₕ   (Wₕ = Nₕ/N = stratum weight)
Var(ȳ_st)V(ȳ_st) = Σₕ Wₕ²(1−fₕ)Sₕ²/nₕ
Neyman allocationnₕ = n · (NₕSₕ) / Σₕ(NₕSₕ)
Proportional alloc.nₕ = n · Nₕ/N
🌍 Bangladesh HIES ExampleHousehold Income & Expenditure Survey stratifies by division (8) × urban/rural (2) = 16 strata. Neyman allocation samples more from Dhaka (large, variable) and less from Sylhet (small, homogeneous). Result: 40% lower variance than SRS of same total size — more accurate poverty estimates at lower cost.
· · ·
T4
Practical Designs
Systematic & Cluster Sampling
📋
Systematic Sampling

Every kth Unit

  • k = N/n (sampling interval); select random start r ∈ {1,…,k}; then r, r+k, r+2k, …
  • Very easy to implement — just a list and arithmetic
  • Efficient when list is in random order (≈SRS)
  • ⚠ Periodic pattern in list + periodic k = biased disaster!
🏘️
Cluster Sampling

Sample Groups, Not Individuals

  • Divide population into clusters; randomly select m clusters; survey ALL units in selected clusters
  • Cost-efficient when clusters are geographically compact
  • Less efficient statistically — units within cluster tend to be similar (intraclass correlation ρ)
  • DEFF = 1 + (b̄−1)ρ where b̄ = avg cluster size
💡
Two-Stage Cluster

Select Clusters, Then Sub-Sample

Stage 1: Select m PSUs (primary sampling units) with probability proportional to size. Stage 2: Select n SSUs within each selected PSU. Used in virtually all large national surveys (DHS, MICS, census post-enumeration). More flexible than single-stage cluster sampling.

Systematic interval kk = N/n   (round to integer); sample: r, r+k, r+2k, …
Cluster mean estimatorȳ_cl = (1/m)Σᵢȳᵢ   (ȳᵢ = mean of ith selected cluster)
Design effectDEFF = V(ȳ_cluster) / V(ȳ_SRS) = 1 + (b̄−1)ρ
Intraclass corr. ρρ = (MSB−MSW) / (MSB + (b̄−1)MSW)   (between/within cluster)
· · ·
T5
Auxiliary Information
Ratio & Regression Estimation
📈
Ratio Estimator

Using a Correlated Auxiliary Variable

If auxiliary variable X (known population total X̄) is highly correlated with Y: ȳ_R = R̂·X̄ where R̂=ȳ/x̄. Biased but often much lower MSE than ȳ. Best when ratio Y/X is more constant than Y itself — e.g., estimating crop yield per hectare.

⚙️
Regression Estimator

OLS-Based Improvement

ȳ_reg = ȳ + b̂(X̄−x̄) where b̂ = Σ(xᵢ−x̄)(yᵢ−ȳ)/Σ(xᵢ−x̄)². Always has smaller or equal variance than ȳ. More general than ratio estimator — doesn't require proportionality. Gain in efficiency ∝ ρ²(X,Y).

💡
When to Use Each

Ratio vs Regression vs SRS

  • Use ratio when Y∝X (passes through origin) and ρ>0.5
  • Use regression for general linear relationship
  • Both reduce variance when |ρ(X,Y)| is large
  • SRS if no good auxiliary variable available
Ratio estimatorȳ_R = (ȳ/x̄)·X̄ = R̂·X̄
Approx. Var(ȳ_R)V(ȳ_R) ≈ (1−f)/n · (Sᵧ² + R²Sₓ² − 2RSₓᵧ)
Regression estimatorȳ_reg = ȳ + b̂(X̄ − x̄)
Var(ȳ_reg)V(ȳ_reg) ≈ (1−f)Sᵧ²(1−ρ²)/n   (always ≤ V(ȳ))
· · ·
T6
Advanced & Errors
PPS Sampling & Non-Sampling Errors
⚖️
PPS Sampling

Probability Proportional to Size

Select PSUs with probability proportional to a size measure (number of households, land area). Larger clusters have higher selection probability. Combined with equal-probability sub-sampling within PSUs → self-weighting sample. Used in almost all national surveys.

⚠️
Non-Sampling Errors

Often Bigger Than Sampling Error!

  • Coverage error: Frame misses units (undercoverage of homeless, migrants)
  • Non-response: Selected units don't participate — can cause serious bias
  • Measurement error: Wrong answers due to question wording, recall, interviewer bias
  • Processing error: Data entry, coding mistakes
  • 😄 "A perfectly designed sample with 40% non-response is worse than a simple convenience sample for many questions."
💡
Hansen-Hurwitz Estimator

PPS with Replacement

π_i = n·Mᵢ/M₀ (selection probability). Estimator: ȳ_HH = (1/n)Σ(yᵢ/πᵢ). Unbiased. Variance ∝ variation of yᵢ/πᵢ — good PPS reduces this variation dramatically compared to SRS for skewed populations (like business surveys).

PPS prob. of selectionπᵢ = n·Mᵢ / M₀   (Mᵢ=size of unit i, M₀=total size)
HH estimatorȳ_HH = (1/n)·Σᵢ(yᵢ/πᵢ)   (unbiased)
Horvitz-Thompsonŷ_HT = Σᵢ∈s (yᵢ/πᵢ)   (unbiased for any design)
STAT4106 · Categorical Data Analysis
χ²
STAT4106 · B.Sc. Statistics Year 4 · BRUR
Categorical Data Analysis
Contingency Tables · χ² Tests · Odds Ratios · Logistic Regression · Log-Linear Models · Ordinal Data · Matched Pairs
🎓 Why Categorical Data Analysis? Most real-world outcomes are categorical — disease/no disease, vote/don't vote, pass/fail. You cannot use t-tests or ANOVA on counts. CDA provides the correct tools: chi-square tests for independence, odds ratios for effect size, logistic regression for prediction, and log-linear models for multi-way tables. As Agresti notes: "Categorical data analysis is arguably more important in practice than normal-theory methods."
C1
Foundations
Distributions for Categorical Data
📖
Key Distributions

Binomial, Multinomial & Poisson

  • Binomial(n,π): n independent trials, count successes. Foundation for proportions.
  • Multinomial(n; π₁,…,πk): n trials, k categories. Joint distribution of cell counts.
  • Poisson(μ): Independent cell counts — used in log-linear models
⚙️
Sampling Schemes

Poisson, Multinomial, Product-Multinomial

  • Poisson: Both margins random — all counts independent Poisson
  • Multinomial: Grand total n fixed; cell counts ~ Multinomial
  • Product-multinomial: Row totals fixed (prospective study); each row ~ Multinomial
  • χ² test gives same result for all three — convenient!
Binomial PMFP(Y=k) = C(n,k)·πᵏ·(1−π)^(n−k)
Multinomial PMFP(n₁,…,nk) = n!/(n₁!…nk!) · π₁^n₁·…·πk^nk
MLE of ππ̂ = y/n   (sample proportion — unbiased, consistent)
· · ·
C2
Core Tool
Contingency Tables & χ² Tests
📊
r×c Contingency Table

Cross-Tabulation

An r×c table cross-classifies n observations by two categorical variables (r rows, c columns). Cell count nᵢⱼ = observations in row i, column j. Marginal totals: nᵢ₊ (row), n₊ⱼ (column). Test: are the two variables independent?

⚙️
Pearson χ² Test

Testing Independence

  • H₀: rows and columns are independent (πᵢⱼ = πᵢ₊·π₊ⱼ)
  • Expected count: Eᵢⱼ = nᵢ₊·n₊ⱼ/n (under H₀)
  • χ² = Σ(nᵢⱼ−Eᵢⱼ)²/Eᵢⱼ ~ χ²((r−1)(c−1)) under H₀
  • ⚠ Requires Eᵢⱼ ≥ 5 in all cells — use Fisher's exact if violated
💡
Likelihood Ratio G²

Alternative to χ²

G² = 2Σnᵢⱼ·ln(nᵢⱼ/Eᵢⱼ) ~ χ²((r−1)(c−1)). Also called the deviance. Preferred in log-linear model context — additive across hierarchical models. χ² and G² converge for large n; differ for small n.

⚠️
Fisher's Exact Test

Small Samples

For 2×2 tables with small expected counts: compute exact probability of observing table this extreme, conditioning on both margins fixed. p = C(n₁₊,n₁₁)·C(n₂₊,n₂₁)/C(n,n₊₁). No large-sample approximation needed.

Expected cell countEᵢⱼ = nᵢ₊·n₊ⱼ / n   (under independence)
Pearson χ²X² = Σᵢⱼ(nᵢⱼ−Eᵢⱼ)²/Eᵢⱼ ~ χ²((r−1)(c−1))
Likelihood ratio G²G² = 2Σᵢⱼ nᵢⱼ·ln(nᵢⱼ/Eᵢⱼ) ~ χ²((r−1)(c−1))
dfdf = (r−1)(c−1)   (for r×c independence test)
· · ·
C3
Effect Size
Measures of Association — OR, RR & φ
📐
Odds Ratio (OR)

Most Important Association Measure

OR = (n₁₁·n₂₂)/(n₁₂·n₂₁) = (odds of outcome in group 1)/(odds in group 2). OR=1 means no association. OR>1 means higher odds in group 1. OR is the natural parameter for logistic regression and case-control studies. Does not depend on marginal totals — unlike RR.

⚙️
Relative Risk (RR)

Risk Ratio for Prospective Studies

  • RR = (n₁₁/n₁₊) / (n₂₁/n₂₊) = risk in exposed / risk in unexposed
  • More intuitive than OR when outcomes are common
  • Only valid when row totals are fixed (prospective/cohort design)
  • For rare outcomes: OR ≈ RR
💡
φ and Cramér's V

Symmetric Association Measures

  • φ = √(χ²/n) — for 2×2 tables; ∈ [−1,1]
  • Cramér's V = √(χ²/(n·min(r−1,c−1))) — for r×c; ∈ [0,1]
  • V=0: no association; V=1: perfect association
Odds RatioOR = (n₁₁·n₂₂) / (n₁₂·n₂₁)   (2×2 table)
ln(OR) SESE[ln(OR)] = √(1/n₁₁ + 1/n₁₂ + 1/n₂₁ + 1/n₂₂)
95% CI for ORexp[ln(OR) ± 1.96·SE(ln OR)]
Relative RiskRR = (n₁₁/n₁₊) / (n₂₁/n₂₊)
Cramér's VV = √[χ²/(n·min(r−1,c−1))]
🌍 Bangladesh TB Study2×2 table: smokers vs non-smokers, TB vs no TB. OR=3.2 (95% CI: 1.8–5.7, p<0.001). Interpretation: smokers have 3.2 times the odds of TB compared to non-smokers. Since TB is rare (<5%), OR ≈ RR: smokers have approximately 3× the risk. This is statistically significant AND clinically meaningful — OR=3.2 is a strong association.
· · ·
C4
Binary Outcomes
Logistic Regression
🔢
The Model

Logit Link Function

For binary Y∈{0,1}: log[π/(1−π)] = β₀ + β₁X₁ + … + βₖXₖ where π = P(Y=1|X). The logit link ensures predicted probabilities ∈ (0,1). Estimated by Maximum Likelihood Estimation (MLE), not OLS. Iteratively Reweighted Least Squares (IRLS) algorithm.

⚙️
Interpretation

Coefficients as Log-Odds

  • βⱼ = change in log-odds of Y=1 per unit increase in Xⱼ (others fixed)
  • exp(βⱼ) = odds ratio for 1-unit increase in Xⱼ — most interpretable
  • exp(βⱼ) > 1: higher Xⱼ → higher odds; < 1: lower odds
  • 95% CI for OR: exp(βⱼ ± 1.96·SE(βⱼ))
💡
Model Fit

Assessing Goodness of Fit

  • Deviance: −2·log-likelihood; lower = better; compare nested models
  • Hosmer-Lemeshow test: Goodness of fit for grouped data
  • Pseudo R²: McFadden's, Nagelkerke's — analogue of R² (not identical!)
  • AUC-ROC: Discrimination ability — AUC>0.7 good; >0.8 excellent
⚠️
Common Extensions

Multinomial & Ordinal Logistic

  • Multinomial logistic: Nominal Y with >2 categories — g−1 logit equations vs reference
  • Ordinal logistic (PO model): Ordered Y — Proportional Odds: log[P(Y≤j)/P(Y>j)] = αⱼ−β'X
  • PO assumption: same βs for all cut-points — test with parallel lines test
Logit modellogit(π) = ln[π/(1−π)] = β₀ + β₁X₁ + … + βₖXₖ
Predicted probabilityπ̂ = exp(β̂'x) / [1 + exp(β̂'x)] = 1/[1+exp(−β̂'x)]
Odds RatioOR_j = exp(β̂ⱼ) — per unit increase in Xⱼ holding others fixed
Wald testz = β̂ⱼ/SE(β̂ⱼ) ~ N(0,1)   (test H₀: βⱼ=0)
LR test (nested)G² = −2[ℓ(reduced) − ℓ(full)] ~ χ²(df_diff)
· · ·
C5
Multi-Way Tables
Log-Linear Models
📦
What it is

Modelling Cell Counts

Log-linear models treat cell counts as Poisson: ln(μᵢⱼ) = λ + λᵢᴬ + λⱼᴮ + λᵢⱼᴬᴮ. All variables are response variables — no distinction between X and Y. Especially useful for 3+ way tables to model partial and conditional independence structures.

⚙️
Model Hierarchy

Saturated vs Parsimonious

  • Saturated: All interactions included; perfect fit; df=0 — useless for testing
  • [AB,AC,BC]: All 2-way interactions; no 3-way
  • [AB,C]: A and B interact; C independent of both
  • [A,B,C]: Complete independence of A, B, C
  • Select model by G² (deviance) and AIC
💡
Link to Logistic Regression

Equivalence Result

For a 2×J table (binary Y), the log-linear model [XY, X] is exactly equivalent to logistic regression of Y on X. The association parameter in the log-linear model = the logistic regression coefficient. This provides a unified framework for all categorical models.

Saturated 2-wayln(μᵢⱼ) = λ + λᵢᴬ + λⱼᴮ + λᵢⱼᴬᴮ
Independence modelln(μᵢⱼ) = λ + λᵢᴬ + λⱼᴮ   (no interaction term)
MLE fitted countsμ̂ᵢⱼ = nᵢ₊·n₊ⱼ/n   (independence model = Eᵢⱼ)
Model selectionAIC = G² − 2·df   (choose model with lowest AIC)
· · ·
C6
Special Topics
Ordinal Data & Matched Pairs
📊
Ordinal Association

Concordant & Discordant Pairs

  • Concordant pair: Both variables rank same direction
  • Discordant pair: Variables rank opposite directions
  • Gamma γ: (C−D)/(C+D) ∈ [−1,1] — ignores ties
  • Kendall's τb: Accounts for ties — preferred
  • Spearman's ρ: Correlation of ranks
⚙️
Matched Pairs (McNemar)

Paired Binary Data

  • n subjects measured twice (before/after) or matched pairs
  • Only discordant pairs (b and c) carry information about change
  • McNemar's test: χ² = (b−c)²/(b+c) ~ χ²(1)
  • Odds ratio for matched pairs: OR = b/c
💡
Cochran-Mantel-Haenszel

Controlling for Confounding

CMH test: test association between X and Y controlling for a third variable Z (stratification). Combines evidence across K strata. Common OR estimate: OR_MH = Σₖ(aₖdₖ/nₖ) / Σₖ(bₖcₖ/nₖ). Essential for removing confounding in observational studies.

McNemar χ²χ² = (b−c)²/(b+c) ~ χ²(1)   (b,c = discordant cell counts)
Matched OROR = b/c   (ratio of discordant pairs)
Gammaγ = (C−D)/(C+D)   (C=concordant, D=discordant pairs)
CMH statisticχ²_MH = [Σₖ(aₖ−μₖ)]² / Σₖσₖ² ~ χ²(1)
STAT4104 · Research Methodology
R
STAT4104 · B.Sc. Statistics Year 4 · BRUR
Research Methodology
Research Design · Literature Review · Measurement · Questionnaire Design · Validity & Reliability · Data Collection · Report Writing · Ethics
🎓 What is Research Methodology? Research methodology is the systematic framework for conducting scientific inquiry — it answers "HOW do we find out what we want to know?" It covers study design, measurement, data collection, analysis strategy, and reporting. As Saunders et al. describe it: "Research methodology is the theory of how research should be undertaken." 😄 "Good methodology won't save bad ideas, but bad methodology will ruin good ones."
R1
Foundations
Nature & Types of Research
📖
What is Research?

Systematic Inquiry

Research is a systematic, controlled, empirical investigation of natural phenomena guided by theory and hypotheses about the relationship between variables. It is not just "searching the web" — it requires rigour, replicability, and transparency.

⚙️
Types by Purpose

Basic vs Applied vs Action

  • Basic/Pure: Advances knowledge without immediate application — testing theory
  • Applied: Solves specific practical problems — policy evaluation, product testing
  • Action research: Researcher is also a participant; improves practice while studying it
💡
Types by Approach

Quantitative vs Qualitative vs Mixed

  • Quantitative: Numbers, tests, generalisation — large n, structured data
  • Qualitative: Meaning, context, depth — interviews, observation, small n
  • Mixed methods: Combines both — sequential, concurrent, or embedded designs
Types by Time

Cross-Sectional vs Longitudinal

  • Cross-sectional: One point in time — snapshot; cheap but no causation
  • Longitudinal: Same subjects over time — tracks change; expensive but causal insight
  • Retrospective: Past data — case-control; recall bias risk
  • Prospective: Follow forward — cohort; gold standard for temporal causation
· · ·
R2
Study Design
Research Design & Paradigms
🔭
Research Paradigms

Positivism, Interpretivism & Pragmatism

  • Positivism: Objective reality exists; can be measured; deductive; quantitative
  • Interpretivism: Reality is socially constructed; context matters; inductive; qualitative
  • Pragmatism: "Whatever works" — mixed methods; research question drives method choice
  • Most statistics students work within a positivist paradigm
⚙️
Experimental Design

RCT — Gold Standard

  • Randomised Controlled Trial (RCT): Random assignment to treatment/control → allows causal inference
  • Quasi-experiment: No randomisation but comparison group exists (DID, RDD)
  • Observational: No manipulation — correlation only (unless IV, matching used)
💡
Causal Inference

Why RCTs Rule

RCT removes selection bias — treatment and control groups are identical on average (observed AND unobserved). Average Treatment Effect (ATE) = E[Y(1)−Y(0)]. Without randomisation, Y(1) and Y(0) differ systematically — we observe only one potential outcome per person (fundamental problem of causal inference).

🌍 Bangladesh Microfinance RCTBandhan microfinance RCT (Banerjee et al.): randomly assigned microcredit to some villages; compared income/consumption 2 years later. ATE estimate = positive but modest income effect. RCT design means we can confidently attribute this to the credit program — not to pre-existing differences between borrowers and non-borrowers. Landmark example of rigorous impact evaluation.
· · ·
R3
Before Data Collection
Literature Review & Hypothesis Formulation
📚
Literature Review

Why Review the Literature?

  • Identifies what is already known — avoid duplicating work
  • Locates gaps your research fills
  • Provides theoretical framework and conceptual models
  • Guides appropriate methodology and instruments
  • Databases: PubMed, Web of Science, Scopus, Google Scholar, JSTOR
🎯
Hypothesis Formulation

Good Hypotheses

  • Stated as relationship between two or more variables
  • Testable with available data and methods
  • Grounded in theory and prior literature
  • Null H₀: No effect/relationship — what we statistically test
  • Directional (one-sided): Stronger theory → directional; exploratory → two-sided
💡
PICO Framework

Structuring Research Questions

Especially in health research: Population — Intervention/Exposure — Comparison — Outcome. Example: Among Bangladeshi children under 5 (P), does exclusive breastfeeding for 6 months (I) compared to mixed feeding (C) reduce stunting rates (O)? Clear PICO prevents vague, unanswerable questions.

· · ·
R4
Measurement
Measurement, Scales & Questionnaire Design
📏
Scales of Measurement

Nominal · Ordinal · Interval · Ratio

  • Nominal: Gender, religion, blood type — categories only; mode appropriate
  • Ordinal: Education level, satisfaction rating — ranked; median appropriate
  • Interval: Temperature, IQ — equal intervals, no true zero; mean appropriate
  • Ratio: Income, weight, height — true zero; all measures appropriate
📝
Questionnaire Design

Golden Rules

  • Each question measures ONE thing only (no double-barrelled questions)
  • Use simple, clear language appropriate for target population
  • Avoid leading questions ("Don't you agree that…?")
  • Order: easy/non-sensitive first; sensitive/demographics last
  • Pilot test with 10–20 people before full deployment
💡
Response Scales

Likert, Semantic Differential & VAS

  • Likert scale: 1–5 or 1–7 agreement scale; treat as ordinal (or approximately interval for ≥5 points)
  • Semantic differential: Bipolar adjectives (good–bad, fast–slow) on 7-point scale
  • VAS (Visual Analogue Scale): 0–100mm line; continuous; good for pain, intensity
· · ·
R5
Quality Assurance
Validity, Reliability & Data Quality
🎯
Validity

Are We Measuring What We Intend?

  • Content validity: Items cover the full domain
  • Construct validity: Measures the theoretical construct (convergent + discriminant)
  • Criterion validity: Correlates with gold standard (concurrent + predictive)
  • Internal validity: Study design allows causal inference (no confounding)
  • External validity: Results generalise to other populations/settings
🔁
Reliability

Consistency of Measurement

  • Test-retest reliability: Same result on repeated measurement (Pearson r)
  • Inter-rater reliability: Different raters agree (Cohen's κ)
  • Internal consistency: Items in scale hang together (Cronbach's α ≥ 0.7)
💡
Validity vs Reliability

The Dartboard Analogy

Reliable but not valid: all darts in tight cluster but hitting the wrong target. Valid but not reliable: darts scattered but centred on the right target. Reliable AND valid: tight cluster on the correct target. Reliability is necessary but not sufficient for validity.

Cronbach's αα = (k/(k−1)) · [1 − Σσᵢ²/σ²_total]   (k=items; ≥0.7 acceptable)
Cohen's κκ = (P_o − P_e)/(1 − P_e)   (P_o=observed agreement; P_e=expected by chance)
κ interpretation0.0–0.2: slight; 0.21–0.4: fair; 0.41–0.6: moderate; 0.61–0.8: substantial; >0.8: almost perfect
· · ·
R6
Field Work
Data Collection Methods & Ethics
📋
Collection Methods

Survey, Interview, Observation, Secondary

  • Self-administered survey: Cheap, large scale, no interviewer bias — but low response rate
  • Face-to-face interview: High response, complex questions possible — expensive, interviewer bias
  • Telephone/CATI: Moderate cost, fast — coverage bias (no mobile?)
  • CAPI: Computer-assisted personal interview — error reduction, skip patterns automated
  • Secondary data: HIES, DHS, census, administrative records
🛡️
Research Ethics

Core Principles

  • Informed consent: Voluntary participation with full information
  • Confidentiality: Individual data not disclosed; anonymise outputs
  • No harm: Physical, psychological, social harm must be minimised
  • Honesty: No fabrication, falsification, or plagiarism of data/results
  • IRB/Ethics Board approval: Required for human subjects research
📊
Report Writing

Structure of a Research Report

  • Abstract: Background, objective, methods, results, conclusions (≤250 words)
  • Introduction: Problem, rationale, objectives, hypotheses
  • Methods: Study design, population, sample, instruments, analysis plan
  • Results: Tables, figures, statistical findings — no interpretation
  • Discussion: Interpret, compare with literature, limitations, implications
  • Conclusion: Answer the research question; recommendations
😄 Ethics Reminder"In research ethics, the three golden rules are: (1) Do not harm participants, (2) Do not lie to participants, (3) Do not lie about participants in your results. The fourth, unofficial rule: (4) Do not ONLY add your supervisor's name to the author list without their contribution — honorary authorship is a form of research misconduct." Always get IRB clearance before data collection, not after — retroactive approval doesn't exist!

📚 Reference Books

[1]
Probability and Statistical Inference
Hogg, R.V., Tanis, E.A., & Zimmerman, D.L. — John Wiley & Sons · 9th Ed.
[2]
Introduction to Mathematical Statistics
Hogg, R.V., McKean, J.W., & Craig, A.T. — Pearson · 7th Ed.
[3]
Introduction to Theory of Statistics
Mood, A.M., Graybill, F.A., & Boes, D.C. — McGraw-Hill · 3rd Ed.
[4]
A First Course in Statistics
McClave, J.T. & Sincich, T. — Pearson / Prentice Hall · 13th Ed.
[5]
Applied Linear Statistical Models
Kutner, M.H., Nachtsheim, C.J., Neter, J., & Li, W. — McGraw-Hill · 5th Ed.
[6]
Introduction to Linear Regression Analysis
Montgomery, D.C., Peck, E.A., & Vining, G.G. — John Wiley & Sons · 5th Ed.
[7]
Basic Econometrics
Gujarati, D.N. & Porter, D.C. — McGraw-Hill · 5th Ed.
[8]
Introductory Econometrics: A Modern Approach
Wooldridge, J.M. — Cengage Learning · 7th Ed.
[9]
Applied Multivariate Statistical Analysis
Johnson, R.A. & Wichern, W.D. — Pearson Prentice Hall · 6th Ed.
[10]
Statistical Inference
Casella, G. & Berger, R.L. — Duxbury Press / Cengage · 2nd Ed.
[11]
Testing Statistical Hypotheses
Lehmann, E.L. & Romano, J.P. — Springer · 3rd Ed.
[12]
Sampling Techniques
Cochran, W.G. — John Wiley & Sons · 3rd Ed.
[13]
An Introduction to Categorical Data Analysis
Agresti, A. — John Wiley & Sons · 3rd Ed.
[14]
Categorical Data Analysis
Agresti, A. — John Wiley & Sons · 3rd Ed.
[15]
Design and Analysis of Experiments
Montgomery, D.C. — John Wiley & Sons · 10th Ed.
[16]
The Design of Experiments
Fisher, R.A. — Oliver & Boyd · 9th Ed.
[17]
Research Methods for Business Students
Saunders, M., Lewis, P. & Thornhill, A. — Pearson · 8th Ed.
[18]
Survey Sampling
Kish, L. — John Wiley & Sons · Classic Ed.