Ch.8: Statistics for MachineLearning
Sen-ching Samson Cheung
Representing Data
Categorical or Nominal
Classes with no ordering
e.g. dessert = {ice-cream, pudding, cake, fruit}
Represented by 1-of-m encoding
Ice-cream = (0,0,0,1), pudding = (0,0,1,0), …
Ordinal
Classes with ordering
e.g. cold=-1, cool=0, warm=+1, hot=+2
Numerical
Real numbers
Continuous Distributions
Probability Density Function p(x)
Expectation
Inference on continuous distribution relies onintegration (versus summation).
Other important concepts
Concept
Definitions
K-th moment
Cumulative DistributionFunction
Moment Generating Function
Mode
Covariance Matrix
Correlation Matrix
Change of variables
Given p(x) and a bijective transformationy=f(x)
where
Skewness and Kurtosis
Skewness
Kurtosis
image403.gif
>0
=0
<0
Empirical Distribution
From data points to distribution
DiscreteContinuous
Sample mean and covariance
Entropy and Information
(Average) Entropy


It measures the average number of bits (uncertainty) to represent a symbol drawn from the distribution p(x). 
Conditional Entropy
𝐻 𝑋 𝑌 =−   log 𝑝(𝑥|𝑦)   𝑝(𝑥|𝑦) 
Mutual Information
𝑀𝐼 𝑋,𝑌 =𝐻 𝑋 −𝐻 𝑋 𝑌 =𝐻 𝑌 −𝐻(𝑌|𝑋)
=𝐻(X)
Kullback-Leibler Divergence
“Difference” between two distributions


It is always positive because it represents the extra bandwidth needed to compress a source with a wrong distribution
	𝐾𝐿 𝑞 𝑝 =−   log 𝑝 𝑥    𝑞 𝑥  −𝐻 𝑞 
	
And zero if and only if p(x)=q(x).
Classical Discrete Distributions
Bernoulli Distribution
Discrete Binary Variable x
Dom(x) = {0,1}
Parameters:  p(x=1)=
Properties
1.p(x=0)=1-
2.<x>=
3.Var(x)=(1-)
Categorical Distribution
Discrete Variable x
Dom(x) = {1,…,C}
Parameters:  p(x=c)=c
Classical Discrete Distributions
Binomial Distribution
Discrete Variable 𝑦= 𝑖=1 𝑛  𝑥 𝑖   where  𝑥 𝑖 ’s are independent Bernoulli variables
dom(y)={0,1,2,…,n}
Parameters=n, 
Properties
Eq :
<y>=n
Var(y)=n(1- )
File:Binomial distribution pmf.svg
Classical Discrete Distributions
Multinomial Distribution
Discrete vector 

 𝑦 𝑖 = 𝑗=1 𝑛  𝕀 { 𝑥 𝑗 =𝑖}   where { 𝑥 1 ,…,  𝑥 𝑛 } are iid K-valued categorical data with parameters { 𝜃 1 ,…, 𝜃 𝐾 }
Properties
Classical Discrete Distribution
Poisson
Model discrete events where the average scales with the length of the observation interval
Discrete variable x with dom(x)={0,1,2,…}
Parameters: 
	𝑝 𝑥=𝑘 𝜆 = 1 𝑘!  𝑒 −𝜆  𝜆 𝑘 
	 𝑥 =𝜆  and var(𝑥)=𝜆
File:Poisson pmf.svg
Classical Continuous Distribution
Uniform
Continuous variable x with dom(x) = [a,b]
𝑝 𝑥 = 1 𝑏−𝑎 ,    𝑥 = 𝑎+𝑏 2 ,  
var(x)=  (𝑏−𝑎) 2  12 
Exponential
Continuous variable x with dom(x) = 0,∞ 
𝑝 𝑥 =𝜆 𝑒 −𝜆𝑥 ,    𝑥 = 1 𝜆 ,  
var(x)= 1  𝜆 2
File:Uniform Distribution PDF SVG.svg
File:Exponential pdf.svg
Classical Continuous Distribution
Gamma Distribution (α=shape, =scale)

	      where
 𝑥 =𝛼𝛽  and  var(x)=𝛼 𝛽 2 
“Good” priors for exponential and Poisson
Classical Continuous Distribution
Inverse Gamma 
 𝑥 = 𝛽 𝛼−1  and  var(x)=  𝛽 2   (𝛼 −1) 2 (𝛼 −2) 
“Good” prior for  𝜎 2  	of Gaussian
http://upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Inverse_gamma_pdf.png/1280px-Inverse_gamma_pdf.png
Classical Continuous Distribution
Beta distribution



 𝑥 = 𝛼 𝛼+𝛽  and  var(x)= 𝛼𝛽   𝛼+𝛽  2 (𝛼+𝛽+1) 
Good priors for Bernoulli, binomial, geometric
Classical Continuous Distribution
Univariate Gaussian
Student’s t-distribution
Estimating the mean of a univariateGaussian where the sample size is small andthe variance unknown
File:Student t pdf.svg
Classical Continuous Distribution
Dirichlet distribution
Good prior for categorical distribution
For
Properties:
Dirichlet
Multivariate Gaussian
Multivariate Gaussian
Canonical Representation
Entropy
Multivariate Gaussian
Properties
Product of two Gaussians
Linear Transformation
Completing the Square
Partitioned Gaussian
Whitening
1.Centering
2.Normalization
Exponential Family
A distribution belongs to exponential has thefollowing form:
T(x) is called sufficient statistics
() is called the natural parameter
() is called the log-partition function
𝜓(𝜃)
𝜓(𝜃)
𝜓(𝜃)
Many distributions  exp family
Multivariate Gaussian to standard form
Others include Bernoulli, binomial, poisson, exponential,Pareto, negative binomial, Weibull, Laplace, chi-square,Gaussian, lognormal, inverse gaussian, gamma, inversegamma, beta, multinomial, Dirichlet, Wishart, inverseWishart, normal-gamma (see Wikipedia)
From Data to Parameters
Given data                         ,
Based on a model M
𝑀
𝜃
𝑋
𝑁
From Data to Parameters
Given data                         , estimate  in p(x|)
1.Bayesian methods
2.Maximum A posteriori
3.Maximum Likelihood
4.Moment Matching
5.Pseudo Likelihood
ML Estimate of Gaussian
Assume i.i.d. data, the log-likelihood function
ML estimator of mean
Derivative
Setting to zero
Optimal estimator
ML Estimate for Gaussian
Optimal variance estimator
Derivative with respect to precision
Setting it to zero
Unbiased Estimator
Unbiased estimator
Example
is unbaised for i.i.d data because
Average variance is biased
For i.i.d data
    but
Conjugate Priors
If the posterior is of the same parametric formas the prior, the prior is called the conjugateprior of the likelihood function
If the likelihood and the prior  exp. family
    it is always a conjugate prior:
MAP Estimate of Gaussian
For simplicity, assume univariate:
Posterior:
a)Prior knowledge about mean
Conjugate prior for : Gaussian
Prior knowledge
about parameters
Assume independence
= p(X|μ,σ2)p(μ)p(σ2)
What does the posterior look like?
2
X
= p(X|μ,σ2)p(μ)p(σ2)
What is the mean of p(μ|X, σ2)?
What about 𝑝  𝜎 2  𝑋  ?
Assumption
Then
This form is called gaussian-inverse-gamma whoseconjugate prior (2)N-2p(2) is inverse-gamma