LearningHiddenPublic_Page_01
ChapterΒ 11,Β 20.1-3
LearningHiddenPublic_Page_02
LearningHiddenPublic_Page_02
LearningHiddenPublic_Page_02
LearningHiddenPublic_Page_03
LearningHiddenPublic_Page_04
LearningHiddenPublic_Page_04
LearningHiddenPublic_Page_05
LearningHiddenPublic_Page_05
LearningHiddenPublic_Page_05
LearningHiddenPublic_Page_06
LearningHiddenPublic_Page_07
LearningHiddenPublic_Page_07
LearningHiddenPublic_Page_08
LearningHiddenPublic_Page_08
We’ll need to use a handy inequality: Jensen’s Inequality

For any concave function 𝑓, we have 𝑓(  𝑋  π‘ž(π‘₯) )β‰₯  𝑓(𝑋)  π‘ž(π‘₯)  for any pdf π‘ž(π‘₯). 

Recall that for a concave function 𝑓 has 𝑓 𝛼π‘₯+(1βˆ’π›Ό)𝑦 β‰₯𝛼𝑓 π‘₯ +(1βˆ’π›Ό)𝑓(𝑦)

The linear sum can easily be extended to integration/summation.

Logarithm is a concave function
http://upload.wikimedia.org/wikipedia/en/7/73/ConcaveDef.png
x+(1-)y
f(x+(1-)y)
f(x)+(1-)f(y)
LearningHiddenPublic_Page_08

Let’s consider a single observation 𝑣. We need to find parameters πœƒ to maximize the data likelihood  log 𝑝(𝑣|πœƒ) . Variational approach seeks to lower bound the cost function and alternatively optimize the bound using πœƒ and a variational distribution π‘ž(β„Ž|𝑣):

  log 𝑝(𝑣|πœƒ)  =    log 𝑝(𝑣|πœƒ)   π‘ž(β„Ž|𝑣)   =    log 𝑝 𝑣,β„Ž πœƒ /𝑝 β„Ž 𝑣,πœƒ    π‘ž(β„Ž|𝑣)   =    log 𝑝(𝑣,β„Ž|πœƒ)   π‘ž(β„Ž|𝑣) βˆ’   log 𝑝 β„Ž 𝑣,πœƒ    π‘ž β„Ž 𝑣   

The first term is called the Energy term – it looks almost like the case for fully observable!

 βˆ’   log 𝑝 𝑣 β„Ž,πœƒ    π‘ž 𝑣 β„Ž   β‰₯ βˆ’   log 𝑝 𝑣 β„Ž,πœƒ    π‘ž 𝑣 β„Ž  βˆ’πΎπΏ(π‘ž(𝑣|β„Ž)|𝑝 𝑣 β„Ž,πœƒ )  = βˆ’   log π‘ž 𝑣 β„Ž    π‘ž 𝑣 β„Ž   
This term is called the Entropy term – it does not depend on πœƒ. But given πœƒ, can you choose a 
π‘ž 𝑣 β„Ž  so that the data likelihood can be maximize?
LearningHiddenPublic_Page_08

Let’s consider a single observation 𝑣. We need to find parameters πœƒ to maximize the data likelihood  log 𝑝(𝑣|πœƒ) . Variational approach seeks to lower bound the cost function and alternatively optimize the bound using πœƒ and a variational distribution π‘ž(β„Ž|𝑣):

  log 𝑝(𝑣|πœƒ)  =    log 𝑝(𝑣|πœƒ)   π‘ž(β„Ž|𝑣)   =    log 𝑝 𝑣,β„Ž πœƒ /𝑝 β„Ž 𝑣,πœƒ    π‘ž(β„Ž|𝑣)   =    log 𝑝(𝑣,β„Ž|πœƒ)   π‘ž(β„Ž|𝑣) βˆ’   log 𝑝 β„Ž 𝑣,πœƒ    π‘ž β„Ž 𝑣   

The first term is called the Energy term – it looks almost like the case for fully observable!

 βˆ’   log 𝑝 β„Ž 𝑣,πœƒ    π‘ž β„Ž 𝑣   β‰₯ βˆ’   log 𝑝 β„Ž 𝑣,πœƒ    π‘ž β„Ž 𝑣  βˆ’πΎπΏ(π‘ž(β„Ž|𝑣)|𝑝 β„Ž 𝑣,πœƒ )  = βˆ’   log π‘ž β„Ž 𝑣    π‘ž β„Ž 𝑣   
This term is called the Entropy term – it does not depend on πœƒ. But given πœƒ, can you choose a 
π‘ž β„Ž 𝑣  so that the data likelihood can be maximize?
LearningHiddenPublic_Page_08

Let’s consider a single observation 𝑣. We need to find parameters πœƒ to maximize the data likelihood  log 𝑝(𝑣|πœƒ) . Variational approach seeks to lower bound the cost function and alternatively optimize the bound using πœƒ and a variational distribution π‘ž(β„Ž|𝑣):

  log 𝑝(𝑣|πœƒ)  =    log 𝑝(𝑣|πœƒ)   π‘ž(β„Ž|𝑣)   =    log 𝑝 𝑣,β„Ž πœƒ /𝑝 β„Ž 𝑣,πœƒ    π‘ž(β„Ž|𝑣)   =    log 𝑝(𝑣,β„Ž|πœƒ)   π‘ž(β„Ž|𝑣) βˆ’   log 𝑝 β„Ž 𝑣,πœƒ    π‘ž β„Ž 𝑣   

The first term is called the Energy term – it looks almost like the case for fully observable!

 βˆ’   log 𝑝 𝑣 β„Ž,πœƒ    π‘ž β„Ž 𝑣   β‰₯ βˆ’   log 𝑝 β„Ž 𝑣,πœƒ    π‘ž β„Ž 𝑣  βˆ’πΎπΏ(π‘ž(β„Ž|𝑣)|𝑝 β„Ž 𝑣,πœƒ )  = βˆ’   log π‘ž β„Ž 𝑣    π‘ž β„Ž 𝑣   
This term is called the Entropy term – it does not depend on πœƒ. But given πœƒ, can you choose a 
π‘ž β„Ž 𝑣  so that the data likelihood can be maximize?
LearningHiddenPublic_Page_09
LearningHiddenPublic_Page_09
LearningHiddenPublic_Page_09
LearningHiddenPublic_Page_09
LearningHiddenPublic_Page_10
LearningHiddenPublic_Page_10
E-step
LearningHiddenPublic_Page_10
E-step
M-step
LearningHiddenPublic_Page_10
E-step
M-step
LearningHiddenPublic_Page_10
E-step
M-step
LearningHiddenPublic_Page_10
E-step
M-step
LearningHiddenPublic_Page_11
LearningHiddenPublic_Page_16
LearningHiddenPublic_Page_16
LearningHiddenPublic_Page_16
LearningHiddenPublic_Page_17
LearningHiddenPublic_Page_18
LearningHiddenPublic_Page_18
LearningHiddenPublic_Page_18
LearningHiddenPublic_Page_19
a_Page_10
a_Page_20
This iterative algorithm actually minimizes this cost function:
𝐽  π‘š 𝑖  = 𝑖=1 𝐾   π‘₯ 𝑛 ∈ 𝑁 𝑖      π‘₯ 𝑛 βˆ’ π‘š 𝑖    2 2
a_Page_21
a_Page_02
a_Page_03
a_Page_09
a_Page_11
a_Page_12
a_Page_12
a_Page_12
a_Page_13
a_Page_13
a_Page_13
ContributionΒ ofΒ theΒ nthΒ datapointΒ toΒ theΒ ithΒ cluster
a_Page_14
a_Page_15
a_Page_15
a_Page_15
InsteadΒ ofΒ forcingΒ eachΒ dataΒ pointΒ toΒ aΒ clusterΒ inΒ K-means,Β GMMhasΒ aΒ soft-assignment – theΒ fractionΒ ofΒ aΒ dataΒ pointΒ belongingΒ toΒ aclusterΒ isΒ measuredΒ byΒ thisΒ posterior
C:\Users\hays\Desktop\143 Computer Vision\slides\13\moore_gmm_01.png
C:\Users\hays\Desktop\143 Computer Vision\slides\13\moore_gmm_02.png
C:\Users\hays\Desktop\143 Computer Vision\slides\13\moore_gmm_03.png
C:\Users\hays\Desktop\143 Computer Vision\slides\13\moore_gmm_04.png
C:\Users\hays\Desktop\143 Computer Vision\slides\13\moore_gmm_05.png
C:\Users\hays\Desktop\143 Computer Vision\slides\13\moore_gmm_06.png
C:\Users\hays\Desktop\143 Computer Vision\slides\13\moore_gmm_07.png
C:\Users\hays\Desktop\143 Computer Vision\slides\13\moore_gmm_08.png