博文

EM 算法一个例子

已有 8768 次阅读 2015-11-14 07:53 |个人分类:统计相关|系统分类:科研笔记

EM 算法一个例子 EM in R

参考以下网页并翻译，其中变量稍作修改。

http://rstudio-pubs-static.s3.amazonaws.com/1001_3177e85f5e4840be840c84452780db52.html

假设：有一枚硬币，以25%的概率面朝上(heads）， 75%的概率面朝下（tails).

如果面朝上，服从均值为1，标准方差为1 的正态分布，记为分布1；

如果面朝下，服从均值为7，标准方差为1的正态分布，记为分布2。

在该问题中，硬币什么时候服从分布1，什么时候服从分布2就是隐变量。即（p1，p2）

(1)#先模拟数据

set.seed(123)
p1=0.25

x<-y<-rep(0,1000)
for (i in 1:1000)
{
if (runif(1)<p1)
{x[i]<-rnorm(1,mean=1,sd=1)
y[i]<-"head"} #以25%的概率产生服从N(1,1)的“正面”数据。
else
{x[i]<-rnorm(1,mean=7,sd=1)
y[i]<-"tail"} #以75%的概率产生服从 N(7,1)的 “反面”数据。
}

(2)目前有是三个部分：

1.观察数据 x.

2.正态分布的参数mu，sd（均值，标准方差）记为 theta=（mu1,mu2,sd1,sd2)

3. 隐变量（混合概率）记为 Z=(p1,p2)

(3)EM算法：

1. 给定当前的参数和观察数据，估计隐变量

2.给定隐变量和观察数据，估计参数。

（3-1）E步

为了使得问题简单，假设我们已知两个正态分布的标准方差都是1.那么未知参数就是两个均值。此时参数theta=(mu1,mu2)

我们先给定参数 theta（mu1，mu2)=(0, 1)

每一个观测值，在给定当前参数和观察数据的情况下，我们可以计算它来自分布1，分布2的概率。

mu1=0;mu2=1;

x[1];dnorm(x[1],mean=mu1,sd=1);dnorm(x[1],mean=mu2,sd=1)

##[1] 7.800554[1] 2.442273e-14[1] 3.617301e-11

因为，3.617301e-11>2.442273e-14, 我们认为 x[1] 在分布2 的概率要比在分布1的概率大一些，即我们认为x[1] 来自分布2. 然而，在计算概率时候，我们还要考虑到混合概率。我们需要按照混合概率对每次观察进行加权。一开始我们假设来自两个分布的概率都是50%， Z=（0.5,0.5）

## consider the mixing probability

z1=0.5;z2=0.5;

t1<-z1*dnorm(x[1],mean=mu1,sd=1)

t2<-z2*dnorm(x[1],mean=mu2,sd=1)

t1/(t1+t2) #[1] 0.0006747089t2/(t1+t2) #[1] 0.9993253

我们发现 x[1] 来自第一个分布的概率约 0.00067，第二个分布的概率0.99933，我们推断x[1]很可能来自第二个分布。同理，我们可以算出所有的观察数据属于第一个分布的概率p1。（即，在给定参数theta和观察数据的情况下，估计隐变量（p1,p2)）

t1<-z1*dnorm(x,mean=mu1,sd=1)

t2<-z2*dnorm(x,mean=mu2,sd=1)

p1<-t1/(t1+t2)

p2<-t2/(t1+t2)

head(p1) #total 1000 numbers， but we show first 6 numbers using head

##[1] 0.0006747089 0.0003162351 0.0004351016 0.0002704738 0.0012503475 0.0029791099

p1hat<-mean(p1)

p2hat<-mean(p2)

## 0.1002794 0.8997206

此时估计的隐变量为(0.1003, 0.8997).

（3-2） M步

接下来我们进行M步。在给定隐变量和观察数据的情况下，估计参数theta。我们使用加权平均来做这一步。

例如对于第一变量x[1], 以0.00067 属于分布1，以0.99933属于分布2.

p1<-t1/(t1+t2) #using as weight of distribution 1

p2<-t2/(t1+t2) #using as weight of distribution 2

mu1_hat<-sum(p1*x)/sum(p1) #evaluate mu1

mu2_hat<-sum(p2*x)/sum(p2) #evaluate mu2

print(c(mu1_hat,mu2_hat)) ##[1] 0.5045618 6.1011529

经过第一次迭代，我们可以估计参数theta（ mu1,mu2）分别为（0.505， 6.101）. 这比初始值（0，1）更接近真实值（1，7）。我们可以简单地重复这个过程，直到我们估计改变非常小为止。我们将这些代码放在一起。

#========generate data==================
set.seed(123)
p1_actual=0.25

x<-y<-rep(0,1000)
for (i in 1:1000)
{
if (runif(1)<p1_actual)
{x[i]<-rnorm(1,mean=1,sd=1)
y[i]<-"head"} # generate the distribution N(1,1)
else
{x[i]<-rnorm(1,mean=7,sd=1)
y[i]<-"tail"} # generate the distribution N(7,1)
}

# set the initial guesses for the distribution parameters
mu1=0
mu2=1
## as well as the initial latent variable parameters
z1=0.5
z2=0.5
## set the iterations number
N=10

for( i in 1:N ) {
## E-step
## Given the observed data (x), as well as the distribution parameters theta(mu1,mu2),
## what are the latent variables(p1,p2)?

t1=z1 * dnorm( x, mu1 )
t2=z2 * dnorm( x, mu2 )
#p1, p2 used as weight
p1<- t1/(t1+t2)
p2<- t2/(t1+t2)## note: p2 = 1 - p1

p1hat <- mean(p1)
p2hat <- mean(p2)

## Given the observed data (x), as well as the latent variables(p1,p2),
## what are the population parameters theta(mu1,mu2)?

mu1 <- sum( p1 * x ) / sum(p1)
mu2 <- sum( p2 * x ) / sum(p2)

## print the current estimates
## iteration;mu1 mu2 p1 p2
print( c(i,mu1, mu2, p1hat,p2hat) )

}

#===================

### i mu1 mu2 p1 p2

#[1] 1.0000000 0.5045618 6.1011529 0.1002794 0.8997206

#[1] 2.0000000 0.9161202 6.9745297 0.2367947 0.7632053

#[1] 3.0000000 1.0020187 7.0115329 0.2448788 0.7551212

#[1] 4.0000000 1.0086520 7.0137279 0.2454253 0.7545747

#[1] 5.0000000 1.0091323 7.0138808 0.2454641 0.7545359

#[1] 6.0000000 1.0091668 7.0138918 0.2454669 0.7545331

#[1] 7.0000000 1.0091693 7.0138926 0.2454671 0.7545329

#[1] 8.0000000 1.0091694 7.0138926 0.2454671 0.7545329

#[1] 9.0000000 1.0091695 7.0138926 0.2454672 0.7545328

#[1] 10.0000000 1.0091695 7.0138926 0.2454672 0.7545328

转载本文请联系原作者获取授权，同时请注明本文来自高建召科学网博客。
链接地址：https://blog.sciencenet.cn/blog-468005-935092.html

上一篇：EM 算法
下一篇：vim 自动添加表头

收藏 IP: 124.191.5.*| 热度|

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

数据加载中...

返回顶部

高建召

扫一扫，分享此博文

cliffgao的个人博客分享 http://blog.sciencenet.cn/u/cliffgao 兴趣：生物信息学、统计、概率

博文

EM 算法一个例子

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

高建召

全部作者的其他最新博文

全部精选博文导读

相关博文

cliffgao的个人博客分享 http://blog.sciencenet.cn/u/cliffgao 兴趣：生物信息学、统计、概率

博文

EM 算法一个例子

当前推荐数：0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

高建召

全部作者的其他最新博文

全部精选博文导读

相关博文

该博文允许注册用户评论请点击登录评论 (0 个评论)