朴素贝叶斯算法
1 朴素贝叶斯算法
1.1 垃圾邮件处理模型
以预测垃圾邮件为例子,介绍贝叶斯算法。词库中有5000个单词。我们有输入\(x\),\(x \in {\{ 0,1\} ^{50000}}\),是一个50000维的向量,其中\(x _i\)表示词库中第\(i\)个单次在该邮件中出现,为\(0\)表示词库中第\(i\)个单次在该邮件中没有出现。
已经邮件是否为邮件,样本\(({x _1},{x _2},…,{x _i})\)出现的概率为\(p({x _1},{x _2},…,{x _i}|y)\),可以定义为下式:
$$p({x _1},{x _2},...,{x _i}|y) = p({x _1}|y)p({x _2}|y,{x _1})...p({x _{50000}}|y,{x _1},...,{x _{49999}})$$注: 上式很好理解。右侧第一个式子为已知\(y\),\(x _1\)出现的概率。第二个为一直\(y\)和\(x _1\)出现\(x _2\)的概率。因此两个式子相乘表示一直\(y\)初选\((x1,x2)\)的概率。依次类推,可以得到这个公式。
$$p({x _1},{x _2},...,{x _i}|y) = p({x _1}|y)p({x _2}|y)...p({x _{50000}}|y) = \prod\limits _{i = 1}^{50000} {p({x _i}|y)}$$1.2 公式推导
关于最大释然函数如何设置,我们需要知道我们的目标是找到参数\(\theta\),使得训练时候在给定\(x\)的情况下,得到最准确\(y\)的概率最大,最大释然函数就是这些概率的连乘,具体是需要保证下面的式子最大:
$$\ln \prod\limits _{i = 1}^m {p(y = {y^i}|x = {x^i};\theta )p(x = {x^i})} $$在很多教材中都介绍最大释然函数是如下的表达,实际上是一样的。下面的式子的直接意义是在\(\theta\)已知情况下,出现\((xi,yi)\)的概率,实际上面的分析过程是一样的。
$$\ln \prod\limits _{i = 1}^m {p(y = {y^i},x = {x^i};\theta )} $$后面为了简写后面省略了\(\theta\)。
我们做如下设置:
$${\phi _y} = p(y = 1)$$ $${\phi _{i|y = 1}} = p({x _i} = 1|y = 1)$$ $${\phi _{i|y = 0}} = p({x _i} = 1|y = 0)$$然后得到最大释然函数:
$$\ell ({\phi _y},{\phi _{i|y = 1}},{\phi _{i|y = 0}}) = \ln \prod\limits _{i = 1}^m {p(y = {y^i},x = {x^i})} = \ln \prod\limits _{i = 1}^m {p({x^i}|{y^i})p({y^i})}$$ $$\ell ({\phi _y},{\phi _{i|y = 1}},{\phi _{i|y = 0}}) = \ln \prod\limits _{i = 1}^m {[[1\{ {y^i} = 1\} \ln ((\prod\limits _{j = 1}^n {p(x _j^i|{y^i})p({y^i} = 1)} ){\phi _y})][1\{ {y^i} = 1\} \ln ((\prod\limits _{j = 1}^n {p(x _j^i|{y^i})p({y^i} = 0)} )(1 - {\phi _y}))]} ]$$ $$\ell ({\phi _y},{\phi _{i|y = 1}},{\phi _{i|y = 0}}) = \sum\limits _{i = 1}^m {[[1\{ {y^i} = 1\} (\ln {\phi _y} + \ln \prod\limits _{j = 1}^n {({{({\phi _{j|y = 1}})}^{1\{ {x _j} = 1\} }}{{(1 - {\phi _{j|y = 1}})}^{1\{ {x _j} = 0\} }})} )][1\{ {y^i} = 0\} (\ln (1 - {\phi _y}) + \ln \prod\limits _{j = 1}^n {({{({\phi _{j|y = 0}})}^{1\{ {x _j} = 1\} }}{{(1 - {\phi _{j|y = 0}})}^{1\{ {x _j} = 0\} }})} )]]}$$ $$\ell ({\phi _y},{\phi _{i|y = 1}},{\phi _{i|y = 0}}) = \sum\limits _{i = 1}^m {[[1\{ {y^i} = 1\} (\ln {\phi _y} + \sum\limits _{j = 1}^n {(1\{ x _j^i = 1\} \ln {\phi _{j|y = 1}} + 1\{ x _j^i = 0\} \ln (1 - {\phi _{j|y = 1}}))} )][1\{ {y^i} = 0\} (\ln (1 - {\phi _y}) + \sum\limits _{j = 1}^n {(1\{ x _j^i = 1\} \ln {\phi _{j|y = 0}} + 1\{ x _j^i = 0\} \ln (1 - {\phi _{j|y = 0}}))} )]]}$$然后我们对\(\phi _y\)求偏导数:
$$\frac{{\partial \ell ({\phi _y},{\phi _{i|y = 1}},{\phi _{i|y = 0}})}}{\partial {\phi _y}} = \sum\limits _{i = 1}^m {(1\{ {y^i} = 1\} \frac{1}{\phi _y} + 1\{ {y^i} = 0\} \frac{ - 1}{1 - {\phi _y}})} = \sum\limits _{i = 1}^m {\frac{1\{ {y^i} = 1\} - {\phi _y}}{\phi _y}(1 - {\phi _y})} = 0$$所以有:
$${\phi _y} = \frac{1}{m}\sum\limits _{i = 1}^m {1\{ {y^i} = 1\} }$$然后继续求偏导数:
$$\frac{{\partial \ell ({\phi _y},{\phi _{i|y = 1}},{\phi _{i|y = 0}})}}{{\partial {\phi _{i|y = 1}}}} = \sum\limits _{i = 1}^m {1\{ {y^i} = 1\} (1\{ x _j^i = 1\} \frac{1}{{{\phi _{x _j^i = 1|y = 1}}}} + 1\{ x _j^i = 0\} \frac{{ - 1}}{{1 - {\phi _{x _j^i = 1|y = 1}}}})}=0$$ $$\frac{{\partial \ell ({\phi _y},{\phi _{i|y = 1}},{\phi _{i|y = 0}})}}{{\partial {\phi _{i|y = 1}}}} = \sum\limits _{i = 1}^m {1\{ {y^i} = 1\} (\frac{{1\{ x _j^i = 1\} - {\phi _{i|y = 1}}}}{{{\phi _{i|y = 1}}(1 - {\phi _{i|y = 1}})}})}=0$$如果一个单词不再训练样本中,就会出现0/0的现象。避免这个事件发生,引入拉普拉斯平滑。所以有:
$${\phi _{i|y = 1}} = \frac{{\sum\limits _{i = 1}^m {1\{ x _j^i = 1,{y^i} = 1\} } + 1}}{{\sum\limits _{i = 1}^m {1\{ {y^i} = 1\} } + 2}}$$同理有:
$${\phi _{i|y = 0}} = \frac{{\sum\limits _{i = 1}^m {1\{ x _j^i = 1,{y^i} = 0\} } + 1}}{{\sum\limits _{i = 1}^m {1\{ {y^i} = 1\} } + 2}}$$注:试分析这样的公式,\(\phi _y\)就是总的垃圾邮件数目除以从样本数。\({\phi _{i|y = 1}}\)就是所有垃圾邮件中出现\(x _j^i\)对应的单词的数目除以所有垃圾邮件的数目。实际上这是个很容理解的道理。当然这就是数学的魅力,即便很简单的公司都是有着其理论依据的。
1.3 算法实现
我们使用样本得到各个单词的概率分布,然后预测邮件是否为垃圾邮件。运行下面的程序,可以得到了正确的垃圾邮件分类。
1 | import numpy as np |
2 多元伯努利事件模型
2.1 公式推导
下面采用另外一种方式进行建模。对于\({x^T} = [{x _1},{x _2},…,{x _n}]\),其中\({x _1} = 1\)代表邮件的第一个单词在词库中的索引为1。仍然有如下公式:
$$p({x _1},{x _2},...,{x _i}|y) = \prod\limits _{i = 1}^n {p({x _i}|y)}$$我们设置\({\phi _y} = p(y = 1)\),\({\phi _{i = k|y = 1}} = p({x _i} = k|y = 1)\)。
然后求最大释然函数:
$$\ell ({\phi _y},{\phi _{i = k|y = 1}},{\phi _{i = k|y = 0}}) = \ln \prod\limits _{i = 1}^m {p({y^i}|{x^i})} = \ln \prod\limits _{i = 1}^m {p({x^i}|{y^i})p({y^i})}$$ $$\ell ({\phi _y},{\phi _{i = k|y = 1}},{\phi _{i = k|y = 0}}) = \ln \prod\limits _{i = 1}^m {{{(p({x^i}|{y^i} = 1){\phi _y})}^{1\{ {y^i} = 1\} }}p({x^i}|{y^i} = 0)(1 - {\phi _y}){)^{1\{ {y^i} = 0\} }})}$$ $$\ell ({\phi _y},{\phi _{i = k|y = 1}},{\phi _{i = k|y = 0}}) = \ln \prod\limits _{i = 1}^m {{{((\ln (\prod\limits _{j = 1}^n {p(} x _j^i|{y^i} = 1)){\phi _y})}^{1\{ {y^i} = 1\} }}(\ln (\prod\limits _{j = 1}^n {p(} x _j^i|{y^i} = 0))(1 - {\phi _y}){)^{1\{ {y^i} = 0\} }})}$$ $$\ell ({\phi _y},{\phi _{i = k|y = 1}},{\phi _{i = k|y = 0}}) = \sum\limits _{i = 1}^m {[1\{ {y^i} = 1\} (\ln {\phi _y} + \sum\limits _{j = 1}^n {\ln p(x _j^i|{y^i} = 1)} ) + 1\{ {y^i} = 0\} (\ln (1 - {\phi _y}) + \sum\limits _{j = 1}^n {\ln p(x _j^i|{y^i} = 0)} )]}$$我们对\({\phi _y}\)求偏导数,与之前相同,这里直接写出:
$${\phi _y} = \frac{1}{m}\sum\limits _{i = 1}^m {1\{ {y^i} = 1\} }$$我们对\({\phi _{i = k|y = 0}}\)求偏导数,如下:
$$\frac{{\partial \ell ({\phi _y},{\phi _{i = k|y = 1}},{\phi _{i = k|y = 0}})}}{{\partial {\phi _{i = k|y = 1}}}} = \frac{{\partial \sum\limits _{i = 1}^m {1\{ {y^i} = 1\} \sum\limits _{j = 1}^n {\ln p({x^i}|{y^i} = 1)} } }}{{\partial {\phi _{i|y = 1}}}}$$其中有:
$$\ln p({x^i}|{y^i} = 1) = \ln [p{(x _j^i = k|{y^i} = 1)^{1\{ x _j^i = k\} }}p{(x _j^i \ne k|{y^i} = 1)^{1\{ x _j^i \ne k\} }}]$$ $$\ln p({x^i}|{y^i} = 1) = 1\{ x _j^i = k\} \ln {\phi _{i = k|y = 1}} + (1 - 1\{ x _j^i = k\} )\ln (1 - {\phi _{i = k|y = 1}})$$所以有:
$$\frac{{\partial \ell ({\phi _y},{\phi _{i = k|y = 1}},{\phi _{i = k|y = 0}})}}{{\partial {\phi _{i = k|y = 1}}}} = \sum\limits _{i = 1}^m {1\{ {y^i} = 1\} \sum\limits _{j = 1}^n {\frac{{1\{ x _j^i = k\} - {\phi _{i = k|y = 1}}}}{{{\phi _{i = k|y = 1}}(1 - {\phi _{i = k|y = 1}})}}} } = 0$$根据拉普拉斯平滑,所以最后得到下面的公式,其中V为词库中单词的数目。
$${\phi _{i = k|y = 1}} = \frac{{\sum\limits _{i = 1}^m {\sum\limits _{j = 1}^n {1\{ {y^i} = 1,x _j^i = k\} } } + 1}}{{n\sum\limits _{i = 1}^m {1\{ {y^i} = 1\} } + V}}$$同理有:
$${\phi _{i = k|y = 0}} = \frac{{\sum\limits _{i = 1}^m {\sum\limits _{j = 1}^n {1\{ {y^i} = 0,x _j^i = k\} } } + 1}}{{n\sum\limits _{i = 1}^m {1\{ {y^i} = 0\} } + V}}$$2.2 算法的实现
我们使用样本得到各个单词的概率分布,然后预测邮件是否为垃圾邮件。运行下面的程序,可以得到了正确的垃圾邮件分类。
1 | import numpy as np |
由于概率值过小,多次乘积会超过浮点数精度范围,所以程序这里乘以一个固定的比例系数
附录
手写公式和word版博客地址:
1 | https://github.com/zhengchenyu/MyBlog/tree/master/doc/mlearning/贝叶斯算法 |