2017-08-05

机器学习-2.6-监督学习之朴素贝叶斯算法

朴素贝叶斯算法

1 朴素贝叶斯算法

1.1 垃圾邮件处理模型

以预测垃圾邮件为例子，介绍贝叶斯算法。词库中有5000个单词。我们有输入$x$，$x \in {\{ 0,1\} ^{50000}}$，是一个50000维的向量，其中$x _i$表示词库中第$i$个单次在该邮件中出现，为$0$表示词库中第$i$个单次在该邮件中没有出现。

已经邮件是否为邮件,样本$({x _1},{x _2},…,{x _i})$出现的概率为$p({x _1},{x _2},…,{x _i}|y)$，可以定义为下式:

$$p({x _1},{x _2},...,{x _i}|y) = p({x _1}|y)p({x _2}|y,{x _1})...p({x _{50000}}|y,{x _1},...,{x _{49999}})$$

注: 上式很好理解。右侧第一个式子为已知$y$，$x _1$出现的概率。第二个为一直$y$和$x _1$出现$x _2$的概率。因此两个式子相乘表示一直$y$初选$(x1,x2)$的概率。依次类推，可以得到这个公式。

$$p({x _1},{x _2},...,{x _i}|y) = p({x _1}|y)p({x _2}|y)...p({x _{50000}}|y) = \prod\limits _{i = 1}^{50000} {p({x _i}|y)}$$

1.2 公式推导

关于最大释然函数如何设置，我们需要知道我们的目标是找到参数$\theta$，使得训练时候在给定$x$的情况下，得到最准确$y$的概率最大，最大释然函数就是这些概率的连乘，具体是需要保证下面的式子最大:

$$\ln \prod\limits _{i = 1}^m {p(y = {y^i}|x = {x^i};\theta )p(x = {x^i})} $$

在很多教材中都介绍最大释然函数是如下的表达，实际上是一样的。下面的式子的直接意义是在$\theta$已知情况下，出现$(x^i,y^i)$的概率，实际上面的分析过程是一样的。

$$\ln \prod\limits _{i = 1}^m {p(y = {y^i},x = {x^i};\theta )} $$

后面为了简写后面省略了$\theta$。

我们做如下设置:

$${\phi _y} = p(y = 1)$$ $${\phi _{i|y = 1}} = p({x _i} = 1|y = 1)$$ $${\phi _{i|y = 0}} = p({x _i} = 1|y = 0)$$

然后得到最大释然函数:

$$\ell ({\phi _y},{\phi _{i|y = 1}},{\phi _{i|y = 0}}) = \ln \prod\limits _{i = 1}^m {p(y = {y^i},x = {x^i})} = \ln \prod\limits _{i = 1}^m {p({x^i}|{y^i})p({y^i})}$$ $$\ell ({\phi _y},{\phi _{i|y = 1}},{\phi _{i|y = 0}}) = \ln \prod\limits _{i = 1}^m {[[1\{ {y^i} = 1\} \ln ((\prod\limits _{j = 1}^n {p(x _j^i|{y^i})p({y^i} = 1)} ){\phi _y})][1\{ {y^i} = 1\} \ln ((\prod\limits _{j = 1}^n {p(x _j^i|{y^i})p({y^i} = 0)} )(1 - {\phi _y}))]} ]$$ $$\ell ({\phi _y},{\phi _{i|y = 1}},{\phi _{i|y = 0}}) = \sum\limits _{i = 1}^m {[[1\{ {y^i} = 1\} (\ln {\phi _y} + \ln \prod\limits _{j = 1}^n {({{({\phi _{j|y = 1}})}^{1\{ {x _j} = 1\} }}{{(1 - {\phi _{j|y = 1}})}^{1\{ {x _j} = 0\} }})} )][1\{ {y^i} = 0\} (\ln (1 - {\phi _y}) + \ln \prod\limits _{j = 1}^n {({{({\phi _{j|y = 0}})}^{1\{ {x _j} = 1\} }}{{(1 - {\phi _{j|y = 0}})}^{1\{ {x _j} = 0\} }})} )]]}$$ $$\ell ({\phi _y},{\phi _{i|y = 1}},{\phi _{i|y = 0}}) = \sum\limits _{i = 1}^m {[[1\{ {y^i} = 1\} (\ln {\phi _y} + \sum\limits _{j = 1}^n {(1\{ x _j^i = 1\} \ln {\phi _{j|y = 1}} + 1\{ x _j^i = 0\} \ln (1 - {\phi _{j|y = 1}}))} )][1\{ {y^i} = 0\} (\ln (1 - {\phi _y}) + \sum\limits _{j = 1}^n {(1\{ x _j^i = 1\} \ln {\phi _{j|y = 0}} + 1\{ x _j^i = 0\} \ln (1 - {\phi _{j|y = 0}}))} )]]}$$

然后我们对$\phi _y$求偏导数：

$$\frac{{\partial \ell ({\phi _y},{\phi _{i|y = 1}},{\phi _{i|y = 0}})}}{\partial {\phi _y}} = \sum\limits _{i = 1}^m {(1\{ {y^i} = 1\} \frac{1}{\phi _y} + 1\{ {y^i} = 0\} \frac{ - 1}{1 - {\phi _y}})} = \sum\limits _{i = 1}^m {\frac{1\{ {y^i} = 1\} - {\phi _y}}{\phi _y}(1 - {\phi _y})} = 0$$

所以有:

$${\phi _y} = \frac{1}{m}\sum\limits _{i = 1}^m {1\{ {y^i} = 1\} }$$

然后继续求偏导数:

$$\frac{{\partial \ell ({\phi _y},{\phi _{i|y = 1}},{\phi _{i|y = 0}})}}{{\partial {\phi _{i|y = 1}}}} = \sum\limits _{i = 1}^m {1\{ {y^i} = 1\} (1\{ x _j^i = 1\} \frac{1}{{{\phi _{x _j^i = 1|y = 1}}}} + 1\{ x _j^i = 0\} \frac{{ - 1}}{{1 - {\phi _{x _j^i = 1|y = 1}}}})}=0$$ $$\frac{{\partial \ell ({\phi _y},{\phi _{i|y = 1}},{\phi _{i|y = 0}})}}{{\partial {\phi _{i|y = 1}}}} = \sum\limits _{i = 1}^m {1\{ {y^i} = 1\} (\frac{{1\{ x _j^i = 1\} - {\phi _{i|y = 1}}}}{{{\phi _{i|y = 1}}(1 - {\phi _{i|y = 1}})}})}=0$$

如果一个单词不再训练样本中，就会出现0/0的现象。避免这个事件发生，引入拉普拉斯平滑。所以有:

$${\phi _{i|y = 1}} = \frac{{\sum\limits _{i = 1}^m {1\{ x _j^i = 1,{y^i} = 1\} } + 1}}{{\sum\limits _{i = 1}^m {1\{ {y^i} = 1\} } + 2}}$$

同理有:

$${\phi _{i|y = 0}} = \frac{{\sum\limits _{i = 1}^m {1\{ x _j^i = 1,{y^i} = 0\} } + 1}}{{\sum\limits _{i = 1}^m {1\{ {y^i} = 1\} } + 2}}$$

注:试分析这样的公式，$\phi _y$就是总的垃圾邮件数目除以从样本数。${\phi _{i|y = 1}}$就是所有垃圾邮件中出现$x _j^i$对应的单词的数目除以所有垃圾邮件的数目。实际上这是个很容理解的道理。当然这就是数学的魅力，即便很简单的公司都是有着其理论依据的。

1.3 算法实现

我们使用样本得到各个单词的概率分布，然后预测邮件是否为垃圾邮件。运行下面的程序，可以得到了正确的垃圾邮件分类。

import numpy as np
import re

global DEBUG

def constructDic():
  file = open("../res/NaiveBayes/dict.txt")
  # key is the word, value is the sequence number
  dict = {}           # In fact, treeMap is better than dict
  count = 0;
  while 1:
    line = file.readline()
    if not line:
      break
    for word in line.split():
      if word.lower() not in dict:
        dict[word.lower()]=count
        count = count+1
  return dict

def findIndexInDict(word,dict):
  if word in dict:
    return dict[word]
  else:
    return -1

def parseSpamOrHam(words):
  if words[0] == "spam":
    return 1
  elif words[0] == "ham":
    return 0
  else:
    if DEBUG:
      print("error email type in training sample!")
  return -1

def splitEmail(line):
  regEx = re.compile(r'[^a-zA-Z]|\d')
  return list(filter(lambda word: word!="", regEx.split(line)))

# spam is 1, ham is 0
def parseEmails(x_arr,y_arr,dict,dictLen):
  file = open("../res/NaiveBayes/emails.txt")
  while 1:
    line = file.readline()
    if not line:
      break
    words = splitEmail(line)
    y = parseSpamOrHam(words)
    x = np.zeros(dictLen)
    if y == -1:
      continue;
    for word in words[1:]:
      if word.lower() in dict:
        index = dict[word.lower()]
        x[index]=1
      else:
        if DEBUG:
          print(word," is not in dic!")
    x_arr.append(x)
    y_arr.append(y)
    #if DEBUG:
    # for i in range(len(x)):
    #  if x[i] == 1:
    #    print(i)
  return dict

def calPhiY(y_arr):
  # p(y=1)
  sum = 0
  for i in y_arr:
    sum = sum + i
  return sum/len(y_arr)

def calPhiXY(y_arr,x_arr,knownY,dictLen):
  # return a vector
  sum = 0
  ret = np.zeros(dictLen)
  for i in range(len(y_arr)):
    if y_arr[i]!=knownY:
      continue
    sum=sum+1
    for j in range(dictLen):
      if x_arr[i][j]==1:
       ret[j]=ret[j]+1
  return (ret+1)/(sum+2)

def classify(phi,phi_y0,phi_y1,dictLen):
  file = open("../res/NaiveBayes/testEmails.txt")
  while 1:
    line = file.readline()
    if not line:
      break
    words = splitEmail(line)[1:]
    x = np.zeros(dictLen)
    for word in words:
      if word.lower() in dict:
        index = dict[word.lower()]
        x[index]=1
    res = calcPro(phi,phi_y1,x)/(calcPro(phi,phi_y1,x)+calcPro(phi,phi_y0,x))
    if res > 0.5:
      print("spam :" + line)
    else:
      print("ham :" + line)
  return dict

def calcPro(phi,phi_y,x):
  # p(x|y=phi_y)
  ret = 1
  for i in range(len(x)):
    if x[i] == 1:
      ret = ret * phi_y[i]
    else:
      ret = ret * (1-phi_y[i])
  return ret*phi

if __name__ == "__main__":
  # 0. debug options
  DEBUG = False

  # 1. construct Dict
  dict = constructDic()
  dictLen = len(dict)

  # 2. parse the email to generate the sample, x and y
  x_arr=[]
  y_arr=[]
  parseEmails(x_arr,y_arr,dict,dictLen)

  # 3. learn from the sample
  phi = calPhiY(y_arr)            # p(y)=1
  phi_y1 = calPhiXY(y_arr, x_arr, 1, dictLen)
  phi_y0 = calPhiXY(y_arr, x_arr, 0, dictLen)

  # 4. test classify
  #p(y=1|x)=p(x|y=1)p(y=1)/(p(x|y=1)p(y=1)+p(x|y=0)p(y=0))
  #p(x|y=1) = Pe(p(x=x^i|y=1))
  classify(phi,phi_y0,phi_y1,dictLen)

2 多元伯努利事件模型

2.1 公式推导

下面采用另外一种方式进行建模。对于${x^T} = [{x _1},{x _2},…,{x _n}]$，其中${x _1} = 1$代表邮件的第一个单词在词库中的索引为1。仍然有如下公式：

$$p({x _1},{x _2},...,{x _i}|y) = \prod\limits _{i = 1}^n {p({x _i}|y)}$$

我们设置${\phi _y} = p(y = 1)$，${\phi _{i = k|y = 1}} = p({x _i} = k|y = 1)$。

然后求最大释然函数：

$$\ell ({\phi _y},{\phi _{i = k|y = 1}},{\phi _{i = k|y = 0}}) = \ln \prod\limits _{i = 1}^m {p({y^i}|{x^i})} = \ln \prod\limits _{i = 1}^m {p({x^i}|{y^i})p({y^i})}$$ $$\ell ({\phi _y},{\phi _{i = k|y = 1}},{\phi _{i = k|y = 0}}) = \ln \prod\limits _{i = 1}^m {{{(p({x^i}|{y^i} = 1){\phi _y})}^{1\{ {y^i} = 1\} }}p({x^i}|{y^i} = 0)(1 - {\phi _y}){)^{1\{ {y^i} = 0\} }})}$$ $$\ell ({\phi _y},{\phi _{i = k|y = 1}},{\phi _{i = k|y = 0}}) = \ln \prod\limits _{i = 1}^m {{{((\ln (\prod\limits _{j = 1}^n {p(} x _j^i|{y^i} = 1)){\phi _y})}^{1\{ {y^i} = 1\} }}(\ln (\prod\limits _{j = 1}^n {p(} x _j^i|{y^i} = 0))(1 - {\phi _y}){)^{1\{ {y^i} = 0\} }})}$$ $$\ell ({\phi _y},{\phi _{i = k|y = 1}},{\phi _{i = k|y = 0}}) = \sum\limits _{i = 1}^m {[1\{ {y^i} = 1\} (\ln {\phi _y} + \sum\limits _{j = 1}^n {\ln p(x _j^i|{y^i} = 1)} ) + 1\{ {y^i} = 0\} (\ln (1 - {\phi _y}) + \sum\limits _{j = 1}^n {\ln p(x _j^i|{y^i} = 0)} )]}$$

我们对${\phi _y}$求偏导数，与之前相同，这里直接写出：

$${\phi _y} = \frac{1}{m}\sum\limits _{i = 1}^m {1\{ {y^i} = 1\} }$$

我们对${\phi _{i = k|y = 0}}$求偏导数，如下：

$$\frac{{\partial \ell ({\phi _y},{\phi _{i = k|y = 1}},{\phi _{i = k|y = 0}})}}{{\partial {\phi _{i = k|y = 1}}}} = \frac{{\partial \sum\limits _{i = 1}^m {1\{ {y^i} = 1\} \sum\limits _{j = 1}^n {\ln p({x^i}|{y^i} = 1)} } }}{{\partial {\phi _{i|y = 1}}}}$$

其中有：

$$\ln p({x^i}|{y^i} = 1) = \ln [p{(x _j^i = k|{y^i} = 1)^{1\{ x _j^i = k\} }}p{(x _j^i \ne k|{y^i} = 1)^{1\{ x _j^i \ne k\} }}]$$ $$\ln p({x^i}|{y^i} = 1) = 1\{ x _j^i = k\} \ln {\phi _{i = k|y = 1}} + (1 - 1\{ x _j^i = k\} )\ln (1 - {\phi _{i = k|y = 1}})$$

所以有：

$$\frac{{\partial \ell ({\phi _y},{\phi _{i = k|y = 1}},{\phi _{i = k|y = 0}})}}{{\partial {\phi _{i = k|y = 1}}}} = \sum\limits _{i = 1}^m {1\{ {y^i} = 1\} \sum\limits _{j = 1}^n {\frac{{1\{ x _j^i = k\} - {\phi _{i = k|y = 1}}}}{{{\phi _{i = k|y = 1}}(1 - {\phi _{i = k|y = 1}})}}} } = 0$$

根据拉普拉斯平滑，所以最后得到下面的公式，其中V为词库中单词的数目。

$${\phi _{i = k|y = 1}} = \frac{{\sum\limits _{i = 1}^m {\sum\limits _{j = 1}^n {1\{ {y^i} = 1,x _j^i = k\} } } + 1}}{{n\sum\limits _{i = 1}^m {1\{ {y^i} = 1\} } + V}}$$

同理有:

$${\phi _{i = k|y = 0}} = \frac{{\sum\limits _{i = 1}^m {\sum\limits _{j = 1}^n {1\{ {y^i} = 0,x _j^i = k\} } } + 1}}{{n\sum\limits _{i = 1}^m {1\{ {y^i} = 0\} } + V}}$$

2.2 算法的实现

我们使用样本得到各个单词的概率分布，然后预测邮件是否为垃圾邮件。运行下面的程序，可以得到了正确的垃圾邮件分类。

import numpy as np
import re

global DEBUG
global SCALE

def constructDic():
  file = open("../res/NaiveBayes/dict.txt")
  # key is the word, value is the sequence number
  dict = {}           # In fact, treeMap is better than dict
  count = 0;
  while 1:
    line = file.readline()
    if not line:
      break
    for word in line.split():
      if word.lower() not in dict:
        dict[word.lower()]=count
        count = count+1
  return dict

def findIndexInDict(word,dict):
  if word in dict:
    return dict[word]
  else:
    return -1

def parseSpamOrHam(words):
  if words[0] == "spam":
    return 1
  elif words[0] == "ham":
    return 0
  else:
    if DEBUG:
      print("error email type in training sample!")
  return -1

def splitEmail(line):
  regEx = re.compile(r'[^a-zA-Z]|\d')
  return list(filter(lambda word: word!="", regEx.split(line)))

# spam is 1, ham is 0
def parseEmails(x_arr,y_arr,dict,dictLen):
  file = open("../res/NaiveBayes/emails.txt")
  while 1:
    line = file.readline()
    if not line:
      break
    words = splitEmail(line)
    y = int(parseSpamOrHam(words))
    words = words[1:]
    x = np.zeros(len(words),int)
    if y == -1:
      continue;
    for i in range(len(words)):
      if words[i].lower() in dict:
        index = dict[words[i].lower()]
        x[i]=index
      else:
        x[i]=dictLen+1
    x_arr.append(x)
    y_arr.append(y)
    #if DEBUG:
    # for i in range(len(x)):
    #    print(x[i])
  return dict

def calPhiY(y_arr):
  # p(y=1)
  sum = 0
  for i in y_arr:
    sum = sum + i
  return sum/len(y_arr)

def calPhiXY(y_arr,x_arr,knownY,dictLen):
  # return a vector
  sum = 0
  ret = np.zeros(dictLen)
  for i in range(len(y_arr)):
    if y_arr[i]!=knownY:
      continue
    sum=sum+len(x_arr[i])
    for j in range(len(x_arr[i])):
      index = x_arr[i][j]
      ret[index]=ret[index]+1
  return SCALE*(ret+1)/(sum+dictLen)

def classify(phi,phi_y0,phi_y1,dictLen):
  file = open("../res/NaiveBayes/testEmails.txt")
  while 1:
    line = file.readline()
    if not line:
      break
    words = splitEmail(line)[1:]
    x = np.zeros(len(words),int)
    for i in range(len(words)):
      if words[i].lower() in dict:
        index = dict[words[i].lower()]
        x[i]=index
    res = calcPro(phi,phi_y1,x)/(calcPro(phi,phi_y1,x)+calcPro(phi,phi_y0,x))

    if res > 0.5:
      print("spam :" + line)
    else:
      print("ham :" + line)
  return dict

def calcPro(phi,phi_y,x):
  # p(x|y=phi_y)
  ret = 1
  for i in range(len(x)):
    ret = ret*phi_y[x[i]]
    #print(ret)
  return ret*phi

if __name__ == "__main__":
  # 0. debug options
  DEBUG = True
  SCALE = 10e3

  # 1. construct Dict
  dict = constructDic()
  dictLen = len(dict)

  # 2. parse the email to generate the sample, x and y
  x_arr=[]
  y_arr=[]
  parseEmails(x_arr,y_arr,dict,dictLen)

  # 3. learn from the sample
  phi = calPhiY(y_arr)            # p(y)=1
  phi_y1 = calPhiXY(y_arr, x_arr, 1, dictLen)
  phi_y0 = calPhiXY(y_arr, x_arr, 0, dictLen)

  # 4. test classify
  #p(y=1|x)=p(x|y=1)p(y=1)/(p(x|y=1)p(y=1)+p(x|y=0)p(y=0))
  #p(x|y=1) = Pe(p(x=x^i|y=1))
  classify(phi,phi_y0,phi_y1,dictLen)

由于概率值过小，多次乘积会超过浮点数精度范围，所以程序这里乘以一个固定的比例系数

附录

手写公式和word版博客地址:

1	https://github.com/zhengchenyu/MyBlog/tree/master/doc/mlearning/贝叶斯算法