机器学习-2.6-监督学习之朴素贝叶斯算法

朴素贝叶斯算法

1 朴素贝叶斯算法

1.1 垃圾邮件处理模型

以预测垃圾邮件为例子,介绍贝叶斯算法。词库中有5000个单词。我们有输入\(x\),\(x \in {\{ 0,1\} ^{50000}}\),是一个50000维的向量,其中\(x _i\)表示词库中第\(i\)个单次在该邮件中出现,为\(0\)表示词库中第\(i\)个单次在该邮件中没有出现。

已经邮件是否为邮件,样本\(({x _1},{x _2},…,{x _i})\)出现的概率为\(p({x _1},{x _2},…,{x _i}|y)\),可以定义为下式:

$$p({x _1},{x _2},...,{x _i}|y) = p({x _1}|y)p({x _2}|y,{x _1})...p({x _{50000}}|y,{x _1},...,{x _{49999}})$$

注: 上式很好理解。右侧第一个式子为已知\(y\),\(x _1\)出现的概率。第二个为一直\(y\)和\(x _1\)出现\(x _2\)的概率。因此两个式子相乘表示一直\(y\)初选\((x1,x2)\)的概率。依次类推,可以得到这个公式。

$$p({x _1},{x _2},...,{x _i}|y) = p({x _1}|y)p({x _2}|y)...p({x _{50000}}|y) = \prod\limits _{i = 1}^{50000} {p({x _i}|y)}$$

1.2 公式推导

关于最大释然函数如何设置,我们需要知道我们的目标是找到参数\(\theta\),使得训练时候在给定\(x\)的情况下,得到最准确\(y\)的概率最大,最大释然函数就是这些概率的连乘,具体是需要保证下面的式子最大:

$$\ln \prod\limits _{i = 1}^m {p(y = {y^i}|x = {x^i};\theta )p(x = {x^i})} $$

在很多教材中都介绍最大释然函数是如下的表达,实际上是一样的。下面的式子的直接意义是在\(\theta\)已知情况下,出现\((xi,yi)\)的概率,实际上面的分析过程是一样的。

$$\ln \prod\limits _{i = 1}^m {p(y = {y^i},x = {x^i};\theta )} $$

后面为了简写后面省略了\(\theta\)。

我们做如下设置:

$${\phi _y} = p(y = 1)$$ $${\phi _{i|y = 1}} = p({x _i} = 1|y = 1)$$ $${\phi _{i|y = 0}} = p({x _i} = 1|y = 0)$$

然后得到最大释然函数:

$$\ell ({\phi _y},{\phi _{i|y = 1}},{\phi _{i|y = 0}}) = \ln \prod\limits _{i = 1}^m {p(y = {y^i},x = {x^i})} = \ln \prod\limits _{i = 1}^m {p({x^i}|{y^i})p({y^i})}$$ $$\ell ({\phi _y},{\phi _{i|y = 1}},{\phi _{i|y = 0}}) = \ln \prod\limits _{i = 1}^m {[[1\{ {y^i} = 1\} \ln ((\prod\limits _{j = 1}^n {p(x _j^i|{y^i})p({y^i} = 1)} ){\phi _y})][1\{ {y^i} = 1\} \ln ((\prod\limits _{j = 1}^n {p(x _j^i|{y^i})p({y^i} = 0)} )(1 - {\phi _y}))]} ]$$ $$\ell ({\phi _y},{\phi _{i|y = 1}},{\phi _{i|y = 0}}) = \sum\limits _{i = 1}^m {[[1\{ {y^i} = 1\} (\ln {\phi _y} + \ln \prod\limits _{j = 1}^n {({{({\phi _{j|y = 1}})}^{1\{ {x _j} = 1\} }}{{(1 - {\phi _{j|y = 1}})}^{1\{ {x _j} = 0\} }})} )][1\{ {y^i} = 0\} (\ln (1 - {\phi _y}) + \ln \prod\limits _{j = 1}^n {({{({\phi _{j|y = 0}})}^{1\{ {x _j} = 1\} }}{{(1 - {\phi _{j|y = 0}})}^{1\{ {x _j} = 0\} }})} )]]}$$ $$\ell ({\phi _y},{\phi _{i|y = 1}},{\phi _{i|y = 0}}) = \sum\limits _{i = 1}^m {[[1\{ {y^i} = 1\} (\ln {\phi _y} + \sum\limits _{j = 1}^n {(1\{ x _j^i = 1\} \ln {\phi _{j|y = 1}} + 1\{ x _j^i = 0\} \ln (1 - {\phi _{j|y = 1}}))} )][1\{ {y^i} = 0\} (\ln (1 - {\phi _y}) + \sum\limits _{j = 1}^n {(1\{ x _j^i = 1\} \ln {\phi _{j|y = 0}} + 1\{ x _j^i = 0\} \ln (1 - {\phi _{j|y = 0}}))} )]]}$$

然后我们对\(\phi _y\)求偏导数:

$$\frac{{\partial \ell ({\phi _y},{\phi _{i|y = 1}},{\phi _{i|y = 0}})}}{\partial {\phi _y}} = \sum\limits _{i = 1}^m {(1\{ {y^i} = 1\} \frac{1}{\phi _y} + 1\{ {y^i} = 0\} \frac{ - 1}{1 - {\phi _y}})} = \sum\limits _{i = 1}^m {\frac{1\{ {y^i} = 1\} - {\phi _y}}{\phi _y}(1 - {\phi _y})} = 0$$

所以有:

$${\phi _y} = \frac{1}{m}\sum\limits _{i = 1}^m {1\{ {y^i} = 1\} }$$

然后继续求偏导数:

$$\frac{{\partial \ell ({\phi _y},{\phi _{i|y = 1}},{\phi _{i|y = 0}})}}{{\partial {\phi _{i|y = 1}}}} = \sum\limits _{i = 1}^m {1\{ {y^i} = 1\} (1\{ x _j^i = 1\} \frac{1}{{{\phi _{x _j^i = 1|y = 1}}}} + 1\{ x _j^i = 0\} \frac{{ - 1}}{{1 - {\phi _{x _j^i = 1|y = 1}}}})}=0$$ $$\frac{{\partial \ell ({\phi _y},{\phi _{i|y = 1}},{\phi _{i|y = 0}})}}{{\partial {\phi _{i|y = 1}}}} = \sum\limits _{i = 1}^m {1\{ {y^i} = 1\} (\frac{{1\{ x _j^i = 1\} - {\phi _{i|y = 1}}}}{{{\phi _{i|y = 1}}(1 - {\phi _{i|y = 1}})}})}=0$$

如果一个单词不再训练样本中,就会出现0/0的现象。避免这个事件发生,引入拉普拉斯平滑。所以有:

$${\phi _{i|y = 1}} = \frac{{\sum\limits _{i = 1}^m {1\{ x _j^i = 1,{y^i} = 1\} } + 1}}{{\sum\limits _{i = 1}^m {1\{ {y^i} = 1\} } + 2}}$$

同理有:

$${\phi _{i|y = 0}} = \frac{{\sum\limits _{i = 1}^m {1\{ x _j^i = 1,{y^i} = 0\} } + 1}}{{\sum\limits _{i = 1}^m {1\{ {y^i} = 1\} } + 2}}$$

注:试分析这样的公式,\(\phi _y\)就是总的垃圾邮件数目除以从样本数。\({\phi _{i|y = 1}}\)就是所有垃圾邮件中出现\(x _j^i\)对应的单词的数目除以所有垃圾邮件的数目。实际上这是个很容理解的道理。当然这就是数学的魅力,即便很简单的公司都是有着其理论依据的。

1.3 算法实现

我们使用样本得到各个单词的概率分布,然后预测邮件是否为垃圾邮件。运行下面的程序,可以得到了正确的垃圾邮件分类。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import numpy as np
import re

global DEBUG

def constructDic():
file = open("../res/NaiveBayes/dict.txt")
# key is the word, value is the sequence number
dict = {} # In fact, treeMap is better than dict
count = 0;
while 1:
line = file.readline()
if not line:
break
for word in line.split():
if word.lower() not in dict:
dict[word.lower()]=count
count = count+1
return dict

def findIndexInDict(word,dict):
if word in dict:
return dict[word]
else:
return -1

def parseSpamOrHam(words):
if words[0] == "spam":
return 1
elif words[0] == "ham":
return 0
else:
if DEBUG:
print("error email type in training sample!")
return -1

def splitEmail(line):
regEx = re.compile(r'[^a-zA-Z]|\d')
return list(filter(lambda word: word!="", regEx.split(line)))

# spam is 1, ham is 0
def parseEmails(x_arr,y_arr,dict,dictLen):
file = open("../res/NaiveBayes/emails.txt")
while 1:
line = file.readline()
if not line:
break
words = splitEmail(line)
y = parseSpamOrHam(words)
x = np.zeros(dictLen)
if y == -1:
continue;
for word in words[1:]:
if word.lower() in dict:
index = dict[word.lower()]
x[index]=1
else:
if DEBUG:
print(word," is not in dic!")
x_arr.append(x)
y_arr.append(y)
#if DEBUG:
# for i in range(len(x)):
# if x[i] == 1:
# print(i)
return dict

def calPhiY(y_arr):
# p(y=1)
sum = 0
for i in y_arr:
sum = sum + i
return sum/len(y_arr)

def calPhiXY(y_arr,x_arr,knownY,dictLen):
# return a vector
sum = 0
ret = np.zeros(dictLen)
for i in range(len(y_arr)):
if y_arr[i]!=knownY:
continue
sum=sum+1
for j in range(dictLen):
if x_arr[i][j]==1:
ret[j]=ret[j]+1
return (ret+1)/(sum+2)

def classify(phi,phi_y0,phi_y1,dictLen):
file = open("../res/NaiveBayes/testEmails.txt")
while 1:
line = file.readline()
if not line:
break
words = splitEmail(line)[1:]
x = np.zeros(dictLen)
for word in words:
if word.lower() in dict:
index = dict[word.lower()]
x[index]=1
res = calcPro(phi,phi_y1,x)/(calcPro(phi,phi_y1,x)+calcPro(phi,phi_y0,x))
if res > 0.5:
print("spam :" + line)
else:
print("ham :" + line)
return dict

def calcPro(phi,phi_y,x):
# p(x|y=phi_y)
ret = 1
for i in range(len(x)):
if x[i] == 1:
ret = ret * phi_y[i]
else:
ret = ret * (1-phi_y[i])
return ret*phi

if __name__ == "__main__":
# 0. debug options
DEBUG = False

# 1. construct Dict
dict = constructDic()
dictLen = len(dict)

# 2. parse the email to generate the sample, x and y
x_arr=[]
y_arr=[]
parseEmails(x_arr,y_arr,dict,dictLen)

# 3. learn from the sample
phi = calPhiY(y_arr) # p(y)=1
phi_y1 = calPhiXY(y_arr, x_arr, 1, dictLen)
phi_y0 = calPhiXY(y_arr, x_arr, 0, dictLen)

# 4. test classify
#p(y=1|x)=p(x|y=1)p(y=1)/(p(x|y=1)p(y=1)+p(x|y=0)p(y=0))
#p(x|y=1) = Pe(p(x=x^i|y=1))
classify(phi,phi_y0,phi_y1,dictLen)

2 多元伯努利事件模型

2.1 公式推导

下面采用另外一种方式进行建模。对于\({x^T} = [{x _1},{x _2},…,{x _n}]\),其中\({x _1} = 1\)代表邮件的第一个单词在词库中的索引为1。仍然有如下公式:

$$p({x _1},{x _2},...,{x _i}|y) = \prod\limits _{i = 1}^n {p({x _i}|y)}$$

我们设置\({\phi _y} = p(y = 1)\),\({\phi _{i = k|y = 1}} = p({x _i} = k|y = 1)\)。

然后求最大释然函数:

$$\ell ({\phi _y},{\phi _{i = k|y = 1}},{\phi _{i = k|y = 0}}) = \ln \prod\limits _{i = 1}^m {p({y^i}|{x^i})} = \ln \prod\limits _{i = 1}^m {p({x^i}|{y^i})p({y^i})}$$ $$\ell ({\phi _y},{\phi _{i = k|y = 1}},{\phi _{i = k|y = 0}}) = \ln \prod\limits _{i = 1}^m {{{(p({x^i}|{y^i} = 1){\phi _y})}^{1\{ {y^i} = 1\} }}p({x^i}|{y^i} = 0)(1 - {\phi _y}){)^{1\{ {y^i} = 0\} }})}$$ $$\ell ({\phi _y},{\phi _{i = k|y = 1}},{\phi _{i = k|y = 0}}) = \ln \prod\limits _{i = 1}^m {{{((\ln (\prod\limits _{j = 1}^n {p(} x _j^i|{y^i} = 1)){\phi _y})}^{1\{ {y^i} = 1\} }}(\ln (\prod\limits _{j = 1}^n {p(} x _j^i|{y^i} = 0))(1 - {\phi _y}){)^{1\{ {y^i} = 0\} }})}$$ $$\ell ({\phi _y},{\phi _{i = k|y = 1}},{\phi _{i = k|y = 0}}) = \sum\limits _{i = 1}^m {[1\{ {y^i} = 1\} (\ln {\phi _y} + \sum\limits _{j = 1}^n {\ln p(x _j^i|{y^i} = 1)} ) + 1\{ {y^i} = 0\} (\ln (1 - {\phi _y}) + \sum\limits _{j = 1}^n {\ln p(x _j^i|{y^i} = 0)} )]}$$

我们对\({\phi _y}\)求偏导数,与之前相同,这里直接写出:

$${\phi _y} = \frac{1}{m}\sum\limits _{i = 1}^m {1\{ {y^i} = 1\} }$$

我们对\({\phi _{i = k|y = 0}}\)求偏导数,如下:

$$\frac{{\partial \ell ({\phi _y},{\phi _{i = k|y = 1}},{\phi _{i = k|y = 0}})}}{{\partial {\phi _{i = k|y = 1}}}} = \frac{{\partial \sum\limits _{i = 1}^m {1\{ {y^i} = 1\} \sum\limits _{j = 1}^n {\ln p({x^i}|{y^i} = 1)} } }}{{\partial {\phi _{i|y = 1}}}}$$

其中有:

$$\ln p({x^i}|{y^i} = 1) = \ln [p{(x _j^i = k|{y^i} = 1)^{1\{ x _j^i = k\} }}p{(x _j^i \ne k|{y^i} = 1)^{1\{ x _j^i \ne k\} }}]$$ $$\ln p({x^i}|{y^i} = 1) = 1\{ x _j^i = k\} \ln {\phi _{i = k|y = 1}} + (1 - 1\{ x _j^i = k\} )\ln (1 - {\phi _{i = k|y = 1}})$$

所以有:

$$\frac{{\partial \ell ({\phi _y},{\phi _{i = k|y = 1}},{\phi _{i = k|y = 0}})}}{{\partial {\phi _{i = k|y = 1}}}} = \sum\limits _{i = 1}^m {1\{ {y^i} = 1\} \sum\limits _{j = 1}^n {\frac{{1\{ x _j^i = k\} - {\phi _{i = k|y = 1}}}}{{{\phi _{i = k|y = 1}}(1 - {\phi _{i = k|y = 1}})}}} } = 0$$

根据拉普拉斯平滑,所以最后得到下面的公式,其中V为词库中单词的数目。

$${\phi _{i = k|y = 1}} = \frac{{\sum\limits _{i = 1}^m {\sum\limits _{j = 1}^n {1\{ {y^i} = 1,x _j^i = k\} } } + 1}}{{n\sum\limits _{i = 1}^m {1\{ {y^i} = 1\} } + V}}$$

同理有:

$${\phi _{i = k|y = 0}} = \frac{{\sum\limits _{i = 1}^m {\sum\limits _{j = 1}^n {1\{ {y^i} = 0,x _j^i = k\} } } + 1}}{{n\sum\limits _{i = 1}^m {1\{ {y^i} = 0\} } + V}}$$

2.2 算法的实现

我们使用样本得到各个单词的概率分布,然后预测邮件是否为垃圾邮件。运行下面的程序,可以得到了正确的垃圾邮件分类。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import numpy as np
import re

global DEBUG
global SCALE

def constructDic():
file = open("../res/NaiveBayes/dict.txt")
# key is the word, value is the sequence number
dict = {} # In fact, treeMap is better than dict
count = 0;
while 1:
line = file.readline()
if not line:
break
for word in line.split():
if word.lower() not in dict:
dict[word.lower()]=count
count = count+1
return dict

def findIndexInDict(word,dict):
if word in dict:
return dict[word]
else:
return -1

def parseSpamOrHam(words):
if words[0] == "spam":
return 1
elif words[0] == "ham":
return 0
else:
if DEBUG:
print("error email type in training sample!")
return -1

def splitEmail(line):
regEx = re.compile(r'[^a-zA-Z]|\d')
return list(filter(lambda word: word!="", regEx.split(line)))

# spam is 1, ham is 0
def parseEmails(x_arr,y_arr,dict,dictLen):
file = open("../res/NaiveBayes/emails.txt")
while 1:
line = file.readline()
if not line:
break
words = splitEmail(line)
y = int(parseSpamOrHam(words))
words = words[1:]
x = np.zeros(len(words),int)
if y == -1:
continue;
for i in range(len(words)):
if words[i].lower() in dict:
index = dict[words[i].lower()]
x[i]=index
else:
x[i]=dictLen+1
x_arr.append(x)
y_arr.append(y)
#if DEBUG:
# for i in range(len(x)):
# print(x[i])
return dict

def calPhiY(y_arr):
# p(y=1)
sum = 0
for i in y_arr:
sum = sum + i
return sum/len(y_arr)

def calPhiXY(y_arr,x_arr,knownY,dictLen):
# return a vector
sum = 0
ret = np.zeros(dictLen)
for i in range(len(y_arr)):
if y_arr[i]!=knownY:
continue
sum=sum+len(x_arr[i])
for j in range(len(x_arr[i])):
index = x_arr[i][j]
ret[index]=ret[index]+1
return SCALE*(ret+1)/(sum+dictLen)

def classify(phi,phi_y0,phi_y1,dictLen):
file = open("../res/NaiveBayes/testEmails.txt")
while 1:
line = file.readline()
if not line:
break
words = splitEmail(line)[1:]
x = np.zeros(len(words),int)
for i in range(len(words)):
if words[i].lower() in dict:
index = dict[words[i].lower()]
x[i]=index
res = calcPro(phi,phi_y1,x)/(calcPro(phi,phi_y1,x)+calcPro(phi,phi_y0,x))

if res > 0.5:
print("spam :" + line)
else:
print("ham :" + line)
return dict

def calcPro(phi,phi_y,x):
# p(x|y=phi_y)
ret = 1
for i in range(len(x)):
ret = ret*phi_y[x[i]]
#print(ret)
return ret*phi

if __name__ == "__main__":
# 0. debug options
DEBUG = True
SCALE = 10e3

# 1. construct Dict
dict = constructDic()
dictLen = len(dict)

# 2. parse the email to generate the sample, x and y
x_arr=[]
y_arr=[]
parseEmails(x_arr,y_arr,dict,dictLen)

# 3. learn from the sample
phi = calPhiY(y_arr) # p(y)=1
phi_y1 = calPhiXY(y_arr, x_arr, 1, dictLen)
phi_y0 = calPhiXY(y_arr, x_arr, 0, dictLen)

# 4. test classify
#p(y=1|x)=p(x|y=1)p(y=1)/(p(x|y=1)p(y=1)+p(x|y=0)p(y=0))
#p(x|y=1) = Pe(p(x=x^i|y=1))
classify(phi,phi_y0,phi_y1,dictLen)

由于概率值过小,多次乘积会超过浮点数精度范围,所以程序这里乘以一个固定的比例系数

附录

手写公式和word版博客地址:

1
https://github.com/zhengchenyu/MyBlog/tree/master/doc/mlearning/贝叶斯算法