0%

机器学习(二)——K近邻算法

引言

在具体讨论 KNN 算法之前,我们先通过一个具体的例子引入。我们创建一个数据集,并将其可视化出来。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

raw_data_x = [[3.3935, 2.3313],
[3.1101, 1.7815],
[1.3438, 3.3684],
[3.5823, 4.6792],
[2.2804, 2.8670],
[7.4234, 4.6965],
[5.7451, 3.5340],
[9.1722, 2.5111],
[7.7928, 3.4241],
[7.9398, 0.7916]]
raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
x_train = np.array(raw_data_x)
y_train = np.array(raw_data_y)
x_new = np.array([8.0936, 3.3657])
plt.scatter(x_train[y_train==0,0], x_train[y_train==0,1], color='g')
plt.scatter(x_train[y_train==1,0], x_train[y_train==1,1], color='r')
plt.scatter(x_new[0], x_new[1], color='b')
plt.show()

散点图
从上面的散点图中,蓝色的点明显更接近红色的点。我们将蓝色的点与其它点的距离求出来,并将最近的 5 个点选出来。

1
2
3
4
5
6
7
8
from math import sqrt
distance = []
for x in x_train:
d = sqrt(np.sum((x_new - x) ** 2))
distance.append(d)
nearsest = np.argsort(distance)
k = 5
topk_y = [y_train[i] for i in nearsest[:k]]

可以看到结果是 [1,1,1,1,1],如果我们将距离最近的 5 个点中,种类数量最多作为蓝色的点的种类,那么蓝色也就是归于红色一类,与我们之前看到的散点图一样。没错这就是 KNN 算法。


具体思想

KNNK-Nearest Neighbors,其思想特别简单,就是设置一个超参数 k,对于一个新样本,我们计算其与所有训练集中数据的距离(这个计算距离的方式也可以选择,欧式距离、马氏距离….),然后选出最近的 k 个,进行投票,即 k 个数据中,类别最多的那一类即为新样本的类别。
是不是很简单,下面附一个手写的 KNN 解决手写数字识别的代码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
from math import sqrt
from collections import Counter
from sklearn import datasets
from sklearn.model_selection import train_test_split


class KNNClassifier():

def __init__(self, k):
assert 1 <= k, "k must be valid"
self.k = k
self._x_train = None
self._y_train = None

def fit(self, x_train, y_train):
assert x_train.shape[0] == y_train.shape[0], \
"the size of x_train must be equal to the size of y_train"
assert self.k <= x_train.shape[0], \
"the size of x_train must be at least k"

self._x_train = x_train
self._y_train = y_train

def predict(self, x_new):
assert self._x_train is not None and self._y_train is not None, \
"must fit before predict"
assert x_new.shape[1] == self._x_train.shape[1], \
"the feature number of x_new must be equal to x_train"

y_new = [self._predict(x) for x in x_new]
return np.array(y_new)

def _predict(self, x):
assert x.shape[0] == self._x_train.shape[1], \
"the feature number of x must be equal to x_train"

distances = [sqrt(np.sum((x_train - x) ** 2)) for x_train in self._x_train]
nearest = np.argsort(distances)

topk_y = [self._y_train[i] for i in nearest[:self.k]]
votes = Counter(topk_y)

return votes.most_common(1)[0][0]

def score(self, x_test, y_test):
y_predict = self.predict(x_test)
assert y_test.shape[0] == y_predict.shape[0], \
"the size of y_true must be equal to the size of y_predict"
return sum(y_test == y_predict) / len(y_test)

def __repr__(self):
return "KNN(k=%d)" % self.k


if __name__ == '__main__':
digits = datasets.load_digits()
x = digits.data
y = digits.target

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
knn_classifier = KNNClassifier(k=3)
knn_classifier.fit(x_train, y_train)
print(knn_classifier.score(x_test, y_test))


超参数

我们已经知道,这个模型自己选择的部分一是 k 即选几个邻居,二是计算距离方法。但是实际操作中,可能会遇到平票的情况,而且只是通过类别的个数来判断有时也不是很准确,我们可以设定一个权重,一般为距离的倒数,这样离新样本越近的那个种类所具有的权重也越大,当然我们也可以自己根据具体情况自己设定计算权重的函数。
下面附一个,利用sklearn包的 KNN 分类器寻找最优超参数的代码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets

digits = datasets.load_digits()
x = digits.data
y = digits.target

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

best_p = -1
best_score = 0.0
best_k = -1
for p in range(1, 6):
for k in range(1, 11):
knn_clf = KNeighborsClassifier(n_neighbors=k, weights="distance", p=p)
knn_clf.fit(x_train, y_train)
score = knn_clf.score(x_test, y_test)
if score > best_score:
best_p = p
best_k = k
best_score = score
print("best_p=", best_p)
print("best k=", best_k)
print("best score=", best_score)
-------------本文结束感谢您的阅读-------------