藤原栗子工作室: 3月 2018

2018年3月27日星期二

機器學習_ML_KNN_最近鄰居

tags: `ML` `KNN`

官方連結
KNN是一種惰式學習器，它不會從訓練資料集中學出判別函數，而是把訓練資料集記憶起來。
在學習過程中是沒有成本的!
三步驟：

選定k值和一個距離度量
找出k個想要分類的最相近的鄰近樣本
以多數決方式來指定類別標籤
根據選定的距離度量來針對某個新分類的點做判斷，多數決!

KNN對異常的容忍度高，畢竟是取最接近的K點來做多數決，但相對的，計算量也不小。預設情況下，都是相同權重，如果有需求也可以透過權重設置來對不同的鄰居做不同的權重設置。
還有一種為『RadiusNeighborsClassifier』，可以設置相鄰範圍，這對數據分佈不均的資料集有較好的效果，這部份我們另篇說明。

Warning
Regarding the Nearest Neighbors algorithms, if it is found that two neighbors, neighbor k+1 and k, have identical distances but different labels, the results will depend on the ordering of the training data.
官方說明到，如果兩個鄰居具有相同權重，具相同距離但不同label的話，其結果取決於訓練資料集的順序。

IMPORT

from sklearn.neighbors import KNeighborsClassifier

CLASS

sklearn.neighbors.KNeighborsClassifier(n_neighbors=5, 
                weights='uniform', algorithm='auto',
                leaf_size=30, p=2, metric='minkowski', 
                metric_params=None, n_jobs=1, **kwargs)

參數說明

n_neighbors

int, optional (default = 5)
K值，以幾個點來決定預測點。

weights

str or callable, optional (default = ‘uniform’)
uniform：預設置，相同權重
distance：愈接近的權重愈高
也可以自定義

algorithm

{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional
計算最近鄰居的演算法
auto：學習的時候會自動選擇(一般選擇此項)
brute：暴力解

leaf_size

int, optional (default = 30)
葉節點的大小，不影響結果，但影響速度以及結構樹的存儲空間。
結構樹的存儲空間需求記憶體用量為n_samples / leaf_size
一般取預設值即可。

p

integer, optional (default = 2)
default 2
1_曼哈頓距離(只能直角)
2_歐氏距離

metric

string or callable, default ‘minkowski’
『the distance metric to use for the tree』，樹的距離度量。
一般來說取預設值即可，如果有其它需求的話可以參考官方Distance metric

minkowski=sum(|x - y|^p)^(1/p)

metric_params

dict, optional (default = None)
如果有使用其它的Distance metric，相關特殊參數可由此設置。
更多情況使用預設值即可滿足。

n_jobs

int, optional (default = 1)
訓練時的cpu使用數
設置-1則火力全開

方法說明

fit(X, y)

擬合、訓練

get_params([deep])

取得模型參數

kneighbors([X, n_neighbors, return_distance])

找出輸入X的鄰居與距離，如果沒有給值的話會回傳訓練模型的資料。
每個點的最近鄰居都是自己，所以距離都為0。

kneighbors_graph([X, n_neighbors, mode])

回傳最近鄰居的矩陣圖，1是最近點，0非最近點。

predict(X)

回傳預測類別

predict_proba(X)

回傳機率

score(X, y[, sample_weight])

取得平均精準度(mean accuracy)

set_params(**params)

設置模型參數

範例

透過簡單的範例來了解部份method所產生的結果

X = [[0], [1], [3], [15], [7]]
from sklearn.neighbors import NearestNeighbors
nn = NearestNeighbors(n_neighbors=2)
nn.fit(X)
a = nn.kneighbors_graph(X)
a.toarray()

第1行：建置一個資料集
第2行：import套件，借用無監督式演算法
第3行：實作
第4行：擬合、訓練
第5行：回傳矩陣圖
第6行：將圖轉陣列

array([[ 1.,  1.,  0.,  0.,  0.],
       [ 1.,  1.,  0.,  0.,  0.],
       [ 0.,  1.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  1.],
       [ 0.,  0.,  1.,  0.,  1.]])

1是鄰居，0不是鄰居，自己跟自己一定是鄰居，n_neighbors設置為2，故有兩個1。最後一維來看，7跟3比跟15還要近，故7、3為1中間空了一個0。

nn.kneighbors(X)

第1行：回傳兩個資料，與鄰居距離與鄰居索引

(array([[ 0.,  1.],
        [ 0.,  1.],
        [ 0.,  2.],
        [ 0.,  8.],
        [ 0.,  4.]]), array([[0, 1],
        [1, 0],
        [2, 1],
        [3, 4],
        [4, 2]], dtype=int64))

第一部份是鄰居間的距離，自己最近的鄰居是自己，所以第一個值是0，第二個值是距離。
第二部份是自己跟鄰居的索引值，索引從0開始。

透過上面的簡單範例可以了解到method回傳的資料意函，上面的範例借用了無監督式的演算法，故沒有y，請不要在意，method的功能是相同的。
接下來，我們就可以來實作一個KNN的範例。

手動自己產生資料集

#  產生資料集
from sklearn.datasets.samples_generator import make_blobs
centers = [[-1, -1], [0.5, 0.5], [2, 2]]
X,y = make_blobs(n_samples=60, centers=centers, random_state=0, cluster_std=0.5)

n_samples：樣本數
centers：中心點，數值會依中心點分佈
cluster_std：樣本的標準差，這影響到數值

資料集設置完畢之後，記得先利用可視化方式確認資料集

#  查看資料集的分佈
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
cen = np.array(centers)
#  設置圖框大小
plt.figure(figsize=(16,9), dpi=100)
#  產生資料集散佈圖
plt.scatter(X[:, 0], X[:, 1], c=y)
#  繪製中心點
plt.scatter(cen[:, 0], cen[:, 1], s=100, marker='^', c='orange')
plt.show()

訓練模型

#  在沒有特殊需求情況下，預設參數就可以有不錯的效果，我們單純的設置k
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X,y)

預測

#  自己增加一個範本
X_test = [[1.5, 1]]
y_test = knn.predict(X_test)
#  單純回傳鄰居索引，不回傳距離
neighbors = knn.kneighbors(X_test, return_distance=False)

查看y_test

array([1])

範本點被歸於1

查看範本的各類別機率

array([[ 0. ,  0.6,  0.4]])

有三個中心點，所以有三類，y=1的機率最高。

可視化

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
cen = np.array(centers)
plt.figure(figsize=(16,9), dpi=100)
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.scatter(cen[:, 0], cen[:, 1], s=100, marker='^', c='orange')
#  加入預測點x
plt.scatter(X_test[0][0], X_test[0][1], marker='x', c=y_test, s=200)
#  產生於測點與鄰居的連接線
for i in neighbors[0]:
    plt.plot([X[i][0], X_test[0][0]], [X[i][1], X_test[0][1]], 'k--', linewidth=0.5)
plt.show()

2018年3月1日星期四

機器學習_ML_OneHotEncoder_獨熱編碼

說明

官方文件
在另篇中我們說明了標籤轉置，用於label(目標類別)的時候是萬無一失，但是用於特徵呢?
很遺憾的，即使是0，1，2，3…對於整個模型的計算還是會有影響，這時候我們就必需改透過獨熱編碼來處理，一起來看sklearn怎麼幫我們達成。

IMPORT

from sklearn.preprocessing import OneHotEncoder

範例

資料集如下：

類別	身高	體重	性別	年紀
胖	175	70	男	35
瘦	160	50	女	31
瘦	175		男	27
胖	180	80	女
胖	180	100	男	18

#  載入需求lib
import numpy as np
import pandas as pd
#  獨熱編碼套件
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

#  手動寫入資料集，資料假裝透過插補完成了
ds = [
    ('胖',175,70,'男',35),
    ('瘦',160,50,'女',31),
    ('瘦',175,75,'男',27),
    ('胖',180,80,'女',27.5),
    ('胖',180,100,'男',18)]

columns = ['label','high','weight','sex','years']

#  假裝一下資料載入pandas，比較有fu
df = pd.DataFrame.from_records(ds,columns=columns)

#  確認資料是否載入pandas
df

#  資料拆分
X = df.iloc[:,1:].values
y = df.iloc[:,0].values
#  確認y的狀態
y
>>>array(['胖', '瘦', '瘦', '胖', '胖'], dtype=object)

#  標籤轉換
y_labelencoder = LabelEncoder()
y = y_labelencoder.fit_transform(y)
>>>array([1, 0, 0, 1, 1], dtype=int64)

X_labelencoder = LabelEncoder()
X[:,2] = X_labelencoder.fit_transform(X[:,2])
>>>array([1, 0, 1, 0, 1], dtype=int64)

#  獨熱編碼
#  需利用categorical_features來指定預執行獨熱編碼的index
onehotencoder = OneHotEncoder(categorical_features=[2])
X_hot = onehotencoder.fit_transform(X).toarray()
#  檢核X_hot
X_hot

使用Pandas

事實上，在轉獨熱編碼的時候，我還是習慣使用pandas。

pd.get_dummies(df[['high', 'weight', 'sex', 'years']])

機器學習_ML_LabelEncoder

說明

官方文件
在另篇中我們說明了插補，資料插補完了以後，要面對的第二個問題就是文字特徵，甚至我們的類別都有可能是文字，怎麼辦?
這時候我們要透過編碼轉置，將文字轉成數值，讓我們往下看sklearn怎麼幫我們做到這一點。

IMPORT

from sklearn.preprocessing import LabelEncoder

CLASS

class sklearn.preprocessing.LabelEncoder(array)

方法

fit

擬合

transform

轉換

fit_transform

擬合+轉換

inverse_transform

還原

範例

資料集如下：

類別	身高	體重	性別	年紀
胖	175	70	男	35
瘦	160	50	女	31
瘦	175		男	27
胖	180	80	女
胖	180	100	男	18

#  載入需求lib
import numpy as np
import pandas as pd
#  標籤轉換套件
from sklearn.preprocessing import LabelEncoder

#  手動寫入資料集，資料假裝透過插補完成了
ds = [
    ('胖',175,70,'男',35),
    ('瘦',160,50,'女',31),
    ('瘦',175,75,'男',27),
    ('胖',180,80,'女',27.5),
    ('胖',180,100,'男',18)]

columns = ['label','high','weight','sex','years']

#  假裝一下資料載入pandas，比較有fu
df = pd.DataFrame.from_records(ds,columns=columns)

#  確認資料是否載入pandas
df

#  資料拆分
X = df.iloc[:,1:].values
y = df.iloc[:,0].values
#  確認y的狀態
y
>>>array(['胖', '瘦', '瘦', '胖', '胖'], dtype=object)

#  標籤轉換
labelencoder = LabelEncoder()
y = labelencoder.fit_transform(y)
#  確認目標類別轉換狀況
y
>>>array([1, 0, 0, 1, 1], dtype=int64)
#  嚐試還原
labelencoder.inverse_transform(y)
>>>array(['胖', '瘦', '瘦', '胖', '胖'], dtype=object)

訂閱：文章 (Atom)

類別	身高	體重	性別	年紀
胖	175	70	男	35
瘦	160	50	女	31
瘦	175		男	27
胖	180	80	女
胖	180	100	男	18

類別	身高	體重	性別	年紀
胖	175	70	男	35
瘦	160	50	女	31
瘦	175		男	27
胖	180	80	女
胖	180	100	男	18

類別	身高	體重	性別	年紀
胖	175	70	男	35
瘦	160	50	女	31
瘦	175		男	27
胖	180	80	女
胖	180	100	男	18

類別	身高	體重	性別	年紀
胖	175	70	男	35
瘦	160	50	女	31
瘦	175		男	27
胖	180	80	女
胖	180	100	男	18

2018年3月27日 星期二

機器學習_ML_KNN_最近鄰居

機器學習_ML_KNN_最近鄰居

tags: ML KNN

IMPORT

CLASS

參數說明

n_neighbors

weights

algorithm

leaf_size

p

metric

metric_params

n_jobs

方法說明

fit(X, y)

get_params([deep])

kneighbors([X, n_neighbors, return_distance])

kneighbors_graph([X, n_neighbors, mode])

predict(X)

predict_proba(X)

score(X, y[, sample_weight])

set_params(**params)

範例

2018年3月1日 星期四

機器學習_ML_OneHotEncoder_獨熱編碼

機器學習_ML_OneHotEncoder_獨熱編碼

說明

IMPORT

範例

使用Pandas

機器學習_ML_LabelEncoder

機器學習_ML_LabelEncoder

說明

IMPORT

CLASS

方法

fit

transform

fit_transform

inverse_transform

範例

2018年3月27日星期二

tags: `ML` `KNN`

2018年3月1日星期四

類別	身高	體重	性別	年紀
胖	175	70	男	35
瘦	160	50	女	31
瘦	175		男	27
胖	180	80	女
胖	180	100	男	18

類別	身高	體重	性別	年紀
胖	175	70	男	35
瘦	160	50	女	31
瘦	175		男	27
胖	180	80	女
胖	180	100	男	18