Python × 資料分析¶

Scikit-Learn¶

Kristen Chan¶

Agenda¶

Introduction
Datasets
Train data ＆ Test data
Data Preprocessing
Machine Learning Algorithms
- Supervised learning
  - Regression -- Linear Regression
  - Regression -- Support Vector Regression
  - Classification -- Logistic Regression
  - Classification -- K-Nearest Neighbors
  - Classification -- Support Vector Machine
- Unsupervised learning
  - Clustering -- K-means
  - Clustering -- Hierarchical clustering

Review
¶

Data science process flowchart

Reference : https://www.wikiwand.com/en/Data_analysis

Note
¶

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Scikit-Learn¶

Introduction¶

資料探勘 and 機器學習
建立在 NumPy, SciPy ,Matplotlib 套件上
六大功能：
a. Preprocessing (資料預處理)
b. Dimensionality reduction (資料降維)
c. Regression (迴歸)
d. Classification (分類)
e. Clustering (分群)
f. Model selection (模型評估/選取)

Introduction¶

Note
¶

Supervised learning
- Regression
- Classification
Unsupervised learning
- Clustering

Introduction¶

Machine Learning Problem：
Reference : http://scikit-learn.org/stable/tutorial/machine_learning_map/

Introduction¶

Note
¶

SciKits

Reference : https://scikits.appspot.com/scikits

Scikit-Learn¶

Datasets¶

Boston
Iris

Datasets¶

Note
¶

Scikit-Learn 針對不同資料集使用不同的 load 指令
格式：
```
load_資料集名稱()
```

Datasets -- Boston¶

Import Scikit-Learn 的 Datasets

In [2]:

from sklearn import datasets

Dataset (boston)

In [3]:

boston = datasets.load_boston()

Datasets -- Boston¶

Check boston (type)

In [4]:

type(boston)

Out[4]:

sklearn.utils.Bunch

The Bunch object in Scikit-Learn is simply a dictionary that exposes dictionary keys as properties so that you can access them with dot notation.

Datasets -- Boston¶

Check boston (key)

In [5]:

boston.keys()

Out[5]:

dict_keys(['data', 'target', 'feature_names', 'DESCR'])

data：解釋變數

target：反應變數

feature_names：變數名稱說明

DESCR：資料描述

Datasets -- Boston¶

boston (DESCR)

In [6]:

boston.DESCR

Out[6]:

"Boston House Prices dataset\n===========================\n\nNotes\n------\nData Set Characteristics:  \n\n    :Number of Instances: 506 \n\n    :Number of Attributes: 13 numeric/categorical predictive\n    \n    :Median Value (attribute 14) is usually the target\n\n    :Attribute Information (in order):\n        - CRIM     per capita crime rate by town\n        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.\n        - INDUS    proportion of non-retail business acres per town\n        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n        - NOX      nitric oxides concentration (parts per 10 million)\n        - RM       average number of rooms per dwelling\n        - AGE      proportion of owner-occupied units built prior to 1940\n        - DIS      weighted distances to five Boston employment centres\n        - RAD      index of accessibility to radial highways\n        - TAX      full-value property-tax rate per $10,000\n        - PTRATIO  pupil-teacher ratio by town\n        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n        - LSTAT    % lower status of the population\n        - MEDV     Median value of owner-occupied homes in $1000's\n\n    :Missing Attribute Values: None\n\n    :Creator: Harrison, D. and Rubinfeld, D.L.\n\nThis is a copy of UCI ML housing dataset.\nhttp://archive.ics.uci.edu/ml/datasets/Housing\n\n\nThis dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\n\nThe Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\nprices and the demand for clean air', J. Environ. Economics & Management,\nvol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n...', Wiley, 1980.   N.B. Various transformations are used in the table on\npages 244-261 of the latter.\n\nThe Boston house-price data has been used in many machine learning papers that address regression\nproblems.   \n     \n**References**\n\n   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\n   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)\n"

Datasets -- Boston¶

boston ( feature_names )

In [7]:

boston.feature_names

Out[7]:

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'],
      dtype='<U7')

分别是房屋均價及周邊犯罪率、是否在河邊等相關信息

Datasets -- Boston¶

確認 boston ( data ) 的維度

In [8]:

boston.data.shape

Out[8]:

(506, 13)

確認 boston ( target ) 的維度

In [9]:

boston.target.shape

Out[9]:

(506,)

Datasets -- Iris¶

Import Scikit-Learn 的 Datasets

In [10]:

from sklearn import datasets

Dataset (iris)

In [11]:

iris = datasets.load_iris()

鳶尾花(iris)資料集是非常著名的生物資訊資料集之一，取自美國加州大學歐文分校的機械學習資料庫

Datasets -- Iris¶

Check iris (key)

In [12]:

iris.keys()

Out[12]:

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])

data：解釋變數

target：反應變數

target_names：反應變數的類別名稱

feature_names：變數名稱說明

DESCR：資料描述

Datasets -- Iris¶

iris (DESCR)

iris.DESCR

Iris DESCR

Datasets -- Iris¶

iris ( target_names )

In [13]:

iris.target_names

Out[13]:

array(['setosa', 'versicolor', 'virginica'],
      dtype='<U10')

Iris DESCR

Datasets -- Iris¶

iris ( feature_names )

In [14]:

iris.feature_names

Out[14]:

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

分别是花萼、花瓣等相關信息

Datasets -- Iris¶

確認 iris ( data ) 的維度

In [15]:

iris.data.shape

Out[15]:

(150, 4)

確認 iris ( target ) 的維度

In [16]:

iris.target.shape

Out[16]:

(150,)

Question
¶

想分析 Boston 這組資料。根據前幾堂課學到的方法做一些，你認為在進入分析(模型)前應該要做的事?¶

Review
¶

Data science process flowchart¶

Data science process flowchart

Reference : https://www.wikiwand.com/en/Data_analysis

Datasets -- Boston
¶

Data Prepare

In [17]:

boston_df = pd.DataFrame( boston.data )
boston_df.columns = boston.feature_names
boston_df['PRICE'] = boston.target

確認資料維度

In [18]:

boston_df.shape

Out[18]:

(506, 14)

Datastes -- Boston
¶

可以利用前五筆資料，來看一下資料狀況

In [19]:

boston_df.head()

Out[19]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	PRICE
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2

Datastes -- Boston
¶

確認資料中是否有遺失值

In [20]:

boston_df.isnull().sum()

Out[20]:

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
PRICE      0
dtype: int64

Datastes -- Boston
¶

基本敘述統計

In [21]:

boston_df.describe()

Out[21]:

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	PRICE
count	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000
mean	3.593761	11.363636	11.136779	0.069170	0.554695	6.284634	68.574901	3.795043	9.549407	408.237154	18.455534	356.674032	12.653063	22.532806
std	8.596783	23.322453	6.860353	0.253994	0.115878	0.702617	28.148861	2.105710	8.707259	168.537116	2.164946	91.294864	7.141062	9.197104
min	0.006320	0.000000	0.460000	0.000000	0.385000	3.561000	2.900000	1.129600	1.000000	187.000000	12.600000	0.320000	1.730000	5.000000
25%	0.082045	0.000000	5.190000	0.000000	0.449000	5.885500	45.025000	2.100175	4.000000	279.000000	17.400000	375.377500	6.950000	17.025000
50%	0.256510	0.000000	9.690000	0.000000	0.538000	6.208500	77.500000	3.207450	5.000000	330.000000	19.050000	391.440000	11.360000	21.200000
75%	3.647423	12.500000	18.100000	0.000000	0.624000	6.623500	94.075000	5.188425	24.000000	666.000000	20.200000	396.225000	16.955000	25.000000
max	88.976200	100.000000	27.740000	1.000000	0.871000	8.780000	100.000000	12.126500	24.000000	711.000000	22.000000	396.900000	37.970000	50.000000

Datastes -- Boston
¶

探索型資料視覺化

In [22]:

sns.pairplot(boston_df)

Out[22]:

<seaborn.axisgrid.PairGrid at 0x1064ee240>

Question
¶

想分析 iris 這組資料。根據前幾堂課學到的方法做一些，先來探索一下這組資料吧！¶

Datasets -- Iris
¶

Data Prepare

In [23]:

iris_df = pd.DataFrame( iris.data )
iris_df.columns = iris.feature_names
iris_df['species'] = iris.target_names[iris.target]

確認資料維度

In [24]:

iris_df.shape

Out[24]:

(150, 5)

Datasets -- Iris
¶

可以利用前五筆資料，來看一下資料狀況

In [25]:

iris_df.head()

Out[25]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

Datasets -- Iris
¶

確認資料中是否有遺失值

In [26]:

iris_df.isnull().sum()

Out[26]:

sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
species              0
dtype: int64

Datasets -- Iris
¶

基本敘述統計

In [27]:

iris_df.describe()

Out[27]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.054000	3.758667	1.198667
std	0.828066	0.433594	1.764420	0.763161
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

In [28]:

iris_df["species"].value_counts()

Out[28]:

setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64

Datasets -- Iris
¶

探索型資料視覺化

In [29]:

sns.pairplot(iris_df)

Out[29]:

<seaborn.axisgrid.PairGrid at 0x105c69470>

Datasets -- Iris
¶

探索型資料視覺化

In [30]:

sns.pairplot(iris_df ,hue='species')

Out[30]:

<seaborn.axisgrid.PairGrid at 0x1161ad5c0>

Datasets -- Iris
¶

探索型資料視覺化

In [31]:

sns.pairplot(iris_df ,hue='species',diag_kind="kde")

Out[31]:

<seaborn.axisgrid.PairGrid at 0x117338208>

Datasets -- Iris
¶

探索型資料視覺化

In [32]:

sns.boxplot(x="species", y="petal length (cm)", data=iris_df)

Out[32]:

<matplotlib.axes._subplots.AxesSubplot at 0x118697198>

Scikit-Learn¶

Train data ＆ Test data¶

Different version¶

from sklearn.model_selection import train_test_split

from sklearn.cross_validation import train_test_split

Train data ＆ Test data¶

Train data ＆ Test data

Train data ＆ Test data¶

Train data ＆ Test data

Train data ＆ Test data¶

Train data ＆ Test data

Train data ＆ Test data¶

Train data ＆ Test data

Train data ＆ Test data -- Boston¶

Choose Data

In [33]:

B_X = boston_df.drop('PRICE',axis=1)
B_y = boston_df['PRICE']

Train data ＆ Test data -- Boston¶

Training and testing

In [34]:

from sklearn.model_selection import train_test_split

In [35]:

B_X_train, B_X_test, B_y_train, B_y_test = train_test_split(B_X , B_y , test_size = 0.2, random_state = 0)

Train data ＆ Test data -- Iris¶

Choose Data

In [36]:

I_X = iris_df.loc[:, ['petal length (cm)','petal width (cm)']]
I_y = iris.target

Train data ＆ Test data -- Iris¶

Training and testing

In [37]:

from sklearn.model_selection import train_test_split

In [38]:

I_X_train, I_X_test, I_y_train, I_y_test = train_test_split(I_X, I_y, test_size = 0.3, random_state = 0)

Scikit-Learn¶

Machine Learning Algorithms -- Supervised learning¶

Regression
- Linear Regression
- Support Vector Regression
Classification
- Logistic Regression
- K-Nearest Neighbors
- Support Vector Machine

Machine Learning Algorithms -- Supervised learning¶

Linear Regression¶

Linear Regression

Reference : http://www.csie.ntnu.edu.tw/~u91029/Regression.html

Machine Learning Algorithms -- Supervised learning¶

Linear Regression¶

Import module¶

In [39]:

from sklearn.linear_model import LinearRegression

Use LinearRegression¶

In [40]:

lm = LinearRegression()

Machine Learning Algorithms -- Supervised learning¶

Linear Regression¶

Fit Model ( Training Data )¶

In [41]:

lm.fit( B_X_train.values, B_y_train.values )

/usr/local/lib/python3.6/site-packages/scipy/linalg/basic.py:1018: RuntimeWarning: internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.
  warnings.warn(mesg, RuntimeWarning)

Out[41]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Machine Learning Algorithms -- Supervised learning¶

Linear Regression¶

Training Model ( Intercept )¶

In [42]:

print(lm.intercept_)

38.1386927134

Training Model ( coefficient )¶

In [43]:

print(lm.coef_)

[ -1.18410318e-01   4.47550643e-02   5.85674689e-03   2.34230117e+00
  -1.61634024e+01   3.70135143e+00  -3.04553661e-03  -1.38664542e+00
   2.43784171e-01  -1.09856157e-02  -1.04699133e+00   8.22014729e-03
  -4.93642452e-01]

Machine Learning Algorithms -- Supervised learning¶

Linear Regression¶

Note
¶

Training Model ( features and coefficients )¶

In [44]:

pd.DataFrame(list( zip(B_X_train.columns,lm.coef_) ),columns=['features','estimatedCoeffs'])

Out[44]:

	features	estimatedCoeffs
0	CRIM	-0.118410
1	ZN	0.044755
2	INDUS	0.005857
3	CHAS	2.342301
4	NOX	-16.163402
5	RM	3.701351
6	AGE	-0.003046
7	DIS	-1.386645
8	RAD	0.243784
9	TAX	-0.010986
10	PTRATIO	-1.046991
11	B	0.008220
12	LSTAT	-0.493642

Machine Learning Algorithms -- Supervised learning¶

Linear Regression¶

Feature Test ( 係數檢定 )¶

In [45]:

from sklearn.feature_selection import f_regression
print(f_regression(B_X_train, B_y_train)[1])

[  1.24951397e-17   8.53455987e-18   5.27647041e-30   3.33700835e-04
   1.67294465e-22   5.64340813e-62   4.20176291e-18   9.08403584e-09
   3.66673466e-19   2.94014039e-28   3.07127177e-35   2.18386516e-13
   6.88510971e-76]

Note
¶

小數位數¶

In [46]:

print(f_regression(B_X_train, B_y_train)[1].round(decimals=5))

[ 0.       0.       0.       0.00033  0.       0.       0.       0.       0.
  0.       0.       0.       0.     ]

Machine Learning Algorithms -- Supervised learning¶

Linear Regression¶

Predict (Test Data)¶

In [47]:

B_y_predict = lm.predict(B_X_test)

In [48]:

pd.DataFrame( list(zip(B_y_test.values,B_y_predict)), columns=['Measured','Predicted'] )

Out[48]:

	Measured	Predicted
0	22.6	24.890130
1	50.0	23.724882
2	23.0	29.372133
3	8.3	12.140103
4	21.2	21.446865
5	19.9	19.286453
6	20.6	20.496373
7	18.7	21.361896
8	16.1	18.901879
9	18.6	19.892403
10	8.8	5.148872
11	17.2	16.346841
12	14.9	17.060125
13	10.5	5.609031
14	50.0	40.004621
15	29.0	32.494273
16	23.0	22.460817
17	33.3	36.855865
18	29.4	30.865793
19	21.0	23.154780
20	23.8	24.776560
21	19.1	24.679962
22	20.4	20.593782
23	29.1	30.356250
24	19.3	22.426400
25	23.1	10.228738
26	19.6	17.648142
27	19.4	18.260385
28	38.7	35.530774
29	18.7	20.961253
...	...	...
72	23.5	30.734731
73	31.2	28.828630
74	23.7	25.901463
75	7.4	5.239417
76	48.3	36.712836
77	24.4	23.774485
78	22.6	27.271346
79	18.3	19.294855
80	23.3	28.624284
81	17.1	19.178398
82	27.9	18.975513
83	44.8	37.816238
84	50.0	39.208833
85	23.0	23.714100
86	21.4	24.934828
87	10.2	15.850637
88	23.3	26.096484
89	23.2	16.677999
90	18.9	15.833155
91	13.4	13.065319
92	21.9	24.722807
93	24.8	31.254435
94	11.9	22.171410
95	24.3	20.251676
96	13.8	0.596340
97	24.7	25.445216
98	14.1	15.521760
99	18.7	17.937782
100	28.1	25.306178
101	19.8	22.372220

102 rows × 2 columns

Machine Learning Algorithms -- Supervised learning¶

Linear Regression¶

Plot¶

In [49]:

plt.scatter(B_y_test.values,B_y_predict,s=2)
plt.plot([B_y_test.values.min(), B_y_test.values.max()], [B_y_test.values.min(), B_y_test.values.max()], 'k--', lw=2)
plt.ylabel('Predicted')
plt.xlabel('Measured')

Out[49]:

<matplotlib.text.Text at 0x118f4cc50>

Machine Learning Algorithms -- Supervised learning¶

Linear Regression¶

Model Performance¶

Mean squared error（MSE）

In [50]:

from sklearn.metrics import mean_squared_error
mse = mean_squared_error(B_y_test.values, B_y_predict)
print("MSE : ",mse)

MSE :  33.4507089677

Machine Learning Algorithms -- Supervised learning¶

Linear Regression¶

Model Performance¶

R-squared

In [51]:

R_2 = lm.score(B_X_train, B_y_train) 
print("R-squared  : ",R_2)

R-squared  :  0.772971872657

Adjusted R-squared

In [52]:

adj_R_2 = R_2 - (1 - R_2) * (B_X_train.shape[1] / (B_X_train.shape[0] - B_X_train.shape[1] - 1))
print("Adjusted R-squared : ",adj_R_2)

Adjusted R-squared :  0.765404268412

Machine Learning Algorithms -- Supervised learning¶

Support Vector Regression ( SVR )¶

Support Vector Regression

Reference : https://www.mathworks.com/matlabcentral/fileexchange/43429-support-vector-regressiontml

Machine Learning Algorithms -- Supervised learning¶

Support Vector Regression ( SVR )¶

Import module¶

In [53]:

from sklearn.svm import SVR

Use Support Vector Regression¶

In [54]:

svr = SVR(kernel='rbf')

rbf : 高斯徑向基底函數

Machine Learning Algorithms -- Supervised learning¶

Support Vector Regression ( SVR )¶

Training Model¶

In [55]:

svr.fit( B_X_train.values, B_y_train.values )

Out[55]:

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

Machine Learning Algorithms -- Supervised learning¶

Support Vector Regression ( SVR )¶

Predict (Test Data)¶

In [56]:

B_y_predict_svr = svr.predict(B_X_test)

In [57]:

pd.DataFrame( list(zip(B_y_test.values,B_y_predict_svr)), columns=['Measured','Predicted'] )

Out[57]:

	Measured	Predicted
0	22.6	21.360372
1	50.0	21.360428
2	23.0	21.375869
3	8.3	21.088170
4	21.2	21.357921
5	19.9	21.294222
6	20.6	21.318502
7	18.7	20.856544
8	16.1	21.360407
9	18.6	21.360407
10	8.8	21.360407
11	17.2	21.360209
12	14.9	20.876751
13	10.5	21.304793
14	50.0	21.541168
15	29.0	21.360566
16	23.0	21.360436
17	33.3	21.389027
18	29.4	21.360408
19	21.0	21.414940
20	23.8	21.471958
21	19.1	21.346785
22	20.4	21.314222
23	29.1	21.709732
24	19.3	21.360222
25	23.1	21.360407
26	19.6	21.224352
27	19.4	21.360407
28	38.7	21.362649
29	18.7	20.928249
...	...	...
72	23.5	21.360420
73	31.2	21.360407
74	23.7	21.851062
75	7.4	21.360025
76	48.3	21.388638
77	24.4	21.423946
78	22.6	21.438740
79	18.3	20.909480
80	23.3	21.360407
81	17.1	18.860371
82	27.9	21.360405
83	44.8	21.501592
84	50.0	21.395877
85	23.0	21.367203
86	21.4	21.353663
87	10.2	21.360407
88	23.3	21.673876
89	23.2	21.360377
90	18.9	21.360407
91	13.4	21.360402
92	21.9	21.360407
93	24.8	21.360407
94	11.9	21.200132
95	24.3	21.360407
96	13.8	21.343289
97	24.7	21.360407
98	14.1	21.360405
99	18.7	21.379847
100	28.1	21.361395
101	19.8	21.121805

102 rows × 2 columns

Machine Learning Algorithms -- Supervised learning¶

Support Vector Regression ( SVR )¶

Plot¶

In [58]:

plt.scatter(B_y_test.values,B_y_predict_svr,s=2)
plt.plot([B_y_test.values.min(), B_y_test.values.max()], [B_y_test.values.min(), B_y_test.values.max()], 'k--', lw=2)
plt.ylabel('Predicted')
plt.xlabel('Measured')

Out[58]:

<matplotlib.text.Text at 0x118fd8d68>

Machine Learning Algorithms -- Supervised learning¶

Support Vector Regression ( SVR )¶

Question
¶

WHY ??¶

Machine Learning Algorithms -- Supervised learning¶

Support Vector Regression ( SVR )¶

資料標準化
¶

In [59]:

from sklearn.preprocessing import StandardScaler

In [60]:

scaler = StandardScaler()
scaler.fit(B_X_train)
B_X_train_s = scaler.transform(B_X_train)
B_X_test_s = scaler.transform(B_X_test)

Machine Learning Algorithms -- Supervised learning¶

Support Vector Regression ( SVR )¶

In [61]:

svr.fit( B_X_train_s, B_y_train.values )

Out[61]:

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [62]:

B_y_predict_svr_s = svr.predict(B_X_test_s)

In [63]:

plt.scatter(B_y_test.values,B_y_predict_svr_s,s=2)
plt.plot([B_y_test.values.min(), B_y_test.values.max()], [B_y_test.values.min(), B_y_test.values.max()], 'k--', lw=2)
plt.ylabel('Predicted')
plt.xlabel('Measured')

Out[63]:

<matplotlib.text.Text at 0x118ffc908>

Scikit-Learn¶

Data Preprocessing¶

StandardScaler $$ \acute{x} = \frac{x - \bar{x}}{\sigma} $$

MinMaxScaler $$ \acute{x} = \frac{x - min(x)}{max(x) - min(x)} $$

Normalizer $$ \acute{x} = \frac{x}{\lVert x \rVert} $$

Data Preprocessing¶

Note
¶

不要顯示科學計算符號，規定顯示小數五位¶

In [64]:

np.set_printoptions(precision = 5, suppress = True)

Data Preprocessing¶

Original¶

Features : Petal length , Petal width
Species : setosa , versicolor , virginica

In [65]:

print("Mean : ",I_X.values.mean(axis=0))
print("Std : ",I_X.values.std(axis=0))
print("Max : ",I_X.values.max(axis=0))
print("Min : ",I_X.values.min(axis=0))

Mean :  [ 3.75867  1.19867]
Std :  [ 1.75853  0.76061]
Max :  [ 6.9  2.5]
Min :  [ 1.   0.1]

Data Preprocessing¶

In [66]:

from sklearn import preprocessing

Data Preprocessing¶

StandardScaler¶

z-score normalization

In [67]:

standard = preprocessing.StandardScaler()
standard.fit(I_X)
I_X_s = standard.transform(I_X)

In [68]:

print("Mean : ",I_X_s.mean(axis=0))
print("Std : ",I_X_s.std(axis=0))
print("Max : ",I_X_s.max(axis=0))
print("Min : ",I_X_s.min(axis=0))

Mean :  [-0. -0.]
Std :  [ 1.  1.]
Max :  [ 1.78634  1.7109 ]
Min :  [-1.56874 -1.44445]

Data Preprocessing¶

MinMaxScaler¶

In [69]:

minmax = preprocessing.MinMaxScaler()
minmax.fit(I_X)
I_X_m = minmax.transform(I_X)

In [70]:

print("Mean : ",I_X_m.mean(axis=0))
print("Std : ",I_X_m.std(axis=0))
print("Max : ",I_X_m.max(axis=0))
print("Min : ",I_X_m.min(axis=0))

Mean :  [ 0.46757  0.45778]
Std :  [ 0.29806  0.31692]
Max :  [ 1.  1.]
Min :  [ 0.  0.]

Data Preprocessing¶

Normalizer¶

In [71]:

normal = preprocessing.Normalizer()
normal.fit(I_X)
I_X_n = normal.transform(I_X)

In [72]:

print("Mean : ",I_X_n.mean(axis=0))
print("Std : ",I_X_n.std(axis=0))
print("Max : ",I_X_n.max(axis=0))
print("Min : ",I_X_n.min(axis=0))

Mean :  [ 0.95911  0.26772]
Std :  [ 0.02265  0.08904]
Max :  [ 0.99779  0.4258 ]
Min :  [ 0.90482  0.06652]

Scikit-Learn¶

Machine Learning Algorithms -- Supervised learning¶

Regression
- Linear Regression
- Support Vector Regression
Classification
- Logistic Regression
- K-Nearest Neighbors
- Support Vector Machine

Machine Learning Algorithms -- Supervised learning¶

Logistic Regression¶

Logistic Regression

Reference : https://houxianxu.github.io/2015/04/23/logistic-softmax-regression

Machine Learning Algorithms -- Supervised learning¶

Logistic Regression¶

Sigmoid Function

Sigmoid Function

Reference : http://www.saedsayad.com/logistic_regression.htm

Machine Learning Algorithms -- Supervised learning¶

Logistic Regression¶

Sigmoid Function

Note
¶

In [73]:

def SigmoidFunc(x):
    return 1 / (1 + np.exp(-x))

Machine Learning Algorithms -- Supervised learning¶

Logistic Regression¶

Sigmoid Function

Note
¶

In [74]:

Sigmoid_x = np.arange(-20, 20, 0.01)
Sigmoid_y = SigmoidFunc(Sigmoid_x)
plt.plot(Sigmoid_x, Sigmoid_y)
plt.axvline(0, ls='dotted', color='black', alpha=0.5)
plt.axhline(y=0, ls='dotted', color='black', alpha=0.5)
plt.axhline(y=0.5, ls = 'dotted', color='black', alpha=0.5)
plt.axhline(y=1, ls='dotted', color='black', alpha=0.5)
plt.yticks([0.0, 0.5, 1.0])
plt.ylim(-0.05, 1.05)
plt.title("Sigmoid Function")
plt.show()

Machine Learning Algorithms -- Supervised learning¶

Logistic Regression¶

Multiclass classification

Multiclass classification

Reference : https://houxianxu.github.io/2015/04/23/logistic-softmax-regression/

Machine Learning Algorithms -- Supervised learning¶

Logistic Regression¶

Import module¶

In [75]:

from sklearn.linear_model import LogisticRegression

Use LogisticRegression¶

In [76]:

logit = LogisticRegression()

Machine Learning Algorithms -- Supervised learning¶

Logistic Regression¶

Fit Model ( Training Data )¶

In [77]:

logit.fit( I_X_train, I_y_train )

Out[77]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Machine Learning Algorithms -- Supervised learning¶

Logistic Regression¶

Predicting Species¶

In [78]:

y_logit_predict = logit.predict( I_X_test )

Compare Predict Value and Measure Value

In [79]:

pd.DataFrame( list(zip( I_y_test, y_logit_predict)), columns=['Measured','Predicted'] )

Out[79]:

	Measured	Predicted
0	2	2
1	1	1
2	0	0
3	2	2
4	0	0
5	2	2
6	0	0
7	1	2
8	1	2
9	1	2
10	2	1
11	1	2
12	1	1
13	1	2
14	1	2
15	0	0
16	1	2
17	1	1
18	0	0
19	0	0
20	2	2
21	1	2
22	0	0
23	0	0
24	2	2
25	0	0
26	0	0
27	1	2
28	1	1
29	0	0
30	2	2
31	1	2
32	0	0
33	2	2
34	2	2
35	1	2
36	0	0
37	1	2
38	1	2
39	1	1
40	2	2
41	0	0
42	2	2
43	0	0
44	0	0

Machine Learning Algorithms -- Supervised learning¶

Logistic Regression¶

Compare Predict Value and Measure Value

In [80]:

pd.DataFrame( list(zip(iris.target_names[ I_y_test],
                       iris.target_names[ y_logit_predict])), columns=['Measured','Predicted'] )

Out[80]:

	Measured	Predicted
0	virginica	virginica
1	versicolor	versicolor
2	setosa	setosa
3	virginica	virginica
4	setosa	setosa
5	virginica	virginica
6	setosa	setosa
7	versicolor	virginica
8	versicolor	virginica
9	versicolor	virginica
10	virginica	versicolor
11	versicolor	virginica
12	versicolor	versicolor
13	versicolor	virginica
14	versicolor	virginica
15	setosa	setosa
16	versicolor	virginica
17	versicolor	versicolor
18	setosa	setosa
19	setosa	setosa
20	virginica	virginica
21	versicolor	virginica
22	setosa	setosa
23	setosa	setosa
24	virginica	virginica
25	setosa	setosa
26	setosa	setosa
27	versicolor	virginica
28	versicolor	versicolor
29	setosa	setosa
30	virginica	virginica
31	versicolor	virginica
32	setosa	setosa
33	virginica	virginica
34	virginica	virginica
35	versicolor	virginica
36	setosa	setosa
37	versicolor	virginica
38	versicolor	virginica
39	versicolor	versicolor
40	virginica	virginica
41	setosa	setosa
42	virginica	virginica
43	setosa	setosa
44	setosa	setosa

Machine Learning Algorithms -- Supervised learning¶

Logistic Regression¶

Check probability¶

In [81]:

print(iris.target_names)
print(logit.predict_proba( I_X_test.iloc[0, :].values.reshape(1, 2)))
print(logit.predict_proba( I_X_test.iloc[1, :].values.reshape(1, 2)))
print(logit.predict_proba( I_X_test.iloc[2, :].values.reshape(1, 2)))

['setosa' 'versicolor' 'virginica']
[[ 0.00368  0.20336  0.79297]]
[[ 0.14155  0.53555  0.3229 ]]
[[ 0.72601  0.23859  0.0354 ]]

Machine Learning Algorithms -- Supervised learning¶

Logistic Regression¶

Check probability¶

Note
¶

Check probability

Machine Learning Algorithms -- Supervised learning¶

Logistic Regression¶

Accuracy¶

In [82]:

from sklearn.metrics import accuracy_score
print('Accuracy:',accuracy_score( I_y_test, y_logit_predict))

Accuracy: 0.688888888889

Note
¶

Decision boundary¶

In [83]:

from matplotlib.colors import ListedColormap
def DecisionBoundary_plot(X, y, method, h=.02):
    markers = ('o', '*', '^') 
    colors = ('yellow', 'magenta', 'cyan') 
    colormap = ListedColormap(colors[:len(np.unique(y))])

    x_min, x_max = X[:,0].min()-1, X[:,0].max()+1
    y_min, y_max = X[:,1].min()-1, X[:,1].max()+1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = method.predict(np.array([xx.ravel(), yy.ravel()]).T)
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, alpha=0.2, cmap=colormap)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    for ix, lab in enumerate(np.unique(y)):
        spec = iris.target_names[lab]
        plt.scatter( x = X[y==lab,0], y = X[y==lab,1],
                     c=colormap(ix), marker=markers[ix], label=spec)

Machine Learning Algorithms -- Supervised learning¶

Logistic Regression¶

Plot ( Decision boundary )¶

In [84]:

DecisionBoundary_plot(X= I_X.values, y= I_y, method=logit)
plt.xlabel('petal length (cm)')
plt.ylabel('petal width (cm)')
plt.legend(loc = 'upper left')
plt.show()

Question
¶

利用經過 StandardScaler 轉換後的資料，建立Logistic Regression Model，並比較 Accuracy¶

Logistic Regression Model -- StandardScaler Data
¶

Data Prepare -- Training and testing

In [85]:

I_X_train_s, I_X_test_s, I_y_train_s, I_y_test_s = train_test_split( I_X_s, I_y, test_size = 0.3, random_state = 0)

Model

In [86]:

logit_s = LogisticRegression()
logit_s.fit( I_X_train_s, I_y_train_s )

Out[86]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Logistic Regression Model -- StandardScaler Data
¶

Predict

In [87]:

y_logit_predict_s = logit_s.predict( I_X_test_s )

Accuracy

In [88]:

print('Accuracy(Standard):',accuracy_score( I_y_test_s, y_logit_predict_s))

Accuracy(Standard): 0.8

經過標準化的資料， Accuracy 更好

Logistic Regression Model -- StandardScaler Data
¶

Plot ( Decision boundary )

In [89]:

DecisionBoundary_plot(X= I_X_s, y= I_y, method=logit_s)
plt.title('Iris (Standard)')
plt.xlabel('petal length (cm)')
plt.ylabel('petal width (cm)')
plt.legend(loc = 'upper left')
plt.show()

Note
¶

Linear VS Nonlinear¶

Multiclass classification

Reference : https://sebastianraschka.com/Articles/2014_kernel_pca.html

Note
¶

Classification of linearly nonseparable¶

K-nearest neighbor (KNN)
Support Vector Machine (SVM)
Decision Tree

Machine Learning Algorithms -- Supervised learning¶

K-Nearest Neighbors¶

找 K 個鄰居

KNN

Reference : https://www.slideshare.net/ssuserf88631/knn-51511604

Machine Learning Algorithms -- Supervised learning¶

K-Nearest Neighbors¶

由鄰居決定類別

KNN

Reference : https://www.slideshare.net/ssuserf88631/knn-51511604

Machine Learning Algorithms -- Supervised learning¶

K-Nearest Neighbors¶

Import module¶

In [90]:

from sklearn.neighbors import KNeighborsClassifier

Use KNeighborsClassifier¶

In [91]:

knn = KNeighborsClassifier()

Machine Learning Algorithms -- Supervised learning¶

K-Nearest Neighbors¶

Fit Model ( Training Data )¶

In [92]:

knn.fit( I_X_train_s, I_y_train_s)

Out[92]:

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

Machine Learning Algorithms -- Supervised learning¶

K-Nearest Neighbors¶

Predicting Species¶

In [93]:

y_knn_predict = knn.predict( I_X_test_s )

Compare Predict Value and Measure Value

In [94]:

pd.DataFrame( list(zip(iris.target_names[ I_y_test_s],
                       iris.target_names[ y_knn_predict])), columns=['Measured','Predicted'] )

Out[94]:

	Measured	Predicted
0	virginica	virginica
1	versicolor	versicolor
2	setosa	setosa
3	virginica	virginica
4	setosa	setosa
5	virginica	virginica
6	setosa	setosa
7	versicolor	versicolor
8	versicolor	versicolor
9	versicolor	versicolor
10	virginica	virginica
11	versicolor	versicolor
12	versicolor	versicolor
13	versicolor	versicolor
14	versicolor	versicolor
15	setosa	setosa
16	versicolor	versicolor
17	versicolor	versicolor
18	setosa	setosa
19	setosa	setosa
20	virginica	virginica
21	versicolor	versicolor
22	setosa	setosa
23	setosa	setosa
24	virginica	virginica
25	setosa	setosa
26	setosa	setosa
27	versicolor	versicolor
28	versicolor	versicolor
29	setosa	setosa
30	virginica	virginica
31	versicolor	versicolor
32	setosa	setosa
33	virginica	virginica
34	virginica	virginica
35	versicolor	versicolor
36	setosa	setosa
37	versicolor	versicolor
38	versicolor	versicolor
39	versicolor	versicolor
40	virginica	virginica
41	setosa	setosa
42	virginica	virginica
43	setosa	setosa
44	setosa	setosa

Machine Learning Algorithms -- Supervised learning¶

K-Nearest Neighbors¶

Accuracy¶

In [95]:

from sklearn.metrics import accuracy_score
print('Accuracy:',accuracy_score( I_y_test_s, y_knn_predict))

Accuracy: 1.0

Machine Learning Algorithms -- Supervised learning¶

K-Nearest Neighbors¶

Plot ( Decision boundary )¶

In [96]:

DecisionBoundary_plot(X= I_X_s, y= I_y, method=knn)
plt.xlabel('petal length (cm)')
plt.ylabel('petal width (cm)')
plt.legend(loc = 'upper left')
plt.show()

Machine Learning Algorithms -- Supervised learning¶

Support Vector Machine ( SVM )¶

SVM

Reference : https://www.linkedin.com/pulse/support-vector-machine-srinivas-kulkarni

資料投射到可以用一個超平面（Hyper-plane）分離的空間經過公式的轉化紅色的點會在平面下方藍色的點會在平面上方拿這個平面當作是分類依據對完了之後在映射回來原本的平面

支持向量機是一種最小化結構風險（Structural risk）的演算法，何謂結構型風險？機器學習的內涵在於假設一個類似模型去逼近真實模型，而量化類似模型與真實模型之間差距的方式，跟我們在計算績效（準確率）用的概念是相同的，我們用類似模型預測的結果去跟答案比較。許多的分類器可以在訓練資料上達到很高的正確率（稱作 Overfitting），但是卻失去應用在實際問題的推廣能力（Generalization ability）。資料科學家將分類器在訓練樣本可能過度配適的風險稱為 Empirical risk，分類器的推廣能力不足的風險稱為 Generalization risk，兩者的總和即為結構風險，而支持向量機就是在兩者之間取得最佳平衡點，進而得到一個在訓練資料績效不錯，亦能推廣適用的類似模型。

Note
¶

Support Vector Machine¶

Radial basis function kernel $$ K( \, x,\acute{x} \, ) = exp(-\gamma \lVert x - \acute{x} \rVert^2 ) , where \, \gamma = \frac{1}{2 \sigma^2} $$

Machine Learning Algorithms -- Supervised learning¶

Support Vector Machine ( SVM )¶

Import module¶

In [97]:

from sklearn.svm import SVC

Use SVC¶

In [98]:

svm = SVC(kernel = 'rbf', random_state = 0, gamma = 0.2)

Machine Learning Algorithms -- Supervised learning¶

Support Vector Machine ( SVM )¶

Fit Model ( Training Data )¶

In [99]:

svm.fit( I_X_train, I_y_train )

Out[99]:

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.2, kernel='rbf',
  max_iter=-1, probability=False, random_state=0, shrinking=True,
  tol=0.001, verbose=False)

Machine Learning Algorithms -- Supervised learning¶

Support Vector Machine ( SVM )¶

Predicting Species¶

In [100]:

y_svm_predict = svm.predict( I_X_test )

Compare Predict Value and Measure Value

In [101]:

pd.DataFrame( list(zip(iris.target_names[ I_y_test],
                       iris.target_names[ y_svm_predict])), columns=['Measured','Predicted'] )

Out[101]:

	Measured	Predicted
0	virginica	virginica
1	versicolor	versicolor
2	setosa	setosa
3	virginica	virginica
4	setosa	setosa
5	virginica	virginica
6	setosa	setosa
7	versicolor	versicolor
8	versicolor	versicolor
9	versicolor	versicolor
10	virginica	virginica
11	versicolor	versicolor
12	versicolor	versicolor
13	versicolor	versicolor
14	versicolor	versicolor
15	setosa	setosa
16	versicolor	versicolor
17	versicolor	versicolor
18	setosa	setosa
19	setosa	setosa
20	virginica	virginica
21	versicolor	versicolor
22	setosa	setosa
23	setosa	setosa
24	virginica	virginica
25	setosa	setosa
26	setosa	setosa
27	versicolor	versicolor
28	versicolor	versicolor
29	setosa	setosa
30	virginica	virginica
31	versicolor	versicolor
32	setosa	setosa
33	virginica	virginica
34	virginica	virginica
35	versicolor	versicolor
36	setosa	setosa
37	versicolor	virginica
38	versicolor	versicolor
39	versicolor	versicolor
40	virginica	virginica
41	setosa	setosa
42	virginica	virginica
43	setosa	setosa
44	setosa	setosa

Machine Learning Algorithms -- Supervised learning¶

Support Vector Machine ( SVM )¶

Accuracy¶

In [102]:

from sklearn.metrics import accuracy_score
print('Accuracy:',accuracy_score( I_y_test, y_svm_predict))

Accuracy: 0.977777777778

Machine Learning Algorithms -- Supervised learning¶

Support Vector Machine ( SVM )¶

Plot ( Decision boundary )¶

In [103]:

DecisionBoundary_plot(X= I_X.values , y= I_y, method=svm)
plt.xlabel('petal length (cm)')
plt.ylabel('petal width (cm)')
plt.legend(loc = 'upper left')
plt.show()

Question
¶

利用經過 StandardScaler 轉換後的資料，建立SVM Model，並比較 Accuracy¶

Support Vector Machine ( SVM ) Model -- StandardScaler Data
¶

Data Prepare -- Training and testing

Model

In [104]:

svm_s = SVC(kernel = 'rbf', random_state = 0, gamma = 0.2)
svm_s.fit( I_X_train_s, I_y_train_s )

Out[104]:

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.2, kernel='rbf',
  max_iter=-1, probability=False, random_state=0, shrinking=True,
  tol=0.001, verbose=False)

Support Vector Machine ( SVM ) Model -- StandardScaler Data
¶

Predict

In [105]:

y_svm_predict_s = svm_s.predict( I_X_test_s )

Accuracy

In [106]:

print('Accuracy:',accuracy_score( I_y_test_s, y_svm_predict_s))

Accuracy: 0.977777777778

Support Vector Machine ( SVM ) Model -- StandardScaler Data
¶

Plot ( Decision boundary )

In [107]:

DecisionBoundary_plot(X= I_X_s, y=I_y, method=svm_s)
plt.xlabel('petal length (cm)')
plt.ylabel('petal width (cm)')
plt.legend(loc = 'upper left')
plt.show()

Note
¶

Overfitting¶

Overfitting

Reference : http://mlwiki.org/index.php/Overfitting

Scikit-Learn¶

Machine Learning Algorithms -- Unsupervised learning¶

Unsupervised learning
- K-means
- Hierarchical clustering

Machine Learning Algorithms -- Unsupervised learning¶

K means¶

Linear Regression

Reference : : https://dotblogs.com.tw/dragon229/2013/02/04/89919

Machine Learning Algorithms -- Unsupervised learning¶

K means¶

K Means

Reference : https://dotblogs.com.tw/dragon229/2013/02/04/89919

Machine Learning Algorithms -- Unsupervised learning¶

K means¶

K Means

Reference : http://www.csie.ntnu.edu.tw/~u91029/Classification.html

Machine Learning Algorithms -- Unsupervised learning¶

K means¶

K Means

Reference : http://www.csie.ntnu.edu.tw/~u91029/Classification.html

Machine Learning Algorithms -- Unsupervised learning¶

K means¶

K Means

Reference : http://www.csie.ntnu.edu.tw/~u91029/Classification.html

Machine Learning Algorithms -- Unsupervised learning¶

K means¶

K Means

Reference : http://www.csie.ntnu.edu.tw/~u91029/Classification.html

Machine Learning Algorithms -- Unsupervised learning¶

K means¶

K Means

Reference : http://www.csie.ntnu.edu.tw/~u91029/Classification.html

Machine Learning Algorithms -- Unsupervised learning¶

K means¶

K Means

Reference : http://www.csie.ntnu.edu.tw/~u91029/Classification.html

Machine Learning Algorithms -- Unsupervised learning¶

K means¶

Import module¶

In [108]:

from sklearn.cluster import KMeans

Use KMeans¶

In [109]:

kmeans = KMeans(n_clusters = 3)

Machine Learning Algorithms -- Unsupervised learning¶

K means¶

Fit Model¶

In [110]:

kmeans_fit = kmeans.fit( I_X )

Machine Learning Algorithms -- Unsupervised learning¶

K means¶

Result¶

In [111]:

kmeans_result = kmeans_fit.labels_

Compare Original and Cluster

In [112]:

pd.DataFrame( list(zip(iris.target_names[I_y],
                       iris.target_names[kmeans_result])), columns=['Original','Cluster'] )

Out[112]:

	Original	Cluster
0	setosa	setosa
1	setosa	setosa
2	setosa	setosa
3	setosa	setosa
4	setosa	setosa
5	setosa	setosa
6	setosa	setosa
7	setosa	setosa
8	setosa	setosa
9	setosa	setosa
10	setosa	setosa
11	setosa	setosa
12	setosa	setosa
13	setosa	setosa
14	setosa	setosa
15	setosa	setosa
16	setosa	setosa
17	setosa	setosa
18	setosa	setosa
19	setosa	setosa
20	setosa	setosa
21	setosa	setosa
22	setosa	setosa
23	setosa	setosa
24	setosa	setosa
25	setosa	setosa
26	setosa	setosa
27	setosa	setosa
28	setosa	setosa
29	setosa	setosa
...	...	...
120	virginica	versicolor
121	virginica	versicolor
122	virginica	versicolor
123	virginica	versicolor
124	virginica	versicolor
125	virginica	versicolor
126	virginica	virginica
127	virginica	versicolor
128	virginica	versicolor
129	virginica	versicolor
130	virginica	versicolor
131	virginica	versicolor
132	virginica	versicolor
133	virginica	versicolor
134	virginica	versicolor
135	virginica	versicolor
136	virginica	versicolor
137	virginica	versicolor
138	virginica	virginica
139	virginica	versicolor
140	virginica	versicolor
141	virginica	versicolor
142	virginica	versicolor
143	virginica	versicolor
144	virginica	versicolor
145	virginica	versicolor
146	virginica	versicolor
147	virginica	versicolor
148	virginica	versicolor
149	virginica	versicolor

150 rows × 2 columns

Machine Learning Algorithms -- Unsupervised learning¶

K means¶

Silhouette¶

In [113]:

from sklearn.metrics import silhouette_score
print('Silhouette:',silhouette_score( I_X, kmeans_result))

Silhouette: 0.660276088219

分群演算法的績效可以使用 Silhouette 係數

Question
¶

利用經過 MinMaxScaler 轉換後的資料，做 K-Means Cluster，並比較 Silhouette¶

K-MeansModel -- MinMaxScaler Data
¶

Model

In [114]:

kmeans_fit_m = kmeans.fit( I_X_m )

In [115]:

kmeans_result_m = kmeans_fit_m.labels_

In [116]:

print('Silhouette:',silhouette_score( I_X_m, kmeans_result_m ))

Silhouette: 0.675805601905

Machine Learning Algorithms -- Unsupervised learning¶

Hierarchical Clustering¶

Tries to combine or divide a dataset into clusters
A tree-like hierarchical structure is created
Can adopt two approaches :
- Agglomerative hierarchical clustering
- Divisive hierarchical clustering

Machine Learning Algorithms -- Unsupervised learning¶

Hierarchical Clustering¶

Hierarchical Clustering

Reference : https://quantdare.com/hierarchical-clustering/

Machine Learning Algorithms -- Unsupervised learning¶

Hierarchical Clustering¶

Hierarchical Clustering

Reference : http://www.sthda.com/english/wiki/hierarchical-clustering-essentials-unsupervised-machine-learning/

Machine Learning Algorithms -- Unsupervised learning¶

Hierarchical Clustering¶

Import module¶

In [117]:

from sklearn.cluster import AgglomerativeClustering

Use AgglomerativeClustering¶

In [118]:

hierarchical = AgglomerativeClustering(linkage = 'ward', affinity = 'euclidean', n_clusters = 3)

Machine Learning Algorithms -- Unsupervised learning¶

Hierarchical Clustering¶

Fit Model¶

In [119]:

hierarchical_fit = hierarchical.fit( I_X )

Machine Learning Algorithms -- Unsupervised learning¶

Hierarchical Clustering¶

Result¶

In [120]:

hierarchical_result = hierarchical_fit.labels_

Compare Original and Cluster

In [121]:

pd.DataFrame( list(zip(iris.target_names[I_y],
                       iris.target_names[hierarchical_result])), columns=['Measured','Predicted'] )

Out[121]:

	Measured	Predicted
0	setosa	versicolor
1	setosa	versicolor
2	setosa	versicolor
3	setosa	versicolor
4	setosa	versicolor
5	setosa	versicolor
6	setosa	versicolor
7	setosa	versicolor
8	setosa	versicolor
9	setosa	versicolor
10	setosa	versicolor
11	setosa	versicolor
12	setosa	versicolor
13	setosa	versicolor
14	setosa	versicolor
15	setosa	versicolor
16	setosa	versicolor
17	setosa	versicolor
18	setosa	versicolor
19	setosa	versicolor
20	setosa	versicolor
21	setosa	versicolor
22	setosa	versicolor
23	setosa	versicolor
24	setosa	versicolor
25	setosa	versicolor
26	setosa	versicolor
27	setosa	versicolor
28	setosa	versicolor
29	setosa	versicolor
...	...	...
120	virginica	setosa
121	virginica	setosa
122	virginica	setosa
123	virginica	setosa
124	virginica	setosa
125	virginica	setosa
126	virginica	setosa
127	virginica	setosa
128	virginica	setosa
129	virginica	setosa
130	virginica	setosa
131	virginica	setosa
132	virginica	setosa
133	virginica	setosa
134	virginica	setosa
135	virginica	setosa
136	virginica	setosa
137	virginica	setosa
138	virginica	setosa
139	virginica	setosa
140	virginica	setosa
141	virginica	setosa
142	virginica	setosa
143	virginica	setosa
144	virginica	setosa
145	virginica	setosa
146	virginica	setosa
147	virginica	setosa
148	virginica	setosa
149	virginica	setosa

150 rows × 2 columns

Machine Learning Algorithms -- Unsupervised learning¶

Hierarchical Clustering¶

Silhouette¶

In [122]:

from sklearn.metrics import silhouette_score
print('Silhouette:',silhouette_score( I_X, hierarchical_result))

Silhouette: 0.657185644873

	Measured	Predicted
0	2	2
1	1	1
2	0	0
3	2	2
4	0	0
5	2	2
6	0	0
7	1	2
8	1	2
9	1	2
10	2	1
11	1	2
12	1	1
13	1	2
14	1	2
15	0	0
16	1	2
17	1	1
18	0	0
19	0	0
20	2	2
21	1	2
22	0	0
23	0	0
24	2	2
25	0	0
26	0	0
27	1	2
28	1	1
29	0	0
30	2	2
31	1	2
32	0	0
33	2	2
34	2	2
35	1	2
36	0	0
37	1	2
38	1	2
39	1	1
40	2	2
41	0	0
42	2	2
43	0	0
44	0	0

	Measured	Predicted
0	2	2
1	1	1
2	0	0
3	2	2
4	0	0
5	2	2
6	0	0
7	1	2
8	1	2
9	1	2
10	2	1
11	1	2
12	1	1
13	1	2
14	1	2
15	0	0
16	1	2
17	1	1
18	0	0
19	0	0
20	2	2
21	1	2
22	0	0
23	0	0
24	2	2
25	0	0
26	0	0
27	1	2
28	1	1
29	0	0
30	2	2
31	1	2
32	0	0
33	2	2
34	2	2
35	1	2
36	0	0
37	1	2
38	1	2
39	1	1
40	2	2
41	0	0
42	2	2
43	0	0
44	0	0

Python × 資料分析¶

Scikit-Learn¶

Kristen Chan¶

Agenda¶

Review¶

Note¶

Scikit-Learn¶

Introduction¶

Introduction¶

Introduction¶

Note¶

Introduction¶

Introduction¶

Note¶

Scikit-Learn¶

Datasets¶

Datasets¶

Datasets¶

Note¶

Datasets -- Boston¶

Datasets -- Boston¶

Datasets -- Boston¶

Datasets -- Boston¶

Datasets -- Boston¶

Datasets -- Boston¶

Datasets -- Iris¶

Datasets -- Iris¶

Datasets -- Iris¶

Datasets -- Iris¶

Datasets -- Iris¶

Datasets -- Iris¶

Question¶

想分析 Boston 這組資料。根據前幾堂課學到的方法做一些，你認為在進入分析(模型)前應該要做的事?¶

Review¶

Data science process flowchart¶

Datasets -- Boston¶

Datastes -- Boston¶

Datastes -- Boston¶

Datastes -- Boston¶

Datastes -- Boston¶

Question¶

想分析 iris 這組資料。根據前幾堂課學到的方法做一些，先來探索一下這組資料吧！¶

Datasets -- Iris¶

Datasets -- Iris¶

Datasets -- Iris¶

Datasets -- Iris¶

Datasets -- Iris¶

Datasets -- Iris¶

Datasets -- Iris¶

Datasets -- Iris¶

Scikit-Learn¶

Train data ＆ Test data¶

Train data ＆ Test data¶

Different version¶

Train data ＆ Test data¶

Train data ＆ Test data¶

Train data ＆ Test data¶

Train data ＆ Test data¶

Train data ＆ Test data -- Boston¶

Train data ＆ Test data -- Boston¶

Train data ＆ Test data -- Iris¶

Train data ＆ Test data -- Iris¶

Scikit-Learn¶

Machine Learning Algorithms -- Supervised learning¶

Machine Learning Algorithms -- Supervised learning¶

Machine Learning Algorithms -- Supervised learning¶

Linear Regression¶

Machine Learning Algorithms -- Supervised learning¶

Linear Regression¶

Import module¶

Use LinearRegression¶

Machine Learning Algorithms -- Supervised learning¶

Linear Regression¶

Fit Model ( Training Data )¶

Machine Learning Algorithms -- Supervised learning¶

Linear Regression¶

Training Model ( Intercept )¶

Training Model ( coefficient )¶

Machine Learning Algorithms -- Supervised learning¶

Linear Regression¶

Review
¶

Note
¶

Note
¶

Note
¶

Note
¶

Question
¶

Review
¶

Datasets -- Boston
¶

Datastes -- Boston
¶

Datastes -- Boston
¶

Datastes -- Boston
¶

Datastes -- Boston
¶

Question
¶

Datasets -- Iris
¶

Datasets -- Iris
¶

Datasets -- Iris
¶

Datasets -- Iris
¶

Datasets -- Iris
¶

Datasets -- Iris
¶

Datasets -- Iris
¶

Datasets -- Iris
¶

Note
¶

Note
¶

Question
¶

資料標準化
¶

Note
¶

Note
¶

	Measured	Predicted
0	2	2
1	1	1
2	0	0
3	2	2
4	0	0
5	2	2
6	0	0
7	1	2
8	1	2
9	1	2
10	2	1
11	1	2
12	1	1
13	1	2
14	1	2
15	0	0
16	1	2
17	1	1
18	0	0
19	0	0
20	2	2
21	1	2
22	0	0
23	0	0
24	2	2
25	0	0
26	0	0
27	1	2
28	1	1
29	0	0
30	2	2
31	1	2
32	0	0
33	2	2
34	2	2
35	1	2
36	0	0
37	1	2
38	1	2
39	1	1
40	2	2
41	0	0
42	2	2
43	0	0
44	0	0