scikit-lern 練習 - 技術をかじる猫

ついでに

今回の日記は実は全部 jupitor notebook からのコピペなんだ…すまない。
やばすぎるわこれ（汗

まずは使うライブラリを取り込んでみる

numpy : 行列演算ライブラリ
scipy : 科学技術計算ライブラリ
matplotlib : グラフ作画ライブラリ
pandas : データ加工とか表示とか…なんか使った印象でこれ matplot 要るのか？（汗
IPython : python シェルの強化拡張ライブラリ
mglearn : https://github.com/amueller/mglearn オライリー先生の書籍専属ライブラリ。ヘルパー扱い。

%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
from scipy import sparse
import pandas as pd
from IPython import display
import mglearn

疎行列を作る

numpy で単位行列を作ってみる

eye = np.eye(4)
print(f"Numpy array\n{eye}")

Numpy array
[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]

Numpy 圧縮行格納形式(CSR)変換

numpy 行列は、sciPy 行列に変換可能

sparse_matrix = sparse.csr_matrix(eye)

print(f"CSR matrix\n {sparse_matrix}")

CSR matrix
   (0, 0)   1.0
  (1, 1)    1.0
  (2, 2)    1.0
  (3, 3)    1.0

行リスト(COO)変換

data = np.ones(4)
row_indicates = np.arange(4)
col_indicates = np.arange(4)
print(col_indicates)

[0 1 2 3]

eye_coo = sparse.coo_matrix((data, (row_indicates, col_indicates)))
print(f"COO \n {eye_coo}")

COO 
   (0, 0)   1.0
  (1, 1)    1.0
  (2, 2)    1.0
  (3, 3)    1.0

Matplot で適当なグラフ作画

x = np.linspace(-10, 10, 100)
y = np.sin(x)

plt.plot(x, y, marker="x")

f:id:white-azalea:20190304231927p:plain

Pandas を使ってみる

data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Location': ['NY', 'Paris', 'Berlin', 'London'],
    'Age': [24, 13, 53, 33]
}
data_pandas = pd.DataFrame(data)

data_pandas

	Name	Location	Age
0	John	NY	24
1	Anna	Paris	13
2	Peter	Berlin	53
3	Linda	London	33

data_pandas[data_pandas.Age > 30]

	Name	Location	Age
2	Peter	Berlin	53
3	Linda	London	33

Iris データを読み込んでみる

from sklearn.datasets import load_iris
iris_dataset = load_iris()

iris_dataset.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

詳細を読み取ってみる。

print(iris_dataset['DESCR'])

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica

    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

print(iris_dataset['target_names'])

['setosa' 'versicolor' 'virginica']

print(iris_dataset['feature_names'])

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

print(iris_dataset['data'][:5])

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]

print(type(iris_dataset['target']))

<class 'numpy.ndarray'>

print(iris_dataset['target'].shape)

(150,)

iris_dataset['target']

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

データを学習用と検証用に分離

train_test_split を利用すると、デフォルトで 75% を学習用、25% をテスト用に分離してくれる。

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'], iris_dataset['target'], random_state=0)

最初にデータを観察

pd.DataFrame から pd.scatter_matrix をしてやると、任意の二組の相関図を出してくれる。
やっべなんだこれ…データ解析捗りすぎんだろ…（汗

iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)
grr = pd.scatter_matrix(iris_dataframe, c=y_train, figsize=(15,15), marker='o', hist_kwds={'bins': 20}, s=60, alpha=0.8, cmap=mglearn.cm3)

f:id:white-azalea:20190304232009p:plain

The機械学習

K-最近傍法で処理してみる…むかしやったなーこんなやつ。

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=1, p=2,
           weights='uniform')

…めっちゃ簡単になっとる…機械学習本見て根性で実装してたのになぁ（汗
まずは適当なデータで特徴量を表示

X_new = np.array([[5, 2.9, 1, 0.2]])
X_new.shape

(1, 4)

prediction = knn.predict(X_new)
prediction

array([0])

print(iris_dataset['target_names'][prediction])

['setosa']

なんかそれっぽい。

早速確認用データを食わせる。

y_pred = knn.predict(X_test)
y_pred

array([2, 1, 0, 2, 0, 2, 0, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, 0, 2, 1,
       0, 0, 2, 0, 0, 1, 1, 0, 2, 1, 0, 2, 2, 1, 0, 2])

なんか出てきた。結果を比較

np.mean(y_pred == y_test)

0.9736842105263158

は…(ﾟoﾟ;;
これだけでこんな行くんかい…