scikit-lern 練習
ついでに
今回の日記は実は全部 jupitor notebook からのコピペなんだ…すまない。
やばすぎるわこれ(汗
まずは使うライブラリを取り込んでみる
- numpy : 行列演算ライブラリ
- scipy : 科学技術計算ライブラリ
- matplotlib : グラフ作画ライブラリ
- pandas : データ加工とか表示とか…なんか使った印象でこれ matplot 要るのか?(汗
- IPython : python シェルの強化拡張ライブラリ
- mglearn : https://github.com/amueller/mglearn オライリー先生の書籍専属ライブラリ。ヘルパー扱い。
%matplotlib inline import numpy as np import matplotlib.pyplot as plt from scipy import sparse import pandas as pd from IPython import display import mglearn
疎行列を作る
numpy で単位行列を作ってみる
eye = np.eye(4) print(f"Numpy array\n{eye}")
Numpy array
[[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
Numpy 圧縮行格納形式(CSR)変換
numpy 行列は、sciPy 行列に変換可能
sparse_matrix = sparse.csr_matrix(eye) print(f"CSR matrix\n {sparse_matrix}")
CSR matrix
(0, 0) 1.0
(1, 1) 1.0
(2, 2) 1.0
(3, 3) 1.0
行リスト(COO)変換
data = np.ones(4) row_indicates = np.arange(4) col_indicates = np.arange(4) print(col_indicates)
[0 1 2 3]
eye_coo = sparse.coo_matrix((data, (row_indicates, col_indicates))) print(f"COO \n {eye_coo}")
COO
(0, 0) 1.0
(1, 1) 1.0
(2, 2) 1.0
(3, 3) 1.0
Matplot で適当なグラフ作画
x = np.linspace(-10, 10, 100) y = np.sin(x) plt.plot(x, y, marker="x")
Pandas を使ってみる
data = { 'Name': ['John', 'Anna', 'Peter', 'Linda'], 'Location': ['NY', 'Paris', 'Berlin', 'London'], 'Age': [24, 13, 53, 33] } data_pandas = pd.DataFrame(data) data_pandas
Name | Location | Age | |
---|---|---|---|
0 | John | NY | 24 |
1 | Anna | Paris | 13 |
2 | Peter | Berlin | 53 |
3 | Linda | London | 33 |
data_pandas[data_pandas.Age > 30]
Name | Location | Age | |
---|---|---|---|
2 | Peter | Berlin | 53 |
3 | Linda | London | 33 |
Iris データを読み込んでみる
from sklearn.datasets import load_iris iris_dataset = load_iris() iris_dataset.keys()
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
詳細を読み取ってみる。
print(iris_dataset['DESCR'])
.. _iris_dataset:
Iris plants dataset
--------------------
**Data Set Characteristics:**
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
.. topic:: References
- Fisher, R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
print(iris_dataset['target_names'])
['setosa' 'versicolor' 'virginica']
print(iris_dataset['feature_names'])
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
print(iris_dataset['data'][:5])
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]]
print(type(iris_dataset['target']))
<class 'numpy.ndarray'>
print(iris_dataset['target'].shape)
(150,)
iris_dataset['target']
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
データを学習用と検証用に分離
train_test_split
を利用すると、デフォルトで 75% を学習用、25% をテスト用に分離してくれる。
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'], iris_dataset['target'], random_state=0)
最初にデータを観察
pd.DataFrame
から pd.scatter_matrix
をしてやると、任意の二組の相関図を出してくれる。
やっべなんだこれ…データ解析捗りすぎんだろ…(汗
iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names) grr = pd.scatter_matrix(iris_dataframe, c=y_train, figsize=(15,15), marker='o', hist_kwds={'bins': 20}, s=60, alpha=0.8, cmap=mglearn.cm3)
The機械学習
K-最近傍法で処理してみる…むかしやったなーこんなやつ。
from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors=1) knn.fit(X_train, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=1, p=2,
weights='uniform')
…めっちゃ簡単になっとる…機械学習本見て根性で実装してたのになぁ(汗
まずは適当なデータで特徴量を表示
X_new = np.array([[5, 2.9, 1, 0.2]]) X_new.shape
(1, 4)
prediction = knn.predict(X_new) prediction
array([0])
print(iris_dataset['target_names'][prediction])
['setosa']
なんかそれっぽい。
早速確認用データを食わせる。
y_pred = knn.predict(X_test) y_pred
array([2, 1, 0, 2, 0, 2, 0, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, 0, 2, 1,
0, 0, 2, 0, 0, 1, 1, 0, 2, 1, 0, 2, 2, 1, 0, 2])
なんか出てきた。 結果を比較
np.mean(y_pred == y_test)
0.9736842105263158
は…(゚o゚;;
これだけでこんな行くんかい…