秀和さんホントいい本出しますね。
本の概要
データサイエンスって、手法は一通り調べはしたんだけど、そこからどう応用すればいいんだろう?
そう思ったときに必要なのは何かと考えてみると
現実同様に正規化できてないデータ だったり、 目的とその達成までの考え方 あたりは教科書では中々教えてくれません。
この本はまさにこの部分を手を動かして実感する本です。
ポイント
氏名の書き方の揺れ、全角半角の揺れ、日付の書式揺れなどや、データ欠損の補完。
実務的に機械学習を利用した、ユーザの継続/解約予測、利用回数予測などの各種予測といった、実際に実務でやりたいであろう予測の仕方を説明しています。
Kaggle *1 でも学ぶことはできますが、日本語で十分な説明付き手順があるというのは非常に日本人としてはありがたい。
第一章の使い方
全部書くと、本のネタバレにしかならない気がするので、先頭 10 本のみ。
pandas の使い方を実際にデータいじりながら説明してくれるので、基本と実践が同居してるとてもいい本…というのが個人的印象。
import pandas as pd
customer_master = pd.read_csv('datas/customer_master.csv')
customer_master.head()
|
customer_id |
customer_name |
registration_date |
customer_name_kana |
email |
gender |
age |
birth |
pref |
0 |
IK152942 |
平田 裕次郎 |
2019-01-01 00:25:33 |
ひらた ゆうじろう |
hirata_yuujirou@example.com |
M |
29 |
1990/6/10 |
石川県 |
1 |
TS808488 |
田村 詩織 |
2019-01-01 01:13:45 |
たむら しおり |
tamura_shiori@example.com |
F |
33 |
1986/5/20 |
東京都 |
2 |
AS834628 |
久野 由樹 |
2019-01-01 02:00:14 |
ひさの ゆき |
hisano_yuki@example.com |
F |
63 |
1956/1/2 |
茨城県 |
3 |
AS345469 |
鶴岡 薫 |
2019-01-01 04:48:22 |
つるおか かおる |
tsuruoka_kaoru@example.com |
M |
74 |
1945/3/25 |
東京都 |
4 |
GD892565 |
大内 高史 |
2019-01-01 04:54:51 |
おおうち たかし |
oouchi_takashi@example.com |
M |
54 |
1965/8/5 |
千葉県 |
transaction_1 = pd.read_csv('datas/transaction_1.csv')
transaction_1.head()
|
transaction_id |
price |
payment_date |
customer_id |
0 |
T0000000113 |
210000 |
2019-02-01 01:36:57 |
PL563502 |
1 |
T0000000114 |
50000 |
2019-02-01 01:37:23 |
HD678019 |
2 |
T0000000115 |
120000 |
2019-02-01 02:34:19 |
HD298120 |
3 |
T0000000116 |
210000 |
2019-02-01 02:47:23 |
IK452215 |
4 |
T0000000117 |
170000 |
2019-02-01 04:33:46 |
PL542865 |
transaction_detail_1 = pd.read_csv('datas/transaction_detail_1.csv')
transaction_detail_1.head()
|
detail_id |
transaction_id |
item_id |
quantity |
0 |
0 |
T0000000113 |
S005 |
1 |
1 |
1 |
T0000000114 |
S001 |
1 |
2 |
2 |
T0000000115 |
S003 |
1 |
3 |
3 |
T0000000116 |
S005 |
1 |
4 |
4 |
T0000000117 |
S002 |
2 |
knock2 UNION してみる
二つ以上のデータを縦に結合する。
transaction_2 = pd.read_csv('datas/transaction_2.csv')
transaction_2.head()
|
transaction_id |
price |
payment_date |
customer_id |
0 |
T0000005113 |
295000 |
2019-06-15 07:20:27 |
TS169261 |
1 |
T0000005114 |
50000 |
2019-06-15 07:35:47 |
HI599892 |
2 |
T0000005115 |
85000 |
2019-06-15 07:56:36 |
HI421757 |
3 |
T0000005116 |
50000 |
2019-06-15 08:40:55 |
OA386378 |
4 |
T0000005117 |
120000 |
2019-06-15 08:44:23 |
TS506913 |
transaction = pd.concat([transaction_1, transaction_2], ignore_index=True)
transaction.head()
|
transaction_id |
price |
payment_date |
customer_id |
0 |
T0000000113 |
210000 |
2019-02-01 01:36:57 |
PL563502 |
1 |
T0000000114 |
50000 |
2019-02-01 01:37:23 |
HD678019 |
2 |
T0000000115 |
120000 |
2019-02-01 02:34:19 |
HD298120 |
3 |
T0000000116 |
210000 |
2019-02-01 02:47:23 |
IK452215 |
4 |
T0000000117 |
170000 |
2019-02-01 04:33:46 |
PL542865 |
transaction_1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 transaction_id 5000 non-null object
1 price 5000 non-null int64
2 payment_date 5000 non-null object
3 customer_id 5000 non-null object
dtypes: int64(1), object(3)
memory usage: 156.4+ KB
transaction_2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1786 entries, 0 to 1785
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 transaction_id 1786 non-null object
1 price 1786 non-null int64
2 payment_date 1786 non-null object
3 customer_id 1786 non-null object
dtypes: int64(1), object(3)
memory usage: 55.9+ KB
transaction.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6786 entries, 0 to 6785
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 transaction_id 6786 non-null object
1 price 6786 non-null int64
2 payment_date 6786 non-null object
3 customer_id 6786 non-null object
dtypes: int64(1), object(3)
memory usage: 212.2+ KB
transaction_detail_2 = pd.read_csv('datas/transaction_detail_2.csv')
transaction_detail = pd.concat([transaction_detail_1, transaction_detail_2], ignore_index=True)
knock3 結合
SQL の JOIN とある意味一緒。
left
は実質 left inner join
やね
join_data = pd.merge(transaction_detail, transaction[['transaction_id', 'payment_date', 'customer_id']], on='transaction_id', how='left')
join_data.head()
|
detail_id |
transaction_id |
item_id |
quantity |
payment_date |
customer_id |
0 |
0 |
T0000000113 |
S005 |
1 |
2019-02-01 01:36:57 |
PL563502 |
1 |
1 |
T0000000114 |
S001 |
1 |
2019-02-01 01:37:23 |
HD678019 |
2 |
2 |
T0000000115 |
S003 |
1 |
2019-02-01 02:34:19 |
HD298120 |
3 |
3 |
T0000000116 |
S005 |
1 |
2019-02-01 02:47:23 |
IK452215 |
4 |
4 |
T0000000117 |
S002 |
2 |
2019-02-01 04:33:46 |
PL542865 |
join_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 7144 entries, 0 to 7143
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 detail_id 7144 non-null int64
1 transaction_id 7144 non-null object
2 item_id 7144 non-null object
3 quantity 7144 non-null int64
4 payment_date 7144 non-null object
5 customer_id 7144 non-null object
dtypes: int64(2), object(4)
memory usage: 390.7+ KB
print(len(transaction_detail))
print(len(transaction))
print(len(join_data))
7144
6786
7144
knock4 JOIN する
こちらも結合。違いは前回 dara_frame[['select_column', 'select_column']]
のような部分結合か、全体の結合かだけ。
item_master = pd.read_csv('datas/item_master.csv')
join_data = pd.merge(join_data, customer_master, on='customer_id', how='left')
join_data = pd.merge(join_data, item_master, on='item_id', how='left')
join_data.head()
|
detail_id |
transaction_id |
item_id |
quantity |
payment_date |
customer_id |
customer_name |
registration_date |
customer_name_kana |
email |
gender |
age |
birth |
pref |
item_name |
item_price |
0 |
0 |
T0000000113 |
S005 |
1 |
2019-02-01 01:36:57 |
PL563502 |
井本 芳正 |
2019-01-07 14:34:35 |
いもと よしまさ |
imoto_yoshimasa@example.com |
M |
30 |
1989/7/15 |
熊本県 |
PC-E |
210000 |
1 |
1 |
T0000000114 |
S001 |
1 |
2019-02-01 01:37:23 |
HD678019 |
三船 六郎 |
2019-01-27 18:00:11 |
みふね ろくろう |
mifune_rokurou@example.com |
M |
73 |
1945/11/29 |
京都府 |
PC-A |
50000 |
2 |
2 |
T0000000115 |
S003 |
1 |
2019-02-01 02:34:19 |
HD298120 |
山根 小雁 |
2019-01-11 08:16:02 |
やまね こがん |
yamane_kogan@example.com |
M |
42 |
1977/5/17 |
茨城県 |
PC-C |
120000 |
3 |
3 |
T0000000116 |
S005 |
1 |
2019-02-01 02:47:23 |
IK452215 |
池田 菜摘 |
2019-01-10 05:07:38 |
いけだ なつみ |
ikeda_natsumi@example.com |
F |
47 |
1972/3/17 |
兵庫県 |
PC-E |
210000 |
4 |
4 |
T0000000117 |
S002 |
2 |
2019-02-01 04:33:46 |
PL542865 |
栗田 憲一 |
2019-01-25 06:46:05 |
くりた けんいち |
kurita_kenichi@example.com |
M |
74 |
1944/12/17 |
長崎県 |
PC-B |
85000 |
Knock5 計算した行を追加する
join_data['price'] = join_data['quantity'] * join_data['item_price']
join_data[['quantity', 'item_price', 'price']].head()
|
quantity |
item_price |
price |
0 |
1 |
210000 |
210000 |
1 |
1 |
50000 |
50000 |
2 |
1 |
120000 |
120000 |
3 |
1 |
210000 |
210000 |
4 |
2 |
85000 |
170000 |
Knock6 集計する
print(join_data['price'].sum())
print(transaction['price'].sum())
971135000
971135000
各種統計量を求める
join_data.isnull().sum()
detail_id 0
transaction_id 0
item_id 0
quantity 0
payment_date 0
customer_id 0
customer_name 0
registration_date 0
customer_name_kana 0
email 0
gender 0
age 0
birth 0
pref 0
item_name 0
item_price 0
price 0
dtype: int64
join_data.describe()
|
detail_id |
quantity |
age |
item_price |
price |
count |
7144.000000 |
7144.000000 |
7144.000000 |
7144.000000 |
7144.000000 |
mean |
3571.500000 |
1.199888 |
50.265677 |
121698.628219 |
135937.150056 |
std |
2062.439494 |
0.513647 |
17.190314 |
64571.311830 |
68511.453297 |
min |
0.000000 |
1.000000 |
20.000000 |
50000.000000 |
50000.000000 |
25% |
1785.750000 |
1.000000 |
36.000000 |
50000.000000 |
85000.000000 |
50% |
3571.500000 |
1.000000 |
50.000000 |
102500.000000 |
120000.000000 |
75% |
5357.250000 |
1.000000 |
65.000000 |
187500.000000 |
210000.000000 |
max |
7143.000000 |
4.000000 |
80.000000 |
210000.000000 |
420000.000000 |
print(join_data['payment_date'].min())
print(join_data['payment_date'].max())
2019-02-01 01:36:57
2019-07-31 23:41:38
月別のデータを集計する
join_data.dtypes
detail_id int64
transaction_id object
item_id object
quantity int64
payment_date object
customer_id object
customer_name object
registration_date object
customer_name_kana object
email object
gender object
age int64
birth object
pref object
item_name object
item_price int64
price int64
dtype: object
join_data['payment_date'] = pd.to_datetime(join_data['payment_date'])
join_data['payment_month'] = join_data['payment_date'].dt.strftime('%Y%m')
join_data[['payment_date', 'payment_month']].head()
|
payment_date |
payment_month |
0 |
2019-02-01 01:36:57 |
201902 |
1 |
2019-02-01 01:37:23 |
201902 |
2 |
2019-02-01 02:34:19 |
201902 |
3 |
2019-02-01 02:47:23 |
201902 |
4 |
2019-02-01 04:33:46 |
201902 |
join_data.groupby('payment_month').sum()['price']
payment_month
201902 160185000
201903 160370000
201904 160510000
201905 155420000
201906 164030000
201907 170620000
Name: price, dtype: int64
Knock 9 月別、商品別でデータ集計
join_data.groupby(['payment_month', 'item_name']).sum()[['price', 'quantity']]
|
|
price |
quantity |
payment_month |
item_name |
|
|
201902 |
PC-A |
24150000 |
483 |
PC-B |
25245000 |
297 |
PC-C |
19800000 |
165 |
PC-D |
31140000 |
173 |
PC-E |
59850000 |
285 |
201903 |
PC-A |
26000000 |
520 |
PC-B |
25500000 |
300 |
PC-C |
19080000 |
159 |
PC-D |
25740000 |
143 |
PC-E |
64050000 |
305 |
201904 |
PC-A |
25900000 |
518 |
PC-B |
23460000 |
276 |
PC-C |
21960000 |
183 |
PC-D |
24300000 |
135 |
PC-E |
64890000 |
309 |
201905 |
PC-A |
24850000 |
497 |
PC-B |
25330000 |
298 |
PC-C |
20520000 |
171 |
PC-D |
25920000 |
144 |
PC-E |
58800000 |
280 |
201906 |
PC-A |
26000000 |
520 |
PC-B |
23970000 |
282 |
PC-C |
21840000 |
182 |
PC-D |
28800000 |
160 |
PC-E |
63420000 |
302 |
201907 |
PC-A |
25250000 |
505 |
PC-B |
28220000 |
332 |
PC-C |
19440000 |
162 |
PC-D |
26100000 |
145 |
PC-E |
71610000 |
341 |
pd.pivot_table(join_data, index='item_name', columns='payment_month', values=['price', 'quantity'], aggfunc='sum')
|
price |
quantity |
payment_month |
201902 |
201903 |
201904 |
201905 |
201906 |
201907 |
201902 |
201903 |
201904 |
201905 |
201906 |
201907 |
item_name |
|
|
|
|
|
|
|
|
|
|
|
|
PC-A |
24150000 |
26000000 |
25900000 |
24850000 |
26000000 |
25250000 |
483 |
520 |
518 |
497 |
520 |
505 |
PC-B |
25245000 |
25500000 |
23460000 |
25330000 |
23970000 |
28220000 |
297 |
300 |
276 |
298 |
282 |
332 |
PC-C |
19800000 |
19080000 |
21960000 |
20520000 |
21840000 |
19440000 |
165 |
159 |
183 |
171 |
182 |
162 |
PC-D |
31140000 |
25740000 |
24300000 |
25920000 |
28800000 |
26100000 |
173 |
143 |
135 |
144 |
160 |
145 |
PC-E |
59850000 |
64050000 |
64890000 |
58800000 |
63420000 |
71610000 |
285 |
305 |
309 |
280 |
302 |
341 |
Knock10 商品別の売り上げ推移を可視化する
graph_data = pd.pivot_table(join_data, index='payment_month', columns='item_name', values='price', aggfunc='sum')
graph_data.head()
item_name |
PC-A |
PC-B |
PC-C |
PC-D |
PC-E |
payment_month |
|
|
|
|
|
201902 |
24150000 |
25245000 |
19800000 |
31140000 |
59850000 |
201903 |
26000000 |
25500000 |
19080000 |
25740000 |
64050000 |
201904 |
25900000 |
23460000 |
21960000 |
24300000 |
64890000 |
201905 |
24850000 |
25330000 |
20520000 |
25920000 |
58800000 |
201906 |
26000000 |
23970000 |
21840000 |
28800000 |
63420000 |
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(list(graph_data.index), graph_data['PC-A'], label='PC-A')
plt.plot(list(graph_data.index), graph_data['PC-B'], label='PC-B')
plt.plot(list(graph_data.index), graph_data['PC-C'], label='PC-C')
plt.plot(list(graph_data.index), graph_data['PC-D'], label='PC-D')
plt.plot(list(graph_data.index), graph_data['PC-E'], label='PC-E')
plt.legend()