Pandas の入門にいいかも？ - 技術をかじる猫

Python実践データ分析100本ノック

作者:下山輝昌,松田雄馬,三木孝行
発売日: 2019/10/31
メディア: Kindle版

秀和さんホントいい本出しますね。

本の概要

データサイエンスって、手法は一通り調べはしたんだけど、そこからどう応用すればいいんだろう？
そう思ったときに必要なのは何かと考えてみると

現実同様に正規化できてないデータ だったり、 目的とその達成までの考え方 あたりは教科書では中々教えてくれません。

この本はまさにこの部分を手を動かして実感する本です。

ポイント

氏名の書き方の揺れ、全角半角の揺れ、日付の書式揺れなどや、データ欠損の補完。
実務的に機械学習を利用した、ユーザの継続/解約予測、利用回数予測などの各種予測といった、実際に実務でやりたいであろう予測の仕方を説明しています。

Kaggle *1 でも学ぶことはできますが、日本語で十分な説明付き手順があるというのは非常に日本人としてはありがたい。

第一章の使い方

全部書くと、本のネタバレにしかならない気がするので、先頭 10 本のみ。
pandas の使い方を実際にデータいじりながら説明してくれるので、基本と実践が同居してるとてもいい本…というのが個人的印象。

import pandas as pd

customer_master = pd.read_csv('datas/customer_master.csv')
customer_master.head()

	customer_id	customer_name	registration_date	customer_name_kana	email	gender	age	birth	pref
0	IK152942	平田裕次郎	2019-01-01 00:25:33	ひらたゆうじろう	hirata_yuujirou@example.com	M	29	1990/6/10	石川県
1	TS808488	田村詩織	2019-01-01 01:13:45	たむらしおり	tamura_shiori@example.com	F	33	1986/5/20	東京都
2	AS834628	久野由樹	2019-01-01 02:00:14	ひさのゆき	hisano_yuki@example.com	F	63	1956/1/2	茨城県
3	AS345469	鶴岡薫	2019-01-01 04:48:22	つるおかかおる	tsuruoka_kaoru@example.com	M	74	1945/3/25	東京都
4	GD892565	大内高史	2019-01-01 04:54:51	おおうちたかし	oouchi_takashi@example.com	M	54	1965/8/5	千葉県

transaction_1 = pd.read_csv('datas/transaction_1.csv')
transaction_1.head()

	transaction_id	price	payment_date	customer_id
0	T0000000113	210000	2019-02-01 01:36:57	PL563502
1	T0000000114	50000	2019-02-01 01:37:23	HD678019
2	T0000000115	120000	2019-02-01 02:34:19	HD298120
3	T0000000116	210000	2019-02-01 02:47:23	IK452215
4	T0000000117	170000	2019-02-01 04:33:46	PL542865

transaction_detail_1 = pd.read_csv('datas/transaction_detail_1.csv')
transaction_detail_1.head()

	detail_id	transaction_id	item_id	quantity
0	0	T0000000113	S005	1
1	1	T0000000114	S001	1
2	2	T0000000115	S003	1
3	3	T0000000116	S005	1
4	4	T0000000117	S002	2

knock2 UNION してみる

二つ以上のデータを縦に結合する。

transaction_2 = pd.read_csv('datas/transaction_2.csv')
transaction_2.head()

	transaction_id	price	payment_date	customer_id
0	T0000005113	295000	2019-06-15 07:20:27	TS169261
1	T0000005114	50000	2019-06-15 07:35:47	HI599892
2	T0000005115	85000	2019-06-15 07:56:36	HI421757
3	T0000005116	50000	2019-06-15 08:40:55	OA386378
4	T0000005117	120000	2019-06-15 08:44:23	TS506913

transaction = pd.concat([transaction_1, transaction_2], ignore_index=True)
transaction.head()

	transaction_id	price	payment_date	customer_id
0	T0000000113	210000	2019-02-01 01:36:57	PL563502
1	T0000000114	50000	2019-02-01 01:37:23	HD678019
2	T0000000115	120000	2019-02-01 02:34:19	HD298120
3	T0000000116	210000	2019-02-01 02:47:23	IK452215
4	T0000000117	170000	2019-02-01 04:33:46	PL542865

transaction_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   transaction_id  5000 non-null   object
 1   price           5000 non-null   int64 
 2   payment_date    5000 non-null   object
 3   customer_id     5000 non-null   object
dtypes: int64(1), object(3)
memory usage: 156.4+ KB

transaction_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1786 entries, 0 to 1785
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   transaction_id  1786 non-null   object
 1   price           1786 non-null   int64 
 2   payment_date    1786 non-null   object
 3   customer_id     1786 non-null   object
dtypes: int64(1), object(3)
memory usage: 55.9+ KB

transaction.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6786 entries, 0 to 6785
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   transaction_id  6786 non-null   object
 1   price           6786 non-null   int64 
 2   payment_date    6786 non-null   object
 3   customer_id     6786 non-null   object
dtypes: int64(1), object(3)
memory usage: 212.2+ KB

transaction_detail_2 = pd.read_csv('datas/transaction_detail_2.csv')
transaction_detail = pd.concat([transaction_detail_1, transaction_detail_2], ignore_index=True)

knock3 結合

SQL の JOIN とある意味一緒。
left は実質 left inner join やね

join_data = pd.merge(transaction_detail, transaction[['transaction_id', 'payment_date', 'customer_id']], on='transaction_id', how='left')
join_data.head()

	detail_id	transaction_id	item_id	quantity	payment_date	customer_id
0	0	T0000000113	S005	1	2019-02-01 01:36:57	PL563502
1	1	T0000000114	S001	1	2019-02-01 01:37:23	HD678019
2	2	T0000000115	S003	1	2019-02-01 02:34:19	HD298120
3	3	T0000000116	S005	1	2019-02-01 02:47:23	IK452215
4	4	T0000000117	S002	2	2019-02-01 04:33:46	PL542865

join_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7144 entries, 0 to 7143
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   detail_id       7144 non-null   int64 
 1   transaction_id  7144 non-null   object
 2   item_id         7144 non-null   object
 3   quantity        7144 non-null   int64 
 4   payment_date    7144 non-null   object
 5   customer_id     7144 non-null   object
dtypes: int64(2), object(4)
memory usage: 390.7+ KB

print(len(transaction_detail))
print(len(transaction))
print(len(join_data))

7144
6786
7144

knock4 JOIN する

こちらも結合。違いは前回 dara_frame[['select_column', 'select_column']] のような部分結合か、全体の結合かだけ。

item_master = pd.read_csv('datas/item_master.csv')

join_data = pd.merge(join_data, customer_master, on='customer_id', how='left')
join_data = pd.merge(join_data, item_master, on='item_id', how='left')
join_data.head()

	detail_id	transaction_id	item_id	quantity	payment_date	customer_id	customer_name	registration_date	customer_name_kana	email	gender	age	birth	pref	item_name	item_price
0	0	T0000000113	S005	1	2019-02-01 01:36:57	PL563502	井本芳正	2019-01-07 14:34:35	いもとよしまさ	imoto_yoshimasa@example.com	M	30	1989/7/15	熊本県	PC-E	210000
1	1	T0000000114	S001	1	2019-02-01 01:37:23	HD678019	三船六郎	2019-01-27 18:00:11	みふねろくろう	mifune_rokurou@example.com	M	73	1945/11/29	京都府	PC-A	50000
2	2	T0000000115	S003	1	2019-02-01 02:34:19	HD298120	山根小雁	2019-01-11 08:16:02	やまねこがん	yamane_kogan@example.com	M	42	1977/5/17	茨城県	PC-C	120000
3	3	T0000000116	S005	1	2019-02-01 02:47:23	IK452215	池田菜摘	2019-01-10 05:07:38	いけだなつみ	ikeda_natsumi@example.com	F	47	1972/3/17	兵庫県	PC-E	210000
4	4	T0000000117	S002	2	2019-02-01 04:33:46	PL542865	栗田憲一	2019-01-25 06:46:05	くりたけんいち	kurita_kenichi@example.com	M	74	1944/12/17	長崎県	PC-B	85000

Knock5 計算した行を追加する

join_data['price'] = join_data['quantity'] * join_data['item_price']
join_data[['quantity', 'item_price', 'price']].head()

	quantity	item_price	price
0	1	210000	210000
1	1	50000	50000
2	1	120000	120000
3	1	210000	210000
4	2	85000	170000

Knock6 集計する

print(join_data['price'].sum())
print(transaction['price'].sum())

971135000
971135000

各種統計量を求める

join_data.isnull().sum()

detail_id             0
transaction_id        0
item_id               0
quantity              0
payment_date          0
customer_id           0
customer_name         0
registration_date     0
customer_name_kana    0
email                 0
gender                0
age                   0
birth                 0
pref                  0
item_name             0
item_price            0
price                 0
dtype: int64

join_data.describe()

	detail_id	quantity	age	item_price	price
count	7144.000000	7144.000000	7144.000000	7144.000000	7144.000000
mean	3571.500000	1.199888	50.265677	121698.628219	135937.150056
std	2062.439494	0.513647	17.190314	64571.311830	68511.453297
min	0.000000	1.000000	20.000000	50000.000000	50000.000000
25%	1785.750000	1.000000	36.000000	50000.000000	85000.000000
50%	3571.500000	1.000000	50.000000	102500.000000	120000.000000
75%	5357.250000	1.000000	65.000000	187500.000000	210000.000000
max	7143.000000	4.000000	80.000000	210000.000000	420000.000000

print(join_data['payment_date'].min())
print(join_data['payment_date'].max())

2019-02-01 01:36:57
2019-07-31 23:41:38

月別のデータを集計する

join_data.dtypes

detail_id              int64
transaction_id        object
item_id               object
quantity               int64
payment_date          object
customer_id           object
customer_name         object
registration_date     object
customer_name_kana    object
email                 object
gender                object
age                    int64
birth                 object
pref                  object
item_name             object
item_price             int64
price                  int64
dtype: object

join_data['payment_date'] = pd.to_datetime(join_data['payment_date'])
join_data['payment_month'] = join_data['payment_date'].dt.strftime('%Y%m')
join_data[['payment_date', 'payment_month']].head()

	payment_date	payment_month
0	2019-02-01 01:36:57	201902
1	2019-02-01 01:37:23	201902
2	2019-02-01 02:34:19	201902
3	2019-02-01 02:47:23	201902
4	2019-02-01 04:33:46	201902

join_data.groupby('payment_month').sum()['price']

payment_month
201902    160185000
201903    160370000
201904    160510000
201905    155420000
201906    164030000
201907    170620000
Name: price, dtype: int64

Knock 9 月別、商品別でデータ集計

join_data.groupby(['payment_month', 'item_name']).sum()[['price', 'quantity']]

		price	quantity
payment_month	item_name
201902	PC-A	24150000	483
	PC-B	25245000	297
	PC-C	19800000	165
	PC-D	31140000	173
	PC-E	59850000	285
201903	PC-A	26000000	520
	PC-B	25500000	300
	PC-C	19080000	159
	PC-D	25740000	143
	PC-E	64050000	305
201904	PC-A	25900000	518
	PC-B	23460000	276
	PC-C	21960000	183
	PC-D	24300000	135
	PC-E	64890000	309
201905	PC-A	24850000	497
	PC-B	25330000	298
	PC-C	20520000	171
	PC-D	25920000	144
	PC-E	58800000	280
201906	PC-A	26000000	520
	PC-B	23970000	282
	PC-C	21840000	182
	PC-D	28800000	160
	PC-E	63420000	302
201907	PC-A	25250000	505
	PC-B	28220000	332
	PC-C	19440000	162
	PC-D	26100000	145
	PC-E	71610000	341

pd.pivot_table(join_data, index='item_name', columns='payment_month', values=['price', 'quantity'], aggfunc='sum')

	price						quantity
payment_month	201902	201903	201904	201905	201906	201907	201902	201903	201904	201905	201906	201907
item_name
PC-A	24150000	26000000	25900000	24850000	26000000	25250000	483	520	518	497	520	505
PC-B	25245000	25500000	23460000	25330000	23970000	28220000	297	300	276	298	282	332
PC-C	19800000	19080000	21960000	20520000	21840000	19440000	165	159	183	171	182	162
PC-D	31140000	25740000	24300000	25920000	28800000	26100000	173	143	135	144	160	145
PC-E	59850000	64050000	64890000	58800000	63420000	71610000	285	305	309	280	302	341

Knock10 商品別の売り上げ推移を可視化する

graph_data = pd.pivot_table(join_data, index='payment_month', columns='item_name', values='price', aggfunc='sum')
graph_data.head()

item_name	PC-A	PC-B	PC-C	PC-D	PC-E
payment_month
201902	24150000	25245000	19800000	31140000	59850000
201903	26000000	25500000	19080000	25740000	64050000
201904	25900000	23460000	21960000	24300000	64890000
201905	24850000	25330000	20520000	25920000	58800000
201906	26000000	23970000	21840000	28800000	63420000

import matplotlib.pyplot as plt
%matplotlib inline

plt.plot(list(graph_data.index), graph_data['PC-A'], label='PC-A')
plt.plot(list(graph_data.index), graph_data['PC-B'], label='PC-B')
plt.plot(list(graph_data.index), graph_data['PC-C'], label='PC-C')
plt.plot(list(graph_data.index), graph_data['PC-D'], label='PC-D')
plt.plot(list(graph_data.index), graph_data['PC-E'], label='PC-E')
plt.legend()

f:id:white-azalea:20200910233247p:plain

*1:というか他の人の考えや手順も学ぶという意味ならKaggleの方がサンプルは多い