も少し真面目にスクレイピング(Scrapy)

Python を使ってスクレイピングしてみる。
変に商用サービスにスクレイピングすると、規約問題が発生するので、とりあえずは自分のサイトで検証してみた。

準備

まずは Python その他諸々をインストールしよう。
ちなみに自分の Python 環境はいま時点でこんな感じ

Python 3.7.6
Scrapy 2.4.1 ←公式リンク

特に特筆すべきものではないかもね。

では、Scrapy をインストール

% pip  install scrapy
Collecting scrapy
  Downloading Scrapy-2.4.1-py2.py3-none-any.whl (239 kB)
     |████████████████████████████████| 239 kB 9.3 MB/s 
... 中略
Successfully built protego PyDispatcher
Installing collected packages: cssselect, w3lib, parsel, protego, itemadapter, jmespath, itemloaders, zope.interface, pyasn1, pyasn1-modules, service-identity, queuelib, PyDispatcher, incremental, constantly, Automat, hyperlink, PyHamcrest, Twisted, scrapy
Successfully installed Automat-20.2.0 PyDispatcher-2.0.5 PyHamcrest-2.0.2 Twisted-20.3.0 constantly-15.1.0 cssselect-1.1.0 hyperlink-20.0.1 incremental-17.5.0 itemadapter-0.2.0 itemloaders-1.0.4 jmespath-0.10.0 parsel-1.6.0 protego-0.1.16 pyasn1-0.4.8 pyasn1-modules-0.2.8 queuelib-1.5.0 scrapy-2.4.1 service-identity-18.1.0 w3lib-1.22.0 zope.interface-5.2.0

で、バージョンを確認してみると

% scrapy version
Scrapy 2.4.1

早速プロジェクトを開始

プロジェクトをまずは作成

% scrapy startproject white_azalea
New Scrapy project 'white_azalea', using template directory '/Users/armeria/.pyenv/versions/anaconda3-2020.02/lib/python3.7/site-packages/scrapy/templates/project', created in:
    /Users/xxxxx/workspace/scraping/white_azalea

You can start your first spider with:
    cd white_azalea
    scrapy genspider example example.com

f:id:white-azalea:20201221185904p:plain

で、軽めに調べたら

spiders : クロール対象のサイトへのリクエスト/レスポンスのパース処理定義
items : 抽出するデータ形式の定義
pipeline : spiders から入ってきた Items の処理を記述。ファイルに保存するとか

では、スパイダー定義を作成してみますか

ブログデータの型定義

items.py をしれっと書き換える。
デフォルトのクラス名とかあるけど、あんまり意味のあるものではないらしい。

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy

class BlogPost(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    url = scrapy.Field()
    title = scrapy.Field()
    date = scrapy.Field()

まぁコレでいいでしょ

Spider 追加

コレはコマンドから

% cd white_azalea 
% scrapy genspider hatenablog white-azalea.hatenablog.jp 
Created spider 'hatenablog' using template 'basic' in module:
  white_azalea.spiders.hatenablog

すると、こんな hatenablog.py が出来上がってるはず。

f:id:white-azalea:20201221193642p:plain

中身はこんな感じ

import scrapy


class HatenablogSpider(scrapy.Spider):
    name = 'hatenablog'
    allowed_domains = ['white-azalea.hatenablog.jp']
    start_urls = ['http://white-azalea.hatenablog.jp/']

    def parse(self, response):
        pass

割と見たままなので、ここに処理を記述。
要素の取得は CSS セレクタ書式で拾える。

このサイトのの書式がこんな感じなので、

<div class="entry-inner">
  <header class="entry-header">
    <div class="date entry-date first">
      <a
        href="https://white-azalea.hatenablog.jp/archive/2020/12/21"
        rel="nofollow"
      >
        <time datetime="2020-12-21T11:44:18Z" title="2020-12-21T11:44:18Z">
          <span class="date-year">2020</span><span class="hyphen">-</span
          ><span class="date-month">12</span><span class="hyphen">-</span
          ><span class="date-day">21</span>
        </time>
      </a>
    </div>
    <h1 class="entry-title">
      <a
        href="https://white-azalea.hatenablog.jp/entry/2020/12/21/204418"
        class="entry-title-link bookmark"
        >も少し真面目にスクレイピング(Scrapy)</a
      >
    </h1>
    <!-- 略 -->
  </header>

  <div class="entry-content">
    <!-- 略 -->
  </div>

  <footer class="entry-footer">
    <div class="entry-tags-wrapper">
      <div class="entry-tags"></div>
    </div>

    <p class="entry-footer-section">
      <span class="author vcard"
        ><span class="fn" data-load-nickname="1" data-user-name="white-azalea"
          ><span class="user-name-nickname">しろつつじー</span>
          <span class="user-name-paren">(</span
          ><span class="user-name-hatena-id">id:white-azalea</span
          ><span class="user-name-paren">)</span></span
        ></span
      >
      <span class="entry-footer-time"
        ><a href="https://white-azalea.hatenablog.jp/entry/2020/12/21/204418"
          ><time
            data-relative=""
            datetime="2020-12-21T11:44:18Z"
            title="2020-12-21T11:44:18Z"
            class="updated"
            >10分前</time
          ></a
        ></span
      >
    </p>

    <!-- 略 -->
  </footer>
</div>

記事１件あたり : article[data-publication-type="entry"]
タイトル : a.entry-title-link.bookmark::text
URL : a.entry-title-link.bookmark::attr(href)
更新日時 : time.updated::attr(title)

で抽出できる。

import scrapy
from white_azalea.items import BlogPost

class HatenablogSpider(scrapy.Spider):
    name = 'hatenablog'
    allowed_domains = ['white-azalea.hatenablog.jp']
    start_urls = ['https://white-azalea.hatenablog.jp/']

    def parse(self, response):
        """レスポンスパース処理

        Args:
            response (scrapy.response): Scrapy のレスポンスデータ
        """
        # CSS selector 書式で対象を検索
        for post in response.css('article[data-publication-type="entry"]'):
            # 子要素も CSS セレクタで取得
            yield BlogPost(
                url = post.css('a.entry-title-link.bookmark::attr(href)').extract_first().strip(),
                title = post.css('a.entry-title-link.bookmark::text').extract_first().strip(),
                date = post.css('time.updated::attr(title)').extract_first().strip()
            )
        
        # 次のページをたどる
        next_page_link = response.css('a[rel="next"]::attr(href)').extract_first()
        if next_page_link is None:
            return  # 辿れなかった
        
        # 相対パスだった場合に、絶対パスに書き換え
        next_page_link = response.urljoin(next_page_link)

        # お次のページ
        yield scrapy.Request(next_page_link, callback=self.parse)

実行速度に制限をいれる

無制限にリクエスト飛ばしまくったらただの DOS 攻撃になってしまい、法的に色々問われることになるので、制限を入れる。
具体的には settings.py でコメントアウトされてる一部設定を有効化する。

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3

# 中略

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

ダウンロードしたら、次のダウンロードまでディレイを３秒。
そして、一度ダウンロードしたページをキャッシュする設定だ。

ここまでで動かしてみる

scrapy crawl hatenablog とコマンドを打つとこんな感じに。

2020-12-21 20:27:11 [scrapy.core.scraper] DEBUG: Scraped from <200 https://white-azalea.hatenablog.jp/?page=1569584075>
{'date': '2019-09-15T07:59:24Z',
 'title': 'データサイエンスの勉強(データをざっくり眺める)',
 'url': 'https://white-azalea.hatenablog.jp/entry/2019/09/15/165924'}
2020-12-21 20:27:11 [scrapy.core.scraper] DEBUG: Scraped from <200 https://white-azalea.hatenablog.jp/?page=1569584075>
{'date': '2019-09-05T13:35:35Z',
 'title': 'Elixir の薄い同人本読んでみた',
 'url': 'https://white-azalea.hatenablog.jp/entry/2019/09/05/223535'}
2020-12-21 20:27:11 [scrapy.core.scraper] DEBUG: Scraped from <200 https://white-azalea.hatenablog.jp/?page=1569584075>
{'date': '2019-08-25T11:33:34Z',
 'title': 'Framework 離れて 1 年',
 'url': 'https://white-azalea.hatenablog.jp/entry/2019/08/25/203334'}
2020-12-21 20:27:11 [scrapy.core.scraper] DEBUG: Scraped from <200 https://white-azalea.hatenablog.jp/?page=1569584075>
{'date': '2019-08-20T15:30:18Z',
 'title': 'Matplotlib の使い方近辺',
 'url': 'https://white-azalea.hatenablog.jp/entry/2019/08/21/003018'}

...以下略

ちなみにこのまま出力もできた

scrapy crawl hatenablog -o data.csv

date,title,url
2020-12-20T05:41:55Z,Chrome拡張をやってみる,https://white-azalea.hatenablog.jp/entry/2020/12/20/144155
2020-12-17T10:38:18Z,Markdown や Restructured Text ですべてのドキュメントを解決したい Pandoc,https://white-azalea.hatenablog.jp/entry/2020/12/17/193818
2020-12-16T12:00:05Z,プレーテキストを校正チェックする,https://white-azalea.hatenablog.jp/entry/2020/12/16/210005
2020-12-14T11:13:07Z,Ui Path で簡単スクレイピング,https://white-azalea.hatenablog.jp/entry/2020/12/14/201307
2020-10-23T12:29:26Z,LightningWebComponent(OSS)で Bootstrap を読み込む,https://white-azalea.hatenablog.jp/entry/2020/10/23/212926
2020-10-18T03:14:27Z,拡張可能な 4TB のNASサーバをほぼ２万で,https://white-azalea.hatenablog.jp/entry/2020/10/18/121427

... 以下略

もうこの時点で大分満足ではあるけども、パイプラインも実装してみるか

パイプラインの実装

pipelines.py を開くと、デフォルトではこんな感じ。

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class WhiteAzaleaPipeline:
    def process_item(self, item, spider):
        return item

まぁファイル保存も CSV でやればいいし…パイプってことはここでリアルタイムにゴニョゴニョしたいんだろうけど、今回はリアルタイム性もほしいわけではないのでざっくり

class WhiteAzaleaPipeline:
    def process_item(self, item, spider):
        # 別にファイルに保存してもいいのだけど、パイプの挙動を知りたいだけなのでシンプルに
        title = item['title']
        url = item['url']
        postdate = item['date']
        print(f"Title: {title} (at {postdate}, URL: {url})")
        return item

ついで setting.py で

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'white_azalea.pipelines.WhiteAzaleaPipeline': 300,
}

として実行するとエントリ毎にフォーマットされて出てることがわかる。

2020-12-21 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 https://white-azalea.hatenablog.jp/?page=1550841650>
{'date': '2019-02-06T06:13:59Z',
 'title': '微分を使ってパラメータを求める（最急降下法or勾配降下法）',
 'url': 'https://white-azalea.hatenablog.jp/entry/2019/02/06/151359'}
2020-12-21 20:41:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://white-azalea.hatenablog.jp/?page=1549433639> (referer: https://white-azalea.hatenablog.jp/?page=1550841650) ['cached']
Title: ゼロDeep４章のニューラルネットワーク学習 (at 2019-02-03T16:05:30Z, URL: https://white-azalea.hatenablog.jp/entry/2019/02/04/010530)
2020-12-21 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 https://white-azalea.hatenablog.jp/?page=1549433639>
{'date': '2019-02-03T16:05:30Z',
 'title': 'ゼロDeep４章のニューラルネットワーク学習',
 'url': 'https://white-azalea.hatenablog.jp/entry/2019/02/04/010530'}
Title: Checkstyle の Metrics 近辺の内容を日本語で分かり易く (at 2019-02-02T06:40:51Z, URL: https://white-azalea.hatenablog.jp/entry/2019/02/02/154051)
2020-12-21 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 https://white-azalea.hatenablog.jp/?page=1549433639>
{'date': '2019-02-02T06:40:51Z',
 'title': 'Checkstyle の Metrics 近辺の内容を日本語で分かり易く',
 'url': 'https://white-azalea.hatenablog.jp/entry/2019/02/02/154051'}
Title: 今日やった事メモ (at 2019-01-27T13:35:33Z, URL: https://white-azalea.hatenablog.jp/entry/2019/01/27/223533)
2020-12-21 20:41:49 [scrapy.core.scraper] DEBUG: Scraped from <200 https://white-azalea.hatenablog.jp/?page=1549433639>
{'date': '2019-01-27T13:35:33Z',
 'title': '今日やった事メモ',
 'url': 'https://white-azalea.hatenablog.jp/entry/2019/01/27/223533'}
Title: 単体テストコードの書きやすい設計入門 (at 2019-01-26T02:28:48Z, URL: https://white-azalea.hatenablog.jp/entry/2019/01/26/112848)

※ ただし、スクレイピングが利用規約上禁止のサイトで行った場合に関しては自己責任でどうぞ（汗）。株情報とか結構禁止なのよね…

技術をかじる猫

適当に気になった技術や言語、思ったこと考えた事など。