スクレイピング備忘録

splash
python

splashを使う

JSレンダリングサイト、API使えない、頻繁に更新しない場合はseleniumよりsplashが便利
dockerではimageを使いまわす（PCスペック対策）

#コンテナ確認
docker ps -a　
#コンテナスタート
docker start

splash（504対策スクリプト）

cssは400に、images不要

function main(splash, args)
    splash:on_request(function(request)
       if request.url:find('css') then
       request.abort()
       end
     end)
     splash.images_enabled = false
     splash.js_enabled = false
     assert(splash:go(args.url))
     assert(splash:wait(0.5))
     return splash:html()
end

その他（splash）

#プライベートモードを解除
splash.private_mode_enabled=false

# headerをいじる
headers = {
 [‘User-Agent’]=‘…..’
 [‘cookie’]=‘…..’
 }
splash:set_custom_headers(headers)

#コメント
--[[
--]]

その他（docker)

#コンテナ削除して作り直す
docker rm　
#splashコンテナを作成・起動 
docker run -it -p 8050:8050 scrapinghub/splash
#splash時間を長く
docker run -it -p 8050:8050 scrapinghub/splash --max-timeout=3600

#dockerメンテ
docker system prune
鯨のアイコン　preferences  
虫のマーク　Reset to factory defaults  
再起動

scrapy-splash

scrapy-splash github

注意（scrapy設定がgithubと異なるところがある）

SPLASH_URL = 'http://localhost:8050'

Home