Hi, In this post I give a detailed guide for scraping web pages which contains JavaScript contents and login forms. I’m using the following existing tools based on python.
- Scrapy: A web scrapping framework
- Splash: A lightweight browser
- Selenium: A web application test-automation tool
Scrapy provides a simple framework to scrap web sites. However, it originally does not provide facility to generate dynamic contents of JavaScript code. Therefore, Scrapy-Splash plugin was introduced. With this plugin a ‘SplashRequest’ can be used to send requests to JavaScript enabled sites and receive the dynamic content in response. However, I had trouble using corresponding ‘SplashFormRequst’ to login in to the web site, before scrapping. Hence, I used the selenium to login to the website first, then pass the cookies to ‘SplashRequest’ for further scrapping the web site.
Step 1: Generating a Scrapy Project
You can refer to scrapy documentation on how to install, create project, and create your spider. In brief
pip install Scrapy
- Goto the desired directory ->
scrapy startproject <project name>
- In the generated project, goto ‘spiders/’ subdirectory. Then create a new python file naming your spider.
Refer: https://docs.scrapy.org/en/latest/intro/tutorial.html#our-first-spider to write your first spider.
import scrapy
from scrapy.spiders import CrawlSpider
class LoginSpider(CrawlSpider):
name = NAME
start_urls = [LOGIN_URL]
def parse(self, response):
print(response)
Step 2: Installing Splash
To install splash I followed an existing tutorial by ScrapingAuthority. It installs splash through docker platform. In brief the steps are to
sudo apt install docker.io
sudo docker pull scrapinghub/splash
Possible Error and Solution:
If you get following error:
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
See ‘docker run –help’.
# sudo service docker stop
# sudo service docker start
# sudo docker run -p 8050:8050 scrapinghub/splash

- Install scrapy-splash python plugin
sudo pip install scrapy-splash
- Update scrapy project’s settings.py with splash related configurations.
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPLASH_URL = 'http://localhost:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
sudo docker run -p 8050:8050 scrapinghub/splash
import scrapy
from scrapy.spiders import CrawlSpider
class LoginSpider(CrawlSpider):
name = NAME
start_urls = [LOGIN_URL]
def parse(self, response):
print(response)
yield SplashRequest(url=LOGIN_URL, callback=self.after_login, endpoint='render.html)
def after_login(self, response):
print(response)
Step 3: Use Selenium to Login to the Web Page
pip install selenium
- Update YourSpider.py as follows.
import scrapy
from scrapy.spiders import CrawlSpider
class LoginSpider(CrawlSpider):
name = NAME
start_urls = [LOGIN_URL]
def parse(self, response):
print(response)
browser = webdriver.Chrome('/usr/bin/chromedriver', chrome_options=chrome_options)
browser.get('https://ifttt.com/login?wp_=1')
username = browser.find_element_by_id(USERNAME_KEY)
password = browser.find_element_by_id(PASS_KEY)
username.send_keys(USERNAME)
password.send_keys(PASS)
browser.find_element_by_name("commit").click()
time.wait(10)
yield SplashRequest(url=browser.current_url, callback=self.after_login, endpoint='render.html',cookies=self.browser.get_cookies())
def after_login(self, response):
print(response)
Cheers ! 🙂
Like this:
Like Loading...