Python – Web Scraping with Scrapy

Scrapy digunakan untuk project yg complex : misal scraping data di ratusan halaman website

install Scrapy

pip install scrapy

Create Project

scrapy startproject project_name

Generate Spiders

scrapy genspider spiders_name www.website.com

Example spiders : countries.py

Example 1 :

import scrapy


class CountriesSpider(scrapy.Spider):
    name = 'countries'
    allowed_domains = ['www.worldometers.info/']
    start_urls = ['https://www.worldometers.info/world-population/population-by-country/']

    def parse(self, response):
        title = response.xpath('//h1/text()').get()
        countries = response.xpath('//td/a/text()').getall()

        yield {
            'title': title,
            'countries': countries
        }
Example 2 :

import scrapy


class CountriesSpider(scrapy.Spider):
    name = 'countries'
    allowed_domains = ['www.worldometers.info/']
    start_urls = ['https://www.worldometers.info/world-population/population-by-country/']

    def parse(self, response):
        countries = response.xpath('//td/a')
        for country in countries:
            name = country.xpath('.//text()').get()
            link = country.xpath('.//@href').get()

            yield {
                'country_name': name,
                'country_link': link
                # 'country_link': response.follow(url=link)
            }
Example 3 :

# -*- coding: utf-8 -*-
import scrapy
import logging


class CountriesSpider(scrapy.Spider):
    name = 'countries'
    allowed_domains = ['www.worldometers.info']
    start_urls = ['https://www.worldometers.info/world-population/population-by-country/']

    def parse(self, response):
        countries = response.xpath("//td/a")
        for country in countries:
            name = country.xpath(".//text()").get()
            link = country.xpath(".//@href").get()

            # absolute_url = f"https://www.worldometers.info{link}"
            # absolute_url = response.urljoin(link)

            yield response.follow(url=link, callback=self.parse_country, meta={'country_name': name})

    def parse_country(self, response):
        name = response.request.meta['country_name']
        rows = response.xpath(
            "(//table[@class='table table-striped table-bordered table-hover table-condensed table-list'])[1]/tbody/tr")
        for row in rows:
            year = row.xpath(".//td[1]/text()").get()
            population = row.xpath(".//td[2]/strong/text()").get()
            yield {
                'country_name': name,
                'year': year,
                'population': population
            }

Run Spiders

scrapy crawl spiders_name

# scrapy crawl countries

Building Datasets

scrapy crawl spiders_name -o filename.format

JSON :
# scrapy crawl countries -o data.json
# Alt + Shift + F = tidying json data

CSV :
# scrapy crawl countries -o data.csv

XML :
# scrapy crawl countries -o data.xml

Problem utf-8

open settings.py, and add code below in the end of code :

FEED_EXPORT_ENCODING = 'utf-8'

Problem white space with normalize space

Example :
string = "\n   Text   \n"

string = response.xpath("normalize-space(//div[@class='num']/a/text())").get

Leave a Reply

Your email address will not be published. Required fields are marked *