Scrapy digunakan untuk project yg complex : misal scraping data di ratusan halaman website
install Scrapy
pip install scrapy
Create Project
scrapy startproject project_name
Generate Spiders
scrapy genspider spiders_name www.website.com
Example spiders : countries.py
Example 1 :
import scrapy
class CountriesSpider(scrapy.Spider):
name = 'countries'
allowed_domains = ['www.worldometers.info/']
start_urls = ['https://www.worldometers.info/world-population/population-by-country/']
def parse(self, response):
title = response.xpath('//h1/text()').get()
countries = response.xpath('//td/a/text()').getall()
yield {
'title': title,
'countries': countries
}
Example 2 :
import scrapy
class CountriesSpider(scrapy.Spider):
name = 'countries'
allowed_domains = ['www.worldometers.info/']
start_urls = ['https://www.worldometers.info/world-population/population-by-country/']
def parse(self, response):
countries = response.xpath('//td/a')
for country in countries:
name = country.xpath('.//text()').get()
link = country.xpath('.//@href').get()
yield {
'country_name': name,
'country_link': link
# 'country_link': response.follow(url=link)
}
Example 3 :
# -*- coding: utf-8 -*-
import scrapy
import logging
class CountriesSpider(scrapy.Spider):
name = 'countries'
allowed_domains = ['www.worldometers.info']
start_urls = ['https://www.worldometers.info/world-population/population-by-country/']
def parse(self, response):
countries = response.xpath("//td/a")
for country in countries:
name = country.xpath(".//text()").get()
link = country.xpath(".//@href").get()
# absolute_url = f"https://www.worldometers.info{link}"
# absolute_url = response.urljoin(link)
yield response.follow(url=link, callback=self.parse_country, meta={'country_name': name})
def parse_country(self, response):
name = response.request.meta['country_name']
rows = response.xpath(
"(//table[@class='table table-striped table-bordered table-hover table-condensed table-list'])[1]/tbody/tr")
for row in rows:
year = row.xpath(".//td[1]/text()").get()
population = row.xpath(".//td[2]/strong/text()").get()
yield {
'country_name': name,
'year': year,
'population': population
}
Run Spiders
scrapy crawl spiders_name
# scrapy crawl countries
Building Datasets
scrapy crawl spiders_name -o filename.format
JSON :
# scrapy crawl countries -o data.json
# Alt + Shift + F = tidying json data
CSV :
# scrapy crawl countries -o data.csv
XML :
# scrapy crawl countries -o data.xml
Problem utf-8
open settings.py, and add code below in the end of code :
FEED_EXPORT_ENCODING = 'utf-8'
Problem white space with normalize space
Example :
string = "\n Text \n"
string = response.xpath("normalize-space(//div[@class='num']/a/text())").get