Crawl a website with Scrapy and get valuable SEO data
In this part, we will try to replicate a little bit what Screaming Frog is doing. By extracting valuable data.
Let's see how to do this with Scrapy:- Address: in Scrapy this is what we call the response.url. From https://doc.scrapy.org/en/latest/topics/request-response.html
- Count address: as sometimes in SEO you need to consider the lenght of your URLs. So here we are using a function of Python to measure the lenght of our variable, we are using for that len(), and are including our variable within.
- Content type: it will be response.headers[Content-Type]. This one is a nasty one as you really need to use [ ] to get the data you need and know that it exists. This is all from the Scrapy documentation.
- Status code: you will get it with response.status. A bit like above, if you don't know that it exists. It is hard to get.
- Title: response.xpath("//title/text()").extract()
- Title count: is nothing more than a len() of the title variable. But as the data extracted from Scrapy is a list, we need to use len(title[0]) in order to indicate the first element of the list.
- Meta description: response.xpath("//meta[@name='description']/@content").extract_first()
- Meta description count: len(description) if description else 0, it ensures no bug if no description are found.
- Meta keywords: response.xpath("//meta[@name='keywords']/@content").extract_first()
- h1: response.xpath('//h1//text()').extract_first()
- h2: response.xpath('//h2//text()').extract_first()
- robot: response.xpath("//meta[@name='robots']/@content").extract_first()
- Loading time: response.meta['download_latency'], from the Scrapy documentation.
How to scrape all the pages of a website with Scrapy?
There are different ways you can do it, one is by using the linkextractor library. The other one with what we call rules.
Linkextractor
Here is the script:
# -*- coding: utf-8 -*-
import scrapy
# this line below is here in order to import a library which extracts links.
from scrapy.linkextractors import LinkExtractor
class SpidermanSpider(scrapy.Spider):
name
= 'spiderman'
allowed_domains = ['ronan-chardonneau.fr']
start_urls = ['https://ronan-chardonneau.fr']
def parse(self, response):
address
= response.url
count_address = len(address)
content_type = response.headers['Content-Type']
status_code
= response.status
title = response.xpath("//title/text()").extract()
count_title = len(title[0])
description = response.xpath("//meta[@name='description']/@content").extract_first()
count_description = len(description) if description else 0
keywords =
response.xpath("//meta[@name='keywords']/@content").extract_first()
h1 = response.xpath('//h1//text()').extract_first()
h2 = response.xpath('//h2//text()').extract_first()
robot = response.xpath("//meta[@name='robots']/@content").extract_first()
download_time = response.meta['download_latency']
yield {
'Address': address,
'Address
count': count_address,
'Content Type': content_type,
'Status code': status_code,
'Title': title,
'Title count': count_title,
'Meta description': description,
'Meta description count': count_description,
'Meta keywords': keywords,
'H1':
h1,
'H2': h2,
'Robot': robot,
'Download time': download_time
}
for a in LinkExtractor(allow_domains=['ronan-chardonneau.fr']).extract_links(response):
yield response.follow(a, callback=self.parse)
The lines in bold are the only one you need to care about. The first one import * from * mean that something in Python has been already created (this is what we call a module) and that this module will avoid you to go throug many lines of code in order to do this simple thing. Here the module that we call is named LinkExtractor https://docs.scrapy.org/en/latest/topics/link-extractors.html, his purpose is to extract links from pages which is exactly what we want.
The second part, for * yield * is a feature saying that everytime that the spider see a link (so a a href element) and as far as it belongs to the domain that we want to parse then it is going applying the behaviour that we defined before. To be honest with you, I didn't code all of this on my own, I inspired myself from forums and a friend of mine who gave me some help.
Rules
As their name stand for it rules are here in order to say to the spider what to do, here is a script example:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
class SpidermanSpider(CrawlSpider):
name = 'spidermanrules'
allowed_domains = ['floss-marketing-school.com']
start_urls = ['https://floss-marketing-school.com']
rules = (
Rule(LinkExtractor(allow_domains=['floss-marketing-school.com']), follow=True, callback="parse_item"),
)
def parse_item(self, response):
address = response.url
count_address = len(address)
content_type = response.headers['Content-Type']
status_code = response.status
title = response.xpath("//title/text()").extract()
count_title = len(title[0])
description = response.xpath("//meta[@name='description']/@content").extract_first()
if description:
count_description = len(description)
else:
count_description = 0
keywords = response.xpath("//meta[@name='keywords']/@content").extract_first()
h1 = response.xpath('//h1//text()').extract_first()
h2 = response.xpath('//h2//text()').extract_first()
robot = response.xpath("//meta[@name='robots']/@content").extract_first()
download_time = response.meta['download_latency']
yield {
'Address': address,
'Address count': count_address,
'Content Type': content_type,
'Status code': status_code,
'Title': title,
'Title count': count_title,
'Meta description': description,
'Meta description count': count_description,
'Meta keywords': keywords,
'H1': h1,
'H2': h2,
'Robot': robot,
'Download time': download_time
}