Analyzing your SEO data with Scrapy

Ok, there are great solutions out there to give you SEO data about your website. Screaming Frog is probably one great example https://www.screamingfrog.co.uk/seo-spider/, it works on all operating systems and provide you data about:

  • broken links
  • page titles + meta data (description, keywords...)
  • robots.txt

and many, many other things... but it is a proprietary solution. As a result you need to pay when your crawl exceed 500 URLs and you cannot customize it... unless you pay. In this unit, we will explain you how can create your own crawler with a solution named Scrapy.

Ok, but what is Scrapy then?

Scrapy is a Python framework.

Ok, so what is Python?

Python is a programming language taught at the youngest age (primary school) in the UK and France. So to say, if they can do it, you can do it too.

Ok and what is a framework?

When you are building a computer program you are using a programming language. In order to build this programming language you have the possibility to write it from scratch or you can use frameworks. Frameworks are components built by other developers in order for you to not have to reinvent the wheel.

So Scrapy is a set of components already created in order to create easily your own spiders written in Python.

Still not clear? Let's make a comparison. Let's imagine that you would like to build a house, then you need everything: a land, some wood, some cables... and many other things, so you are building everything from scratch, it is like coding with a programming language from scratch. A framework is like, you need to build a house so you have a land which is already lawned with the water coming through pipes, electricity as well, instead of wood you have doors already made in wood ... so it is not a house yet, that's your job to assign the different components where you want them. If we had to compare a Content Management System to a house, it is more like having a house in a kit, the house is already made, but you can add extensions to it. Hope that's clearer now.

So to sum it up, Scrapy is not a software, so do not expect to find there a finished product such as Screaming Frog, you are in fact the one who is going to build up the finished product with the Scrapy framework.

Scrapy is a framework to create spiders, those spiders, you can use them to scrap whatever you want, it is not just about the main tags in SEO it can be used to scrape directories, ads or many other things, you just need to explain it to it.

Ok that's great let's install Scrapy

As mentioned previously Scrapy rely on Python, so you need to have Python installed. Which should be the case if you are using a GNU/Linux distribution. If that's not the case, please visit https://www.python.org/ to know more.

In order to install Scrapy on GNU/Linux Ubuntu, run:

sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

It is highly recommended to install it within a virtual environment to not have conflicts with other programs. By default Scrapy will be installed on all your computer, by creating a virtual environment you are isolating it from the rest, it avoid many many bugs, so that's the recommended method for not becoming crazy.

Ok let's run Scrapy

Ok, so just create a folder and go within it, in our case we called this folder myscrapy within the Downloads folder:

mkdir myscrapy

we go within this folder:

cd Downloads/myscrapy

Scrapy is well organized so before starting it you need to create a project. A project can contain several spiders which will have all a different behaviour and of course each project has its own settings.

So to create your first project (so to say the one you will use for your organization) run:

scrapy startproject my_organization

Congratulations, now you have a project :) it is great but if you are curious and look within this project, you will see that no spiders are created yet. So let's create one. Go within your folder:

cd my_organization

Then run the following command:

scrapy genspider name_of_your_spider domain-name-you-want-to-crawl.com

Ok, so now, you have a spider, you can access it by going within the folder named spiders, you should have a file named similar to name_of_your_spider.py.

As you will see this file does not have much content and that's normal, this is because you have not yet told to your spider what do you expect.

In order to do so you need to modify the Python associated file. And then run

scrapy crawl name-of-your-spider

Pay attention don't put .py at the end of your spider as you will get an error message if you do so.

Normally at this stage you won't get anything spectacular as you still need to ask to Scrapy what you want to get.

So let's tweak the original spider file and let's change it by something like:

# -*- coding: utf-8 -*-
import scrapy
class TooterSpider(scrapy.Spider):
    name = 'tooter'
    allowed_domains = ['twitter.com']
    start_urls = ['https://twitter.com/yacy_search']
    def parse(self, response):
        title = response.xpath("//title/text()").extract()
        yield {'title': title}

Here we are simply asking to extract the title of the page and to put display it along with some text in order to have a nice file at the end.

As we probably want to get a .csv out of it, just run:

scrapy crawl name-of-your-spider -o file.csv

and we are getting a simple file with the title of the page we wanted.

Last modified: Tuesday, 5 November 2019, 8:28 PM