Xpath and CSS selectors in Scrapy
Xpath and CSS selectors are your life savers in Scrapy. They give you the possibility to inform Scrapy about what you would like to extract from the page. The only thing is that you need to know how that works in order to be able to leverage their all potentials, let's see how that work.
Xpath
According to Wikipedia, XPath (XML Path Language) is a query language for selecting nodes from an XML document. Ok but what are nodes then? A node is every HTML elements of the page.
Here are some of the selectors which are used when you are starting:// : means all the elements correspond to it. For example if you write //h1 you are asking for all the h1 elements of the page.
/ : the single / is used in order inform that you work on this parent element. For example /body//h1 give me all the h1 within the body of the page.
@ : is used in order to work on an attribute for example a class. So by writing //h2[@class="sectionname"] you are asking all the h2 which have a class with a value of sectionname.
text() : is used in order to extract the text content of the tag. Very useful.extract() : is used in order to extract the data.
Here are below an example of some selectors you may want to use in order to start in SEO:
- response.xpath("//title/text()").extract() : so you will extract the title text.
- response.xpath("//meta[@name='description']/@content")[0].extract() : so here we want all the meta which have an attribute name of value description and which have an attribute content, and we want to extract the first value of content.
- response.xpath('//h1//text()').extract() : to extract the h1.
Possibilities are endless, just to give you an idea here is the list of all the different combinations you can apply: http://scraping.pro/res/xpath-cheat/xpath_css_dom_recipes.pdf
CSS selectors
So as xpath, CSS selectors can be used in Scrapy. In order to use them, use response.css instead of response.xpath.
Then the structure is of course different, for example you will use:
- response.css('.sectionname').extract()
in order to extract all the corresponding tag, here it is the tag which have a class name sectionname as . stands for class in CSS.
If you would like just to get the text part of it, perform:
- response.css('.sectionname::text').extract()