Guide to Develop A Simple Web Scraper Using Scrapy
Web scrapers are a great way to collect data for projects. In this example, I will use the Scrapy Framework to create a web scraper that gets the links of products when searching for “headphones” on amazon.com
To start with let’s check if we have the scrapy library set to go. Open the terminal on your Mac OS device and type:
$ scrapy version
At the time of this post, I am using this version(1.5.1). You should be getting an output similar to this.
Scrapy 1.5.1
If you do not have it installed yet, you can use pip
it to install it. Here is how to do it:
$ pip install Scrapy
Time to get to the coding. Scrapy uses a command in the terminal to create a project. It will set you up with a set of files to start with. Navigate to the directory where you want to save the project. To start a project you will start by typing this line in the terminal:
$ scrapy startproject headphones
Scrapy will create the directory with contents like this:
headphones/
scrapy.cfg
headphones/
init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init.py
We will create our first file under the ‘spiders’ directory. I will name it headphone_spider.py
. After creating the new file, your directory structure should look like this:
headphones/
scrapy.cfg
headphones/
init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init.py
headphone_spider.py
Our first part to code will be to import scrapy and create a class that will scrape the web for us.
import scrapy # adding scrapy to our file
class HeadphoneSpider(scrapy.Spider): # our class inherits from scrapy.Spider
name = "headphones" # we will name this as headphones and we will need it later on
The line name = "headphones"
is very important here. When we will run our spider form terminal, you will use the name of the spider to start crawling the web. The name should be as clear as possible since you may have multiple spiders later on.
Now it is time for the first function in our class! To do web scraping we will have two important parts. The first part is to send a request to the website(s) we will scrape. we will name our function start_requests
and we will define a list of URLs that we want to visit and send requests to them.
def start_requests(self):
urls = [] # list to enter our urls
for url in urls:
# we will explain the callback soon
yield scrapy.Request(url=url, callback=self.parse)
The keyword yield
creates a generator but acts like the return
keyword. It will return a generator. Generators are useful when you are using a list of items and will not use them again. They will be processed once and then forgotten.
The keyword argument callback
is used to call another function when there is a response from the function. After scrapy.Request(url=url, callback=self.parse)
is completed, it calls our second function in the class. Which will be named parse
. Here it is:
def parse(self, response):
img_urls = response.css('img::attr(src)').extract()
with open('urls.txt', 'w') as f:
for u in img_urls:
f.write(u + "\n")
In this function, the first line is where we use Scrapy’s selectors. We use the response that is generated from the scrapy.Request() function as a parameter. we will use .css() as the selector to parse the data we are looking for. since we are looking for images we will enter .css(‘img’) but this will give us all of the <img> tags which do not extract what we need. Since image tags in HTML have an attribute src
, we will use this to select the source of the image. So now we have the selector object! The only thing we have left is to extract the data we found by just adding the .extract()
function to the end. After all this, I write all the names of the URLs to a text file.
Here is the final version of our headphone_spider.py
file.
import scrapy
class HeadphonesSpider(scrapy.Spider):
name = "headphones"
def start_requests(self):
urls = [
'https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=headphones&rh=i%3Aaps%2Ck%3Aheadphones&ajr=2',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
img_urls = response.css('img::attr(src)').extract()
with open('urls.txt', 'w') as f:
for u in img_urls:
f.write(u + "\n")
To run your spider go to the directory that includes the whole project, top-level directory, and in the terminal type:
$ scrapy crawl headphones
The name `
headphones` is the name that we used in the class. And it will start crawling!
This was the first project I had done with Scrapy. I used the tutorial page to guide myself. Here is my code. In another post, I will show how to allow your spider to skip pages and extract data from those pages as well.
Thanks for reading this post!
If you’d like to contact me about anything, send feedback, or want to chat feel free to:
Send an email: andy@serra.us
Leave a Reply