Skip to content

Menu

  • Home
  • About
  • Resume
  • Contact

Archives

  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • May 2023
  • April 2023
  • March 2023
  • September 2022
  • April 2022
  • March 2022

Calendar

March 2022
M T W T F S S
 123456
78910111213
14151617181920
21222324252627
28293031  
    Apr »

Categories

  • Analysis
  • Personal View
  • Ping Pong Tracker – Embedded Camera System
  • Projects
  • Service Metrics
  • Tutorial
  • Web Scrapers

Copyright Andrew Serra 2025 | Theme by ThemeinProgress | Proudly powered by WordPress

Andrew Serra
  • Home
  • About
  • Resume
  • Contact
Written by Andrew SerraMarch 15, 2022

Guide to Develop A Simple Web Scraper Using Scrapy

Projects . Tutorial . Web Scrapers Article

Web scrapers are a great way to collect data for projects. In this example, I will use the Scrapy Framework to create a web scraper that gets the links of products when searching for “headphones” on amazon.com

To start with let’s check if we have the scrapy library set to go. Open the terminal on your Mac OS device and type:

$ scrapy version
Read moreService Metrics Project, A Step-by-Step Journey to Analytics Development

At the time of this post, I am using this version(1.5.1). You should be getting an output similar to this.

Scrapy 1.5.1

If you do not have it installed yet, you can use pip it to install it. Here is how to do it:

$ pip install Scrapy
Read moreDesigning the data flow of a Ping Pong Tracker on a Embedded Device

Time to get to the coding. Scrapy uses a command in the terminal to create a project. It will set you up with a set of files to start with. Navigate to the directory where you want to save the project. To start a project you will start by typing this line in the terminal:

$ scrapy startproject headphones

Scrapy will create the directory with contents like this:

headphones/
    scrapy.cfg
headphones/
    init__.py
    items.py
    middlewares.py
    pipelines.py
    settings.py
    spiders/
        __init.py
Read moreSnickerdoodle FPGA Project: Image Processing and Position Visualization using Glasgow Linux

We will create our first file under the ‘spiders’ directory. I will name it headphone_spider.py. After creating the new file, your directory structure should look like this:

headphones/
    scrapy.cfg
    headphones/
    init__.py
    items.py
    middlewares.py
    pipelines.py
    settings.py
    spiders/
        __init.py
        headphone_spider.py

Our first part to code will be to import scrapy and create a class that will scrape the web for us.

1
2
3
4
import scrapy # adding scrapy to our file
class HeadphoneSpider(scrapy.Spider): # our class inherits from scrapy.Spider
    name = "headphones"   # we will name this as headphones and we will need it later on
   
Read moreNon-hassle, Easy, And Simple Ping Tool In Go

The line name = "headphones" is very important here. When we will run our spider form terminal, you will use the name of the spider to start crawling the web. The name should be as clear as possible since you may have multiple spiders later on.

Now it is time for the first function in our class! To do web scraping we will have two important parts. The first part is to send a request to the website(s) we will scrape. we will name our function start_requests and we will define a list of URLs that we want to visit and send requests to them.

7
8
9
10
11
def start_requests(self):
    urls = [] # list to enter our urls
    for url in urls:
        # we will explain the callback soon
        yield scrapy.Request(url=url, callback=self.parse) 
Read moreSimplified Credit Card Validator API in Go

The keyword yield creates a generator but acts like the return keyword. It will return a generator. Generators are useful when you are using a list of items and will not use them again. They will be processed once and then forgotten.

The keyword argument callback is used to call another function when there is a response from the function. After scrapy.Request(url=url, callback=self.parse) is completed, it calls our second function in the class. Which will be named parse. Here it is:

13
14
15
16
17
def parse(self, response):
    img_urls = response.css('img::attr(src)').extract()
    with open('urls.txt', 'w') as f:
        for u in img_urls:
            f.write(u + "\n")
Read moreWord Counter Using Producer-Consumer Pattern

In this function, the first line is where we use Scrapy’s selectors. We use the response that is generated from the scrapy.Request() function as a parameter. we will use .css() as the selector to parse the data we are looking for. since we are looking for images we will enter .css(‘img’) but this will give us all of the <img> tags which do not extract what we need. Since image tags in HTML have an attribute src, we will use this to select the source of the image. So now we have the selector object! The only thing we have left is to extract the data we found by just adding the .extract() function to the end. After all this, I write all the names of the URLs to a text file.


Here is the final version of our headphone_spider.py file.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import scrapy
class HeadphonesSpider(scrapy.Spider):
    name = "headphones"
    def start_requests(self):
        urls = [
            'https://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=headphones&rh=i%3Aaps%2Ck%3Aheadphones&ajr=2',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
    def parse(self, response):
        img_urls = response.css('img::attr(src)').extract()
        with open('urls.txt', 'w') as f:
            for u in img_urls:
                f.write(u + "\n")

Read moreDocker Containers for React Deployment

To run your spider go to the directory that includes the whole project, top-level directory, and in the terminal type:

$ scrapy crawl headphones

The name `headphones` is the name that we used in the class. And it will start crawling!

Read moreHow To Minimize Docker Container Sizes

This was the first project I had done with Scrapy. I used the tutorial page to guide myself. Here is my code. In another post, I will show how to allow your spider to skip pages and extract data from those pages as well.


Thanks for reading this post!

If you’d like to contact me about anything, send feedback, or want to chat feel free to:

Send an email: andy@serra.us

Contact Me

  • LinkedIn
  • GitHub
  • Twitter
  • Instagram

Tags: data, python, scrapy, web, web scraper

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Archives

  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • May 2023
  • April 2023
  • March 2023
  • September 2022
  • April 2022
  • March 2022

Calendar

March 2022
M T W T F S S
 123456
78910111213
14151617181920
21222324252627
28293031  
    Apr »

Categories

  • Analysis
  • Personal View
  • Ping Pong Tracker – Embedded Camera System
  • Projects
  • Service Metrics
  • Tutorial
  • Web Scrapers

Copyright Andrew Serra 2025 | Theme by ThemeinProgress | Proudly powered by WordPress