Scraping Python books data from Amazon using Scrapy Framework

scraping python book data from Amazon using scrapy

We learned how we can scrape twitter data using BeautifulSoup. But BeautifulSoup is slow and we need to take care of multiple things.

Here we will see how to scrape data from websites using scrapy.

I tried scraping Python books details from Amazon.com using scrapy and I found it extremely fast and easy. We will see how to start working with scrapy, create a scraper, scrape data and save data to Database.

Scraper code is available on Github. I dumped the data in MySQL database and developed a mini Django app over it which is available here.

Let’s start building a scraper.

Setup:

First create a virtual environment and activate it.

Once virtual environment is activated, install the below listed dependencies in it.

 

Now create a scrapy project.

A new folder with below structure will be created.

Writing spider:

Spider are the classes which are written by us and scrapy uses those classes to get data from websites.

Inside spiders folder, create a spider class BooksSpider  and start writing your code in it.

Define the name of the spider. Create a list of starting URLs and Generate parse method.

We will also maintain a list of books already scrapped to avoid duplicate requests, although scrapy can take care of this itself.

Since we will be fetching the top 10 comments as well, we are starting with product review URL.

After details of one book is scraped, we fetch the other related books on same page and then scrape data for those books.

But before we see the code in parse method which parse data from page, we should know what is an Item class.

Scrapy Item:

One good thing about scrapy is that it help in structuring the data. We can define our Item class in items.py file. This will work as a container for our data.

Now let’s go back to the parse method.

Parsing the response:

Parse method accept the response as parameter. We will use css  or xpath  selectors to fetch the data from response. We will be fetching book title, author’s name, rating, review count and book Id.

Now since we are interested only in Python books, we will try to filter other books out. For this I have created a simple utility function is_python_book , which checks if there is python or Django or flask word in either title or comments.

 

Returning scraped item:

Once a book’s data is scraped along with review comments, we set that in Item and yield it. What happens to the yielded data, is explained in next paragraph.

So we make sure that scrapped data is of python book? if yes we return the data for further processing else data is lost.

Generating next request:

Once first page is processed, we need to generate the next URL and generate a new request to parse second URL. This process goes on until manually terminated or some condition in code is satisfied.

 

Storing scrapped data in Database:

When we yielded processed Item, it is sent for further processing in Item Pipeline.

We need to define our pipeline class in pipelines.py file. This class is responsible for further processing of data, be it cleaning the data, storing in DB or string in text files.

We can write connection creation and closing part in pipeline’s methods, open_spider  and close_spider .

 

Points to Remember:
  • Be polite on sites you are scraping. Do not send too many concurrent requests.
  • Respect robot.txt file.
  • If API is available use it instead of scraping data.
Settings:

Spider wide settings are defined in settings.py file.

  • Make sure obeying Robot.txt file is set to True.
  • You should add some delay between requests and limit the concurrent requests.
  • To process the item in pipeline, enable the pipeline.
  • If you are testing the code and need to hit the same page frequently, better enable cache. It will speed up the process for you and will be good for website as well.
Avoiding 503 error:

You may encounter 503 response status code in some requests. This is because scraper send the default value of user-agent header.

Update the user-agent value in settings file to something more common.

 

Feel free to download the code from Github and experiment with it. Try to scrape data for books of another genre.

You can see the scrapped data in action here. List of top python books on Amazon.

python books scraped date from amazon using scrapy

 

Read more : https://doc.scrapy.org/en/latest/intro/tutorial.html

(Visited 465 times, 1 visits today)

You must read this :

Leave a Reply

Your email address will not be published. Required fields are marked *