Python Script 10: Collecting one million website links

I needed a collection of different website links to experiment with Docker cluster. So I created this small script to collect one million website URLs.

Code is available on Github too.

Running script:

Either create a new virtual environment using python3 or use existing one in your system.

Install the dependencies.

Activate the virtual environment and run the code.

Code:

 

We are scraping links from site http://www.websitelists.in/. If you inspect the webpage, you can see anchor  tag inside td  tag with class web_width . We will convert the page response into BeautifulSoup object and get all such elements and extract the HREF  value of them.

one million site urls

 

Although there is natural delay of more than 1 second between consecutive requests which is pretty slow but is good for server. I still introduced one second delay to avoid 429 HTTP status.

Scraped links will be dumped in text file in same directory.

 

Hosting Django App for free on PythonAnyWhere Server.

Featured Image Source : http://ehacking.net/

Scraping Python books data from Amazon using Scrapy Framework

We learned how we can scrape twitter data using BeautifulSoup. But BeautifulSoup is slow and we need to take care of multiple things.

Here we will see how to scrape data from websites using scrapy.

I tried scraping Python books details from Amazon.com using scrapy and I found it extremely fast and easy. We will see how to start working with scrapy, create a scraper, scrape data and save data to Database.

Scraper code is available on Github. I dumped the data in MySQL database and developed a mini Django app over it which is available here.

Continue reading “Scraping Python books data from Amazon using Scrapy Framework”