In previous articles we used requests and BeautifulSoup to scrape the data. Scraping data this way is slow (Using selenium is even slower). Sometimes we need data quickly. But if we try to speed up the process of scraping data using multi-threading or any other technique, we will start getting http status 429 i.e. too may requests. We might get banned from the site as well.
Purpose of this article is to scrape lots of data quickly without getting banned and we will do this by using docker cluster of celery and RabbitMQ along with Tor.
For this to achieve we will follow below steps:
- Install docker and docker-compose
- Download the boilerplate code to setup docker cluster in 5 minutes
- Understanding the code
- Experiment with docker cluster
- Update the code to download tweets
- Using Tor to avoid getting banned.
- Speeding up the process by increasing workers and concurrency
Note: This article is for educational purpose only. Do not send too many requests to any server. Respect the robot.txt file. Use API if possible.