I recently tried scraping the tweets quickly using Celery RabbitMQ Docker cluster. Since I was hitting same servers I was using rotating proxies via Tor network. Turned out it is not very fast and using rotating proxy via Tor is not a nice thing to do.
I was able to scrape approx 10000 tweets in 60 seconds i.e. 166 tweets per second. Not an impressive number. (But I was able to make Celery, RabbitMQ, rotating proxy via Tor network and Postgres, work in docker cluster.)
Above approach was not very fast, hence I tried to compare below three approaches to send multiple request and parse the response.
– Celery-RabbitMQ docker cluster
– Scrapy framework
I planned to send requests to 1 million websites, but once I started, I figured out that it will take one whole day to finish this hence I settled for 1000 URLs.
Continue reading “Comparing celery-rabbitmq docker cluster, multi-threading and scrapy framework for 1000 requests”
Docker have all the good featured of virtual machine. It helps developer to set up an environment on development machine which is similar to production environment. Please go through official docker site if you want to know more about Docker.
In this article we will see how to develop a hello world Django project and will run it docker container instead of virtual environment.
Please follow this guide to install docker on your machine.
We are using Docker version
17.12.1-ce for this article.
Starting docker container of application:
Continue reading “Using Docker instead of Virtual Environment for Django app development”
In previous articles we used requests and BeautifulSoup to scrape the data. Scraping data this way is slow (Using selenium is even slower). Sometimes we need data quickly. But if we try to speed up the process of scraping data using multi-threading or any other technique, we will start getting
http status 429 i.e. too may requests. We might get banned from the site as well.
Purpose of this article is to scrape lots of data quickly without getting banned and we will do this by using docker cluster of celery and RabbitMQ along with Tor.
For this to achieve we will follow below steps:
- Install docker and docker-compose
- Download the boilerplate code to setup docker cluster in 5 minutes
- Understanding the code
- Experiment with docker cluster
- Update the code to download tweets
- Using Tor to avoid getting banned.
- Speeding up the process by increasing workers and concurrency
Note: This article is for educational purpose only. Do not send too many requests to any server. Respect the robot.txt file. Use API if possible.
Continue reading “Scraping 10000 tweets in 60 seconds using celery, RabbitMQ and Docker cluster with rotating proxy”