Scraping 10000 tweets in 60 seconds using celery, RabbitMQ and Docker cluster with rotating proxy

scrap 10000 tweets in 60 seconds using celery, rabbitmq and docker cluster with rotating proxy

In previous articles we used requests and BeautifulSoup to scrape the data. Scraping data this way is slow (Using selenium is even slower). Sometimes we need data quickly. But if we try to speed up the process of scraping data using multi-threading or any other technique, we will start getting http status 429  i.e. too may requests. We might get banned from the site as well.

Purpose of this article is to scrape lots of data quickly without getting banned and we will do this by using docker cluster of celery and RabbitMQ along with Tor.

For this to achieve we will follow below steps:

  1. Install docker and docker-compose
  2. Download the boilerplate code to setup docker cluster in 5 minutes
  3. Understanding the code
  4. Experiment with docker cluster
  5. Update the code to download tweets
  6. Using Tor to avoid getting banned.
  7. Speeding up the process by increasing workers and concurrency

Note: This article is for educational purpose only. Do not send too many requests to any server. Respect the robot.txt file. Use API if possible.

Let’s start.

Installing docker and docker-compose:
  • We will be using docker version Docker version 17.12.0-ce, build c97c6d6 . Install docker by following the instructions on docker’s official page.
  • Docker-compose version being used is  docker-compose version 1.8.0, build unknown . Install docker compose by following instructions on their page.
Boilerplate code:

Clone the code from this Github repository. README file is added to help you start with cluster.

Dockerfile used to build the worker image is using python:3 docker image.

Directory structure of code:

Run the below command to start the docker cluster:

This will run one container for each worker and RabbitMQ.  Once you see something like

on your screen at the end of output, you are good to go. Now you can submit the tasks. But before going any further lets try to understand the code while it is simple and small.

Understanding the code:

The first argument to Celery is the name of the current module. This is only needed so that names can be automatically generated when the tasks are defined in the __main__ module.

The second argument is the broker keyword argument, specifying the URL of the message broker you want to use. Here using RabbitMQ (also the default option).

The third argument is backend. A backend in Celery is used for storing the task results.

This code will submit the tasks to workers. We need to call do_work  method with delay  so that it can be executed in async manner. Flow returns immediately without waiting for result. If you try to print the result without waiting, it will print None .

We can easily create a task from any callable by using the task()  decorator. This is what we are doing here.

bind=True  means the first argument to the task will always be the task instance (self). Bound tasks are needed for retries, for accessing information about the current task request.

Experimenting with Docker cluster:

Run the containers by using command sudo docker-compose up . We will not be running containers in detached mode ( -d ) as we need to see the output.

By default it will create one worker. In another terminal Go inside the worker container using command sudo docker exec -it [container-name] bash . It will start the bash session in working directory defined by WORKDIR  in dockerfile.

Run the task submitter by using command python -m celery_main.task_submitter . Task submitter will submit the tasks to workers and exit without waiting for results. You can see the output (info, debug and warnings) in previous terminal. Find out how much seconds cluster took to complete 10 tasks.

Now stop all containers, remove them and restart them. But this time keep the worker count to 10. Use command sudo docker-compose up --scale worker=10 . Repeat the process and find the time taken to complete the tasks.

Repeat above step by changing the worker count and concurrency value in dockerfile to find the best value for your machine where it took least time.

Increasing concurrency value beyond a limit will no longer improve the performance as workers will keep switching the context instead of doing actual job. Similarly increasing the worker count beyond a limit will make your machine go unresponsive. Keep a tab on CPU and memory consumed by running top command in another terminal.

Let’s start downloading tweets:

Now lets start extending the boilerplate code.

All the code we are going to write below is available on Github. You can download and run the code to scrape the tweets.

All the twitter handles are in handles.txt  file placed in root directory of code.

Update the  file to read the handles and submit them to to the task receiver.

Task Receiver will get the response from twitter and parse the response to extract the tweets available on first page. For simplicity we are not going to the second page. Code to extract the tweets is as below:

Now if you run this code, it will start throwing too many requests i.e. HTTP status 429 error after few hits.

To avoid this we need to use tor network to send the requests from different IPs and we will also use different user agent in each request.

Adding Rotating Proxy:

– Clone this git repository.


– Change anything in the code as per you requirement.

– Build the image and use the same name in docker-compose file.

– You may skip above steps as docker image with tag used in docker-compose is already present in docker hub.

– Create a file  and write the below code in it.

– Create a new file . This will contain the list of user agents and we will use one of these, selected randomly, in each request.

If you will run the container now, IP will be changed after every few requests and user agent will be changed on each hit, resulting in almost zero 429 status responses.



I was able to download approx 10000 tweets in 60 seconds using 15 workers and 5 concurrency in each worker. Increasing the worker count beyond 15 started making machine unresponsive.

You may try with different numbers and find out the configuration which will help scrape the maximum tweets in minimum time.

Source code:

Source code is available on Github.

(Visited 1,512 times, 1 visits today)

You must read this :

1 thought on “Scraping 10000 tweets in 60 seconds using celery, RabbitMQ and Docker cluster with rotating proxy”

Leave a Reply

Your email address will not be published. Required fields are marked *