Python Script 7: Scraping tweets using BeautifulSoup

Twitter is one of the most popular social networking services used by most prominent people of world. Tweets can be used to perform sentimental analysis.

In this article we will see how to scrape tweets using BeautifulSoup. We are not using Twitter API as most of the APIs have rate limits.

Continue reading “Python Script 7: Scraping tweets using BeautifulSoup”

Scraping 10000 tweets in 60 seconds using celery, RabbitMQ and Docker cluster with rotating proxy

In previous articles we used requests and BeautifulSoup to scrape the data. Scraping data this way is slow (Using selenium is even slower). Sometimes we need data quickly. But if we try to speed up the process of scraping data using multi-threading or any other technique, we will start getting http status 429  i.e. too may requests. We might get banned from the site as well.

Purpose of this article is to scrape lots of data quickly without getting banned and we will do this by using docker cluster of celery and RabbitMQ along with Tor.

For this to achieve we will follow below steps:

  1. Install docker and docker-compose
  2. Download the boilerplate code to setup docker cluster in 5 minutes
  3. Understanding the code
  4. Experiment with docker cluster
  5. Update the code to download tweets
  6. Using Tor to avoid getting banned.
  7. Speeding up the process by increasing workers and concurrency

Note: This article is for educational purpose only. Do not send too many requests to any server. Respect the robot.txt file. Use API if possible.

Let’s start.

Continue reading “Scraping 10000 tweets in 60 seconds using celery, RabbitMQ and Docker cluster with rotating proxy”

Scraping Python books data from Amazon using Scrapy Framework

We learned how we can scrape twitter data using BeautifulSoup. But BeautifulSoup is slow and we need to take care of multiple things.

Here we will see how to scrape data from websites using scrapy.

I tried scraping Python books details from Amazon.com using scrapy and I found it extremely fast and easy. We will see how to start working with scrapy, create a scraper, scrape data and save data to Database.

Scraper code is available on Github. I dumped the data in MySQL database and developed a mini Django app over it which is available here.

Continue reading “Scraping Python books data from Amazon using Scrapy Framework”

Python Script 7: Scraping tweets using BeautifulSoup

Twitter is one of the most popular social networking services used by most prominent people of world. Tweets can be used to perform sentimental analysis.

In this article we will see how to scrape tweets using BeautifulSoup. We are not using Twitter API as most of the APIs have rate limits.

Continue reading “Python Script 7: Scraping tweets using BeautifulSoup”

py_instagram_dl – The Python Package to Download All pictures of an Instagram User

I created a small script to download all pictures of an Instagram user without using APIs as APIs poses few limitations like rate limit.

After few rounds of tweaking, optimisation and beautifying code, I though of creating a python package out of it. If you want to know how to create a distributable python package, this article will be extremely helpful as steps are discussed in great detail.

You can find the  py_instagram_dl  package listed on pypi.
link is –  https://pypi.python.org/pypi/py-instagram-dl.

How to download all pictures of an Instagram user:
  • Create a virtual environment. Optional but strongly recommended. You may follow this simple and step by step pocket guide on Python Virtual Environment.
  • Install dependencies. This package instead few other python packages to work.
  • Now install this package.
  • Use the installed package in your code.
    Parameter Options:
Download  method have one mandatory and two optional parameters as of now.

Mandatory Parameter:
Parameter 1: Valid username of Instagram user.

Optional Parameter:
verbose
: default value – True (boolean) : Decides whether information should be printed on screen. Recommended to have it set to True so that in case of large number of downloads you can make sure script is working and is not just freezed.

wait_between_requests : default value – 0 (integer) : This is the time in seconds for which scripts waits to send new hit to download the picture to Instagram. It is recommended to pass a positive value for this parameter. If you are getting rate limit exceptions after downloading few pictures, pass 1 in this parameter, i.e. wait for 1 second between each request.

Exceptions:

InvalidUsernameException: When a non existent username is provided.
RateLimitException: When rate limit is reached. Use parameter wait_between_requests  to avoid this.

 

Source code.