Python Script 7: Scraping tweets using BeautifulSoup

scrapping tweets using BeautifulSoup

Twitter is one of the most popular social networking services used by most prominent people of world. Tweets can be used to perform sentimental analysis.

In this article we will see how to scrap tweets using BeautifulSoup. We are not using Twitter API as most of the APIs have rate limits.

You can download all the pictures of any Instagram user in just few lines of codes. We converted the script into reusable python package to make things easy.

Setup:

Create a virtual environment. If you are not in the habit of working with virtual environments, please stop immediately and read this article on virtual environments first.

Once virtual environment is created and activated, install the dependencies in it.

Analysing Twitter Web Requests:

Lets say we want to scrap all the tweets made by Honourable Prime Minister of India, Shri Narendra Modi. Go to the browser, I am using Chrome, press F12 to open the debugging tool.

Now go the the URL https://twitter.com/narendramodi. In the network tab of debugging tool, you will see the response of request made to URL /narendramodi. Response is an HTML page. We will convert this HTML response into a BeautifulSoup object and will extract the tweets.

scrapping tweets

 

If you scroll down the page to load more tweets, you will see more requests being sent where response is not simple HTML but is in JSON format.

scrapping tweets

 

Extracting tweets from HTML content:

First inspect the tweet element on web page. You will see that all the tweets are enclosed in li  HTML tag. Actual tweet text is inside a p  tag which is the descendent of li  tag.

We will first get all the li tags and then p  tags from each li  tag. Text contained in the p  tag is what we need.

 

Code to start with:

We will start with start  function. First collect the username from command line and then send the request to twitter page. If there is no exception and status code returned in response is 200  i.e. success, proceed otherwise exit.

Convert the response text into BeautifulSoup object and see if there is any div  tag in the HTML with class errorpage-topbar . If yes that means the username is invalid. Although this check is not required because in case of invalid username, 404  status is returned which will be checked in status_code  check condition.

 

Extract tweet text:

As discussed, we first find out all li  tags and then for each element we try to get tweet text out of that li  tag. We keep printing a dot on screen every time a tweet is scrapped successfully to show the progress otherwise user may think that script is doing nothing or is hanged.

We sometimes have images inside tweets, we will discard those images as of now. We do this by getting image tags inside tweets and replacing image text by empty string.

 

Scrapping more tweets:

So far we were able to get tweets from first page. As we load more pages, when scrolling down, we get JSON response. We need to parse JSON response, which is slightly different.

First we check if there are more tweets. If yes then we find the next pointer and create the next URL. Once JSON is received, we take out the items_html  part and repeat the process of creating soup and fetching tweets.

We keep doing this until there are no more tweets to scrap. We know this by looking at the variable has_more_items  and min_position  in JSON response.

Complete script:

Now all the functions are completed. Let put them together. Download the complete script from GitHub.

Running the script:

Assuming you have installed dependencies in virtual environment, lets run the script.

 

You might introduce some wait between requests if you get any rate limit errors.

Dumping data in file:

You might want to dump the data in text file. I prefer dumping data in JSON format.

 

Let us know if you face any issues.

(Visited 693 times, 1 visits today)

You must read this :

1 thought on “Python Script 7: Scraping tweets using BeautifulSoup”

Leave a Reply

Your email address will not be published. Required fields are marked *