In this article we will discuss how to upload an Excel file and then process the content without storing file on server. One approach could be uploading the file, storing it in upload directory and then reading the file. Another approach could be uploading file and reading it directly from post data without storing it in memory and displaying the data.
We will work with the later approach here.
You may create a new project or work on existing code.
If you are setting up a new project then create a new virtual environment and install Django 2.0 and openpyxl modules in virtual environment using pip.
A site map is a list of a website’s content designed to help both users and search engines navigate the site. A site map can be a hierarchical list of pages, an organization chart, or an XML document that provides instructions to search engine crawl bots.
Why sitemaps is required:
XML Sitemaps are important for SEO because they make it easier for Google to find your site’s pages—this is important because Google ranks web PAGES not just websites. There is no downside of having an XML Sitemap and having one can improve your SEO, so we highly recommend them.
Create two different classes in sitemap.py file, one for static pages and another for Dynamic URLs.
Lets assume your website sell some product where product details are stored in database. Once a new product is added to database, you want that product page to be searchable by search engines. We need to add all such product pages/urls to sitemaps.
Define a class
StaticSitemap in your
sitemap.py file. Define the mandatory function
items in it which will return the list of objects. These objects will be passed to location method which will create URL from these objects.
Here in items function, we are returning
appname:url_name which will be used by location method to convert into absolute URL. Refer you app’s urls.py file for url names.
Similarly we will create Dynamic sitemap by fetching values from DB.
Robots.txt is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned.
Why robots.txt is important:
Before a search engine crawls your site, it will look at your robots.txt file as instructions on where they are allowed to crawl/visit and index on the search engine results. If you want search engines to ignore any pages on your website, you mention it in your robots.txt file.
Disallow:[URL stringnotto be crawled]
Steps to add robots.txt in Your Django Project:
Lets say your project’s name is myproject.
Create a directory templates in root location of your project. Create another directory with the same name as your project inside templates directory.
Place a text file robots.txt in it.
Your project structure should look something like this.
Add user-agent and disallow URL in it.
Now go to your project’s urls.py file and add below import statement
I was able to scrape approx 10000 tweets in 60 seconds i.e. 166 tweets per second. Not an impressive number. (But I was able to make Celery, RabbitMQ, rotating proxy via Tor network and Postgres, work in docker cluster.)
Above approach was not very fast, hence I tried to compare below three approaches to send multiple request and parse the response.
– Celery-RabbitMQ docker cluster
– Scrapy framework
I planned to send requests to 1 million websites, but once I started, I figured out that it will take one whole day to finish this hence I settled for 1000 URLs.
Docker have all the good featured of virtual machine. It helps developer to set up an environment on development machine which is similar to production environment. Please go through official docker site if you want to know more about Docker.
In this article we will see how to develop a hello world Django project and will run it docker container instead of virtual environment.
Please follow this guide to install docker on your machine.
We are using Docker version
17.12.1-ce for this article.
In previous articles we used requests and BeautifulSoup to scrape the data. Scraping data this way is slow (Using selenium is even slower). Sometimes we need data quickly. But if we try to speed up the process of scraping data using multi-threading or any other technique, we will start getting
http status429 i.e. too may requests. We might get banned from the site as well.
Purpose of this article is to scrape lots of data quickly without getting banned and we will do this by using docker cluster of celery and RabbitMQ along with Tor.
For this to achieve we will follow below steps:
Install docker and docker-compose
Download the boilerplate code to setup docker cluster in 5 minutes
Understanding the code
Experiment with docker cluster
Update the code to download tweets
Using Tor to avoid getting banned.
Speeding up the process by increasing workers and concurrency
Note: This article is for educational purpose only. Do not send too many requests to any server. Respect the robot.txt file. Use API if possible.
Sometimes we need to know who made what changes to which table. This might be required for legal audit purpose or for simple organisational level logging.
There are multiple Django apps available online which can help you log the model changes but there is no fun in doing that. We will see how to do it without using ready-made app and hence will learn something in the process.
Signals lets a sender notify another receiver that some event have occurred and some action needs to be performed.
For example, we have some data in cache as well in DB. We read data from cache and if not found then goes to DB as fallback. Now whenever a DB is updated, we need to update the cache as well. But we might update the model from multiple views. Hence it is tough and not clean to write cache update logic in every such view. Signals comes into picture now.
Download method have one mandatory and two optional parameters as of now.
Parameter 1: Valid username of Instagram user.
Optional Parameter: verbose : default value – True (boolean) : Decides whether information should be printed on screen. Recommended to have it set to True so that in case of large number of downloads you can make sure script is working and is not just freezed.
wait_between_requests : default value – 0 (integer) : This is the time in seconds for which scripts waits to send new hit to download the picture to Instagram. It is recommended to pass a positive value for this parameter. If you are getting rate limit exceptions after downloading few pictures, pass 1 in this parameter, i.e. wait for 1 second between each request.
InvalidUsernameException: When a non existent username is provided. RateLimitException: When rate limit is reached. Use parameter
wait_between_requests to avoid this.
In almost every article, we recommended the use of virtual environment for developing any Python or Django project.
In this article, we will briefly cover the virtual environment in python, installation and usage.
What is a Virtual Environment:
Virtual environment is an isolated python environment which can be created using virtualenv python tool. This virtual environment contains all the packages that a python package would require. Python project running in virtual environment does not use the system wide installed python package.