Making a multithreaded file downloader in python

5 November 2020 - 5 minute read

Downloading multiple things at once from websites by hand is a waste of time. Waiting for a file to be done isn't great either. Isn't it better to download multiple files at once?

Let's make a multi-threaded python file downloader

For the people who want to directly see the source code. Gist source code

Imports

First of all we have to import all the packages we are going to use for this downloader. The only one you will have to download is requests.

import os, sys, threading
import requests

Class

Next we will make a FileDownloader class. This class will have all the functionality for the downloader. This class can be instantiated somewhere else.

In the constructor function, in this case init, we will have some default values setup first.

class FileDownloader():
    def __init__(self, max_threads=10):
        self.sema = threading.Semaphore(value=max_threads)
        self.headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
        self.block_size = 1024

We will setup a Semaphore class. A semaphore class handles which threads will be active.

Headers are used in the request, it will tell the server that we basically are a browser.

The block size is used the set amount of data we will write the file with each iteration.

Threaded function

Now comes the function that does the most work. The threaded function.

First we need to notify the semaphore that we are going to use a thread. If the semaphore is full it won't run the function until another thread is done with it.

self.sema.acquire()

Next we need to make sure that the filepath exists so that we can write to it.

filepath = os.path.join(os.getcwd() + '/Downloads/' + str(filename))
os.makedirs(os.path.dirname(filepath), exist_ok=True)

Since there is a possibility that the download suddenly aborted, your internet conection failed or the server didn't respond anymore. The file won't be completed then. We also don't want to download the complete file again.

To solve this we need to check if the file is already there in the first place.

if not os.path.isfile(filepath):
    #the file isn't there yet
    #start downloading
else:
    #the file is here
    #check if it is the full file
    #otherwise continue downloading

New file

Let's work out the first if statement. The file isn't there yet.

if not os.path.isfile(filepath):
    #the file isn't there yet
    #start downloading
    self.download_new_file(link, filepath, session)

We will call a function that will download the new file.

def download_new_file(self, link, filepath, session):
    print(f"downloading: {filepath}")
    if session == None:
        try:
            request = requests.get(link, headers=self.headers, timeout=30, stream=True)
            self.write_file(request, filepath, 'wb')
        except requests.exceptions.RequestException as e:
            print(e)        
    else:
        request = session.get(link, stream=True)
        self.write_file(request ,filepath, 'wb')

If there is no session present, like a site where you logged in to, it will just request the link. A session here is a requests.Session() Object

If there is a session present, we will use the session to get the file.

Continue downloading

But what if the file is already there, but not done yet? Let's work that problem out.

First we need check the current amount of bytes of the file. It will tell us how much of the file is there already.

current_bytes = os.stat(filepath).st_size

Next we try to get the original file size from the server. We can do this by getting the headers from the file. In the headers we will find a property called content-length. This header tells us how much bytes the original file is.

There is a possibility that the server doesn't support a content-length header. If it doesn't you wont be able to continue a file download

headers = requests.head(link).headers
if 'content-length' not in headers:
    print(f"server doesn't support content-length for {link}")
    self.sema.release()
    return

total_bytes = int(requests.head(link).headers['content-length'])

Now that we have the current amount of download bytes and the original file size in bytes we can check if the full file is there.

This will be done by checking if the current_bytes is lower than the total_bytes. If it is, we will continue to download the file. If it isn't, the file is already fully downloaded.

if current_bytes < total_bytes:
    self.continue_file_download(link, filepath, current_bytes, total_bytes)
else:
    print(f"already done: {filename}")

We can specify a byte range in the headers to tell the server which bytes we need to complete the file.

def continue_file_download(self, link, filepath, current_bytes, total_bytes):
    print(f"resuming: {filepath}")
    range_header = self.headers.copy()
    range_header['Range'] = f"bytes={current_bytes}-{total_bytes}"

    try:
        request = requests.get(link, headers=range_header, timeout=30, stream=True)
        self.write_file(request, filepath, 'ab')
    except requests.exceptions.RequestException as e:
        print(e)

The file has now been downloaded, we need to close the thread so that other threads can run.

self.sema.release()

Writing to a file

In the code blocks above you may have noticed a function called write_file a couple of times. This function, basically, writes to the file.

But since we have to different modes of download a file, in one go or resuming it, we need to set different write modes.

def write_file(self, content, filepath, writemode):
    with open(filepath, writemode) as f:
        for chunk in content.iter_content(chunk_size=self.block_size):
            if chunk:
                f.write(chunk)

    print(f"completed file {filepath}", end='\n')
    f.close()

For the new download we will use the 'wb' mode, which stands for write binary. For the resuming download we will use the 'ab' mode, which stands for append binary.

Where it begins

We still have to start the thread. We do this by starting it in another function.

The threading.Thread target will point to the threading function and should not be called from the outside. It should only be called internally.

def get_file(self, link, filename, session=None):
    """ Downloads the file"""
    thread = threading.Thread(
        target=self.t_getfile, 
        args=(link, filename, session)
    )

    thread.start()

So the get the whole downloader going you need to call the get_file function from the class with the arguments that need to be provided.

from {the filename you called the downloader} import FileDownloader
downloader = FileDownloader()
downloader = FileDownloader(max_threads=50)

for link in links:
    filename = 'whateveer.jpg' #filename here, regex or md5, you name it
    downloader.get_file(link, filename)

Complete Code

For those who want to copy the complete code, I put the code in a gist on github. Gist source code