[Python] Crawler download a file via resume breakpoint

Last Updated on 2021-03-04 by Clay

Introduction

Sometimes when we downloading a file via crawler, maybe network fluctuations or other work interrupt us …… but we don’t have to download the file again.

If we can resume our downloading in breakpoint, we can resume the download progress, continue to finished it.

The basic idea is that we made a request to the website in advance and got the size of the file. Then we check the file under our download path, if it exists, then we will check where it is currently downloaded, and then start downloading from the part that has not been downloaded.

The way I downloaded it was through the built-in requests of Python.

My code

The code is very simple, I will explain it step-by-step.

import os

import time

import requests

First we import the module we need. If you have no requests module if your environment, please use the following command to download it:

pip3 install requests

After installing, we try to use the following code to download a file.

startTime = time.time()

url = 'https://img.ltn.com.tw/Upload/news/600/2019/11/10/phpK7nZ7J.jpg'

fileName = 'test.jpg'

content_size = int(requests.get(url, stream=True).headers['Content-Length'])

The downloaded URL is an example of a typhoon picture of a news article. If you paste the website with the url, you should see the picture directly.

fileName is our picture name
content_size is the size of file

if os.path.exists(fileName):

       temp_size = os.path.getsize(fileName)

else:

       temp_size = 0

This step is to check whether there is the picture in the current directory. If it exists, get the current size of the picture; if it does not exist, the size is set ot 0, which means that we have never downloaded this file.

print('Temp:', temp_size)

print('Total:', content_size)



headers = {'Range': 'bytes=%d-' % temp_size}

r = requests.get(url, stream=True, headers=headers)

print('[File size]: %0.2f MB' % (content_size/1024/1024))

print('Status:', r.status_code)

Output:

Temp: 0
Total: 66825
[File size]: 0.06 MB
Status: 206

The Range in file Header is very important, it means how many bytes we start downloading. Also note that if the Status return value is 200, it means that this website does not support us to specify which bytes to start downloading from.

If it is 206, it means that we can specify it.

with open(fileName, 'ab') as file:

       for data in r.iter_content(chunk_size=1024):

               file.write(data)

               temp_size += len(data)



               print('\r' + '[Download progress]:[%s%s]%.2f%%;' % (

               '█' * int(temp_size * 20 / content_size), ' '*(20-int(temp_size*20/content_size)),

               float(temp_size/content_size*100)), end='')



print('\n' + 'Donwload finished!')

print('Download time:%.2f s' % (time.time()-startTime))

Output:

[Download progress]:[████████████████████]100.00%
Donwload finished!
Download time:0.56 s

Since we have not downloaded this picture here, we can see that we have downloaded this picture directly. You might as well take a look at this picture after downloading it.

Next, we delete the image test.jpg and change the code to the following:

with open(fileName, 'ab') as file:

       for data in r.iter_content(chunk_size=1024):

               file.write(data)

               temp_size += len(data)



               print('\r' + '[Download progress]:[%s%s]%.2f%%;' % (

               '█' * int(temp_size * 20 / content_size), ' '*(20-int(temp_size*20/content_size)),

               float(temp_size/content_size*100)), end='')

               break

Output:

[Download progress]:[                    ]1.53%;
Donwload finished!
Download time:0.51 s

This time we used break to jump out of the program in the first write, so the download was not successful.

Our picture now looks like this:

Yes, you read it right, it looks like a failed picture.

Then we remove the break and download it again.

Our results this time:

Temp: 1024
Total: 66825
[File size]: 0.06 MB
Status: 206
[Download progress]:[████████████████████]100.00%
Donwload finished!
Download time:0.58 s

Open the picture and take a look. Have you completed the part that was not downloaded this time and successfully downloaded the complete picture?

Complete code

# -*- coding: utf-8 -*-

import os

import time

import requests





startTime = time.time()

url = 'https://img.ltn.com.tw/Upload/news/600/2019/11/10/phpK7nZ7J.jpg'

fileName = 'test.jpg'

content_size = int(requests.get(url, stream=True).headers['Content-Length'])



if os.path.exists(fileName):

       temp_size = os.path.getsize(fileName)

else:

       temp_size = 0



print('Temp:', temp_size)

print('Total:', content_size)



headers = {'Range': 'bytes=%d-' % temp_size}

r = requests.get(url, stream=True, headers=headers)

print('[File size]: %0.2f MB' % (content_size/1024/1024))

print('Status:', r.status_code)





with open(fileName, 'ab') as file:

       for data in r.iter_content(chunk_size=1024):

               file.write(data)

               temp_size += len(data)



               print('\r' + '[Download progress]:[%s%s]%.2f%%;' % (

               '█' * int(temp_size * 20 / content_size), ' '*(20-int(temp_size*20/content_size)),

               float(temp_size/content_size*100)), end='')



print('\n' + 'Donwload finished!')

print('Download time:%.2f s' % (time.time()-startTime))

References

https://realpython.com/python-requests/

[Python] Crawler download a file via resume breakpoint

Introduction

My code

Complete code

References

Share this:

Leave a ReplyCancel reply