[Python] 爬蟲時使用『斷點續傳』從中斷點繼續下載檔案

Last Updated on 2023-05-30 by Clay

有時候我們在使用爬蟲下載東西的時候，可能會遇到網路不穩、臨時有事要切換工作……等等的狀態。這時候我們不必讓之前所下載的東西前功盡棄，我們可以事先將程式設定成可以『斷點續傳』的狀態，這樣一來，我們就可以延續下載到一半的檔案，繼續往下下載。

這種的程式的基本思路在於，我們事先對網站發出了請求，得知了這個檔案的大小。然後我們查看我們下載路徑底下是否有這個檔案，如果存在，那麼我們會查看它目前下載到哪，然後從還沒下載的部份開始下載。

基本上我下載的方式便是透過 Python 內建的 requests。如果希望對 requests 模組有更進一步的了解，可以前往這個網站： https://realpython.com/python-requests/ 研究。

程式碼解說

程式碼非常單純，請聽我一步步解說：

import os
import time
import requests

import os
import time
import requests

首先我們匯入所需要使用的模組，如果你的環境中沒有 requests 模組，請用以下指令下載。

pip3 install requests

安裝好之後，我們來試著使用程式下載檔案。

startTime = time.time()
url = 'https://img.ltn.com.tw/Upload/news/600/2019/11/10/phpK7nZ7J.jpg'
fileName = 'test.jpg'
content_size = int(requests.get(url, stream=True).headers['Content-Length'])

startTime = time.time()
url = ‘https://img.ltn.com.tw/Upload/news/600/2019/11/10/phpK7nZ7J.jpg’
fileName = ‘test.jpg’
content_size = int(requests.get(url, stream=True).headers[‘Content-Length’])

下載的網址就以自由時報某篇新聞的颱風圖片為例吧！以 url 的網址貼上網站的話，應該會直接看到圖片。

fileName 為我們要在當前目錄存檔的圖片名稱。
content_size 為我們向網站請求的檔案大小。

if os.path.exists(fileName):
    temp_size = os.path.getsize(fileName)
else:
    temp_size = 0

if os.path.exists(fileName):
    temp_size = os.path.getsize(fileName)
else:
    temp_size = 0

這一步是查看當前目錄是否有這個圖片，如果存在，便是取得該圖片目前的大小；如果不存在，temp_size 的大小直接設定為 0 ，意味著我們從來沒下載過這個檔案。

print('Temp:', temp_size)
print('Total:', content_size)

headers = {'Range': 'bytes=%d-' % temp_size}
r = requests.get(url, stream=True, headers=headers)
print('[File size]: %0.2f MB' % (content_size/1024/1024))
print('Status:', r.status_code)

print(‘Temp:’, temp_size)
print(‘Total:’, content_size)

headers = {‘Range’: ‘bytes=%d-‘ % temp_size}
r = requests.get(url, stream=True, headers=headers)
print(‘[File size]: %0.2f MB’ % (content_size/1024/1024))
print(‘Status:’, r.status_code)

Output:

Temp: 0
Total: 66825
[File size]: 0.06 MB
Status: 206

header 裡設定的 Range 是非常重要的，意思是我們從多少 bytes 開始下載。另外要注意的是，如果 Status 返回值是 200 的話，意味著這個網站並不支援我們指定從哪個 bytes 開始下載。如果是 206，代表我們是可以指定的。

with open(fileName, 'ab') as file:
    for data in r.iter_content(chunk_size=1024):
        file.write(data)
        temp_size += len(data)

        print('\r' + '[Download progress]:[%s%s]%.2f%%;' % (
        '█' * int(temp_size * 20 / content_size), ' '*(20-int(temp_size*20/content_size)),
        float(temp_size/content_size*100)), end='')

print('\n' + 'Donwload finished!')
print('Download time:%.2f s' % (time.time()-startTime))

with open(fileName, ‘ab’) as file:
    for data in r.iter_content(chunk_size=1024):
        file.write(data)
        temp_size += len(data)

        print(‘\r’ + ‘[Download progress]:[%s%s]%.2f%%;’ % (
        ‘█’ * int(temp_size * 20 / content_size), ‘ ‘*(20-int(temp_size*20/content_size)),
        float(temp_size/content_size*100)), end=”)

print(‘\n’ + ‘Donwload finished!’)
print(‘Download time:%.2f s’ % (time.time()-startTime))

Output:

[Download progress]:[████████████████████]100.00%
Donwload finished!
Download time:0.56 s

這裡由於我們並沒有下載過這張圖片，所以我們可以看到我們直接下載完了這張圖片。大家不妨看一下這張圖片下載後的模樣。

接下來，我們刪掉 test.jpg 這張圖片，並將程式碼改成以下的模樣：

with open(fileName, 'ab') as file:
    for data in r.iter_content(chunk_size=1024):
        file.write(data)
        temp_size += len(data)

        print('\r' + '[Download progress]:[%s%s]%.2f%%;' % (
        '█' * int(temp_size * 20 / content_size), ' '*(20-int(temp_size*20/content_size)),
        float(temp_size/content_size*100)), end='')
        break

with open(fileName, ‘ab’) as file:
    for data in r.iter_content(chunk_size=1024):
        file.write(data)
        temp_size += len(data)

        print(‘\r’ + ‘[Download progress]:[%s%s]%.2f%%;’ % (
        ‘█’ * int(temp_size * 20 / content_size), ‘ ‘*(20-int(temp_size*20/content_size)),
        float(temp_size/content_size*100)), end=”)
        break

Output:

[Download progress]:[                    ]1.53%;
Donwload finished!
Download time:0.51 s

這次我們在第一次寫入中就使用 break 跳出程式了，所以並沒有成功下載。

我們的圖片現在長這樣：

是的，你沒看錯，看起來就像張很失敗的圖片一樣。

然後我們把 break 去掉，重新下載一次。

這次我們的結果：

Temp: 1024
Total: 66825
[File size]: 0.06 MB
Status: 206
[Download progress]:[████████████████████]100.00%
Donwload finished!
Download time:0.58 s

打開圖片看一看，是否這次就補完之前沒下載的部份，成功下載完整的圖片了呢？

完整程式碼

# -*- coding: utf-8 -*-
import os
import time
import requests


startTime = time.time()
url = 'https://img.ltn.com.tw/Upload/news/600/2019/11/10/phpK7nZ7J.jpg'
fileName = 'test.jpg'
content_size = int(requests.get(url, stream=True).headers['Content-Length'])

if os.path.exists(fileName):
    temp_size = os.path.getsize(fileName)
else:
    temp_size = 0

print('Temp:', temp_size)
print('Total:', content_size)

headers = {'Range': 'bytes=%d-' % temp_size}
r = requests.get(url, stream=True, headers=headers)
print('[File size]: %0.2f MB' % (content_size/1024/1024))
print('Status:', r.status_code)


with open(fileName, 'ab') as file:
    for data in r.iter_content(chunk_size=1024):
        file.write(data)
        temp_size += len(data)

        print('\r' + '[Download progress]:[%s%s]%.2f%%;' % (
        '█' * int(temp_size * 20 / content_size), ' '*(20-int(temp_size*20/content_size)),
        float(temp_size/content_size*100)), end='')

print('\n' + 'Donwload finished!')
print('Download time:%.2f s' % (time.time()-startTime))

# -*- coding: utf-8 -*-
import os
import time
import requests


startTime = time.time()
url = ‘https://img.ltn.com.tw/Upload/news/600/2019/11/10/phpK7nZ7J.jpg’
fileName = ‘test.jpg’
content_size = int(requests.get(url, stream=True).headers[‘Content-Length’])

if os.path.exists(fileName):
    temp_size = os.path.getsize(fileName)
else:
    temp_size = 0

print(‘Temp:’, temp_size)
print(‘Total:’, content_size)

headers = {‘Range’: ‘bytes=%d-‘ % temp_size}
r = requests.get(url, stream=True, headers=headers)
print(‘[File size]: %0.2f MB’ % (content_size/1024/1024))
print(‘Status:’, r.status_code)


with open(fileName, ‘ab’) as file:
    for data in r.iter_content(chunk_size=1024):
        file.write(data)
        temp_size += len(data)

        print(‘\r’ + ‘[Download progress]:[%s%s]%.2f%%;’ % (
        ‘█’ * int(temp_size * 20 / content_size), ‘ ‘*(20-int(temp_size*20/content_size)),
        float(temp_size/content_size*100)), end=”)

print(‘\n’ + ‘Donwload finished!’)
print(‘Download time:%.2f s’ % (time.time()-startTime))

[Python] 爬蟲時使用『斷點續傳』從中斷點繼續下載檔案

程式碼解說

完整程式碼

相關

Leave a Reply取消回覆

[Python] 爬蟲時使用『斷點續傳』從中斷點繼續下載檔案

程式碼解說

完整程式碼

分享此文：

相關

Leave a Reply取消回覆