[Python] 使用 Selenium 進行 Google 搜尋引擎的爬蟲

一直以來，我都想要寫一篇爬蟲的心得文，描述下 Regular Expression、IP 設置、User-Agent …… 等等經常會用到的各種工具或小知識。一方面是因為當初寫 Blog 的初衷便是想要紀錄下自己學習的過程、一方面則是因為現在花了很多時間進行爬蟲的工作，怕將來若是比較少做，就會慢慢淡忘現在比較熟練的爬蟲技能。

今天的心得筆記以 Google 搜尋引擎為主，使用 Python + Selenium 為主。我可以設定要查詢的關鍵字以及想要爬取的頁數，然後將抓到的標題以及網址印出來。

Selenium 的準備工作

首先，我們自然是要先安裝 “selenium” 這個套件以及 “webdriver_manager”：

pip3 install selenium
pip3 install webdriver_manager

然後我們需要安裝 Chromium 的 Driver：

sudo apt-get install chromium-driver

安裝好了以後，再多裝個 “BeautifulSoup4″。其實在我爬蟲的過程中並不是一定需要使用到 “BeautifulSoup4″，但是我這次仰賴 “BeatifulSoup4” 的 “prettify()” 印出清楚的版面再進行 Regular Expression 的處理。

pip3 install beautifulsoup4

匯入會用到的套件

首先，先把所有會使用到的套件匯入專案中。

# coding: utf-8
"""
Post the query to Google　Search and get the return results
"""
import re
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

# coding: utf-8
“””
Post the query to Google　Search and get the return results
“””
import re
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

設定參數

# Browser settings
chrome_options = Options()
chrome_options.add_argument('--incognito')
chrome_options.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0')
browser = webdriver.Chrome(chrome_options=chrome_options)


# Query settings
query = 'US Stock'
browser.get('https://www.google.com/search?q={}'.format(query))
next_page_times = 10

# Browser settings
chrome_options = Options()
chrome_options.add_argument(‘–incognito’)
chrome_options.add_argument(‘user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0’)
browser = webdriver.Chrome(chrome_options=chrome_options)


# Query settings
query = ‘US Stock’
browser.get(‘https://www.google.com/search?q={}’.format(query))
next_page_times = 10

在這裡我分別進行了瀏覽器的設定、以及查詢 Query 的關鍵字設定、還有翻頁次數的設定。

chrome_options.add_argument('--incognito')

chrome_options.add_argument(‘–incognito’)

這行指令是在使用『無痕模式』。

chrome_options.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0')

chrome_options.add_argument(‘user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0’)

這行指令是在填入 User-Agent。基本上，Google Chrome 回傳的 HTML 資訊會隨著我們的 User-Agent 而改變。我後續擷取標題及網址的指令是針對我準備的這個 User-Agent 調整的。

進行爬蟲

# Crawler
for _page in range(next_page_times):
    soup = BeautifulSoup(browser.page_source, 'html.parser')
    content = soup.prettify()

    # Get titles and urls
    titles = re.findall('<h3 class="[\w\d]{6} [\w\d]{6}">\n\ +(.+)', content)
    urls = re.findall('<div class="r">\ *\n\ *<a href="(.+)" onmousedown', soup.prettify())

    for n in range(min(len(titles), len(urls))):
        print(titles[n], urls[n])

    # Wait
    time.sleep(5)

    # Turn to the next page
    try:
        browser.find_element_by_link_text('下一頁').click()
    except:
        print('Search Early Stopping.')
        browser.close()
        exit()


# Close the browser
browser.close()

# Crawler
for _page in range(next_page_times):
    soup = BeautifulSoup(browser.page_source, ‘html.parser’)
    content = soup.prettify()

    # Get titles and urls
    titles = re.findall(‘<h3 class="[\w\d]{6} [\w\d]{6}">\n\ +(.+)’, content)
    urls = re.findall(‘<div class="r">\ *\n\ *<a href="(.+)" onmousedown', soup.prettify())

    for n in range(min(len(titles), len(urls))):
        print(titles[n], urls[n])

    # Wait
    time.sleep(5)

    # Turn to the next page
    try:
        browser.find_element_by_link_text('下一頁').click()
    except:
        print('Search Early Stopping.')
        browser.close()
        exit()


# Close the browser
browser.close()

這邊比較沒有特別需要解說的部份。基本上我使用 soup.prettify() 整理了 HTML，再透過 Regular Expression 擷取了標題以及網址。

重要的是，在我換頁前，我特意等待了 5 秒。根據實測，若不稍微等待一下，有時候換頁會出現問題。

最下面的 try-except 是在預防查詢到的頁面數量不夠的情況。比如說我指定我要翻頁 10 次、然而我搜尋引擎返回的結果一共只有 7 頁。

這裡展示一下回傳的結果 (一部分)。

完整程式碼

# coding: utf-8
"""
Post the query to Google　Search and get the return results
"""
import re
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options


# Browser settings
chrome_options = Options()
chrome_options.add_argument('--incognito')
chrome_options.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0')
browser = webdriver.Chrome(chrome_options=chrome_options)


# Query settings
query = 'US Stock'
browser.get('https://www.google.com/search?q={}'.format(query))
next_page_times = 10


# Crawler
for _page in range(next_page_times):
    soup = BeautifulSoup(browser.page_source, 'html.parser')
    content = soup.prettify()

    # Get titles and urls
    titles = re.findall('<h3 class="[\w\d]{6} [\w\d]{6}">\n\ +(.+)', content)
    urls = re.findall('<div class="r">\ *\n\ *<a href="(.+)" onmousedown', soup.prettify())

    for n in range(min(len(titles), len(urls))):
        print(titles[n], urls[n])

    # Wait
    time.sleep(5)

    # Turn to the next page
    try:
        browser.find_element_by_link_text('下一頁').click()
    except:
        print('Search Early Stopping.')
        browser.close()
        exit()


# Close the browser
browser.close()

# coding: utf-8
“””
Post the query to Google　Search and get the return results
“””
import re
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options


# Browser settings
chrome_options = Options()
chrome_options.add_argument(‘–incognito’)
chrome_options.add_argument(‘user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0’)
browser = webdriver.Chrome(chrome_options=chrome_options)


# Query settings
query = ‘US Stock’
browser.get(‘https://www.google.com/search?q={}’.format(query))
next_page_times = 10


# Crawler
for _page in range(next_page_times):
    soup = BeautifulSoup(browser.page_source, ‘html.parser’)
    content = soup.prettify()

    # Get titles and urls
    titles = re.findall(‘<h3 class="[\w\d]{6} [\w\d]{6}">\n\ +(.+)’, content)
    urls = re.findall(‘<div class="r">\ *\n\ *<a href="(.+)" onmousedown', soup.prettify())

    for n in range(min(len(titles), len(urls))):
        print(titles[n], urls[n])

    # Wait
    time.sleep(5)

    # Turn to the next page
    try:
        browser.find_element_by_link_text('下一頁').click()
    except:
        print('Search Early Stopping.')
        browser.close()
        exit()


# Close the browser
browser.close()

References

6 thoughts on “[Python] 使用 Selenium 進行 Google 搜尋引擎的爬蟲”

Stan 2020-10-06 at 08:29


想請教一下版主我有試跑您的程式碼但沒有產出像您一樣的output （應該是說沒有outpu出現），問題會是是我需要自己改成我goole裡網頁 class的位置嗎？還是有哪邊我沒注意到，謝謝。(最近剛接觸python 自學爬蟲，很多不懂有問到白癡問題請見諒）
1. ccs96307 2020-10-06 at 13:57
  
  
  您好，很高興認識您。
  剛開始爬蟲遇到問題是很正常的，不如說您肯主動詢問，我真該多多向您學習。
  
  如果程式能跑、沒有報錯訊息，只是沒有任何返回結果的話，那應該是因為 Google Chrome 網頁經過改版，網頁原始碼已經與當初寫下這篇文章時不同了。
  最重要的就是我查找『網頁名稱』與『網址』那裡的正規表示式（Regular Expression）已經不適用於這個網頁。
  您可能需要自己改寫匹配的規則，才能返回當前瀏覽器版本的查詢結果。
  
  在我的裝置上，我將以下程式碼：
  
  更改為：
  
  之後，我的程式就又能返回查詢結果了。
  
  所以我推測是瀏覽器更新造成，也需要隨時更新匹配程式碼。
菜雞 2020-12-09 at 01:56


今天2020/12/9，chrome好像前幾天又改版了，
這次是url那段多了一段data-ved=…..，修改後就可以get到了，因為小弟是菜雞，所以不敢把code貼上來，
若c大有空再麻煩幫我驗證一下我說的是否正確^^
1. ccs96307 2020-12-09 at 13:47
  
  
  哦哦，感謝告知！
  最近比較忙，可能比較少時間測試這方面的東西。
  
  不過上方我回覆說可能是 Chrome 更新導致網頁原始碼不同，其實 Google 搜尋引擎每年也會經歷過大大小小好幾次的改版，只是不一定會通知我們使用者。
  
  這或許也是導致時常網頁原始碼不同的緣故，這方面就希望有高手大神能幫忙解說了 =D
2. 1. 菜雞 2020-12-11 at 07:25
    
    
    想請教c大或其他高手要如何取得搜尋結果完整的”摘要”，因為當摘要匹配到關鍵字會多一個，用findall我只能抓到部分摘要，沒法抓到完整摘要，不知有沒有高手有試過…謝謝
ccs96307 2020-12-11 at 08:18


抱歉，我有點沒看明白匹配到關鍵字多一個的意思@@
不知道有沒有什麼範例呢？

[Python] 使用 Selenium 進行 Google 搜尋引擎的爬蟲

Selenium 的準備工作

匯入會用到的套件

設定參數

進行爬蟲

完整程式碼

References

Read More

相關

6 thoughts on “[Python] 使用 Selenium 進行 Google 搜尋引擎的爬蟲”

Leave a Reply取消回覆

[Python] 使用 Selenium 進行 Google 搜尋引擎的爬蟲

Selenium 的準備工作

匯入會用到的套件

設定參數

進行爬蟲

完整程式碼

References

Read More

分享此文：

相關

6 thoughts on “[Python] 使用 Selenium 進行 Google 搜尋引擎的爬蟲”

Leave a Reply取消回覆