Skip to content

[Python] Use Selenium package to crawl the google search engine

For a long time, I want to write an article about crawlers, recording the regular expression, IP settings, user agent… etc. often used tools or some knowledge.

Because I have done a lot of crawling work now, but I am afraid that this part of the work will stop for a while, and I am afraid that my crawling related skills will become unfamiliar.

Today’s note is based on the Google Search Engine and Python + Selenium. I can set the keywords to be queried and the number of pages I want to crawl, and then print out the title and url of query results.


Selenium Preparation

First, we need to install selenium and webdriver_manager.

pip3 install selenium
pip3 install webdriver_manager


And we need to install chromium driver.

sudo apt-get install chromium-driver


After the installation is complete, install BeautifulSoup4 package. In fact, it is not necessary to use in my crawling process, but this time I want to use prettify() function of it to print out a clear layout and perform regular expression processing.

pip3 install beautifulsoup4

Import the Packages

Then import all the packages we need.

# coding: utf-8
"""
Post the query to Google Search and get the return results
"""
import re
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options




Setting Parmeters

# Browser settings
chrome_options = Options()
chrome_options.add_argument('--incognito')
chrome_options.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0')
browser = webdriver.Chrome(chrome_options=chrome_options)


# Query settings
query = 'US Stock'
browser.get('https://www.google.com/search?q={}'.format(query))
next_page_times = 10



Here I separately set the browser settings, query keyword settings, and page turning times.

This command is using incognito mode.

chrome_options.add_argument('--incognito')



This line of code is filling in user agent. the html information returned by Google Chrome will change with our user agent. So my subsequent command to retrieve the title and url is adjusted for the user agent I prepared.

chrome_options.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0')




Crawl

# Crawler
for _page in range(next_page_times):
    soup = BeautifulSoup(browser.page_source, 'html.parser')
    content = soup.prettify()

    # Get titles and urls
    titles = re.findall('<h3 class="[\w\d]{6} [\w\d]{6}">\n\ +(.+)', content)
    urls = re.findall('<div class="r">\ *\n\ *<a href="(.+)" onmousedown', soup.prettify())

    for n in range(min(len(titles), len(urls))):
        print(titles[n], urls[n])

    # Wait
    time.sleep(5)

    # Turn to the next page
    try:
        browser.find_element_by_link_text('下一頁').click()
    except:
        print('Search Early Stopping.')
        browser.close()
        exit()


# Close the browser
browser.close()



The important thing is that I deliberately waited for 5 seconds before I changed the page. According to actual measurement, if you don’t wait for a while, sometimes there will be problems when changing pages.

The bottom try-except paragraph is to prevent the number of pages queried from being insufficient. For example, I specified that I want to turn the page 10 times, but my search engine returned only 7 pages of results.


Complete Code

# coding: utf-8
"""
Post the query to Google Search and get the return results
"""
import re
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options


# Browser settings
chrome_options = Options()
chrome_options.add_argument('--incognito')
chrome_options.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0')
browser = webdriver.Chrome(chrome_options=chrome_options)


# Query settings
query = 'US Stock'
browser.get('https://www.google.com/search?q={}'.format(query))
next_page_times = 10


# Crawler
for _page in range(next_page_times):
    soup = BeautifulSoup(browser.page_source, 'html.parser')
    content = soup.prettify()

    # Get titles and urls
    titles = re.findall('<h3 class="[\w\d]{6} [\w\d]{6}">\n\ +(.+)', content)
    urls = re.findall('<div class="r">\ *\n\ *<a href="(.+)" onmousedown', soup.prettify())

    for n in range(min(len(titles), len(urls))):
        print(titles[n], urls[n])

    # Wait
    time.sleep(5)

    # Turn to the next page
    try:
        browser.find_element_by_link_text('下一頁').click()
    except:
        print('Search Early Stopping.')
        browser.close()
        exit()


# Close the browser
browser.close()




References


Read More

Leave a Reply