Last Updated on 2021-07-08 by Clay
For a long time, I want to write an article about crawlers, recording the regular expression, IP settings, user agent... etc. often used tools or some knowledge.
Because I have done a lot of crawling work now, but I am afraid that this part of the work will stop for a while, and I am afraid that my crawling related skills will become unfamiliar.
Today's note is based on the Google Search Engine and Python + Selenium. I can set the keywords to be queried and the number of pages I want to crawl, and then print out the title and url of query results.
Selenium Preparation
First, we need to install selenium
and webdriver_manager
.
pip3 install selenium
pip3 install webdriver_manager
And we need to install chromium driver.
sudo apt-get install chromium-driver
After the installation is complete, install BeautifulSoup4 package. In fact, it is not necessary to use in my crawling process, but this time I want to use prettify()
function of it to print out a clear layout and perform regular expression processing.
pip3 install beautifulsoup4
Import the Packages
Then import all the packages we need.
# coding: utf-8 """ Post the query to Google Search and get the return results """ import re import time from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.chrome.options import Options
Setting Parmeters
# Browser settings chrome_options = Options() chrome_options.add_argument('--incognito') chrome_options.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0') browser = webdriver.Chrome(chrome_options=chrome_options) # Query settings query = 'US Stock' browser.get('https://www.google.com/search?q={}'.format(query)) next_page_times = 10
Here I separately set the browser settings, query keyword settings, and page turning times.
This command is using incognito mode.
chrome_options.add_argument('--incognito')
This line of code is filling in user agent. the html information returned by Google Chrome will change with our user agent. So my subsequent command to retrieve the title and url is adjusted for the user agent I prepared.
chrome_options.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0')
Crawl
# Crawler for _page in range(next_page_times): soup = BeautifulSoup(browser.page_source, 'html.parser') content = soup.prettify() # Get titles and urls titles = re.findall('<h3 class="[\w\d]{6} [\w\d]{6}">\n\ +(.+)', content) urls = re.findall('<div class="r">\ *\n\ *<a href="(.+)" onmousedown', soup.prettify()) for n in range(min(len(titles), len(urls))): print(titles[n], urls[n]) # Wait time.sleep(5) # Turn to the next page try: browser.find_element_by_link_text('下一頁').click() except: print('Search Early Stopping.') browser.close() exit() # Close the browser browser.close()
The important thing is that I deliberately waited for 5 seconds before I changed the page. According to actual measurement, if you don't wait for a while, sometimes there will be problems when changing pages.
The bottom try-except paragraph is to prevent the number of pages queried from being insufficient. For example, I specified that I want to turn the page 10 times, but my search engine returned only 7 pages of results.
Complete Code
# coding: utf-8 """ Post the query to Google Search and get the return results """ import re import time from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.chrome.options import Options # Browser settings chrome_options = Options() chrome_options.add_argument('--incognito') chrome_options.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0') browser = webdriver.Chrome(chrome_options=chrome_options) # Query settings query = 'US Stock' browser.get('https://www.google.com/search?q={}'.format(query)) next_page_times = 10 # Crawler for _page in range(next_page_times): soup = BeautifulSoup(browser.page_source, 'html.parser') content = soup.prettify() # Get titles and urls titles = re.findall('<h3 class="[\w\d]{6} [\w\d]{6}">\n\ +(.+)', content) urls = re.findall('<div class="r">\ *\n\ *<a href="(.+)" onmousedown', soup.prettify()) for n in range(min(len(titles), len(urls))): print(titles[n], urls[n]) # Wait time.sleep(5) # Turn to the next page try: browser.find_element_by_link_text('下一頁').click() except: print('Search Early Stopping.') browser.close() exit() # Close the browser browser.close()