[Python] 使用 GoogleNews 套件輕鬆取得 Google News 新聞資訊

Google News 是 Google 所提供的新聞平台，目前最著名的便是會通過你的瀏覽新聞推薦你『想看的新聞』，這應該是基於某種關聯式學習的結果。在這點上，至少對我而言是相當適用的。（我真的每篇 Google 推薦的新聞都會想看 XDD 可以想像我到底提供了他們多少資料）

其實說穿了，Google News 也是集中著網路上各式各樣的資料，與其說是『Google 的新聞』，不如說，它提供的是一個平台，可以看到 Google 所收集來了各式各樣不同立場的新聞言論 —— 哦對，別忘了還會幫你客製化專屬推薦哦！

在這樣的情況下，很多人可能會直覺地想到：如果我想要收集新聞的資料，我應該直接把 Google News 上的新聞爬下來就好啦。至少之後就只需要專注於『清理不必要的標籤』這件事即可。

我以前也這樣想的，也並不是說真的看不懂 Google News 網頁版的架構而沒辦法爬蟲 …… 真相是，當我在 Google 時發現了一個非常好用的套件: GoogleNews。

沒錯，GoogleNews 是一個 Python 上的開源套件，可以直接爬取 Google News 上搜尋關鍵字的新聞 —— 而且我們可以看到裡面的爬蟲規則是怎麼寫的、當然程式碼也非常乾淨可讀。

高手可以自己寫爬蟲來爬 Google News，這樣才不會在 Google News 的網頁版更新之後沒辦法運作。不過，這個套件對高手而言可能就沒什麼參考性了，可以直接 Pass。

但對於不擅長爬蟲的人而言，這是一個相當棒的套件，讓我們可以簡單呼叫幾個指令便拿到 Google News 的新聞資訊。

不過還是要在此澄清一下，這個套件是『沒辦法拿到新聞的全文的』！應該說，就算你直接爬取 Google News，還是沒辦法得到新聞的內容的。畢竟 Google 也是去爬別人的網站網址，並把網址集合在自家的平台上而已。

以下廢話不多說，開始介紹如何使用吧！

GoogleNews 介紹

首先，先把作者寫的使用教學貼出來： https://pypi.org/project/GoogleNews/

事先聲明，我沒有對於這個程式有任何的修改，教學也都是開發者自己寫的。我只是本著於『看到方便乾淨的套件就該分享一下』這樣的心態，來寫出這篇文章的。

讓我們期待開發者會因應之後的網頁改版（如果 Google News 會的話）而更新這個套件吧！

首先，我們要使用以下的指令下載：

pip3 install GoogleNews

然後我們來看段簡單的 Sample Code:

# -*- coding: utf-8 -*-
from GoogleNews import GoogleNews

googlenews = GoogleNews()
googlenews.search('Trump')
result = googlenews.result()
print(len(result))

for n in range(len(result)):
    print(n)
    for index in result[n]:
        print(index, '\n', result[n][index])

    exit()
# -*- coding: utf-8 -*-
from GoogleNews import GoogleNews

googlenews = GoogleNews()
googlenews.search('Trump')
result = googlenews.result()
print(len(result))

for n in range(len(result)):
    print(n)
    for index in result[n]:
        print(index, '\n', result[n][index])

    exit()
COPYㄘㄟ

# -*- coding: utf-8 -*-
from GoogleNews import GoogleNews

googlenews = GoogleNews()
googlenews.search(‘Trump’)
result = googlenews.result()
print(len(result))

for n in range(len(result)):
    print(n)
    for index in result[n]:
        print(index, ‘\n’, result[n][index])

    exit()
# -*- coding: utf-8 -*-
from GoogleNews import GoogleNews

googlenews = GoogleNews()
googlenews.search(‘Trump’)
result = googlenews.result()
print(len(result))

for n in range(len(result)):
    print(n)
    for index in result[n]:
        print(index, ‘\n’, result[n][index])

    exit()
COPYㄘㄟ

Output:

10
0

10 是預設的頁數所有的文章總數（預設是第一頁）
0 為我只印出 index 為 0 的文章（也就是第一篇文章）

title
Republicans can't hide as new week of impeachment drama …

title: 文章的標題

media 
The Guardian

date 
7 小時前

date: 時間

desc 
A mocked-up video depicting US president Donald Trump stabbing and shooting his political opponents and the media has reportedly been …

desc: 文章的描述內容

link 
https://www.theguardian.com/us-news/2019/oct/14/fake-video-of-trump-shooting-media-and-opponents-shown-at-presidents-resort

link: 文章的連結

img 
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQyLuZefsE6IBuYw4rI7taPS0-rRwYoEtM7RK_nytVgdYpdggrfXx2MJC9hzLHW3YwvlYEu6mo

img: 文章內的圖片

以上是我搜尋 ‘Trump‘ 得到的結果。如果想要看第二頁的內容，可以這樣做：

# -*- coding: utf-8 -*-
from GoogleNews import GoogleNews

googlenews = GoogleNews()
googlenews.search('Trump')
googlenews.clear()
googlenews.getpage(2)
result = googlenews.result()
print(len(result))

for n in range(len(result)):
    print(n)
    for index in result[n]:
        print(index, '\n', result[n][index])

    exit()

# -*- coding: utf-8 -*-
from GoogleNews import GoogleNews

googlenews = GoogleNews()
googlenews.search(‘Trump’)
googlenews.clear()
googlenews.getpage(2)
result = googlenews.result()
print(len(result))

for n in range(len(result)):
    print(n)
    for index in result[n]:
        print(index, ‘\n’, result[n][index])

    exit()

Output:

10
0

title 
Inside Trump's Botched Attempt to Hire Trey Gowdy

media 
The New York Times

date 
11 小時前

desc 
Even as the White House confronts a deepening threat to Mr. Trump's presidency, it has struggled to decide how to respond, and who should …

link 
https://www.nytimes.com/2019/10/13/us/politics/trey-gowdy-trump-impeachment.html

img 
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT7PEu3yYM-KbMkqyGIRF-ppy5uxh_DlOMz0t2WhrYVefQt4McbzA7hBQmQ8513dWRXRTETpAT2

我們可以看到，這是第二頁的第一篇文章！

記得要使用 googlenews.clear() 來清除之前的紀錄、並且使用 googlenews.getpage(2) 來指定要搜尋的頁數。

中文問題

也許試用過的讀者已經發現了，這個套件在使用中文下關鍵字時會有一個簡單的小 Bug (在此我以『川普』為關鍵字)：

'ascii' codec can't encode characters in position 14-15: ordinal not in range(128)

程式報了一個這樣的錯誤。

這僅僅只是因為在 Python3 中使用 urllib 時，我們的搜索的 url 裡頭不能有中文字的緣故。

我們要實現中文搜索的話，可能會需要修改一下套件的原始碼。

原始碼如下：

from bs4 import BeautifulSoup as soup
import urllib.request


class GoogleNews():
    def __init__(self):
        self.texts = []
        self.links = []
        self.results = []

    def search(self, key):
        self.key = "+".join(key.split(" "))
        self.getpage()

    def getpage(self, page=1):
        self.user_agent='Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:64.0) Gecko/20100101 Firefox/64.0'
        self.headers={'User-Agent':self.user_agent}
        self.url="https://www.google.com/search?q="+self.key+"&tbm=nws&start=%d" % (10*(page-1))
        
        try:
            self.req=urllib.request.Request(self.url, headers=self.headers)
            self.response=urllib.request.urlopen(self.req)
            self.page=self.response.read()
            self.content=soup(self.page, "html.parser")
            result=self.content.find_all("div", class_="g")
        
            for item in result:
                self.texts.append(item.find("h3").text)
                self.links.append(item.find("h3").find("a").get("href"))
                self.results.append({'title':item.find("h3").text,'media':item.find("div", class_="slp").find_all("span")[0].text,'date':item.find("div", class_="slp").find_all("span")[2].text,'desc':item.find("div", class_="st").text,'link':item.find("h3").find("a").get("href"),'img':item.find("img").get("src")})
            
            self.response.close()
        
        except Exception as e:
            print(e)
            pass

    def result(self):
        return self.results

    def gettext(self):
        return self.texts

    def getlinks(self):
        return self.links

    def clear(self):
        self.texts = []
        self.links = []
        self.results = []

from bs4 import BeautifulSoup as soup
import urllib.request


class GoogleNews():
    def __init__(self):
        self.texts = []
        self.links = []
        self.results = []

    def search(self, key):
        self.key = “+”.join(key.split(” “))
        self.getpage()

    def getpage(self, page=1):
        self.user_agent=’Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:64.0) Gecko/20100101 Firefox/64.0′
        self.headers={‘User-Agent’:self.user_agent}
        self.url=”https://www.google.com/search?q=”+self.key+”&tbm=nws&start=%d” % (10*(page-1))
        
        try:
            self.req=urllib.request.Request(self.url, headers=self.headers)
            self.response=urllib.request.urlopen(self.req)
            self.page=self.response.read()
            self.content=soup(self.page, “html.parser”)
            result=self.content.find_all(“div”, class_=”g”)
        
            for item in result:
                self.texts.append(item.find(“h3”).text)
                self.links.append(item.find(“h3”).find(“a”).get(“href”))
                self.results.append({‘title’:item.find(“h3”).text,’media’:item.find(“div”, class_=”slp”).find_all(“span”)[0].text,’date’:item.find(“div”, class_=”slp”).find_all(“span”)[2].text,’desc’:item.find(“div”, class_=”st”).text,’link’:item.find(“h3”).find(“a”).get(“href”),’img’:item.find(“img”).get(“src”)})
            
            self.response.close()
        
        except Exception as e:
            print(e)
            pass

    def result(self):
        return self.results

    def gettext(self):
        return self.texts

    def getlinks(self):
        return self.links

    def clear(self):
        self.texts = []
        self.links = []
        self.results = []

看起來有些雜亂的話真的很抱歉，可能複製起來貼到自己的 IDE 上會排版地比較好看。

最重要的是，我們要 import string 進這個程式碼。

import string

import string

然後我們在

self.url="https://www.google.com/search?q="+self.key+"&tbm=nws&start=%d" % (10*(page-1))

self.url=”https://www.google.com/search?q=”+self.key+”&tbm=nws&start=%d” % (10*(page-1))

底下加一行程式碼，使用 urllib.request.quote() 這個指令。

self.url="https://www.google.com/search?q="+self.key+"&tbm=nws&start=%d" % (10*(page-1))
self.url = urllib.request.quote(self.url, safe=string.printable)

self.url=”https://www.google.com/search?q=”+self.key+”&tbm=nws&start=%d” % (10*(page-1))
self.url = urllib.request.quote(self.url, safe=string.printable)

如此一來就可以用中文進行搜尋了。

再次回到我的程式碼：

# -*- coding: utf-8 -*-
from GoogleNews import GoogleNews

googlenews = GoogleNews()
googlenews.search('川普')
result = googlenews.gettext()
print(len(result))

for n in range(len(result)):
    print(n)
    print(result[n])

# -*- coding: utf-8 -*-
from GoogleNews import GoogleNews

googlenews = GoogleNews()
googlenews.search(‘川普’)
result = googlenews.gettext()
print(len(result))

for n in range(len(result)):
    print(n)
    print(result[n])

Output:

10
0
最狂直播主上線！ 川普開Twitch頻道狂吸年輕票
1
支持者播放迷因影片描繪川普掃射媒體屠殺政敵
2
川普被預測連任有望得票率過半更勝2016年
3
陸美談判露曙光又怎樣？他爆川普慣用伎倆
4
川普：北京已開始購買美國農產品
5
川普總統如參加APEC 張忠謀：兩人一定會見面
6
嗆搞垮土經濟傳川普最快這時會出手
7
大兵回家or另闢戰場？川普下令駐敘利亞美軍開始撤離沙烏地 …
8
川普樂了？超準牛津經濟預測：川普明年將連任
9
拜登子打破沉默駁斥川普攻擊辭去中國公司董事

如此一來，我們便可以進行中文的搜尋了！

我前面似乎說過我不修改程式碼的 …… 這點真的非常抱歉。不過我認為我基本上不算改套件的程式碼，只是讓中文可以在上面正常運作而已。

未來

先說句無關緊要的題外話：我每天早上吃早餐時喜歡喝咖啡配新聞。

我一直在想著，我想要寫一套桌面的 GUI 小工具，可以讓我看到最新的新聞。多虧了這份套件，我覺得我可以省下前面很多的時間了。

剩下的部份，有很大的程度是在做不同網站的清標籤的工作。不過這也是沒辦法逃避的吧！

總之，這個套件真的非常方便，希望作者能持之以恆地更新！（如果有必要的話）

10 thoughts on “[Python] 使用 GoogleNews 套件輕鬆取得 Google News 新聞資訊”

yen 2020-04-23 at 02:22


我想問一下，之前用了您的套件media跟date都會跑出來可是過一陣子之後這兩項不知道是不是被擋都是空值~
1. ccs96307 2020-04-23 at 03:11
  
  
  請問是使用英文搜尋時沒問題、使用中文搜尋時卻無法返回結果嗎？
  如果是這樣的話，因為實在較難以文字敘述可能的解決方法，故還是寫成了一篇文章。
  也許可以參考一下也不一定：https://clay-atlas.com/blog/2020/04/23/python-cn-package-googlenews-chinese-error/
2. 1. yen 2020-05-07 at 13:25
    
    
    getpage(1)裡面1改成10抓到的東西一模一樣…google新聞好像是沒有頁數的他會依直往下拉那這樣應該怎麼改?
yen 2020-05-07 at 13:23


getpag(1)那改了頁數好像抓到的東西跟getpag(1)都一樣
Yen 2020-05-07 at 13:25


getpage(1)裡面1改成10抓到的東西一模一樣…google新聞好像是沒有頁數的他會依直往下拉那這樣應該怎麼改?
1. ccs96307 2020-05-07 at 14:12
  
  
  目前感覺 getpage() 這個函式的使用上真的怪怪的… 可能得等待開發者日後的更新、或者是考慮自己使用 selenium 進行 Google News 的爬蟲。
tansunit 2021-04-29 at 08:52


記得作者在項目頁面裡提到，可以在 googlenews = GoogleNews() 中加入參數，來設置語言。比如我加入 lang=zh-cn 就可以直接搜索簡體中文內容。

我希望這個工具能夠進一步解決的問題是，可以指定具體的網站，目前嘗試失敗。
1. Clay 2021-04-29 at 10:02
  
  
  畢竟 Google 系列的服務、產品，通常都是官方自己推出 API 的比較好用。
Wa.01 2023-12-13 at 09:50


作者目前(2023/12/13)更新到 1.6.12 囉
https://github.com/Iceloof/GoogleNews
1. Clay 2023-12-14 at 07:37
  
  
  哇，居然一直到現在還在更新！我好久沒有使用這個套件了哈哈哈。現在的用法或許有點不太一樣了？我該找個時間翻新一下我的文章。

[Python] 使用 GoogleNews 套件輕鬆取得 Google News 新聞資訊

GoogleNews 介紹

中文問題

未來

相關

10 thoughts on “[Python] 使用 GoogleNews 套件輕鬆取得 Google News 新聞資訊”

Leave a Reply取消回覆

[Python] 使用 GoogleNews 套件輕鬆取得 Google News 新聞資訊

GoogleNews 介紹

中文問題

未來

分享此文：

相關

10 thoughts on “[Python] 使用 GoogleNews 套件輕鬆取得 Google News 新聞資訊”

Leave a Reply取消回覆