Last Updated on 2021-06-02 by Clay
Google news is a famous news platform by Google, and it will recommend you “what news you want to read” with your history browsing history. (I think it is training from relational learning, but I’m not sure.)
I very like the recommend function because I do read the news they commend.
But in fact, Google News is understood as a collection of news from Google, and Google provided the news platform.
Because there are many news on it, many people may think about to download these news to process. After all, you only need a word to query and get all kinds of news, it’s very convenient!
Anyway, Google’s HTML architecture is actually quite complex. So, if we use a third-party package maybe help us.
“GoogleNews” is a great package. It is a open source package on Python and we can read the rules it have. (Maybe you are a master can easily to write your crawler.)
But this package is very convenient for the people they doesn’t similar with crawler, like me. So it’s a useful package.
It is worth noting, the package “GoogleNews” can’t get full content of news. In other words, if you download the Google News URL, that’s just another platform’s URL, Google just collect these URL in their own platform.
The topic is far away, let’s start now!
GoogleNews
We take a look for the developer’s PyPI: https://pypi.org/project/GoogleNews/
And we need to use the following command to download.
pip3 install GoogleNews
Let’s read a simple sample code:
# -*- coding: utf-8 -*- from GoogleNews import GoogleNews googlenews = GoogleNews() googlenews.search('Trump') result = googlenews.result() print(len(result)) for n in range(len(result)): print(n) for index in result[n]: print(index, '\n', result[n][index]) exit()
Output:
10
0
10 is the default article numbers in page.
0 is the article index I print. (the first article)
title
Republicans can't hide as new week of impeachment drama …
title: article title.
media
The Guardian
date
7 小時前
date: date time.
desc
A mocked-up video depicting US president Donald Trump stabbing and shooting his political opponents and the media has reportedly been …
desc: description.
link
https://www.theguardian.com/us-news/2019/oct/14/fake-video-of-trump-shooting-media-and-opponents-shown-at-presidents-resort
link: the url you can link to the article.
img
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQyLuZefsE6IBuYw4rI7taPS0-rRwYoEtM7RK_nytVgdYpdggrfXx2MJC9hzLHW3YwvlYEu6mo
img: the image in the article.
The above is the result of my query about “Trump”. If you want to see the contents of the second page, you can do like this:
# -*- coding: utf-8 -*- from GoogleNews import GoogleNews googlenews = GoogleNews() googlenews.search('Trump') googlenews.clear() googlenews.getpage(2) result = googlenews.result() print(len(result)) for n in range(len(result)): print(n) for index in result[n]: print(index, '\n', result[n][index]) exit()
Output:
10
0
title
Inside Trump's Botched Attempt to Hire Trey Gowdy
media
The New York Times
date
11 小時前
desc
Even as the White House confronts a deepening threat to Mr. Trump's presidency, it has struggled to decide how to respond, and who should …
link
https://www.nytimes.com/2019/10/13/us/politics/trey-gowdy-trump-impeachment.html
img
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT7PEu3yYM-KbMkqyGIRF-ppy5uxh_DlOMz0t2WhrYVefQt4McbzA7hBQmQ8513dWRXRTETpAT2
The above article information is the first article in next page.
Be careful to use “googlenews.clear()” to clear the record and we can use “googlenews.getpage(2)” to search the specific page.
Chinese Error
Maybe some user have the same problem: a little bug in this package when we want to query using Chinese characters. (For example “川普” (Trump))
'ascii' codec can't encode characters in position 14-15: ordinal not in range(128)
We will meet this error report. It because of when we using “urlib”, the url we search can’t contain Chinese characters. So if we want to query news by Chinese, we have to change some source code:
Original code:
from bs4 import BeautifulSoup as soup import urllib.request class GoogleNews(): def __init__(self): self.texts = [] self.links = [] self.results = [] def search(self, key): self.key = "+".join(key.split(" ")) self.getpage() def getpage(self, page=1): self.user_agent='Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:64.0) Gecko/20100101 Firefox/64.0' self.headers={'User-Agent':self.user_agent} self.url="https://www.google.com/search?q="+self.key+"&tbm=nws&start=%d" % (10*(page-1)) try: self.req=urllib.request.Request(self.url, headers=self.headers) self.response=urllib.request.urlopen(self.req) self.page=self.response.read() self.content=soup(self.page, "html.parser") result=self.content.find_all("div", class_="g") for item in result: self.texts.append(item.find("h3").text) self.links.append(item.find("h3").find("a").get("href")) self.results.append({'title':item.find("h3").text,'media':item.find("div", class_="slp").find_all("span")[0].text,'date':item.find("div", class_="slp").find_all("span")[2].text,'desc':item.find("div", class_="st").text,'link':item.find("h3").find("a").get("href"),'img':item.find("img").get("src")}) self.response.close() except Exception as e: print(e) pass def result(self): return self.results def gettext(self): return self.texts def getlinks(self): return self.links def clear(self): self.texts = [] self.links = [] self.results = []
We need to import “string” package into our code:
import string
And then we find the following statement:
self.url="https://www.google.com/search?q="+self.key+"&tbm=nws&start=%d" % (10*(page-1))
Add a new line of code, using “urllib.request.quote()” function.
self.url="https://www.google.com/search?q="+self.key+"&tbm=nws&start=%d" % (10*(page-1)) self.url = urllib.request.quote(self.url, safe=string.printable)
And we can query by Chinese!
# -*- coding: utf-8 -*- from GoogleNews import GoogleNews googlenews = GoogleNews() googlenews.search('川普') result = googlenews.gettext() print(len(result)) for n in range(len(result)): print(n) print(result[n])
Output:
10
0
最狂直播主上線! 川普開Twitch頻道狂吸年輕票
1
支持者播放迷因影片描繪川普掃射媒體屠殺政敵
2
川普被預測連任有望得票率過半更勝2016年
3
陸美談判露曙光又怎樣?他爆川普慣用伎倆
4
川普:北京已開始購買美國農產品
5
川普總統如參加APEC 張忠謀:兩人一定會見面
6
嗆搞垮土經濟傳川普最快這時會出手
7
大兵回家or另闢戰場?川普下令駐敘利亞美軍開始撤離沙烏地 …
8
川普樂了?超準牛津經濟預測:川普明年將連任
9
拜登子打破沉默駁斥川普攻擊辭去中國公司董事
Hi,
I’m unable to extract complete Description(complete news body). Can you help me to to get the news’s complete description?
Hello!
I remember this package is designed to only get the beginning or summary of news. After all, Google News crawls news information on various news platforms and let users go to the platform to read news on news website.
The package developer is unable to prepares parser for any news platforms.
If you want to get the complete news content, maybe you need to develop a crawler to download the news which news platform you want.
Have a good day.