All Projects → philipperemy → google-news-scraper

philipperemy / google-news-scraper

Licence: MIT license
Google News Scraper for languages like Japanese, Chinese... [VPN Support]

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to google-news-scraper

Kanji Data Media
Japanese language data on kanji and radicals, media files, fonts and related resources from Kanji alive
Stars: ✭ 186 (+111.36%)
Mutual labels:  japanese, japanese-language
newspaperjs
News extraction and scraping. Article Parsing
Stars: ✭ 59 (-32.95%)
Mutual labels:  news, news-aggregator
Genki Study Resources
A collection of exercises for practicing what is taught in Genki: An Integrated Course in Elementary Japanese.
Stars: ✭ 232 (+163.64%)
Mutual labels:  japanese, japanese-language
Nihonoari-App
A little and minimalist Japanese Kana training
Stars: ✭ 66 (-25%)
Mutual labels:  japanese, japanese-language
nippon
日语N5-N2语法笔记~ 🍻
Stars: ✭ 84 (-4.55%)
Mutual labels:  japanese, japanese-language
Topokanji
Topologically ordered lists of kanji for effective learning
Stars: ✭ 108 (+22.73%)
Mutual labels:  japanese, japanese-language
gnewsclient
An easy-to-use python client for Google News feeds.
Stars: ✭ 42 (-52.27%)
Mutual labels:  news, google-news
Yomichan
Japanese pop-up dictionary extension for Chrome and Firefox.
Stars: ✭ 464 (+427.27%)
Mutual labels:  japanese, japanese-language
jmdict-simplified
JMdict, JMnedict, Kanjidic, KRADFILE/RADKFILE in JSON format
Stars: ✭ 96 (+9.09%)
Mutual labels:  japanese, japanese-language
python-doc-ja
Python ドキュメント日本語訳プロジェクト
Stars: ✭ 130 (+47.73%)
Mutual labels:  japanese, japanese-language
Languagepod101 Scraper
Python scraper for Language Pods such as Japanesepod101.com 👹 🗾 🍣 Compatible with Japanese, Chinese, French, German, Italian, Korean, Portuguese, Russian, Spanish and many more! ✨
Stars: ✭ 104 (+18.18%)
Mutual labels:  japanese, japanese-language
trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Stars: ✭ 711 (+707.95%)
Mutual labels:  news, news-aggregator
The Tab Of Words
A minimal Chrome / Firefox extension to help you learn Japanese words in each new tab.
Stars: ✭ 94 (+6.82%)
Mutual labels:  japanese, japanese-language
Ichiran
Linguistic tools for texts in Japanese language
Stars: ✭ 120 (+36.36%)
Mutual labels:  japanese, japanese-language
Kagome
Self-contained Japanese Morphological Analyzer written in pure Go
Stars: ✭ 554 (+529.55%)
Mutual labels:  japanese, japanese-language
PressCenters.com
News aggregator for the press releases of the Bulgarian government sites written in ASP.NET Core
Stars: ✭ 91 (+3.41%)
Mutual labels:  news, news-aggregator
japanese-pitch-accent-resources
Trying to consolidate japanese phonetic, and in particular pitch accent resources into one list
Stars: ✭ 64 (-27.27%)
Mutual labels:  japanese, japanese-language
unofficial-jisho-api
Encapsulates the official Jisho.org API and also provides kanji, example, and stroke diagram search.
Stars: ✭ 88 (+0%)
Mutual labels:  japanese, japanese-language
Newspaper
News, full-text, and article metadata extraction in Python 3. Advanced docs:
Stars: ✭ 11,545 (+13019.32%)
Mutual labels:  news, news-aggregator
Convert-Numbers-to-Japanese
Converts Arabic numerals, or 'western' style numbers, to a Japanese context.
Stars: ✭ 33 (-62.5%)
Mutual labels:  japanese, japanese-language

Google News Scraper - Japanese and Chinese supported

For English articles, Google has a RSS feed that you can directly use. Click here for English.

Each scraped article has the following fields:

  • title: Title of the article
  • datetime: Publication date
  • content: Full content (text format) - best effort
  • link: URL where the article was published
  • keyword: Google News keyword used to find this article

How many articles can I fetch with this scraper?

No upper bound of course but it should be in the range 100,000 articles per day when scraping 24/7 with VPN enabled.

How to get started?

git clone [email protected]:philipperemy/google-news-scraper.git && cd google-news-scraper
virtualenv -p python3 venv && source venv/bin/activate # optional but recommended!
pip install -r requirements.txt
python main_no_vpn.py --keywords hello,toto --language ja  # for VPN support, scroll down!

Output example

Article 1

{
    "content": "(本文中の野村証券 [...] 生命経済研の熊野英生氏は指摘。  記事の全文 \n保護主義を根拠とする円高説を信じ込むのは禁物であり、実際は米貿易赤字縮小と円安が進むかもしれないとBBHの村田雅志氏は指摘。  記事の全文 \n",
    "datetime": "2015/11/03",
    "keyword": "米国の銀行業務",
    "link": "http://jp.reuters.com/article/idJPL3N12Y5QX20151104",
    "title": "再送-インタビュー:運用高度化、PEやハイイールド債増やす=長門・ゆうちょ銀社長"
}

Article 2

{
    "content": "記事保存 有料会員の方のみご利用になれます。[...] 詳しくは、こちら 電子版トップ速報トップ アルゼンチン、ドル、通貨ペソ、外貨取引 来春の新入社員を募集 記者など4職種 【週末新紙面】宅配+電子版お試し実施中! 天気 プレスリリース検索 アカウント一覧 訂正・おわび",
    "datetime": "2015/12/17",
    "keyword": "アルゼンチン",
    "link": "http://www.nikkei.com/article/DGXLASGM18H1B_Y5A211C1EAF000/",
    "title": "アルゼンチンの通貨ペソ、大幅下落 対ドルで36%安"
}

NOTE: The field content was truncated in the README.

VPN

Scraping Google News usually results in a ban for a few hours. Using a VPN with dynamic IP fetching is a way to overcome this problem.

In my case, I subscribed to this VPN: https://www.expressvpn.com/.

I provide a python binding for this VPN here: https://github.com/philipperemy/expressvpn-python.

Also make sure that:

Every time the script detects that Google has banned you, it will request the VPN to get a fresh new IP and will resume.

Questions/Answers

  • Why didn't you use the RSS feed provided by Google News? It does not exist for Japanese!
  • What is the best way to use this scraper? If you want to scrape a lot of data, I highly recommend you to subscribe to a VPN, preferably ExpressVPN (I implemented the VPN wrapper and the interaction with this scraper).
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].