All Projects → flickz → newspaperjs

flickz / newspaperjs

Licence: MIT license
News extraction and scraping. Article Parsing

Programming Languages

HTML
75241 projects
javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to newspaperjs

newsemble
API for fetching data from news websites.
Stars: ✭ 42 (-28.81%)
Mutual labels:  scraper, news, webscraping
Newspaper
News, full-text, and article metadata extraction in Python 3. Advanced docs:
Stars: ✭ 11,545 (+19467.8%)
Mutual labels:  scraper, news, news-aggregator
TradeTheEvent
Implementation of "Trade the Event: Corporate Events Detection for News-Based Event-Driven Trading." In Findings of ACL2021
Stars: ✭ 64 (+8.47%)
Mutual labels:  scraper, news
trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Stars: ✭ 711 (+1105.08%)
Mutual labels:  news, news-aggregator
MalScraper
Scrape everything you can from MyAnimeList.net
Stars: ✭ 132 (+123.73%)
Mutual labels:  scraper, news
TrollHunter
Twitter Troll & Fake News Hunter - Crawls news websites and twitter to identify fake news
Stars: ✭ 38 (-35.59%)
Mutual labels:  scraper, news
Youtube Projects
This repository contains all the code I use in my YouTube tutorials.
Stars: ✭ 144 (+144.07%)
Mutual labels:  scraper, webscraping
ioweb
Web Scraping Framework
Stars: ✭ 31 (-47.46%)
Mutual labels:  webscraping, webcrawling
Mailinglistscraper
A python web scraper for public email lists.
Stars: ✭ 19 (-67.8%)
Mutual labels:  scraper, webscraping
PressCenters.com
News aggregator for the press releases of the Bulgarian government sites written in ASP.NET Core
Stars: ✭ 91 (+54.24%)
Mutual labels:  news, news-aggregator
robotstxt
robots.txt file parsing and checking for R
Stars: ✭ 65 (+10.17%)
Mutual labels:  scraper, webscraping
civic-scraper
Tools for downloading agendas, minutes and other documents produced by local government
Stars: ✭ 21 (-64.41%)
Mutual labels:  scraper, news
Django Dynamic Scraper
Creating Scrapy scrapers via the Django admin interface
Stars: ✭ 1,024 (+1635.59%)
Mutual labels:  scraper, webscraping
Polite
Be nice on the web
Stars: ✭ 253 (+328.81%)
Mutual labels:  scraper, webscraping
Huginn
Create agents that monitor and act on your behalf. Your agents are standing by!
Stars: ✭ 33,694 (+57008.47%)
Mutual labels:  scraper, webscraping
BookingScraper
🌎 🏨 Scrape Booking.com 🏨 🌎
Stars: ✭ 68 (+15.25%)
Mutual labels:  scraper, webscraping
extractnet
A Dragnet that also extract author, headline, date, keywords from context
Stars: ✭ 52 (-11.86%)
Mutual labels:  news, webscraping
Autoscraper
A Smart, Automatic, Fast and Lightweight Web Scraper for Python
Stars: ✭ 4,077 (+6810.17%)
Mutual labels:  scraper, webscraping
Xidel
Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
Stars: ✭ 335 (+467.8%)
Mutual labels:  scraper, webscraping
google-news-scraper
Google News Scraper for languages like Japanese, Chinese... [VPN Support]
Stars: ✭ 88 (+49.15%)
Mutual labels:  news, news-aggregator

Newspaperjs

News extraction and scraping. Maximizing the power of Request and Cheerio.

Inspired by "Codelucas - Python Newspaper lib"

Features

  • News url identification
  • News Categories extraction
  • Text extraction from html
  • Top image extraction from html
  • Description extraction from html
  • Keyword extraction from html
  • Author extraction from html

Installation

npm install newspaperjs

Using API

const Build = require('newspaperjs').Build;
const Article = require('newspaperjs').Article

Building a news source

Building a Source will extract its categories and articles url with two simple methods.

.getCategoriesUrl(url{string}, cateOfInterest[array])

Get all categories url. When cateOfInterest is specified, it's only extract their links if found. Returns Promise, an array of categories url.

Build.getCategoriesUrl('https://www.nytimes.com', ['politics', 'sports', 'technology']).then(categories=>{
    console.log(categories); 
}).catch(reason=>{
    console.log(reason);
})
//[
     'https://www.nytimes.com/pages/politics'
     'https://www.nytimes.com/pages/sports',
     'https://www.nytimes.com/pages/technology'
  ]

.getArticlesUrl(categoriesUrl{string})

Get all articles url from a category url. Returns Promise, array of articles url.

 Build.getArticlesUrl('https://www.nytimes.com/pages/politics').then(result=>{
    console.log(result);
}).catch(reason=>{
    console.log(reason)
})
//[
   'https://www.nytimes.com/2017/06/12/us/politics/trump-travel-ban-court-of-appeals.html',
  'https://www.nytimes.com/aponline/2017/06/12/us/politics/ap-us-trump-lawsuit-the-latest.html',
  'https://www.nytimes.com/aponline/2017/06/12/us/politics/ap-us-supreme-court-biotech-drugs.html',
  'https://www.nytimes.com/2017/06/12/us/trump-lawsuit-private-businesses.html',
  'https://www.nytimes.com/2017/06/12/us/politics/ivanka-trump-comey-donald-trump-fox-and-friends.html',
  'https://www.nytimes.com/2017/06/12/us/politics/unions-come-into-the-justices-cross-hairs-again.html',
  'https://www.nytimes.com/2017/06/11/us/politics/ducks-washington-reflecting-pool-unity.html',
  'https://www.nytimes.com/2017/06/11/us/politics/preet-bharara-trump-contacts.html',
  'https://www.nytimes.com/2017/06/11/us/politics/jeff-sessions-russia-trump-attorney-general-senate.html',
  'https://www.nytimes.com/2017/06/11/us/politics/defense-secretary-jim-mattis-trump.html',
...]

Extracting and Parsing News Article.

Extract news article using the article url provided and parse the content.

.Article(url{string})

Extract and Parse news article, in order to access title, text, topImage, date, author, description and keywords of the article.

Article('https://www.nytimes.com/2017/06/10/us/politics/sessions-senate-russia-election.html')
.then(result=>{
    console.log(result);
}).catch(reason=>{
    console.log(reason);
})
{
    title: 'Sessions Will Testify in Senate on Russian Meddling in Election',

    text: " AdvertisementBy CHARLIE SAVAGEJUNE 10, 2017\nWASHINGTON — Attorney General Jeff Sessions told Congress on Saturday that he would testify before the Senate Intelligence Committee on Tuesday about issues related to Russia’s interference in the 2016 election.  Mr. Sessions had been scheduled to testify before other committees about the Justice Department’s budget that day, but he will instead appear before the intelligence panel. Mr.Sessions said he would send Rod J. Rosenstein, the deputy attorney general, to testify about the department’s budget before the House and Senate appropriations panels.... ",

    topImage:'https://static01.nyt.com/images/2017/06/11/us/11dcSESSIONS/11dcSESSIONS-facebookJumbo.jpg',

    date: '2017-06-10T20:08:09-04:00',

    author: 'Charlie Savage',

    description: 'Instead of discussing the Justice Department budget, Attorney General Jeff Sessions will face questions from members of Congress who have access to intelligence materials on the Russia inquiry.',

    keywords: [ 'Russian Interference in 2016 US Elections and Ties to Trump Associates', 'Sessions  Jefferson B III', 
    'Justice Department', 
    'United States Politics and Government', 'Attorneys General', 
    'Senate Committee on Intelligence','Trump  Donald J', 'Comey  James B' ]
}

Author

Authored and maintained by Oluwaseun Omoyajowo. Like to get in touch?

Email: [email protected]

Twitter: @oluwaseunOmoya

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].