All Projects → ndaidong → Article Parser

ndaidong / Article Parser

Licence: mit
To extract main article from given URL with Node.js

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to Article Parser

Clean Mark
Convert an article into a clean text
Stars: ✭ 414 (+131.28%)
Mutual labels:  article, readability
Php Goose
Readability / Html Content / Article Extractor & Web Scrapping library written in PHP
Stars: ✭ 392 (+118.99%)
Mutual labels:  article, readability
Py Readability Metrics
📗 Score text readability using a number of formulas: Flesch-Kincaid Grade Level, Gunning Fog, ARI, Dale Chall, SMOG, and more
Stars: ✭ 132 (-26.26%)
Mutual labels:  readability
100 Days Of Ml Code
A day to day plan for this challenge. Covers both theoritical and practical aspects
Stars: ✭ 172 (-3.91%)
Mutual labels:  article
Engineering Management
A collection of inspiring resources related to engineering management and tech leadership
Stars: ✭ 2,520 (+1307.82%)
Mutual labels:  article
Google Rules Of Machine Learning
Github mirror of M. Zinkevich's "Rules of Machine Learning" style guide, with extra goodness.
Stars: ✭ 137 (-23.46%)
Mutual labels:  article
Code2sec.com
xmind\code\articles for my personal blog 个人博客上的资源备份存储,也是个人分享的汇总
Stars: ✭ 164 (-8.38%)
Mutual labels:  article
Awesome Apollo Graphql
A curated list of amazingly awesome things regarding Apollo GraphQL ecosystem 🌟
Stars: ✭ 126 (-29.61%)
Mutual labels:  article
Ttrss plugin Feediron
Evolution of ttrss_plugin-af_feedmod
Stars: ✭ 172 (-3.91%)
Mutual labels:  article
Structured Data Json Ld
Collection of structured data snippets in Google preferred JSON-LD format.
Stars: ✭ 157 (-12.29%)
Mutual labels:  article
Post Misread Tsne
How to Use t-SNE Effectively
Stars: ✭ 169 (-5.59%)
Mutual labels:  article
Post Augmented Rnns
Attention and Augmented Recurrent Neural Networks
Stars: ✭ 154 (-13.97%)
Mutual labels:  article
D2 Daily
D2 日报
Stars: ✭ 138 (-22.91%)
Mutual labels:  article
Reading List Mover
A Python utility for moving bookmarks/reading lists between services
Stars: ✭ 166 (-7.26%)
Mutual labels:  readability
Jbt blog
一个基于Django2.0+Python3.6的博客/A simple blog based on python3.6 and Django2.0.
Stars: ✭ 137 (-23.46%)
Mutual labels:  article
Cadmium
Natural Language Processing (NLP) library for Crystal
Stars: ✭ 172 (-3.91%)
Mutual labels:  readability
Php Readability
A fork of https://bitbucket.org/fivefilters/php-readability
Stars: ✭ 127 (-29.05%)
Mutual labels:  readability
Post Handwriting
Four Experiments in Handwriting with a Neural Network
Stars: ✭ 144 (-19.55%)
Mutual labels:  article
Readability
visualise readability
Stars: ✭ 160 (-10.61%)
Mutual labels:  readability
Awesome Deep Learning Music
List of articles related to deep learning applied to music
Stars: ✭ 2,195 (+1126.26%)
Mutual labels:  article

article-parser

Extract main article, main image and meta data from URL.

NPM CI test Coverage Status Quality Gate Status

Demo

View screenshots for more info.

Usage

npm install article-parser

Then:

const {
  extract
} = require('article-parser');

const url = 'https://goo.gl/MV8Tkh';

extract(url).then((article) => {
  console.log(article);
}).catch((err) => {
  console.log(err);
});

APIs

Since v4, article-parser will focus only on its main mission: extract main readable content from given webpages, such as blog posts or news entries. Although it is still able to get other kinds of content like YouTube movies, SoundCloud media, etc, they are just additions.

extract(String url | String html)

Extract data from specified url or full HTML page content. Return: a Promise

Here is how we can use article-parser:

import {
  extract
} from 'article-parser';

const getArticle = async (url) => {
  try {
    const article = await extract(url);
    return article;
  } catch (err) {
    console.trace(err);
  }
};

In comparison to v3, the article object structure has been changed too. Now it looks like below:

{
  "url": URI String,
  "title": String,
  "description": String,
  "image": URI String,
  "author": String,
  "content": HTML String,
  "published": Date String,
  "source": String, // original publisher
  "links": Array, // list of alternative links
  "ttr": Number, // time to read in second, 0 = unknown
}

Configuration methods

In addition, this lib provides some methods to customize default settings. Don't touch them unless you have reason to do that.

  • setParserOptions(Object parserOptions)
  • getParserOptions()
  • setNodeFetchOptions(Object nodeFetchOptions)
  • getNodeFetchOptions()
  • setSanitizeHtmlOptions(Object sanitizeHtmlOptions)
  • getSanitizeHtmlOptions()

Here are default properties/values:

Object parserOptions:

{
  wordsPerMinute: 300,
  urlsCompareAlgorithm: 'levenshtein',
}

Read string-comparison docs for more info about urlsCompareAlgorithm.

Object nodeFetchOptions:

{
  headers: {
    'user-agent': 'article-parser/4.0.0',
  },
  timeout: 30000,
  redirect: 'follow',
  compress: true,
  agent: false,
}

Read node-fetch docs for more info.

Object sanitizeHtmlOptions:

{
  allowedTags: [
    'h1', 'h2', 'h3', 'h4', 'h5',
    'u', 'b', 'i', 'em', 'strong',
    'div', 'span', 'p', 'article', 'blockquote', 'section',
    'pre', 'code',
    'ul', 'ol', 'li', 'dd', 'dl',
    'table', 'th', 'tr', 'td', 'thead', 'tbody', 'tfood',
    'label',
    'fieldset', 'legend',
    'img', 'picture',
    'br', 'p', 'hr',
    'a',
  ],
  allowedAttributes: {
    a: ['href'],
    img: ['src', 'alt'],
  },
}

Read sanitize-html docs for more info.

Screenshots

  • Article Parser demo:

Screenshot_2019-11-29_14-21-30.png

  • Example FasS with Google Cloud Function

Screenshot_2019-11-29_14-38-32.png

Test

git clone https://github.com/ndaidong/article-parser.git
cd article-parser
npm install  // or `yarn install` or `pnpm install`
npm test

License

The MIT License (MIT)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].