All Projects → croqaz → Clean Mark

croqaz / Clean Mark

Licence: mit
Convert an article into a clean text

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to Clean Mark

Mybox
Easy tools of document, image, file, network, location, color, and media.
Stars: ✭ 45 (-89.13%)
Mutual labels:  markdown, text
Php Goose
Readability / Html Content / Article Extractor & Web Scrapping library written in PHP
Stars: ✭ 392 (-5.31%)
Mutual labels:  article, readability
Proselint
Proselint wrapper with a friendly reporter
Stars: ✭ 56 (-86.47%)
Mutual labels:  markdown, text
Article Parser
To extract main article from given URL with Node.js
Stars: ✭ 179 (-56.76%)
Mutual labels:  article, readability
Text
📑 Collaborative document editing using Markdown
Stars: ✭ 282 (-31.88%)
Mutual labels:  markdown, text
Lookatme
An interactive, terminal-based markdown presenter
Stars: ✭ 392 (-5.31%)
Mutual labels:  markdown
Sublime zk
A SublimeText3 package featuring ID based wiki style links, and #tags, intended for zettelkasten method users. Loaded with tons of features like inline image display, sophisticated tag search, note transclusion features, support for note templates, bibliography support, support for multiple panes, etc. to make working in your Zettelkasten a joy 😄.
Stars: ✭ 408 (-1.45%)
Mutual labels:  markdown
Blastula
Easily send great-looking HTML email messages from R
Stars: ✭ 394 (-4.83%)
Mutual labels:  markdown
Markdownlint Cli
MarkdownLint Command Line Interface
Stars: ✭ 389 (-6.04%)
Mutual labels:  markdown
Mdx Docs
📝 Document and develop React components with MDX and Next.js
Stars: ✭ 412 (-0.48%)
Mutual labels:  markdown
Cleaver
30-second slideshows for hackers
Stars: ✭ 3,927 (+848.55%)
Mutual labels:  markdown
Lmdb Embeddings
Fast word vectors with little memory usage in Python
Stars: ✭ 404 (-2.42%)
Mutual labels:  text
Crowbook
Converts books written in Markdown to HTML, LaTeX/PDF and EPUB
Stars: ✭ 399 (-3.62%)
Mutual labels:  markdown
Crayons
Text UI colors for Python.
Stars: ✭ 409 (-1.21%)
Mutual labels:  text
Multimarkdown 6
Lightweight markup processor to produce HTML, LaTeX, and more.
Stars: ✭ 394 (-4.83%)
Mutual labels:  markdown
Pipe
🎷 一款小而美的博客平台,专为程序员设计。
Stars: ✭ 3,898 (+841.55%)
Mutual labels:  markdown
Remark
remark is a popular tool that transforms markdown with plugins. These plugins can inspect and change your markup. You can use remark on the server, the client, CLIs, deno, etc.
Stars: ✭ 4,746 (+1046.38%)
Mutual labels:  markdown
Justwrite
一款支持同步滑动预览的跨平台Markdown编辑器
Stars: ✭ 411 (-0.72%)
Mutual labels:  markdown
Github Profile Readme Generator
🚀 Generate GitHub profile README easily with the latest add-ons like visitors count, GitHub stats, etc using minimal UI.
Stars: ✭ 7,812 (+1786.96%)
Mutual labels:  markdown
Misaka
A Python binding for Hoedown.
Stars: ✭ 404 (-2.42%)
Mutual labels:  markdown

➹ Clean-mark

Convert a blog article into a clean Markdown text file.

NPM Version NPM Downloads Build Status Standard Style Guide

Example

For example, this article:

Original article

Is converted into this text file:

Clean text

Usage

$ clean-mark "http://some-website.com/fancy-article"

The article will be automatically named using the URL path name. In the case, above, the name will be fancy-article.md.

The file type can be specified:

$ clean-mark "http://some-website.com/fancy-article" -t html

The available types are: HTML, TEXT and Markdown.

The output file and path can be also specified:

$ clean-mark "http://some-website.com/fancy-article" -o /tmp/article

In that case the output will be /tmp/article.md. The extension is added automatically.

Installation

Simply install with npm:

$ npm install clean-mark --global

Why ?

  • to save interesting articles offline, in a highly readable text format
  • it's easy to read on a tablet, or a Kindle (as it is, or exported to PDF)
  • Markdown is easy to export into different formats
  • for offline text analysis of multiple articles, using machine learning / AI

How ?

Implementation steps:

  1. Downloads the content of a web page
  2. Meta-scrape page details (title, author, date, etc)
  3. Sanitizes the ugly HTML
  4. Minifies the disinfected HTML
  5. Converts the result into clean Markdown text

This project depends on the A-Extractor project, a database of expressions used for extracting content from blogs and articles.

Vision

The goals of the project are are:

  1. Good text extraction
  2. More useless text is preferred, instead of cutting from the actual article by mistake
  3. Extracting media (images, videos, audio) is not that important
  4. Extraction speed is not that important

Contributing

Clean-mark was tested on all major news sites. On some websites, the text, or links are cut from the article. In this case, you have to manually edit the resulted text,

AND

please raise an issue on A-Extractor with the link that doesn't work and we'll add it in the database, so that next time, the text will be extracted correctly.

Also, see how to contribute.

Similar tools


License

MIT © Cristi Constantin.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].