All Projects → miso-belica → Justext

miso-belica / Justext

Licence: bsd-2-clause
Heuristic based boilerplate removal tool

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Justext

htmlparser
delphi html parser(代码是改自原wr960204的HtmlParser)
Stars: ✭ 65 (-84.45%)
Mutual labels:  html-parser
ioBroker.parser
Parse web-site or file and extract data from it.
Stars: ✭ 14 (-96.65%)
Mutual labels:  html-parser
Jsoupxpath
纯Java实现的支持W3C Xpath 1.0标准语法的HTML解析器。A html parser with xpath base on Jsoup and Antlr4. Maybe it is the best in java,ha ha.Just try it.
Stars: ✭ 331 (-20.81%)
Mutual labels:  html-parser
html-parser
Html Parser - Html to Pug, Jinja2, Blade Converter | AppSeed
Stars: ✭ 40 (-90.43%)
Mutual labels:  html-parser
sherpa 41
Simple browser engine.
Stars: ✭ 31 (-92.58%)
Mutual labels:  html-parser
ocr
Simple app to extract text from pictures using Tesseract
Stars: ✭ 98 (-76.56%)
Mutual labels:  text-extraction
html-parser
Simple HTML to JSON parser use Regexp and String.indexOf
Stars: ✭ 34 (-91.87%)
Mutual labels:  html-parser
Jodd
Jodd! Lightweight. Java. Zero dependencies. Use what you like.
Stars: ✭ 3,616 (+765.07%)
Mutual labels:  html-parser
any-text
Get text content from any file
Stars: ✭ 19 (-95.45%)
Mutual labels:  text-extraction
Hquery.php
An extremely fast web scraper that parses megabytes of invalid HTML in a blink of an eye. PHP5.3+, no dependencies.
Stars: ✭ 295 (-29.43%)
Mutual labels:  html-parser
Aris
Aris - A fast and powerful tool to write HTML in JS easily. Includes syntax highlighting, templates, SVG, CSS autofixing, debugger support and more...
Stars: ✭ 61 (-85.41%)
Mutual labels:  html-parser
wagtail textract
Text extraction for Wagtail document search
Stars: ✭ 27 (-93.54%)
Mutual labels:  text-extraction
modest ex
Elixir library to do pipeable transformations on html strings (with CSS selectors)
Stars: ✭ 31 (-92.58%)
Mutual labels:  html-parser
HtmlMonkey
Lightweight HTML/XML parser written in C#.
Stars: ✭ 37 (-91.15%)
Mutual labels:  html-parser
Htmlquery
htmlquery is golang XPath package for HTML query.
Stars: ✭ 338 (-19.14%)
Mutual labels:  html-parser
AdvancedHTMLParser
Fast Indexed python HTML parser which builds a DOM node tree, providing common getElementsBy* functions for scraping, testing, modification, and formatting. Also XPath.
Stars: ✭ 90 (-78.47%)
Mutual labels:  html-parser
html2any
🌀 parse and convert html string to anything
Stars: ✭ 43 (-89.71%)
Mutual labels:  html-parser
Nlp
[UNMANTEINED] Extract values from strings and fill your structs with nlp.
Stars: ✭ 367 (-12.2%)
Mutual labels:  text-extraction
Pdftools
Text Extraction, Rendering and Converting of PDF Documents
Stars: ✭ 349 (-16.51%)
Mutual labels:  text-extraction
Htmlparser2
The fast & forgiving HTML and XML parser
Stars: ✭ 3,299 (+689.23%)
Mutual labels:  html-parser

.. _jusText: http://code.google.com/p/justext/ .. _Python: http://www.python.org/ .. _lxml: http://lxml.de/

jusText

.. image:: https://api.travis-ci.org/miso-belica/jusText.png?branch=master :target: https://travis-ci.org/miso-belica/jusText

Program jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed <doc/algorithm.rst>_ to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. You can try it online <http://nlp.fi.muni.cz/projects/justext/>_.

This is a fork of original (currently unmaintained) code of jusText_ hosted on Google Code.

Adaptations of the algorithm to other languages:

  • C++ <https://github.com/endredy/jusText>_
  • Go <https://github.com/JalfResi/justext>_
  • Java <https://github.com/wizenoze/justext-java>_

Some libraries using jusText:

  • chirp <https://github.com/9b/chirp>_
  • lazynlp <https://github.com/chiphuyen/lazynlp>_
  • off-topic-memento-toolkit <https://github.com/oduwsdl/off-topic-memento-toolkit>_
  • pears <https://github.com/PeARSearch/PeARS-orchard>_
  • readability calculator <https://github.com/joaopalotti/readability_calculator>_
  • sky <https://github.com/kootenpv/sky>_

Some currently (Jan 2020) maintained alternatives:

  • dragnet <https://github.com/dragnet-org/dragnet>_
  • html2text <https://github.com/Alir3z4/html2text>_
  • inscriptis <https://github.com/weblyzard/inscriptis>_
  • newspaper <https://github.com/codelucas/newspaper>_
  • python-readability <https://github.com/buriy/python-readability>_
  • trafilatura <https://github.com/adbar/trafilatura>_

Installation

Make sure you have Python_ 2.7+/3.4+ and pip <https://pip.pypa.io/en/stable/>_ (Windows <http://docs.python-guide.org/en/latest/starting/install/win/>, Linux <http://docs.python-guide.org/en/latest/starting/install/linux/>) installed. Run simply:

.. code-block:: bash

$ [sudo] pip install justext

Dependencies

::

lxml (version depends on your Python version)

Usage

.. code-block:: bash

$ python -m justext -s Czech -o text.txt http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/ $ python -m justext -s English -o plain_text.txt english_page.html $ python -m justext --help # for more info

Python API

.. code-block:: python

import requests import justext

response = requests.get("http://planet.python.org/") paragraphs = justext.justext(response.content, justext.get_stoplist("English")) for paragraph in paragraphs: if not paragraph.is_boilerplate: print paragraph.text

Testing

Run tests via

.. code-block:: bash

$ py.test-2.7 && py.test-3.4 && py.test-3.5 && py.test-3.6 && py.test-3.7 && py.test-3.8

Acknowledgements

.. _Natural Language Processing Centre: http://nlp.fi.muni.cz/en/nlpc .. _Masaryk University in Brno: http://nlp.fi.muni.cz/en .. _PRESEMT: http://presemt.eu/ .. _Lexical Computing Ltd.: http://lexicalcomputing.com/ .. _PhD research: http://is.muni.cz/th/45523/fi_d/phdthesis.pdf

This software has been developed at the Natural Language Processing Centre_ of Masaryk University in Brno_ with a financial support from PRESEMT_ and Lexical Computing Ltd._ It also relates to PhD research_ of Jan Pomikálek.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].