All Projects → nikitautiu → learnhtml

nikitautiu / learnhtml

Licence: Apache-2.0 license
Web content extraction using machine learning

Programming Languages

HTML
75241 projects
python
139335 projects - #7 most used programming language
shell
77523 projects

Projects that are alternatives of or similar to learnhtml

extractnet
A Dragnet that also extract author, headline, date, keywords from context
Stars: ✭ 52 (+92.59%)
Mutual labels:  content-extraction

LearnHtml

Html web content extraction library using mostly DOM features as well as some textual features. Achieves a tag-level F1-score of .96 on the Dragnet dataset.

Requirements

First you will need to install the dependencies. For the binary dependencies:

sudo apt-get install recode libxml2-dev libxslt1-dev unzip

Python dependencies:

pip install -r requirements.txt

Build the project and install it locally

pip install -e .

Running the scripts

./learnhtml/cli/prepare_data.sh <<WHERE_TO_DOWNLOAD_DATA>> <<NUMBER_OF_WORKERS>>

Copyright (C) 2018 Nichita Uțiu

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].