All Projects → currentsapi → extractnet

currentsapi / extractnet

Licence: MIT license
A Dragnet that also extract author, headline, date, keywords from context

Programming Languages

HTML
75241 projects
python
139335 projects - #7 most used programming language
cython
566 projects
C++
36643 projects - #6 most used programming language

Projects that are alternatives of or similar to extractnet

trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Stars: ✭ 711 (+1267.31%)
Mutual labels:  text-mining, news, web-scraping, text-cleaning
text-mining-corona-articles
Text Mining for Indonesian Online News Articles About Corona
Stars: ✭ 15 (-71.15%)
Mutual labels:  text-mining, web-scraping, news-articles
newspaperjs
News extraction and scraping. Article Parsing
Stars: ✭ 59 (+13.46%)
Mutual labels:  news, webscraping
Giveme5W
Extraction of the five journalistic W-questions (5W) from news articles
Stars: ✭ 16 (-69.23%)
Mutual labels:  news, news-articles
Utlyz-CLI
Let's you to access your FB account from the command line and returns various things number of unread notifications, messages or friend requests you have.
Stars: ✭ 30 (-42.31%)
Mutual labels:  news, webscraping
Autoscraper
A Smart, Automatic, Fast and Lightweight Web Scraper for Python
Stars: ✭ 4,077 (+7740.38%)
Mutual labels:  web-scraping, webscraping
R Web Scraping Cheat Sheet
Guide, reference and cheatsheet on web scraping using rvest, httr and Rselenium.
Stars: ✭ 207 (+298.08%)
Mutual labels:  web-scraping, webscraping
Instago
Download/access photos, videos, stories, story highlights, postlives, following and followers of Instagram
Stars: ✭ 59 (+13.46%)
Mutual labels:  web-scraping, webscraping
Text-Analysis
Explaining textual analysis tools in Python. Including Preprocessing, Skip Gram (word2vec), and Topic Modelling.
Stars: ✭ 48 (-7.69%)
Mutual labels:  text-mining, web-scraping
Uc Davis Cs Exams Analysis
📈 Regression and Classification with UC Davis student quiz data and exam data
Stars: ✭ 33 (-36.54%)
Mutual labels:  text-mining, web-scraping
Learning Social Media Analytics With R
This repository contains code and bonus content which will be added from time to time for the book "Learning Social Media Analytics with R" by Packt
Stars: ✭ 102 (+96.15%)
Mutual labels:  text-mining, news
newsemble
API for fetching data from news websites.
Stars: ✭ 42 (-19.23%)
Mutual labels:  news, webscraping
restaurant-finder-featureReviews
Build a Flask web application to help users retrieve key restaurant information and feature-based reviews (generated by applying market-basket model – Apriori algorithm and NLP on user reviews).
Stars: ✭ 21 (-59.62%)
Mutual labels:  text-mining, web-scraping
BookingScraper
🌎 🏨 Scrape Booking.com 🏨 🌎
Stars: ✭ 68 (+30.77%)
Mutual labels:  web-scraping, webscraping
ioweb
Web Scraping Framework
Stars: ✭ 31 (-40.38%)
Mutual labels:  web-scraping, webscraping
youtube-audio
extract videos from youtube in audio format using webscraping techniques 🎶
Stars: ✭ 68 (+30.77%)
Mutual labels:  webscraping
Email-Crawler-Lead-Generator
This email crawler will visit all pages of a provided website and parse and save emails found to a csv file.
Stars: ✭ 47 (-9.62%)
Mutual labels:  webscraping
grailer
web scraping tool for grailed.com
Stars: ✭ 30 (-42.31%)
Mutual labels:  web-scraping
covid19.swift
🌐 Small iOS app to show some COVID-19 health, data, news and tweets
Stars: ✭ 25 (-51.92%)
Mutual labels:  news
odinson
Odinson is a powerful and highly optimized open-source framework for rule-based information extraction. Odinson couples a simple, yet powerful pattern language that can operate over multiple representations of text, with a runtime system that operates in near real time.
Stars: ✭ 59 (+13.46%)
Mutual labels:  text-mining

ExtractNet

PyPI version codecov

Based on the popular content extraction package Dragnet, ExtractNet extend the machine learning approach to extract other attributes such as date, author and keywords from news article.

demo code

Example code:

Simply use the following command to install the latest released version:

pip install extractnet

Start extract content and other meta data passing the result html to function

import requests
from extractnet import Extractor

raw_html = requests.get('https://currentsapi.services/en/blog/2019/03/27/python-microframework-benchmark/.html').text
results = Extractor().extract(raw_html)

Why don't just use existing rule-base extraction method:

We discover some webpage doesn't provide the real author name but simply populate the author tag with a default value.

For example ltn.com.tw, udn.com always populate the same author value for each news article while the real author can only be found within the content.

Our machine learnig first approach extract correct fields just like human reading a website

ExtractNet uses machine learning approach to extract these relevant data through visible section of the webpage just like a human.

ExtractNet pipeline

What ExtractNet is and isn't

  • ExtractNet is a platform to extract any interesting attributes from any webpage, not just limited to content based article.

  • The core of ExtractNet aims to convert unstructured webpage to structured data without relying hand crafted rules

  • ExtractNet do not support boilerplate content extraction

  • ExtractNet allows user to add custom pipelines that returns additional data through a list of callbacks function


Performance

Results of the body extraction evaluation:

We use the same body extraction benchmark from article-extraction-benchmark

Model Precision Recall F1 Accuracy Open Source
AutoExtract 0.984 ± 0.003 0.956 ± 0.010 0.970 ± 0.005 0.470 ± 0.037
Diffbot 0.958 ± 0.009 0.944 ± 0.013 0.951 ± 0.010 0.348 ± 0.035
ExtractNet 0.922 ± 0.011 0.933 ± 0.013 0.927 ± 0.010 0.160 ± 0.027
boilerpipe 0.850 ± 0.016 0.870 ± 0.020 0.860 ± 0.016 0.006 ± 0.006
dragnet 0.925 ± 0.012 0.889 ± 0.018 0.907 ± 0.014 0.221 ± 0.030
html-text 0.500 ± 0.017 0.994 ± 0.001 0.665 ± 0.015 0.000 ± 0.000
newspaper 0.917 ± 0.013 0.906 ± 0.017 0.912 ± 0.014 0.260 ± 0.032
readability 0.913 ± 0.014 0.931 ± 0.015 0.922 ± 0.013 0.315 ± 0.034
trafilatura 0.930 ± 0.010 0.967 ± 0.009 0.948 ± 0.008 0.243 ± 0.031

Results of author name extraction:
Model F1
ExtractNet : fasttext embeddings + CRF 0.904 ± 0.10

List of changes from Dragnet

  • Underlying classifier is replaced by Catboost instead of Decision Tree for all attributes extraction for consistency and performance boost.

  • Updated CSS features, added text+css latent feature

  • Includes a CRF model that extract names from author block text.

  • Trained on 22000+ updated webpages collected in the late 2020, 20 times of dragnet data.

GETTING STARTED

Installing and extraction

pip install extractnet
from extractnet import Extractor

raw_html = requests.get('https://apnews.com/article/6e58b5742b36e3de53298cf73fbfdf48').text
results = Extractor().extract(raw_html)
for key, value in results.items():
    print(key)
    print(value)
    print('------------')

Callbacks

ExtractNet also support the ability to add callbacks functions to inject additional features during extraction process

A quick glance of usage : each callbacks will be able to access the raw html string provided during the extraction process. This allows user to extract addtional information such as language detection to the final results

def meta_pre1(raw_html):
    return {'first_value': 0}

def meta_pre2(raw_html):
    return {'first_value': 1, 'second_value': 2}

def find_stock_ticker(raw_html, results):
    matched_ticker = []
    for ticket in re.findall(r'[$][A-Za-z][\S]*', str(results['content'])):
      matched_ticker.append(ticket)
    return {'matched_ticker': matched_ticker}

extract = Extractor(author_prob_threshold=0.1, 
      meta_postprocess=[meta_pre1, meta_pre2], 
      postprocess=[find_stock_ticker])

The extracted results will contain like, first_value and second_value. Do note callbacks are executed by the given order ( which means meta_pre1 will be executed first followed by meta_pre2 ), any results passed from the previous stage will not be overwritten by later stage

raw_html = requests.get('https://apnews.com/article/6e58b5742b36e3de53298cf73fbfdf48').text
results = extract(raw_html)

In this example the value for first_value will remain 0 even though meta_pre2 also returns first_value=1 because meta_pre2 callbacks already assign first_value as 0.

Contributing

We love contributions! Open an issue, or fork/create a pull request.

Develop Locally

Since extractnet relies on several C++ modules, before starting to run locally you need to compile them first

Usually what you need would be this command

make

However, you can try to build it

More details about the code structure

Coming soon

Reference

Content extraction using diverse feature sets

[1] Peters, Matthew E. and D. Lecocq, Content extraction using diverse feature sets

@inproceedings{Peters2013ContentEU,
  title={Content extraction using diverse feature sets},
  author={Matthew E. Peters and D. Lecocq},
  booktitle={WWW '13 Companion},
  year={2013}
}

Bag of Tricks for Efficient Text Classification

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@article{joulin2016bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1607.01759},
  year={2016}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].