All Projects → trafilatura → Similar Projects or Alternatives

973 Open source projects that are alternatives of or similar to trafilatura

extractnet

A Dragnet that also extract author, headline, date, keywords from context

Stars: ✭ 52 (-92.69%)

Mutual labels: text-mining, news, web-scraping, text-cleaning

Autoscraper

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

Stars: ✭ 4,077 (+473.42%)

Mutual labels: scraping, web-scraping

restaurant-finder-featureReviews

Build a Flask web application to help users retrieve key restaurant information and feature-based reviews (generated by applying market-basket model – Apriori algorithm and NLP on user reviews).

Stars: ✭ 21 (-97.05%)

Mutual labels: text-mining, web-scraping

readability-cli

A CLI for Mozilla Readability. Get clean, uncluttered, ready-to-read HTML from any webpage!

Stars: ✭ 41 (-94.23%)

Mutual labels: scraping, readability

Neural-Scam-Artist

Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

Stars: ✭ 18 (-97.47%)

Mutual labels: web-scraping, readability

Text-Analysis

Explaining textual analysis tools in Python. Including Preprocessing, Skip Gram (word2vec), and Topic Modelling.

Stars: ✭ 48 (-93.25%)

Mutual labels: text-mining, web-scraping

papercut

Papercut is a scraping/crawling library for Node.js built on top of JSDOM. It provides basic selector features together with features like Page Caching and Geosearch.

Stars: ✭ 15 (-97.89%)

Mutual labels: scraping, web-scraping

raspagem-de-dados-fatec

📓 Minicurso de raspagem de dados web com Python ministrado na Semana de Tecnologia da FATEC Jundiaí

Stars: ✭ 22 (-96.91%)

Mutual labels: scraping, web-scraping

Newspaper

News, full-text, and article metadata extraction in Python 3. Advanced docs:

Stars: ✭ 11,545 (+1523.77%)

Mutual labels: news, news-aggregator

PressCenters.com

News aggregator for the press releases of the Bulgarian government sites written in ASP.NET Core

Stars: ✭ 91 (-87.2%)

Mutual labels: news, news-aggregator

malay-dataset

Text corpus for Bahasa Malaysia, https://malaya.readthedocs.io/en/latest/Dataset.html

Stars: ✭ 189 (-73.42%)

Mutual labels: text-mining, corpus

google-news-scraper

Google News Scraper for languages like Japanese, Chinese... [VPN Support]

Stars: ✭ 88 (-87.62%)

Mutual labels: news, news-aggregator

Sqrape

Simple Query Scraping with CSS and Go Reflection (MOVED to Gitlab)

Stars: ✭ 144 (-79.75%)

Mutual labels: scraping, web-scraping

Uc Davis Cs Exams Analysis

📈 Regression and Classification with UC Davis student quiz data and exam data

Stars: ✭ 33 (-95.36%)

Mutual labels: text-mining, web-scraping

newspaperjs

News extraction and scraping. Article Parsing

Stars: ✭ 59 (-91.7%)

Mutual labels: news, news-aggregator

selectorlib

A library to read a YML file with Xpath or CSS Selectors and extract data from HTML pages using them

Stars: ✭ 53 (-92.55%)

Mutual labels: scraping, web-scraping

browser-pool

A Node.js library to easily manage and rotate a pool of web browsers, using any of the popular browser automation libraries like Puppeteer, Playwright, or SecretAgent.

Stars: ✭ 71 (-90.01%)

Mutual labels: scraping, web-scraping

Elixir Scrape

Scrape any website, article or RSS/Atom Feed with ease!

Stars: ✭ 306 (-56.96%)

Mutual labels: scraping, readability

Scrapple

A framework for creating semi-automatic web content extractors

Stars: ✭ 464 (-34.74%)

Mutual labels: scraping, web-scraping

GNews

A Happy and lightweight Python Package that Provides an API to search for articles on Google News and returns a JSON response.

Stars: ✭ 271 (-61.88%)

Mutual labels: news, rss-feed

nytwit

New York Times Word Innovation Types dataset

Stars: ✭ 21 (-97.05%)

Mutual labels: news, corpus

General News Extractor Js

🤔一个新闻网页正文通用抽取器，包括标题、作者和日期。

Stars: ✭ 55 (-92.26%)

Mutual labels: news, readability

HungryHippo

🦛 scrapes websites and generates rss feeds

Stars: ✭ 33 (-95.36%)

Mutual labels: news, rss-feed

Awesome Hungarian Nlp

A curated list of NLP resources for Hungarian

Stars: ✭ 121 (-82.98%)

Mutual labels: text-mining, corpus

Learning Social Media Analytics With R

This repository contains code and bonus content which will be added from time to time for the book "Learning Social Media Analytics with R" by Packt

Stars: ✭ 102 (-85.65%)

Mutual labels: text-mining, news

Texthero

Text preprocessing, representation and visualization from zero to hero.

Stars: ✭ 2,407 (+238.54%)

Mutual labels: text-mining, text-preprocessing

info-bot

🤖 A Versatile Telegram Bot

Stars: ✭ 37 (-94.8%)

Mutual labels: news, scraping

Phpscraper

PHP Scraper - an highly opinionated web-interface for PHP

Stars: ✭ 148 (-79.18%)

Mutual labels: scraping, web-scraping

Nlp chinese corpus

大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP

Stars: ✭ 6,656 (+836.15%)

Mutual labels: news, corpus

Scrape Linkedin Selenium

`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.

Stars: ✭ 239 (-66.39%)

Mutual labels: scraping, web-scraping

readability

Fast readability scores for text data

Stars: ✭ 22 (-96.91%)

Mutual labels: text-mining, readability

Humanoid

Node.js package to bypass CloudFlare's anti-bot JavaScript challenges

Stars: ✭ 88 (-87.62%)

Mutual labels: scraping, web-scraping

Khcoder

KH Coder: for Quantitative Content Analysis or Text Mining

Stars: ✭ 126 (-82.28%)

Mutual labels: text-mining, corpus

SmartReader

SmartReader is a library to extract the main content of a web page, based on a port of the Readability library by Mozilla

Stars: ✭ 88 (-87.62%)

Mutual labels: readability, article-extractor

ioweb

Web Scraping Framework

Stars: ✭ 31 (-95.64%)

Mutual labels: scraping, web-scraping

Reader

Extract clean(er), readable text from web pages via Mercury Web Parser.

Stars: ✭ 75 (-89.45%)

Mutual labels: web-scraping, readability

top-github-scraper

Scape top GitHub repositories and users based on keywords

Stars: ✭ 40 (-94.37%)

Mutual labels: scraping, web-scraping

twitter-to-rss

Simple python script to parse twitter feed to generate a rss feed.

Stars: ✭ 15 (-97.89%)

Mutual labels: rss-feed, readability

Gopa

[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn

Stars: ✭ 277 (-61.04%)

Mutual labels: scraping, web-scraping

Apify Js

Apify SDK — The scalable web scraping and crawling library for JavaScript/Node.js. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.

Stars: ✭ 3,154 (+343.6%)

Mutual labels: scraping, web-scraping

Detect Cms

PHP Library for detecting CMS

Stars: ✭ 78 (-89.03%)

Mutual labels: scraping, web-scraping

text-mining-corona-articles

Text Mining for Indonesian Online News Articles About Corona

Stars: ✭ 15 (-97.89%)

Mutual labels: text-mining, web-scraping

Breadability

Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

Stars: ✭ 186 (-73.84%)

Mutual labels: text-mining, text-extraction

PythonScrapyBasicSetup

Basic setup with random user agents and IP addresses for Python Scrapy Framework.

Stars: ✭ 57 (-91.98%)

Mutual labels: scraping, web-scraping

california-electricity-capacity-analysis

A Los Angeles Times analysis of California's costly power glut

Stars: ✭ 17 (-97.61%)

Mutual labels: news

proiel-treebank

Official releases of the PROIEL treebank of ancient Indo-European languages

Stars: ✭ 30 (-95.78%)

Mutual labels: corpus

Text-Classification-LSTMs-PyTorch

The aim of this repository is to show a baseline model for text classification by implementing a LSTM-based model coded in PyTorch. In order to provide a better understanding of the model, it will be used a Tweets dataset provided by Kaggle.

Stars: ✭ 45 (-93.67%)

Mutual labels: text-mining

corpus-joyce-ulysses-tei

James Joyce's novel Ulysses in TEI XML. Work-in-progress.

Stars: ✭ 18 (-97.47%)

Mutual labels: tei

text-mined-synthesis public

Codes for text-mined solid-state reactions dataset

Stars: ✭ 46 (-93.53%)

Mutual labels: text-mining

perke

A keyphrase extractor for Persian

Stars: ✭ 60 (-91.56%)

Mutual labels: text-mining

gochanges

**[ARCHIVED]** website changes tracker 🔍

Stars: ✭ 12 (-98.31%)

Mutual labels: scraping

crawlzone

Crawlzone is a fast asynchronous internet crawling framework for PHP.

Stars: ✭ 70 (-90.15%)

Mutual labels: web-scraping

Goirate

Pillaging the seven seas for torrents, pieces of eight and other bounty.

Stars: ✭ 20 (-97.19%)

Mutual labels: scraping

etf4u

📊 Python tool to scrape real-time information about ETFs from the web and mixing them together by proportionally distributing their assets allocation

Stars: ✭ 29 (-95.92%)

Mutual labels: scraping

ariel-news-app

News App developed with Flutter featuring beautiful UI, category-based news, story for faster news reading, inbuilt article viewer, share feature, and more.

Stars: ✭ 31 (-95.64%)

Mutual labels: news

codepen-puppeteer

Use Puppeteer to download pens from Codepen.io as single html pages

Stars: ✭ 22 (-96.91%)

Mutual labels: web-scraping

2017-summer-workshop

Exercises, data, and more for our 2017 summer workshop (funded by the Estes Fund and in partnership with Project Jupyter and Berkeley's D-Lab)

Stars: ✭ 33 (-95.36%)

Mutual labels: web-scraping

linkedin-scraper

Tool to scrape linkedin