All Projects → trafilatura → Similar Projects or Alternatives

973 Open source projects that are alternatives of or similar to trafilatura

extractnet
A Dragnet that also extract author, headline, date, keywords from context
Stars: ✭ 52 (-92.69%)
Autoscraper
A Smart, Automatic, Fast and Lightweight Web Scraper for Python
Stars: ✭ 4,077 (+473.42%)
Mutual labels:  scraping, web-scraping
restaurant-finder-featureReviews
Build a Flask web application to help users retrieve key restaurant information and feature-based reviews (generated by applying market-basket model – Apriori algorithm and NLP on user reviews).
Stars: ✭ 21 (-97.05%)
Mutual labels:  text-mining, web-scraping
readability-cli
A CLI for Mozilla Readability. Get clean, uncluttered, ready-to-read HTML from any webpage!
Stars: ✭ 41 (-94.23%)
Mutual labels:  scraping, readability
Neural-Scam-Artist
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.
Stars: ✭ 18 (-97.47%)
Mutual labels:  web-scraping, readability
Text-Analysis
Explaining textual analysis tools in Python. Including Preprocessing, Skip Gram (word2vec), and Topic Modelling.
Stars: ✭ 48 (-93.25%)
Mutual labels:  text-mining, web-scraping
papercut
Papercut is a scraping/crawling library for Node.js built on top of JSDOM. It provides basic selector features together with features like Page Caching and Geosearch.
Stars: ✭ 15 (-97.89%)
Mutual labels:  scraping, web-scraping
raspagem-de-dados-fatec
📓 Minicurso de raspagem de dados web com Python ministrado na Semana de Tecnologia da FATEC Jundiaí
Stars: ✭ 22 (-96.91%)
Mutual labels:  scraping, web-scraping
Newspaper
News, full-text, and article metadata extraction in Python 3. Advanced docs:
Stars: ✭ 11,545 (+1523.77%)
Mutual labels:  news, news-aggregator
PressCenters.com
News aggregator for the press releases of the Bulgarian government sites written in ASP.NET Core
Stars: ✭ 91 (-87.2%)
Mutual labels:  news, news-aggregator
malay-dataset
Text corpus for Bahasa Malaysia, https://malaya.readthedocs.io/en/latest/Dataset.html
Stars: ✭ 189 (-73.42%)
Mutual labels:  text-mining, corpus
google-news-scraper
Google News Scraper for languages like Japanese, Chinese... [VPN Support]
Stars: ✭ 88 (-87.62%)
Mutual labels:  news, news-aggregator
Sqrape
Simple Query Scraping with CSS and Go Reflection (MOVED to Gitlab)
Stars: ✭ 144 (-79.75%)
Mutual labels:  scraping, web-scraping
Uc Davis Cs Exams Analysis
📈 Regression and Classification with UC Davis student quiz data and exam data
Stars: ✭ 33 (-95.36%)
Mutual labels:  text-mining, web-scraping
newspaperjs
News extraction and scraping. Article Parsing
Stars: ✭ 59 (-91.7%)
Mutual labels:  news, news-aggregator
selectorlib
A library to read a YML file with Xpath or CSS Selectors and extract data from HTML pages using them
Stars: ✭ 53 (-92.55%)
Mutual labels:  scraping, web-scraping
browser-pool
A Node.js library to easily manage and rotate a pool of web browsers, using any of the popular browser automation libraries like Puppeteer, Playwright, or SecretAgent.
Stars: ✭ 71 (-90.01%)
Mutual labels:  scraping, web-scraping
Elixir Scrape
Scrape any website, article or RSS/Atom Feed with ease!
Stars: ✭ 306 (-56.96%)
Mutual labels:  scraping, readability
Scrapple
A framework for creating semi-automatic web content extractors
Stars: ✭ 464 (-34.74%)
Mutual labels:  scraping, web-scraping
GNews
A Happy and lightweight Python Package that Provides an API to search for articles on Google News and returns a JSON response.
Stars: ✭ 271 (-61.88%)
Mutual labels:  news, rss-feed
nytwit
New York Times Word Innovation Types dataset
Stars: ✭ 21 (-97.05%)
Mutual labels:  news, corpus
General News Extractor Js
🤔一个新闻网页正文通用抽取器,包括标题、作者和日期。
Stars: ✭ 55 (-92.26%)
Mutual labels:  news, readability
HungryHippo
🦛 scrapes websites and generates rss feeds
Stars: ✭ 33 (-95.36%)
Mutual labels:  news, rss-feed
Awesome Hungarian Nlp
A curated list of NLP resources for Hungarian
Stars: ✭ 121 (-82.98%)
Mutual labels:  text-mining, corpus
Learning Social Media Analytics With R
This repository contains code and bonus content which will be added from time to time for the book "Learning Social Media Analytics with R" by Packt
Stars: ✭ 102 (-85.65%)
Mutual labels:  text-mining, news
Texthero
Text preprocessing, representation and visualization from zero to hero.
Stars: ✭ 2,407 (+238.54%)
Mutual labels:  text-mining, text-preprocessing
info-bot
🤖 A Versatile Telegram Bot
Stars: ✭ 37 (-94.8%)
Mutual labels:  news, scraping
Phpscraper
PHP Scraper - an highly opinionated web-interface for PHP
Stars: ✭ 148 (-79.18%)
Mutual labels:  scraping, web-scraping
Nlp chinese corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+836.15%)
Mutual labels:  news, corpus
Scrape Linkedin Selenium
`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.
Stars: ✭ 239 (-66.39%)
Mutual labels:  scraping, web-scraping
readability
Fast readability scores for text data
Stars: ✭ 22 (-96.91%)
Mutual labels:  text-mining, readability
Humanoid
Node.js package to bypass CloudFlare's anti-bot JavaScript challenges
Stars: ✭ 88 (-87.62%)
Mutual labels:  scraping, web-scraping
Khcoder
KH Coder: for Quantitative Content Analysis or Text Mining
Stars: ✭ 126 (-82.28%)
Mutual labels:  text-mining, corpus
SmartReader
SmartReader is a library to extract the main content of a web page, based on a port of the Readability library by Mozilla
Stars: ✭ 88 (-87.62%)
Mutual labels:  readability, article-extractor
ioweb
Web Scraping Framework
Stars: ✭ 31 (-95.64%)
Mutual labels:  scraping, web-scraping
Reader
Extract clean(er), readable text from web pages via Mercury Web Parser.
Stars: ✭ 75 (-89.45%)
Mutual labels:  web-scraping, readability
top-github-scraper
Scape top GitHub repositories and users based on keywords
Stars: ✭ 40 (-94.37%)
Mutual labels:  scraping, web-scraping
twitter-to-rss
Simple python script to parse twitter feed to generate a rss feed.
Stars: ✭ 15 (-97.89%)
Mutual labels:  rss-feed, readability
Gopa
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Stars: ✭ 277 (-61.04%)
Mutual labels:  scraping, web-scraping
Apify Js
Apify SDK — The scalable web scraping and crawling library for JavaScript/Node.js. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.
Stars: ✭ 3,154 (+343.6%)
Mutual labels:  scraping, web-scraping
Detect Cms
PHP Library for detecting CMS
Stars: ✭ 78 (-89.03%)
Mutual labels:  scraping, web-scraping
text-mining-corona-articles
Text Mining for Indonesian Online News Articles About Corona
Stars: ✭ 15 (-97.89%)
Mutual labels:  text-mining, web-scraping
Breadability
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Stars: ✭ 186 (-73.84%)
Mutual labels:  text-mining, text-extraction
PythonScrapyBasicSetup
Basic setup with random user agents and IP addresses for Python Scrapy Framework.
Stars: ✭ 57 (-91.98%)
Mutual labels:  scraping, web-scraping
california-electricity-capacity-analysis
A Los Angeles Times analysis of California's costly power glut
Stars: ✭ 17 (-97.61%)
Mutual labels:  news
proiel-treebank
Official releases of the PROIEL treebank of ancient Indo-European languages
Stars: ✭ 30 (-95.78%)
Mutual labels:  corpus
Text-Classification-LSTMs-PyTorch
The aim of this repository is to show a baseline model for text classification by implementing a LSTM-based model coded in PyTorch. In order to provide a better understanding of the model, it will be used a Tweets dataset provided by Kaggle.
Stars: ✭ 45 (-93.67%)
Mutual labels:  text-mining
corpus-joyce-ulysses-tei
James Joyce's novel Ulysses in TEI XML. Work-in-progress.
Stars: ✭ 18 (-97.47%)
Mutual labels:  tei
text-mined-synthesis public
Codes for text-mined solid-state reactions dataset
Stars: ✭ 46 (-93.53%)
Mutual labels:  text-mining
perke
A keyphrase extractor for Persian
Stars: ✭ 60 (-91.56%)
Mutual labels:  text-mining
gochanges
**[ARCHIVED]** website changes tracker 🔍
Stars: ✭ 12 (-98.31%)
Mutual labels:  scraping
crawlzone
Crawlzone is a fast asynchronous internet crawling framework for PHP.
Stars: ✭ 70 (-90.15%)
Mutual labels:  web-scraping
Goirate
Pillaging the seven seas for torrents, pieces of eight and other bounty.
Stars: ✭ 20 (-97.19%)
Mutual labels:  scraping
etf4u
📊 Python tool to scrape real-time information about ETFs from the web and mixing them together by proportionally distributing their assets allocation
Stars: ✭ 29 (-95.92%)
Mutual labels:  scraping
ariel-news-app
News App developed with Flutter featuring beautiful UI, category-based news, story for faster news reading, inbuilt article viewer, share feature, and more.
Stars: ✭ 31 (-95.64%)
Mutual labels:  news
codepen-puppeteer
Use Puppeteer to download pens from Codepen.io as single html pages
Stars: ✭ 22 (-96.91%)
Mutual labels:  web-scraping
2017-summer-workshop
Exercises, data, and more for our 2017 summer workshop (funded by the Estes Fund and in partnership with Project Jupyter and Berkeley's D-Lab)
Stars: ✭ 33 (-95.36%)
Mutual labels:  web-scraping
linkedin-scraper
Tool to scrape linkedin
Stars: ✭ 74 (-89.59%)
Mutual labels:  scraping
Interesting-Things-on-GitHub
[INACTIVE] News on GitHub, featured by Huei Tan 😄
Stars: ✭ 29 (-95.92%)
Mutual labels:  news
chopper
Chopper is a tool to extract elements from HTML by preserving ancestors and CSS rules
Stars: ✭ 22 (-96.91%)
Mutual labels:  scraping
1-60 of 973 similar projects