Papercut is a scraping/crawling library for Node.js built on top of JSDOM. It provides basic selector features together with features like Page Caching and Geosearch.

Stars: ✭ 15 (-80%)

Mutual labels: scraping

image-collector

Download images from Google Image Search

Stars: ✭ 38 (-49.33%)

Mutual labels: scraping

shup

A POSIX shell script to parse HTML

Stars: ✭ 28 (-62.67%)

Mutual labels: scraping

humanparser

Parse a human name string into salutation, first name, middle name, last name, suffix.

Stars: ✭ 78 (+4%)

Mutual labels: scraping

naos

📉 Uptime and error monitoring CLI

Stars: ✭ 30 (-60%)

Mutual labels: scraping

Zeiver

A Scraper, Downloader, & Recorder for static open directories.

Stars: ✭ 14 (-81.33%)

Mutual labels: scraping

kuwala

Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data sc…

Stars: ✭ 474 (+532%)

Mutual labels: scraping

dust

Archive web pages with all relevant assets or save as a single file HTML

Stars: ✭ 19 (-74.67%)

Mutual labels: scraping

api-flight.com

Main API Flight Git Repository

Stars: ✭ 26 (-65.33%)

Mutual labels: scraping

Babler

Data Collection System For NLP/Speech Recognition

Stars: ✭ 21 (-72%)

Mutual labels: scraping

whatsapp-tracking

Scraping the status of WhatsApp contacts

Stars: ✭ 49 (-34.67%)

Mutual labels: scraping

View All Similar Projects ➔

Webdext

Webdext is a Javascript library for web data extraction (web scraping). Currently, it only supports data records extraction from a list page (a web page containing 2 or more data records).

In order to use it, you must run Webdext inside the web page context. There are 2 ways to do that:

Use it as browser extension (currently, I only implemented the Chrome extension)
Inject the script into the web page context using headless browser such as Puppeteer, PhantomJS, or Splash (currently, I only implemented the runner script for PhantomJS)

Check the video below to see how it works as Chrome extension:

Installation and usage

Internals

Intelligent extraction algorithm is heavily based on AutoRM [1] and DAG-MTM [2] (not an exact implementation though).

[1]	Shengsheng Shi , Chengfei Liu, Yi Shen, Chunfeng Yuan, Yihua Huang. 2015. AutoRM: An effective approach for automatic Web data record mining. Knowledge-Based Systems, 89, 314–331. doi:10.1016/j.knosys.2015.07.012

[2]	Shengsheng Shi , Chengfei Liu, Chunfeng Yuan, Yihua Huang. 2014. Multi-feature and DAG-based multi-tree matching algorithm for automatic web data mining. Proceedings of International Joint Conferences on Web Intelligence and Intelligent Agent Technology, 739–755. doi:10.1109/WI-IAT.2014.24

Author

Sigit Dewanto, sigitdewanto11[at]yahoo[dot]co[dot]uk

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

seagatesoft / webdext

Programming Languages

Labels

Projects that are alternatives of or similar to webdext

Webdext

Installation and usage

Internals

Author