All Projects → seagatesoft → webdext

seagatesoft / webdext

Licence: MIT License
Intelligent Web Data Extractor

Programming Languages

HTML
75241 projects
javascript
184084 projects - #8 most used programming language
CSS
56736 projects

Labels

Projects that are alternatives of or similar to webdext

web-clipper
Easily download the main content of a web page in html, markdown, and/or epub format from command line.
Stars: ✭ 15 (-80%)
Mutual labels:  scraping
pomp
Screen scraping and web crawling framework
Stars: ✭ 61 (-18.67%)
Mutual labels:  scraping
TorScrapper
A Scraper made 100% in Python using BeautifulSoup and Tor. It can be used to scrape both normal and onion links. Happy Scraping :)
Stars: ✭ 24 (-68%)
Mutual labels:  scraping
Scraper-Projects
🕸 List of mini projects that involve web scraping 🕸
Stars: ✭ 25 (-66.67%)
Mutual labels:  scraping
dmi-instascraper
A GUI for Instaloader to scrape users and hashtags with on Instagram
Stars: ✭ 21 (-72%)
Mutual labels:  scraping
scrapy-zyte-smartproxy
Zyte Smart Proxy Manager (formerly Crawlera) middleware for Scrapy
Stars: ✭ 317 (+322.67%)
Mutual labels:  scraping
AngleParse
HTML parsing and processing tool for PowerShell.
Stars: ✭ 35 (-53.33%)
Mutual labels:  scraping
PyLex
Perform lexical analysis on words, one word at a time.
Stars: ✭ 60 (-20%)
Mutual labels:  scraping
scrapy facebooker
Collection of scrapy spiders which can scrape posts, images, and so on from public Facebook Pages.
Stars: ✭ 22 (-70.67%)
Mutual labels:  scraping
papercut
Papercut is a scraping/crawling library for Node.js built on top of JSDOM. It provides basic selector features together with features like Page Caching and Geosearch.
Stars: ✭ 15 (-80%)
Mutual labels:  scraping
image-collector
Download images from Google Image Search
Stars: ✭ 38 (-49.33%)
Mutual labels:  scraping
shup
A POSIX shell script to parse HTML
Stars: ✭ 28 (-62.67%)
Mutual labels:  scraping
humanparser
Parse a human name string into salutation, first name, middle name, last name, suffix.
Stars: ✭ 78 (+4%)
Mutual labels:  scraping
naos
📉 Uptime and error monitoring CLI
Stars: ✭ 30 (-60%)
Mutual labels:  scraping
Zeiver
A Scraper, Downloader, & Recorder for static open directories.
Stars: ✭ 14 (-81.33%)
Mutual labels:  scraping
kuwala
Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data sc…
Stars: ✭ 474 (+532%)
Mutual labels:  scraping
dust
Archive web pages with all relevant assets or save as a single file HTML
Stars: ✭ 19 (-74.67%)
Mutual labels:  scraping
api-flight.com
Main API Flight Git Repository
Stars: ✭ 26 (-65.33%)
Mutual labels:  scraping
Babler
Data Collection System For NLP/Speech Recognition
Stars: ✭ 21 (-72%)
Mutual labels:  scraping
whatsapp-tracking
Scraping the status of WhatsApp contacts
Stars: ✭ 49 (-34.67%)
Mutual labels:  scraping

Webdext

Webdext is a Javascript library for web data extraction (web scraping). Currently, it only supports data records extraction from a list page (a web page containing 2 or more data records).

In order to use it, you must run Webdext inside the web page context. There are 2 ways to do that:

  1. Use it as browser extension (currently, I only implemented the Chrome extension)
  2. Inject the script into the web page context using headless browser such as Puppeteer, PhantomJS, or Splash (currently, I only implemented the runner script for PhantomJS)

Check the video below to see how it works as Chrome extension:

DemoVideo

Installation and usage

  1. Chrome Extension
  2. PhantomJS script

Internals

Intelligent extraction algorithm is heavily based on AutoRM [1] and DAG-MTM [2] (not an exact implementation though).

[1]Shengsheng Shi , Chengfei Liu, Yi Shen, Chunfeng Yuan, Yihua Huang. 2015. AutoRM: An effective approach for automatic Web data record mining. Knowledge-Based Systems, 89, 314–331. doi:10.1016/j.knosys.2015.07.012
[2]Shengsheng Shi , Chengfei Liu, Chunfeng Yuan, Yihua Huang. 2014. Multi-feature and DAG-based multi-tree matching algorithm for automatic web data mining. Proceedings of International Joint Conferences on Web Intelligence and Intelligent Agent Technology, 739–755. doi:10.1109/WI-IAT.2014.24

Author

Sigit Dewanto, sigitdewanto11[at]yahoo[dot]co[dot]uk

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].