All Projects → maithilish → scoopi-scraper

maithilish / scoopi-scraper

Licence: GPL-3.0 License
Scoopi Web Scraper is a heavy duty tool to extract data from HTML pages.

Programming Languages

java
68154 projects - #9 most used programming language
HTML
75241 projects
typescript
32286 projects
shell
77523 projects
javascript
184084 projects - #8 most used programming language
Dockerfile
14818 projects

Projects that are alternatives of or similar to scoopi-scraper

Skrape.it
A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
Stars: ✭ 231 (+1183.33%)
Mutual labels:  scraper, jsoup
Skraper
Kotlin/Java library and cli tool for scraping posts and media from various sources with neither authorization nor full page rendering (Facebook, Instagram, Twitter, Youtube, Tiktok, Telegram, Twitch, Reddit, 9GAG, Pinterest, Flickr, Tumblr, IFunny, VK, Pikabu)
Stars: ✭ 72 (+300%)
Mutual labels:  scraper, jsoup
scraper
Scraper example built on Scala, Akka and Jsoup
Stars: ✭ 15 (-16.67%)
Mutual labels:  scraper, jsoup
tinyPornManager
Made for pornhub. Fork from tinyMediaManager v3
Stars: ✭ 57 (+216.67%)
Mutual labels:  scraper, jsoup
Mac-OS-Setup-Applications
👾 All I need to setup a new Mac and the applications I use everyday as a Web Developper
Stars: ✭ 96 (+433.33%)
Mutual labels:  workflow
evine
Interactive CLI Web Crawler
Stars: ✭ 140 (+677.78%)
Mutual labels:  scraper
patreon-scraper
WIP Patreon attachment download written in TypeScript
Stars: ✭ 25 (+38.89%)
Mutual labels:  scraper
Scraper-Projects
🕸 List of mini projects that involve web scraping 🕸
Stars: ✭ 25 (+38.89%)
Mutual labels:  scraper
Notselwyn
NotSelwyn's over-engineered automatic profile readme
Stars: ✭ 15 (-16.67%)
Mutual labels:  workflow
actionsflow-workflow-default
Actionsflow workflow template repository. The best Zapier/IFTTT free alternative for developers to automate your workflows based on Github actions
Stars: ✭ 20 (+11.11%)
Mutual labels:  workflow
xstate-viz
Visualizer for XState machines
Stars: ✭ 274 (+1422.22%)
Mutual labels:  workflow
tsioc
AOP, Ioc container, Boot framework, unit testing framework , activities workflow framework.
Stars: ✭ 15 (-16.67%)
Mutual labels:  workflow
git-commands-workflows
🚀 All the git commands and workflows you need to know
Stars: ✭ 50 (+177.78%)
Mutual labels:  workflow
kick-off-web-scraping-python-selenium-beautifulsoup
A tutorial-based introduction to web scraping with Python.
Stars: ✭ 18 (+0%)
Mutual labels:  scraper
alfred-packagist
Alfred workflow to search for PHP packages with Packagist
Stars: ✭ 21 (+16.67%)
Mutual labels:  workflow
plugin.video.covenant
Covenant Kodi Addon Development - Kodi is a registered trademark of the XBMC Foundation. We are not connected to or in any other way affiliated with Kodi - DMCA: [email protected]
Stars: ✭ 24 (+33.33%)
Mutual labels:  scraper
WeiboPictureWorkflow
微博图床 Alfred Workflow,警告:微博修改了登录方式,此 workflow 暂时不能用了,何时修复未定,推荐使用 iPic
Stars: ✭ 23 (+27.78%)
Mutual labels:  workflow
fb-page-chat-download
Python script to download messages from a Facebook page to a CSV file
Stars: ✭ 51 (+183.33%)
Mutual labels:  scraper
scrapy facebooker
Collection of scrapy spiders which can scrape posts, images, and so on from public Facebook Pages.
Stars: ✭ 22 (+22.22%)
Mutual labels:  scraper
Stamp-Craft
Plugin for adding timestamp to filenames.
Stars: ✭ 28 (+55.56%)
Mutual labels:  workflow

scoopi-logo

CodeTab Scoopi Guide Quickstart and Guide


Scoopi is a tool to extract and transform data from web pages.

Libraries such as JSoup and HtmlUnit makes it quite easy to scrape web pages in Java, but they do well in scraping data from limited set of pages but things get pretty compilcated when you start to scrape thousands of pages. Scoopi is built on JSoup and HtmlUnit and the functionality offered by Scoopi are:

  • Scoopi is fully definition driven. Data structure, task workflow and pages to scrape are defined with a set of YML definition files and no coding skill is required
  • It can be configured to use either JSoup or HtmlUnit as scraper
  • Query can be written either using Selectors (JSoup) or XPath (HtmlUnit)
  • Scoopi is a multithreaded application which process pages in parallel for maximum throughput.
    • even on a low end system with core 2 duo processor, it can load, parse and transform around 1000 pages in under two minutes.
  • Scoopi ships as Docker image so that it can run without any cumbersome installation
  • Scoopi persists pages and data to file system so that it recover from the failed state without repeating the tasks already completed
  • Can transform, filter and sort the data before output
  • Ships with built-in appenders such as FileAppender, DBAppender and ListAppender.
  • ScoopiEngine can be embeded in other programs and access scrapped data with ListAppender
  • Flexible workflow allows one to change sequence of steps
  • Scoopi is extensible. Developers can extend the predefined base steps or even create new ones with different functionality and weave them in workflow
  • Scoopi Cluster
    • In cluster mode, it can scale horizontally by distributing tasks across multiple nodes
    • Designed to run in various environments; in bare JVM or in Docker containers or even on high end container orchestration platforms such as Kubernetes
    • For clustering, Scoopi Cluster uses Hazelcast IMDG, a fault-tolerant distributed in-memory computing platform

Scoopi Installation

To install and run Scoopi refer Quickstart and Guide.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].