Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → slotix → Dataflowkit

slotix / Dataflowkit

Licence: bsd-3-clause

Extract structured data from web sites. Web sites scraping.

Programming Languages

31211 projects - #10 most used programming language

3204 projects

Labels

scraper scraping headless crawling

Projects that are alternatives of or similar to Dataflowkit

[Unmaintained] A simple and clean video/music/image downloader 👾

Stars: ✭ 789 (+73.03%)

Mutual labels: scraper, scraping, crawling

No description or website provided.

Stars: ✭ 59 (-87.06%)

Mutual labels: scraper, scraping, crawling

Declarative web scraping

Stars: ✭ 4,837 (+960.75%)

Mutual labels: scraper, scraping, crawling

Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.

Stars: ✭ 52 (-88.6%)

Mutual labels: scraper, scraping, crawling

diffbot-php-client

[Deprecated - Maintenance mode - use APIs directly please!] The official Diffbot client library

Stars: ✭ 53 (-88.38%)

Mutual labels: scraper, scraping, crawling

Linkedin Profile Scraper

🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.

Stars: ✭ 171 (-62.5%)

Mutual labels: scraper, scraping, crawling

Headless Chrome Crawler

Distributed crawler powered by Headless Chrome

Stars: ✭ 5,129 (+1024.78%)

Mutual labels: scraper, scraping, crawling

Elegant Scraper and Crawler Framework for Golang

Stars: ✭ 15,535 (+3306.8%)

Mutual labels: scraper, scraping, crawling

proxycrawl-python

ProxyCrawl Python library for scraping and crawling

Stars: ✭ 51 (-88.82%)

Mutual labels: scraper, scraping, crawling

Crawly, a high-level web crawling & scraping framework for Elixir.

Stars: ✭ 440 (-3.51%)

Mutual labels: scraper, scraping, crawling

Papercut is a scraping/crawling library for Node.js built on top of JSDOM. It provides basic selector features together with features like Page Caching and Geosearch.

Stars: ✭ 15 (-96.71%)

Mutual labels: scraper, scraping

whatsapp-tracking

Scraping the status of WhatsApp contacts

Stars: ✭ 49 (-89.25%)

Mutual labels: scraper, scraping

Screen scraping and web crawling framework

Stars: ✭ 61 (-86.62%)

Mutual labels: scraping, crawling

scrapy facebooker

Collection of scrapy spiders which can scrape posts, images, and so on from public Facebook Pages.

Stars: ✭ 22 (-95.18%)

Mutual labels: scraper, scraping

A Scraper, Downloader, & Recorder for static open directories.

Stars: ✭ 14 (-96.93%)

Mutual labels: scraper, scraping

A Scraper made 100% in Python using BeautifulSoup and Tor. It can be used to scrape both normal and onion links. Happy Scraping :)

Stars: ✭ 24 (-94.74%)

Mutual labels: scraper, scraping

Nodejs web scraper. Contains a command line, docker container, terraform module and ansible roles for distributed cloud scraping. Supported databases: SQLite, MySQL, PostgreSQL. Supported headless clients: Puppeteer, Playwright, Cheerio, JSdom.

Stars: ✭ 37 (-91.89%)

Mutual labels: scraper, scraping

Scraper-Projects

🕸 List of mini projects that involve web scraping 🕸

Stars: ✭ 25 (-94.52%)

Mutual labels: scraper, scraping

ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of different websites. On the websites, ARGUS is able to perform tasks like scraping texts or collecting hyperlinks between websites. See: https://link.springer.com/article/10.1007/s11192-020-03726-9

Stars: ✭ 68 (-85.09%)

Mutual labels: scraping, crawling

facebook-discussion-tk

A collection of tools to (semi-)automatically collect and analyze data from online discussions on Facebook groups and pages.

Stars: ✭ 33 (-92.76%)

Mutual labels: scraper, scraping

View All Similar Projects ➔

Dataflow kit

Dataflow kit ("DFK") is a Web Scraping framework for Gophers. It extracts data from web pages, following the specified CSS Selectors.

You can use it in many ways for data mining, data processing or archiving.

The Web Scraping Pipeline

Web-scraping pipeline consists of 3 general components:

Downloading an HTML web-page. (Fetch Service)
Parsing an HTML page and retrieving data we're interested in (Parse Service)
Encoding parsed data to CSV, MS Excel, JSON, JSON Lines or XML format.

Fetch service

fetch.d server is intended for html web pages content download. Depending on Fetcher type, web page content is downloaded using either Base Fetcher or Chrome fetcher.

Base fetcher uses standard golang http client to fetch pages as is. It works faster than Chrome fetcher. But Base fetcher cannot render dynamic javascript driven web pages.

Chrome fetcher is intended for rendering dynamic javascript based content. It sends requests to Chrome running in headless mode.

A fetched web page is passed to parse.d service.

Parse service

parse.d is the service that extracts data from downloaded web page following the rules listed in configuration JSON file. Extracted data is returned in CSV, MS Excel, JSON or XML format.

Note: Sometimes Parse service cannot extract data from some pages retrieved by default Base fetcher. Empty results may be returned while parsing Java Script generated pages. Parse service then attempts to force Chrome fetcher to render the same dynamic javascript driven content automatically. Have a look at https://scrape.dataflowkit.com/persons/page-0 which is a sample of JavaScript driven web page.

Dataflow kit benefits:

Scraping of JavaScript generated pages;
Data extraction from paginated websites;
Processing infinite scrolled pages.
Sсraping of websites behind login form;
Cookies and sessions handling;
Following links and detailed pages processing;
Managing delays between requests per domain;
Following robots.txt directives;
Saving intermediate data in Diskv or Mongodb. Storage interface is flexible enough to add more storage types easily;
Encode results to CSV, MS Excel, JSON(Lines), XML formats;
Dataflow kit is fast. It takes about 4-6 seconds to fetch and then parse 50 pages.
Dataflow kit is suitable to process quite large volumes of data. Our tests show the time needed to parse appr. 4 millions of pages is about 7 hours.

Installation

go get -u github.com/slotix/dataflowkit

Usage

Docker

Install Docker and Docker Compose
Start services.

cd $GOPATH/src/github.com/slotix/dataflowkit && docker-compose up

This command fetches docker images automatically and starts services.

Launch parsing in the second terminal window by sending POST request to parse daemon. Some json configuration files for testing are available in /examples folder.

curl -XPOST  127.0.0.1:8001/parse --data-binary "@$GOPATH/src/github.com/slotix/dataflowkit/examples/books.toscrape.com.json"

Here is the sample json configuration file:

{
	"name":"collection",
	"request":{
	   "url":"https://example.com"
	},
	"fields":[
	   {
		  "name":"Title",
		  "selector":".product-container a",
		  "extractor":{
			 "types":["text", "href"],
			 "filters":[
				"trim",
				"lowerCase"
			 ],
			 "params":{
				"includeIfEmpty":false
			 }
		  }
	   },
	   {
		  "name":"Image",
		  "selector":"#product-container img",
		  "extractor":{
			 "types":["alt","src","width","height"],
			 "filters":[
				"trim",
				"upperCase"
			 ]
		  }
	   },
	   {
		  "name":"Buyinfo",
		  "selector":".buy-info",
		  "extractor":{
			 "types":["text"],
			 "params":{
				"includeIfEmpty":false
			 }
		  }
	   }
	],
	"paginator":{
	   "selector":".next",
	   "attr":"href",
	   "maxPages":3
	},
	"format":"json",
	"fetcherType":"chrome",
	"paginateResults":false
}

Read more information about scraper configuration JSON files at our GoDoc reference

Extractors and filters are described at https://godoc.org/github.com/slotix/dataflowkit/extract

To stop services just press Ctrl+C and run

cd $GOPATH/src/github.com/slotix/dataflowkit && docker-compose down --remove-orphans --volumes

Click on image to see CLI in action.

Manual way

Start Chrome docker container

docker run --init -it --rm -d --name chrome --shm-size=1024m -p=127.0.0.1:9222:9222 --cap-add=SYS_ADMIN \
  yukinying/chrome-headless-browser

Headless Chrome is used for fetching web pages to feed a Dataflow kit parser.

Build and run fetch.d service

cd $GOPATH/src/github.com/slotix/dataflowkit/cmd/fetch.d && go build && ./fetch.d

In new terminal window build and run parse.d service

cd $GOPATH/src/github.com/slotix/dataflowkit/cmd/parse.d && go build && ./parse.d

Launch parsing. See step 3. from the previous section.

Run tests

docker-compose -f test-docker-compose.yml up -d
./test.sh
To stop services just run docker-compose -f test-docker-compose.yml down

Front-End

Try https://dataflowkit.com/dfk Front-end with Point-and-click interface to Dataflow kit services. It generates JSON config file and sends POST request to DFK Parser

Click on image to see Dataflow kit in action.

License

This is Free Software, released under the BSD 3-Clause License.

Contributing

You are welcome to contribute to our project.

Please submit your issues
Fork the project

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 456

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗