Browser automation API for repetitive web-based tasks, with a friendly user interface. You can use it to scrape content or do many other things like capture a screenshot, generate pdf, extract content or execute custom Puppeteer, Playwright functions.

Stars: ✭ 24 (+50%)

Mutual labels: scraping

chesf

CHeSF is the Chrome Headless Scraping Framework, a very very alpha code to scrape javascript intensive web pages

Stars: ✭ 18 (+12.5%)

Mutual labels: scraping

Scrapping

Mastering the art of scrapping 🎓

Stars: ✭ 24 (+50%)

Mutual labels: scraping

ogpParser

Open Graph Protocol Parser for Node.js

Stars: ✭ 43 (+168.75%)

Mutual labels: scraping

scrap

Scrapping Facebook with JavaScript.

Stars: ✭ 25 (+56.25%)

Mutual labels: scraping

copycat

A PHP Scraping Class

Stars: ✭ 70 (+337.5%)

Mutual labels: scraping

angel.co-companies-list-scraping

No description or website provided.

Stars: ✭ 54 (+237.5%)

Mutual labels: scraping

Instagram-to-discord

Monitor instagram user account and automatically post new images to discord channel via a webhook. Working 2022!

Stars: ✭ 113 (+606.25%)

Mutual labels: scraping

rubium

Rubium is a lightweight alternative to Selenium/Capybara/Watir if you need to perform some operations (like web scraping) using Headless Chromium and Ruby

Stars: ✭ 65 (+306.25%)

Mutual labels: scraping

ha-multiscrape

Home Assistant custom component for scraping (html, xml or json) multiple values (from a single HTTP request) with a separate sensor/attribute for each value. Support for (login) form-submit functionality.

Stars: ✭ 103 (+543.75%)

Mutual labels: scraping

browser-pool

A Node.js library to easily manage and rotate a pool of web browsers, using any of the popular browser automation libraries like Puppeteer, Playwright, or SecretAgent.

Stars: ✭ 71 (+343.75%)

Mutual labels: scraping

document-dl

Command line program to download documents from web portals.

Stars: ✭ 14 (-12.5%)

Mutual labels: scraping

go-scrapy

Web crawling and scraping framework for Golang

Stars: ✭ 17 (+6.25%)

Mutual labels: scraping

sg-food-ml

This script is used to scrap images from the Internet to classify 5 common noodle "mee" dishes in Singapore. Wanton Mee, Bak Chor Mee, Lor Mee, Prawn Mee and Mee Siam.

Stars: ✭ 18 (+12.5%)

Mutual labels: scraping

View All Similar Projects ➔

Gunaydin!

Your good mornings!. "Gunaydin" in Turkish 🇹🇷 language means "Good Morning".

Every day, I wake up in the morning, go to my work, and the first thing I do is I go throw a list of websites I usually keep an eye on, and check if there's something new. Even worse, sometimes I forget to check some of them, and so I miss some information.

One way to automate the process is to scrap these websites form time to time, and get a list of the latest links (news, products, etc).

The downside is I've to explicitly define how to scrap each website. That's how to find the content in the HTML page. To do so, add a new document in Template collection. That's all, and the logic is the same for all.

Demo

Check Demo

Features

Scraps a given list of pages from time to time.
Uses request for static pages and nightmare for dynamic pages.
Queue async jobs (scraping and saving to database) using async
Scraps proxies, rotate between proxies, and randomly assigns user agents.
Logs and tracks events especially jobs (how many succeeded, failed (and why), etc.).
Logging Errors and Exceptions using Winston, and logs are shipped to Loggly via winston-loggly-bulk
Uses MongoDB, Mongoose and MongoLab(mLab) for storing and querying data.

Installation

Running Locally

Make sure you have Node.js and npm installed.

Clone or Download the repository

$ git clone https://github.com/OmarElGabry/gunaydin.git
$ cd gunaydin

Install Dependencies
```
$ npm install
```
Start the application
```
$ npm start
```

Your app should now be running on localhost:3000.

How It Works

Setup Configurations

Everything from setting up user authentication to database is explained in chat.io. I almost copied and pasted the code.

User & Pages (Model)

Every user document contains all information about that user. It has an array of pages.

Each page is what to be scrapped. Each page has list of links, a title, url, etc, and a reference to the template (see below). The links list might be a list of products, news, etc depending on the page.

Template (Model)

Template is is webpage(s) with the same layout. For example all the below links they have the same layout. So, they can be grouped under a 'Template', which defines a one specific way on how to scrap the webpage.

Thinking about a template? Open an issue, and I'll be happy to add it on the list.

Shards (aka Cycles)

Users are split-ed into logical shards. So, every time interval, say 1 hour, go to a shard, and scrap all users' pages in that shard. Then, update their listings in the database.

Queue (Service)

A queue is a list of the async jobs to be processed by the workers. The jobs might be scraping or saving to database. Accordingly, the workers might be scrapers or database workers.

A queue limits the maximum number of simultaneous operations, and handle the failed job by re-pushing it to the queue (up to maximum of say, 3 times).

There is a generic Queue class, where the Queue Factory instantiates different queues with different workers and max concurrent jobs.

Scrapers (Service)

There are three scrapers; static, dynamic, and a dedicated one for proxies (also dynamic). All scrapers inherit from the generic class Scraper, which provides useful methods to extract data, rotate proxies, randomly assigns user agents, and so on.

All scrapers are also workers and inherit from the Worker interface.

Stats (Service)

It keeps track all events especially jobs. It then persist them to database every some hours.

Support

I've written this script in my free time during my work. If you find it useful, please support the project by spreading the word.

Contribute

Contribute by creating new issues, sending pull requests on Github or you can send an email at: [email protected]

License

Built under MIT license.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

OmarElgabry / gunaydin

Programming Languages

Labels

Projects that are alternatives of or similar to gunaydin

Gunaydin!

Index

Demo

Features

Installation

Running Locally

How It Works

Setup Configurations

User & Pages (Model)

Template (Model)

Shards (aka Cycles)

Queue (Service)

Scrapers (Service)

Stats (Service)

Support

Contribute

License