All Projects → OmarElgabry → gunaydin

OmarElgabry / gunaydin

Licence: MIT license
Your good mornings ☀️

Programming Languages

javascript
184084 projects - #8 most used programming language
HTML
75241 projects
shell
77523 projects

Labels

Projects that are alternatives of or similar to gunaydin

htmltab
Command-line utility to convert HTML tables into CSV files
Stars: ✭ 13 (-18.75%)
Mutual labels:  scraping
scavenger
Scrape and take screenshots of dynamic and static webpages
Stars: ✭ 14 (-12.5%)
Mutual labels:  scraping
wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Stars: ✭ 52 (+225%)
Mutual labels:  scraping
proxycrawl-python
ProxyCrawl Python library for scraping and crawling
Stars: ✭ 51 (+218.75%)
Mutual labels:  scraping
anime-scraper
[partially working] Scrape and add anime episode stream URLs to uGet (Linux) or IDM (Windows) ~ Python3
Stars: ✭ 21 (+31.25%)
Mutual labels:  scraping
torchestrator
Spin up Tor containers and then proxy HTTP requests via these Tor instances
Stars: ✭ 32 (+100%)
Mutual labels:  scraping
browser-automation-api
Browser automation API for repetitive web-based tasks, with a friendly user interface. You can use it to scrape content or do many other things like capture a screenshot, generate pdf, extract content or execute custom Puppeteer, Playwright functions.
Stars: ✭ 24 (+50%)
Mutual labels:  scraping
chesf
CHeSF is the Chrome Headless Scraping Framework, a very very alpha code to scrape javascript intensive web pages
Stars: ✭ 18 (+12.5%)
Mutual labels:  scraping
Scrapping
Mastering the art of scrapping 🎓
Stars: ✭ 24 (+50%)
Mutual labels:  scraping
ogpParser
Open Graph Protocol Parser for Node.js
Stars: ✭ 43 (+168.75%)
Mutual labels:  scraping
scrap
Scrapping Facebook with JavaScript.
Stars: ✭ 25 (+56.25%)
Mutual labels:  scraping
copycat
A PHP Scraping Class
Stars: ✭ 70 (+337.5%)
Mutual labels:  scraping
angel.co-companies-list-scraping
No description or website provided.
Stars: ✭ 54 (+237.5%)
Mutual labels:  scraping
Instagram-to-discord
Monitor instagram user account and automatically post new images to discord channel via a webhook. Working 2022!
Stars: ✭ 113 (+606.25%)
Mutual labels:  scraping
rubium
Rubium is a lightweight alternative to Selenium/Capybara/Watir if you need to perform some operations (like web scraping) using Headless Chromium and Ruby
Stars: ✭ 65 (+306.25%)
Mutual labels:  scraping
ha-multiscrape
Home Assistant custom component for scraping (html, xml or json) multiple values (from a single HTTP request) with a separate sensor/attribute for each value. Support for (login) form-submit functionality.
Stars: ✭ 103 (+543.75%)
Mutual labels:  scraping
browser-pool
A Node.js library to easily manage and rotate a pool of web browsers, using any of the popular browser automation libraries like Puppeteer, Playwright, or SecretAgent.
Stars: ✭ 71 (+343.75%)
Mutual labels:  scraping
document-dl
Command line program to download documents from web portals.
Stars: ✭ 14 (-12.5%)
Mutual labels:  scraping
go-scrapy
Web crawling and scraping framework for Golang
Stars: ✭ 17 (+6.25%)
Mutual labels:  scraping
sg-food-ml
This script is used to scrap images from the Internet to classify 5 common noodle "mee" dishes in Singapore. Wanton Mee, Bak Chor Mee, Lor Mee, Prawn Mee and Mee Siam.
Stars: ✭ 18 (+12.5%)
Mutual labels:  scraping

Screenshot

Gunaydin!

Scrutinizer Code Quality Code Climate

Your good mornings!. "Gunaydin" in Turkish 🇹🇷 language means "Good Morning".

Every day, I wake up in the morning, go to my work, and the first thing I do is I go throw a list of websites I usually keep an eye on, and check if there's something new. Even worse, sometimes I forget to check some of them, and so I miss some information.

One way to automate the process is to scrap these websites form time to time, and get a list of the latest links (news, products, etc).

The downside is I've to explicitly define how to scrap each website. That's how to find the content in the HTML page. To do so, add a new document in Template collection. That's all, and the logic is the same for all.

Index

Demo

Check Demo

Features

  • Scraps a given list of pages from time to time.
  • Uses request for static pages and nightmare for dynamic pages.
  • Queue async jobs (scraping and saving to database) using async
  • Scraps proxies, rotate between proxies, and randomly assigns user agents.
  • Logs and tracks events especially jobs (how many succeeded, failed (and why), etc.).
  • Logging Errors and Exceptions using Winston, and logs are shipped to Loggly via winston-loggly-bulk
  • Uses MongoDB, Mongoose and MongoLab(mLab) for storing and querying data.

Installation

Running Locally

Make sure you have Node.js and npm installed.

  1. Clone or Download the repository

    $ git clone https://github.com/OmarElGabry/gunaydin.git
    $ cd gunaydin
    
  2. Install Dependencies

    $ npm install
    
  3. Start the application

    $ npm start
    

Your app should now be running on localhost:3000.

How It Works

Setup Configurations

Everything from setting up user authentication to database is explained in chat.io. I almost copied and pasted the code.

User & Pages (Model)

Every user document contains all information about that user. It has an array of pages.

Each page is what to be scrapped. Each page has list of links, a title, url, etc, and a reference to the template (see below). The links list might be a list of products, news, etc depending on the page.

Template (Model)

Template is is webpage(s) with the same layout. For example all the below links they have the same layout. So, they can be grouped under a 'Template', which defines a one specific way on how to scrap the webpage.

Thinking about a template? Open an issue, and I'll be happy to add it on the list.

Shards (aka Cycles)

Users are split-ed into logical shards. So, every time interval, say 1 hour, go to a shard, and scrap all users' pages in that shard. Then, update their listings in the database.

Queue (Service)

A queue is a list of the async jobs to be processed by the workers. The jobs might be scraping or saving to database. Accordingly, the workers might be scrapers or database workers.

A queue limits the maximum number of simultaneous operations, and handle the failed job by re-pushing it to the queue (up to maximum of say, 3 times).

There is a generic Queue class, where the Queue Factory instantiates different queues with different workers and max concurrent jobs.

Scrapers (Service)

There are three scrapers; static, dynamic, and a dedicated one for proxies (also dynamic). All scrapers inherit from the generic class Scraper, which provides useful methods to extract data, rotate proxies, randomly assigns user agents, and so on.

All scrapers are also workers and inherit from the Worker interface.

Stats (Service)

It keeps track all events especially jobs. It then persist them to database every some hours.

Support

I've written this script in my free time during my work. If you find it useful, please support the project by spreading the word.

Contribute

Contribute by creating new issues, sending pull requests on Github or you can send an email at: [email protected]

License

Built under MIT license.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].