Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → paambaati → websight

paambaati / websight

Licence: WTFPL License

🕷A simple but *really* fast crawler built with Node.js & TypeScript

Programming Languages

32286 projects

14818 projects

Labels

nodejs crawler monzo interview-questions coding-challenge

Projects that are alternatives of or similar to websight

Cs Fundamentals

🎓 Data structures and algorithms

Stars: ✭ 869 (+5693.33%)

Mutual labels: interview-questions, coding-challenge

123 Essential Javascript Interview Questions

JavaScript interview Questions

Stars: ✭ 3,634 (+24126.67%)

Mutual labels: interview-questions, coding-challenge

Coding-Interview-Challenges

This is a repo where I upload code for important interview questions written in Python, C++, and Swift

Stars: ✭ 13 (-13.33%)

Mutual labels: interview-questions, coding-challenge

Best websites a Programmer should visit

Stars: ✭ 27 (+80%)

Mutual labels: interview-questions, coding-challenge

Codility lessons

Codility Lesson1~Lesson17 100% solutions with Python3 除正確解答外comment裡有解題的思考過程

Stars: ✭ 87 (+480%)

Mutual labels: interview-questions, coding-challenge

Javascript Interview Questions Developer

Danh sách những câu hỏi trong phỏng vấn Javascript 📝

Stars: ✭ 62 (+313.33%)

Mutual labels: interview-questions, coding-challenge

Codinginterviews

This repository contains coding interviews that I have encountered in company interviews

Stars: ✭ 2,881 (+19106.67%)

Mutual labels: interview-questions, coding-challenge

Technical Interview Guide

My learning material for technical interviews!

Stars: ✭ 76 (+406.67%)

Mutual labels: interview-questions, coding-challenge

Everything you need to know to get the job.

Stars: ✭ 54,875 (+365733.33%)

Mutual labels: interview-questions, coding-challenge

Algorithmic-Problem-Solving

Solutions of algorithmic type of programming problems from sites like LeetCode.com, HackerRank.com, LeetCode.com, Codility.com, CodeForces.com, etc. using Java.

Stars: ✭ 20 (+33.33%)

Mutual labels: interview-questions, coding-challenge

front-end-interview-guide

前端面试手册，含JS，HTML，CSS，算法和数据结构，计算机系统，计算机网络，浏览器，性能优化，前端工程化，数据库，前端框架，小程序，设计模式，数据可视化

Stars: ✭ 42 (+180%)

Mutual labels: interview-questions

android-interview-questions

I'm contributing to help others!

Stars: ✭ 24 (+60%)

Mutual labels: interview-questions

just for fun

Stars: ✭ 118 (+686.67%)

Mutual labels: interview-questions

Learning the various documents and small projects

Stars: ✭ 46 (+206.67%)

Mutual labels: interview-questions

Frontend-Developer-Interview-Preparation

Things you need to know to crack that frontend developer job [Work in Progress]

Stars: ✭ 113 (+653.33%)

Mutual labels: interview-questions

😬 Hours upon hours upon hours of awful interview prep

Stars: ✭ 16 (+6.67%)

Mutual labels: interview-questions

✍️ My LeetCode solutions, ideas and templates sharing. (我的LeetCode题解，思路以及各专题的解题模板分享。分专题归纳，见tag)

Stars: ✭ 123 (+720%)

Mutual labels: interview-questions

CodingInterview

Solutions to Leetcode, CareerCup Coding problems

Stars: ✭ 64 (+326.67%)

Mutual labels: interview-questions

CodeSignal-Solutions

CodeSignal solutions

Stars: ✭ 112 (+646.67%)

Mutual labels: interview-questions

📚 Comprehensive list of questions and problems to pass an interview for the iOS Developer position

Stars: ✭ 127 (+746.67%)

Mutual labels: interview-questions

View All Similar Projects ➔

websight

A simple crawler that fetches all pages in a given website and prints the links between them.

📣 Note that this project was purpose-built for a coding challenge (see problem statement) and is not meant for production use (unless you aren't web scale yet).

🛠️ Setup

Before you run this app, make sure you have Node.js installed. yarn is recommended, but can be used interchangeably with npm. If you'd prefer running everything inside a Docker container, see the Docker setup section.

git clone https://github.com/paambaati/websight
cd websight
yarn install && yarn build

👩🏻‍💻 Usage

yarn start <website>

🧪 Tests & Coverage

yarn run coverage

🐳 Docker Setup

docker build -t websight .
docker run -ti websight <website>

📦 Executable Binary

yarn bundle && yarn binary

This produces standalone executable binaries for both Linux and macOS.

🧩 Design

                                            +---------------------+                        
                                            |   Link Extractor    |                        
                                            | +-----------------+ |                        
                                            | |                 | |                        
                                            | |   URL Resolver  | |                        
                                            | |                 | |                        
                                            | +-----------------+ |                        
                    +-----------------+     | +-----------------+ |     +-----------------+
                    |                 |     | |                 | |     |                 |
                    |     Crawler     +---->+ |     Fetcher     | +---->+     Sitemap     |
                    |                 |     | |                 | |     |                 |
                    +-----------------+     | +-----------------+ |     +-----------------+
                                            | +-----------------+ |                        
                                            | |                 | |                        
                                            | |     Parser      | |                        
                                            | |                 | |                        
                                            | +-----------------+ |                        
                                            +---------------------+

The Crawler class runs a fast non-deterministic fetch of all pages (via LinkExtractor) & the URLs in them recursively and saves them in Sitemap. When crawling is complete^[1], the sitemap is printed as a ASCII tree.

The LinkExtractor class is a thin orchestrating wrapper around 3 core components —

URLResolver includes logic for resolving relative URLs and normalizing them. It also includes utility methods for filtering out external URLs.
Fetcher takes a URL, fetches it and returns the response as a Stream. This is better because streams can be read in small buffered chunks, avoiding holding very large HTMLs in memory.
Parser parses the HTML stream (returned by Fetcher) in chunks and emits the link event on each page URL and the asset event on each static asset found in the HTML.

¹ Crawler.crawl() is an async function that never resolves because it is technically impossible to detect when we've finished crawling. In most runtimes, we'd have to implement some kind of idle polling to detect completion; however, in Node.js, as soon as the event loop has no more tasks to execute, the main process will run to completion. This is why we finally print the sitemap in the Process.beforeExit event. ↩

🏎 Optimizations

Streams all the way down.

The key workloads in this system are HTTP fetches (I/O-bound) and HTML parses (CPU-bound), and either can be time-consuming and/or high on memory usage. To better parallelize the crawls and use as little memory as possible, got library's streaming API and the very fast htmlparser2 have been used.
Keep-Alive connections.

The Fetcher class uses a global keepAlive agent to reuse sockets as we're only crawling a single domain. This helps avoid re-establishing TCP connections for each request.

⚡️ Limitations

When ramping up for scale, this design exposes a few of its limitations —

No rate-limiting.

Most modern and large websites have some sort of throttling set up to block bots. A production-grade crawler should implement some politeness policy to make sure it doesn't inadverdently bring down a website, and so it doesn't run into permanent bans & 429 error responses.
In-memory state management.

Sitemap().sitemap is an unbound Map, and can quickly grow and possibly cause the runtime to run out of memory & crash when crawling very large websites. In a production-grade crawler, there should an external scheduler that holds URLs to crawl next.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 15

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (0) 🔗