pip安装的天眼查爬虫API，指定的单个/多个企业工商信息一键保存为Excel/JSON格式。A Battery-included Scraper API of Tianyancha, the best Chinese business data and investigation platform.

Stars: ✭ 206 (+249.15%)

Mutual labels: crawler, scraper, selenium

Dotnetcrawler

DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c

Stars: ✭ 100 (+69.49%)

Mutual labels: crawler, scraping, crawling

Antch

Antch, a fast, powerful and extensible web crawling & scraping framework for Go

Stars: ✭ 198 (+235.59%)

Mutual labels: crawler, scraping, crawling

papercut

Papercut is a scraping/crawling library for Node.js built on top of JSDOM. It provides basic selector features together with features like Page Caching and Geosearch.

Stars: ✭ 15 (-74.58%)

Mutual labels: crawler, scraper, scraping

Newspaper

News, full-text, and article metadata extraction in Python 3. Advanced docs:

Stars: ✭ 11,545 (+19467.8%)

Mutual labels: crawler, scraper, crawling

Squidwarc

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

Stars: ✭ 125 (+111.86%)

Mutual labels: crawler, crawling, puppeteer

Scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.

Stars: ✭ 42,343 (+71667.8%)

Mutual labels: crawler, scraping, crawling

diffbot-php-client

[Deprecated - Maintenance mode - use APIs directly please!] The official Diffbot client library

Stars: ✭ 53 (-10.17%)

Mutual labels: scraper, scraping, crawling

double-agent

A test suite of common scraper detection techniques. See how detectable your scraper stack is.

Stars: ✭ 123 (+108.47%)

Mutual labels: scraping, crawling, puppeteer

browser-automation-api

Browser automation API for repetitive web-based tasks, with a friendly user interface. You can use it to scrape content or do many other things like capture a screenshot, generate pdf, extract content or execute custom Puppeteer, Playwright functions.

Stars: ✭ 24 (-59.32%)

Mutual labels: scraping, puppeteer, playwright

browser-pool

A Node.js library to easily manage and rotate a pool of web browsers, using any of the popular browser automation libraries like Puppeteer, Playwright, or SecretAgent.

Stars: ✭ 71 (+20.34%)

Mutual labels: scraping, puppeteer, playwright

View All Similar Projects ➔

Bots Zoo

For my different projects, I often have to launch bots using different kinds of browsers (Firefox, Chrome, Headless/not headless) using different automation frameworks (Puppeteer, Selenium, Playwright) in several programming languages.

Since I'm juggling between different frameworks/languages, sometimes it's difficult to remember/find how to set up a particular kind of bot, or how to execute basic commands.

That's why I've decided to centralize examples of simple bots in this repository. I hope it will also benefit other people.

For the moment I only have example for the following bots:

Playwright (NodeJS): Chromium, Webkit (Safari), Firefox
Playwright extra stealth (Nodejs): Chromium (will be updated when it becomes stable)
Puppeteer (NodeJS): Chromium, Firefox, Android (emulation), iPhone (emulation)
Puppeteer extra stealth (NodeJS): Chromium
Pyppeteer stealth (Python): Chromium
Selenium (NodeJS): Chromium, Firefox
Selenium stealth (Python): Chrome
Undetected Chromedriver (Python): Chrome
Ferrum (Ruby): Chrome
Watir (Ruby): Chrome, Safari (MacOS)
Simple HTTP module/library (NodeJS + Cheerio): Sequential, Parallel, Sequential using Nord VPN, HTTP proxies
Simple HTTP module/library (Python requests/aiohttp + Beautifulsoup): Sequential, Parallel (x2 implementations)
Simple HTTP module/library (Golang standard library + goquery): Sequential, Parallel

I will continue to add other examples, such as Playwright Firefox/WebKit, Selenium Firefox, both in NodeJS but also in other programming languages like Python. I will also provide examples for bot frameworks that provide mechanisms against bot detection solutions.

The headers directory contains data related to HTTP headers. For the moment, it contains:

A list of ~16K user-agents;
Accept headers for the main browsers;
Accept-Encoding headers for the main browsers;
Header names for the main browsers;
Fetch metadata request headers.

The browser_apis directory contains data related to JS APIs sometimes used to identify a browser:

language.txt values of navigator.language;
languages.txt values of navigator.languages;
mimeTypes.txt values of navigator.mimeTypes;
oscpus.txt values of navigator.oscpu;
platforms.txt values of navigator.platform;
plugins.txt values of navigator.plugins;
webGLrenderers.txt values of gl.getParameter(debugInfo.UNMASKED_RENDERER_WEBGL);
webGLvendors.txt values of gl.getParameter(debugInfo.UNMASKED_VENDOR_WEBGL);

DON'T contact me if you want to ask me how to make your bot(s) undetectable. You can find articles on my website, but I won't provide more details since I'm working for a bot detection company.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

antoinevastel / bots-zoo

Programming Languages

Labels

Projects that are alternatives of or similar to bots-zoo

Bots Zoo