All Projects → antoinevastel → bots-zoo

antoinevastel / bots-zoo

Licence: MIT License
No description or website provided.

Programming Languages

javascript
184084 projects - #8 most used programming language
python
139335 projects - #7 most used programming language
ruby
36898 projects - #4 most used programming language
go
31211 projects - #10 most used programming language

Projects that are alternatives of or similar to bots-zoo

Headless Chrome Crawler
Distributed crawler powered by Headless Chrome
Stars: ✭ 5,129 (+8593.22%)
Mutual labels:  crawler, scraper, scraping, crawling, puppeteer
Linkedin Profile Scraper
🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Stars: ✭ 171 (+189.83%)
Mutual labels:  crawler, scraper, scraping, crawling, puppeteer
Colly
Elegant Scraper and Crawler Framework for Golang
Stars: ✭ 15,535 (+26230.51%)
Mutual labels:  crawler, scraper, scraping, crawling
Ferret
Declarative web scraping
Stars: ✭ 4,837 (+8098.31%)
Mutual labels:  crawler, scraper, scraping, crawling
Crawly
Crawly, a high-level web crawling & scraping framework for Elixir.
Stars: ✭ 440 (+645.76%)
Mutual labels:  crawler, scraper, scraping, crawling
Lulu
[Unmaintained] A simple and clean video/music/image downloader 👾
Stars: ✭ 789 (+1237.29%)
Mutual labels:  crawler, scraper, scraping, crawling
Instagram Bot
An Instagram bot developed using the Selenium Framework
Stars: ✭ 138 (+133.9%)
Mutual labels:  crawler, crawling, selenium
Jvppeteer
Headless Chrome For Java (Java 爬虫)
Stars: ✭ 193 (+227.12%)
Mutual labels:  crawler, scraper, puppeteer
whatsapp-tracking
Scraping the status of WhatsApp contacts
Stars: ✭ 49 (-16.95%)
Mutual labels:  scraper, scraping, puppeteer
Tianyancha
pip安装的天眼查爬虫API,指定的单个/多个企业工商信息一键保存为Excel/JSON格式。A Battery-included Scraper API of Tianyancha, the best Chinese business data and investigation platform.
Stars: ✭ 206 (+249.15%)
Mutual labels:  crawler, scraper, selenium
Dotnetcrawler
DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c
Stars: ✭ 100 (+69.49%)
Mutual labels:  crawler, scraping, crawling
Antch
Antch, a fast, powerful and extensible web crawling & scraping framework for Go
Stars: ✭ 198 (+235.59%)
Mutual labels:  crawler, scraping, crawling
papercut
Papercut is a scraping/crawling library for Node.js built on top of JSDOM. It provides basic selector features together with features like Page Caching and Geosearch.
Stars: ✭ 15 (-74.58%)
Mutual labels:  crawler, scraper, scraping
Newspaper
News, full-text, and article metadata extraction in Python 3. Advanced docs:
Stars: ✭ 11,545 (+19467.8%)
Mutual labels:  crawler, scraper, crawling
Squidwarc
Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
Stars: ✭ 125 (+111.86%)
Mutual labels:  crawler, crawling, puppeteer
Scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
Stars: ✭ 42,343 (+71667.8%)
Mutual labels:  crawler, scraping, crawling
diffbot-php-client
[Deprecated - Maintenance mode - use APIs directly please!] The official Diffbot client library
Stars: ✭ 53 (-10.17%)
Mutual labels:  scraper, scraping, crawling
double-agent
A test suite of common scraper detection techniques. See how detectable your scraper stack is.
Stars: ✭ 123 (+108.47%)
Mutual labels:  scraping, crawling, puppeteer
browser-automation-api
Browser automation API for repetitive web-based tasks, with a friendly user interface. You can use it to scrape content or do many other things like capture a screenshot, generate pdf, extract content or execute custom Puppeteer, Playwright functions.
Stars: ✭ 24 (-59.32%)
Mutual labels:  scraping, puppeteer, playwright
browser-pool
A Node.js library to easily manage and rotate a pool of web browsers, using any of the popular browser automation libraries like Puppeteer, Playwright, or SecretAgent.
Stars: ✭ 71 (+20.34%)
Mutual labels:  scraping, puppeteer, playwright

Bots Zoo

For my different projects, I often have to launch bots using different kinds of browsers (Firefox, Chrome, Headless/not headless) using different automation frameworks (Puppeteer, Selenium, Playwright) in several programming languages.

Since I'm juggling between different frameworks/languages, sometimes it's difficult to remember/find how to set up a particular kind of bot, or how to execute basic commands.

That's why I've decided to centralize examples of simple bots in this repository. I hope it will also benefit other people.

For the moment I only have example for the following bots:

  • Playwright (NodeJS): Chromium, Webkit (Safari), Firefox
  • Playwright extra stealth (Nodejs): Chromium (will be updated when it becomes stable)
  • Puppeteer (NodeJS): Chromium, Firefox, Android (emulation), iPhone (emulation)
  • Puppeteer extra stealth (NodeJS): Chromium
  • Pyppeteer stealth (Python): Chromium
  • Selenium (NodeJS): Chromium, Firefox
  • Selenium stealth (Python): Chrome
  • Undetected Chromedriver (Python): Chrome
  • Ferrum (Ruby): Chrome
  • Watir (Ruby): Chrome, Safari (MacOS)
  • Simple HTTP module/library (NodeJS + Cheerio): Sequential, Parallel, Sequential using Nord VPN, HTTP proxies
  • Simple HTTP module/library (Python requests/aiohttp + Beautifulsoup): Sequential, Parallel (x2 implementations)
  • Simple HTTP module/library (Golang standard library + goquery): Sequential, Parallel

I will continue to add other examples, such as Playwright Firefox/WebKit, Selenium Firefox, both in NodeJS but also in other programming languages like Python. I will also provide examples for bot frameworks that provide mechanisms against bot detection solutions.

The headers directory contains data related to HTTP headers. For the moment, it contains:

  • A list of ~16K user-agents;
  • Accept headers for the main browsers;
  • Accept-Encoding headers for the main browsers;
  • Header names for the main browsers;
  • Fetch metadata request headers.

The browser_apis directory contains data related to JS APIs sometimes used to identify a browser:

  • language.txt values of navigator.language;
  • languages.txt values of navigator.languages;
  • mimeTypes.txt values of navigator.mimeTypes;
  • oscpus.txt values of navigator.oscpu;
  • platforms.txt values of navigator.platform;
  • plugins.txt values of navigator.plugins;
  • webGLrenderers.txt values of gl.getParameter(debugInfo.UNMASKED_RENDERER_WEBGL);
  • webGLvendors.txt values of gl.getParameter(debugInfo.UNMASKED_VENDOR_WEBGL);

DON'T contact me if you want to ask me how to make your bot(s) undetectable. You can find articles on my website, but I won't provide more details since I'm working for a bot detection company.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].