web spider built by puppeteer, support task-queue and task-scheduling by decorators，support nedb / mongodb, support data visualization; 基于puppeteer的web爬虫框架，提供灵活的任务队列管理调度方案，提供便捷的数据保存方案（nedb/mongodb），提供数据可视化和用户交互的实现方案

Stars: ✭ 237 (+848%)

Mutual labels: spider, puppeteer

Linkedin Profile Scraper

🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.

Stars: ✭ 171 (+584%)

Mutual labels: spider, puppeteer

ZSpider

基于Electron爬虫程序

Stars: ✭ 37 (+48%)

Mutual labels: spider, puppeteer

ICP-Checker

ICP备案查询，可查询企业或域名的ICP备案信息，自动完成滑动验证，保存结果到Excel表格，适用于2022年新版的工信部备案管理系统网站，告别频繁拖动验证，以及某站*工具要开通VIP才可查看备案信息的坑

Stars: ✭ 119 (+376%)

Mutual labels: spider

Spider

Spider项目将会不断更新本人学习使用过的爬虫方法！！！

Stars: ✭ 16 (-36%)

Mutual labels: spider

jest-puppeteer-istanbul

Collect code coverage information from end-to-end jest puppeteer tests

Stars: ✭ 26 (+4%)

Mutual labels: puppeteer

jest-puppe-shots

A Jest plugin for creating screenshots of React components with a little help of Puppeteer

Stars: ✭ 86 (+244%)

Mutual labels: puppeteer

webring

“วงแหวนเว็บ” แห่งนี้สร้างขึ้นเพื่อส่งเสริมให้ศิลปิน นักออกแบบ และนักพัฒนาชาวไทย สร้างเว็บไซต์ของตัวเองและแบ่งปันการเข้าชมซึ่งกันและกัน

Stars: ✭ 125 (+400%)

Mutual labels: puppeteer

puppeteer-instagram

Instagram automation driven by headless chrome.

Stars: ✭ 87 (+248%)

Mutual labels: puppeteer

weibo topic

微博话题关键词,个人微博采集, 微博博文一键删除 selenium获取cookie,requests处理

Stars: ✭ 28 (+12%)

Mutual labels: spider

NScrapy

NScrapy is a .net core corss platform Distributed Spider Framework which provide an easy way to write your own Spider

Stars: ✭ 88 (+252%)

Mutual labels: spider

Sina Spider

新浪爬虫，基于Python+Selenium。模拟登陆后保存cookie，实现登录状态的保存。可以通过输入关键词来爬取到关键词相关的热门微博。

Stars: ✭ 25 (+0%)

Mutual labels: spider

Social-Media-Automation

Automate social media because you don't have to be active on all of them😉. Best way to be active on all social media without actually being active on them. 😃

Stars: ✭ 186 (+644%)

Mutual labels: puppeteer

nest-puppeteer

Puppeteer (Headless Chrome) provider for Nest.js

Stars: ✭ 68 (+172%)

Mutual labels: puppeteer

throughout

🎪 End-to-end testing made simple (using Jest and Puppeteer)

Stars: ✭ 16 (-36%)

Mutual labels: puppeteer

personal-puppeteer

A personal web page screenshotting service. Basically, it exposes an API that I can use to generate screenshot of any URL.

Stars: ✭ 19 (-24%)

Mutual labels: puppeteer

preact-typescript-parcel-starter

Starter with Preact - Typescript - Parcel Bundler

Stars: ✭ 51 (+104%)

Mutual labels: puppeteer

View All Similar Projects ➔

gz-spider

中文文档

A web spider framework for NodeJs, base on Puppeteer & Axios;

Feture

IP Proxy
Fail retry
Support Puppeteer
Easily compatible with various task queue services
Easily multiprocessing

Install

npm i gz-spider --save

Usage

const spider = require('gz-spider');

// All your spider code register in Processer
spider.setProcesser({
  ['getGoogleSearchResult']: async (fetcher, params) => {
    // fetcher.page is original puppeteer page 
    let resp = await fetcher.axios.get(`https://www.google.com/search?q=${params}`);

    // throw 'Retry', will retry this processer
    // throw 'ChangeProxy', will retry this processer use new proxy
    // throw 'Fail', will finish this processer with message(fail) Immediately

    if (resp.status === 200) {
      // Data processing start
      let result = resp.data + 1;
      // Data processing end
      return result;
    } else {
      throw 'retry';
    }
  }
});

// Get data
spider.getData('getGoogleSearchResult', params).then(userInfo => {
  console.log(userInfo);
});

Config

This framework is divided into three components, fetcher, strategy and processer.

Fetcher

spider.setFetcher({
  axiosTimeout: 5000,
  proxyTimeout: 180 * 1000
  proxy() {
    // support async function，you can get proxy config from remote.
    return {
      host: '127.0.0.1',
      port: '9000'
    }
  }
});

axiosTimeout: [Number] Peer request timeout ms
proxyTimeout: [Number] When config.proxy is [Function], will re-run proxy function and get new proxy host+port
proxy: [Object | Function] When proxy is [Function], support async function，you can get proxy config from remote
- proxy.host [String]
- proxy.port [String]

Strategy

spider.setStrategy({
  retryTimes: 2
});

retryTimes: [Number] Max retry times for one task

Work with task queue

Process

Get one task -> `spider.getData(processerKey, processerIn)` -> Complete task with processed data

Simulate task queue use MySQL

Create table spider-task, include at least 'id', 'status', 'processer_key', 'processer_input', 'processer_output'
Write api to get one todo task (status = todo), for example GET /spider/task
Write api to update db table with processed data, for example PUT /spider/task

const axios = require('axios');

while (true) {
  // Get one task
  let resp = await axios.get('http://127.0.0.1:8080/spider/task');

  if (!resp.data.task) break;
  
  let { id, processerKey, processerInput } = resp.data.task;
  let processerOutput = await spider.getData(processerKey, processerInput);

  // Complete task with processed data
  await axios.put('http://127.0.0.1:8080/spider/task', {
    id, processerOutput,
    status: 'success'
  });
}

License

MIT

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

GeoffZhu / spider

Programming Languages

Labels

Projects that are alternatives of or similar to spider

gz-spider

Feture

Install

Usage

Config

Fetcher

Strategy

Work with task queue

Process

Simulate task queue use MySQL

License