All Projects → GeoffZhu → spider

GeoffZhu / spider

Licence: other
A web spider framework

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to spider

Webster
a reliable high-level web crawling & scraping framework for Node.js.
Stars: ✭ 364 (+1356%)
Mutual labels:  spider, puppeteer
Chromium for spider
dynamic crawler for web vulnerability scanner
Stars: ✭ 220 (+780%)
Mutual labels:  spider, puppeteer
Puppeteer Walker
a puppeteer walker 🕷 🕸
Stars: ✭ 78 (+212%)
Mutual labels:  spider, puppeteer
Ppspider
web spider built by puppeteer, support task-queue and task-scheduling by decorators,support nedb / mongodb, support data visualization; 基于puppeteer的web爬虫框架,提供灵活的任务队列管理调度方案,提供便捷的数据保存方案(nedb/mongodb),提供数据可视化和用户交互的实现方案
Stars: ✭ 237 (+848%)
Mutual labels:  spider, puppeteer
Linkedin Profile Scraper
🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Stars: ✭ 171 (+584%)
Mutual labels:  spider, puppeteer
ZSpider
基于Electron爬虫程序
Stars: ✭ 37 (+48%)
Mutual labels:  spider, puppeteer
ICP-Checker
ICP备案查询,可查询企业或域名的ICP备案信息,自动完成滑动验证,保存结果到Excel表格,适用于2022年新版的工信部备案管理系统网站,告别频繁拖动验证,以及某站*工具要开通VIP才可查看备案信息的坑
Stars: ✭ 119 (+376%)
Mutual labels:  spider
Spider
Spider项目将会不断更新本人学习使用过的爬虫方法!!!
Stars: ✭ 16 (-36%)
Mutual labels:  spider
jest-puppeteer-istanbul
Collect code coverage information from end-to-end jest puppeteer tests
Stars: ✭ 26 (+4%)
Mutual labels:  puppeteer
jest-puppe-shots
A Jest plugin for creating screenshots of React components with a little help of Puppeteer
Stars: ✭ 86 (+244%)
Mutual labels:  puppeteer
webring
“วงแหวนเว็บ” แห่งนี้สร้างขึ้นเพื่อส่งเสริมให้ศิลปิน นักออกแบบ และนักพัฒนาชาวไทย สร้างเว็บไซต์ของตัวเองและแบ่งปันการเข้าชมซึ่งกันและกัน
Stars: ✭ 125 (+400%)
Mutual labels:  puppeteer
puppeteer-instagram
Instagram automation driven by headless chrome.
Stars: ✭ 87 (+248%)
Mutual labels:  puppeteer
weibo topic
微博话题关键词,个人微博采集, 微博博文一键删除 selenium获取cookie,requests处理
Stars: ✭ 28 (+12%)
Mutual labels:  spider
NScrapy
NScrapy is a .net core corss platform Distributed Spider Framework which provide an easy way to write your own Spider
Stars: ✭ 88 (+252%)
Mutual labels:  spider
Sina Spider
新浪爬虫,基于Python+Selenium。模拟登陆后保存cookie,实现登录状态的保存。可以通过输入关键词来爬取到关键词相关的热门微博。
Stars: ✭ 25 (+0%)
Mutual labels:  spider
Social-Media-Automation
Automate social media because you don't have to be active on all of them😉. Best way to be active on all social media without actually being active on them. 😃
Stars: ✭ 186 (+644%)
Mutual labels:  puppeteer
nest-puppeteer
Puppeteer (Headless Chrome) provider for Nest.js
Stars: ✭ 68 (+172%)
Mutual labels:  puppeteer
throughout
🎪 End-to-end testing made simple (using Jest and Puppeteer)
Stars: ✭ 16 (-36%)
Mutual labels:  puppeteer
personal-puppeteer
A personal web page screenshotting service. Basically, it exposes an API that I can use to generate screenshot of any URL.
Stars: ✭ 19 (-24%)
Mutual labels:  puppeteer
preact-typescript-parcel-starter
Starter with Preact - Typescript - Parcel Bundler
Stars: ✭ 51 (+104%)
Mutual labels:  puppeteer

gz-spider

中文文档

A web spider framework for NodeJs, base on Puppeteer & Axios;

Feture

  • IP Proxy
  • Fail retry
  • Support Puppeteer
  • Easily compatible with various task queue services
  • Easily multiprocessing

Install

npm i gz-spider --save

Usage

const spider = require('gz-spider');

// All your spider code register in Processer
spider.setProcesser({
  ['getGoogleSearchResult']: async (fetcher, params) => {
    // fetcher.page is original puppeteer page 
    let resp = await fetcher.axios.get(`https://www.google.com/search?q=${params}`);

    // throw 'Retry', will retry this processer
    // throw 'ChangeProxy', will retry this processer use new proxy
    // throw 'Fail', will finish this processer with message(fail) Immediately

    if (resp.status === 200) {
      // Data processing start
      let result = resp.data + 1;
      // Data processing end
      return result;
    } else {
      throw 'retry';
    }
  }
});

// Get data
spider.getData('getGoogleSearchResult', params).then(userInfo => {
  console.log(userInfo);
});

Config

This framework is divided into three components, fetcher, strategy and processer.

Fetcher

spider.setFetcher({
  axiosTimeout: 5000,
  proxyTimeout: 180 * 1000
  proxy() {
    // support async function,you can get proxy config from remote.
    return {
      host: '127.0.0.1',
      port: '9000'
    }
  }
});
  • axiosTimeout: [Number] Peer request timeout ms
  • proxyTimeout: [Number] When config.proxy is [Function], will re-run proxy function and get new proxy host+port
  • proxy: [Object | Function] When proxy is [Function], support async function,you can get proxy config from remote
    • proxy.host [String]
    • proxy.port [String]

Strategy

spider.setStrategy({
  retryTimes: 2
});
  • retryTimes: [Number] Max retry times for one task

Work with task queue

Process

Get one task -> `spider.getData(processerKey, processerIn)` -> Complete task with processed data

Simulate task queue use MySQL

  1. Create table spider-task, include at least 'id', 'status', 'processer_key', 'processer_input', 'processer_output'
  2. Write api to get one todo task (status = todo), for example GET /spider/task
  3. Write api to update db table with processed data, for example PUT /spider/task
const axios = require('axios');

while (true) {
  // Get one task
  let resp = await axios.get('http://127.0.0.1:8080/spider/task');

  if (!resp.data.task) break;
  
  let { id, processerKey, processerInput } = resp.data.task;
  let processerOutput = await spider.getData(processerKey, processerInput);

  // Complete task with processed data
  await axios.put('http://127.0.0.1:8080/spider/task', {
    id, processerOutput,
    status: 'success'
  });
}

License

MIT

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].