All Projects → Knovour → json-web-crawler

Knovour / json-web-crawler

Licence: other
Use JSON to list all elements (with css 3 and jquery selector) that you want to crawl.

Programming Languages

javascript
184084 projects - #8 most used programming language

Projects that are alternatives of or similar to json-web-crawler

Pulsar
Turn large Web sites into tables and charts using simple SQLs.
Stars: ✭ 100 (+488.24%)
Mutual labels:  web-crawler
Zhihu Crawler People
A simple distributed crawler for zhihu && data analysis
Stars: ✭ 182 (+970.59%)
Mutual labels:  web-crawler
doc crawler.py
Explore a website recursively and download all the wanted documents (PDF, ODT…)
Stars: ✭ 22 (+29.41%)
Mutual labels:  web-crawler
Crawlab Lite
Lite version of Crawlab. 轻量版 Crawlab 爬虫管理平台
Stars: ✭ 122 (+617.65%)
Mutual labels:  web-crawler
Abot
Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Stars: ✭ 1,961 (+11435.29%)
Mutual labels:  web-crawler
Antch
Antch, a fast, powerful and extensible web crawling & scraping framework for Go
Stars: ✭ 198 (+1064.71%)
Mutual labels:  web-crawler
Ultimate Dork
Web Crawler
Stars: ✭ 79 (+364.71%)
Mutual labels:  web-crawler
WeReadScan
扫描“微信读书”已购图书并下载本地PDF的爬虫
Stars: ✭ 273 (+1505.88%)
Mutual labels:  web-crawler
Crawler Commons
A set of reusable Java components that implement functionality common to any web crawler
Stars: ✭ 173 (+917.65%)
Mutual labels:  web-crawler
Market-Trend-Prediction
This is a project of build knowledge graph course. The project leverages historical stock price, and integrates social media listening from customers to predict market Trend On Dow Jones Industrial Average (DJIA).
Stars: ✭ 57 (+235.29%)
Mutual labels:  web-crawler
Proxy
A simple tool for fetching usable proxies from several websites.
Stars: ✭ 124 (+629.41%)
Mutual labels:  web-crawler
Awesome Web Scraper
A collection of awesome web scaper, crawler.
Stars: ✭ 147 (+764.71%)
Mutual labels:  web-crawler
Kochat
Opensource Korean chatbot framework
Stars: ✭ 204 (+1100%)
Mutual labels:  web-crawler
Pspider
简单易用的Python爬虫框架,QQ交流群:597510560
Stars: ✭ 1,611 (+9376.47%)
Mutual labels:  web-crawler
ant
A web crawler for Go
Stars: ✭ 264 (+1452.94%)
Mutual labels:  web-crawler
Infinitycrawler
A simple but powerful web crawler library for .NET
Stars: ✭ 97 (+470.59%)
Mutual labels:  web-crawler
Nutch
Apache Nutch is an extensible and scalable web crawler
Stars: ✭ 2,277 (+13294.12%)
Mutual labels:  web-crawler
StackOverflow-Crawler
It is a web crawler which crawls the stackoverfolw website (http://stackoverflow.com/) and finds the most popular technologies at current point of time by getting the tags info of the newest questions asked on the website.
Stars: ✭ 25 (+47.06%)
Mutual labels:  web-crawler
Raspagem-de-dados-para-iniciantes
Raspagem de dados para iniciante usando Scrapy e outras libs básicas
Stars: ✭ 113 (+564.71%)
Mutual labels:  web-crawler
Strong Web Crawler
基于C#.NET+PhantomJS+Sellenium的高级网络爬虫程序。可执行Javascript代码、触发各类事件、操纵页面Dom结构。
Stars: ✭ 238 (+1300%)
Mutual labels:  web-crawler

Json Web Crawler

NPM version Node version Open Source Love

Use JSON to list all elements (with css 3 and jquery selector) that you want to crawl.

Demo

Usage

npm i json-web-crawler --save
const crawl = require('json-web-crawler');

crawl('HTML content', your json setting)
  .then(console.log)
  .catch(console.error);

Settings

type

The default type is content.

  • content: crawl specific $container to a single json.
  • list: crawl a list like Google search result into multi data.

container

DOM element that will focus on. If type is list, it will crawl each container class.

listOption

Optional, enable in list type only, use when you don't want to crawl the whole list. ** ALL STRAT FROM 0 **

  • [ 'limit', 10 ]: ten elements only (eq(0) ~ eq(9)).
  • [ 'range', 6, 12 ]: from eq(6) to eq(12 - 1). If without end, it will continue to the last one.
  • [ 'focus', 0, 3, 7, ... ]: specific elements in list (eq(0), eq(3), eq(7), ...). You can use -1, -2 to count from backward.
  • [ 'ignore', 1, 2, 5 ]: elements you want to ignore it. You can use -1, -2 to count from backward.

crawl

keyName: { options } => keyName: data

crawl: {
  image: {
    elem: 'img',
    get: 'src'
  }
}

// will become
image: IMAGE_SRC_URL

options

  • elem: element inside container. If empty or undefined, it will use container or listElems instead
  • noChild (boolean): remove all children elem under $(elem)
  • outOfContainer (boolean): If exist, It will use $('html').find(elem)
  • get: return type of element
    • text
    • num
    • length: $element.length
    • attrName: $element.attr('attrName')
    • data-dataName: $element.data('dataNAme')
    • data-dataName:X: X is optional.
      • If data is an array, set data-dataName:0 will return $elem.data('dataAttribute')[0].
      • If data is an object, set data-dataName:id will return $elem.data('dataAttribute')['id'].
      • If X not exist, it will return the whole data.
  • process: If you want to do something else after 'get' (string type only)
// You can use some simple functions that existed in lodash.
process: [
  ['match', /regex here/, number],  // => str.match(/regex here/)[number], return array if no number, but will cause other process won't work
  ['split', ',', number],           // => str.split(',')[number], return array if no number, but will cause other process won't work
  ['replace', 'one', 'two'],
  ['substring', 0, 3],
  ['prepend', 'text'],              // => 'text' + value
  ['append', 'text'],               // => value + 'text'
  ['indexOf', 'text']               // => return number
  ['INDENPENDENT_FUNCTION'],        // like encodeURI, encodeURIComponent, unescape, etc...
  /**
    * Due to lodash has the same name `escape` & `unescape` functions with
    * different behavior, the origin `escape` & `unescape` function will
    * renamed to `encode` & `decode` instead.
    */
],

// Or you want to DIY, you can use function instead
process(value, $elem /* jquery dom */) {
  // do something

  return newValue;
}
  • collect: If the value you want is sperated to several elements, use collect to find them all.

    • elems: contain multi elements array.
    • loop (boolean): It will run all elems (like li) you want to get
    • combineWith: without this, collect will return array
  • default: return default value when elem not found, null or undefined (process will be ignored)

pageNotFound

If match, it will return page not found error.

  • elem
  • get
  • check: like process, but only one step

Example

Content Type

Steam Dota2 page in demo.

const setting = {
  type: 'content',
  container: '#game_highlights .rightcol',
  crawl: {
    appId: {
      elem: '.glance_tags',
      get:  'data-appid'
    },
    appName: {
      outOfContainer: true,
      elem: '.apphub_AppName',
      get:  'text'
    },
    image: {
      elem: '.game_header_image_full',
      get:  'src'
    },
    reviews: {
      elem: '.game_review_summary:eq(0)',
      get:  'text',
    },
    tags: {
      elem: '.glance_tags',
      collect: {
        elems: [{
          elem: 'a.app_tag:eq(0)',
          get:  'text'
        }, {
          elem: 'a.app_tag:eq(1)',
          get:  'text'
        }, {
          elem: 'a.app_tag:eq(2)',
          get:  'text'
        }],
        combineWith: ', '
      }
    },
    allTags: {
      elem: '.glance_tags a.app_tag',
      collect: {
        loop: true,
        get:  'text',
        combineWith: ', '
      }
    },
    description: {
      elem: '.game_description_snippet',
      get:  'text',
      process(value, $elem) {
        return value.split(', ');
      }
    },
    releaseDate: {
      elem: '.release_date .date',
      get:  'text'
    }
  }
};

List Type

KickStarter popular list in demo.

const setting = {
  pageNotFound: [{
    elem: '.grey-frame-inner h1',
    get:  'text',
    check: ['equal', '404']
  }],
  type: 'list',
  container: '#projects_list .project-card',
  listOption: [ 'limit', 3 ],
  // listOption: [ 'range', 0, 10 ],
  // listOption: [ 'ignore', 0, 2, -1 ],
  // listOption: [ 'focus', 3, -3 ],
  crawl: {
    projectID: {
      get: 'data-pid',
    },
    name: {
      elem: '.project-title',
      get:  'text',
    },
    image: {
      elem: '.project-thumbnail img',
      get:  'src'
    },
    link: {
      elem: '.project-title a',
      get:  'href',
      process: [
        [ 'split', '?', 0 ],
        [ 'prepend', 'https://www.kickstarter.com' ]
      ]
    },
    description: {
      elem: '.project-blurb',
      get:  'text'
    },
    funded: {
      elem: '.project-stats-value:eq(0)',
      get:  'text'
    },
    percentPledged: {
      elem: '.project-percent-pledged',
      get:  'style',
      process: [
        [ 'split', /:\s?/g, 1 ]
      ]
    },
    pledged: {
      elem: '.money.usd',
      get:  'num'
    }
  }
};
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].