All Projects → wspl → Creeper

wspl / Creeper

Licence: apache-2.0
🐾 Creeper - The Next Generation Crawler Framework (Go)

Programming Languages

go
31211 projects - #10 most used programming language
golang
3204 projects
language
365 projects
script
160 projects

Projects that are alternatives of or similar to Creeper

Abotx
Cross Platform C# Web crawler framework, headless browser, parallel crawler. Please star this project! +1.
Stars: ✭ 63 (-91.73%)
Mutual labels:  spider, framework, cross-platform
Abot
Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Stars: ✭ 1,961 (+157.35%)
Mutual labels:  crawler, spider, cross-platform
Colly
Elegant Scraper and Crawler Framework for Golang
Stars: ✭ 15,535 (+1938.71%)
Mutual labels:  crawler, spider, framework
Haipproxy
💖 High available distributed ip proxy pool, powerd by Scrapy and Redis
Stars: ✭ 4,993 (+555.25%)
Mutual labels:  crawler, spider
Crawly
Crawly, a high-level web crawling & scraping framework for Elixir.
Stars: ✭ 440 (-42.26%)
Mutual labels:  crawler, spider
Learnpython
Python的基础练习代码与各种爬虫代码
Stars: ✭ 451 (-40.81%)
Mutual labels:  crawler, spider
Bilili
🍻 bilibili video (including bangumi) and danmaku downloader | B站视频(含番剧)、弹幕下载器
Stars: ✭ 379 (-50.26%)
Mutual labels:  crawler, spider
Fbcrawl
A Facebook crawler
Stars: ✭ 536 (-29.66%)
Mutual labels:  crawler, spider
Go jobs
带你了解一下Golang的市场行情
Stars: ✭ 526 (-30.97%)
Mutual labels:  crawler, spider
Xxl Crawler
A distributed web crawler framework.(分布式爬虫框架XXL-CRAWLER)
Stars: ✭ 561 (-26.38%)
Mutual labels:  crawler, spider
Newcrawler
Free Web Scraping Tool with Java
Stars: ✭ 589 (-22.7%)
Mutual labels:  crawler, spider
Html2article
Html网页正文提取
Stars: ✭ 441 (-42.13%)
Mutual labels:  crawler, spider
Toou 2d
基于Qt Quick(Qml) 跨平台技术打造的2D框架
Stars: ✭ 413 (-45.8%)
Mutual labels:  framework, cross-platform
Awesome Crawler
A collection of awesome web crawler,spider in different languages
Stars: ✭ 4,793 (+529%)
Mutual labels:  crawler, spider
Gosint
OSINT Swiss Army Knife
Stars: ✭ 401 (-47.38%)
Mutual labels:  crawler, spider
Xsrfprobe
The Prime Cross Site Request Forgery (CSRF) Audit and Exploitation Toolkit.
Stars: ✭ 532 (-30.18%)
Mutual labels:  crawler, spider
Douyin
API of DouYin for Humans used to Crawl Popular Videos and Musics
Stars: ✭ 580 (-23.88%)
Mutual labels:  crawler, spider
Baiduimagespider
一个超级轻量的百度图片爬虫
Stars: ✭ 591 (-22.44%)
Mutual labels:  crawler, spider
Spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Stars: ✭ 656 (-13.91%)
Mutual labels:  crawler, spider
Libgdx
Desktop/Android/HTML5/iOS Java game development framework
Stars: ✭ 19,420 (+2448.56%)
Mutual labels:  framework, cross-platform

License Go Report Card Gitter Creeper

About

Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your news app, subscribe program, etc.

Warning: At present this project is still under early stage development, please do not use in the production environment.

Get Started

Installation

$ go get github.com/wspl/creeper

Hello World!

Create hacker_news.crs

page(@page=1) = "https://news.ycombinator.com/news?p={@page}"

news[]: page -> $("tr.athing")
    title: $(".title a.storylink").text
    site: $(".title span.sitestr").text
    link: $(".title a.storylink").href

Then, create main.go

package main

import "github.com/wspl/creeper"

func main() {
	c := creeper.Open("./hacker_news.crs")
	c.Array("news").Each(func(c *creeper.Creeper) {
		println("title: ", c.String("title"))
		println("site: ", c.String("site"))
		println("link: ", c.String("link"))
		println("===")
	})
}

Build and run. Console will print something like:

title:  Samsung chief Lee arrested as S.Korean corruption probe deepens
site:  reuters.com
link:  http://www.reuters.com/article/us-southkorea-politics-samsung-group-idUSKBN15V2RD
===
title:  ReactOS 0.4.4 Released
site:  reactos.org
link:  https://reactos.org/project-news/reactos-044-released
===
title:  FeFETs: How this new memory stacks up against existing non-volatile memory
site:  semiengineering.com
link:  http://semiengineering.com/what-are-fefets/

Script Spec

Town

Town is a lambda like expression for saving (in)mutable string. Most of the time, we used it to store url.

page(@page=1, ext) = "https://news.ycombinator.com/news?p={@page}&ext={ext}"

When you need town, use it as if you were calling a function:

news[]: page(ext="Hello World!") -> $("tr.athing")

You might have noticed that the @page parameter is not used. Yeah, it is a special parameter.

Expression in town definition line like name="something", represents parameter name has a default value "something".

Incidentally, @page is a parameter that will automatically increasing when current page has no more content.

Node

Nodes are tree structure that represent the data structure you are going to crawl.

news[]: page -> $("tr.athing")
	title: $(".title a.storylink").text
	site: $(".title span.sitestr").text
	link: $(".title a.storylink").href

Like yaml, nodes distinguishes the hierarchy by indentation.

Node Name

Node has name. title is a field name, represents a general string data. news[] is a array name, represents a parent structure with multiple sub-data.

Page

Page indicates where to fetching the field data. It can be a town expression or field reference.

Field reference is a advanced usage of Node, you can found the details in ./eh.crs.

If a node owned page and fun at the same time, page should on the left of ->, fun should on the right of ->. Which is page -> fun

Fun

Fun represents the data processing process.

There are all supported funs:

Name Parameters Description
$ (selector: string) Relative CSS selector (select from parent node)
$root (selector: string) Absolute CSS selector (select from body)
html inner HTML
text inner text
outerHTML outer HTML
attr (attr: string) attribute value
style style attribute value
href href attribute value
src src attribute value
class class attribute value
id id attribute value
calc (prec: int) calculate arithmetic expression
match (regexp: string) match first sub-string via regular expression
expand (regexp: string, target: string) expand matched strings to target string

Author

Plutonist

impl.moe · Github @wspl

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].