All Projects → infinitbyte → Gopa

infinitbyte / Gopa

Licence: other
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn

Programming Languages

go
31211 projects - #10 most used programming language

Projects that are alternatives of or similar to Gopa

Colly
Elegant Scraper and Crawler Framework for Golang
Stars: ✭ 15,535 (+5508.3%)
Mutual labels:  crawler, spider, scraping, crawling
Linkedin Profile Scraper
🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON. Works in 2020.
Stars: ✭ 171 (-38.27%)
Mutual labels:  crawler, spider, scraping, crawling
Arachnid
Powerful web scraping framework for Crystal
Stars: ✭ 68 (-75.45%)
Mutual labels:  crawler, spider, web-scraping, crawling
flink-crawler
Continuous scalable web crawler built on top of Flink and crawler-commons
Stars: ✭ 48 (-82.67%)
Mutual labels:  crawler, spider, web-crawler, crawling
Antch
Antch, a fast, powerful and extensible web crawling & scraping framework for Go
Stars: ✭ 198 (-28.52%)
Mutual labels:  crawler, scraping, crawling, web-crawler
Crawly
Crawly, a high-level web crawling & scraping framework for Elixir.
Stars: ✭ 440 (+58.84%)
Mutual labels:  crawler, spider, scraping, crawling
Spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
Stars: ✭ 656 (+136.82%)
Mutual labels:  crawler, spider, web-scraping, web-crawler
Dotnetcrawler
DotnetCrawler is a straightforward, lightweight web crawling/scrapying library for Entity Framework Core output based on dotnet core. This library designed like other strong crawler libraries like WebMagic and Scrapy but for enabling extandable your custom requirements. Medium link : https://medium.com/@mehmetozkaya/creating-custom-web-crawler-with-dotnet-core-using-entity-framework-core-ec8d23f0ca7c
Stars: ✭ 100 (-63.9%)
Mutual labels:  crawler, scraping, crawling
Scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
Stars: ✭ 42,343 (+15186.28%)
Mutual labels:  crawler, scraping, crawling
papercut
Papercut is a scraping/crawling library for Node.js built on top of JSDOM. It provides basic selector features together with features like Page Caching and Geosearch.
Stars: ✭ 15 (-94.58%)
Mutual labels:  crawler, scraping, web-scraping
Pspider
简单易用的Python爬虫框架,QQ交流群:597510560
Stars: ✭ 1,611 (+481.59%)
Mutual labels:  crawler, spider, web-crawler
Crawlab Lite
Lite version of Crawlab. 轻量版 Crawlab 爬虫管理平台
Stars: ✭ 122 (-55.96%)
Mutual labels:  crawler, spider, web-crawler
Abot
Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
Stars: ✭ 1,961 (+607.94%)
Mutual labels:  crawler, spider, web-crawler
Gopa Abandoned
GOPA, a spider written in Go.(NOTE: this project moved to https://github.com/infinitbyte/gopa )
Stars: ✭ 98 (-64.62%)
Mutual labels:  crawler, spider, lightweight
Apify Js
Apify SDK — The scalable web scraping and crawling library for JavaScript/Node.js. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.
Stars: ✭ 3,154 (+1038.63%)
Mutual labels:  scraping, web-scraping, crawling
Geziyor
Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.
Stars: ✭ 1,246 (+349.82%)
Mutual labels:  crawler, spider, scraping
wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
Stars: ✭ 52 (-81.23%)
Mutual labels:  spider, scraping, crawling
Spidy
The simple, easy to use command line web crawler.
Stars: ✭ 257 (-7.22%)
Mutual labels:  crawler, crawling, web-crawler
Crawlab
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
Stars: ✭ 8,392 (+2929.6%)
Mutual labels:  crawler, spider, web-crawler
Awesome Python Primer
自学入门 Python 优质中文资源索引,包含 书籍 / 文档 / 视频,适用于 爬虫 / Web / 数据分析 / 机器学习 方向
Stars: ✭ 57 (-79.42%)
Mutual labels:  crawler, spider, scraping
What a Spider!

GOPA, A Spider Written in Go.

Travis Go Report Card Join the chat at https://gitter.im/infinitbyte/gopa

Goal

  • Light weight, low footprint, memory requirement should < 100MB
  • Easy to deploy, no runtime or dependency required
  • Easy to use, no programming or scripts ability needed, out of box features

Screenshoot

What a Spider! GOPA Spider!

How to use

Requirements

  • Elasticsearch v5.3+

Setup

First of all, get it, two opinions: download the pre-built package or compile it yourself.

Download Pre Built Package

Go to Release page, download the right package for your platform.

Note: Darwin is for Mac

Compile The Package Manually

Requirements

  • Golang 1.9+

Supported platform

For example:

#apt  install golang-go
#brew install golang
mkdir ~/go/src/github.com/infinitbyte/ -p
cd ~/go/src/github.com/infinitbyte/
git clone https://github.com/infinitbyte/gopa.git
cd gopa
make

After a few minutes, you should have:

gopa, the main program, a single binary.
gopa.yml, main configuration for gopa.

Required Config

Note: Elasticsearch version should >= v5.3

  • Enable elastic module in gopa.yml, update the elasticsearch's setting:
elasticsearch:
- name: default
  enabled: true
  endpoint: http://localhost:9200
  index_prefix: gopa-
  basic_auth:
    username: elastic
    password: changeme

Start

Besides Elasticsearch, Gopa doesn't require any other dependencies, just simply run ./gopa to start the program.

Gopa can be run as daemon(Note: Only available on Linux and Mac):

Example
➜  gopa git:(master) ✗ ./bin/gopa --daemon
  ________ ________ __________  _____
 /  _____/ \_____  \\______   \/  _  \
/   \  ___  /   |   \|     ___/  /_\  \
\    \_\  \/    |    \    |  /    |    \
 \______  /\_______  /____|  \____|__  /
        \/         \/                \/
[gopa] 0.10.0_SNAPSHOT
///last commit: 99616a2, Fri Oct 20 14:04:54 2017 +0200, medcl, update version to 0.10.0 ///

[10-21 16:01:09] [INF] [instance.go:23] workspace: data/gopa/nodes/0 [gopa] started.

Also run ./gopa -h to get the full list of command line options.

Example
➜  gopa git:(master) ✗ ./bin/gopa -h
  ________ ________ __________  _____
 /  _____/ \_____  \\______   \/  _  \
/   \  ___  /   |   \|     ___/  /_\  \
\    \_\  \/    |    \    |  /    |    \
 \______  /\_______  /____|  \____|__  /
        \/         \/                \/
[gopa] 0.10.0_SNAPSHOT
///last commit: 99616a2, Fri Oct 20 14:04:54 2017 +0200, medcl, update version to 0.10.0 ///

Usage of ./bin/gopa: -config string the location of config file (default "gopa.yml") -cpuprofile string write cpu profile to this file -daemon run in background as daemon -debug run in debug mode, gopa will quit with panic error -log string the log level,options:trace,debug,info,warn,error (default "info") -log_path string the log path (default "log") -memprofile string write memory profile to this file -pidfile string pidfile path (only for daemon) -pprof string enable and setup pprof/expvar service, eg: localhost:6060 , the endpoint will be: http://localhost:6060/debug/pprof/ and http://localhost:6060/debug/vars

Stop

It's safety to press ctrl+c stop the current running Gopa, Gopa will handle the rest,saving the checkpoint, you may restore the job later, the world is still in your hand.

If you are running Gopa as daemon, you may stop it like this:

 kill -QUIT `pgrep gopa`

Configuration

UI

  • Search Console http://127.0.0.1:9000/
  • Admin Console http://127.0.0.1:9000/admin/

API

Architecture

What a Spider! GOPA Spider!

Who uses it?

You use GOPA and you want to be listed there? Contact me.

License

Released under the Apache License, Version 2.0 .

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].