fgksgf / GitHub-Trending-Crawler

Licence: other

Crawling GitHub Trending Pages every day

Programming Languages

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to GitHub-Trending-Crawler

web spider built by puppeteer, support task-queue and task-scheduling by decorators，support nedb / mongodb, support data visualization; 基于puppeteer的web爬虫框架，提供灵活的任务队列管理调度方案，提供便捷的数据保存方案（nedb/mongodb），提供数据可视化和用户交互的实现方案

Stars: ✭ 237 (+330.91%)

Mutual labels: spider

imdb-spider

scrapy spider for scraping imdb {movie_id: [recommended, ...]}

Stars: ✭ 23 (-58.18%)

Mutual labels: spider

server init harden

Server hardening on 1st login as "root"

Stars: ✭ 75 (+36.36%)

Mutual labels: linux-server

Fast Lianjia Crawler

直接通过链家 API 抓取数据的极速爬虫，宇宙最快~~ 🚀

Stars: ✭ 247 (+349.09%)

Mutual labels: spider

python-spider

零基础学习python爬虫

Stars: ✭ 31 (-43.64%)

Mutual labels: spider

young-crawler

scala结合actor编写的分布式网络爬虫

Stars: ✭ 15 (-72.73%)

Mutual labels: spider

Article spider

微信公众号爬虫

Stars: ✭ 235 (+327.27%)

Mutual labels: spider

bilibili-smallvideo

🕷️用于爬取B站前top100的小视频

Stars: ✭ 133 (+141.82%)

Mutual labels: spider

weaver

A spider tapestry weaver

Stars: ✭ 72 (+30.91%)

Mutual labels: spider

Spider

资讯爬虫App

Stars: ✭ 24 (-56.36%)

Mutual labels: spider

Magic google

Google search results crawler, get google search results that you need

Stars: ✭ 247 (+349.09%)

Mutual labels: spider

dht-spider

一个简单的基于DHT协议的BT磁力链接爬虫

Stars: ✭ 16 (-70.91%)

Mutual labels: spider

BaiduSpider

项目已经移动至：https://github.com/BaiduSpider/BaiduSpider ！！一个爬取百度搜索结果的爬虫，目前支持百度网页搜索，百度图片搜索，百度知道搜索，百度视频搜索，百度资讯搜索，百度文库搜索，百度经验搜索和百度百科搜索。

Stars: ✭ 29 (-47.27%)

Mutual labels: spider

Core

🔞 JAVClub - 让你的大姐姐不再走丢

Stars: ✭ 2,728 (+4860%)

Mutual labels: spider

spider

裁判文书网爬虫

Stars: ✭ 19 (-65.45%)

Mutual labels: spider

Killshot

A Penetration Testing Framework, Information gathering tool & Website Vulnerability Scanner

Stars: ✭ 237 (+330.91%)

Mutual labels: spider

simpyder

超高速异步协程Python爬虫

Stars: ✭ 74 (+34.55%)

Mutual labels: spider

scrapy helper

Dynamic configurable crawl (动态可配置化爬虫)

Stars: ✭ 84 (+52.73%)

Mutual labels: spider

linux-server-administration-scripts

Simple bash administration scripts for Linux to make your life easier.

Stars: ✭ 47 (-14.55%)

Mutual labels: linux-server

ben-ben-spider

犇犇爬虫

Stars: ✭ 36 (-34.55%)

Mutual labels: spider

View All Similar Projects ➔

GitHub-Trending-Crawler

Crawling GitHub Trending Pages every day.

Introduction

The program is highly recommend to be deployed on a Linux server, which can crawl information about popular repositories of languages you are interested in on GitHub every day. Then it will create a markdown file to record those information and generate a wordcloud image according to repositories' descriptions.

This crawler is designed to help me keep track of the latest trends in technology and discover some new and interesting repositories. In fact, reading the newest markdown file has become a part of my daily routines. More importantly, it increases contributions of GitHub :P

The idea was inspired by LJ147.

Requirements

python 3.6+
git
screen
unzip

Configuration

Fork my repo or create your own repo for uploading markdown files.
If you don't have ssh keys, generating a new SSH key and adding it to the ssh-agent.

Usage on Linux

$ sudo apt install -y unzip screen python3-pip
$ sudo apt-get install -y python-tk python3-tk

# the `release` branch is stable, and there is only code. 
$ wget https://github.com/fgksgf/GitHub-Trending-Crawler/archive/release.zip
$ unzip release.zip
$ cd GitHub-Trending-Crawler-release/
$ mkdir img
$ git init
$ git remote add origin <YourGitHubRepoURL>

# using virtual environment is highly recommended
$ pip3 install -r requirements.txt

Switch to the repository directory and just type screen at the command prompt. Then the screen will show with interface exactly as the command prompt.
When you enter the screen, you can do all your work as you are in the normal CLI environment. But since the screen is an application, so it have command or parameters.
And now, we can run the program: python3 main.py -p -l
While the program is running, you can press Ctrl + A and d to detach the screen. Then you can disconnect your SSH session.
When you want to check the status of the crawler, just reconnect to your server via ssh. Then use this command screen -r to restore the screen. For more information about screen command, you can visit here.

CLI Options

python3 main.py (-h | --help)
python3 main.py (-v | --version)
python3 main.py [-l | --loop] [-p | --push] [--frequency=<f>]

Options:
  -h --help        Show this screen.
  -v --version     Show version.
  -l --loop        Run this program cyclically.
  -p --push        Use git to push the markdown and the image.
  --frequency=<f>  The frequency of crawling [default: daily].

Change Logs

V1.5 (2020-02-22)

Refactor code with object-oriented methods
Split single python file into several files
Improve exception handling
Add logging feature
Use docopt to enhance command-line usage
Update requirements

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

fgksgf / GitHub-Trending-Crawler

Programming Languages

Labels

Projects that are alternatives of or similar to GitHub-Trending-Crawler

GitHub-Trending-Crawler

Introduction

Requirements

Configuration

Usage on Linux

CLI Options

Change Logs

V1.5 (2020-02-22)