All Projects → fgksgf → GitHub-Trending-Crawler

fgksgf / GitHub-Trending-Crawler

Licence: other
Crawling GitHub Trending Pages every day

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to GitHub-Trending-Crawler

Ppspider
web spider built by puppeteer, support task-queue and task-scheduling by decorators,support nedb / mongodb, support data visualization; 基于puppeteer的web爬虫框架,提供灵活的任务队列管理调度方案,提供便捷的数据保存方案(nedb/mongodb),提供数据可视化和用户交互的实现方案
Stars: ✭ 237 (+330.91%)
Mutual labels:  spider
imdb-spider
scrapy spider for scraping imdb {movie_id: [recommended, ...]}
Stars: ✭ 23 (-58.18%)
Mutual labels:  spider
server init harden
Server hardening on 1st login as "root"
Stars: ✭ 75 (+36.36%)
Mutual labels:  linux-server
Fast Lianjia Crawler
直接通过链家 API 抓取数据的极速爬虫,宇宙最快~~ 🚀
Stars: ✭ 247 (+349.09%)
Mutual labels:  spider
python-spider
零基础学习python爬虫
Stars: ✭ 31 (-43.64%)
Mutual labels:  spider
young-crawler
scala结合actor编写的分布式网络爬虫
Stars: ✭ 15 (-72.73%)
Mutual labels:  spider
Article spider
微信公众号爬虫
Stars: ✭ 235 (+327.27%)
Mutual labels:  spider
bilibili-smallvideo
🕷️用于爬取B站前top100的小视频
Stars: ✭ 133 (+141.82%)
Mutual labels:  spider
weaver
A spider tapestry weaver
Stars: ✭ 72 (+30.91%)
Mutual labels:  spider
Spider
资讯爬虫App
Stars: ✭ 24 (-56.36%)
Mutual labels:  spider
Magic google
Google search results crawler, get google search results that you need
Stars: ✭ 247 (+349.09%)
Mutual labels:  spider
dht-spider
一个简单的基于DHT协议的BT磁力链接爬虫
Stars: ✭ 16 (-70.91%)
Mutual labels:  spider
BaiduSpider
项目已经移动至:https://github.com/BaiduSpider/BaiduSpider !! 一个爬取百度搜索结果的爬虫,目前支持百度网页搜索,百度图片搜索,百度知道搜索,百度视频搜索,百度资讯搜索,百度文库搜索,百度经验搜索和百度百科搜索。
Stars: ✭ 29 (-47.27%)
Mutual labels:  spider
Core
🔞 JAVClub - 让你的大姐姐不再走丢
Stars: ✭ 2,728 (+4860%)
Mutual labels:  spider
spider
裁判文书网爬虫
Stars: ✭ 19 (-65.45%)
Mutual labels:  spider
Killshot
A Penetration Testing Framework, Information gathering tool & Website Vulnerability Scanner
Stars: ✭ 237 (+330.91%)
Mutual labels:  spider
simpyder
超高速异步协程Python爬虫
Stars: ✭ 74 (+34.55%)
Mutual labels:  spider
scrapy helper
Dynamic configurable crawl (动态可配置化爬虫)
Stars: ✭ 84 (+52.73%)
Mutual labels:  spider
linux-server-administration-scripts
Simple bash administration scripts for Linux to make your life easier.
Stars: ✭ 47 (-14.55%)
Mutual labels:  linux-server
ben-ben-spider
犇犇爬虫
Stars: ✭ 36 (-34.55%)
Mutual labels:  spider

GitHub-Trending-Crawler

Crawling GitHub Trending Pages every day.

Introduction

The program is highly recommend to be deployed on a Linux server, which can crawl information about popular repositories of languages you are interested in on GitHub every day. Then it will create a markdown file to record those information and generate a wordcloud image according to repositories' descriptions.

This crawler is designed to help me keep track of the latest trends in technology and discover some new and interesting repositories. In fact, reading the newest markdown file has become a part of my daily routines. More importantly, it increases contributions of GitHub :P

The idea was inspired by LJ147.

Requirements

  • python 3.6+
  • git
  • screen
  • unzip

Configuration

Usage on Linux

$ sudo apt install -y unzip screen python3-pip
$ sudo apt-get install -y python-tk python3-tk

# the `release` branch is stable, and there is only code. 
$ wget https://github.com/fgksgf/GitHub-Trending-Crawler/archive/release.zip
$ unzip release.zip
$ cd GitHub-Trending-Crawler-release/
$ mkdir img
$ git init
$ git remote add origin <YourGitHubRepoURL>

# using virtual environment is highly recommended
$ pip3 install -r requirements.txt
  1. Switch to the repository directory and just type screen at the command prompt. Then the screen will show with interface exactly as the command prompt.

  2. When you enter the screen, you can do all your work as you are in the normal CLI environment. But since the screen is an application, so it have command or parameters.

  3. And now, we can run the program: python3 main.py -p -l

  4. While the program is running, you can press Ctrl + A and d to detach the screen. Then you can disconnect your SSH session.

  5. When you want to check the status of the crawler, just reconnect to your server via ssh. Then use this command screen -r to restore the screen. For more information about screen command, you can visit here.

CLI Options

python3 main.py (-h | --help)
python3 main.py (-v | --version)
python3 main.py [-l | --loop] [-p | --push] [--frequency=<f>]

Options:
  -h --help        Show this screen.
  -v --version     Show version.
  -l --loop        Run this program cyclically.
  -p --push        Use git to push the markdown and the image.
  --frequency=<f>  The frequency of crawling [default: daily].

Change Logs

V1.5 (2020-02-22)

  • Refactor code with object-oriented methods
  • Split single python file into several files
  • Improve exception handling
  • Add logging feature
  • Use docopt to enhance command-line usage
  • Update requirements
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].