All Projects → khuyentran1401 → top-github-scraper

khuyentran1401 / top-github-scraper

Licence: other
Scape top GitHub repositories and users based on keywords

Programming Languages

HTML
75241 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to top-github-scraper

Scrapple
A framework for creating semi-automatic web content extractors
Stars: ✭ 464 (+1060%)
Mutual labels:  scraping, web-scraper, web-scraping
Scrape Linkedin Selenium
`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.
Stars: ✭ 239 (+497.5%)
Mutual labels:  scraping, web-scraper, web-scraping
Detect Cms
PHP Library for detecting CMS
Stars: ✭ 78 (+95%)
Mutual labels:  scraping, web-scraper, web-scraping
Phpscraper
PHP Scraper - an highly opinionated web-interface for PHP
Stars: ✭ 148 (+270%)
Mutual labels:  scraping, web-scraper, web-scraping
papercut
Papercut is a scraping/crawling library for Node.js built on top of JSDOM. It provides basic selector features together with features like Page Caching and Geosearch.
Stars: ✭ 15 (-62.5%)
Mutual labels:  scraping, web-scraping
Gopa
[WIP] GOPA, a spider written in Golang, for Elasticsearch. DEMO: http://index.elasticsearch.cn
Stars: ✭ 277 (+592.5%)
Mutual labels:  scraping, web-scraping
Autoscraper
A Smart, Automatic, Fast and Lightweight Web Scraper for Python
Stars: ✭ 4,077 (+10092.5%)
Mutual labels:  scraping, web-scraping
Web Scraping
Detailed web scraping tutorials for dummies with financial data crawlers on Reddit WallStreetBets, CME (both options and futures), US Treasury, CFTC, LME, SHFE and news data crawlers on BBC, Wall Street Journal, Al Jazeera, Reuters, Financial Times, Bloomberg, CNN, Fortune, The Economist
Stars: ✭ 153 (+282.5%)
Mutual labels:  web-scraper, web-scraping
Humanoid
Node.js package to bypass CloudFlare's anti-bot JavaScript challenges
Stars: ✭ 88 (+120%)
Mutual labels:  scraping, web-scraping
Sqrape
Simple Query Scraping with CSS and Go Reflection (MOVED to Gitlab)
Stars: ✭ 144 (+260%)
Mutual labels:  scraping, web-scraping
ioweb
Web Scraping Framework
Stars: ✭ 31 (-22.5%)
Mutual labels:  scraping, web-scraping
Apify Js
Apify SDK — The scalable web scraping and crawling library for JavaScript/Node.js. Enables development of data extraction and web automation jobs (not only) with headless Chrome and Puppeteer.
Stars: ✭ 3,154 (+7785%)
Mutual labels:  scraping, web-scraping
raspagem-de-dados-fatec
📓 Minicurso de raspagem de dados web com Python ministrado na Semana de Tecnologia da FATEC Jundiaí
Stars: ✭ 22 (-45%)
Mutual labels:  scraping, web-scraping
trafilatura
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
Stars: ✭ 711 (+1677.5%)
Mutual labels:  scraping, web-scraping
selectorlib
A library to read a YML file with Xpath or CSS Selectors and extract data from HTML pages using them
Stars: ✭ 53 (+32.5%)
Mutual labels:  scraping, web-scraping
browser-pool
A Node.js library to easily manage and rotate a pool of web browsers, using any of the popular browser automation libraries like Puppeteer, Playwright, or SecretAgent.
Stars: ✭ 71 (+77.5%)
Mutual labels:  scraping, web-scraping
Daftlistings
A library that enables programmatic interaction with daft.ie. Daft.ie has nationwide coverage and contains about 80% of the total available properties in Ireland.
Stars: ✭ 86 (+115%)
Mutual labels:  web-scraper, web-scraping
Html Metadata
MetaData html scraper and parser for Node.js (supports Promises and callback style)
Stars: ✭ 129 (+222.5%)
Mutual labels:  web-scraper, web-scraping
PythonScrapyBasicSetup
Basic setup with random user agents and IP addresses for Python Scrapy Framework.
Stars: ✭ 57 (+42.5%)
Mutual labels:  scraping, web-scraping
Linkedin-Client
Web scraper for grabing data from Linkedin profiles or company pages (personal project)
Stars: ✭ 42 (+5%)
Mutual labels:  web-scraper, web-scraping

Medium article

Top Github Scraper

Scrape top Github repositories and users based on keywords.

I used this tool to analyze the top 1k machine learning users and create an interactive map to search for users based on their location.

demo

Setup

Installation

pip install top-github-scraper

Add Credentials

To make sure you can scrape many repositories and users, add your GitHub's credentials to .env file.

touch .env

Add your username and token to .env file:

GITHUB_USERNAME=yourusername
GITHUB_TOKEN=yourtoken

Usage

View full documentation here.

Get Top Github Repositories' URLs

from top_github_scraper import get_top_repo_urls

get_top_repo_urls(keyword="machine learning", stop_page=10)

Output at top_repo_urls_<keyword>_<sort_by>_<start_page>_<end_page>.json:

[
    "/josephmisiti/awesome-machine-learning",
    "/wepe/MachineLearning",
    "/udacity/machine-learning",
    "/Jack-Cherish/Machine-Learning",
    "/ZuzooVn/machine-learning-for-software-engineers",
    "/rasbt/python-machine-learning-book",
    "/lawlite19/MachineLearning_Python",
    "/lazyprogrammer/machine_learning_examples",
    "/trekhleb/homemade-machine-learning",
    "/ujjwalkarn/Machine-Learning-Tutorials"
]

Get Top Github Repositories' Information

from top_github_scraper import get_top_repos

get_top_repos("machine learning", stop_page=10)

Output for 1 repository at top_repo_info_<keyword>_<sort_by>_<start_page>_<end_page>.json :

{
        "stargazers_count": 48620,
        "forks_count": 12155,
        "contributors": {
            "login": [
                "josephmisiti",
                "josephmmisiti",
                "hslatman",
                "0asa",
                "ajkl",
                "ipcenas",
                "cogmission",
                "spekulatius",
                "basickarl",
                "NathanEpstein"
            ],
            "url": [
                "https://api.github.com/users/josephmisiti",
                "https://api.github.com/users/josephmmisiti",
                "https://api.github.com/users/hslatman",
                "https://api.github.com/users/0asa",
                "https://api.github.com/users/ajkl",
                "https://api.github.com/users/ipcenas",
                "https://api.github.com/users/cogmission",
                "https://api.github.com/users/spekulatius",
                "https://api.github.com/users/basickarl",
                "https://api.github.com/users/NathanEpstein"
            ],
            "contributions": [
                671,
                105,
                21,
                12,
                11,
                9,
                8,
                7,
                7,
                7
            ]
        }
    }

Get Top Github Contributors' Profiles

from top_github_scraper import get_top_contributors

get_top_contributors("machine learning", stop_page=10)

Output at top_contributor_info_<keyword>_<sort_by>_<start_page>_<end_page>.csv:

login url type name company location email hireable bio public_repos public_gists followers following
0 josephmisiti https://api.github.com/users/josephmisiti User Joseph Misiti Math & Pencil "Brooklyn, NY" True Mathematician & Co-founder of Math & Pencil 229 142 2705 275
1 josephmmisiti https://api.github.com/users/josephmmisiti User 0 0 2 0
2 hslatman https://api.github.com/users/hslatman User Herman Slatman DistributIT 133 20 469 67
3 0asa https://api.github.com/users/0asa User Vincent Botta Belgium "Innovation Engineer @evs-broadcast, previously Data Scientist @kensuio, E-Marketing Tools Manager @Diagenode, cofounder @Antibody-Adviser and photographer" 35 15 25 16
4 ajkl https://api.github.com/users/ajkl User Ajinkya Kale [email protected] 58 1 29 4
5 ipcenas https://api.github.com/users/ipcenas User 79 0 1 0
6 cogmission https://api.github.com/users/cogmission User David Ray Third planet from the sun... [email protected] Humanity's freedom and abundance through the pursuit of technological innovation in the area of cognitive applications - Cognition Mission 30 19 54 44
7 spekulatius https://api.github.com/users/spekulatius User Peter Thaleikis @bringyourownideas 127.0.0.1 True Software engineer focused on solutions using open source and simply filling in the gaps to fulfill the requirements. 42 1 232 920
8 basickarl https://api.github.com/users/basickarl User Karl Morrison "Malmö, Sweden" [email protected] The question is: Will you take me seriously 5 1 12 6
9 NathanEpstein https://api.github.com/users/NathanEpstein User Nathan Epstein "New York, NY" [email protected] True 23 12 208 0

Get Top Github Users' Profiles

from top_github_scraper import get_top_users

get_top_users("machine learning", stop_page=10)

Output at top_user_info_<keyword>_<start_page>_<end_page>.csv

login url type name company location email hireable bio public_repos public_gists followers following
0 rasbt https://api.github.com/users/rasbt User Sebastian Raschka UW-Madison "Madison, WI" "Machine Learning researcher & open source contributor. Author of ""Python Machine Learning."" Asst. Prof. of Statistics @ UW-Madison." 71 5 13888 35
1 tqchen https://api.github.com/users/tqchen User Tianqi Chen "CMU, OctoML" Large scale Machine Learning 28 1 8611 126
2 halfrost https://api.github.com/users/halfrost User halfrost @Alibaba Shanghai China [email protected] 💪天道酬勤,勤能补拙。博观而约取,厚积而薄发。Gopher / Rustacean / iOS Dev. / Machine Learning / Retired acmer / Math / Philosophy / Technical Writer. 22 0 8566 314
3 ageron https://api.github.com/users/ageron User Aurélien Geron Paris Author of the book Hands-On Machine Learning with Scikit-Learn and TensorFlow. Former PM of YouTube video classification and founder & CTO of a telco operator. 43 16 8383 2
4 chiphuyen https://api.github.com/users/chiphuyen User Chip Huyen https://snorkel.ai "Mountain View, CA" True Developing tools and best practices for machine learning production. 19 1 7839 15
5 rhiever https://api.github.com/users/rhiever User Randy Olson FOXO BioScience "Vancouver, WA" [email protected] "Chief Data Scientist, @FOXOBioScience. AI, Machine Learning, and Data Visualization specialist. Community leader for /r/DataIsBeautiful." 77 17 5363 13
6 lexfridman https://api.github.com/users/lexfridman User Lex Fridman MIT "Cambridge, MA" "AI researcher working on autonomous vehicles, human-robot interaction, and machine learning at MIT and beyond." 2 0 5031 0
7 eriklindernoren https://api.github.com/users/eriklindernoren User Erik Linder-Norén "Stockholm, Sweden" [email protected] "ML engineer at Apple. Excited about machine learning, basketball and building things." 24 0 3764 11
8 roboticcam https://api.github.com/users/roboticcam User A/Prof Richard Xu 徐亦达教授 University of Technology Sydney Sydney Australia "I am an A/Professor in Machine Learning at UTS. manage a large research team of postdoc, PhD students close to 30 people" 10 0 3561 0
9 ogrisel https://api.github.com/users/ogrisel User Olivier Grisel Inria "Paris, France" [email protected] Machine Learning Engineer a Inria Saclay (Parietal team). 174 93 3237 116

Parameters

View a full list of paramters here.

How the Data is Scraped

top-github-scraper scrapes the owners as well as the contributors of the top repositories that pop up in the search when searching for a specific keyword on GitHub.

image

For each user, top-github-scraper scrapes 16 data points:

  • login: username
  • url: URL of the user
  • type: Whether this account is a user or an organization
  • name: Name of the user
  • company: User's company
  • location: User's location
  • email: User's email
  • hireable: Whether the user is hireable
  • bio: Short description of the user
  • public_repos: Number of public repositories the user has (including forked repositories)
  • public_gists: Number of public repositories the user has (including forked gists)
  • followers: Number of followers the user has
  • following: Number of people the user is following
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].