All Projects → tenex → Opensourcecontributors

tenex / Opensourcecontributors

Find all contributions for a user through the GitHub Archive

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Opensourcecontributors

Digeds cat
This research seeks to examine best practice in the field of digital editions by collating relevant evidence in a detailed catalogue of extant digital projects.
Stars: ✭ 40 (-54.55%)
Mutual labels:  data, open-source
Airbyte
Airbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.
Stars: ✭ 4,919 (+5489.77%)
Mutual labels:  data, open-source
Covid19 scenarios
Models of COVID-19 outbreak trajectories and hospital demand
Stars: ✭ 1,355 (+1439.77%)
Mutual labels:  data, open-source
Deveeldb
DeveelDB is a complete SQL database system, primarly developed for .NET/Mono frameworks
Stars: ✭ 80 (-9.09%)
Mutual labels:  data, open-source
Tksheet
Python 3.6+ tkinter table widget for displaying tabular data
Stars: ✭ 86 (-2.27%)
Mutual labels:  data
Kiftd
sky driver & cloud driver open source server application : kiftd . welcome to the home page: https://kohgylw.gitee.io/ to quick start——kiftd是一款专门面向个人、团队和小型组织的私有网盘系统。轻量、开源、完善。无论是在家庭、学校还是在办公室,您都能立刻开始使用它!了解更多请访问官方网站:
Stars: ✭ 1,259 (+1330.68%)
Mutual labels:  open-source
Bhagavadgita
A non-profit initiative to help spread the transcendental wisdom from the Bhagavad Gita to people around the world.
Stars: ✭ 84 (-4.55%)
Mutual labels:  open-source
Deeplearning Mindmap
A mindmap summarising Deep Learning concepts.
Stars: ✭ 1,251 (+1321.59%)
Mutual labels:  data
Aurdroid
Android AUR [Arch Linux user Repository] packages browser
Stars: ✭ 88 (+0%)
Mutual labels:  open-source
Rest Hooks
Delightful data fetching for React.
Stars: ✭ 1,276 (+1350%)
Mutual labels:  data
Nodejs Starter
Nodejs Starter - Open-Source Javascript Boilerplate | AppSeed
Stars: ✭ 86 (-2.27%)
Mutual labels:  open-source
P32929.github.io
Second iteration of my portfolio - created using ReactJS, Material-UI, Overmind, etc
Stars: ✭ 84 (-4.55%)
Mutual labels:  open-source
Openfintech
Opensource FinTech standards & payment provider data
Stars: ✭ 87 (-1.14%)
Mutual labels:  data
Rain
Visualize vertical data inside your terminal 💦
Stars: ✭ 84 (-4.55%)
Mutual labels:  data
D3vue
A D3 Plugin for VueJS
Stars: ✭ 87 (-1.14%)
Mutual labels:  data
Semana Hacktoberfest
🔥 Semana Hacktoberfest na Lukin Co. —— Quer participar da semana Hacktoberfest? Nós preparamos um guia especial para você!
Stars: ✭ 84 (-4.55%)
Mutual labels:  open-source
Obofoundry.github.io
Metadata and website for the Open Bio Ontologies Foundry Ontology Registry
Stars: ✭ 85 (-3.41%)
Mutual labels:  open-source
Surviving With Android
Source code related to the posts in the blog
Stars: ✭ 1,275 (+1348.86%)
Mutual labels:  open-source
Ios Demos
Examples of ios applications http://www.novoda.com/blog
Stars: ✭ 85 (-3.41%)
Mutual labels:  open-source
Core
Open source Dota 2 data platform
Stars: ✭ 1,266 (+1338.64%)
Mutual labels:  data

OpenSourceContributo.rs

OpenSourceContributo.rs

Note about name change: This project was formerly known as githubcontributions.io. GitHub requested that the name of the project be changed in order to avoid confusion about who owns and maintains this project.

This is a utility to find a list of all contributions a user has made to any public repository on GitHub from 2011-01-01 through yesterday.

The data from 2015-01-01 - present is found on GitHub Archive. The data from before this uses a different schema and was obtained from Google's BigQuery (see below)

As of 2015-08-28, it tracks a total of

% cd /github-archive/processed
% gzip -l *.json.gz | awk 'END{print $2}' | numfmt --to=iec-i --suffix=B --format="%3f"
93GiB
% zcat *.json.gz | wc -l
253027947

events.

db.contributions.stats():

{
  "ns" : "contributions.contributions",
  "count" : 284048099,
  "size" : 113714359272,
  "avgObjSize" : 400,
  "storageSize" : 47820357632,
  "capped" : false,
  "nindexes" : 4,
  "totalIndexSize" : 8810385408,
  "indexSizes" : {
    "_id_" : 2804744192,
    "_user_lower_1" : 2275647488,
    "_event_id_1" : 1029251072,
    "created_at_1" : 2700742656
  },
  "ok" : 1
}

(WiredTiger stats omitted)

Processing data archives

Processing the data archives involves 3 steps:

  1. Download the raw events files from GitHub Archive into the events directory
  2. Transform the events files by filtering non-contribution events (e.g., starring a repository) and adding necessary indexable keys (e.g., lowercased username)
  3. Load the transformed data into MongoDB

The archive-processor tool in the util directory handles all of this.

The transformed data from step 2 is compressed and saved just in case we need to re-load the entire database (these files are much smaller than the raw data).

All of this can be done automatically by setting the correct environment variables, then running archive-processor process, or it can be invoked differently to separate the steps or change the working directories. Run archive-processor --help for details.

Environment Variable Meaning
GHC_EVENTS_PATH Contains data from 2015-01-01 to present (.json.gz)
GHC_TIMELINE_PATH Contains data before 2015-01-01 (.csv.gz)
GHC_TRANSFORMED_PATH Contains output of "transform" operation (.json.gz)
GHC_LOADED_PATH Links to files in GHC_TRANSFORMED_PATH when loaded to DB
GHC_LOG_PATH Each invocation of archive-processor logs to here

BigQuery Data Sets

For the data from 2011-2014 (actually, 2008-08-25 01:07:06 to 2014-12-31 23:59:59), the GitHub Archive project recorded data from the (now deprecated) Timeline API. This is in a different format and has many more quirks than the new GitHub Events API. To obtain this data, the following BigTable query was used (which took only 47.5s to run):

SELECT
  -- common fields
  created_at, actor, repository_owner, repository_name, repository_organization, type, url,
  -- specific to type
  payload_page_html_url,     -- GollumEvent
  payload_page_summary,      -- GollumEvent
  payload_page_page_name,    -- GollumEvent
  payload_page_action,       -- GollumEvent
  payload_page_title,        -- GollumEvent
  payload_page_sha,          -- GollumEvent
  payload_number,            -- IssuesEvent
  payload_action,            -- MemberEvent, IssuesEvent, ReleaseEvent, IssueCommentEvent
  payload_member_login,      -- MemberEvent
  payload_commit_msg,        -- PushEvent
  payload_commit_email,      -- PushEvent
  payload_commit_id,         -- PushEvent
  payload_head,              -- PushEvent
  payload_ref,               -- PushEvent
  payload_comment_commit_id, -- CommitCommentEvent
  payload_comment_path,      -- CommitCommentEvent
  payload_comment_body,      -- CommitCommentEvent
  payload_issue_id,          -- IssueCommentEvent
  payload_comment_id         -- IssueCommentEvent
FROM (
  TABLE_QUERY(githubarchive:year,'true') -- All the years!
)
WHERE type IN (
  "GollumEvent",
  "IssuesEvent",
  "PushEvent",
  "CommitCommentEvent",
  "ReleaseEvent",
  "PublicEvent",
  "MemberEvent",
  "IssueCommentEvent"
)

If you actually want to use this data, there's no need to run that query; just ask me for the CSVs. When gzipped, they are about 19GB.

Erroneous data

There is lots of data in the archives that just doesn't make sense. Where I can, I've worked around it, for example by parsing needed data out of the event's URL. Here are some issues:

BigQuery exports CSV nulls weird?

Example:

SELECT *
FROM [githubarchive:year.2014]
LIMIT 1000

you will note that in the results pane of Google's BigQuery page, there is the string "null" where it really means a real null value. That makes its way into the exported CSV. So you should export the table the real way, or you will have the string "null" for almost every value.

PushEvent with no repository name (Timeline API)

Example:

SELECT *
FROM [githubarchive:year.2014]
WHERE payload_head='8824ed4d86f587a2a556248d9abfac790a1cbd3f'
LIMIT 1

It seems like sometimes, the only way to get the real repository name (owner/project) is to parse it from the URL.

PushEvent with no way of figuring out the repository (Timeline API)

Example:

SELECT *
FROM [githubarchive:year.2011]
WHERE payload_head='32b2177f05be005df3542c14d9a9985be2b553f7'
LIMIT 5

repository_url is https://github.com// and repository_name is / for each of these. They actually push to: https://github.com/Jiyambi/WoW-Pro-Guides but I only know that by reading the commit messages.

Credits

Created by @hut8 and maintained by Tenex Developers (@tenex).

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].