All Projects → src-d → Datasets

src-d / Datasets

Licence: other
source{d} datasets ("big code") for source code analysis and machine learning on source code

Projects that are alternatives of or similar to Datasets

Openml R
R package to interface with OpenML
Stars: ✭ 81 (-64.94%)
Mutual labels:  jupyter-notebook, dataset, datasets
Corus
Links to Russian corpora + Python functions for loading and parsing
Stars: ✭ 154 (-33.33%)
Mutual labels:  jupyter-notebook, datasets
Lacmus
Lacmus is a cross-platform application that helps to find people who are lost in the forest using computer vision and neural networks.
Stars: ✭ 142 (-38.53%)
Mutual labels:  jupyter-notebook, dataset
Shape Detection
🟣 Object detection of abstract shapes with neural networks
Stars: ✭ 170 (-26.41%)
Mutual labels:  jupyter-notebook, dataset
Coronawatchnl
Numbers concerning COVID-19 disease cases in The Netherlands by RIVM, LCPS, NICE, ECML, and Rijksoverheid.
Stars: ✭ 135 (-41.56%)
Mutual labels:  jupyter-notebook, dataset
Datasets
🎁 3,000,000+ Unsplash images made available for research and machine learning
Stars: ✭ 1,805 (+681.39%)
Mutual labels:  jupyter-notebook, dataset
Cifar 10.1
Release of CIFAR-10.1, a new test set for CIFAR-10.
Stars: ✭ 166 (-28.14%)
Mutual labels:  jupyter-notebook, dataset
Aesthetics
Image Aesthetics Toolkit - includes Fisher Vector implementation, AVA (Image Aesthetic Visual Analysis) dataset and fast multi-threaded downloader
Stars: ✭ 113 (-51.08%)
Mutual labels:  dataset, datasets
Fifa18 All Player Statistics
A complete catalog of all the players in Fifa 18 and their complete statistics.
Stars: ✭ 185 (-19.91%)
Mutual labels:  jupyter-notebook, dataset
Awesome Json Datasets
A curated list of awesome JSON datasets that don't require authentication.
Stars: ✭ 2,421 (+948.05%)
Mutual labels:  dataset, datasets
Trump Lies
Tutorial: Web scraping in Python with Beautiful Soup
Stars: ✭ 201 (-12.99%)
Mutual labels:  jupyter-notebook, dataset
Contactpose
Large dataset of hand-object contact, hand- and object-pose, and 2.9 M RGB-D grasp images.
Stars: ✭ 129 (-44.16%)
Mutual labels:  jupyter-notebook, dataset
Know Your Intent
State of the Art results in Intent Classification using Sematic Hashing for three datasets: AskUbuntu, Chatbot and WebApplication.
Stars: ✭ 116 (-49.78%)
Mutual labels:  jupyter-notebook, dataset
Gossiping Chinese Corpus
PTT 八卦版問答中文語料
Stars: ✭ 137 (-40.69%)
Mutual labels:  jupyter-notebook, dataset
Protest Detection Violence Estimation
Implementation of the model used in the paper Protest Activity Detection and Perceived Violence Estimation from Social Media Images (ACM Multimedia 2017)
Stars: ✭ 114 (-50.65%)
Mutual labels:  jupyter-notebook, dataset
Motion Sense
MotionSense Dataset for Human Activity and Attribute Recognition ( time-series data generated by smartphone's sensors: accelerometer and gyroscope)
Stars: ✭ 159 (-31.17%)
Mutual labels:  jupyter-notebook, dataset
Automated Resume Screening System
Automated Resume Screening System using Machine Learning (With Dataset)
Stars: ✭ 224 (-3.03%)
Mutual labels:  dataset, datasets
Firstcoursenetworkscience
Tutorials, datasets, and other material associated with textbook "A First Course in Network Science" by Menczer, Fortunato & Davis
Stars: ✭ 111 (-51.95%)
Mutual labels:  jupyter-notebook, datasets
Bertqa Attention On Steroids
BertQA - Attention on Steroids
Stars: ✭ 112 (-51.52%)
Mutual labels:  jupyter-notebook, dataset
Data Science Resources
👨🏽‍🏫You can learn about what data science is and why it's important in today's modern world. Are you interested in data science?🔋
Stars: ✭ 171 (-25.97%)
Mutual labels:  jupyter-notebook, dataset

source{d} Datasets Build Status Build status

source{d} datasets for source code analysis and machine learning on source code (ML on Code).

This repository contains all the needed tools and scripts to reproduce the datasets, as well as the academic papers they may relate to.

Available datasets

Public Git Archive

  • Public Git Archive
  • Size: 6TB
  • Description: 260k+ top-bookmarked repositories from GitHub, consisting of 136M+ files and ~28 billion lines of code.

Programming Language Identifiers

Code duplicates

Pull Request review comments

  • PR review comments
  • Size: 1.5GB
  • Description: 25.3 million GitHub PR review comments since January 2015 till December 2018.

Commit messages

  • Commit messages
  • Size: 46GB
  • Description: 1.3 billion GitHub commit messages till March 2019.

Structural commit features

DockerHub Metadata

  • DockerHub Metadata
  • Size: 1.4GB
  • Description: 1.46 million Docker image configuration and manifest files on DockerHub fetched in June 2019.

DockerHub Packages

  • DockerHub Packages
  • Size: 15GB
  • Description: 419092 analyzed Docker images: lists of native, Python and Node packages on DockerHub fetched in summer 2019.

Typos

  • Typos
  • Size: 1MB
  • Description: 7375 typos in source code identifier names found in GitHub repositories.

NuGet Namespaces

  • NugetNamespaces
  • Size: 13MB
  • Description: information about 681,858 .NET namespaces extracted from 227,839 NuGet packages.

Contributions

Contributions are very welcome, please see CONTRIBUTING.md and code of conduct.

License

The tools and scripts are licensed under Apache 2.0, see LICENSE.md.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].