All Projects → taivop → Joke Dataset

taivop / Joke Dataset

A dataset of 200k English plaintext jokes.

Programming Languages

python
139335 projects - #7 most used programming language

Labels

Projects that are alternatives of or similar to Joke Dataset

Pcam
The PatchCamelyon (PCam) deep learning classification benchmark.
Stars: ✭ 340 (-23.94%)
Mutual labels:  dataset
Vpgnet
VPGNet: Vanishing Point Guided Network for Lane and Road Marking Detection and Recognition (ICCV 2017)
Stars: ✭ 382 (-14.54%)
Mutual labels:  dataset
Awesome Remote Sensing Change Detection
List of datasets, codes, and contests related to remote sensing change detection
Stars: ✭ 414 (-7.38%)
Mutual labels:  dataset
Dsprites Dataset
Dataset to assess the disentanglement properties of unsupervised learning methods
Stars: ✭ 340 (-23.94%)
Mutual labels:  dataset
Trashnet
Dataset of images of trash; Torch-based CNN for garbage image classification
Stars: ✭ 368 (-17.67%)
Mutual labels:  dataset
Cmu Multimodalsdk
CMU MultimodalSDK is a machine learning platform for development of advanced multimodal models as well as easily accessing and processing multimodal datasets.
Stars: ✭ 388 (-13.2%)
Mutual labels:  dataset
Atsd Use Cases
Axibase Time Series Database: Usage Examples and Research Articles
Stars: ✭ 335 (-25.06%)
Mutual labels:  dataset
Quickdraw Dataset
Documentation on how to access and use the Quick, Draw! Dataset.
Stars: ✭ 4,622 (+934%)
Mutual labels:  dataset
Tfrecord
TFRecord reader for PyTorch
Stars: ✭ 377 (-15.66%)
Mutual labels:  dataset
Wuhan 2019 Ncov
2019-nCoV 新冠状病毒 2019-12-01至今国家、省、市三级每日统计数据(支持接口读取)
Stars: ✭ 414 (-7.38%)
Mutual labels:  dataset
Medmnist
[ISBI'21] MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis
Stars: ✭ 338 (-24.38%)
Mutual labels:  dataset
Data
Python related videos and metadata powering =>
Stars: ✭ 355 (-20.58%)
Mutual labels:  dataset
Free Spoken Digit Dataset
A free audio dataset of spoken digits. Think MNIST for audio.
Stars: ✭ 396 (-11.41%)
Mutual labels:  dataset
Eseur Code Data
Code and data used to create the examples in "Evidence-based Software Engineering based on the publicly available data"
Stars: ✭ 340 (-23.94%)
Mutual labels:  dataset
Squad Explorer
Visually Explore the Stanford Question Answering Dataset
Stars: ✭ 421 (-5.82%)
Mutual labels:  dataset
Deeperforensics 1.0
[CVPR 2020] A Large-Scale Dataset for Real-World Face Forgery Detection
Stars: ✭ 338 (-24.38%)
Mutual labels:  dataset
Comma2k19
A driving dataset for the development and validation of fused pose estimators and mapping algorithms
Stars: ✭ 391 (-12.53%)
Mutual labels:  dataset
Inat comp
iNaturalist competition details
Stars: ✭ 444 (-0.67%)
Mutual labels:  dataset
Io
Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO
Stars: ✭ 427 (-4.47%)
Mutual labels:  dataset
Imdb Face
A new large-scale noise-controlled face recognition dataset.
Stars: ✭ 399 (-10.74%)
Mutual labels:  dataset

A dataset of English plaintext jokes

There are about 208 000 jokes in this database scraped from three sources.

I make no claim on ownership of these files, nor do I necessarily endorse the jokes in them. This dataset is provided for research purposes (see License section below).

Files

Currently the dataset contains jokes from three sources, each in a different file.

----------------------------------------------
reddit_jokes.json |  195K jokes | 7.40M tokens
stupidstuff.json  | 3.77K jokes |  396K tokens
wocka.json        | 10.0K jokes | 1.11M tokens
----------------------------------------------
TOTAL             |  208K jokes | 8.91M tokens
----------------------------------------------

Format

Each file is a JSON document, containing a flat list of joke objects. Each joke object always has the body field with additional fields varying based on the dataset, described below.

Obviously they are not all funny; to find the best ones, sort on the relevant additional fields.

Note that the title is in part of the joke many cases (especially for Reddit submissions).

reddit_jokes.json

Scraped from /r/jokes. Contains all submissions to the subreddit as of 13.02.2017.

These jokes may have additional comments in them (example).

Additional fields:

  • id -- submission ID in the subreddit.
  • score -- post score displayed on Reddit.
  • title -- title of the submission.
{
        "title": "My boss said to me, \"you're the worst train driver ever. How many have you derailed this year?\"",
        "body": "I said, \"I'm not sure; it's hard to keep track.\"",
        "id": "5tyytx",
        "score": 3
    }

stupidstuff.json

Scraped from stupidstuff.org.

Additional fields:

  • id -- page ID on stupidstuff.org.
  • category -- see available categories here.
  • rating -- mean user rating on a scale of 1 to 5.
{
        "category": "Blonde Jokes",
        "body": "A blonde is walking down the street with her blouse open, exposing one of her breasts. A nearby policeman approaches her and remarks, \"Ma'am, are you aware that I could cite you for indecent exposure?\" \"Why, officer?\" asks the blonde. \"Because your blouse is open and your breast is exposed.\" \"Oh my goodness,\" exclaims the blonde, \"I must have left my baby on the bus!\"",
        "id": 14,
        "rating": 3.5
    }

wocka.json

Scraped from wocka.com.

Additional fields:

  • id -- page ID on wocka.com.
  • category -- see available categories here.
  • title -- title of the joke.
{
        "title": "Infants vs Adults",
        "body": "Do infants enjoy infancy as much as adults enjoy adultery?",
        "category": "One Liners",
        "id": 17
    }

License

I provide this dataset for research purposes and make no ownership claim on any part of it. The question of copyright in the case of jokes is unclear and I recommend not using the dataset commercially.

For removal of copyrighted content, please contact me on GitHub.

Citing

If you use this dataset in academic work, please cite as follows:

@misc{pungas,
        title={A dataset of English plaintext jokes.},
        url={https://github.com/taivop/joke-dataset},
        author={Pungas, Taivo},
        year={2017},
        publisher = {GitHub},
        journal = {GitHub repository}
}
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].