All Projects → wb14123 → Couplet Dataset

wb14123 / Couplet Dataset

Licence: agpl-3.0
Dataset for couplets. 70万条对联数据库。

Programming Languages

python
139335 projects - #7 most used programming language

Labels

Projects that are alternatives of or similar to Couplet Dataset

Inat comp
iNaturalist competition details
Stars: ✭ 444 (-24.62%)
Mutual labels:  dataset
Cluepretrainedmodels
高质量中文预训练模型集合:最先进大模型、最快小模型、相似度专门模型
Stars: ✭ 493 (-16.3%)
Mutual labels:  dataset
Nas Bench 201
NAS-Bench-201 API and Instruction
Stars: ✭ 537 (-8.83%)
Mutual labels:  dataset
Mongodb Json Files
📦 A curated list of JSON / BSON datasets from the web in order to practice / use in MongoDB
Stars: ✭ 456 (-22.58%)
Mutual labels:  dataset
Tensorflow object tracking video
Object Tracking in Tensorflow ( Localization Detection Classification ) developed to partecipate to ImageNET VID competition
Stars: ✭ 491 (-16.64%)
Mutual labels:  dataset
Cdap
An open source framework for building data analytic applications.
Stars: ✭ 509 (-13.58%)
Mutual labels:  dataset
Io
Dataset, streaming, and file system extensions maintained by TensorFlow SIG-IO
Stars: ✭ 427 (-27.5%)
Mutual labels:  dataset
Open stt
Open STT
Stars: ✭ 584 (-0.85%)
Mutual labels:  dataset
Doccano
Open source annotation tool for machine learning practitioners.
Stars: ✭ 5,600 (+850.76%)
Mutual labels:  dataset
Awesome Twitter Data
A list of Twitter datasets and related resources.
Stars: ✭ 533 (-9.51%)
Mutual labels:  dataset
Lidar Bonnetal
Semantic and Instance Segmentation of LiDAR point clouds for autonomous driving
Stars: ✭ 465 (-21.05%)
Mutual labels:  dataset
Chinese rumor dataset
中文谣言数据
Stars: ✭ 470 (-20.2%)
Mutual labels:  dataset
Pokemon.json
Pokemon dataset in JSON.
Stars: ✭ 511 (-13.24%)
Mutual labels:  dataset
Joke Dataset
A dataset of 200k English plaintext jokes.
Stars: ✭ 447 (-24.11%)
Mutual labels:  dataset
Hate Speech And Offensive Language
Repository for the paper "Automated Hate Speech Detection and the Problem of Offensive Language", ICWSM 2017
Stars: ✭ 543 (-7.81%)
Mutual labels:  dataset
Quickdraw Dataset
Documentation on how to access and use the Quick, Draw! Dataset.
Stars: ✭ 4,622 (+684.72%)
Mutual labels:  dataset
Voice datasets
🔊 A comprehensive list of open-source datasets for voice and sound computing (50+ datasets).
Stars: ✭ 494 (-16.13%)
Mutual labels:  dataset
Cvat
Powerful and efficient Computer Vision Annotation Tool (CVAT)
Stars: ✭ 6,557 (+1013.24%)
Mutual labels:  dataset
Total Text Dataset
Total Text Dataset. It consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.
Stars: ✭ 580 (-1.53%)
Mutual labels:  dataset
Pycococreator
Helper functions to create COCO datasets
Stars: ✭ 530 (-10.02%)
Mutual labels:  dataset

对联数据集。

This is a project to fetch couplets from 冯重朴_梨味斋散叶_的博客

This dataset contains more than 700,000 couplets.

Run the spider:

scrapy runspider sina_spider.py

It will store the data into ./output/.

Download the data

There is an already fetched and cleaned dataset that can be used directly with the seq2seq model. You can download it at here.

The downloaded data contains 5 files:

  1. train/in.txt: The input of the couplets. Each line is an input. Each word is split by space.
  2. train/out.txt: The output of the couplets. Each line is the output for the same line in the in.txt. Each word is split by space.
  3. test/in.txt: Same as train/in.txt but with less data.
  4. test/out.txt: Same as train/out.txt but with less data.
  5. vocabs: Vocabs file. Add <s> and <\s> as the first vocabs, which will be used to train in the seq2seq mode.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].