philipperemy / Reuters Full Data Set
Full dataset of Reuters composed of 8,551,441 news titles, links and timestamps (Jan 2007 - Aug 2016). Generate your own up to today!
Stars: ✭ 159
Projects that are alternatives of or similar to Reuters Full Data Set
Nlp chinese corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+4086.16%)
Mutual labels: news, dataset
Weihanli.npoi
NPOI Extensions, excel/csv importer/exporter for IEnumerable<T>/DataTable, fluentapi(great flexibility)/attribute configuration
Stars: ✭ 157 (-1.26%)
Mutual labels: dataset
Django Newsfeed
A news curator and newsletter subscription package for Django
Stars: ✭ 155 (-2.52%)
Mutual labels: news
Maskedface Net
MaskedFace-Net is a dataset of human faces with a correctly and incorrectly worn mask based on the dataset Flickr-Faces-HQ (FFHQ).
Stars: ✭ 152 (-4.4%)
Mutual labels: dataset
Snape
Snape is a convenient artificial dataset generator that wraps sklearn's make_classification and make_regression and then adds in 'realism' features such as complex formating, varying scales, categorical variables, and missing values.
Stars: ✭ 155 (-2.52%)
Mutual labels: dataset
Lapa Dataset
A large-scale dataset for face parsing (AAAI2020)
Stars: ✭ 149 (-6.29%)
Mutual labels: dataset
Covid 19 Timeline
以 社会学年鉴模式体例规范地统编自2019年末起新冠肺炎疫情进展的时间线。
Stars: ✭ 1,887 (+1086.79%)
Mutual labels: news
Rt gene
RT-GENE: Real-Time Eye Gaze and Blink Estimation in Natural Environments
Stars: ✭ 157 (-1.26%)
Mutual labels: dataset
Isic Archive Downloader
A script to download the ISIC Archive of lesion images
Stars: ✭ 153 (-3.77%)
Mutual labels: dataset
Evoskeleton
Official project website for the CVPR 2020 paper (Oral Presentation) "Cascaded Deep Monocular 3D Human Pose Estimation With Evolutionary Training Data"
Stars: ✭ 154 (-3.14%)
Mutual labels: dataset
Quickdraw Appendix
Dataset of 25k penises: an appendix to the Quick, Draw! Dataset
Stars: ✭ 153 (-3.77%)
Mutual labels: dataset
Newswatch React Native
📺 A news app using YouTube playlists, built with React Native
Stars: ✭ 155 (-2.52%)
Mutual labels: news
Music Dance Video Synthesis
(ACM MM 20 Oral) PyTorch implementation of Self-supervised Dance Video Synthesis Conditioned on Music
Stars: ✭ 150 (-5.66%)
Mutual labels: dataset
Omr Datasets
Collection of datasets used for Optical Music Recognition
Stars: ✭ 158 (-0.63%)
Mutual labels: dataset
Census Data Downloader
Download U.S. census data and reformat it for humans
Stars: ✭ 149 (-6.29%)
Mutual labels: news
Awesome Biomechanics
A curated, public list collecting resources for biomechanics and human motion: datasets, processing tools, software for simulation, educational videos, lectures, etc.
Stars: ✭ 154 (-3.14%)
Mutual labels: dataset
Dem.net
Digital Elevation model library in C#. 3D terrain models, line/point Elevations, intervisibility reports
Stars: ✭ 153 (-3.77%)
Mutual labels: dataset
Motion Sense
MotionSense Dataset for Human Activity and Attribute Recognition ( time-series data generated by smartphone's sensors: accelerometer and gyroscope)
Stars: ✭ 159 (+0%)
Mutual labels: dataset
Pytorch Nlp
Basic Utilities for PyTorch Natural Language Processing (NLP)
Stars: ✭ 1,996 (+1155.35%)
Mutual labels: dataset
Reuters-full-data-set
Full unofficial data set of Reuters composed of 8,551,441 news titles, links and timestamps (Jan 2007 - Aug 2016).
NB: To generate it from scrach (from 2007 up to today), please scroll down.
Using the pre-existing one
git clone https://github.com/philipperemy/Reuters-full-data-set.git
cd Reuters-full-data-set
python3 read.py
ts = 20070228 11:46 AM EST, t = European stocks hit 7-week low amid new sell-off, h= http://www.reuters.com/article/companyNewsAndPR/idUSWEB277620070228
ts = 20070228 11:46 AM EST, t = Schering-Plough announces Ismail Kola as VP and Chief Scientific Officer, h= http://www.reuters.com/article/inPlayBriefing/idUSIN20070228164651SGP20070228
ts = 20070228 11:46 AM EST, t = O'Reilly Automotive forecasts 2007 earnings growth, h= http://www.reuters.com/article/marketsNews/idUSN2845320220070228
ts = 20070228 11:42 AM EST, t = Market Wrap, h= http://www.reuters.com/article/inPlayBriefing/idUSIN20070228164235WRAPX20070228
ts = 20070228 11:42 AM EST, t = Chile's CMPC net profit falls 13 pct in 2006, h= http://www.reuters.com/article/tnBasicIndustries-SP/idUSN2844077020070228
ts = 20070228 11:42 AM EST, t = Toyota Venezuela to halt March ops on currency woes, h= http://www.reuters.com/article/tnBasicIndustries-SP/idUSN2827887820070228
Each pickle file in data
represents a day (e.g. 20160102.pkl
is for Jan, 2 2016).
One day is composed of several news, gathered in a list
.
Each news is a dict
of the form:
ts: timestamp of the form 20070228 11:46 AM EST
title: title of the news
href: link to the article to get the full content
Generate your own data set
Nothing could be easier. Just run those commands to generate pickle and CSV files.
I get the data from http://www.reuters.com/resources/archive/us
.
git clone https://github.com/philipperemy/Reuters-full-data-set.git
cd Reuters-full-data-set
pip3 install beautifulsoup4 requests
python3 generate.py
python3 dump_to_csv.py DATA_DIR # where DATA_DIR is the directory contained your pickle files from generate.py
Other languages exist
Note that the project description data, including the texts, logos, images, and/or trademarks,
for each open source project belongs to its rightful owner.
If you wish to add or remove any projects, please contact us at [email protected].