All Projects → philipperemy → Reuters Full Data Set

philipperemy / Reuters Full Data Set

Full dataset of Reuters composed of 8,551,441 news titles, links and timestamps (Jan 2007 - Aug 2016). Generate your own up to today!

Projects that are alternatives of or similar to Reuters Full Data Set

Nlp chinese corpus
大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP
Stars: ✭ 6,656 (+4086.16%)
Mutual labels:  news, dataset
Weihanli.npoi
NPOI Extensions, excel/csv importer/exporter for IEnumerable<T>/DataTable, fluentapi(great flexibility)/attribute configuration
Stars: ✭ 157 (-1.26%)
Mutual labels:  dataset
Financial News Dataset
Reuters and Bloomberg
Stars: ✭ 147 (-7.55%)
Mutual labels:  dataset
Django Newsfeed
A news curator and newsletter subscription package for Django
Stars: ✭ 155 (-2.52%)
Mutual labels:  news
Maskedface Net
MaskedFace-Net is a dataset of human faces with a correctly and incorrectly worn mask based on the dataset Flickr-Faces-HQ (FFHQ).
Stars: ✭ 152 (-4.4%)
Mutual labels:  dataset
Snape
Snape is a convenient artificial dataset generator that wraps sklearn's make_classification and make_regression and then adds in 'realism' features such as complex formating, varying scales, categorical variables, and missing values.
Stars: ✭ 155 (-2.52%)
Mutual labels:  dataset
Lapa Dataset
A large-scale dataset for face parsing (AAAI2020)
Stars: ✭ 149 (-6.29%)
Mutual labels:  dataset
Covid 19 Timeline
以 社会学年鉴模式体例规范地统编自2019年末起新冠肺炎疫情进展的时间线。
Stars: ✭ 1,887 (+1086.79%)
Mutual labels:  news
Rt gene
RT-GENE: Real-Time Eye Gaze and Blink Estimation in Natural Environments
Stars: ✭ 157 (-1.26%)
Mutual labels:  dataset
Isic Archive Downloader
A script to download the ISIC Archive of lesion images
Stars: ✭ 153 (-3.77%)
Mutual labels:  dataset
Evoskeleton
Official project website for the CVPR 2020 paper (Oral Presentation) "Cascaded Deep Monocular 3D Human Pose Estimation With Evolutionary Training Data"
Stars: ✭ 154 (-3.14%)
Mutual labels:  dataset
Quickdraw Appendix
Dataset of 25k penises: an appendix to the Quick, Draw! Dataset
Stars: ✭ 153 (-3.77%)
Mutual labels:  dataset
Newswatch React Native
📺 A news app using YouTube playlists, built with React Native
Stars: ✭ 155 (-2.52%)
Mutual labels:  news
Music Dance Video Synthesis
(ACM MM 20 Oral) PyTorch implementation of Self-supervised Dance Video Synthesis Conditioned on Music
Stars: ✭ 150 (-5.66%)
Mutual labels:  dataset
Omr Datasets
Collection of datasets used for Optical Music Recognition
Stars: ✭ 158 (-0.63%)
Mutual labels:  dataset
Census Data Downloader
Download U.S. census data and reformat it for humans
Stars: ✭ 149 (-6.29%)
Mutual labels:  news
Awesome Biomechanics
A curated, public list collecting resources for biomechanics and human motion: datasets, processing tools, software for simulation, educational videos, lectures, etc.
Stars: ✭ 154 (-3.14%)
Mutual labels:  dataset
Dem.net
Digital Elevation model library in C#. 3D terrain models, line/point Elevations, intervisibility reports
Stars: ✭ 153 (-3.77%)
Mutual labels:  dataset
Motion Sense
MotionSense Dataset for Human Activity and Attribute Recognition ( time-series data generated by smartphone's sensors: accelerometer and gyroscope)
Stars: ✭ 159 (+0%)
Mutual labels:  dataset
Pytorch Nlp
Basic Utilities for PyTorch Natural Language Processing (NLP)
Stars: ✭ 1,996 (+1155.35%)
Mutual labels:  dataset

Reuters-full-data-set

Full unofficial data set of Reuters composed of 8,551,441 news titles, links and timestamps (Jan 2007 - Aug 2016).

NB: To generate it from scrach (from 2007 up to today), please scroll down.

Using the pre-existing one

git clone https://github.com/philipperemy/Reuters-full-data-set.git
cd Reuters-full-data-set
python3 read.py
ts = 20070228 11:46 AM EST, t = European stocks hit 7-week low amid new sell-off, h= http://www.reuters.com/article/companyNewsAndPR/idUSWEB277620070228
ts = 20070228 11:46 AM EST, t = Schering-Plough announces Ismail Kola as VP and Chief Scientific Officer, h= http://www.reuters.com/article/inPlayBriefing/idUSIN20070228164651SGP20070228
ts = 20070228 11:46 AM EST, t = O'Reilly Automotive forecasts 2007 earnings growth, h= http://www.reuters.com/article/marketsNews/idUSN2845320220070228
ts = 20070228 11:42 AM EST, t = Market Wrap, h= http://www.reuters.com/article/inPlayBriefing/idUSIN20070228164235WRAPX20070228
ts = 20070228 11:42 AM EST, t = Chile's CMPC net profit falls 13 pct in 2006, h= http://www.reuters.com/article/tnBasicIndustries-SP/idUSN2844077020070228
ts = 20070228 11:42 AM EST, t = Toyota Venezuela to halt March ops on currency woes, h= http://www.reuters.com/article/tnBasicIndustries-SP/idUSN2827887820070228

Each pickle file in data represents a day (e.g. 20160102.pkl is for Jan, 2 2016).

One day is composed of several news, gathered in a list.

Each news is a dict of the form:

ts: timestamp of the form 20070228 11:46 AM EST
title: title of the news
href: link to the article to get the full content

Generate your own data set

Nothing could be easier. Just run those commands to generate pickle and CSV files.

I get the data from http://www.reuters.com/resources/archive/us.

git clone https://github.com/philipperemy/Reuters-full-data-set.git
cd Reuters-full-data-set
pip3 install beautifulsoup4 requests
python3 generate.py
python3 dump_to_csv.py DATA_DIR # where DATA_DIR is the directory contained your pickle files from generate.py

Other languages exist

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].