Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → several27 → Fakenewscorpus

several27 / Fakenewscorpus

Licence: apache-2.0

A dataset of millions of news articles scraped from a curated list of data sources.

Labels

machine-learning database nlp natural-language-processing artificial-intelligence dataset corpus

Projects that are alternatives of or similar to Fakenewscorpus

Nlp bahasa resources

A Curated List of Dataset and Usable Library Resources for NLP in Bahasa Indonesia

Stars: ✭ 158 (-38.04%)

Mutual labels: dataset, corpus, natural-language-processing

Insuranceqa Corpus Zh

🚁 保险行业语料库，聊天机器人

Stars: ✭ 821 (+221.96%)

Mutual labels: dataset, corpus, natural-language-processing

Text2sql Data

A collection of datasets that pair questions with SQL queries.

Stars: ✭ 287 (+12.55%)

Mutual labels: dataset, database, natural-language-processing

Coarij

Corpus of Annual Reports in Japan

Stars: ✭ 55 (-78.43%)

Mutual labels: dataset, corpus, natural-language-processing

Awesome Hungarian Nlp

A curated list of NLP resources for Hungarian

Stars: ✭ 121 (-52.55%)

Mutual labels: dataset, corpus, natural-language-processing

Ua Gec

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

Stars: ✭ 108 (-57.65%)

Mutual labels: dataset, corpus, natural-language-processing

Wikisql

A large annotated semantic parsing corpus for developing natural language interfaces.

Stars: ✭ 965 (+278.43%)

Mutual labels: dataset, database, natural-language-processing

Prosody

Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text

Stars: ✭ 139 (-45.49%)

Mutual labels: dataset, corpus, natural-language-processing

Game Datasets

🎮 A curated list of awesome game datasets, and tools to artificial intelligence in games

Stars: ✭ 261 (+2.35%)

Mutual labels: artificial-intelligence, dataset, database

Medical-Names-Corpus

医疗语料库。医疗机构名语料库。药品本位码。

Stars: ✭ 26 (-89.8%)

Mutual labels: corpus, dataset

Lazynlp

Library to scrape and clean web pages to create massive datasets.

Stars: ✭ 1,985 (+678.43%)

Mutual labels: artificial-intelligence, natural-language-processing

Data Science Resources

👨🏽‍🏫You can learn about what data science is and why it's important in today's modern world. Are you interested in data science?🔋

Stars: ✭ 171 (-32.94%)

Mutual labels: artificial-intelligence, dataset

QANTA Quiz Bowl AI

Stars: ✭ 153 (-40%)

Mutual labels: artificial-intelligence, natural-language-processing

Ai Job Info

互联网大厂面试经验

Stars: ✭ 145 (-43.14%)

Mutual labels: artificial-intelligence, natural-language-processing

Fixy

Amacımız Türkçe NLP literatüründeki birçok farklı sorunu bir arada çözebilen, eşsiz yaklaşımlar öne süren ve literatürdeki çalışmaların eksiklerini gideren open source bir yazım destekleyicisi/denetleyicisi oluşturmak. Kullanıcıların yazdıkları metinlerdeki yazım yanlışlarını derin öğrenme yaklaşımıyla çözüp aynı zamanda metinlerde anlamsal analizi de gerçekleştirerek bu bağlamda ortaya çıkan yanlışları da fark edip düzeltebilmek.

Stars: ✭ 165 (-35.29%)

Mutual labels: artificial-intelligence, natural-language-processing

Awesome Nlp Resources

This repository contains landmark research papers in Natural Language Processing that came out in this century.

Stars: ✭ 145 (-43.14%)

Mutual labels: artificial-intelligence, natural-language-processing

Gun

An open source cybersecurity protocol for syncing decentralized graph data.

Stars: ✭ 15,172 (+5849.8%)

Mutual labels: artificial-intelligence, database

Pyss3

A Python package implementing a new machine learning model for text classification with visualization tools for Explainable AI

Stars: ✭ 191 (-25.1%)

Mutual labels: artificial-intelligence, natural-language-processing

Nlpaug

Data augmentation for NLP

Stars: ✭ 2,761 (+982.75%)

Mutual labels: artificial-intelligence, natural-language-processing

Deepinterests

深度有趣

Stars: ✭ 2,232 (+775.29%)

Mutual labels: artificial-intelligence, natural-language-processing

View All Similar Projects ➔

Fake News Corpus

This is an open source dataset composed of millions of news articles mostly scraped from a curated list of 1001 domains from http://www.opensources.co/. Because the list does not contain many reliable websites, additionally NYTimes and WebHose English News Articles articles has been included to better balance the classes. Corpus is mainly intended for use in training deep learning algorithms for purpose of fake news recognition. The dataset is still work in progress and for now, the public version includes only 9,408,908 articles (745 out of 1001 domains).

Downloading

https://github.com/several27/FakeNewsCorpus/releases/tag/v1.0

How was the corpus created?

The corpus was created by scraping (using scrapy) all the domains as provided by http://www.opensources.co/. Then all the pure HTML content was processed to extract the article text with some additional fields (listed below) using the newspaper library. Each article has been attributed the same label as the label associated with its domain. All the source code is available at FakeNewsRecognition and will be made more “usable” in the next few months.

Formatting

The corpus is formatted as a CSV and contains the following fields:

id
domain
type
url
content
scraped_at
inserted_at
updated_at
title
authors
keywords
meta_keywords
meta_description
tags
summary
source (opensources, nytimes, or webhose)

Available types More information on http://www.opensources.co

Type	Tag	Count (so far)	Description
Fake News	fake	928,083	Sources that entirely fabricate information, disseminate deceptive content, or grossly distort actual news reports
Satire	satire	146,080	Sources that use humor, irony, exaggeration, ridicule, and false information to comment on current events.
Extreme Bias	bias	1,300,444	Sources that come from a particular point of view and may rely on propaganda, decontextualized information, and opinions distorted as facts.
Conspiracy Theory	conspiracy	905,981	Sources that are well-known promoters of kooky conspiracy theories.
State News	state	0	Sources in repressive states operating under government sanction.
Junk Science	junksci	144,939	Sources that promote pseudoscience, metaphysics, naturalistic fallacies, and other scientifically dubious claims.
Hate News	hate	117,374	Sources that actively promote racism, misogyny, homophobia, and other forms of discrimination.
Clickbait	clickbait	292,201	Sources that provide generally credible content, but use exaggerated, misleading, or questionable headlines, social media descriptions, and/or images.
Proceed With Caution	unreliable	319,830	Sources that may be reliable but whose contents require further verification.
Political	political	2,435,471	Sources that provide generally verifiable information in support of certain points of view or political orientations.
Credible	reliable	1,920,139	Sources that circulate news and information in a manner consistent with traditional and ethical practices in journalism (Remember: even credible sources sometimes rely on clickbait-style headlines or occasionally make mistakes. No news organization is perfect, which is why a healthy news diet consists of multiple sources of information).

List of domains You can find the full list of domains in websites.csv.

Limitations

The dataset was not manually filtered, therefore some of the labels might not be correct and some of the URLs might not point to the actual articles but other pages on the website. However, because the corpus is intended for use in training machine learning algorithms, those problems should not pose a practical issue.

Additionally, when the dataset will be finalised (as for now only about 80% was cleaned and published), I do not intend to update it, therefore it might quickly become outdated for other purposes than content-based algorithms. However, any contributions are welcome!

Contributing

Because there’s currently only myself working on this corpus, I’d really appreciate all the contributions. If you have found wrong labels associated with any articles, weirdly formatted content or URLs that are not pointing to any articles, feel free to post an issue with the problem and exact article id and I will do my best to respond promptly. Because of the size of the corpus, I could not host it on GitHub, therefore, unfortunately, for now, pull requests cannot be used to collaboratively work on the data, however, I’m open to any ideas 🙂

Acknowledgments

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 255

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (2) 🔗