All Projects → StanGirard → TrollHunter

StanGirard / TrollHunter

Licence: GPL-3.0 license
Twitter Troll & Fake News Hunter - Crawls news websites and twitter to identify fake news

Programming Languages

Jupyter Notebook
11667 projects

Projects that are alternatives of or similar to TrollHunter

civic-scraper
Tools for downloading agendas, minutes and other documents produced by local government
Stars: ✭ 21 (-44.74%)
Mutual labels:  scraper, news
newspaperjs
News extraction and scraping. Article Parsing
Stars: ✭ 59 (+55.26%)
Mutual labels:  scraper, news
bullshit-detector
🔍 Chráňte vašich blízkych pred nedôveryhodným 🇸🇰 a 🇨🇿 obsahom
Stars: ✭ 24 (-36.84%)
Mutual labels:  news, fake-news
Newspaper
News, full-text, and article metadata extraction in Python 3. Advanced docs:
Stars: ✭ 11,545 (+30281.58%)
Mutual labels:  scraper, news
stweet
Advanced python library to scrap Twitter (tweets, users) from unofficial API
Stars: ✭ 287 (+655.26%)
Mutual labels:  scraper, twint
TradeTheEvent
Implementation of "Trade the Event: Corporate Events Detection for News-Based Event-Driven Trading." In Findings of ACL2021
Stars: ✭ 64 (+68.42%)
Mutual labels:  scraper, news
newsemble
API for fetching data from news websites.
Stars: ✭ 42 (+10.53%)
Mutual labels:  scraper, news
MalScraper
Scrape everything you can from MyAnimeList.net
Stars: ✭ 132 (+247.37%)
Mutual labels:  scraper, news
feedIO
A Feed Aggregator that Knows What You Want to Read.
Stars: ✭ 26 (-31.58%)
Mutual labels:  news, fake-news
SitemapParser
XML Sitemap parser class compliant with the Sitemaps.org protocol.
Stars: ✭ 57 (+50%)
Mutual labels:  sitemap
document-dl
Command line program to download documents from web portals.
Stars: ✭ 14 (-63.16%)
Mutual labels:  scraper
clinews
A CLI for reading the news and getting the latest headlines including search functionality. Supports over 70 sources.
Stars: ✭ 17 (-55.26%)
Mutual labels:  news
gnewsclient
An easy-to-use python client for Google News feeds.
Stars: ✭ 42 (+10.53%)
Mutual labels:  news
OpenScraper
An open source webapp for scraping: towards a public service for webscraping
Stars: ✭ 80 (+110.53%)
Mutual labels:  scraper
siteshooter
📷 Automate full website screenshots and PDF generation with multiple viewport support.
Stars: ✭ 63 (+65.79%)
Mutual labels:  sitemap
impartus-downloader
Download Impartus lectures, convert to mkv for offline viewing.
Stars: ✭ 19 (-50%)
Mutual labels:  scraper
OLX Scraper
📻 An OLX Scraper using Scrapy + MongoDB. It Scrapes recent ads posted regarding requested product and dumps to NOSQL MONGODB.
Stars: ✭ 15 (-60.53%)
Mutual labels:  scraper
Sitemap
Bolt Sitemap extension - create XML sitemaps for your Bolt website.
Stars: ✭ 19 (-50%)
Mutual labels:  sitemap
dac
Entity linker for the newspaper collection of the National Library of the Netherlands. Links named entity mentions to DBpedia descriptions using either a binary SVM classifier or a neural net.
Stars: ✭ 14 (-63.16%)
Mutual labels:  news
yt-videos-list
Create and **automatically** update a list of all videos on a YouTube channel (in txt/csv/md form) via YouTube bot with end-to-end web scraping - no API tokens required. Multi-threaded support for YouTube videos list updates.
Stars: ✭ 64 (+68.42%)
Mutual labels:  scraper

TrollHunter

TrollHunter is a Twitter Crawler & News Website Indexer. It aims at finding Troll Farmers & Fake News on Twitter.

It composed of three parts:

  • Twint API to extract information about a tweet or a user
  • News Indexer which indexes all the articles of a website and extract its keywords
  • Analysis of the tweets and news

Installation

You can either run

pip3 install TrollHunter

or clone the project and run

pip3 install -r requirements.txt

Docker

TrollHunter requires many services to run

  • ELK ( Elastic Search, Logstash, Kibana)
  • InfluxDb & Grafana
  • RabbitMQ

You can either launch them individually if you already have them setup or use our docker-compose.yml

  • Install Docker
  • Run docker-compose up -d

Setup

Change the .env with the required values Export the .env variables

export $(cat .env | sed 's/#.*//g' | xargs)

Twitter crawler

Twint

For crawl tweets and extract user's information we use Twint wich allow us to get many information without using Twitter api.

Some of the benefits of using Twint vs Twitter API:

  • Can fetch almost all Tweets (Twitter API limits to last 3200 Tweets only);
  • Fast initial setup;
  • Can be used anonymously and without Twitter sign up;
  • No rate limitations.

When we used twint, we encountered some problems:

  • Bad compatibility with windows and datetime
  • We can't set a limit on the recovery of tweets
  • Bug with some user-agent

So we decided to fork the project.

With allow us to:

  • get tweets
  • get user information
  • get follow and follower
  • search tweet from hashtag or word

API

For this we use the open-source framework flask.

Four endpoints are defined and their

  • /tweets/<string:user>

    • get all informations of a user (tweets, follow, interaction)
  • /search

    • crawl every 2 hours tweets corresponding to research
  • /stop

    • stop the search
  • /tweet/origin

    • retrieve the origin of a tweets

Some query parameters are available:

  • tweet: set to 0 to avoid tweet (default: 1)
  • follow: set to 0 to avoid follow (default: 1)
  • limit: set the number of tweet to retrieve (Increments of 20, default: 100)
  • follow_limit: set the number of following and followers to retrieve (default: 100)
  • since: date selector for tweets (Example: 2017-12-27)
  • until: date selector for tweets (Example: 2017-12-27)
  • retweet: set to 1 to retrieve retweet (default: 0)
  • search:
    • search terms format "i search"
    • for hashtag : (#Hashtag)
    • for multiple : (#Hashtag1 AND|OR #Hashtag2)
  • tweet_interact: set to 1 to parse tweet interaction between users (default: 0)
  • depth: search tweet and info from list of follow

Twitter Storage

Information retrieve with twint is stored in elastic search, we do not use the default twint storage format as we want a stronger relationship parsing. There is currently three index:

  • twitter_user
  • twitter_tweet
  • twitter_interaction

The first and second index are stored as in twitter. The third is build to store interaction from followers/following, conversation and retweet.

Twitter interaction

News Indexer

The second main part of the project is the crawler and indexer of news.

For this, we use the sitemap xml file of news websites to crawl all the articles. In a sitemap file, we extract the tag sitemap and url.

The sitemap tag is a link to a child sitemap xml file for a specific category of articles in the website.

The url tag represents an article/news of the website.

The root url of a sitemap is stored in a postgres database with a trust level of the website (Oriented, Verified, Fake News, ...) and headers. The headers are the tag we want to extract from the url tag which contains details about the article (title, keywords, publication date, ...).

The headers are the list of fields use in the index pattern of ElasticSearch.

In crawling sitemaps, we insert the new child sitemap in the database with the last modification date or update it for the ones already in the database. The last modification date is used to crawl only sitemaps which change since the last crawling.

The data extracts from the url tags are built in a dataframe then sent in ElasticSearch for further utilisation with the request in Twint API.

In the same time, some sitemaps don't provide the keywords for their articles. Hence, from ElasticSearch we retrieve the entries without keywords. Then, we download the content of the article and extract the keywords thanks to NLP. Finally, we update the entries in ElasticSearch.

How it works

  • Insert a sitemap that you want to crawl with insert_sitemap(loc, lastmod, url_headers, id_trust)
  • Then run scheduler_news()which will retrieve all the sitemap that you have inserted in the database
  • You can also run scheduler_keywords() to extract the keywords that are missing from the url that have been fetched.
  • Every urls found are inserted in elastic.

Run

For the crawler/indexer:

from TrollHunter.news_crawler import scheduler_news

scheduler_news(time_interval)

For updating keywords:

from TrollHunter.news_crawler import scheduler_keywords

scheduler_keywords(time_interval, max_entry)

Or see with the main use with docker.

Grafana

We use grafana for visualizing and monitoring different events with the crawler/indexer as the insertion of an url in ElasticSearch and the extraction of keywords in an article.

alt text

Create new events.

  • Use TrollHunter.loggers.InfluxDBLog()
  • Create a new dashboard in grafana, save as json and add it to docker/grafana-provisioning/dashboards

Text analysis

The text Analysis part is under TrollHunter/texto. It aims to process a text or a set of texts to retrieve useful information that can be used to help determine the "troll" status of a user or link a text to a news.

There a several classes that make the job:

  • Sentiment.py to extract polarity, feeling and subjectivity (integer indicators)
  • Keyword.py to extract keywords/topics from a text input (a tweet or maybe a news)
  • Inicator_average.py to compute tweets from a list of user (in a specific format) and produce an mean for all users. This was used for used to detect patterns that could help to qualify a user as a troll or not, by giving a certain trust percentage.

Keyword

Keywords extraction are useful because it can help can detect topics of an input text. To extract keywords from a text, just import "extract" function from Keyword and call it with a text as input. "extract" function is just a wrapper of 2 extraction functions:

  • extract_v1: implements RAKE (Rapid Automatic Keyword Extraction) algorithm. As it name says, it's a faster way to extract keywords. Keywords are lemmatized.
  • extract_v2: implements TextRank keyword Extraction. It produces better results than RAKE but it is slower.

We use both results on the same text and merge them to have keywords from both algorithm. Because of different algorithms, results are sometimes different so we merge the result into a set of unique keywords to have both visions. At least 75 keywords are returned: 25 from extract_v1 and 50 from extract_v2 (we can adjust this number by parameter).

Feelings

Feelings extraction is to extract Polarity, Feelings and Subjectivity as numerical values from a text or set of text. To extract them, import from Sentiment.py functions get_sentiment_from_tweets, get_polarity and get_subjectivity. We use TextBlob for Polarity and Subjectivity analysis. We use SentimentIntensityAnalyzer from nltk.sentiment.vader (nltk package) for feeling analysis.

Average Indicator

This one is to compute and extract average useful data from a set of tweets for a user (or a set of users). It consists in one class called "Indicator". You give it one folder with a set of user csv file, and you call "get_all_indicator_users" function to apply all our algorithms to have an average and detect some patterns. We can for instance compare a set of troll users and a set of non-troll users.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].