isaacmg / fb_scraper

Licence: Apache-2.0 license

FBLYZE is a Facebook scraping system and analysis system.

Programming Languages

Jupyter Notebook

11667 projects

python

139335 projects - #7 most used programming language

Projects that are alternatives of or similar to fb scraper

hadoopoffice

HadoopOffice - Analyze Office documents using the Hadoop ecosystem (Spark/Flink/Hive)

Stars: ✭ 56 (-8.2%)

Mutual labels: flink

tf-idf-python

Term frequency–inverse document frequency for Chinese novel/documents implemented in python.

Stars: ✭ 98 (+60.66%)

Mutual labels: tf-idf

SentimentAnalysis

(BOW, TF-IDF, Word2Vec, BERT) Word Embeddings + (SVM, Naive Bayes, Decision Tree, Random Forest) Base Classifiers + Pre-trained BERT on Tensorflow Hub + 1-D CNN and Bi-Directional LSTM on IMDB Movie Reviews Dataset

Stars: ✭ 40 (-34.43%)

Mutual labels: tf-idf

TextAudit

一个短视频app文本审核模块的实现思路及demo

Stars: ✭ 63 (+3.28%)

Mutual labels: tf-idf

flink-training-troubleshooting

No description or website provided.

Stars: ✭ 41 (-32.79%)

Mutual labels: flink

wink-bm25-text-search

Fast Full Text Search based on BM25

Stars: ✭ 44 (-27.87%)

Mutual labels: tf-idf

text-classification-cn

中文文本分类实践，基于搜狗新闻语料库，采用传统机器学习方法以及预训练模型等方法

Stars: ✭ 81 (+32.79%)

Mutual labels: tf-idf

review-notes

团队分享学习、复盘笔记资料共享。Java、Scala、Flink...

Stars: ✭ 27 (-55.74%)

Mutual labels: flink

flink-connectors

Apache Flink connectors for Pravega.

Stars: ✭ 84 (+37.7%)

Mutual labels: flink

piglet

A compiler for Pig Latin to Spark and Flink.

Stars: ✭ 23 (-62.3%)

Mutual labels: flink

devsearch

A web search engine built with Python which uses TF-IDF and PageRank to sort search results.

Stars: ✭ 52 (-14.75%)

Mutual labels: tf-idf

flink-learn

Learning Flink : Flink CEP,Flink Core,Flink SQL

Stars: ✭ 70 (+14.75%)

Mutual labels: flink

emma

A quotation-based Scala DSL for scalable data analysis.

Stars: ✭ 61 (+0%)

Mutual labels: flink

minimal-search-engine

最小のサーチエンジン/PageRank/tf-idf

Stars: ✭ 18 (-70.49%)

Mutual labels: tf-idf

Keywords-Abstract-TFIDF-TextRank4ZH

使用tf-idf, TextRank4ZH等不同方式从中文文本中提取关键字，从中文文本中提取摘要和关键词

Stars: ✭ 26 (-57.38%)

Mutual labels: tf-idf

coolplayflink

Flink: Stateful Computations over Data Streams

Stars: ✭ 14 (-77.05%)

Mutual labels: flink

Real-time-Data-Warehouse

Real-time Data Warehouse with Apache Flink & Apache Kafka & Apache Hudi

Stars: ✭ 52 (-14.75%)

Mutual labels: flink

FlinkTutorial

FlinkTutorial 专注大数据Flink流试处理技术。从基础入门、概念、原理、实战、性能调优、源码解析等内容，使用Java开发，同时含有Scala部分核心代码。欢迎关注我的博客及github。

Stars: ✭ 46 (-24.59%)

Mutual labels: flink

soan

Social Analysis based on Whatsapp data

Stars: ✭ 106 (+73.77%)

Mutual labels: tf-idf

cassandra.realtime

Different ways to process data into Cassandra in realtime with technologies such as Kafka, Spark, Akka, Flink

Stars: ✭ 25 (-59.02%)

Mutual labels: flink

View All Similar Projects ➔

FBLYZE: a Facebook page and group scraping and analysis system.

Getting started tutorial on Medium.

The goal of this project is to implement a Facebook scraping and extraction engine. This project is originally based on the scraper from minimaxir which you can find here. However, our project aims to take this one step further and create a continous scraping and processing system which can easily be deployed into production. Specifically, for our purposes we want to extract information about upcoming paddling meetups, event information, flow info, and other river related reports. However, this project should be useful for anyone who needs regular scrapping of FB pages or groups.

Instructions

To get the ID of a Facebook group go here and input the url of the group you are trying to scrape. Pages you can just use after the slash (i.e. http://facebook.com/paddlesoft would be paddlesoft).

Update we have switched to using a DB for recording information. Please see documentation for revised instructions.

Docker

We recommend you use our Docker images as it contains everything you need. For instructions on how to use our Dockerfile please see the wiki page. Our Dockerfile is tested regularly on Codefresh so you can easily see if the build is passing above.

Running Locally

You will need to have Python 3.5+. If you want to use the examples (located in /data) you will need Jupyter Notebooks and Spark.

Create a file called app.txt and place your app_id in it along with your app_secret. Alternatively you can set this up in your system environment variables in a way similar to the way you would for Docker.
Use get_posts.py to pull data from a FB Group. So far we have provided five basic functions. Basically you can either do a full scrape or scrape from the last time stamp. You can also choose whether you want to write to a CSV or send the posts as Kafka messages. See get_posts.py for more details. Example:

from get_posts import scrape_comments_from_last_scrape, scrape_posts_from_last_scrape
group_id = "115285708497149"
scrape_posts_from_last_scrape(group_id)
scrape_comments_from_last_scrape(group_id)

Note that our messaging system using Kafka currently only works with the basic json data (comparable to the CSV). We are working on addeding a new schema for the more complex data see issue 11. Plans to upgrade to add authentication for Kafka authentication are in progress.
Currently the majority of examples of actual analysis are contained in the Examining data using Spark.ipynb notebook located in the data folder. You can open the notebook and specify the name of your CSV.
ElasticSearch is ocassionally throwing an authentication error when to trying to save posts. If you get an authentication error when using ES please add it to issue 15. Ability to connect to Bonsai and elastic.co are in the works.
There are some other use case examples on my main GitHub page which you can look at as well. However, I have omitted them from this repo since they are mainly in Java and require Apache Flink.
We are also working on automating scraping with Apache Airflow. The dags we have created so far are in the dags folder. It is recomended that you use the dags in conjunction with our Docker image. This will avoid directory errors.

Scrape away!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

isaacmg / fb_scraper

Programming Languages

Labels

Projects that are alternatives of or similar to fb scraper

FBLYZE: a Facebook page and group scraping and analysis system.

Instructions

Update we have switched to using a DB for recording information. Please see documentation for revised instructions.

Docker

Running Locally

Scrape away!