All Projects → isaacmg → fb_scraper

isaacmg / fb_scraper

Licence: Apache-2.0 license
FBLYZE is a Facebook scraping system and analysis system.

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to fb scraper

hadoopoffice
HadoopOffice - Analyze Office documents using the Hadoop ecosystem (Spark/Flink/Hive)
Stars: ✭ 56 (-8.2%)
Mutual labels:  flink
tf-idf-python
Term frequency–inverse document frequency for Chinese novel/documents implemented in python.
Stars: ✭ 98 (+60.66%)
Mutual labels:  tf-idf
SentimentAnalysis
(BOW, TF-IDF, Word2Vec, BERT) Word Embeddings + (SVM, Naive Bayes, Decision Tree, Random Forest) Base Classifiers + Pre-trained BERT on Tensorflow Hub + 1-D CNN and Bi-Directional LSTM on IMDB Movie Reviews Dataset
Stars: ✭ 40 (-34.43%)
Mutual labels:  tf-idf
TextAudit
一个短视频app文本审核模块的实现思路及demo
Stars: ✭ 63 (+3.28%)
Mutual labels:  tf-idf
flink-training-troubleshooting
No description or website provided.
Stars: ✭ 41 (-32.79%)
Mutual labels:  flink
wink-bm25-text-search
Fast Full Text Search based on BM25
Stars: ✭ 44 (-27.87%)
Mutual labels:  tf-idf
text-classification-cn
中文文本分类实践,基于搜狗新闻语料库,采用传统机器学习方法以及预训练模型等方法
Stars: ✭ 81 (+32.79%)
Mutual labels:  tf-idf
review-notes
团队分享学习、复盘笔记资料共享。Java、Scala、Flink...
Stars: ✭ 27 (-55.74%)
Mutual labels:  flink
flink-connectors
Apache Flink connectors for Pravega.
Stars: ✭ 84 (+37.7%)
Mutual labels:  flink
piglet
A compiler for Pig Latin to Spark and Flink.
Stars: ✭ 23 (-62.3%)
Mutual labels:  flink
devsearch
A web search engine built with Python which uses TF-IDF and PageRank to sort search results.
Stars: ✭ 52 (-14.75%)
Mutual labels:  tf-idf
flink-learn
Learning Flink : Flink CEP,Flink Core,Flink SQL
Stars: ✭ 70 (+14.75%)
Mutual labels:  flink
emma
A quotation-based Scala DSL for scalable data analysis.
Stars: ✭ 61 (+0%)
Mutual labels:  flink
minimal-search-engine
最小のサーチエンジン/PageRank/tf-idf
Stars: ✭ 18 (-70.49%)
Mutual labels:  tf-idf
Keywords-Abstract-TFIDF-TextRank4ZH
使用tf-idf, TextRank4ZH等不同方式从中文文本中提取关键字,从中文文本中提取摘要和关键词
Stars: ✭ 26 (-57.38%)
Mutual labels:  tf-idf
coolplayflink
Flink: Stateful Computations over Data Streams
Stars: ✭ 14 (-77.05%)
Mutual labels:  flink
Real-time-Data-Warehouse
Real-time Data Warehouse with Apache Flink & Apache Kafka & Apache Hudi
Stars: ✭ 52 (-14.75%)
Mutual labels:  flink
FlinkTutorial
FlinkTutorial 专注大数据Flink流试处理技术。从基础入门、概念、原理、实战、性能调优、源码解析等内容,使用Java开发,同时含有Scala部分核心代码。欢迎关注我的博客及github。
Stars: ✭ 46 (-24.59%)
Mutual labels:  flink
soan
Social Analysis based on Whatsapp data
Stars: ✭ 106 (+73.77%)
Mutual labels:  tf-idf
cassandra.realtime
Different ways to process data into Cassandra in realtime with technologies such as Kafka, Spark, Akka, Flink
Stars: ✭ 25 (-59.02%)
Mutual labels:  flink

FBLYZE: a Facebook page and group scraping and analysis system.

Travis Codecov branch Codefresh build status Join the chat at https://gitter.im/fb_scraper/Lobby Docker Pulls Code Health

Getting started tutorial on Medium.

The goal of this project is to implement a Facebook scraping and extraction engine. This project is originally based on the scraper from minimaxir which you can find here. However, our project aims to take this one step further and create a continous scraping and processing system which can easily be deployed into production. Specifically, for our purposes we want to extract information about upcoming paddling meetups, event information, flow info, and other river related reports. However, this project should be useful for anyone who needs regular scrapping of FB pages or groups.

Instructions

To get the ID of a Facebook group go here and input the url of the group you are trying to scrape. Pages you can just use after the slash (i.e. http://facebook.com/paddlesoft would be paddlesoft).

Update we have switched to using a DB for recording information. Please see documentation for revised instructions.

Docker

We recommend you use our Docker images as it contains everything you need. For instructions on how to use our Dockerfile please see the wiki page. Our Dockerfile is tested regularly on Codefresh so you can easily see if the build is passing above.

Running Locally

You will need to have Python 3.5+. If you want to use the examples (located in /data) you will need Jupyter Notebooks and Spark.

  1. Create a file called app.txt and place your app_id in it along with your app_secret. Alternatively you can set this up in your system environment variables in a way similar to the way you would for Docker.
  2. Use get_posts.py to pull data from a FB Group. So far we have provided five basic functions. Basically you can either do a full scrape or scrape from the last time stamp. You can also choose whether you want to write to a CSV or send the posts as Kafka messages. See get_posts.py for more details. Example:
from get_posts import scrape_comments_from_last_scrape, scrape_posts_from_last_scrape
group_id = "115285708497149"
scrape_posts_from_last_scrape(group_id)
scrape_comments_from_last_scrape(group_id)
  1. Note that our messaging system using Kafka currently only works with the basic json data (comparable to the CSV). We are working on addeding a new schema for the more complex data see issue 11. Plans to upgrade to add authentication for Kafka authentication are in progress.

  2. Currently the majority of examples of actual analysis are contained in the Examining data using Spark.ipynb notebook located in the data folder. You can open the notebook and specify the name of your CSV.

  3. ElasticSearch is ocassionally throwing an authentication error when to trying to save posts. If you get an authentication error when using ES please add it to issue 15. Ability to connect to Bonsai and elastic.co are in the works.

  4. There are some other use case examples on my main GitHub page which you can look at as well. However, I have omitted them from this repo since they are mainly in Java and require Apache Flink.

  5. We are also working on automating scraping with Apache Airflow. The dags we have created so far are in the dags folder. It is recomended that you use the dags in conjunction with our Docker image. This will avoid directory errors.

Scrape away!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].