All Projects → wsuen → Pygotham2018_graphmining

wsuen / Pygotham2018_graphmining

Large-scale Graph Mining with Spark

Projects that are alternatives of or similar to Pygotham2018 graphmining

Tech Terms
A repository of technical terms and definitions. As flashcards.
Stars: ✭ 30 (-3.23%)
Mutual labels:  jupyter-notebook
Udacity Ml Nanodegree
Projects for Udacity's Machine Learning Engineer Nanodegree
Stars: ✭ 30 (-3.23%)
Mutual labels:  jupyter-notebook
Hacktoberfest2020
beginner-friendly project to help you in open-source contributions. Made specifically for contributions in HACKTOBERFEST 2020! Hello World Programs in any language and C and Cpp program , Please leave a star ⭐ to support this project! ✨
Stars: ✭ 31 (+0%)
Mutual labels:  jupyter-notebook
Poi2vec
POI2Vec: Geographical Latent Representation for Predicting Future Visitors
Stars: ✭ 30 (-3.23%)
Mutual labels:  jupyter-notebook
Machine Learning Alpine
Alpine Container for Machine Learning
Stars: ✭ 30 (-3.23%)
Mutual labels:  jupyter-notebook
Machine Learning
Machine learning for Project Cognoma
Stars: ✭ 30 (-3.23%)
Mutual labels:  jupyter-notebook
Pytorch Course
JULYEDU PyTorch Course
Stars: ✭ 947 (+2954.84%)
Mutual labels:  jupyter-notebook
Crnn Pytorch
✍️ Convolutional Recurrent Neural Network in Pytorch | Text Recognition
Stars: ✭ 31 (+0%)
Mutual labels:  jupyter-notebook
Qa Rankit
QA - Answer Selection (Rank candidate answers for a given question)
Stars: ✭ 30 (-3.23%)
Mutual labels:  jupyter-notebook
Ijulia Notebooks
My IJulia notebooks
Stars: ✭ 30 (-3.23%)
Mutual labels:  jupyter-notebook
Datahacksummit 2017
Apache Zeppelin notebooks for Recommendation Engines using Keras and Machine Learning on Apache Spark
Stars: ✭ 30 (-3.23%)
Mutual labels:  jupyter-notebook
Docker Iocaml Datascience
Dockerfile of Jupyter (IPython notebook) and IOCaml (OCaml kernel) with libraries for data science and machine learning
Stars: ✭ 30 (-3.23%)
Mutual labels:  jupyter-notebook
Functional Python
Stars: ✭ 30 (-3.23%)
Mutual labels:  jupyter-notebook
Quantumcircuitbornmachine
gradient based training of Quantum Circuit Born Machine
Stars: ✭ 30 (-3.23%)
Mutual labels:  jupyter-notebook
Mathematical And Statistical Modeling Of Covid19 In Brazil
To make a library of models that aim to understand the spread of COVID19 in adequate scenarios of the Brazilian population
Stars: ✭ 31 (+0%)
Mutual labels:  jupyter-notebook
Sparkmagic
Jupyter magics and kernels for working with remote Spark clusters
Stars: ✭ 954 (+2977.42%)
Mutual labels:  jupyter-notebook
Bdr Analytics Py
Common data science and data engineering utilities to help us perform analytics. Our toolbox for data scientists, licensed under Apache-2.0
Stars: ✭ 30 (-3.23%)
Mutual labels:  jupyter-notebook
Signdetect Face
Stars: ✭ 31 (+0%)
Mutual labels:  jupyter-notebook
Learn Quantum Computing With Python And Ibm Quantum Experience
Learn Quantum Computing with Python and IBM Quantum Experience, published by Packt
Stars: ✭ 31 (+0%)
Mutual labels:  jupyter-notebook
Udacity machine learning engineer
Udacity Machine Learning Engineer Nanodegree
Stars: ✭ 30 (-3.23%)
Mutual labels:  jupyter-notebook

Large Scale Graph Mining with Spark

PyGotham 2018 talk. See also my tutorial on Medium.

Getting started

This repo includes Dockerfile for running a Jupyter notebook with pyspark.

Running the notebook

  1. Make sure you have Docker installed.
  2. Run make build to create your Docker image. This may take a while.
  3. Run make run_notebook_volume. This starts a Docker container with a volume containing the notebooks and sample dataset
  4. Go to 127.0.0.0:8888 to see the notebook server. You may need to enter authentication token, which will be somewhere in your terminal output.
  5. Open work/notebooks/Graphframes_demo.

Stopping Jupyter notebook

  1. Find Docker process with docker ps.
  2. Kill container with docker kill <container_id>.

About the sample dataset

I also included a small sample dataset that I created from the Common Crawl September 2017 dataset. The data, stored in a parquet file under notebooks/data/outlinks_pq, has the following format:

  • parent: full URL of parent node, the html I pulled links from.
  • parentTLD: top level domain of parent
  • childTLD: top level domain of child
  • child: full url of child node, the link found on the parent web page.

Hopefully this will jumpstart your exploration of web graphs, LPA, PageRank, and other cool features!

References

Adamic, Lada A., and Natalie Glance. "The political blogosphere and the 2004 US election: divided they blog." Proceedings of the 3rd international workshop on Link discovery. ACM, 2005.

Common Crawl dataset (September 2017).

Farine, Damien R., et al. "Both nearest neighbours and long-term affiliates predict individual locations during collective movement in wild baboons." Scientific reports 6 (2016): 27704

Fortunato, Santo. "Community detection in graphs." Physics reports 486.3-5 (2010): 75-174.

Girvan, Michelle, and Mark EJ Newman. “Community structure in social and biological networks.” Proceedings of the national academy of sciences 99.12 (2002): 7821–7826.

Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman. Mining of massive datasets. Cambridge University Press, 2014.

Raghavan, Usha Nandini, Réka Albert, and Soundar Kumara. "Near linear time algorithm to detect community structures in large-scale networks." Physical review E 76.3 (2007): 036106.

Zachary karate club network dataset -- KONECT, April 2017.

Additional Resources

Spark

  • I like Learning Spark by Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia.
  • Also High Performance Spark by Holden Karau and Rachel Warren.

GraphFrames

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].