All Projects → harsha2010 → Magellan

harsha2010 / Magellan

Licence: apache-2.0
Geo Spatial Data Analytics on Spark

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to Magellan

Geopyspark
GeoTrellis for PySpark
Stars: ✭ 167 (-67.06%)
Mutual labels:  spark, big-data, geospatial
Geotools
Official GeoTools repository
Stars: ✭ 1,109 (+118.74%)
Mutual labels:  geospatial, geojson, shapefile
Data Science Ipython Notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Stars: ✭ 22,048 (+4248.72%)
Mutual labels:  spark, big-data
GeoConvert
Converting between Geojson and GIS file formats
Stars: ✭ 32 (-93.69%)
Mutual labels:  geojson, shapefile
Delta
An open-source storage layer that brings scalable, ACID transactions to Apache Spark™ and big data workloads.
Stars: ✭ 3,903 (+669.82%)
Mutual labels:  spark, big-data
geojson-mongo-import.py
Import GeoJSON file into MongoDB using Python
Stars: ✭ 20 (-96.06%)
Mutual labels:  geojson, geospatial
aut
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Stars: ✭ 111 (-78.11%)
Mutual labels:  big-data, spark
Election Geodata
Precinct shapes (and vote results) for US elections past, present, and future
Stars: ✭ 289 (-43%)
Mutual labels:  geospatial, shapefile
ibmpairs
open source tools for interaction with IBM PAIRS:
Stars: ✭ 23 (-95.46%)
Mutual labels:  big-data, geospatial
Blendergis
Blender addons to make the bridge between Blender and geographic data
Stars: ✭ 4,642 (+815.58%)
Mutual labels:  geospatial, shapefile
Metorikku
A simplified, lightweight ETL Framework based on Apache Spark
Stars: ✭ 361 (-28.8%)
Mutual labels:  spark, big-data
Go Geom
Package geom implements efficient geometry types for geospatial applications.
Stars: ✭ 456 (-10.06%)
Mutual labels:  geospatial, geojson
xyz-hub
XYZ Hub is a RESTful web service for the access and management of geospatial data.
Stars: ✭ 43 (-91.52%)
Mutual labels:  geojson, geospatial
GeoJSON.jl
Utilities for working with GeoJSON data in Julia
Stars: ✭ 46 (-90.93%)
Mutual labels:  geojson, geospatial
bigdata-fun
A complete (distributed) BigData stack, running in containers
Stars: ✭ 14 (-97.24%)
Mutual labels:  big-data, spark
leaflet heatmap
简单的可视化湖州通话数据 假设数据量很大,没法用浏览器直接绘制热力图,把绘制热力图这一步骤放到线下计算分析。使用Apache Spark并行计算数据之后,再使用Apache Spark绘制热力图,然后用leafletjs加载OpenStreetMap图层和热力图图层,以达到良好的交互效果。现在使用Apache Spark实现绘制,可能是Apache Spark不擅长这方面的计算或者是我没有设计好算法,并行计算的速度比不上单机计算。Apache Spark绘制热力图和计算代码在这 https://github.com/yuanzhaokang/ParallelizeHeatmap.git .
Stars: ✭ 13 (-97.44%)
Mutual labels:  big-data, spark
Succinct
Enabling queries on compressed data.
Stars: ✭ 257 (-49.31%)
Mutual labels:  spark, big-data
Orb
Types and utilities for working with 2d geometry in Golang
Stars: ✭ 378 (-25.44%)
Mutual labels:  geospatial, geojson
awesome-AI-kubernetes
❄️ 🐳 Awesome tools and libs for AI, Deep Learning, Machine Learning, Computer Vision, Data Science, Data Analytics and Cognitive Computing that are baked in the oven to be Native on Kubernetes and Docker with Python, R, Scala, Java, C#, Go, Julia, C++ etc
Stars: ✭ 95 (-81.26%)
Mutual labels:  big-data, spark
spark-acid
ACID Data Source for Apache Spark based on Hive ACID
Stars: ✭ 91 (-82.05%)
Mutual labels:  big-data, spark

Magellan: Geospatial Analytics Using Spark

Gitter chat Build Status codecov.io

Magellan is a distributed execution engine for geospatial analytics on big data. It is implemented on top of Apache Spark and deeply leverages modern database techniques like efficient data layout, code generation and query optimization in order to optimize geospatial queries.

The application developer writes standard sql or data frame queries to evaluate geometric expressions while the execution engine takes care of efficiently laying data out in memory during query processing, picking the right query plan, optimizing the query execution with cheap and efficient spatial indices while presenting a declarative abstraction to the developer.

Magellan is the first library to extend Spark SQL to provide a relational abstraction for geospatial analytics. I see it as an evolution of geospatial analytics engines into the emerging world of big data by providing abstractions that are developer friendly, can be leveraged by anyone who understands or uses Apache Spark while simultaneously showcasing an execution engine that is state of the art for geospatial analytics on big data.

Version Release Notes

You can find notes on the various released versions here

Linking

You can link against the latest release using the following coordinates:

groupId: harsha2010
artifactId: magellan
version: 1.0.5-s_2.11

Requirements

v1.0.5 requires Spark 2.1+ and Scala 2.11

Capabilities

The library currently supports reading the following formats:

We aim to support the full suite of OpenGIS Simple Features for SQL spatial predicate functions and operators together with additional topological functions.

The following geometries are currently supported:

Geometries:

  • Point
  • LineString
  • Polygon
  • MultiPoint
  • MultiPolygon (treated as a collection of Polygons and read in as a row per polygon by the GeoJSON reader)

The following predicates are currently supported:

  • Intersects
  • Contains
  • Within

The following languages are currently supported:

  • Scala

Reading Data

You can read Shapefile formatted data as follows:

val df = sqlCtx.read.
  format("magellan").
  load(path)
  
df.show()

+-----+--------+--------------------+--------------------+-----+
|point|polyline|             polygon|            metadata|valid|
+-----+--------+--------------------+--------------------+-----+
| null|    null|Polygon(5, Vector...|Map(neighborho ->...| true|
| null|    null|Polygon(5, Vector...|Map(neighborho ->...| true|
| null|    null|Polygon(5, Vector...|Map(neighborho ->...| true|
| null|    null|Polygon(5, Vector...|Map(neighborho ->...| true|
+-----+--------+--------------------+--------------------+-----+

df.select(df.metadata['neighborho']).show()

+--------------------+
|metadata[neighborho]|
+--------------------+
|Twin Peaks       ...|
|Pacific Heights  ...|
|Visitacion Valley...|
|Potrero Hill     ...|
+--------------------+

To read GeoJSON format pass in the type as geojson during load as follows:

val df = sqlCtx.read.
  format("magellan").
  option("type", "geojson").
  load(path)

Scala API

Magellan is hosted on Spark Packages

When launching the Spark Shell, Magellan can be included like any other spark package using the --packages option:

> $SPARK_HOME/bin/spark-shell --packages harsha2010:magellan:1.0.4-s_2.11

A few common packages you might want to import within Magellan

import magellan.{Point, Polygon}
import org.apache.spark.sql.magellan.dsl.expressions._
import org.apache.spark.sql.types._

Data Structures

Point

val points = sc.parallelize(Seq((-1.0, -1.0), (-1.0, 1.0), (1.0, -1.0))).toDF("x", "y").select(point($"x", $"y").as("point"))

points.show()

+-----------------+
|            point|
+-----------------+
|Point(-1.0, -1.0)|
| Point(-1.0, 1.0)|
| Point(1.0, -1.0)|
+-----------------+

Polygon

case class PolygonRecord(polygon: Polygon)

val ring = Array(Point(1.0, 1.0), Point(1.0, -1.0),
 Point(-1.0, -1.0), Point(-1.0, 1.0),
 Point(1.0, 1.0))
val polygons = sc.parallelize(Seq(
    PolygonRecord(Polygon(Array(0), ring))
  )).toDF()
  
polygons.show()

+--------------------+
|             polygon|
+--------------------+
|Polygon(5, Vector...|
+--------------------+

Predicates

within

points.join(polygons).where($"point" within $"polygon").show()

intersects

points.join(polygons).where($"point" intersects $"polygon").show()

+-----------------+--------------------+
|            point|             polygon|
+-----------------+--------------------+
|Point(-1.0, -1.0)|Polygon(5, Vector...|
| Point(-1.0, 1.0)|Polygon(5, Vector...|
| Point(1.0, -1.0)|Polygon(5, Vector...|
+-----------------+--------------------+

contains

Since contains is an overloaded expression (contains is used for checking String containment by Spark SQL), Magellan uses the Binary Expression >? for checking shape containment.

points.join(polygons).where($"polygon" >? $"polygon").show()

A Databricks notebook with similar examples is published here for convenience.

Spatial indexes

Starting v1.0.5, Magellan support spatial indexes. Spatial indexes supported the so called ZOrderCurves.

Given a column of shapes, one can index the shapes to a given precision using a geohash indexer by doing the following:

df.withColumn("index", $"polygon" index 30)

This produces a new column called index which is a list of ZOrder Curves of precision 30 that taken together cover the polygon.

Creating Indexes while loading data

The Spatial Relations (GeoJSON, Shapefile, OSM-XML) all have the ability to automatically index the geometries while loading them.

To turn this feature on, pass in the parameter magellan.index = true and optionally a value for magellan.index.precision (default = 30) while loading the data as follows:

spark.read.format("magellan")
  .option("magellan.index", "true")
  .option("magellan.index.precision", "25")
  .load(s"$path")

This creates an additional column called index which holds the list of ZOrder Curves of the given precision that cover each geometry in the dataset.

Spatial Joins

Magellan leverages Spark SQL and has support for joins by default. However, these joins are by default not aware that the columns are geometric so a join of the form

  points.join(polygons).where($"point" within $"polygon")

will be treated as a Cartesian Join followed by a predicate. In some cases (especially when the polygon dataset is small (O(100-10000) polygons) this is fast enough. However, when the number of polygons is much larger than that, you will need spatial joins to allow you to scale this computation

To enable spatial joins in Magellan, add a spatial join rule to Spark by injecting the following code before the join:

  magellan.Utils.injectRules(spark)

Furthermore, during the join, you will need to provide Magellan a hint of the precision at which to create indices for the join

You can do this by annotating either of the dataframes involved in the join by providing a Spatial Join Hint as follows:

var df = df.index(30) //after load or
val df =spark.read.format(...).load(..).index(30) //during load

Then a join of the form

  points.join(polygons).where($"point" within $"polygon") // or
  
  points.join(polygons index 30).where($"point" within $"polygon")

automatically uses indexes to speed up the join

Developer Channel

Please visit Gitter to discuss Magellan, obtain help from developers or report issues.

Magellan Blog

For more details on Magellan and thoughts around Geospatial Analytics and the optimizations chosen for this project, please visit my blog

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].