All Projects → locationtech-labs → Geopyspark

locationtech-labs / Geopyspark

Licence: other
GeoTrellis for PySpark

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Geopyspark

Magellan
Geo Spatial Data Analytics on Spark
Stars: ✭ 507 (+203.59%)
Mutual labels:  spark, big-data, geospatial
Logisland
Scalable stream processing platform for advanced realtime analytics on top of Kafka and Spark. LogIsland also supports MQTT and Kafka Streams (Flink being in the roadmap). The platform does complex event processing and is suitable for time series analysis. A large set of valuable ready to use processors, data sources and sinks are available.
Stars: ✭ 97 (-41.92%)
Mutual labels:  spark, big-data
Spark Py Notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 1,338 (+701.2%)
Mutual labels:  spark, big-data
Richdem
High-performance Terrain and Hydrology Analysis
Stars: ✭ 127 (-23.95%)
Mutual labels:  big-data, geospatial
Labs
Research on distributed system
Stars: ✭ 73 (-56.29%)
Mutual labels:  spark, big-data
Spark Website
Apache Spark Website
Stars: ✭ 75 (-55.09%)
Mutual labels:  spark, big-data
Bigdataclass
Two-day workshop that covers how to use R to interact databases and Spark
Stars: ✭ 110 (-34.13%)
Mutual labels:  spark, big-data
Spark Doc Zh
Apache Spark 官方文档中文版
Stars: ✭ 1,126 (+574.25%)
Mutual labels:  spark, big-data
Gaffer
A large-scale entity and relation database supporting aggregation of properties
Stars: ✭ 1,642 (+883.23%)
Mutual labels:  spark, big-data
Calcite Avatica
Mirror of Apache Calcite - Avatica
Stars: ✭ 130 (-22.16%)
Mutual labels:  big-data, geospatial
Spark On Lambda
Apache Spark on AWS Lambda
Stars: ✭ 137 (-17.96%)
Mutual labels:  spark, big-data
Geni
A Clojure dataframe library that runs on Spark
Stars: ✭ 152 (-8.98%)
Mutual labels:  spark, big-data
Big Data Engineering Coursera Yandex
Big Data for Data Engineers Coursera Specialization from Yandex
Stars: ✭ 71 (-57.49%)
Mutual labels:  spark, big-data
Setl
A simple Spark-powered ETL framework that just works 🍺
Stars: ✭ 79 (-52.69%)
Mutual labels:  spark, big-data
Rsparkling
RSparkling: Use H2O Sparkling Water from R (Spark + R + Machine Learning)
Stars: ✭ 65 (-61.08%)
Mutual labels:  spark, big-data
Bigdata Notes
大数据入门指南 ⭐
Stars: ✭ 10,991 (+6481.44%)
Mutual labels:  spark, big-data
Spark With Python
Fundamentals of Spark with Python (using PySpark), code examples
Stars: ✭ 150 (-10.18%)
Mutual labels:  spark, big-data
Spark
Apache Spark - A unified analytics engine for large-scale data processing
Stars: ✭ 31,618 (+18832.93%)
Mutual labels:  spark, big-data
Docker Spark Cluster
A Spark cluster setup running on Docker containers
Stars: ✭ 57 (-65.87%)
Mutual labels:  spark, big-data
Feast
Feature Store for Machine Learning
Stars: ✭ 2,576 (+1442.51%)
Mutual labels:  spark, big-data

GeoPySpark


.. image:: https://travis-ci.org/locationtech-labs/geopyspark.svg?branch=master :target: https://travis-ci.org/locationtech-labs/geopyspark

.. image:: https://readthedocs.org/projects/geopyspark/badge/?version=latest :target: https://geopyspark.readthedocs.io/en/latest/?badge=latest

.. image:: https://badges.gitter.im/locationtech-labs/geopyspark.png :target: https://gitter.im/geotrellis/geotrellis

GeoPySpark is not currently under active development. We will try and address PRs and Issues, but it may take some time as most of our resources are devoted to other projects now. There is a chance that this project will be revisted in the future, so it is by no means dead.

GeoPySpark is a Python bindings library for GeoTrellis <http://geotrellis.io>, a Scala library for working with geospatial data in a distributed environment. By using PySpark <http://spark.apache.org/docs/latest/api/python/pyspark.html>, GeoPySpark is able to provide an interface into the GeoTrellis framework.

Links

  • Documentation <https://geopyspark.readthedocs.io>_
  • Gitter <https://gitter.im/geotrellis/geotrellis>_

A Quick Example

Here is a quick example of GeoPySpark. In the following code, we take NLCD data of the state of Pennsylvania from 2011, and do a masking operation on it with a Polygon that represents an area of interest. This masked layer is then saved.

If you wish to follow along with this example, you will need to download the NLCD data and unzip it.. Running these two commands will complete these tasks for you:

.. code:: console

curl -o /tmp/NLCD2011_LC_Pennsylvania.zip "https://s3-us-west-2.amazonaws.com/prd-tnm/StagedProducts/NLCD/data/2011/landcover/states/NLCD2011_LC_Pennsylvania.zip?ORIG=513_SBDDG" unzip -d /tmp /tmp/NLCD2011_LC_Pennsylvania.zip

.. code:: python

import geopyspark as gps

from pyspark import SparkContext from shapely.geometry import box

Create the SparkContext

conf = gps.geopyspark_conf(appName="geopyspark-example", master="local[*]") sc = SparkContext(conf=conf)

Read in the NLCD tif that has been saved locally.

This tif represents the state of Pennsylvania.

raster_layer = gps.geotiff.get(layer_type=gps.LayerType.SPATIAL, uri='/tmp/NLCD2011_LC_Pennsylvania.tif', num_partitions=100)

Tile the rasters within the layer and reproject them to Web Mercator.

tiled_layer = raster_layer.tile_to_layout(layout=gps.GlobalLayout(), target_crs=3857)

Creates a Polygon that covers roughly the north-west section of Philadelphia.

This is the region that will be masked.

area_of_interest = box(-75.229225, 40.003686, -75.107345, 40.084375)

Mask the tiles within the layer with the area of interest

masked = tiled_layer.mask(geometries=area_of_interest)

We will now pyramid the masked TiledRasterLayer so that we can use it in a TMS server later.

pyramided_mask = masked.pyramid()

Save each layer of the pyramid locally so that it can be accessed at a later time.

for pyramid in pyramided_mask.levels.values(): gps.write(uri='file:///tmp/pa-nlcd-2011', layer_name='north-west-philly', tiled_raster_layer=pyramid)

For additional examples, check out the Jupyter notebook demos <./notebook-demos>_.

Requirements

============ ============ Requirement Version ============ ============ Java >=1.8 Scala >=2.11 Python 3.3 - 3.6 Spark >=2.1.1 ============ ============

Java 8 and Scala 2.11 are needed for GeoPySpark to work, as they are required by GeoTrellis. In addition, Spark needs to be installed and configured with the environment variable SPARK_HOME set.

You can test to see if Spark is installed properly by running the following in the terminal:

.. code:: console

echo $SPARK_HOME /usr/local/bin/spark

If the return is a path leading to your Spark folder, then it means that Spark has been configured correctly. If SPARK_HOME is unset or empty, you'll need to add it to your PATH after noting where Spark is installed on your system. For example, a MacOS installation of Spark 2.3.0 via HomeBrew would set SPARK_HOME as follows:

.. code:: bash

In ~/.bash_profile

export SPARK_HOME=/usr/local/Cellar/apache-spark/2.3.0/libexec/

Installation

Before installing, check the above Requirements_ table to make sure that the requirements are met.

Installing From Pip


To install via ``pip`` open the terminal and run the following:

.. code:: console

   pip install geopyspark
   geopyspark install-jar

The first command installs the python code and the `geopyspark` command
from PyPi. The second downloads the backend jar file, which is too large
to be included in the pip package, and installs it to the GeoPySpark
installation directory. For more information about the ``geopyspark``
command, see the `GeoPySpark CLI`_ section.

Installing From Source

If you would rather install from source, clone the GeoPySpark repo and enter it.

.. code:: console

git clone https://github.com/locationtech-labs/geopyspark.git cd geopyspark make install

This will assemble the backend-end jar that contains the Scala code, move it to the jars sub-package, and then runs the setup.py script.

Note: If you have altered the global behavior of sbt this install may not work the way it was intended.

Uninstalling


To uninstall GeoPySpark, run the following in the terminal:

.. code:: console

   pip uninstall geopyspark
   rm .local/bin/geopyspark

Contact and Support
-------------------

If you need help, have questions, or like to talk to the developers (let us
know what you're working on!) you can contact us at:

 * `Gitter <https://gitter.im/geotrellis/geotrellis>`_
 * `Mailing list <https://locationtech.org/mailman/listinfo/geotrellis-user>`_

As you may have noticed from the above links, those are links to the GeoTrellis
gitter channel and mailing list. This is because this project is currently an
offshoot of GeoTrellis, and we will be using their mailing list and gitter
channel as a means of contact. However, we will form our own if there is a need
for it.

GeoPySpark CLI
--------------

When GeoPySpark is installed, it comes with a script which can be accessed
from anywhere on you computer. This script is used to facilitate management
of the GeoPySpark jar file that must be installed in order for GeoPySpark to
work correctly. Here are the available commands:

.. code:: console

   geopyspark -h, --help // return help string and exit
   geopyspark install-jar // downloads jar file to default location, which is geopyspark install dir
   geopyspark install-jar -p, --path [download/path] //downloads the jar file to location specified
   geopyspark jar-path //returns the relative path of the jar file
   geopyspark jar-path -a, --absolute //returns the absolute path of the jar file

``geopyspark install-jar`` is only needed when installing GeoPySpark through
``pip``; and it **must** be ran before using GeoPySpark. If no path is selected,
then the jar will be installed wherever GeoPySpark was installed.

The second and third commands are for getting the location of the jar file.
These can be used regardless of installation method. However, if installed
through ``pip``, then the jar must be downloaded first or these commands
will not work.

Developing GeoPySpark
---------------------

Contributing

Feedback and contributions to GeoPySpark are always welcomed. A CLA is required for contribution, see Contributing <docs/contributing.rst>_ for more information.

Installing for Developers


.. code:: console

   make build
   pip install -e .

``make build`` will assemble the back-end ``jar`` and move it the ``jars``
sub-package. The second command will install GeoPySpark in "editable" mode.
Meaning any changes to the source files will also appear in your system
installation.

Within a virtualenv
===================

It's possible that you may run into issues when performing the ``pip install -e .``
described above with a Python virtualenv active. If you're having trouble with
Python finding installed libraries within the virtualenv, try adding the virtualenv
site-packages directory to your PYTHONPATH:

.. code:: console

   workon <your-geopyspark-virtualenv-name>
   export PYTHONPATH=$VIRTUAL_ENV/lib/<your python version>/site-packages

Replace ``<your python version`` with whatever Python version
``virtualenvwrapper`` is set to. Once you've set PYTHONPATH, re-install
GeoPySpark using the instructions in "Installing for Developers" above.

Running GeoPySpark Tests
~~~~~~~~~~~~~~~~~~~~~~~~

GeoPySpark uses the `pytest <https://docs.pytest.org/en/latest/>`_ testing
framework to run its unittests. If you wish to run GeoPySpark's unittests,
then you must first clone this repository to your machine. Once complete,
go to the root of the library and run the following command:

.. code:: console

   pytest

This will then run all of the tests present in the GeoPySpark library.

**Note**: The unittests require additional dependencies in order to pass fully.
`pyproj <https://pypi.python.org/pypi/pyproj?>`_, `colortools <https://pypi.python.org/pypi/colortools/0.1.2>`_,
and `matplotlib <https://pypi.python.org/pypi/matplotlib/2.0.2>`_  (only for >=Python3.4) are needed to
ensure that all of the tests pass.

Make Targets
============

 - **install** - install GeoPySpark python package locally
 - **wheel** - build python GeoPySpark wheel for distribution
 - **pyspark** - start pyspark shell with project jars
 - **build** - builds the backend jar and moves it to the jars sub-package
 - **clean** - remove the wheel, the backend jar file, and clean the
   geotrellis-backend directory

Developing GeoPySpark With GeoNotebook

Note: Before begining this section, it should be noted that python-mapnik, a dependency for GeoNotebook, has been found to be difficult to install. If problems are encountered during installation, a possible work around would be to run make wheel and then do docker cp the wheel into the GeoPySpark docker container and install it from there.

GeoNotebook <https://github.com/OpenGeoscience/geonotebook>_ is a Jupyter notebook extension that specializes in working with geospatial data. GeoPySpark can be used with this notebook; which allows for a more interactive experience when using the library. For this section, we will be installing both tools in a virtual environment. It is recommended that you start with a new environment before following this guide.

Because there's already documentation on how to install GeoPySpark in a virtual environment, we won't go over it here. As for GeoNotebook, it also has a section on installation <https://github.com/OpenGeoscience/geonotebook#make-a-virtualenv-install-jupyternotebook-install-geonotebook>_ so that will not be covered here either.

Once you've setup both GeoPySpark and GeoNotebook, all that needs to be done is go to where you want to save/have saved your notebooks and execute this command:

.. code:: console

jupyter notebook

This will open up the jupyter hub and will allow you to work on your notebooks.

It is also possible to develop with both GeoPySpark and GeoNotebook in editable mode. To do so you will need to re-install and re-register GeoNotebook with Jupyter.

.. code:: console

pip uninstall geonotebook git clone --branch feature/geotrellis https://github.com/geotrellis/geonotebook ~/geonotebook pip install -r ~/geonotebook/prerequirements.txt pip install -r ~/geonotebook/requirements.txt pip install -e ~/geonotebook jupyter serverextension enable --py geonotebook jupyter nbextension enable --py geonotebook make notebook

The default Geonotebook (Python 3) kernel will require the following environment variables to be defined:

.. code:: console

export PYSPARK_PYTHON="/usr/local/bin/python3" export SPARK_HOME="/usr/local/apache-spark/2.1.1/libexec" export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.4-src.zip:${SPARK_HOME}/python/lib/pyspark.zip"

Make sure to define them to values that are correct for your system. The make notebook command also makes used of PYSPARK_SUBMIT_ARGS variable defined in the Makefile.

GeoNotebook/GeoTrellis integration in currently in active development and not part of GeoNotebook master. The latest development is on a feature/geotrellis branch at <https://github.com/geotrellis/geonotebook>.

Side Note For Developers

An optional (but recommended!) step for developers is to place these two lines of code at the top of your notebooks.

.. code:: console

%load_ext autoreload %autoreload 2

This will make it so that you don't have to leave the notebook for your changes to take affect. Rather, you just have to reimport the module and it will be updated. However, there are a few caveats when using autoreload that can be read here <http://ipython.readthedocs.io/en/stable/config/extensions/autoreload.html#caveats>_.

Using pip install -e in conjunction with autoreload should cover any changes made, though, and will make the development experience much less painful.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].