All Projects → hi-primus → optimus

hi-primus / optimus

Licence: Apache-2.0 license
🚚 Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects
HTML
75241 projects
javascript
184084 projects - #8 most used programming language
Dockerfile
14818 projects
CSS
56736 projects

Projects that are alternatives of or similar to optimus

bumblebee
🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)
Stars: ✭ 120 (-91.12%)
Mutual labels:  dask, data-preparation, data-cleaning, data-profiling, cudf, dask-cudf
foofah
Foofah: programming-by-example data transformation program synthesizer
Stars: ✭ 24 (-98.22%)
Mutual labels:  data-transformation, data-wrangling, data-preparation, data-cleaning
Optimus
🚚 Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark
Stars: ✭ 986 (-27.02%)
Mutual labels:  bigdata, pyspark, data-wrangling, data-cleaning
datatile
A library for managing, validating, summarizing, and visualizing data.
Stars: ✭ 419 (-68.99%)
Mutual labels:  dask, data-exploration, data-profiling
Udacity-Data-Analyst-Nanodegree
Repository for the projects needed to complete the Data Analyst Nanodegree.
Stars: ✭ 31 (-97.71%)
Mutual labels:  data-wrangling, data-cleaning
Sweetviz
Visualize and compare datasets, target values and associations, with one line of code.
Stars: ✭ 1,851 (+37.01%)
Mutual labels:  data-exploration, data-profiling
Data Forge Ts
The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
Stars: ✭ 967 (-28.42%)
Mutual labels:  data-wrangling, data-cleaning
wrangler
Wrangler Transform: A DMD system for transforming Big Data
Stars: ✭ 63 (-95.34%)
Mutual labels:  data-transformation, data-cleansing
bamboolib binder template
bamboolib - template for creating your own binder notebook
Stars: ✭ 19 (-98.59%)
Mutual labels:  data-transformation, data-exploration
allie
🤖 A machine learning framework for audio, text, image, video, or .CSV files (50+ featurizers and 15+ model trainers).
Stars: ✭ 93 (-93.12%)
Mutual labels:  data-transformation, data-cleaning
big data
A collection of tutorials on Hadoop, MapReduce, Spark, Docker
Stars: ✭ 34 (-97.48%)
Mutual labels:  bigdata, pyspark
anovos
Anovos - An Open Source Library for Scalable feature engineering Using Apache-Spark
Stars: ✭ 77 (-94.3%)
Mutual labels:  bigdata, pyspark
data processing course
Some class materials for a data processing course using PySpark
Stars: ✭ 50 (-96.3%)
Mutual labels:  bigdata, pyspark
prosto
Prosto is a data processing toolkit radically changing how data is processed by heavily relying on functions and operations with functions - an alternative to map-reduce and join-groupby
Stars: ✭ 54 (-96%)
Mutual labels:  data-wrangling, data-preparation
Pandas Profiling
Create HTML profiling reports from pandas DataFrame objects
Stars: ✭ 8,329 (+516.51%)
Mutual labels:  data-exploration, data-profiling
data-algorithms-with-spark
O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian
Stars: ✭ 34 (-97.48%)
Mutual labels:  data-transformation, pyspark
Spark-and-Kafka IoT-Data-Processing-and-Analytics
Final Project for IoT: Big Data Processing and Analytics class. Analyzing U.S nationwide temperature from IoT sensors in real-time
Stars: ✭ 42 (-96.89%)
Mutual labels:  bigdata, pyspark
Spark Py Notebooks
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks
Stars: ✭ 1,338 (-0.96%)
Mutual labels:  bigdata, pyspark
Tdengine
An open-source big data platform designed and optimized for the Internet of Things (IoT).
Stars: ✭ 17,434 (+1190.45%)
Mutual labels:  bigdata
spark-dgraph-connector
A connector for Apache Spark and PySpark to Dgraph databases.
Stars: ✭ 36 (-97.34%)
Mutual labels:  pyspark

Optimus

Logo Optimus

Tests Docker image updated PyPI Latest Release GitHub release CalVer

Downloads Downloads Downloads Mentioned in Awesome Data Science Slack

Overview

Optimus is an opinionated python library to easily load, process, plot and create ML models that run over pandas, Dask, cuDF, dask-cuDF, Vaex or Spark.

Some amazing things Optimus can do for you:

  • Process using a simple API, making it easy to use for newcomers.
  • More than 100 functions to handle strings, process dates, urls and emails.
  • Easily plot data from any size.
  • Out of box functions to explore and fix data quality.
  • Use the same code to process your data in your laptop or in a remote cluster of GPUs.

See Documentation

Try Optimus

To launch a live notebook server to test optimus using binder or Colab, click on one of the following badges:

Binder Colab

Installation (pip):

In your terminal just type:

pip install pyoptimus

By default Optimus install Pandas as the default engine, to install other engines you can use the following commands:

Engine Command
Dask pip install pyoptimus[dask]
cuDF pip install pyoptimus[cudf]
Dask-cuDF pip install pyoptimus[dask-cudf]
Vaex pip install pyoptimus[vaex]
Spark pip install pyoptimus[spark]

To install from the repo:

pip install git+https://github.com/hi-primus/[email protected]

To install other engines:

pip install git+https://github.com/hi-primus/[email protected]#egg=pyoptimus[dask]

Requirements

  • Python 3.7 or 3.8

Examples

You can go to 10 minutes to Optimus where you can find the basics to start working in a notebook.

Also you can go to the Examples section and find specific notebooks about data cleaning, data munging, profiling, data enrichment and how to create ML and DL models.

Here's a handy Cheat Sheet with the most common Optimus' operations.

Start Optimus

Start Optimus using "pandas", "dask", "cudf","dask_cudf","vaex" or "spark".

from optimus import Optimus
op = Optimus("pandas")

Loading data

Now Optimus can load data in csv, json, parquet, avro and excel formats from a local file or from a URL.

#csv
df = op.load.csv("../examples/data/foo.csv")

#json
df = op.load.json("../examples/data/foo.json")

# using a url
df = op.load.json("https://raw.githubusercontent.com/hi-primus/optimus/develop-22.10/examples/data/foo.json")

# parquet
df = op.load.parquet("../examples/data/foo.parquet")

# ...or anything else
df = op.load.file("../examples/data/titanic3.xls")

Also, you can load data from Oracle, Redshift, MySQL and Postgres databases.

Saving Data

#csv
df.save.csv("data/foo.csv")

# json
df.save.json("data/foo.json")

# parquet
df.save.parquet("data/foo.parquet")

You can also save data to oracle, redshift, mysql and postgres.

Create dataframes

Also, you can create a dataframe from scratch

df = op.create.dataframe({
    'A': ['a', 'b', 'c', 'd'],
    'B': [1, 3, 5, 7],
    'C': [2, 4, 6, None],
    'D': ['1980/04/10', '1980/04/10', '1980/04/10', '1980/04/10']
})

Using display you have a beautiful way to show your data with extra information like column number, column data type and marked white spaces.

display(df)

Cleaning and Processing

Optimus was created to make data cleaning a breeze. The API was designed to be super easy to newcomers and very familiar for people that comes from Pandas. Optimus expands the standard DataFrame functionality adding .rows and .cols accessors.

For example you can load data from a url, transform and apply some predefined cleaning functions:

new_df = df\
    .rows.sort("rank", "desc")\
    .cols.lower(["names", "function"])\
    .cols.date_format("date arrival", "yyyy/MM/dd", "dd-MM-YYYY")\
    .cols.years_between("date arrival", "dd-MM-YYYY", output_cols="from arrival")\
    .cols.normalize_chars("names")\
    .cols.remove_special_chars("names")\
    .rows.drop(df["rank"]>8)\
    .cols.rename("*", str.lower)\
    .cols.trim("*")\
    .cols.unnest("japanese name", output_cols="other names")\
    .cols.unnest("last position seen", separator=",", output_cols="pos")\
    .cols.drop(["last position seen", "japanese name", "date arrival", "cybertronian", "nulltype"])

Need help? 🛠️

Feedback

Feedback is what drive Optimus future, so please take a couple of minutes to help shape the Optimus' Roadmap: http://bit.ly/optimus_survey

Also if you want to a suggestion or feature request use https://github.com/hi-primus/optimus/issues

Troubleshooting

If you have issues, see our Troubleshooting Guide

Contributing to Optimus 💡

Contributions go far beyond pull requests and commits. We are very happy to receive any kind of contributions
including:

  • Documentation updates, enhancements, designs, or bugfixes.
  • Spelling or grammar fixes.
  • README.md corrections or redesigns.
  • Adding unit, or functional tests
  • Triaging GitHub issues -- especially determining whether an issue still persists or is reproducible.
  • Blogging, speaking about, or creating tutorials about Optimus and its many features.
  • Helping others on our official chats

Backers and Sponsors

Become a backer or a sponsor and get your image on our README on Github with a link to your site.

OpenCollective OpenCollective

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].