All Projects → nalepae → Pandarallel

nalepae / Pandarallel

Licence: bsd-3-clause
A simple and efficient tool to parallelize Pandas operations on all available CPUs

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Pandarallel

Deepgraph
Analyze Data with Pandas-based Networks. Documentation:
Stars: ✭ 232 (-87.71%)
Mutual labels:  parallel, pandas
Tqdm
A Fast, Extensible Progress Bar for Python and CLI
Stars: ✭ 20,632 (+993.38%)
Mutual labels:  parallel, pandas
Xyzpy
Efficiently generate and analyse high dimensional data.
Stars: ✭ 45 (-97.62%)
Mutual labels:  parallel, pandas
Tabula Py
Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame
Stars: ✭ 1,351 (-28.4%)
Mutual labels:  pandas
Maps Location History
Get, Concatenate and Process you location history from Google Maps TimeLine
Stars: ✭ 99 (-94.75%)
Mutual labels:  pandas
100 Pandas Puzzles
100 data puzzles for pandas, ranging from short and simple to super tricky (60% complete)
Stars: ✭ 1,382 (-26.76%)
Mutual labels:  pandas
Studybook
Study E-Book(ComputerVision DeepLearning MachineLearning Math NLP Python ReinforcementLearning)
Stars: ✭ 1,457 (-22.79%)
Mutual labels:  pandas
Sspipe
Simple Smart Pipe: python productivity-tool for rapid data manipulation
Stars: ✭ 96 (-94.91%)
Mutual labels:  pandas
Index
Metarhia educational program index 📖
Stars: ✭ 2,045 (+8.37%)
Mutual labels:  parallel
Doazureparallel
A R package that allows users to submit parallel workloads in Azure
Stars: ✭ 102 (-94.59%)
Mutual labels:  parallel
Parallel Process
🏃 Run multiple processes simultaneously
Stars: ✭ 102 (-94.59%)
Mutual labels:  parallel
Scroll
Scroll - making scrolling through buffers fun since 2016
Stars: ✭ 100 (-94.7%)
Mutual labels:  parallel
Yabox
Yet another black-box optimization library for Python
Stars: ✭ 103 (-94.54%)
Mutual labels:  parallel
Stackoverflow Pandas Canonicals
A directory listing of all my pandas canonicals on Stack Overflow to date.
Stars: ✭ 100 (-94.7%)
Mutual labels:  pandas
Twelvedata Python
Twelve Data Python Client - Financial data APIs & WebSockets
Stars: ✭ 111 (-94.12%)
Mutual labels:  pandas
Kglab
Graph-Based Data Science: an abstraction layer in Python for building knowledge graphs, integrated with popular graph libraries – atop Pandas, RDFlib, pySHACL, RAPIDS, NetworkX, iGraph, PyVis, pslpython, pyarrow, etc.
Stars: ✭ 98 (-94.81%)
Mutual labels:  pandas
Stocks
machine learning web app game where the user competes against the AI in picking stocks
Stars: ✭ 108 (-94.28%)
Mutual labels:  pandas
Root
The official repository for ROOT: analyzing, storing and visualizing big data, scientifically
Stars: ✭ 1,377 (-27.03%)
Mutual labels:  parallel
Rslint
A (WIP) Extremely fast JavaScript and TypeScript linter and Rust crate
Stars: ✭ 1,377 (-27.03%)
Mutual labels:  parallel
Sigmoidal ai
Tutoriais de Python, Data Science, Machine Learning e Deep Learning - Sigmoidal
Stars: ✭ 103 (-94.54%)
Mutual labels:  pandas

Pandaral·lel

PyPI version fury.io PyPI license PyPI download month

Without parallelization Without Pandarallel
With parallelization With Pandarallel

Installation

$ pip install pandarallel [--upgrade] [--user]

Requirements

On Windows, Pandaral·lel will works only if the Python session (python, ipython, jupyter notebook, jupyter lab, ...) is executed from Windows Subsystem for Linux (WSL).

On Linux & macOS, nothing special has to be done.

Warning

  • Parallelization has a cost (instantiating new processes, sending data via shared memory, ...), so parallelization is efficient only if the amount of calculation to parallelize is high enough. For very little amount of data, using parallelization is not always worth it.

Examples

An example of each API is available here.

Benchmark

For some examples, here is the comparative benchmark with and without using Pandaral·lel.

Computer used for this benchmark:

  • OS: Linux Ubuntu 16.04
  • Hardware: Intel Core i7 @ 3.40 GHz - 4 cores

Benchmark

For those given examples, parallel operations run approximately 4x faster than the standard operations (except for series.map which runs only 3.2x faster).

API

First, you have to import pandarallel:

from pandarallel import pandarallel

Then, you have to initialize it.

pandarallel.initialize()

This method takes 5 optional parameters:

  • shm_size_mb: Deprecated.
  • nb_workers: Number of workers used for parallelization. (int) If not set, all available CPUs will be used.
  • progress_bar: Display progress bars if set to True. (bool, False by default)
  • verbose: The verbosity level (int, 2 by default)
    • 0 - don't display any logs
    • 1 - display only warning logs
    • 2 - display all logs
  • use_memory_fs: (bool, None by default)
    • If set to None and if memory file system is available, Pandarallel will use it to transfer data between the main process and workers. If memory file system is not available, Pandarallel will default on multiprocessing data transfer (pipe).
    • If set to True, Pandarallel will use memory file system to transfer data between the main process and workers and will raise a SystemError if memory file system is not available.
    • If set to False, Pandarallel will use multiprocessing data transfer (pipe) to transfer data between the main process and workers.

Using memory file system reduces data transfer time between the main process and workers, especially for big data.

Memory file system is considered as available only if the directory /dev/shm exists and if the user has read and write rights on it.

Basically, memory file system is only available on some Linux distributions (including Ubuntu).

With df a pandas DataFrame, series a pandas Series, func a function to apply/map, args, args1, args2 some arguments, and col_name a column name:

Without parallelization With parallelization
df.apply(func) df.parallel_apply(func)
df.applymap(func) df.parallel_applymap(func)
df.groupby(args).apply(func) df.groupby(args).parallel_apply(func)
df.groupby(args1).col_name.rolling(args2).apply(func) df.groupby(args1).col_name.rolling(args2).parallel_apply(func)
df.groupby(args1).col_name.expanding(args2).apply(func) df.groupby(args1).col_name.expanding(args2).parallel_apply(func)
series.map(func) series.parallel_map(func)
series.apply(func) series.parallel_apply(func)
series.rolling(args).apply(func) series.rolling(args).parallel_apply(func)

You will find a complete example here for each row in this table.

Troubleshooting

I have 8 CPUs but parallel_apply speeds up computation only about x4. Why?

Actually Pandarallel can only speed up computation until about the number of cores your computer has. The majority of recent CPUs (like Intel Core i7) uses hyperthreading. For example, a 4-core hyperthreaded CPU will show 8 CPUs to the operating system, but will really have only 4 physical computation units.

On Ubuntu, you can get the number of cores with $ grep -m 1 'cpu cores' /proc/cpuinfo.


I use Jupyter Lab and instead of progress bars, I see these kind of things:
VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=625000), Label(value='0 / 625000')…

Run the following 3 lines, and you should be able to see the progress bars:

$ pip install ipywidgets
$ jupyter nbextension enable --py widgetsnbextension
$ jupyter labextension install @jupyter-widgets/jupyterlab-manager

(You may also have to install nodejs if asked)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].