All Projects → altescy → pdpcli

altescy / pdpcli

Licence: MIT license
PdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline from a configuration file.

Programming Languages

python
139335 projects - #7 most used programming language
Makefile
30231 projects
Jsonnet
166 projects

Projects that are alternatives of or similar to pdpcli

AlphaVantageAPI
An Opinionated AlphaVantage API Wrapper in Python 3.9. Compatible with Pandas TA (pip install pandas_ta). Get your FREE API Key at https://www.alphavantage.co/support/
Stars: ✭ 77 (+413.33%)
Mutual labels:  csv, pandas
Csvs To Sqlite
Convert CSV files into a SQLite database
Stars: ✭ 568 (+3686.67%)
Mutual labels:  csv, pandas
Meza
A Python toolkit for processing tabular data
Stars: ✭ 374 (+2393.33%)
Mutual labels:  csv, pandas
A-Detector
⭐ An anomaly-based intrusion detection system.
Stars: ✭ 69 (+360%)
Mutual labels:  csv, pandas
30 Days Of Python
Learn Python for the next 30 (or so) Days.
Stars: ✭ 1,748 (+11553.33%)
Mutual labels:  csv, pandas
Visidata
A terminal spreadsheet multitool for discovering and arranging data
Stars: ✭ 4,606 (+30606.67%)
Mutual labels:  csv, pandas
Pytablewriter
pytablewriter is a Python library to write a table in various formats: CSV / Elasticsearch / HTML / JavaScript / JSON / LaTeX / LDJSON / LTSV / Markdown / MediaWiki / NumPy / Excel / Pandas / Python / reStructuredText / SQLite / TOML / TSV.
Stars: ✭ 422 (+2713.33%)
Mutual labels:  csv, pandas
Data Forge Ts
The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
Stars: ✭ 967 (+6346.67%)
Mutual labels:  csv, pandas
Pygraphistry
PyGraphistry is a Python library to quickly load, shape, embed, and explore big graphs with the GPU-accelerated Graphistry visual graph analyzer
Stars: ✭ 1,365 (+9000%)
Mutual labels:  csv, pandas
Django Rest Pandas
📊📈 Serves up Pandas dataframes via the Django REST Framework for use in client-side (i.e. d3.js) visualizations and offline analysis (e.g. Excel)
Stars: ✭ 1,030 (+6766.67%)
Mutual labels:  csv, pandas
California Coronavirus Data
The Los Angeles Times' independent tally of coronavirus cases in California.
Stars: ✭ 188 (+1153.33%)
Mutual labels:  csv, pandas
Rightmove webscraper.py
Python class to scrape data from rightmove.co.uk and return listings in a pandas DataFrame object
Stars: ✭ 125 (+733.33%)
Mutual labels:  csv, pandas
DataProfiler
What's in your data? Extract schema, statistics and entities from datasets
Stars: ✭ 843 (+5520%)
Mutual labels:  csv, pandas
csv2latex
🔧 Simple script in python to convert CSV files to LaTeX table
Stars: ✭ 54 (+260%)
Mutual labels:  csv
vulkn
Love your Data. Love the Environment. Love VULKИ.
Stars: ✭ 43 (+186.67%)
Mutual labels:  pandas
rdf-parser-csvw
CSV on the Web parser
Stars: ✭ 15 (+0%)
Mutual labels:  csv
Information-Retrieval
Information Retrieval algorithms developed in python. To follow the blog posts, click on the link:
Stars: ✭ 103 (+586.67%)
Mutual labels:  pandas
csv2keepassxml
Convert CSV files into KeePass 2 XML files.
Stars: ✭ 31 (+106.67%)
Mutual labels:  csv
tsa-tutorial
Material for the tutorial, "Time series analysis with pandas" at T-Academy
Stars: ✭ 21 (+40%)
Mutual labels:  pandas
Python-for-data-analysis
No description or website provided.
Stars: ✭ 18 (+20%)
Mutual labels:  pandas

PdpCLI

Actions Status Python version PyPI version License

Quick Links

Introduction

PdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline powered by pdpipe from a configuration file. You can also extend pipeline stages and data readers / writers by using your own python scripts.

Features

  • Process pandas DataFrame from CLI without wrting Python scripts
  • Support multiple configuration file formats: YAML, JSON, Jsonnet
  • Read / write data files in the following formats: CSV, TSV, JSON, JSONL, pickled DataFrame
  • Import / export data with multiple protocols: S3 / Databse (MySQL, Postgres, SQLite, ...) / HTTP(S)
  • Extensible pipeline and data readers / writers

Installation

Installing the library is simple using pip.

$ pip install "pdpcli[all]"

Tutorial

Basic Usage

  1. Write a pipeline config file config.yml like below. The type fields under pipeline correspond to the snake-cased class names of the PdpipelineStages. Other fields such as stage and columns are the parameters of the __init__ methods of the corresponging classes. Internally, this configuration file is converted to Python objects by colt.
pipeline:
  type: pipeline
  stages:
    drop_columns:
      type: col_drop
      columns:
        - name
        - job

    encode:
      type: one_hot_encode
      columns: sex

    tokenize:
      type: tokenize_text
      columns: content

    vectorize:
      type: tfidf_vectorize_token_lists
      column: content
      max_features: 10
  1. Build a pipeline by training on train.csv. The following command generages a pickled pipeline file pipeline.pkl after training. If you specify a URL of file path, it will be automatically downloaded and cached.
$ pdp build config.yml pipeline.pkl --input-file https://github.com/altescy/pdpcli/raw/main/tests/fixture/data/train.csv
  1. Apply the fitted pipeline to test.csv and get output of a processed file processed_test.jsonl by the following command. PdpCLI automatically detects the output file format based on the file name. In this example, the processed DataFrame will be exported as the JSON-Lines format.
$ pdp apply pipeline.pkl https://github.com/altescy/pdpcli/raw/main/tests/fixture/data/test.csv --output-file processed_test.jsonl
  1. You can also directly run the pipeline from a config file without fitting pipeline.
$ pdp apply config.yml test.csv --output-file processed_test.jsonl
  1. It is possible to override or add parameters by adding command line arguments:
pdp apply config.yml test.csv pipeline.stages.drop_columns.column=name

Data Reader / Writer

PdpCLI automatically detects a suitable data reader / writer based on a given file name. If you need to use the other data reader / writer, add a reader or writer config to config.yml. The following config is an exmaple to use SQL data reader. SQL reader fetches records from the specified database and converts them into a pandas DataFrame.

reader:
    type: sql
    dsn: postgres://${env:POSTGRES_USER}:${env:POSTGRES_PASSWORD}@your.posgres.server/your_database

Config files are interpreted by OmegaConf, so ${env:...} is interpolated by environment variables.

Prepare your SQL file query.sql to fetch data from the database:

select * from your_table limit 1000

You can execute the pipeline with SQL data reader via:

$ POSTGRES_USER=user POSTGRES_PASSWORD=password pdp apply config.yml query.sql

Plugins

By using plugins, you can extend PdpCLI. This plugin feature enables you to use your own pipeline stages, data readers / writers and commands.

Add a new stage

  1. Write your plugin script mypdp.py like below. Stage.register("<stage-name>") registers your pipeline stages, and you can specify these stages by writing type: <stage-name> in your config file.
import pdpcli

@pdpcli.Stage.register("print")
class PrintStage(pdpcli.Stage):
    def _prec(self, df):
        return True

    def _transform(self, df, verbose):
        print(df.to_string(index=False))
        return df
  1. Update config.yml to use your plugin.
pipeline:
    type: pipeline
    stages:
        drop_columns:
        ...

        print:
            type: print

        encode:
        ...
  1. Execute command with --module mypdp and you can see the processed DataFrame after running drop_columns.
$ pdp apply config.yml test.csv --module mypdp

Add a new command

You can also add new commands not only stages.

  1. Add the following script to mypdp.py. This greet command prints out a greeting message with your name.
@pdpcli.Subcommand.register(
    name="greet",
    description="say hello",
    help="say hello",
)
class GreetCommand(pdpcli.Subcommand):
    requires_plugins = False

    def set_arguments(self):
        self.parser.add_argument("--name", default="world")

    def run(self, args):
        print(f"Hello, {args.name}!")
  1. To register this command, you need to create the .pdpcli_plugins file in which module names are listed for each line. Due to module importing order, the --module option is unavailable for command registration.
$ echo "mypdp" > .pdpcli_plugins
  1. Run the following command and get a message like below. By using the .pdpcli_plugins file, it is is not needed to add the --module option to a command line for each execution.
$ pdp greet --name altescy
Hello, altescy!
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].