All Projects → JacksonWuxs → Dapy

JacksonWuxs / Dapy

Licence: gpl-3.0
Easy-to-use data analysis / manipulation framework for humans

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Dapy

Data Describe
data⎰describe: Pythonic EDA Accelerator for Data Science
Stars: ✭ 269 (-48.57%)
Mutual labels:  analysis, data-science, pypi
Dream3d
Data Analysis program and framework for materials science data analytics, based on the managing framework SIMPL framework.
Stars: ✭ 73 (-86.04%)
Mutual labels:  analysis, data-science, data-analysis
Awesome R
A curated list of awesome R packages, frameworks and software.
Stars: ✭ 4,858 (+828.87%)
Mutual labels:  data-science, data-analysis
Articles
A repository for the source code, notebooks, data, files, and other assets used in the data science and machine learning articles on LearnDataSci
Stars: ✭ 350 (-33.08%)
Mutual labels:  data-science, data-analysis
Prettypandas
A Pandas Styler class for making beautiful tables
Stars: ✭ 376 (-28.11%)
Mutual labels:  data-science, data-analysis
Finviz
Unofficial API for finviz.com
Stars: ✭ 493 (-5.74%)
Mutual labels:  analysis, pypi
Quantitative Notebooks
Educational notebooks on quantitative finance, algorithmic trading, financial modelling and investment strategy
Stars: ✭ 356 (-31.93%)
Mutual labels:  data-science, data-analysis
Data Science
Collection of useful data science topics along with code and articles
Stars: ✭ 315 (-39.77%)
Mutual labels:  data-science, data-analysis
Ai Learn
人工智能学习路线图,整理近200个实战案例与项目,免费提供配套教材,零基础入门,就业实战!包括:Python,数学,机器学习,数据分析,深度学习,计算机视觉,自然语言处理,PyTorch tensorflow machine-learning,deep-learning data-analysis data-mining mathematics data-science artificial-intelligence python tensorflow tensorflow2 caffe keras pytorch algorithm numpy pandas matplotlib seaborn nlp cv等热门领域
Stars: ✭ 4,387 (+738.81%)
Mutual labels:  data-science, data-analysis
Jupyter pivottablejs
Drag’n’drop Pivot Tables and Charts for Jupyter/IPython Notebook, care of PivotTable.js
Stars: ✭ 428 (-18.16%)
Mutual labels:  data-science, data-analysis
The Elements Of Statistical Learning Python Notebooks
A series of Python Jupyter notebooks that help you better understand "The Elements of Statistical Learning" book
Stars: ✭ 405 (-22.56%)
Mutual labels:  data-science, data-analysis
Knowledge Repo
A next-generation curated knowledge sharing platform for data scientists and other technical professions.
Stars: ✭ 4,956 (+847.61%)
Mutual labels:  data-science, data-analysis
Scikit Mobility
scikit-mobility: mobility analysis in Python
Stars: ✭ 339 (-35.18%)
Mutual labels:  data-science, data-analysis
Kneed
Knee point detection in Python 📈
Stars: ✭ 328 (-37.28%)
Mutual labels:  data-science, data-analysis
Pandas Summary
An extension to pandas dataframes describe function.
Stars: ✭ 361 (-30.98%)
Mutual labels:  data-science, data-analysis
Akshare
AKShare is an elegant and simple financial data interface library for Python, built for human beings! 开源财经数据接口库
Stars: ✭ 4,334 (+728.68%)
Mutual labels:  data-science, data-analysis
Dataexplorer
Automate Data Exploration and Treatment
Stars: ✭ 362 (-30.78%)
Mutual labels:  data-science, data-analysis
Pyod
A Python Toolbox for Scalable Outlier Detection (Anomaly Detection)
Stars: ✭ 5,083 (+871.89%)
Mutual labels:  data-science, data-analysis
Datascience course
Curso de Data Science em Português
Stars: ✭ 294 (-43.79%)
Mutual labels:  data-science, data-analysis
Pydataroad
open source for wechat-official-account (ID: PyDataLab)
Stars: ✭ 302 (-42.26%)
Mutual labels:  data-science, data-analysis
This open source framework fluently implements your ideas for data mining.

DaPy - Enjoy the Tour in Data Mining

中文版

Overview

DaPy is a data analysis library designed with ease of use in mind and it lets you smoothly implement your thoughts by providing well-designed data structures and abundant professional ML models. There has been a lot of famous data operation modules already like Pandas, but there is no module, which

  • supports writing codes in Chain Programming;
  • multi-threading safety data containers;
  • operates feature engineering methods with simple APIs;
  • handles data as easily as using Excel (do not pay attention to data structures);
  • shows the log of each steps on console like MySQL.

Thus, DaPy is more suitable for data analysts, statistic professors and who works with big data with limited computer knowledge than the engineers. In DaPy, our data structure offers 70 APIs for data mining, including 40+ data operation functions, 10+ feature engineering functions and 15+ data exploring functions.

Example

This example simply shows the characters of DaPy of chain programming, working log and simple feature engineering methods. Our goal in this example is to train a classifier for Iris classification task. Detail information can be read from here.

Features of DaPy

We already have abundant of great libraries for data science, why we need DaPy?

The answer is DaPy is designed for data analysts, not for coders. In DaPy, users only need to focus on their thought of handling data, and pay less attention to coding tricks. For example, in contrast with Pandas, DaPy supports you manipulating data by rows as same as using SQL. Here are just a few of things that make DaPy simple:

  • Variety of ways to visualize data in CMD
  • 2D data sheet structures following Python syntax habits
  • SQL-like APIs to process data
  • Thread-safety data container
  • Variety functions for preprocessing and feature engineering
  • Flexible IO tools for loading and saving data (e.g. Website, Excel, Sqlite3, SPSS, Text)
  • Built-in basic models (e.g. Decision Tree, Multilayer Perceptron, Linear Regression, ...)

Also, DaPy has high efficiency to support you solving real-world situations. Following dialog shows a testing result which provides that DaPy has comparable efficiency than some exists C written libraries. The detail of test can be found from here.

Performance Test

Install

The latest version 1.11.1 had been updated to PyPi.

pip install DaPy

Some of functions in DaPy depend on requirements.

  • xlrd: loading data from .xls file【Necessary】
  • xlwt: export data to a .xls file【Necessary】
  • repoze.lru: speed up loading data from .csv file【Necessary】
  • savReaderWrite: loading data from .sav file【Optional】
  • bs4.BeautifulSoup: auto downloading data from a website【Optional】
  • numpy: dramatically increase the efficiency of ML models【Recommand】

Usages

  • Load & Explore Data
    • Load data from a local csv, sav, sqlite3, mysql server, mysql dump file or xls file: sheet = DaPy.read(file_addr)
    • Display the first five and the last five records: sheet.show(lines=5)
    • Summary the statistical information of each columns: sheet.info
    • Count distribution of categorical variable: sheet.count_values('gender')
    • Find differences of the labels in categorical variables: sheet.groupby('city')
    • Calculate the correlation between the continuous variables: sheet.corr(['age', 'income'])
  • Preprocessing & Clean Up Data
    • Remove duplicate records: sheet.drop_duplicates(col, keep='first')
    • Use linear interpolation to fill in NaN : sheet.fillna(method='linear')
    • Remove the records which contains more than 50% variables are NaN: sheet.dropna(axis=0, how=0.5)
    • Remove some meaningless columns (e.g. ID): sheet.drop('ID', axis=1)
    • Sort records by some columns: sheet = sheet.sort('Age', 'DESC')
    • Merge external features from another table: sheet.merge(sheet2, left_key='ID', other_key='ID', keep_key='self', keep_same=False)
    • Merge external records from another table: sheet.join(sheet2)
    • Append records one by one: sheet.append_row(new_row)
    • Append new variables one by one: sheet.append_col(new_col)
    • Get parts of records by index: sheet[:10, 20: 30, 50: 100]
    • Get parts of columns by column name: sheet['age', 'income', 'name']
  • Feature Engineering
    • Transfer a date time into categorical variables: sheet.get_date_label('birth')
    • Transfer numerical variables into categorical variables: sheet.get_categories(cols='age', cutpoints=[18, 30, 50], group_name=['Juveniles', 'Adults', 'Wrinkly', 'Old'])
    • Transfer categorical variables into dummy variables: sheet.get_dummies(['city', 'education'])
    • Create higher-order crossover terms between your selected variables: sheet.get_interactions(n_power=3, col=['income', 'age', 'gender', 'education'])
    • Introduce the ranks of each records: sheet.get_ranks(cols='income', duplicate='mean')
    • Standardize some normal continuous variables: sheet.normalized(col='age')
    • Special processing for some special variables: sheet.normalized('log', col='salary')
    • Create new variables by some business logical formulas: sheet.apply(func=tax_rate, col=['salary', 'income'])
    • Difference process to make time-series stable: DaPy.diff(sheet.income)
  • Developing Models
    • Choose a model and initialize it: m = MLP(), m = LinearRegression(), m = DecisionTree() or m = DiscriminantAnalysis()
    • Train the model parameters: m.fit(X_train, Y_train)
  • Model Evaluation
    • Evaluate model with parameter tests: m.report.show()
    • Evaluate model with visualization: m.plot_error() or DecisionTree.export_graphviz()
    • Evaluate model with test set: DaPy.methods.Performance(m, X_test, Y_test, mode).
  • Saving Result
    • Save the model: m.save(addr)
    • Save the final dataset: sheet.save(addr)

Contributors

Related

Following programs are also great data analyzing/ manipulating frameworks in Python:

  • Agate: Data analysis library optimized for humans
  • Numpy: fundamental package for scientific computing with Python
  • Pandas: Python Analysis Data
  • Scikit-Learn: Machine Learn in Python

Further-Info

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].