columbia-applied-data-science / Rosetta

Licence: other
Tools, wrappers, etc... for data science with a concentration on text processing

Projects that are alternatives of or similar to Rosetta

Datascience
책) 파이썬으로 데이터 주무르기 - 소스코드 및 데이터 공개
Stars: ✭ 199 (-0.5%)
Mutual labels:  jupyter-notebook
Medium articles
Scripts/Notebooks used for my articles published on Medium
Stars: ✭ 199 (-0.5%)
Mutual labels:  jupyter-notebook
Image To 3d Bbox
Build a CNN network to predict 3D bounding box of car from 2D image.
Stars: ✭ 200 (+0%)
Mutual labels:  jupyter-notebook
Food2vec
🍔
Stars: ✭ 199 (-0.5%)
Mutual labels:  jupyter-notebook
Binpy
An electronic simulation library written in pure Python
Stars: ✭ 199 (-0.5%)
Mutual labels:  jupyter-notebook
Pyspark And Mllib
Getting start with PySpark and MLlib
Stars: ✭ 199 (-0.5%)
Mutual labels:  jupyter-notebook
Radio
RadIO is a library for data science research of computed tomography imaging
Stars: ✭ 198 (-1%)
Mutual labels:  jupyter-notebook
Spark Practice
Apache Spark (PySpark) Practice on Real Data
Stars: ✭ 200 (+0%)
Mutual labels:  jupyter-notebook
Gpt 2 Colab
retrain gpt-2 in colab
Stars: ✭ 200 (+0%)
Mutual labels:  jupyter-notebook
W Net
w-net: a convolutional neural network architecture for the self-supervised learning of depthmap from pairs of stereo images.
Stars: ✭ 200 (+0%)
Mutual labels:  jupyter-notebook
Neuralnetworks.thought Experiments
Observations and notes to understand the workings of neural network models and other thought experiments using Tensorflow
Stars: ✭ 199 (-0.5%)
Mutual labels:  jupyter-notebook
Basset
Convolutional neural network analysis for predicting DNA sequence activity.
Stars: ✭ 199 (-0.5%)
Mutual labels:  jupyter-notebook
Traffic Sign Detection
Traffic Sign Detection. Code for the paper entitled "Evaluation of deep neural networks for traffic sign detection systems".
Stars: ✭ 200 (+0%)
Mutual labels:  jupyter-notebook
Pysonar
Decentralized Machine Learning Client
Stars: ✭ 199 (-0.5%)
Mutual labels:  jupyter-notebook
Ropsten
Ropsten public testnet PoW chain
Stars: ✭ 199 (-0.5%)
Mutual labels:  jupyter-notebook
Mgcnn
Multi-Graph Convolutional Neural Networks
Stars: ✭ 199 (-0.5%)
Mutual labels:  jupyter-notebook
Machine Learning With Python Cookbook Notes
(Part of) Chris Albon's Machine Learning with Python Cookbook in .ipynb form
Stars: ✭ 197 (-1.5%)
Mutual labels:  jupyter-notebook
Allensdk
code for reading and processing Allen Institute for Brain Science data
Stars: ✭ 200 (+0%)
Mutual labels:  jupyter-notebook
Sdc Lane And Vehicle Detection Tracking
OpenCV in Python for lane line and vehicle detection/tracking in autonomous cars
Stars: ✭ 200 (+0%)
Mutual labels:  jupyter-notebook
Vdom
🎄 Virtual DOM for Python
Stars: ✭ 200 (+0%)
Mutual labels:  jupyter-notebook

Rosetta

Tools for data science with a focus on text processing.

  • Focuses on "medium data", i.e. data too big to fit into memory but too small to necessitate the use of a cluster.
  • Integrates with existing scientific Python stack as well as select outside tools.

Examples

  • See the examples/ directory.
  • The docs contain plots of example output.

Packages

cmdutils

  • Unix-like command line utilities. Filters (read from stdin/write to stdout) for files.
  • Focus on stream processing and csv files.

parallel

  • Wrappers for Python multiprocessing that add ease of use
  • Memory-friendly multiprocessing

text

  • Stream text from disk to formats used in common ML processes
  • Write processed text to sparse formats
  • Helpers for ML tools (e.g. Vowpal Wabbit, Gensim, etc...)
  • Other general utilities

workflow

  • High-level wrappers that have helped with our workflow and provide additional examples of code use

modeling

  • General ML modeling utilities

Install

Check out the master branch from the rosettarepo. Then, (so long as you have pip).

cd rosetta
make
make test

If you update the source, you can do

make reinstall
make test

The above make targets use pip, so you can of course do pip uninstall at any time.

Getting the source (above) is the preferred method since the code changes often, but if you don't use Git you can download a tagged release (tarball) here. Then

pip install rosetta-X.X.X.tar.gz

Development

Code

You can get the latest sources with

git clone git://github.com/columbia-applied-data-science/rosetta

Contributing

Feel free to contribute a bug report or a request by opening an issue

The preferred method to contribute is to fork and send a pull request. Before doing this, read CONTRIBUTING.md

Dependencies

  • Major dependencies on Pandas and numpy.
  • Minor dependencies on Gensim and statsmodels.
  • Some examples need scikit-learn.
  • Minor dependencies on docx
  • Minor dependencies on the unix utilities pdftotext and catdoc

Testing

From the base repo directory, rosetta/, you can run all tests with

make test

Documentation

Documentation for releases is hosted at pypi. This does NOT auto-update.

History

Rosetta refers to the Rosetta Stone, the ancient Egyptian tablet discovered just over 200 years ago. The tablet contained fragmented text in three different languages and the uncovering of its meaning is considered an essential key to our understanding of Ancient Egyptian civilization. We would like this project to provide individuals the necessary tools to process and unearth insight in the ever-growing volumes of textual data of today.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].