All Projects → src-d → Ml

src-d / Ml

Licence: other
sourced.ml is a library and command line tools to build and apply machine learning models on top of Universal Abstract Syntax Trees

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Ml

Text rnn attention
嵌入Word2vec词向量的RNN+ATTENTION中文文本分类
Stars: ✭ 117 (-13.97%)
Mutual labels:  word2vec
Fasttext.js
FastText for Node.js
Stars: ✭ 127 (-6.62%)
Mutual labels:  word2vec
Pytorch word2vec
Use pytorch to implement word2vec
Stars: ✭ 133 (-2.21%)
Mutual labels:  word2vec
Asteval
minimalistic evaluator of python expression using ast module
Stars: ✭ 116 (-14.71%)
Mutual labels:  ast
Phplrt
PHP Language Recognition Tool
Stars: ✭ 127 (-6.62%)
Mutual labels:  ast
Ast I18n
Easily migrate your existing React codebase to use i18n
Stars: ✭ 129 (-5.15%)
Mutual labels:  ast
Pytorch neg loss
NEG loss implemented in pytorch
Stars: ✭ 116 (-14.71%)
Mutual labels:  word2vec
Expr
Expression language for Go
Stars: ✭ 2,123 (+1461.03%)
Mutual labels:  ast
Ml Projects
ML based projects such as Spam Classification, Time Series Analysis, Text Classification using Random Forest, Deep Learning, Bayesian, Xgboost in Python
Stars: ✭ 127 (-6.62%)
Mutual labels:  word2vec
Babylon
PSA: moved into babel/babel as @babel/parser -->
Stars: ✭ 1,692 (+1144.12%)
Mutual labels:  ast
Scattertext
Beautiful visualizations of how language differs among document types.
Stars: ✭ 1,722 (+1166.18%)
Mutual labels:  word2vec
Math Engine
Mathematical expression parsing and calculation engine library. 数学表达式解析计算引擎库
Stars: ✭ 123 (-9.56%)
Mutual labels:  ast
Ast Pretty Print
A pretty printer for AST-like structures
Stars: ✭ 129 (-5.15%)
Mutual labels:  ast
Dna2vec
dna2vec: Consistent vector representations of variable-length k-mers
Stars: ✭ 117 (-13.97%)
Mutual labels:  word2vec
Rewrite
Semantic code search and transformation
Stars: ✭ 134 (-1.47%)
Mutual labels:  ast
Nlcst
Natural Language Concrete Syntax Tree format
Stars: ✭ 116 (-14.71%)
Mutual labels:  ast
Learn Javascript
《前端基础漫游指南》深入的、系统的学习 javascript 基础,喜欢点 Star
Stars: ✭ 128 (-5.88%)
Mutual labels:  ast
Turkish Word2vec
Pre-trained Word2Vec Model for Turkish
Stars: ✭ 136 (+0%)
Mutual labels:  word2vec
Role2vec
A scalable Gensim implementation of "Learning Role-based Graph Embeddings" (IJCAI 2018).
Stars: ✭ 134 (-1.47%)
Mutual labels:  word2vec
Scattertext Pydata
Notebooks for the Seattle PyData 2017 talk on Scattertext
Stars: ✭ 132 (-2.94%)
Mutual labels:  word2vec

MLonCode research playground PyPI Build Status Docker Build Status codecov

This project is no longer maintained, it has evolved into several others:

Below goes the original README.

This project is the foundation for MLonCode research and development. It abstracts feature extraction and training models, thus allowing to focus on the higher level tasks.

Currently, the following models are implemented:

  • BOW - weighted bag of x, where x is many different extracted feature types.
  • id2vec, source code identifier embeddings.
  • docfreq, feature document frequencies (part of TF-IDF).
  • topic modeling over source code identifiers.

It is written in Python3 and has been tested on Linux and macOS. source{d} ml is tightly coupled with source{d} engine and delegates all the feature extraction parallelization to it.

Here is the list of proof-of-concept projects which are built using sourced.ml:

  • vecino - finding similar repositories.
  • tmsc - listing topics of a repository.
  • snippet-ranger - topic modeling of source code snippets.
  • apollo - source code deduplication at scale.

Installation

Whether you wish to include Spark in your installation or would rather use an existing installation, to use sourced-ml you will need to have some native libraries installed, e.g. on Ubuntu you must first run: apt install libxml2-dev libsnappy-dev. Tensorflow is also a requirement - we support both the CPU and GPU version. In order to select which version you want, modify the package name in the next section to either sourced-ml[tf] or sourced-ml[tf-gpu] depending on your choice. If you don't, neither version will be installed.

With Apache Spark included

pip3 install sourced-ml

Use existing Apache Spark

If you already have Apache Spark installed and configured on your environment at $APACHE_SPARK you can re-use it and avoid downloading 200Mb through pip "editable installs" by

pip3 install -e "$SPARK_HOME/python"
pip3 install sourced-ml

In both cases, you will need to have some native libraries installed. E.g., on Ubuntu apt install libxml2-dev libsnappy-dev. Some parts require Tensorflow.

Usage

This project exposes two interfaces: API and command line. The command line is

srcml --help

Docker image

docker run -it --rm srcd/ml --help

If this first command fails with

Cannot connect to the Docker daemon. Is the docker daemon running on this host?

And you are sure that the daemon is running, then you need to add your user to docker group: refer to the documentation.

Contributions

...are welcome! See CONTRIBUTING and CODE_OF_CONDUCT.md.

License

Apache 2.0

Algorithms

Identifier embeddings

We build the source code identifier co-occurrence matrix for every repository.

  1. Read Git repositories.

  2. Classify files using enry.

  3. Extract UAST from each supported file.

  4. Split and stem all the identifiers in each tree.

  5. Traverse UAST, collapse all non-identifier paths and record all

    identifiers on the same level as co-occurring. Besides, connect them with their immediate parents.

  6. Write the global co-occurrence matrix.

  7. Train the embeddings using Swivel (requires Tensorflow). Interactively view

    the intermediate results in Tensorboard using --logs.

  8. Write the identifier embeddings model.

1-5 is performed with repos2coocc command, 6 with id2vec_preproc, 7 with id2vec_train, 8 with id2vec_postproc.

Weighted Bag of X

We represent every repository as a weighted bag-of-vectors, provided by we've got document frequencies ("docfreq") and identifier embeddings ("id2vec").

  1. Clone or read the repository from disk.
  2. Classify files using enry.
  3. Extract UAST from each supported file.
  4. Extract various features from each tree, e.g. identifiers, literals or node2vec-like structural fingerprints.
  5. Group by repository, file or function.
  6. Set the weight of each such feature according to TF-IDF.
  7. Write the BOW model.

1-7 are performed with repos2bow command.

Topic modeling

See here.

Glossary

See here.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].