All Projects → novoselrok → codesnippetsearch

novoselrok / codesnippetsearch

Licence: MIT license
Neural bag of words code search implementation using PyTorch and data from the CodeSearchNet project.

Programming Languages

python
139335 projects - #7 most used programming language
Vue
7211 projects
javascript
184084 projects - #8 most used programming language
CSS
56736 projects

Projects that are alternatives of or similar to codesnippetsearch

image embeddings
Using efficientnet to provide embeddings for retrieval
Stars: ✭ 107 (+59.7%)
Mutual labels:  embeddings
meemi
Improving cross-lingual word embeddings by meeting in the middle
Stars: ✭ 20 (-70.15%)
Mutual labels:  embeddings
Simple chat bot
Simple nlp chatbot
Stars: ✭ 23 (-65.67%)
Mutual labels:  embeddings
DeepLearningReading
Deep Learning and Machine Learning mini-projects. Current Project: Deepmind Attentive Reader (rc-data)
Stars: ✭ 78 (+16.42%)
Mutual labels:  embeddings
so stupid search
It's my honor to drive you fucking fire faster, to have more time with your Family and Sunshine.This tool is for those who often want to search for a string Deeply into a directory in Recursive mode, but not with the great tools: grep, ack, ripgrep .........every thing should be Small, Thin, Fast, Lazy....without Think and Remember too much ...一…
Stars: ✭ 135 (+101.49%)
Mutual labels:  code-search
CaRE
EMNLP 2019: CaRe: Open Knowledge Graph Embeddings
Stars: ✭ 34 (-49.25%)
Mutual labels:  embeddings
embedding study
中文预训练模型生成字向量学习,测试BERT,ELMO的中文效果
Stars: ✭ 94 (+40.3%)
Mutual labels:  embeddings
dpar
Neural network transition-based dependency parser (in Rust)
Stars: ✭ 41 (-38.81%)
Mutual labels:  embeddings
ClusterTransformer
Topic clustering library built on Transformer embeddings and cosine similarity metrics.Compatible with all BERT base transformers from huggingface.
Stars: ✭ 36 (-46.27%)
Mutual labels:  embeddings
TCE
This repository contains the code implementation used in the paper Temporally Coherent Embeddings for Self-Supervised Video Representation Learning (TCE).
Stars: ✭ 51 (-23.88%)
Mutual labels:  embeddings
embedding evaluation
Evaluate your word embeddings
Stars: ✭ 32 (-52.24%)
Mutual labels:  embeddings
SubGNN
Subgraph Neural Networks (NeurIPS 2020)
Stars: ✭ 136 (+102.99%)
Mutual labels:  embeddings
ar-embeddings
Sentiment Analysis for Arabic Text (tweets, reviews, and standard Arabic) using word2vec
Stars: ✭ 83 (+23.88%)
Mutual labels:  embeddings
Entity Embedding
Reference implementation of the paper "Word Embeddings for Entity-annotated Texts"
Stars: ✭ 19 (-71.64%)
Mutual labels:  embeddings
graphml-tutorials
Tutorials for Machine Learning on Graphs
Stars: ✭ 125 (+86.57%)
Mutual labels:  embeddings
event-embedding-multitask
*SEM 2018: Learning Distributed Event Representations with a Multi-Task Approach
Stars: ✭ 22 (-67.16%)
Mutual labels:  embeddings
cskg
CSKG: The CommonSense Knowledge Graph
Stars: ✭ 86 (+28.36%)
Mutual labels:  embeddings
info-retrieval
Information Retrieval in High Dimensional Data (class deliverables)
Stars: ✭ 33 (-50.75%)
Mutual labels:  embeddings
RadiologyReportEmbedding
Intelligent Word Embeddings of Free-Text Radiology Reports
Stars: ✭ 22 (-67.16%)
Mutual labels:  embeddings
bor
User-friendly, tiny source code searcher written by pure Python.
Stars: ✭ 105 (+56.72%)
Mutual labels:  code-search

CodeSnippetSearch

CodeSnippetSearch is a web application and a web extension that allows you to search GitHub repositories using natural language queries and code itself.

It is based on a neural bag of words code search implementation using PyTorch and data from the CodeSearchNet project. The model training code was heavily inspired by the baseline (Tensorflow) implementation in the CodeSearchNet repository. Currently, Python, Java, Go, Php, Javascript, and Ruby programming languages are supported.

Helpful papers:

Model description

Model structure

Model structure

Project structure

  • code_search: A Python package with scripts to prepare the data, train the language models and save the embeddings
  • code_search_web: CodeSnippetSearch website Django project
  • serialized_data: Store for intermediate objects during training (docs, vocabularies, models, embeddings etc.)
  • codesearchnet_data: Data from the CodeSearchNet project

Data

We are using the data from the CodeSearchNet project. Run the following commands to download the required data:

  • $ ./scripts/download_codesearchnet_data.sh

This will download around 20GB of data. Overview of the data structure is listed here.

Training the models

If you can, you should be performing these steps inside a virtual environment. To install the required dependencies run: $ ./scripts/install_pip_packages.sh. To install the code_search as a package run: $ ./scripts/install_code_search_package.sh

Preparing the data

Data preparation step is separate from the training step because it is time and memory consuming. We will prepare all the necessary data needed for training. This includes preprocessing code docs, building vocabularies, and encoding sequences.

The first step is to parse the CodeSearchNet data. We need to parse *_dedupe_definitions_v2.pkl files from a pickle format to jsonl format. We will be using the jsonl format throughout the project, since we can read the file line by line and keep the memory footprint minimal. Reading the evaluation docs requires more than 16GB of memory, because the entire file has to be read in memory (largest is javascript_dedupe_definitions_v2.pkl at 6.6GB). If you do not have this kind of horsepower, I suggest renting a cloud server with >16GB of memory and running this step on there. After you are done, just download the jsonl files to your local machine. Subsequent preparation and training steps should not take more than 16GB of memory.

To parse the CodeSearchNet data run: $ python parse_codesearchnet_data.py

To prepare the data for training run: $ python prepare_data.py --prepare-all. It uses the Python multiprocessing module to take advantage of multiple cores. If you encounter memory errors or slow performance you can tweak the number of processes by changing the parameter passed to multiprocessing.Pool.

Training and evaluation

You start the training by running: $ python train.py. This will train separate models for each language, build code embeddings and evaluate them according to MRR (Mean Reciprocal Rank) and output model_predictions.csv. These will be evaluated by Github & WANDB using NDCG (Normalized Discounted cumulative gain) metric to rank the submissions.

Query the trained models

Run $ python search.py "read file lines" and it will output 3 best ranked results for each language.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].