All Projects → millanp95 → DeLUCS

millanp95 / DeLUCS

Licence: other
This repository contains all the source files required to run DeLUCS, a deep learning clustering algorithm for DNA sequences.

Programming Languages

python
139335 projects - #7 most used programming language
Jupyter Notebook
11667 projects
shell
77523 projects

Projects that are alternatives of or similar to DeLUCS

kraken-biom
Create BIOM-format tables (http://biom-format.org) from Kraken output (http://ccb.jhu.edu/software/kraken/, https://github.com/DerrickWood/kraken).
Stars: ✭ 35 (+84.21%)
Mutual labels:  taxonomic-classification

DeLUCS

This repository contains all the source files required to reproduce the results in the original DeLUCS paper (https://doi.org/10.1101/2021.05.13.444008), as well as a detailed guide for running the code.

Computational Pipeline:

1. Build the dataset:

	python build_dp.py --data_path=<PATH_sequence_folder>	
  • Input: Folders with the sequences in FASTA format
  • Output : file in the form (label,sequence,accession)

2. Compute the mimic sequences.

  python get_pairs.py --data_path=<PATH_pickle_dataset> --k=6 --modify='mutation' --output=<PATH_output_file> --n_mimics=<n mimics per sequence>
  • Input: file in the form (label,sequence,accession)
  • Output : file in the form of (pairs, x_test, y_test)

3. Train the model.

* For training DeLUCS and testing its performance
	```
	python EvaluateDeLUCS.py --data_dir=<PATH_of_computed_mimics> --out_dir=<OUTPURDIR>
	```

	* Input: Pickle file with the mimics in the form of (pairs, x_test, y_test). 
	* Output : Confusion Matrix. 
			<!--* File with the misclassified sequences in the form (accession, true_label, predicted_label)-->

* For testing the performance  a single Neural Network trained in an unsupervised way (labels must be available):
	```
	python EvaluateSingleRun.py --data_dir=<PATH_of_computed_mimics> --out_dir=<OUTPURDIR>
	```

Training on your own data

We recomend using the updated version of the code in (https://github.com/Kari-Genomics-Lab) for training on your own data.

Citation

If you find DeLUCS useful in your research please consider citing:

@article{10.1371/journal.pone.0261531,
    doi = {10.1371/journal.pone.0261531},
    author = {Millán Arias, Pablo AND Alipour, Fatemeh AND Hill, Kathleen A. AND Kari, Lila},
    journal = {PLOS ONE},
    publisher = {Public Library of Science},
    title = {DeLUCS: Deep learning for unsupervised clustering of DNA sequences},
    year = {2022},
    month = {01},
    volume = {17},
    url = {https://doi.org/10.1371/journal.pone.0261531},
    pages = {1-25},
    number = {1},
}	
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].