All Projects → biocore-ntnu → Ncls

biocore-ntnu / Ncls

Licence: bsd-3-clause
The Nested Containment List for Python. Basically a static interval-tree that is silly fast for both construction and lookups.

Programming Languages

python
139335 projects - #7 most used programming language
c
50402 projects - #5 most used programming language

Labels

Projects that are alternatives of or similar to Ncls

One Python Benchmark Per Day
An ongoing fun challenge where I'll try to post one Python benchmark per day.
Stars: ✭ 124 (-12.68%)
Mutual labels:  numpy
Ds Ai Tech Notes
📖 [译] 数据科学和人工智能技术笔记
Stars: ✭ 131 (-7.75%)
Mutual labels:  numpy
Facedetection
🌟 Human Face Detection based on AdaBoost
Stars: ✭ 137 (-3.52%)
Mutual labels:  numpy
Data Science For Marketing Analytics
Achieve your marketing goals with the data analytics power of Python
Stars: ✭ 127 (-10.56%)
Mutual labels:  numpy
Forpy
Forpy - use Python from Fortran
Stars: ✭ 129 (-9.15%)
Mutual labels:  numpy
Jyni
Enables Jython to load native CPython extensions.
Stars: ✭ 131 (-7.75%)
Mutual labels:  numpy
Prusacontrol
PrusaControl is an alternative user interface for Slic3r Prusa Edition
Stars: ✭ 123 (-13.38%)
Mutual labels:  numpy
Python Cheat Sheet
Python Cheat Sheet NumPy, Matplotlib
Stars: ✭ 1,739 (+1124.65%)
Mutual labels:  numpy
Root numpy
The interface between ROOT and NumPy
Stars: ✭ 130 (-8.45%)
Mutual labels:  numpy
Veros
The versatile ocean simulator, in pure Python, powered by Bohrium.
Stars: ✭ 136 (-4.23%)
Mutual labels:  numpy
Color Tracker
Color tracking with OpenCV
Stars: ✭ 128 (-9.86%)
Mutual labels:  numpy
Tiny ml
numpy 实现的 周志华《机器学习》书中的算法及其他一些传统机器学习算法
Stars: ✭ 129 (-9.15%)
Mutual labels:  numpy
Machine Learning Projects
This repository consists of all my Machine Learning Projects.
Stars: ✭ 135 (-4.93%)
Mutual labels:  numpy
Teaching Monolith
Data science teaching materials
Stars: ✭ 126 (-11.27%)
Mutual labels:  numpy
Irwin
irwin - the protector of lichess from all chess players villainous
Stars: ✭ 138 (-2.82%)
Mutual labels:  numpy
From Python To Numpy
An open-access book on numpy vectorization techniques, Nicolas P. Rougier, 2017
Stars: ✭ 1,728 (+1116.9%)
Mutual labels:  numpy
Pyjson tricks
Extra features for Python's JSON: comments, order, numpy, pandas, datetimes, and many more! Simple but customizable.
Stars: ✭ 131 (-7.75%)
Mutual labels:  numpy
Data Analysis
主要是爬虫与数据分析项目总结,外加建模与机器学习,模型的评估。
Stars: ✭ 142 (+0%)
Mutual labels:  numpy
Nptdms
NumPy based Python module for reading TDMS files produced by LabView
Stars: ✭ 138 (-2.82%)
Mutual labels:  numpy
Ml Cheatsheet
A constantly updated python machine learning cheatsheet
Stars: ✭ 136 (-4.23%)
Mutual labels:  numpy

Nested containment list

Build Status PyPI version

The Nested Containment List is a datastructure for interval overlap queries, like the interval tree. It is usually an order of magnitude faster than the interval tree both for building and query lookups.

The implementation here is a revived version of the one used in the now defunct PyGr library, which died of bitrot. I have made it less memory-consuming and created wrapper functions which allows batch-querying the NCLS for further speed gains.

It was implemented to be the cornerstone of the PyRanges project, but I have made it available to the Python community as a stand-alone library. Enjoy.

Original Paper: https://academic.oup.com/bioinformatics/article/23/11/1386/199545 Cite: http://dx.doi.org/10.1093/bioinformatics/btz615

Cite

If you use this library in published research cite

http://dx.doi.org/10.1093/bioinformatics/btz615

Install

pip install ncls

Usage

from ncls import NCLS

import pandas as pd

starts = pd.Series(range(0, 5))
ends = starts + 100
ids = starts

subject_df = pd.DataFrame({"Start": starts, "End": ends}, index=ids)

print(subject_df)
#    Start  End
# 0      0  100
# 1      1  101
# 2      2  102
# 3      3  103
# 4      4  104

ncls = NCLS(starts.values, ends.values, ids.values)

# python API, slower
it = ncls.find_overlap(0, 2)
for i in it:
    print(i)
# (0, 100, 0)
# (1, 101, 1)

starts_query = pd.Series([1, 3])
ends_query = pd.Series([52, 14])
indexes_query = pd.Series([10000, 100])

query_df = pd.DataFrame({"Start": starts_query.values, "End": ends_query.values}, index=indexes_query.values)

query_df
#        Start  End
# 10000      1   52
# 100        3   14


# everything done in C/Cython; faster
l_idxs, r_idxs = ncls.all_overlaps_both(starts_query.values, ends_query.values, indexes_query.values)
l_idxs, r_idxs
# (array([10000, 10000, 10000, 10000, 10000,   100,   100,   100,   100,
#          100]), array([0, 1, 2, 3, 4, 0, 1, 2, 3, 4]))

print(query_df.loc[l_idxs])
#        Start  End
# 10000      1   52
# 10000      1   52
# 10000      1   52
# 10000      1   52
# 10000      1   52
# 100        3   14
# 100        3   14
# 100        3   14
# 100        3   14
# 100        3   14
print(subject_df.loc[r_idxs])
#    Start  End
# 0      0  100
# 1      1  101
# 2      2  102
# 3      3  103
# 4      4  104
# 0      0  100
# 1      1  101
# 2      2  102
# 3      3  103
# 4      4  104

# return intervals in python (slow/mem-consuming)
intervals = ncls.intervals()
intervals
# [(0, 100, 0), (1, 101, 1), (2, 102, 2), (3, 103, 3), (4, 104, 4)]

There is also an experimental floating point version of the NCLS called FNCLS. See the examples folder.

Benchmark

Test file of 100 million intervals (created by subsetting gencode gtf with replacement):

Library Function Time (s) Memory (GB)
bx-python build 161.7 2.5
ncls build 3.15 0.5
bx-python overlap 148.4 4.3
ncls overlap 7.2 0.5

Building is 50 times faster and overlap queries are 20 times faster. Memory usage is one fifth and one ninth.

Original paper

Alexander V. Alekseyenko, Christopher J. Lee; Nested Containment List (NCList): a new algorithm for accelerating interval query of genome alignment and interval databases, Bioinformatics, Volume 23, Issue 11, 1 June 2007, Pages 1386–1393, https://doi.org/10.1093/bioinformatics/btl647

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].