JdupesA powerful duplicate file finder and an enhanced fork of 'fdupes'.
Stars: ✭ 790 (+400%)
UMICollapseAccelerating the deduplication and collapsing process for reads with Unique Molecular Identifiers (UMI). Heavily optimized for scalability and orders of magnitude faster than a previous tool.
Stars: ✭ 31 (-80.38%)
cosmosRCOSMOS (Causal Oriented Search of Multi-Omic Space) is a method that integrates phosphoproteomics, transcriptomics, and metabolomics data sets.
Stars: ✭ 30 (-81.01%)
RmlintExtremely fast tool to remove duplicates and other lint from your filesystem
Stars: ✭ 996 (+530.38%)
IntraArchiveDeduplicatorTool for managing data-deduplication within extant compressed archive files, along with a relatively performant BK tree implementation for fuzzy image searching.
Stars: ✭ 87 (-44.94%)
MapeathorTranslator of spreadsheet mappings into R2RML, RML or YARRRML
Stars: ✭ 27 (-82.91%)
KopiaCross-platform backup tool for Windows, macOS & Linux with fast, incremental backups, client-side end-to-end encryption, compression and data deduplication. CLI and GUI included.
Stars: ✭ 507 (+220.89%)
SDM-RDFizerAn Efficient RML-Compliant Engine for Knowledge Graph Construction
Stars: ✭ 68 (-56.96%)
entity-embedPyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.
Stars: ✭ 96 (-39.24%)
ResticFast, secure, efficient backup program
Stars: ✭ 15,105 (+9460.13%)
FingerprintsMake it easier to compare and cross-reference the names of companies and people by applying strong normalisation.
Stars: ✭ 91 (-42.41%)
zpaqfranzDeduplicating archiver with encryption and paranoid-level tests. Swiss army knife for the serious backup and disaster recovery manager. Ransomware neutralizer. Win/Linux/Unix
Stars: ✭ 86 (-45.57%)
doctoral-thesis📖 Generation and Applications of Knowledge Graphs in Systems and Networks Biology
Stars: ✭ 26 (-83.54%)
Dupandas📊 python package for performing deduplication using flexible text matching and cleaning in pandas dataframe
Stars: ✭ 20 (-87.34%)
thymeflowInstaller for Thymeflow, a personal knowledge management system.
Stars: ✭ 27 (-82.91%)
TalismanStraightforward fuzzy matching, information retrieval and NLP building blocks for JavaScript.
Stars: ✭ 584 (+269.62%)
data-product-streamingTemplate to deploy a Data Product for data stream processing into a Data Landing Zone of the Data Management & Analytics Scenario (former Enterprise-Scale Analytics). The Data Product template can be used by cross-functional teams to ingest, provide and create new data assets within the platform.
Stars: ✭ 32 (-79.75%)
LibpostalA C library for parsing/normalizing street addresses around the world. Powered by statistical NLP and open geo data.
Stars: ✭ 3,312 (+1996.2%)
Awesome Single CellCommunity-curated list of software packages and data resources for single-cell, including RNA-seq, ATAC-seq, etc.
Stars: ✭ 1,937 (+1125.95%)
record-linkage-resourcesResources for tackling record linkage / deduplication / data matching problems
Stars: ✭ 67 (-57.59%)
dduperFast block-level out-of-band BTRFS deduplication tool.
Stars: ✭ 108 (-31.65%)
assignPOPPopulation Assignment using Genetic, Non-genetic or Integrated Data in a Machine-learning Framework. Methods in Ecology and Evolution. 2018;9:439–446.
Stars: ✭ 16 (-89.87%)
cargo-limitCargo with less noise: warnings are skipped until errors are fixed, Neovim integration, etc.
Stars: ✭ 105 (-33.54%)
DupeguruFind duplicate files
Stars: ✭ 2,385 (+1409.49%)
Spark LucenerddSpark RDD with Lucene's query and entity linkage capabilities
Stars: ✭ 114 (-27.85%)
mail-deduplicate📧 CLI to deduplicate mails from mail boxes.
Stars: ✭ 134 (-15.19%)
CogStack-NiFiBuilding data processing pipelines for documents processing with NLP using Apache NiFi and related services
Stars: ✭ 22 (-86.08%)
RltkRecord Linkage ToolKit (Find and link entities)
Stars: ✭ 71 (-55.06%)
data-product-batchTemplate to deploy a Data Product for Batch data processing into a Data Landing Zone of the Data Management & Analytics Scenario (former Enterprise-Scale Analytics). The Data Product template can be used by cross-functional teams to ingest, provide and create new data assets within the platform.
Stars: ✭ 27 (-82.91%)
Fastcdc RsFastCDC implementation in Rust
Stars: ✭ 31 (-80.38%)
morph-kgcPowerful RDF Knowledge Graph Generation with [R2]RML Mappings
Stars: ✭ 77 (-51.27%)
BorgmaticSimple, configuration-driven backup software for servers and workstations
Stars: ✭ 902 (+470.89%)
HudiUpserts, Deletes And Incremental Processing on Big Data.
Stars: ✭ 2,586 (+1536.71%)
RdedupData deduplication engine, supporting optional compression and public key encryption.
Stars: ✭ 690 (+336.71%)
OpenOmicsA bioinformatics API and web-app to integrate multi-omics datasets & interface with public databases.
Stars: ✭ 22 (-86.08%)
RecordlinkageA toolkit for record linkage and duplicate detection in Python
Stars: ✭ 532 (+236.71%)
kuwalaKuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data sc…
Stars: ✭ 474 (+200%)
AlertmanagerPrometheus Alertmanager
Stars: ✭ 4,574 (+2794.94%)
winterWInte.r is a Java framework for end-to-end data integration. The WInte.r framework implements well-known methods for data pre-processing, schema matching, identity resolution, data fusion, and result evaluation.
Stars: ✭ 101 (-36.08%)
lieuDedupe/batch geocode addresses and venues around the world with libpostal
Stars: ✭ 73 (-53.8%)
Rudder ServerPrivacy and Security focused Segment-alternative, in Golang and React
Stars: ✭ 2,874 (+1718.99%)
CommonCoreOntologiesThe Common Core Ontology Repository holds the current released version of the Common Core Ontology suite.
Stars: ✭ 109 (-31.01%)
gencoreGenerate duplex/single consensus reads to reduce sequencing noises and remove duplications
Stars: ✭ 91 (-42.41%)
DataBridge.NETConfigurable data bridge for permanent ETL jobs
Stars: ✭ 16 (-89.87%)
splinkImplementation of Fellegi-Sunter's canonical model of record linkage in Apache Spark, including EM algorithm to estimate parameters
Stars: ✭ 181 (+14.56%)
LshLocality Sensitive Hashing using MinHash in Python/Cython to detect near duplicate text documents
Stars: ✭ 182 (+15.19%)
acid-storeA library for secure, deduplicated, transactional, and verifiable data storage
Stars: ✭ 48 (-69.62%)
Mara PipelinesA lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
Stars: ✭ 1,841 (+1065.19%)
yadfYet Another Dupes Finder
Stars: ✭ 32 (-79.75%)
KvdoA pair of kernel modules which provide pools of deduplicated and/or compressed block storage.
Stars: ✭ 168 (+6.33%)
zinggScalable identity resolution, entity resolution, data mastering and deduplication using ML
Stars: ✭ 655 (+314.56%)
SchemaMapperA .NET class library that allows you to import data from different sources into a unified destination
Stars: ✭ 41 (-74.05%)
DejavuQuickly detect already witnessed data.
Stars: ✭ 151 (-4.43%)
R-Learning-JourneySome of the projects i made when starting to learn R for Data Science at the university
Stars: ✭ 19 (-87.97%)
AirbyteAirbyte is an open-source EL(T) platform that helps you replicate your data in your warehouses, lakes and databases.
Stars: ✭ 4,919 (+3013.29%)
scarchesReference mapping for single-cell genomics
Stars: ✭ 175 (+10.76%)
bio2belA Python framework for integrating biological databases and structured data sources in Biological Expression Language (BEL)
Stars: ✭ 16 (-89.87%)
VdoUserspace tools for managing VDO volumes.
Stars: ✭ 138 (-12.66%)