All Projects → tommyod → Efficient Apriori

tommyod / Efficient Apriori

Licence: mit
An efficient Python implementation of the Apriori algorithm.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Efficient Apriori

Model Describer
model-describer : Making machine learning interpretable to humans
Stars: ✭ 22 (-84.83%)
Mutual labels:  data-science, data-mining, machinelearning
Tsv Utils
eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.
Stars: ✭ 1,215 (+737.93%)
Mutual labels:  data-science, data-mining
Tsrepr
TSrepr: R package for time series representations
Stars: ✭ 75 (-48.28%)
Mutual labels:  data-science, data-mining
Matrixprofile
A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms, accessible to everyone.
Stars: ✭ 141 (-2.76%)
Mutual labels:  data-science, data-mining
Etherscan Ml
Python Data Science and Machine Learning Library for the Ethereum and ERC-20 Blockchain
Stars: ✭ 55 (-62.07%)
Mutual labels:  data-science, data-mining
Linkedingiveaway
👨🏽‍🏫You can learn about anything over here. What Giveaways I do and why it's important in today's modern world. Are you interested in Giveaway's?🔋
Stars: ✭ 67 (-53.79%)
Mutual labels:  data-science, data-mining
Vvedenie Mashinnoe Obuchenie
📝 Подборка ресурсов по машинному обучению
Stars: ✭ 1,282 (+784.14%)
Mutual labels:  data-science, data-mining
Tadw
An implementation of "Network Representation Learning with Rich Text Information" (IJCAI '15).
Stars: ✭ 43 (-70.34%)
Mutual labels:  data-science, data-mining
Papers Literature Ml Dl Rl Ai
Highly cited and useful papers related to machine learning, deep learning, AI, game theory, reinforcement learning
Stars: ✭ 1,341 (+824.83%)
Mutual labels:  data-science, data-mining
Vizuka
Explore high-dimensional datasets and how your algo handles specific regions.
Stars: ✭ 100 (-31.03%)
Mutual labels:  data-science, data-mining
Responsible Ai Widgets
This project provides responsible AI user interfaces for Fairlearn, interpret-community, and Error Analysis, as well as foundational building blocks that they rely on.
Stars: ✭ 107 (-26.21%)
Mutual labels:  data-science, machinelearning
Pycm
Multi-class confusion matrix library in Python
Stars: ✭ 1,076 (+642.07%)
Mutual labels:  data-science, data-mining
25daysinmachinelearning
I will update this repository to learn Machine learning with python with statistics content and materials
Stars: ✭ 53 (-63.45%)
Mutual labels:  data-science, machinelearning
Gorse
An open source recommender system service written in Go
Stars: ✭ 1,148 (+691.72%)
Mutual labels:  data-mining, machinelearning
Php Ml
PHP-ML - Machine Learning library for PHP
Stars: ✭ 7,900 (+5348.28%)
Mutual labels:  data-science, data-mining
Dex
Dex : The Data Explorer -- A data visualization tool written in Java/Groovy/JavaFX capable of powerful ETL and publishing web visualizations.
Stars: ✭ 1,238 (+753.79%)
Mutual labels:  data-science, data-mining
Accelerator
The Accelerator is a tool for fast and reproducible processing of large amounts of data.
Stars: ✭ 137 (-5.52%)
Mutual labels:  data-science, data-mining
Clevercsv
CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.
Stars: ✭ 887 (+511.72%)
Mutual labels:  data-science, data-mining
Mldm
потоковый курс "Машинное обучение и анализ данных (Machine Learning and Data Mining)" на факультете ВМК МГУ имени М.В. Ломоносова
Stars: ✭ 35 (-75.86%)
Mutual labels:  data-science, data-mining
Aethos
Automated Data Science and Machine Learning library to optimize workflow.
Stars: ✭ 94 (-35.17%)
Mutual labels:  data-science, machinelearning

Efficient-Apriori Build Status PyPI version Documentation Status Downloads Black

An efficient pure Python implementation of the Apriori algorithm. Works with Python 3.6+.

The apriori algorithm uncovers hidden structures in categorical data. The classical example is a database containing purchases from a supermarket. Every purchase has a number of items associated with it. We would like to uncover association rules such as {bread, eggs} -> {bacon} from the data. This is the goal of association rule learning, and the Apriori algorithm is arguably the most famous algorithm for this problem. This repository contains an efficient, well-tested implementation of the apriori algorithm as described in the original paper by Agrawal et al, published in 1994.

The code is stable and in widespread use. It's cited in the book "Mastering Machine Learning Algorithms" by Bonaccorso.

Example

Here's a minimal working example. Notice that in every transaction with eggs present, bacon is present too. Therefore, the rule {eggs} -> {bacon} is returned with 100 % confidence.

from efficient_apriori import apriori
transactions = [('eggs', 'bacon', 'soup'),
                ('eggs', 'bacon', 'apple'),
                ('soup', 'bacon', 'banana')]
itemsets, rules = apriori(transactions, min_support=0.5, min_confidence=1)
print(rules)  # [{eggs} -> {bacon}, {soup} -> {bacon}]

If your data is in a pandas DataFrame, you must convert it to a list of tuples. Do you have missing values, or does the algorithm run for a long time? See this comment. More examples are included below.

Installation

The software is available through GitHub, and through PyPI. You may install the software using pip.

pip install efficient-apriori

Contributing

You are very welcome to scrutinize the code and make pull requests if you have suggestions and improvements. Your submitted code must be PEP8 compliant, and all tests must pass. Contributors: CRJFisher

More examples

Filtering and sorting association rules

It's possible to filter and sort the returned list of association rules.

from efficient_apriori import apriori
transactions = [('eggs', 'bacon', 'soup'),
                ('eggs', 'bacon', 'apple'),
                ('soup', 'bacon', 'banana')]
itemsets, rules = apriori(transactions, min_support=0.2, min_confidence=1)

# Print out every rule with 2 items on the left hand side,
# 1 item on the right hand side, sorted by lift
rules_rhs = filter(lambda rule: len(rule.lhs) == 2 and len(rule.rhs) == 1, rules)
for rule in sorted(rules_rhs, key=lambda rule: rule.lift):
  print(rule)  # Prints the rule and its confidence, support, lift, ...

Working with large datasets

If you have data that is too large to fit in memory, you may pass a function returning a generator instead of a list. The min_support will most likely have to be a large value, or the algorithm will take very long before it terminates. If you have massive amounts of data, this Python implementation is likely not fast enough, and you should consult more specialized implementations.

def data_generator(filename):
  """
  Data generator, needs to return a generator to be called several times.
  Use this approach if data is too large to fit in memory. If not use a list.
  """
  def data_gen():
    with open(filename) as file:
      for line in file:
        yield tuple(k.strip() for k in line.split(','))      

  return data_gen

transactions = data_generator('dataset.csv')
itemsets, rules = apriori(transactions, min_support=0.9, min_confidence=0.6)

Transactions with IDs

If you need to know which transactions occurred in the frequent itemsets, set the output_transaction_ids parameter to True. This changes the output to contain ItemsetCount objects for each itemset. The objects have a members property containing is the set of ids of frequent transactions as well as a count property. The ids are the enumeration of the transactions in the order they appear.

from efficient_apriori import apriori
transactions = [('eggs', 'bacon', 'soup'),
                ('eggs', 'bacon', 'apple'),
                ('soup', 'bacon', 'banana')]
itemsets, rules = apriori(transactions, output_transaction_ids=True)
print(itemsets)
# {1: {('bacon',): ItemsetCount(itemset_count=3, members={0, 1, 2}), ...
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].