All Projects → chuanconggao → Prefixspan Py

chuanconggao / Prefixspan Py

Licence: mit
The shortest yet efficient Python implementation of the sequential pattern mining algorithm PrefixSpan, closed sequential pattern mining algorithm BIDE, and generator sequential pattern mining algorithm FEAT.

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Prefixspan Py

Data Science Toolkit
Collection of stats, modeling, and data science tools in Python and R.
Stars: ✭ 169 (-21.03%)
Mutual labels:  data-mining
Dataaspirant codes
Complete machine learning model codes
Stars: ✭ 185 (-13.55%)
Mutual labels:  data-mining
Estadistica Con R
Apuntes personales sobre estadística, machine learning y lenguaje de programación R
Stars: ✭ 201 (-6.07%)
Mutual labels:  data-mining
Data Science Resources
👨🏽‍🏫You can learn about what data science is and why it's important in today's modern world. Are you interested in data science?🔋
Stars: ✭ 171 (-20.09%)
Mutual labels:  data-mining
Awesome Machine Learning Interpretability
A curated list of awesome machine learning interpretability resources.
Stars: ✭ 2,404 (+1023.36%)
Mutual labels:  data-mining
Pyss3
A Python package implementing a new machine learning model for text classification with visualization tools for Explainable AI
Stars: ✭ 191 (-10.75%)
Mutual labels:  data-mining
Pipeline
the `pipeline` shell command
Stars: ✭ 168 (-21.5%)
Mutual labels:  data-mining
Zhihu Analysis Python
Social Network Analysis of Zhihu with Python
Stars: ✭ 215 (+0.47%)
Mutual labels:  data-mining
Emuto
manipulate JSON files
Stars: ✭ 180 (-15.89%)
Mutual labels:  data-mining
Tradingview Data Scraper
Extract price and indicator data from TradingView charts to create ML datasets
Stars: ✭ 203 (-5.14%)
Mutual labels:  data-mining
Python practice of data analysis and mining
《Python数据分析与挖掘实战》随书源码与数据
Stars: ✭ 172 (-19.63%)
Mutual labels:  data-mining
2017 Ccf Bdci Aijudge
2017-CCF-BDCI-让AI当法官(初赛):7th/415 (Top 1.68%)
Stars: ✭ 177 (-17.29%)
Mutual labels:  data-mining
Awesome Ensemble Learning
Ensemble learning related books, papers, videos, and toolboxes
Stars: ✭ 195 (-8.88%)
Mutual labels:  data-mining
Lightgbm
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
Stars: ✭ 13,293 (+6111.68%)
Mutual labels:  data-mining
Smartproxy
HTTP(S) Rotating Residential proxies - Code examples & General information
Stars: ✭ 205 (-4.21%)
Mutual labels:  data-mining
Welly
Well handling
Stars: ✭ 168 (-21.5%)
Mutual labels:  data-mining
Ail Framework
AIL framework - Analysis Information Leak framework
Stars: ✭ 191 (-10.75%)
Mutual labels:  data-mining
Gwu data mining
Materials for GWU DNSC 6279 and DNSC 6290.
Stars: ✭ 217 (+1.4%)
Mutual labels:  data-mining
Qminer
Analytic platform for real-time large-scale streams containing structured and unstructured data.
Stars: ✭ 206 (-3.74%)
Mutual labels:  data-mining
Instascrape
Powerful and flexible Instagram scraping library for Python, providing easy-to-use and expressive tools for accessing data programmatically
Stars: ✭ 202 (-5.61%)
Mutual labels:  data-mining

PyPI version PyPI pyversions PyPI license

Featured on ImportPython Issue 173. Thank you so much for support!

The shortest yet efficient implementation of the famous frequent sequential pattern mining algorithm PrefixSpan, the famous frequent closed sequential pattern mining algorithm BIDE (in closed.py), and the frequent generator sequential pattern mining algorithm FEAT (in generator.py), as a unified and holistic algorithm framework.

  • BIDE is usually much faster than PrefixSpan on large datasets, as only a small subset of closed patterns sharing the equivalent information of all the patterns are returned.

  • FEAT is usually faster than PrefixSpan but slower than BIDE on large datasets.

For simpler code, some general purpose functions have been moved to be part of a new library extratools.

Reference

Research Papers

PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth.
Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, Meichun Hsu.
Proceedings of the 17th International Conference on Data Engineering, 2001.
BIDE: Efficient Mining of Frequent Closed Sequences.
Jianyong Wang, Jiawei Han.
Proceedings of the 20th International Conference on Data Engineering, 2004.
Efficient mining of frequent sequence generators.
Chuancong Gao, Jianyong Wang, Yukai He, Lizhu Zhou.
Proceedings of the 17th International Conference on World Wide Web, 2008.

Alternative Implementations

I created this project with the original minimal 15 lines implementation of PrefixSpan for educational purpose. However, as this project grows into a full feature library, its code size also inevitably grows. I have revised and reuploaded the original implementation as a GitHub Gist here for reference.

You can also try my Scala version of PrefixSpan.

Features

Outputs traditional single-item sequential patterns, where gaps are allowed between items.

  • Mining top-k patterns is supported, with respective optimizations on efficiency.

  • You can limit the length of mined patterns. Note that setting maximum pattern length properly can significantly speedup the algorithm.

  • Custom key function, custom filter function, and custom callback function can be applied.

Installation

This package is available on PyPI. Just use pip3 install -U prefixspan to install it.

CLI Usage

You can simply use the algorithms on terminal.

Usage:
    prefixspan-cli (frequent | top-k) <threshold> [options] [<file>]

    prefixspan-cli --help


Options:
    --text             Treat each item as text instead of integer.

    --closed           Return only closed patterns.
    --generator        Return only generator patterns.

    --key=<key>        Custom key function. [default: ]
                       Must be a Python function in form of "lambda patt, matches: ...", returning an integer value.
    --bound=<bound>    The upper-bound function of the respective key function. When unspecified, the same key function is used. [default: ]
                       Must be no less than the key function, i.e. bound(patt, matches) ≥ key(patt, matches).
                       Must be anti-monotone, i.e. for patt1 ⊑ patt2, bound(patt1, matches1) ≥ bound(patt2, matches2).

    --filter=<filter>  Custom filter function. [default: ]
                       Must be a Python function in form of "lambda patt, matches: ...", returning a boolean value.

    --minlen=<minlen>  Minimum length of patterns. [default: 1]
    --maxlen=<maxlen>  Maximum length of patterns. [default: 1000]
  • Sequences are read from standard input. Each sequence is integers separated by space, like this example:
cat test.dat

0 1 2 3 4
1 1 1 3 4
2 1 2 2 0
1 1 1 2 2
  • When dealing with text data, please use the --text option. Each sequence is words separated by space, assuming stop words have been removed, like this example:
cat test.txt

a b c d e
b b b d e
c b c c a
b b b c c
  • The patterns and their respective frequencies are printed to standard output.
prefixspan-cli frequent 2 test.dat

0 : 2
1 : 4
1 2 : 3
1 2 2 : 2
1 3 : 2
1 3 4 : 2
1 4 : 2
1 1 : 2
1 1 1 : 2
2 : 3
2 2 : 2
3 : 2
3 4 : 2
4 : 2
prefixspan-cli frequent 2 --text test.txt

a : 2
b : 4
b c : 3
b c c : 2
b d : 2
b d e : 2
b e : 2
b b : 2
b b b : 2
c : 3
c c : 2
d : 2
d e : 2
e : 2

API Usage

Alternatively, you can use the algorithms via API.

from prefixspan import PrefixSpan

db = [
    [0, 1, 2, 3, 4],
    [1, 1, 1, 3, 4],
    [2, 1, 2, 2, 0],
    [1, 1, 1, 2, 2],
]

ps = PrefixSpan(db)

For details of each parameter, please refer to the PrefixSpan class in prefixspan/api.py.

print(ps.frequent(2))
# [(2, [0]),
#  (4, [1]),
#  (3, [1, 2]),
#  (2, [1, 2, 2]),
#  (2, [1, 3]),
#  (2, [1, 3, 4]),
#  (2, [1, 4]),
#  (2, [1, 1]),
#  (2, [1, 1, 1]),
#  (3, [2]),
#  (2, [2, 2]),
#  (2, [3]),
#  (2, [3, 4]),
#  (2, [4])]

print(ps.topk(5))
# [(4, [1]),
#  (3, [2]),
#  (3, [1, 2]),
#  (2, [1, 3]),
#  (2, [1, 3, 4])]


print(ps.frequent(2, closed=True))

print(ps.topk(5, closed=True))


print(ps.frequent(2, generator=True))

print(ps.topk(5, generator=True))

Closed Patterns and Generator Patterns

The closed patterns are much more compact due to the smaller number.

  • A pattern is closed if there is no super-pattern with the same frequency.
prefixspan-cli frequent 2 --closed test.dat

0 : 2
1 : 4
1 2 : 3
1 2 2 : 2
1 3 4 : 2
1 1 1 : 2

The generator patterns are even more compact due to both the smaller number and the shorter lengths.

  • A pattern is generator if there is no sub-pattern with the same frequency.

  • Due to the high compactness, generator patterns are useful as features for classification, etc.

prefixspan-cli frequent 2 --generator test.dat

0 : 2
1 1 : 2
2 : 3
2 2 : 2
3 : 2
4 : 2

There are patterns that are both closed and generator.

prefixspan-cli frequent 2 --closed --generator test.dat

0 : 2

Custom Key Function

For both frequent and top-k algorithms, a custom key function key=lambda patt, matches: ... can be applied, where patt is the current pattern and matches is the current list of matching sequence (id, position) tuples.

  • In default, len(matches) is used denoting the frequency of current pattern.

  • Alternatively, any key function can be used. As an example, sum(len(db[i]) for i in matches) can be used to find the satisfying patterns according to the number of matched items.

  • For efficiency, an anti-monotone upper-bound function should also be specified for pruning.

    • If unspecified, the key function is also the upper-bound function, and must be anti-monotone.
print(ps.topk(5, key=lambda patt, matches: sum(len(db[i]) for i, _ in matches)))
# [(20, [1]),
#  (15, [2]),
#  (15, [1, 2]),
#  (10, [1, 3]),
#  (10, [1, 3, 4])]

Custom Filter Function

For both frequent and top-k algorithms, a custom filter function filter=lambda patt, matches: ... can be applied, where patt is the current pattern and matches is the current list of matching sequence (id, position) tuples.

  • In default, filter is not applied and all the patterns are returned.

  • Alternatively, any function can be used. As an example, matches[0][0] > 0 can be used to exclude the patterns covering the first sequence.

print(ps.topk(5, filter=lambda patt, matches: matches[0][0] > 0))
# [(2, [1, 1]),
#  (2, [1, 1, 1]),
#  (2, [1, 2, 2]),
#  (2, [2, 2]),
#  (1, [1, 2, 2, 0])]

Custom Callback Function

For both the frequent and the top-k algorithm, you can use a custom callback function callback=lambda patt, matches: ... instead of returning the normal results of patterns and their respective frequencies.

  • When callback function is specified, None is returned.

  • For large datasets, when mining frequent patterns, you can use callback function to process each pattern immediately, and avoid having a huge list of patterns consuming huge amount of memory.

  • The following example finds the longest frequent pattern covering each sequence.

coverage = [[] for i in range(len(db))]

def cover(patt, matches):
    for i, _ in matches:
        coverage[i] = max(coverage[i], patt, key=len)


ps.frequent(2, callback=cover)

print(coverage)
# [[1, 3, 4],
#  [1, 3, 4],
#  [1, 2, 2],
#  [1, 2, 2]]

Tip

I strongly encourage using PyPy instead of CPython to run the script for best performance. In my own experience, it is nearly 10 times faster in average. To start, you can install this package in a virtual environment created for PyPy.

Note that only the earlier version 0.4 works for the latest PyPy3 6.0.0 (compatible with Python 3.5.3). Please install it via pip3 install prefixspan==0.4. Latest version should work for the future PyPy3 (compatible with Python 3.6).

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].