All Projects → jMotif → GI

jMotif / GI

Licence: GPL-2.0 license
Sequitur and RePair grammar induction algorithms implementation

Programming Languages

java
68154 projects - #9 most used programming language
r
7636 projects
shell
77523 projects
HTML
75241 projects

Projects that are alternatives of or similar to GI

clp
Compressed Log Processor (CLP) is a free tool capable of compressing text logs and searching the compressed logs without decompression.
Stars: ✭ 49 (+145%)
Mutual labels:  compression
box
Box - Open Standard Archive Format, a zip killer.
Stars: ✭ 38 (+90%)
Mutual labels:  compression
django-brotli
Django middleware that compresses response using brotli algorithm.
Stars: ✭ 16 (-20%)
Mutual labels:  compression
JFileSync3
File Syncing with encryption and compression (partly) compatible with encfs / boxcryptor (classic) volumes for local folders and WebDAV backends. Based on JFileSync - hence the name.
Stars: ✭ 20 (+0%)
Mutual labels:  compression
x-compressor
x – minimalist data compressor
Stars: ✭ 42 (+110%)
Mutual labels:  compression
rust-huffman-compress
A Rust library for Huffman compression given a propability distribution over arbitrary symbols
Stars: ✭ 18 (-10%)
Mutual labels:  compression
Perfect-Zip
Perfect Zip compression utility.
Stars: ✭ 20 (+0%)
Mutual labels:  compression
QmapCompression
Official implementation of "Variable-Rate Deep Image Compression through Spatially-Adaptive Feature Transform", ICCV 2021
Stars: ✭ 27 (+35%)
Mutual labels:  compression
rocketjob
Ruby's missing background and batch processing system
Stars: ✭ 281 (+1305%)
Mutual labels:  compression
react-native-compressor
The lightweight library for compress image, video, and audio with an awesome experience
Stars: ✭ 157 (+685%)
Mutual labels:  compression
ss21
a fresh attempt at a 4chan userstyle
Stars: ✭ 34 (+70%)
Mutual labels:  sax
upx
Node.js cross-platform wrapper for UPX - the ultimate packer for eXecutables.
Stars: ✭ 27 (+35%)
Mutual labels:  compression
ndzip
A High-Throughput Parallel Lossless Compressor for Scientific Data
Stars: ✭ 19 (-5%)
Mutual labels:  compression
tiny
compress data for better performance
Stars: ✭ 21 (+5%)
Mutual labels:  compression
sanic compress
An extension which allows you to easily compress your Sanic responses with gzip.
Stars: ✭ 26 (+30%)
Mutual labels:  compression
srcset.sh
A command line script that generates multiple responsive versions of an image at width breakpoints -- 320,480,640,768,960,1024,1280,1440 pixels wide -- that match common Mobile and widescreen desktop/laptop viewports using Imagemagick's convert utility and outputs the needed <img/> tag
Stars: ✭ 20 (+0%)
Mutual labels:  compression
deflate-rs
An implementation of a DEFLATE encoder in rust
Stars: ✭ 47 (+135%)
Mutual labels:  compression
acid-store
A library for secure, deduplicated, transactional, and verifiable data storage
Stars: ✭ 48 (+140%)
Mutual labels:  compression
ngx-image-compress
Angular library for uploading and compressing images
Stars: ✭ 65 (+225%)
Mutual labels:  compression
client-compress
A JavaScript based in-browser client-side image compression library
Stars: ✭ 32 (+60%)
Mutual labels:  compression

Sequitur and RePair Grammatical Inference for time series pattern mining

maven build codecov.io Maven Central License

SonarCloud

Implements Sequtur (online) and Re-Pair (off-line) grammar induction algorithms for Grammarviz 2.0 and SAX-VSM-G. This code is released under GPL v.2.0.

More about implemented algorithms:

[1] Nevill-Manning, C.G. and Witten, I.H., "Identifying Hierarchical Structure in Sequences: A linear-time algorithm", Journal of Artificial Intelligence Research, 7, 67-82, (1997).

[2] Larsson, N.J.; Moffat, A., "Offline dictionary-based compression", Data Compression Conference, 1999. Proceedings. DCC '99 , vol., no., pp.296,305, 29-31 Mar 1999.

Citing this work:

If you are using this implementation for you academic work, please cite our Grammarviz 2.0 paper:

[Citation] Senin, P., Lin, J., Wang, X., Oates, T., Gandhi, S., Boedihardjo, A.P., Chen, C., Frankenstein, S., Lerner, M., GrammarViz 2.0: a tool for grammar-based pattern discovery in time series, ECML/PKDD Conference, 2014.

1.0 Building

The code is written in Java and I use maven to build it:

$ mvn package
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Building GI
[INFO]    task-segment: [package]
...
[INFO] Building jar: /media/Stock/git/jmotif-GI.git/target/jmotif-gi-0.3.1-SNAPSHOT.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESSFUL
[INFO] ------------------------------------------------------------------------

2.0 Sequitur API use

Following the original Eibe Frank's java implementation the code is built using global (static) variables:

String TEST3_STRING = "a b a b c a b c d a b c d e a b c d e f";

SAXRule r = SequiturFactory.runSequitur(TEST3_STRING);

System.out.println(SAXRule.printRules());

which prints the following output:

Number	Name	Level	Occurr.	Usage	Yield	Rule str	Expaneded	Indexes
0	R0	0	0	0	0	R1 R2 R3 R4 R4 f 	a b a b c a b c d a b c d e a b c d e f	[]
1	R1	1	5	2	2	a b 	a b 	[0, 2, 5, 9, 14]
2	R2	1	4	2	3	R1 c 	a b c 	[2, 5, 9, 14]
3	R3	1	3	2	4	R2 d 	a b c d 	[5, 9, 14]
4	R4	1	2	2	5	R3 e 	a b c d e 	[9, 14]

My own addition allows to retrieve the Sequitur rules as an iterable collection of GrammaRuleRecords and to map them back to the discretized time series:

GrammarRules rules = r.toGrammarRulesData();
GrammarRuleRecord rec = rules.get(4);
ArrayList<RuleInterval> intervals = rec.getRuleIntervals();
...

3.0 RePair API use

I've implemented RePair from scratch and it uses the same GrammaRules / GrammaRuleRecord data structures as for Sequitur, so it can be plugged into Grammarviz seamlessly:

String TEST_STRING = "abc abc cba XXX abc abc cba";

RePairGrammar rg = RePairFactory.buildGrammar(TEST_STRING);

System.out.println(rg.toGrammarRules());

which yields:

R0 -> R2 XXX R2 
    R1 -> abc cba  : abc cba, [1, 5]
    R2 -> abc R1  : abc abc cba, [0, 4]

Thanks to the algorithm's design, I was able to parallelize RePair. However, the cost of inter-tread communications and synchronization was the majot showstopper, so the current new implementation (>0.8.5) is single-threaded (but you can still get the parallel one -- it is tagged "old_repair" in the version control).

4.0 Performance comparison

The both implemented GI algorithms, Sequitur and RePair, demonstrate a somewhat similar performance with minor differnces. Specifically:

  • Sequitur implementation is slower than RePair
  • Sequitur tends to produce more rules, but Sequitur rules are less frequent than RePair rules
  • Sequitur rule-corresponding subsequences vary in length more
  • Sequitur rules usually cover more points than RePair
  • Sequitur rule coverage depth is lower than that of RePair

All these may affect the performance of the upstream time series analysis algorithms such as SAX-VSM-G, Grammarviz, and RRA. Here is the table with some numbers collected by running Sequitur and RePair using sliding window of size 150, PAA 6, and the alphabet 4. I used the EXACT numerosity reduction in these runs.

DatasetSizeSequiturRePair
rulestimecov.dpthmax.freq.rulestimecov.dpthmax.freq.
Daily commute17175292812.845362418.353
Dutch power demand350409163826.61247691429.6162
ECG 0606230067418.41174137.414
ECG 108216005391118.344472920.845
ECG 1515000279919.258239525.771
ECG 30053697610178445834.29807649204835.71673
ECG 3085400131413.214143122.115
ECG 3185860867113223427.814225112143529.12942
Insect186676321917.1255841018.332
Respiration, NPRS 4340008813326.5298131227.545
Respiration, NPRS 442412511896628.14010571728.961
TEK145000205527.478237332.6130
TEK165000181425.8100210231.9157
TEK175000190726.5190208232208
Video dataset112512851116.929301721.830
Winding250070310.65225133.55

Made with Aloha!

Made with Aloha!

Versions:

1.0.1

  • Optimized rule pruning algorithm
  • GrammarRules, GrammarRuleRecord, and RuleInterval implement Serializable

1.0.0

0.8.6

  • pre-1.0 release with improved RePair implementation.

0.0.1 - 0.8.5

  • initial code development, parallel repair implementation lifecycle.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].