All Projects → zakimjz → IBMGenerator

zakimjz / IBMGenerator

Licence: other
IBM Synthetic Data Generator for Itemsets and Sequences

Programming Languages

C++
36643 projects - #6 most used programming language

Projects that are alternatives of or similar to IBMGenerator

multi-task-defocus-deblurring-dual-pixel-nimat
Reference github repository for the paper "Improving Single-Image Defocus Deblurring: How Dual-Pixel Images Help Through Multi-Task Learning". We propose a single-image deblurring network that incorporates the two sub-aperture views into a multitask framework. Specifically, we show that jointly learning to predict the two DP views from a single …
Stars: ✭ 29 (+45%)
Mutual labels:  synthetic-dataset-generation
mtss-gan
MTSS-GAN: Multivariate Time Series Simulation with Generative Adversarial Networks (by @firmai)
Stars: ✭ 77 (+285%)
Mutual labels:  synthetic-dataset-generation
DeFMO
[CVPR 2021] DeFMO: Deblurring and Shape Recovery of Fast Moving Objects
Stars: ✭ 144 (+620%)
Mutual labels:  synthetic-dataset-generation
recurrent-defocus-deblurring-synth-dual-pixel
Reference github repository for the paper "Learning to Reduce Defocus Blur by Realistically Modeling Dual-Pixel Data". We propose a procedure to generate realistic DP data synthetically. Our synthesis approach mimics the optical image formation found on DP sensors and can be applied to virtual scenes rendered with standard computer software. Lev…
Stars: ✭ 30 (+50%)
Mutual labels:  synthetic-dataset-generation
OpenRooms
This is the dataset and code release of the OpenRooms Dataset. For more information, please refer to our webpage below. Thanks a lot for your interest in our research!
Stars: ✭ 73 (+265%)
Mutual labels:  synthetic-dataset-generation
augraphy
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
Stars: ✭ 49 (+145%)
Mutual labels:  synthetic-dataset-generation
LegoBrickClassification
Repository to identify Lego bricks automatically only using images
Stars: ✭ 57 (+185%)
Mutual labels:  synthetic-dataset-generation

IBMGenerator

IBM Synthetic Data Generator for Itemsets and Sequences

Type make, which will create the executable file 'gen'

type ./gen -help for general help

For itemsets, type ./gen lit -help For sequences, type ./gen seq -help

Itemset Datasets

These datasets mimic the transactions in a retailing environment, where people tend to buy sets of items together, the so called potential maximal frequent set. The size of the maximal elements is clustered around a mean with a few long itemsets. A transaction may contain one or more of such frequent sets. The transaction size is also clustered around a mean, but a few of them may contain many items. Let D denote the number of transactions, T the average transaction size, I the size of a maximal potentially frequent itemset, L the number of maximal potentially frequent itemsets, and N the number of items. The data is generated using the following procedure. We first generate L maximal itemsets of average size I by choosing from the N items. We next generate D transactions of average size T by choosing from the L maximal itemsets.

Type: ./gen lit -help

for all the parameters to generate sequence datasets:

Command Line Options:

-ncust number_of_customers (in 1000's) (default: 100)

-slen avg_trans_per_customer (default: 10)

-tlen avg_items_per_transaction (default: 2.5)

-nitems number_of_different_items (in '000s) (default: 10000)

-rept repetition-level (default: 0)

-seq.npats number_of_seq_patterns (default: 5000)

-seq.patlen avg_length_of_maximal_pattern (default: 4)

-seq.corr correlation_between_patterns (default: 0.25)

-seq.conf avg_confidence_in_a_rule (default: 0.75)

-lit.npats number_of_patterns (default: 25000)

-lit.patlen avg_length_of_maximal_pattern (default: 1.25)

-lit.corr correlation_between_patterns (default: 0.25)

-lit.conf avg_confidence_in_a_rule (default: 0.75)

-fname (write to filename.data and filename.pat)

-ascii (Write data in ASCII format; default: False)

-version (to print out version info)

An example run can be:

./gen lit -ntrans 100 -tlen 10 -nitems 1 -npats 1000 -patlen 4 -fname T10I4D100K -ascii

This will generate a datafile named "T10I4D100K.data" In fact it generates three files:

[fname].data -- the actual data file

[fname].conf -- configuration info

[fname].pat -- the embedded patterns

Data Format

The generated file has the following format. Each line contains:

TID TID NITEMS ITEMSET

where TID is a transaction identifier, NITEMS is the number of items in that transaction, and ITEMSET is the set of items making up that transaction. All ITEMSETS are sorted lexicographically. Note that TID is repeated for consistency with the sequence generator.

Sequence Datasets

The generator generates sequence datasets that mimic real-world transactions, where people buy a sequence of sets of items. Some customers may buy only some items from the sequences, or they may buy items from multiple sequences. The input-sequence size and event size are clustered around a mean and a few of them may have many elements.

The datasets are generated using the following process. First NI maximal events of average size I are generated by choosing from N items. Then NS maximal sequences of average size S are created by assigning events from NI to each sequence. Next a customer (or input-sequence) of average C transactions (or events) is created, and sequences in NS are assigned to different customer elements, respecting the average transaction size of T. The generation stops when D input-sequences have been generated. Default values are NS = 5000, NI = 25000 and N = 10000.

Type: ./gen seq -help

for all the parameters to generate sequence datasets:

Command Line Options:

-ncust number_of_customers (in 1000's) (default: 100)

-slen avg_trans_per_customer (default: 10)

-tlen avg_items_per_transaction (default: 2.5)

-nitems number_of_different_items (in '000s) (default: 10000)

-rept repetition-level (default: 0)

-seq.npats number_of_seq_patterns (default: 5000)

-seq.patlen avg_length_of_maximal_pattern (default: 4)

-seq.corr correlation_between_patterns (default: 0.25)

-seq.conf avg_confidence_in_a_rule (default: 0.75)

-lit.npats number_of_patterns (default: 25000)

-lit.patlen avg_length_of_maximal_pattern (default: 1.25)

-lit.corr correlation_between_patterns (default: 0.25)

-lit.conf avg_confidence_in_a_rule (default: 0.75)

-fname (write to filename.data and filename.pat)

-ascii (Write data in ASCII format; default: False)

-version (to print out version info)

An example run can be:

./gen seq -ncust 200 -fname C10T2.5S4I1.25D200K -ascii

This will generate a datafile named "C10T2.5S4I1.25D200K.data" In fact, it generates four files:

[fname].data -- the actual data file

[fname].conf -- configuration info

[fname].pat -- the embedded patterns

[fname].ntpc -- info on number of trans per customer (ignore this file)

Data Format

The generated file has the following format. Each line contains:

SID TID NITEMS ITEMSET

where SID is the sequence identifier, TID is a transaction/event identifier, NITEMS is the number of items in that transaction, and ITEMSET is the set of items making up that transaction. The TIDs for an SID are listed in temporal order, i.e., TIDs are event ids within that sequence. All ITEMSETS are also sorted lexicographically.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].