P³ARSEC

This repository contains parallel patterns implementations of some applications contained in the PARSEC benchmark.

All the applications (except x264) have been implemented by using the FastFlow pattern-based parallel programming framework. Some benchmarks have been also implemented with the SkePU2 framework and other with the C++ Actor Framework (CAF). In the following table you can find more details about the pattern used for each benchmark and the file(s) containing the actual implementation, both for FastFlow, SkePU2 and CAF. The pattern descriptions reported here are an approximation and exact descriptions will come later. Some benchmarks are implemented by using different patterns (bold pattern is the one used by default). To run the benchmark a different pattern refer to the specific section of this document.

Application	Used Pattern	FastFlow Files	SkePU2 Files	CAF Files
Blackscholes	Map	File 1	File 1	File 1
Bodytrack	Maps	File 1, File 2
Canneal	Master-Worker	File 1		File 1
Dedup	Pipeline of Farms	File 1
"	Farm	File 1
"	Farm of Pipelines	File 1
"	Ordering Farm	File 1
Facesim	Maps	File 1, File 2, File 3, File 4
Ferret	Pipeline of Farms	File 1		File 1
"	Farm of Pipelines	File 1		File 1
"	Farm	File 1
"	Farm (Optimized)	File 1		File 1
Fluidanimate	Maps	File 1
Freqmine	Maps	File 1
Raytrace	Map	File 1	File 1	File 1
Streamcluster	Maps and MapReduce	File 1	File 1
Swaptions	Map	File 1	File 1
Vips	Farm	File 1
x264	Not available.

These implementations have been engineered in order to be used with the standard PARSEC tools. Accordingly, you can use and evaluate the parallel patterns implementations together with the Pthreads, OpenMP and TBB versions already present in PARSEC. After following this guide, more details can be found on PARSEC Website.

Download

To download the last version of P³RSEC, run the following commands:

wget https://github.com/ParaGroup/p3arsec/archive/v1.0.tar.gz

tar -xvf v1.0.tar.gz

cd p3arsec-1.0

Then, run:

./install.sh

These commands could take few minutes to complete, since it will download the original PARSEC implementations with all the input datasets (around 3GB) and all the needed dependencies.

You can specify the following parameters to the ./install.sh command:

--nomeasure: In this case the infrastructure for measuring execution time and energy consumption will not be installed. If this parameter is not specified, then you will be able to measure execution time and energy consumption for all the benchmarks (both for those implemented as parallel patterns and for those already present in PARSEC).
--fast: By specifying this parameter you will only download the PARSEC source code (112MB) and some small test inputs. You can download the input datasets later by running ./install.sh --inputs
--inputs: This parameter will only download the PARSEC input files. It should be used only if ./install.sh --fast has already been run.
--skeputools: This parameter compiles and install the SkePU2 source to source compiler. This is not mandatory and you only need it if you want to modify the *_skepu.cpp files. This parameter is mainly intended for developers.

Compile

To let PARSEC properly work, some dependencies needs to be installed. For Ubuntu systems, you can do it with the following command:

sudo apt-get install git build-essential m4 x11proto-xext-dev libglu1-mesa-dev libxi-dev libxmu-dev libtbb-dev libssl-dev

For Arch Linux, the following:

sudo pacman -Sy git m4 xorgproto glu libxi libxmu intel-tbb openssl

Similar packages can be found for other Linux distributions.

After that, you need to install the benchmarks you are interested in:

cd bin

The parallel patterns versions of the benchmarks have been integrated with the original PARSEC management system (./parsecmgmt). You can find the full documentation here or in the README_PARSEC file which will appear in the directory after the previous commands have been run.

To compile the parallel patterns version of a specific benchmark, is sufficient to run the following command:

./parsecmgmt -a build -p [BenchmarkName] -c gcc-ff

If you also want to compile the other existing versions of the benchmark, just replace gcc-ff with one of the following:

gcc-skepu for the SkePU2 parallel pattern-based implementation.
gcc-pthreads for the Pthreads implementation.
gcc-openmp for the OpenMP implementation.
gcc-tbb for the Intel TBB implementation.
*gcc-caf for the CAF implementation

Note that not all these implementations are available for all the benchmarks. For more details on supported implementations, please refer to the original PARSEC documentation (and to the top table in this file for the SkePU2 and FastFlow versions).

ATTENTION: If you plan to execute the benchmark with more than 1024 threads, you need to modify the following MACROS:

MAX_THREADS in pkgs/apps/blackscholes/src/c.m4.pthreads file.
MAX_NUM_THREADS in pkgs/libs/fastflow/ff/config.hpp file.

Run

Once you compiled a benchmark, you can run it with:

./parsecmgmt -a run -p [BenchmarkName] -c gcc-ff -n [ConcurrencyLevel]

Even in this case, you can run the other existing version by replacing gcc-ff with the name of the desired version. By default, the program is run on a test input. PARSEC provides different input datasets: test, simsmall, simmedium, simlarge, simdev and native. The native input set is the one resembling a real execution scenario, while the others should be used for testing/simulation purposes. To specify the input set, is sufficient to specify it with the -i parameter. For example, to run the parallel patterns implementation of the Canneal benchmark on the native input set:

./parsecmgmt -a run -p canneal -c gcc-ff -i native

All the datasets are present if you ran ./install.sh (or ./install.sh --fast plus ./install.sh --inputs).

ConcurrencyLevel has the same meaning it has in the original PARSEC benchmarks. It represents the concurrency level and it is the minimum number of threads that will be activated by the application. Accordingly, we have the following values:

blackscholes: n+1 threads.
canneal: n+1 threads.
dedup: n threads for each pipeline stage (3n + 3 threads). (For the pipe of farms version.)
ferret: n threads for each pipeline stage (4n + 4 threads). (For the pipe of farms version.)
swaptions: n+1 threads.

Some parallel patterns implementations may not follow this rule. For example, the ordered farm implementation of the dedup benchmark will activate n+2 threads.

Measuring time and energy consumption

If you want to measure energy consumption of the benchmarks (and if you do not specified the --nomeasure parameter in the ./install.sh script), please run the benchmarks with sudo. In this case, in the output of the program you will find something like:

sudo ./parsecmgmt -a run -p canneal -c gcc-ff -i native
...
roi.time|12.3
roi.joules|TYPE|456.7
...

Where 12.3 is the execution time and 456.7 is the energy consumption. Both values consider only the time and energy spent in the Region Of Interest (ROI), i.e. the parallel part of the application, excluding initialisation and cleanup phases (e.g. loading a dataset from the disk to the main memory). This approach is commonly used in scientific literature to evaluate PARSEC behaviour.

Energy measurements are provided through the Mammut library. The meaning of the energy consumption value depends on the type of energy counters available on the running architecture. TYPE can be one of the followings:

CPUS: In this case, 4 values will be printed TAB separated (e.g. roi.joules|CPUS|400 300 0 20).
- The first value represents the energy consumed by all the CPUs/Sockets on the machine.
- The second value represents the energy consumed by only the cores on the CPUs.
- The third value represents the energy consumed by the DRAM controllers. This counter may not be available on some architectures. In this case, 0 is printed.
- The fourth value is architecture dependent. In general it represents the energy consumed by the integrated graphic card. This counter may not be available on some architectures. In this case, 0 is printed.
This counter is available on newer Intel architectures (Silvermont, Broadwell, Haswell, Ivy Bridge, Sandy Bridge, Skylake, Xeon Phi KNL). If you need more detailed measurements (e.g. separating the consumption of individual sockets, please contact us).
PLUG: In this case only one value will be printed, corresponding to the total energy consumption of the machine (measured at the power plug level. This counter is available on:
- Architectures using a SmartPower.
- IBM Power8 machines. This support is still experimental. If you need to use it, please contact us.

If energy counters are not present, only execution time will be printed.

Run alternative versions

Some applications (e.g. ferret and dedup) have been implemented according to different pattern compositions. To run versions different from the default one, you need first to remove the existing one (if present). To do so, execute:

./parsecmgmt -a fullclean -c gcc-ff -p [BenchmarkName]
./parsecmgmt -a fulluninstall -c gcc-ff -p [BenchmarkName]

To compile and run the other versions, please refer to the following sections.

Dedup

At line 33 of the Makefile, replace encoder_ff_pipeoffarms.o with:

encoder_ff_farm.o if you want to run the farm version.
encoder_ff_pipeoffarms.o if you want to run the pipeline of farms version.
encoder_ff_farmofpipes.o if you want to run the farm of pipelines version.
encoder_ff_ofarm.o if you want to run the ordered farm version.

After that, build and run dedup as usual.

Ferret

At line 78 of the Makefile, replace ferret-ff-pipeoffarms with:

ferret-ff-farm if you want to run the farm version.
ferret-ff-farm-optimized if you want to run the farm (optimized) version.
ferret-ff-farmofpipes if you want to run the farm of pipelines version.
ferret-ff-pipeoffarms if you want to run the pipelines of farms version.

After that, build and run ferret as usual.

Enforcing performance and power consumption objectives

It is possible to specify requirements on performance (throughput or execution time) and/or power and energy consumption for all the benchmarks. We provide this possibility by exploiting dynamic reconfiguration of the applications by relying on Nornir runtime. The runtime will automatically change the number of cores allocated to the application and their clock frequency. To exploit this possibility, you need to put an XML file (called parameters.xml) in the p3arsec root directory, containing requirements in terms of performance and power consumption. The XML file must have the following format:

<?xml version="1.0" encoding="UTF-8"?>
<nornirParameters>
    <requirements>
        <throughput>100</throughput>
        <powerConsumption>MIN</powerConsumption> 
    </requirements>
</nornirParameters>

In this specific example, we require the application to have a troughput greater than 100 iterations per second. Moreover, since many configurations may provide such throughput, we require Nornir to choose the configuration with the lowest power consumption among those with a feasible throughput. For more details about the type of parameters that can be specified please refer to Nornir Documentation. The meaning of iteration (i.e. the way in which we measure the throughput) is application-specific. In the following table we show what do we mean for iteration for each benchmark application:

Application	Iteration
Blacksholes	1 Stock Option
Bodytrack	1 Frame
Canneal	1 Move
Dedup	1 Chunk
Facesim	1 Frame
Ferret	1 Query
Fluidanimate	1 Frame
Freqmine	1 Call of the FP_growth function
Raytrace	1 Frame
Streamcluster	1 Evaluation for opening a new center
Swaptions	1 Simulation
Vips	1 Image Tile
x264	1 Frame

For example, the example XML file we shown before would enforce Blackscholes to process at least 100 Stock Options per second.

If you want to compile/run applications with dynamic reconfiguration enabled, use the following configurations (to be specified through the -c parameter):

gcc-ff-nornir for the FastFlow implementation.
gcc-pthreads-nornir for the Pthreads implementation.
gcc-openmp-nornir for the OpenMP implementation.
gcc-tbb-nornir for the Intel TBB implementation.

ATTENTION: To run gcc-*-nornir configurations sudo rights are required since we need to perform some high-priviledge operations such as: reading the power consumption, dynamically scaling the clock frequency, etc...

How to Cite

The structure and modelling of the applications is described in the paper:

@article{10.1145/3132710,
 author = {De Sensi, Daniele and De Matteis, Tiziano and Torquati, Massimo and Mencagli, Gabriele and Danelutto, Marco},
 title = {Bringing Parallel Patterns Out of the Corner: The P3 ARSEC Benchmark Suite},
 year = {2017},
 issue_date = {December 2017},
 publisher = {Association for Computing Machinery},
 address = {New York, NY, USA},
 volume = {14},
 number = {4},
 issn = {1544-3566},
 url = {https://doi.org/10.1145/3132710},
 doi = {10.1145/3132710},
 journal = {ACM Trans. Archit. Code Optim.},
 month = {oct},
 articleno = {33},
 numpages = {26},
 keywords = {multicore programming, parsec, Parallel patterns, algorithmic skeletons, benchmarking}
}

Release v1.0 was used in the paper.

Contributors

P³ARSEC has been developed by [Daniele De Sensi](mailto: [email protected]) and Tiziano De Matteis.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

ParaGroup / p3arsec

Programming Languages

Labels

Projects that are alternatives of or similar to p3arsec

P³ARSEC

Download

Compile

Run

Measuring time and energy consumption

Run alternative versions

Dedup

Ferret

Enforcing performance and power consumption objectives

How to Cite

Contributors

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

ParaGroup / p3arsec

Programming Languages

Labels

Projects that are alternatives of or similar to p3arsec

P3ARSEC

Download

Compile

Run

Measuring time and energy consumption

Run alternative versions

Dedup

Ferret

Enforcing performance and power consumption objectives

How to Cite

Contributors

P³ARSEC