Score-P Workflow

Module and job submission details are for archer2.

Load required modules

Scalasca and/or Score-P is incompatible with the Cray performance profiling tools, so these modules first need to be unloaded. module spider scalasca can be used to find the most up-to-date version of Scalasca.

To load:

module unload perftools perftools-base  # or module unload perftools-lite perftools-base
module load other-software/1.0 scalasca/2.6.1-cray

Add scorep to make file

The compiler alias (e.g. ftn for Cray Fortran on archer2) is prefixed with scorep --user, e.g. in a Makefile for Fortran:

PREP		= scorep --user
COMPILER        = $(PREP) ftn

# ...

$(COMPILER) $(COMPFLAGS) $(PROGDIR)main.F90

The executable is then built using make as per usual.

At this point, it is useful to designate the directory containing the executable and any required input files and submission script as Score-P-specific, and duplicate it for subsequent experiments, to avoid software outputs.

Add scorep measurements to submission script

Measurements are set via environment variables. Generally best to start with summary measurements (default, when no environment variables are set). Available measurements via: scorep-info config-vars --full

Good practice to specify an experiment directory for scorep output, e.g. export SCOREP_EXPERIMENT_DIRECTORY=scorep_sum.

Submit job

sbatch <submission_script.sh>

View Summary Report

To create and view a summary analysis report, use the CUBE4 GUI: cube scorep_sum/profile.cubex

Alternatively, to get a hierarchy of metrics, use scalasca examine, which can be accessed using the alias square. In this case, use the experiment directory rather than the cubex file as the argument: square scorep_sum

Filtering for Tracing Experiments

Filtering out e.g. frequently visited but quickly executed regions helps minimise the measurement overhead. Score-P gives information to aide in setting up filtering, and also in allocating memory for the experiment.

Summary analysis result scoring

scorep-score <experiment_directory>/profile.cubex

This summarises buffer sizes required for various components, alongside their stats: number of visits, total time, time/visit… In this report “com” is for “combined”.

Can break down this analysis by function, using the -r option.

Pay particular attention to the recommended setting for SCOREP_TOTAL_MEMORY.

Filter Configuration

Filtering options are set in a text file, and can specify routines to include and exclude, e.g.

SCOREP_REGION_NAMES_BEGIN
EXCLUDE
    *_init*
    *set*
SCOREP_REGION_NAMES_END

To report on only specific routines:

SCOREP_REGION_NAMES_BEGIN
	EXCLUDE
		*
	INCLUDE
		target1*
		target2*
SCOREP_REGION_NAMES_END

Using wildcards in this way accounts for compilers’ adding an underscore, for example.

The effect of this filter set can be previewed using scorep score -f scorep.filt <experiment_directory>/profile.cubex, where scorep.filt is the file containing the filter specs. It predicts the resources required for the tracing experiment.

Apply filter to new summary experiment

Skip this step if not using filtering for trace experiment.

Running a new summary experiment allows parameters to be checked before carrying out the tracing, and gives an indication of measurement overheads.

Set up or modify the job submission script, to include environment variables that define the trace experiment. Using the estimate from scorep score, it is best to set the memory allocation for the experiment, e.g. export SCOREP_TOTAL_MEMORY=375M

Measurements include any available PAPI events, e.g. export SCOREP_METRIC_PAPI=PAPI_L2_DCM,PAPI_L2_DCH,L2_PREFETCH_HIT_L2,L2_PREFETCH_HIT_L3 or “perf” metrics provided by Linux, e.g. export SCOREP_METRIC_PERF=L1-dcache-load-misses,L1-dcache-loads

Off-core counters like cray_zenl3:::UNC_L3_CACHE_MISS[:ALL] have a huge overhead, it seems; in Leeds Spherical Dynamo testing, they slowed execution about 80x.

Apply the filter, e.g. export SCOREP_FILTERING_FILE=../config/scorep.filt

Analyse the results as previously.

Trace Experiment

Set up memory allocation and hardware performance counters as above, and add the following to the submission script: export SCOREP_ENABLE_TRACING=true

Instrumentation via Score-P API

To target a specific region in the code:

#include "scorep/SCOREP_User.inc"

subroutine foo
	! declaration
	SCOREP_USER_REGION_DEFINE( my_region_handle )
	
	SCOREP_USER_REGION_BEGIN( my_region_handle, "foo", SCOREP_USER_REGION_TYPE_COMMON )

	! region of interest
	
	SCOREP_USER_REGION_END( my_region_handle )
end subroutine foo

Scalasca Automated Trace Analysis

Preview Trace Analysis

Scalasca’s scan -n command can be used to verify correct setup of the experiment and suggest some configuration.

scan -n -v <executable> verifies required scorep instrumentation is present. -v = verbose.

Can use in conjunction with srun command and its options (scan and its options come first), along with environment variables. Also checks that the experiment requested doesn’t already exist.

Scalasca summary profile measurement

Set up job submission script with:

# Scalasca now needs to be loaded as part of job submission script, as scan is called as prefix to srun.
module -q load other-software/1.0
module load scalasca/2.6.1-cray

# Scalasca/Score-P measurement configuration
#export SCOREP_EXPERIMENT_DIRECTORY=scorep_pretrace_sum  # Scalasca can set automatically if not specified
export SCOREP_FILTERING_FILE=../config/scorep.filt  # Best to filter, to avoid excessive overhead
export SCOREP_METRIC_PAPI=PAPI_TOT_INS,PAPI_TOT_CYC,PAPI_FP_OPS  # If required
#export SCOREP_TOTAL_MEMORY=100M  # Not needed for summary
#export SCOREP_ENABLE_TRACING=true  # False or commented out for summary
export SCAN_ANALYZE_OPTS="--time-correct --verbose"

# Launch application
scan  srun  <path to executable>

The --time-correct option accounts for any differences in clocks between compute nodes being used for the job. It is advised to use it routinely to ensure timestamps are synchronised - a necessary part of Scalasca’s analyses.

scorep.log records output to stdout & stderr.

Summary results

Examine results using square: in the general tab at the right of the screen, can bring up advisor, to look at efficiencies.

Use either scorep score or square -s to check the configuration needed for the trace experiment.

Trace measurement

Edit/create job script to enable tracing & adjust memory requirement. Skip hardware performance counters unless absolutely necessary, as they can add a large overhead.

#export SCOREP_EXPERIMENT_DIRECTORY=scorep_trace
export SCOREP_FILTERING_FILE=../config/scorep.filt
#export SCOREP_METRIC_PAPI=PAPI_TOT_INS,PAPI_TOT_CYC,PAPI_FP_OPS
export SCOREP_TOTAL_MEMORY=100M
export SCOREP_ENABLE_TRACING=true
export SCAN_ANALYZE_OPTS="--time-correct --verbose"

Submit job.

The Slurm output file will now show the Scalasca analysis (scout) at the end.

traces subdirectory contains traces as two files for each thread.

Now also get a scout.cubex file.

Trace results

Use square <experiment directory> to examine breakdown of results.