Introduction
Intel's VTune Amplifier is a commercial application for performance analysis on 32 and 64-bit x86 based machines, and has both GUI and command-line interfaces. Basic features work on both Intel and AMD hardware, while advanced hardware-based sampling requires an Intel-manufactured CPU.
VTune Amplifier assists in various kinds of code profiling including:
-
stack sampling
-
thread profiling
-
hardware event sampling
The profiler results consist of details such as time spent in each subroutine which can be drilled down to the instruction level.
Installation steps
Download and untar the VTune tarball from Intel’s website:
tar zxf vtune_amplifier_xe_2013_update13.tar.gz && cd vtune_amplifier_xe_2013_update13
In the program’s directory there exist two installer files
- install.sh (CLI installer)
- install-GUI.sh (launches a Qt-based GUI)
The CLI installer can be started as:
./install.sh
The following menu will be presented:
Select option 6 “Installation”. A prerequisites test will be run to check if your system meets the minimum requirements on installed hardware and software libraries. After this step the EULA will be shown.
After the user accepts the license agreement the “Activation” menu, will be presented where the user can choose how to activate the program. The user can activate either by using a product key, a license file or a license manager.
After that a short menu will be shown asking the user whether he wants to participate in the Intel® Software Improvement Program. After that the Pre-installation summary will be shown:
Select 2 to customize the installation options:
Change the installation location to be in the /cm/shared/apps NFS share:
Advanced options about building the drivers can be selected by choosing option 4:
Options related to building the drivers (e.g. selecting compiler, kernel version etc) can be set by choosing option 8 in the previous menu.
The quickest method to install VTune is to install it in /cm/shared and install the drivers as well as the init script on a node running the desired software image. Once the drivers are built the changes in the node’s local disk can be synced to the image. If that is not possible/desirable the drivers can be built manually as we will discuss later.
Disable the NMI Watchdog
The Non Maskable Interrupt (NMI) Watchdog can be used in Linux kernel to periodically detect if the CPU is locked. When CPU-locking occurs, NMI Watchdog service will 1) print debug info 2)reboot the system, sometimes. However NMI Watchdog needs to use hardware performance count, so other performance tools including VTune Amplifier XE 2013 can’t use PMU event-based sampling data collection.
The NMI watchdog can be disabled in two ways:
1. Add argument “nmi_watchdog=0” to the GRUB commandline then reboot he system:
cmsh -c "softwareimage use default-image; set kernelparameters nmi_watchdog=0; commit"
2. Disable NMI_Watchdog, on a running system:
echo 0 > /proc/sys/kernel/nmi_watchdog
This change can be made permanent with:
echo “kernel.nmi_watchdog=0” >> /cm/images/<software image>/etc/sysctl.conf
Setup the Environment
echo “source /cm/shared/apps/vtune/vtune_amplifier_xe/amplxevars.sh” > /cm/images/<your image>/etc/profile.d/vtune.sh && chmod 755 /cm/images/<your image>/etc/profile.d/vtune.sh
The following message should be printed in the user’s shell:
Copyright (C) 2009-2013 Intel Corporation. All rights reserved.
Intel(R) VTune(TM) Amplifier XE 2013 (build 313935)
Manually building the drivers
Switch to the following directory:
/cm/shared/apps/vtune/vtune_amplifier_xe/<driver>/src
where driver is either sepdk or powerdk depending on whether you intend to build the SEP or Power driver.
Run the build-driver script:
./build-driver <options>
Useful options:
--install-dir=path
"path" is an existing, writable directory where the
driver will be copied after it is successfully built;
this defaults to "/cm/shared/apps/vtune/vtune_amplifier_xe/sepdk/src"
--kernel-version=version
"version" is version string of kernel that should
be used for checksum or for building the driver;
this defaults to "2.6.32-358.el6.x86_64"
--c-compiler=c_compiler
"c_compiler" is the C compiler used to compile the kernel;
this defaults to "gcc"
--exit-if-driver-exists
exits if a pre-built driver for the current running
kernel exists in the driver install directory
After that you need to install the boost script:
./boot-script -i --driver-directory /cm/shared/apps/vtune/vtune_amplifier_xe/<driver>/src
The boot-script script will install the drivers in /etc/init.d by default. The INIT scripts are called apwr3_1 and sep3_10. You will have to copy them manually inside the software image or use the CMSH command grabimage (see below) to sync the change to the software image.
Every time the kernel changes the driver’s have to be rebuilt using the procedure described above.
Installing the Intel Xeon Phi SEP driver using the provided scripts
This is the default method supported by Intel. However the changes will not be persistent upon MPSS RPM package upgrades.
First you will need to set some environment variables:
PATH=/cm/local/apps/intel-mic/current/linux-k1om-4.7/bin:$PATH
KERNEL_SRC_DIR
The value of KERNEL_SRC_DIR must be set. It is used during the next step:
EITHER
- ln -s /cm/local/apps/intel-mic/current/kernels/<kernel_version> /tmp/MPSS/mic_linux
OR
- For this you will need to download the MPSS 2.1 and/or MPSS 3.1 source code and place it in /tmp/MPSS/mic_linux:
Steps required:
-
Unpack it and find the package-full_src-k1om.tar.bz2 file.
-
make defconfig-miclinux
-
make -C card/kernel ARCH=k1om modules_prepare
KERNEL_VERSION
The value of KERNEL_VERSION must be set. It can be found with, for example:
cat ${KERNEL_SRC_DIR}/include/config/kernel.release
or
ssh mic0 "uname -r"
The driver can be built, after it has been set, as follows:
-
./build_mic_driver.sh
-
cp sep3_10-k1om-2.6.38.8-gefd324esmp.ko /cm/shared/apps/intel/vtune/vtune_amplifier_xe/bin64/k1om/sep3_10-k1om-2.6.38.8-gefd324esmp.ko
-
cd /cm/shared/apps/intel/vtune/vtune_amplifier_xe/bin64/k1om
-
sh sep_micboot_install.sh
[root@node001 k1om]# ./sep_micboot_install.sh
SEP configuration files have been successfully installed in the configuration directory.
Please run "service mpss restart" to start the SEP service.
-
service mpss restart
Please note, loading/unloading the SEP driver in MIC will cause the SEP_service on the MIC card to stop.
1) You can replace the sep kernel module in the installation package, and re-install the driver again.
cp <your_new_driver.ko> <path_to_sep_install>/bin/k1om
2) Restart the service:
cd <path_to_sep_install>/bin64/k1om
sudo ./sep_micboot_install.sh
sudo service mpss restart
Sync changes made on the node to the software image
[root@demo ~]# cmsh
[demo]% device use node001
[demo->device[node001]]% help imageupdate
[demo->device[node001]]% softwareimage
[demo->softwareimage]% use default-image
[demo->softwareimage[default-image]]% updateprovisioners
Provisioning nodes will be updated in the background.
Installing the Intel Xeon Phi drivers using overlays
tba
Prerequisites:
Intel MKL 2013
Intel C compiler 2013
Intel MPSS
SELinux must be disabled
You will need to load the following environment modules:
module load shared intel/mic/cross/2.1.6720 intel/mkl/64/11.0/2013.5.192 intel/compiler/64/13.1/2013.5.192 intel/mic/runtime/2.1.6720
C source code:
#ifndef MIC_DEV
#define MIC_DEV 0
#endif
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <mkl.h>
#include <math.h>
// An OpenMP simple matrix multiply
void doMult(int size, float (* restrict A)[size],
float (* restrict B)[size], float (* restrict C)[size])
{
#pragma offload target(mic:MIC_DEV) \
in(A:length(size*size)) in( B:length(size*size)) \
out(C:length(size*size))
{
// Zero the C matrix
#pragma omp parallel for default(none) shared(C,size)
for (int i = 0; i < size; ++i)
for (int j = 0; j < size; ++j)
C[i][j] =0.f;
// Compute matrix multiplication.
#pragma omp parallel for default(none) shared(A,B,C,size)
for (int i = 0; i < size; ++i)
for (int k = 0; k < size; ++k)
for (int j = 0; j < size; ++j)
C[i][j] += A[i][k] * B[k][j];
}
}
float nrmsdError(int size, float (* restrict M1)[size],
float (* restrict M2)[size])
{
double sum=0.;
double max,min;
max=min=(M1[0][0]- M2[0][0]);
#pragma omp parallel for
for (int i = 0; i < size; ++i)
for (int j = 0; j < size; ++j) {
double diff = (M1[i][j]- M2[i][j]);
#pragma omp critical
{
max = (max>diff)?max:diff;
min = (min<diff)?min:diff;
sum += diff*diff;
}
}
return(sqrt(sum/(size*size))/(max-min));
}
float doCheck(int size, float (* restrict A)[size],
float (* restrict B)[size],
float (* restrict C)[size],
int nIter,
float *error)
{
float (*restrict At)[size] = malloc(sizeof(float)*size*size);
float (*restrict Bt)[size] = malloc(sizeof(float)*size*size);
float (*restrict Ct)[size] = malloc(sizeof(float)*size*size);
float (*restrict Cgemm)[size] = malloc(sizeof(float)*size*size);
// transpose to get best sgemm performance
#pragma omp parallel for
for(int i=0; i < size; i++)
for(int j=0; j < size; j++) {
At[i][j] = A[j][i];
Bt[i][j] = B[j][i];
}
float alpha = 1.0f, beta = 0.0f; /* Scaling factors */
// warm up
sgemm("N", "N", &size, &size, &size, &alpha,
(float *)At, &size, (float *)Bt, &size, &beta, (float *) Ct, &size);
double mklStartTime=dsecnd();
for(int i=0; i < nIter; i++)
sgemm("N", "N", &size, &size, &size, &alpha,
(float *)At, &size, (float *)Bt, &size, &beta, (float *) Ct, &size);
double mklEndTime=dsecnd();
// transpose in Cgemm to calculate error
#pragma omp parallel for
for(int i=0; i < size; i++)
for(int j=0; j < size; j++)
Cgemm[i][j] = Ct[j][i];
*error = nrmsdError(size, C,Cgemm);
free(At); free(Bt); free(Ct); free(Cgemm);
return (2e-9*size*size*size/((mklEndTime-mklStartTime)/nIter) );
}
int main(int argc, char *argv[])
{
if(argc != 4) {
fprintf(stderr,"Use: %s size nThreads nIter\n",argv[0]);
return -1;
}
int i,j,k;
int size=atoi(argv[1]);
int nThreads=atoi(argv[2]);
int nIter=atoi(argv[3]);
omp_set_num_threads(nThreads);
float (*restrict A)[size] = malloc(sizeof(float)*size*size);
float (*restrict B)[size] = malloc(sizeof(float)*size*size);
float (*restrict C)[size] = malloc(sizeof(float)*size*size);
// Fill the A and B arrays
#pragma omp parallel for default(none) shared(A,B,size) private(i,j,k)
for (i = 0; i < size; ++i) {
for (j = 0; j < size; ++j) {
A[i][j] = (float)i + j;
B[i][j] = (float)i - j;
}
}
double aveDoMultTime=0.;
{
// warm up
doMult(size, A,B,C);
double startTime = dsecnd();
for(int i=0; i < nIter; i++) {
doMult(size, A,B,C);
}
double endTime = dsecnd();
aveDoMultTime = (endTime-startTime)/nIter;
}
#pragma omp parallel
#pragma omp master
printf("%s nThreads %d matrix %d %d runtime %g GFlop/s %g",
argv[0], omp_get_num_threads(), size, size,
aveDoMultTime, 2e-9*size*size*size/aveDoMultTime);
#pragma omp barrier
// do check
float error=0.f;
float mklGflop = doCheck(size,A,B,C,nIter,&error);
printf(" mklGflop %g NRMSD_error %g", mklGflop, error);
printf("\n");
free(A); free(B); free(C);
return 0;
}
Compilation
# compile for host-based OpenMP
icc -mkl -O3 -no-offload -openmp -Wno-unknown-pragmas -std=c99 -vec-report3 \
matrix.c -o matrix.omp
# compile for offload mode
icc -mkl -O3 -offload-build -Wno-unknown-pragmas -std=c99 -vec-report3 \
matrix.c -o matrix.off
# compile to run natively on the Xeon Phi
icc -mkl -O3 -mmic -openmp -L /opt/intel/lib/mic -Wno-unknown-pragmas \
-std=c99 -vec-report3 matrix.c -o matrix.mic -liomp5
Transfer the required libraries to the MIC:
Note that the home directories of the users are mounted on the Xeon Phis:
mkdir ~/tmp
cd /cm/shared/apps/intel/composer_xe/2013.5.192/mkl/lib/mic
cp libmkl_intel_thread.so libmkl_core.so libmkl_intel_lp64.so ~/tmp
cp /cm/shared/apps/intel/composer_xe/2013.5.192/compiler/lib/mic/libiomp5.so ~/tmp
Create a run script:
[foobar@node001 ~]$ cat run.sh
export MKL_MIC_ENABLE=1
export MIC_ENV_PREFIX=PHI
export PHI_OMP_NUM_THREADS=240
export PHI_KMP_AFFINITY="granularity=thread,balanced"
export LD_LIBRARY_PATH=/home/foobar/tmp
nThreads=240
i=500
echo -n "MIC0 "
echo -n "PHI_THREADS " $PHI_OMP_NUM_THREADS " "
./matrix.mic $i $nThreads 5
[foobar@node001 ~]$ chmod 755 run.sh
Start VTune GUI:
[foobar@node001 ~]$ amplxe-gui
Create a new project (called mic in this example) and set the following project properties:
As application we set ssh as we are connecting to the mic0 using SSH.
The select and Analysis type (Knights Corner Platform (for MICs)/ General Exploration):
and select Start from the right-hand side buttons.
Collecting data using the CLI
[foobar@node001 ~]$ amplxe-cl -collect knc-general-exploration -- ssh mic0 ./run.sh
amplxe: Collection started. To stop the collection, either press CTRL-C or enter from another console window: amplxe-cl -r /home/foobar/r001ge -command stop.
OMP_OFFLOAD PHI_THREADS 240 ./matrix.mic nThreads 240 matrix 4096 4096 runtime 1.4514 GFlop/s 94.6939 mklGflop 1026.08 NRMSD_error 0.0594141
amplxe: Executing actions 50 % Generating a report
General Exploration Metrics
---------------------------
Parameter r001ge
------------------------------ ------------------
CPU Time 10267.932
Clockticks 12711700000000.000
CPU_CLK_UNHALTED 12711700000000.000
Instructions Retired 1723800000000
CPI Rate 7.374
Cache Usage 0.0
L1 Misses 24478500000
L1 Hit Ratio 0.969
Estimated Latency Impact 432.275
Vectorization Usage 0.0
TLB Usage 0.0
Hardware Event Count
L2_DATA_READ_MISS_CACHE_FILL 3774500000
L2_DATA_WRITE_MISS_CACHE_FILL 182000000
L2_DATA_READ_MISS_MEM_FILL 2834000000
L2_DATA_WRITE_MISS_MEM_FILL 279500000
Collection and Platform Info
----------------------------
Parameter r001ge
------------------------ ---------------------------------------------
Application Command Line ssh "mic0" "./run.sh"
User Name root
Operating System Intel MIC Platform Software Stack release 2.1
Computer Name node001-mic0
Result Size 90353949
CPU
---
Parameter r001ge
----------------- -----------------------------------------
Name Intel(R) Xeon(R) / Core i7 980X Processor
Frequency 1238000000
Logical CPU Count 244
Summary
-------
Elapsed Time: 43.343
Event summary
-------------
Hardware Event Type Hardware Event Count:Self Hardware Event Sample Count:Self Events Per Sample
----------------------------- ------------------------- -------------------------------- -----------------
L2_DATA_READ_MISS_CACHE_FILL 3774500000 7549 100000
CPU_CLK_UNHALTED 12711700000000 254234 10000000
L2_DATA_WRITE_MISS_CACHE_FILL 182000000 364 100000
INSTRUCTIONS_EXECUTED 1723800000000 34476 10000000
L2_DATA_READ_MISS_MEM_FILL 2834000000 5668 100000
DATA_READ_MISS_OR_WRITE_MISS 24160000000 4832 1000000
L2_DATA_WRITE_MISS_MEM_FILL 279500000 559 100000
DATA_READ_OR_WRITE 778140000000 155628 1000000
EXEC_STAGE_CYCLES 1489800000000 29796 10000000
L1_DATA_HIT_INFLIGHT_PF1 318500000 1274 50000
amplxe: Executing actions 100 % done