ID #1159

How can I use the Intel VTune XE with Bright?



Intel's VTune Amplifier is a commercial application for performance analysis on 32 and 64-bit x86 based machines, and has both GUI and command-line interfaces. Basic features work on both Intel and AMD hardware,  while advanced hardware-based sampling requires an Intel-manufactured CPU.


VTune Amplifier assists in various kinds of code profiling including:


  • stack sampling

  • thread profiling

  • hardware event sampling


The profiler results consist of details such as time spent in each subroutine which can be drilled down to the instruction level.


Installation steps


Download and untar the VTune tarball from Intel’s website:


tar zxf vtune_amplifier_xe_2013_update13.tar.gz && cd vtune_amplifier_xe_2013_update13


In the program’s directory there exist two installer files

  • (CLI installer)
  • (launches a Qt-based GUI)

The CLI installer can be started as:




The following menu will be presented:

Select option 6 “Installation”. A prerequisites test will be run to check if your system meets the minimum requirements on installed hardware and software libraries. After this step the EULA will be shown.

After the user accepts the license agreement the “Activation” menu, will be presented where the user can choose how to activate the program. The user can activate either by using a product key, a license file or a license manager.

After that a short menu will be shown asking the user whether he wants to participate in the Intel® Software Improvement Program. After that the Pre-installation summary will be shown:

Select 2 to customize the installation options:

Change the installation location to be in the /cm/shared/apps NFS share:

Advanced options about building the drivers can be selected by choosing option 4:

Options related to building the drivers (e.g. selecting compiler, kernel version etc) can be set by choosing option 8 in the previous menu.

The quickest method to install VTune is to install it in /cm/shared and install the drivers as well as the init script on a node running the desired software image. Once the drivers are built the changes in the node’s local disk can be synced to the image.  If that is not possible/desirable the drivers can be built manually as we will discuss later.

Disable the NMI Watchdog


The Non Maskable Interrupt (NMI) Watchdog can be used in Linux kernel to periodically detect if the CPU is locked. When CPU-locking occurs, NMI Watchdog service will 1) print debug info 2)reboot the system, sometimes. However NMI  Watchdog needs to use hardware performance count, so other performance tools including VTune Amplifier XE 2013 can’t use PMU event-based sampling data collection.


The NMI watchdog can be disabled in two ways:


1.     Add argument “nmi_watchdog=0” to the GRUB commandline then reboot he system:


cmsh -c "softwareimage use default-image; set kernelparameters nmi_watchdog=0; commit"


2.     Disable NMI_Watchdog, on a  running system:


echo 0 > /proc/sys/kernel/nmi_watchdog


This change can be made permanent with:


echo “kernel.nmi_watchdog=0” >> /cm/images/<software image>/etc/sysctl.conf

Setup the Environment

echo “source /cm/shared/apps/vtune/vtune_amplifier_xe/” > /cm/images/<your image>/etc/profile.d/ && chmod 755 /cm/images/<your image>/etc/profile.d/


The following message should be printed in the user’s shell:


Copyright (C) 2009-2013 Intel Corporation. All rights reserved.

Intel(R) VTune(TM) Amplifier XE 2013 (build 313935)


Manually building the drivers


Switch to the following directory:




where driver is either sepdk or powerdk depending on whether you intend to build the SEP or Power driver.


Run the build-driver script:


./build-driver <options>


Useful options:



     "path" is an existing, writable directory where the

     driver will be copied after it is successfully built;

     this defaults to "/cm/shared/apps/vtune/vtune_amplifier_xe/sepdk/src"



     "version" is version string of kernel that should

     be used for checksum or for building the driver;

     this defaults to "2.6.32-358.el6.x86_64"



     "c_compiler" is the C compiler used to compile the kernel;

     this defaults to "gcc"



     exits if a pre-built driver for the current running

     kernel exists in the driver install directory


After that you need to install the boost script:


./boot-script -i --driver-directory /cm/shared/apps/vtune/vtune_amplifier_xe/<driver>/src


The boot-script script will install the drivers in /etc/init.d by default. The INIT scripts are called apwr3_1 and sep3_10. You will have to copy them manually inside the software image or use the CMSH command grabimage (see below) to sync the change to the software image.


Every time the kernel changes the driver’s have to be rebuilt using the procedure described above.


Installing the Intel Xeon Phi SEP driver using the provided scripts


This is the default method supported by Intel. However the changes will not be persistent upon MPSS RPM package upgrades.


First you will need to set some environment variables:





 The value of KERNEL_SRC_DIR must be set. It is used during the next step:



  • ln -s /cm/local/apps/intel-mic/current/kernels/<kernel_version> /tmp/MPSS/mic_linux




  • For this you will need to download the MPSS 2.1 and/or MPSS 3.1  source code  and place it in /tmp/MPSS/mic_linux:


Steps required:


  • Unpack it and find the package-full_src-k1om.tar.bz2 file.

  • make defconfig-miclinux

  • make -C card/kernel ARCH=k1om modules_prepare



The value of KERNEL_VERSION must be set. It can be found with, for example:


      cat ${KERNEL_SRC_DIR}/include/config/kernel.release


      ssh mic0 "uname -r"

The driver can be built, after it has been set, as follows:


  • ./

  • cp sep3_10-k1om- /cm/shared/apps/intel/vtune/vtune_amplifier_xe/bin64/k1om/sep3_10-k1om-

  • cd /cm/shared/apps/intel/vtune/vtune_amplifier_xe/bin64/k1om

  • sh

[root@node001 k1om]# ./

SEP configuration files have been successfully installed in the configuration directory.

Please run  "service mpss restart" to start the SEP service.

  • service mpss restart

    Please note, loading/unloading the SEP driver in MIC will cause the SEP_service on the MIC card to stop.


    1) You can replace the sep kernel module in the installation package, and re-install the driver again.


      cp <your_new_driver.ko>  <path_to_sep_install>/bin/k1om


    2) Restart the service:

      cd <path_to_sep_install>/bin64/k1om

      sudo ./

      sudo service mpss restart

Sync changes made on the node to the software image


[root@demo ~]# cmsh

[demo]% device use node001

[demo->device[node001]]% help imageupdate

[demo->device[node001]]% softwareimage

[demo->softwareimage]% use default-image

[demo->softwareimage[default-image]]% updateprovisioners

Provisioning nodes will be updated in the background.


Installing the Intel Xeon Phi drivers using overlays







Intel MKL 2013

Intel C compiler 2013

Intel MPSS

SELinux must be disabled


You will need to load the following environment modules:


module load  shared intel/mic/cross/2.1.6720   intel/mkl/64/11.0/2013.5.192 intel/compiler/64/13.1/2013.5.192   intel/mic/runtime/2.1.6720

C source code:


#ifndef MIC_DEV

#define MIC_DEV 0



#include <stdio.h>

#include <stdlib.h>

#include <omp.h>

#include <mkl.h>

#include <math.h>


// An OpenMP simple matrix multiply

void doMult(int size, float (* restrict A)[size],

       float (* restrict B)[size], float (* restrict C)[size])


#pragma offload target(mic:MIC_DEV) \

               in(A:length(size*size)) in( B:length(size*size))    \



   // Zero the C matrix

#pragma omp parallel for default(none) shared(C,size)

   for (int i = 0; i < size; ++i)

     for (int j = 0; j < size; ++j)

       C[i][j] =0.f;


   // Compute matrix multiplication.

#pragma omp parallel for default(none) shared(A,B,C,size)

   for (int i = 0; i < size; ++i)

     for (int k = 0; k < size; ++k)

       for (int j = 0; j < size; ++j)

         C[i][j] += A[i][k] * B[k][j];




float nrmsdError(int size, float (* restrict M1)[size],

       float (* restrict M2)[size])


   double sum=0.;

   double max,min;

   max=min=(M1[0][0]- M2[0][0]);


#pragma omp parallel for

   for (int i = 0; i < size; ++i)

     for (int j = 0; j < size; ++j) {

   double diff = (M1[i][j]- M2[i][j]);

#pragma omp critical


     max = (max>diff)?max:diff;

     min = (min<diff)?min:diff;

     sum += diff*diff;







float doCheck(int size, float (* restrict A)[size],

         float (* restrict B)[size],

         float (* restrict C)[size],

         int nIter,

         float *error)


 float (*restrict At)[size] = malloc(sizeof(float)*size*size);

 float (*restrict Bt)[size] = malloc(sizeof(float)*size*size);

 float (*restrict Ct)[size] = malloc(sizeof(float)*size*size);

 float (*restrict Cgemm)[size] = malloc(sizeof(float)*size*size);


 // transpose to get best sgemm performance

#pragma omp parallel for

 for(int i=0; i < size; i++)

   for(int j=0; j < size; j++) {

     At[i][j] = A[j][i];

     Bt[i][j] = B[j][i];



 float alpha = 1.0f, beta = 0.0f; /* Scaling factors */


 // warm up

 sgemm("N", "N", &size, &size, &size, &alpha,

   (float *)At, &size, (float *)Bt, &size, &beta, (float *) Ct, &size);

 double mklStartTime=dsecnd();

 for(int i=0; i < nIter; i++)

   sgemm("N", "N", &size, &size, &size, &alpha,

     (float *)At, &size, (float *)Bt, &size, &beta, (float *) Ct, &size);

 double mklEndTime=dsecnd();


 // transpose in Cgemm to calculate error

 #pragma omp parallel for

 for(int i=0; i < size; i++)

   for(int j=0; j < size; j++)

     Cgemm[i][j] = Ct[j][i];


 *error = nrmsdError(size, C,Cgemm);


 free(At); free(Bt); free(Ct); free(Cgemm);


 return (2e-9*size*size*size/((mklEndTime-mklStartTime)/nIter) );



int main(int argc, char *argv[])



 if(argc != 4) {

   fprintf(stderr,"Use: %s size nThreads nIter\n",argv[0]);

   return -1;



 int i,j,k;

 int size=atoi(argv[1]);

 int nThreads=atoi(argv[2]);

 int nIter=atoi(argv[3]);




 float (*restrict A)[size] = malloc(sizeof(float)*size*size);

 float (*restrict B)[size] = malloc(sizeof(float)*size*size);

 float (*restrict C)[size] = malloc(sizeof(float)*size*size);


 // Fill the A and B arrays

#pragma omp parallel for default(none) shared(A,B,size) private(i,j,k)

 for (i = 0; i < size; ++i) {

   for (j = 0; j < size; ++j) {

     A[i][j] = (float)i + j;

     B[i][j] = (float)i - j;




 double aveDoMultTime=0.;


   // warm up

   doMult(size, A,B,C);


   double startTime = dsecnd();

   for(int i=0; i < nIter; i++) {

     doMult(size, A,B,C);


   double endTime = dsecnd();

   aveDoMultTime = (endTime-startTime)/nIter;



#pragma omp parallel

#pragma omp master

 printf("%s nThreads %d matrix %d %d runtime %g GFlop/s %g",

    argv[0], omp_get_num_threads(), size, size,

    aveDoMultTime, 2e-9*size*size*size/aveDoMultTime);

#pragma omp barrier


 // do check

 float error=0.f;

 float mklGflop = doCheck(size,A,B,C,nIter,&error);

 printf(" mklGflop %g NRMSD_error %g", mklGflop, error);




 free(A); free(B); free(C);

 return 0;




# compile for host-based OpenMP

icc -mkl -O3 -no-offload -openmp -Wno-unknown-pragmas -std=c99 -vec-report3 \

   matrix.c -o matrix.omp

# compile for offload mode

icc -mkl -O3 -offload-build -Wno-unknown-pragmas -std=c99 -vec-report3 \

matrix.c -o

# compile to run natively on the Xeon Phi

icc -mkl -O3 -mmic -openmp -L  /opt/intel/lib/mic -Wno-unknown-pragmas \

-std=c99 -vec-report3 matrix.c -o matrix.mic -liomp5

Transfer the required libraries to the MIC:


Note that the home directories of the users are mounted on the Xeon Phis:


mkdir ~/tmp

cd /cm/shared/apps/intel/composer_xe/2013.5.192/mkl/lib/mic

cp ~/tmp

cp /cm/shared/apps/intel/composer_xe/2013.5.192/compiler/lib/mic/ ~/tmp


Create a run script:

[foobar@node001 ~]$ cat




export PHI_KMP_AFFINITY="granularity=thread,balanced"

export LD_LIBRARY_PATH=/home/foobar/tmp




echo -n "MIC0 "


  ./matrix.mic $i $nThreads 5

[foobar@node001 ~]$ chmod 755

Start VTune GUI:


[foobar@node001 ~]$ amplxe-gui


Create a new project (called mic in this example) and set the following project properties:

As application we set ssh as we are connecting to the mic0 using SSH.

The select and Analysis type (Knights Corner Platform (for MICs)/ General Exploration):

and select Start from the right-hand side buttons.

Collecting data using the CLI

[foobar@node001 ~]$ amplxe-cl -collect knc-general-exploration -- ssh mic0 ./

amplxe: Collection started. To stop the collection, either press CTRL-C or enter from another console window: amplxe-cl -r /home/foobar/r001ge -command stop.

OMP_OFFLOAD PHI_THREADS  240  ./matrix.mic nThreads 240 matrix 4096 4096 runtime 1.4514 GFlop/s 94.6939 mklGflop 1026.08 NRMSD_error 0.0594141


amplxe: Executing actions 50 % Generating a report                             


General Exploration Metrics


Parameter                       r001ge            

------------------------------  ------------------

CPU Time                        10267.932         

Clockticks                      12711700000000.000

CPU_CLK_UNHALTED               12711700000000.000

Instructions Retired            1723800000000     

CPI Rate                        7.374             

Cache Usage                     0.0               

L1 Misses                      24478500000       

L1 Hit Ratio                   0.969             

Estimated Latency Impact       432.275           

Vectorization Usage             0.0               

TLB Usage                       0.0               

Hardware Event Count                              

L2_DATA_READ_MISS_CACHE_FILL   3774500000        

L2_DATA_WRITE_MISS_CACHE_FILL  182000000         

L2_DATA_READ_MISS_MEM_FILL     2834000000        

L2_DATA_WRITE_MISS_MEM_FILL    279500000         


Collection and Platform Info


Parameter                 r001ge                                       

------------------------  ---------------------------------------------

Application Command Line  ssh "mic0" "./"                        

User Name                 root                                         

Operating System          Intel MIC Platform Software Stack release 2.1

Computer Name             node001-mic0                                 

Result Size               90353949                                     




Parameter          r001ge                                   

-----------------  -----------------------------------------

Name               Intel(R) Xeon(R) / Core i7 980X Processor

Frequency          1238000000                               

Logical CPU Count  244                                      




Elapsed Time:  43.343


Event summary


Hardware Event Type            Hardware Event Count:Self  Hardware Event Sample Count:Self  Events Per Sample

-----------------------------  -------------------------  --------------------------------  -----------------

L2_DATA_READ_MISS_CACHE_FILL                  3774500000                              7549  100000           

CPU_CLK_UNHALTED                          12711700000000                            254234  10000000         

L2_DATA_WRITE_MISS_CACHE_FILL                  182000000                               364  100000           

INSTRUCTIONS_EXECUTED                      1723800000000                             34476  10000000         

L2_DATA_READ_MISS_MEM_FILL                    2834000000                              5668  100000           

DATA_READ_MISS_OR_WRITE_MISS                 24160000000                              4832  1000000          

L2_DATA_WRITE_MISS_MEM_FILL                    279500000                               559  100000           

DATA_READ_OR_WRITE                          778140000000                            155628  1000000          

EXEC_STAGE_CYCLES                          1489800000000                             29796  10000000         

L1_DATA_HIT_INFLIGHT_PF1                       318500000                              1274  50000            

amplxe: Executing actions 100 % done   

Tags: -

Related entries:

You cannot comment on this entry