Assignment 2 - Reduction

DUE Friday, April 29 at 11:59:59 PM Report must be in PDF, Plain text, or Markdown format only


The objective of this assignment is to implement an optimized reduction kernel and analyze basic architectural performance properties.

Naive Reduction

In the first step, you will implement a naive reduction kernel with unoptimized thread indexing (See Reduction slides for code snippets). Recall that we discussed two types of reduction in class: naive and optimized. The naive reduction kernel implementation suffers from significant warp divergence due to naive thread indexing.

Optimized Reduction

In the second step, you will implement an optimized reduction kernel with optimized thread indexing. Recall that we discussed two types of reduction in class: naive and optimized. The optimized version avoids a significant number of warp divergence. The goal is to have the thread indexing behave as shown in slide 33 of the Reduction slides.

Github Classroom

  1. For this lab, we will be using Github Classroom.
    Please join the classroom by clicking the following link: Once you join the classroom, a private github repository will automatically be created with the starter code.
  2. Clone the git repository. There should be 5 files:,, Makefile,, support.h
  3. The size of the input array is a command line argument. If none is given, 1 million is the default size. Note that you only have to accumulate the partial sums into an array, which is copied back to the host and verified to check the final answer. To ensure consistency when grading, do not change the srad seed value.
  4. Complete the unoptimized and optimized reduction kernel by adding your code to There should be no changes necessary for
  5. You will be profiling both version of reduction to answer the questions below.
  6. Verify the code works

Profiling warp divergence

In order to profile the warp divergence of your reduction algorithms, two helper functions are provided in The countWarpDistribution() function samples the active mask of a warp and records the number of active warps to a global counter array warpDistribution. In order to accurately measure the amount of warp divergence we have, we have to place the countWarpDistribution() function in the same basic block as the reduction operation as follows:

  for (unsigned int stride = 1; stride <= blockDim.x; stride *= 2){
    if (t % stride == 0){
      partialSum[2*t] += partialSum[2*t+stride];

The printWarpDistribution() function prints out the warp distribution counters. Since we only need a single printout of the warp distribution counter and every thread in the kernel runs the same code, we limit only the first thread of the first block to call the printout function. We do this by adding the following to the end of the reduction kernel:

  if(threadIdx.x == 0 && blockIdx.x == 0)

This print function outputs the following:

 Warp Distribution: 
W0: 0, W1: 27349, W2: 15628, W3: 0, W4: 15628, W5: 0, W6: 0, W7: 0, W8: 15628, W9: 0, W10: 0, W11: 0, W12: 0, W13: 0, W14: 0, W15: 0, W16: 15628, W17: 0, W18: 0, W19: 0, W20: 0, W21: 0, W22: 0, W23: 0, W24: 0, W25: 0, W26: 0, W27: 0, W28: 0, W29: 0, W30: 0, W31: 0, W32: 15628, 

This gives a distribution of the number of warps with a given number of active threads. For example, W32 means that all 32 threads are active (i.e. no warp divergence). In this example, we also see a good number of warps that only have half of the threads active: W16:15628.

Answer the following questions:

Assume we run reduction with an input size of 1,000,000 and thread block size of 512 threads.

  1. For the naive reduction kernel, how many steps execute without divergence? How many steps execute with divergence?

We perform reduction locally within a threadblock to calculate a partialSum. Therefore, within each block we have 512 threads and our reduction takes 10 steps operating on 1024. In total there will be 977 thread blocks. The first 976 operates on 1024 elements each, and the last operates on 576 elements (using 288 threads). In the first step, there is no warp divergence (all active threads in a block is divisible by 32). For the remaining steps, all warps are divergent.

  1. For the optimized reduction kernel, how many steps execute without divergence? How many steps execute with divergence?

As with the unoptmized case, the first step has no warp divergence. For the first 976 blocks that operate on 1024 elements each, divergence only occurs when they have less than 32 threads active; therefore the first 5 steps are non-divergent (512, 256, 128, 64, 32 active threads). For the last block that operates on the last 576 elements, 288 threads are active in the first step (non-divergent because divisible by 32). In subsequent steps, the number of active threads are 144, 72, 36, ..., which are not divisible by 32, and therefore are divergent.

  1. Which kernel performed better? Use profiling statistics to support your claim.

Obviously, the optimized reduction kernel performed better. This can be verified with the kernel execution time in the real GPU on Bender.

  1. How does the warp occupancy distribution compare between the two Reduction implementations?

Your exact numbers may vary slightly due to implementation differences (and compiler optimizations). But in general, you should observe that the optmized reduction has a greater occurance of W32 compared to W31-1.

  1. Why do GPGPUs suffer from warp divergence?

Warp divergence is mainly caused by the SIMT execution model where 32 threads in a warp must execute the same instruction (all share the same Program Counter). Due to this, if threads diverge and operate on different instructions, the execution becomes serialized.


  1. Commit and push your completed Naive and Optimized Reduction code to the github repository. (You only need to modify
  2. Answer the previous questions by including a report document in the repository in either PDF, Plain text, or Markdown format only. Please name your report FirstName-LastName.pdf or `FirstName-LastName.txt``, etc.

    Please also include your name in the report.