Lab 2 - Reduction

DUE Tuesday, October 26 at 11:59:59 PM

Overview

The objective of this assignment is to implement an optimized reduction kernel and analyze basic architectural performance properties.

Naive Reduction

In the first step, you will implement a naive reduction kernel with unoptimized thread indexing (See Reduction slides for code snippets). Recall that we discussed two types of reduction in class: naive and optimized. The naive reduction kernel implementation suffers from significant warp divergence due to naive thread indexing.

Optimized Reduction

In the second step, you will implement an optimized reduction kernel with optimized thread indexing. Recall that we discussed two types of reduction in class: naive and optimized. The optimized version avoids a significant number of warp divergence. The goal is to have the thread indexing behave as shown in slide 33 of the Reduction slides.

Github Classroom

  1. For this lab, we will be using Github Classroom.
    Please join the classroom by clicking the following link: https://classroom.github.com/a/ep-NH1tL Once you join the classroom, a private github repository will automatically be created with the starter code.
  2. Clone the git repository. There should be 5 files: kernel.cu, main.cu, Makefile, support.cu, support.h
  3. The size of the input array is a command line argument. If none is given, 1 million is the default size. Note that you only have to accumulate the partial sums into an array, which is copied back to the host and verified to check the final answer. To ensure consistency when grading, do not change the srad seed value.
  4. Complete the unoptimized and optimized reduction kernel by adding your code to kernel.cu. There should be no changes necessary for main.cu.
  5. You will be profiling both version of reduction to answer the questions below.
  6. Verify the code works

GPGPU-Sim analysis of warp divergence

Please see the GPGPU-Sim setup guide.

We will now analyze the behavior of the naive and optimized Reduction kernel using GPGPU-Sim.
To aid in analyzing the microarchitectural properties of these programs, it may help to save the output of the GPGPU-Sim run into a file. You may save the output by redirecting the printouts to a file using ./reduction &> outfile.

Since the focus of this lab is on performance and warp divergence, we will focus mainly on these statistics.
At the end of the simulation run, a lot of various statistics are printed out. Specifially, the following statistics are general performance statistics:

gpu_sim_cycle = 1672         // Number of cycles the last kernel took
gpu_sim_insn = 20884         // Number of instructions in the last kernel
gpu_ipc =      12.4904       // The IPC of the last kernel
gpu_tot_sim_cycle = 1672     // Number of cycles for the entire application (An application can consist of multiple kernels)
gpu_tot_sim_insn = 20884     // Number of instructions for the entire application 
gpu_tot_ipc =      12.4904   // The IPC of the entire application run

Another aspect we're concern about is the warp divergence. In the output, there is a section of output that looks similar to this:

Warp Occupancy Distribution:
Stall:564       W0_Idle:3037    W0_Scoreboard:2257      
W1:0    W2:0    W3:0    W4:9    W5:0    W6:0    W7:0    W8:0    
W9:0    W10:0   W11:0   W12:0   W13:0   W14:0   W15:0   W16:310 
W17:0   W18:0   W19:0   W20:0   W21:0   W22:0   W23:0   W24:0   
W25:0   W26:0   W27:0   W28:1   W29:0   W30:0   W31:0   W32:512

This gives a distribution of the number of warps with a given number of active threads. For example, W32 means that all 32 threads are active (i.e. no warp divergence). In this example, we also see a good number of warps that only have half of the threads active: W16:310.
The simulator defines WX (where X = 1 to 32) - The number of cycles when a warp with X active threads is scheduled into the pipeline.

Answer the following questions:

Assume we run reduction with an input size of 1,000,000 and thread block size of 512 threads.

  1. For the naive reduction kernel, how many steps execute without divergence? How many steps execute with divergence?
  2. For the optimized reduction kernel, how many steps execute without divergence? How many steps execute with divergence?
  3. Which kernel performed better? Use profiling statistics to support your claim.
  4. How does the warp occupancy distribution compare between the two Reduction implementations?
  5. Why do GPGPUs suffer from warp divergence?

Submission

  1. Commit and push your completed Naive and Optimized Reduction code to the github repository. (You only need to modify kernel.cu.)
  2. Answer the previous questions by including a report document in the repository. Please name your report FirstName-LastName.pdf or FirstName-LastName.txt or FirstName-LastName.docx, etc.

    Please also include your name in the report.