# Assignment 2 - Reduction

DUE Monday, October 30 at 11:59:59 PM

## Overview

The objective of this assignment is to implement an optimized reduction kernel and analyze basic architectural performance properties.

## Naive Reduction

In the first step, you will implement a naive reduction kernel with unoptimized thread indexing (See Reduction slides for code snippets). Recall that we discussed two types of reduction in class: naive and optimized. The naive reduction kernel implementation suffers from significant warp divergence due to naive thread indexing.

## Optimized Reduction

In the second step, you will implement an optimized reduction kernel with optimized thread indexing. Recall that we discussed two types of reduction in class: naive and optimized. The optimized version avoids a significant number of warp divergence. The goal is to have the thread indexing behave as shown in slide 33 of the Reduction slides.

## Github Classroom

- For this lab, we will be using Github Classroom.

Please join the classroom by clicking the following link: https://classroom.github.com/a/6DE18gQ3 Once you join the classroom, a private github repository will automatically be created with the starter code. - Clone the git repository. There should be 5 files:
`kernel.cu`

,`main.cu`

,`Makefile`

,`support.cu`

,`support.h`

- The size of the input array is a command line argument. If none is given, 1 million is the default size.
*Note that you only have to accumulate the partial sums into an array, which is copied back to the host and verified to check the final answer.*To ensure consistency when grading, do not change the srad seed value. - Complete the unoptimized and optimized reduction kernel by adding your code to
`kernel.cu`

. There should be no changes necessary for`main.cu`

. - You will be profiling both version of reduction to answer the questions below.
- Verify the code works

## GPGPU-Sim analysis of warp divergence

Please see the GPGPU-Sim setup guide.

We will now analyze the behavior of the naive and optimized Reduction kernel using GPGPU-Sim.

To aid in analyzing the microarchitectural properties of these programs, it may help to save the output of the GPGPU-Sim run into a file.
You may save the output by redirecting the printouts to a file using `./reduction &> outfile`

.

Since the focus of this lab is on performance and warp divergence, we will focus mainly on these statistics.

At the end of the simulation run, a lot of various statistics are printed out.
Specifially, the following statistics are general performance statistics:

```
gpu_sim_cycle = 1672 // Number of cycles the last kernel took
gpu_sim_insn = 20884 // Number of instructions in the last kernel
gpu_ipc = 12.4904 // The IPC of the last kernel
gpu_tot_sim_cycle = 1672 // Number of cycles for the entire application (An application can consist of multiple kernels)
gpu_tot_sim_insn = 20884 // Number of instructions for the entire application
gpu_tot_ipc = 12.4904 // The IPC of the entire application run
```

Another aspect we're concern about is the warp divergence. In the output, there is a section of output that looks similar to this:

```
Warp Occupancy Distribution:
Stall:564 W0_Idle:3037 W0_Scoreboard:2257
W1:0 W2:0 W3:0 W4:9 W5:0 W6:0 W7:0 W8:0
W9:0 W10:0 W11:0 W12:0 W13:0 W14:0 W15:0 W16:310
W17:0 W18:0 W19:0 W20:0 W21:0 W22:0 W23:0 W24:0
W25:0 W26:0 W27:0 W28:1 W29:0 W30:0 W31:0 W32:512
```

This gives a distribution of the number of warps with a given number of active threads. For example, `W32`

means that all 32 threads are active (i.e. no warp divergence). In this example, we also see a good number of warps that only have half of the threads active: `W16:310`

.

The simulator defines WX (where X = 1 to 32) - The number of cycles when a warp with X active threads is scheduled into the pipeline.

### Answer the following questions:

Assume we run reduction with an input size of 1,000,000 and thread block size of 512 threads.

- For the naive reduction kernel, how many steps execute without divergence? How many steps execute with divergence?
- For the optimized reduction kernel, how many steps execute without divergence? How many steps execute with divergence?
- Which kernel performed better? Use profiling statistics to support your claim.
- How does the warp occupancy distribution compare between the two Reduction implementations?
- Why do GPGPUs suffer from warp divergence?

## Submission

- Commit and push your completed Naive and Optimized Reduction code to the github repository. (You only need to modify
`kernel.cu`

.) - Answer the previous questions by including a report document in the repository in either
**PDF, Plain text, or Markdown format only**. Please name your report`FirstName-LastName.pdf`

or`FirstName-LastName.txt`

or`FirstName-LastName.md`

, etc.Please also include your name

**in**the report.