Assignment 3  Tiled Matrix Multiplication
DUE Wednesday, May 18 at 11:59:59 PM
Overview
The objective of this assignment is to implement a tiled matrix multiplication kernel that can support arbitrary sized matrices.
Tiled Matrix Multiplication

For this lab, we will be using Github Classroom.
Please join the classroom by clicking the following link: https://classroom.github.com/a/j4J9Ecei Once you join the classroom, a private github repository will automatically be created with the starter code.
Simplygit clone
to copy the starter code to Bender. 
Edit the source files
kernel.cu
andmain.cu
to complete the functionality of the matrix multiplication on the device. The two matrices could be any size, but we will not test your code with an output matrix size exceeding 65,536 elements (for example, 256 x 256 input matrices). This is purely a limitation for testing your code in a timely manner. Your code should still be able to run for significantly larger matrices. 
There are three modes of operation for the application. Check
main()
for a description of the modes (repeated below). You will support each of these modes using a Tiled matrix multiplication implementation. No arguments: The application will create two randomly initialized matrices to multiply size (1000x1000). After the device multiplication is invoked, it will compute the correct solution matrix using the CPU, and compare that solution with the devicecomputed solution. If it matches (within a certain tolerance), if will print out "Test PASSED" to the screen before exiting.
 One argument: The application will use the random initialization to create the input matrices (size mxm, where m is the argument. Start your testing with small matrices.
 Three arguments m, k, and n: The application will initialize the two input matrices with random values. A matrix will be of size m x k while the B matrix will be of size k x n, producing a C matrix of size m x n
 Note that if you wish, you may add a mode to accept input matrices from files, or to dump input and output matrices to files to facilitate testing. The first three modes must remain untouched.

Commit and push your completed tiled matrix multiplication code to the private repository.
Answer the following questions:
 On Bender, compare the execution time of a 256 x 256 square matrix multiplication compared to a 1024 x 64 and 64 x 1024 rectangular matrix multiply. All input matricies have 65k entries. What do you observe? Which is faster? Can you explain the observed behavior?
Tip: You may want to comment out the
verify()
function inmain.cu
when timing this question.
On Bender you should observe that the 256 x 256 MM is significantly faster than the 1024 x 64 and 64 x 1024 MM.
For both cases, all input matrix are 65k in size. However, the 256 x 256 MM has significantly fewer operations.
Number of multiplyadd operations for 256 x 256 case: 256 256 256 = 16,777,216
Number of multiplyadd operations for 1024 x 64, 64 x 1024 case: 64 1024 1024 = 67,108,864
 Conceptual Question: For a 64 square tiled matrix multiplication, how many times is each element of the input matrices loaded from global memory? Assume 16x16 tiles.
4
 Conceptual Question: For a 64 square nontiled matrix multiplication, how many times is each element of the input matrices loaded from global memory?
64
 Which tile size resulted in the least number of accesses to global memory? Which tile size resulted in the most number of accesses to global memory? What is the reasoning behind this observation?
Tile size of 32 results in least number of global memory access. Tile size of 8 results in greatest number of global memory access. As tile size increases, we have more reuse in shared memory avoiding global memory access.
 Which tile size performed the fastest, which tile size performed the slowest? Why do you think that is?
On a real GPU the performance trend may differ as the maximum threads per SM is 2048 in Maxwell GPUs. Due to this, a tile size of 32 is the fastest.
Submission
 Commit and push your completed tiled matrix multiplication code to the github repository. (You only need to modify
kernel.cu
andmain.cu
.)  Answer the previous questions by including a report document in the repository in either PDF, Plain text, or Markdown format only. Please name your report
FirstNameLastName.pdf
orFirstNameLastName.txt
orFirstNameLastName.md
, etc.Please also include your name in the report.