ECE4893A/CS4803MPG: Multicore and GPU Programming for Video Games

Fall 2008

Homework #7: The Cell

7A checkpoint due: Friday, Dec. 5 (in class at the start of class)

7B boss battle due: Sunday, Dec. 7 at 5:00 PM (via T-square)

Late policy: The "7A checkpoint" will be graded out of 20 points; the "7B boss battle" will be graded out of 100 points. We will accept late submissions; however, for every day that a submission is overdue (including weekend days), we will subtract 10 points (for 7A) or 20 points from the total (for 7B). (We understand that sometimes multiple assignments hit at once, or other life events intervene, and hence you have to make some tough choices. We'd rather let you turn something in late, with some points off, than have a "no late assignments accepted at all" policy, since the former encourages you to still do the assignment and learn something from it, while the latter just grinds down your soul.)

Note about the unusual Sunday due date: Having anything due during finals week has proved to be wildly unpopular in previous semesters; I thought it would be good to give people more time, and give them flexibility in managing their time, but it seems that it distracts people from studying for finals. In particular, having somethig due near the end of finals week was said to put people with flight plans, family coming into town for graduation, etc. at a severe disadvantage. On the other hand, many students I have talked to have projects due or exams on Friday of dead week, were begging to give them at least the weekend before finals to work on it, so they could spend this week working on their other projects. The Sunday 5:00 date/time is a compromise between those two positions; you an turn it in and then spend Sunday evening studying for your Monday exams. If you do not like the Sunday due date, simply pretend that it is due the preceeding Friday.

Review:

7A Checkoff: Modify the example in /opt/cell_class/Hands-on-30/basicDMA/DMA_getbuf_libspe2-async to replace the strings "Good morning!" and "Guten Morgen!" with your name and the name of your favorite video game, respectively (for instance, "Aaron Lanterman" and "Metal Gear Solid".) Hand in hard copies of two screenshots: (1) a screenshots of your complete VMWare screen showing you running the program in the simulator and (2) a screenshot of a terminal window showing you run it on the CellBuzz cluster. This is checkoff is obviously intended to make sure you understand the mechanics of compiling and running cell code out of the way, and that you're not still trying to figure out how to compile and run code the day before 7B is due. (If you manage to complete 7B by the 7A deadline, you can directly turn in 7B early and not bother with 7A.)

This may seem trivial, and it is; but if I don't assign this as a separate early turn in, believe me, I will be getting a dozen e-mails Saturday night saying "I can't download the image/I can't get VMWare Player to work/I can't log in to Cellbuzz/I forgot to sign up for a Cellbuzz account what should I do now/etc."

Warning about compiling other examples: The example used in 7A seems to compile fine, but I think some of the examples in the cell_class directory may have been made for an earlier version of the SDK, so you need to do some tweaking to get them to work. In particular, IBM seems to have moved the locations of needed files around just to annoy us.

7B Boss Battle: Quaternions are a computationally efficient and conceptually intriguing alternative to rotation matrices. To concatenate rotations represented by quaternions, you just multiply the quaternions, just like you do with rotation matrices. Before continuing, please read over the set of supplemental slides on quaternions that we have prepared (slides, 4-up), particularly the slide on quaternion multiplication.

We will create a Cell program to do some computations with quaternions. We will create two sets of quaternions, which we will call A and B, each containing 1,000 quaternions, numbered 0 through 999. You can "hard code" the number of quaternions as 1000, i.e. declare arrays the size you need, and not worry about "mallocing" variable amounts of space for an arbitary numbers of quaternions.

We will initialize quaternions 0 through 299 of set A as

A(n) = {cos([2*pi*n/300]/2),sin([2*pi*n/300]/2)*(1,0,0)},

quaternions 300 to 599 of set A as

A(n) = {cos([2*pi*(n-300)/300]/2),sin([2*pi*(n-300)/300]/2)*(0,1,0)},

and 600 to 999 of set A as

A(n) = {cos([2*pi*(n-600)/400]/2),sin([2*pi*(n-600)/400]/2)*(0,0,1)}

We will initialize quaternion set B as

B(n) = {cos([2*pi*n/1000]/2),sin([2*pi*n/1000]/2)*(sqrt(1/6),sqrt(1/3),sqrt(1/2)}

Initialize your quaternions on the PPE, and then send that data to one of the SPEs. On the SPE, iterate the computation A(n) = A(n)*B(n), for all n=0...999 quaternions, where the equal sign represents assignment (not mathematical equality) and the * sign represents the quaternion multiplication as defined on slide 3 of the supplemental slides, 2,000 times. Note that A will change as a result of these iterations, but B will remain constant. This models a situation in which A represents the orientations of objects, and B represents an amount by which they rotate over a given time interval. Organize your computation such that the 2,000 iterations is the "outer loop" and the computation on the 1,000 quaternion pairs is the "inner loop" - the idea is that in a real game, some of the other SPEs may be doing other kinds of computations at each iteration of the "outer loop." When your iterations is done, ship the data for the final state of the A quaternions back to the PPE.

Then, on the PPE, compute a 4x4 matrix for each quaternion A(n) that contains a rotation matrix in the upper 3x3 submatrix derived from the quaternion representations (see Slide 7 of supplemental slides.) Put 0s along the bottommost row and rightmost row, except for a 1 in the lower right corner. The idea here is that this is the representation that would be passed to the RSX (if we had access to the RSX). Each row of the matrix should consist of a 128-bit word with the usual x,y,z,w ordering of four 32-bit bytes, since that is how the data would be typically be presented to the GPU. (As an aside, quaternion elements are usually labeled w,x,y,z, so don't let that confuse you; they're all just labels.)

When all the computations are done, have the PPE print out the quaternion and associated rotation matrix for quaternions 0, 500, and 999.

Make extensive and clever use of SIMD intrinsics throughout your code, both on the PPE and the SPE, to make the Cell purr. If you do not use SIMD intrinsics and strictly write old-fashioned scalar code, not only will your Cell code not purr, but you will lose points since getting a hang of using those instrinsics is one of the main points of the assignment.

In setting up your code, you will need to give careful thought to how you want to organize your data. Mike Acton, Engine Director for Insomniac Games, is fond of saying that "data is more important than code," and noting this is particularly true on the Cell. You will need to think about whether want to adopt an "array-of-structures" (AOS) style or a "structure-of-arrays" (SOA) style. An AOS style is natural on a GPU, since you could put the four elements of the quaternions into the (r,g,b,a) or (x,y,z,w) fields of a typical 128-bit GPU field, and you have "free" swizzle operations that would make implementing quaternion multiplication easy with that format. However, on the Cell, data permutation can be costly, so it might be better to take an SOA approach and pack all instance of one quaternion coordinate in one giant array, all instances of a second coordinate in another array, etc., and have your quaternion multiply essentially operate on four quaternion pairs at a time.

Although will you probably want to use the Cell simulator for you initial development and debugging (perhaps lowering the number of rotation iterations, at first, for the sake of speed), your final run of the code should be on the Cellbuzz cluster. When you run your code for the screenshot, run it via "time yourprogramnamehere" at the Linux prompt. This will present some extremely course timing information. (The Cell SDK provides a wide variety of sophisticated profiling tools, but we did not have time to cover them this semester.)

Note we are not requiring you to use multiple SPEs or do any SPE<->SPE communication.

For this part, upload a ZIP file with your code, makefile, executable, and a screenshot showing your code running on the CellBuzz cluster and which shows the desired output and timing information.

Deliverables: See the descriptions above. For both 7A and 7B, please tell us an approximate number of hours you spent working it, as well your thoughts on the homework, particularly suggestions for improving it in the future. (If you don't have any suggestions, that's OK, just tell us an approximate number of hours.)

Be sure to finish sufficiently in advance of the deadline that you will be able to work around any troubles T-square gives you to successfully submit before the deadline. If you have trouble getting T-square to work, please e-mail your compressed file to lanterma@ece.gatech.edu, with "MPG HW #7B" and your full name in the header line; please only use this e-mail submission as a last resort if T-square isn't working.