ECE4893A/CS4803MPG: Multicore and GPU Programming for Video Games

Fall 2009

Homework #8: The Cell

8A checkpoint due: Saturday, Dec. 5 at 11:59 PM (via T-square)

8B boss battle due: Friday, Dec. 11 at 2:20 PM (via T-square)

Late policy: The "8A checkpoint" will be graded out of 20 points; the "8B boss battle" will be graded out of 100 points. We will accept late submissions; however, for every day that a submission is overdue (including weekend days), we will subtract 10 points (for 8A) or 20 points from the total (for 8B). 8B will not be accepted past 2:20 on Sunday, Dec. 13, since I need to turn in grades the following day. (We understand that sometimes multiple assignments hit at once, or other life events intervene, and hence you have to make some tough choices. We'd rather let you turn something in late, with some points off, than have a "no late assignments accepted at all" policy, since the former encourages you to still do the assignment and learn something from it, while the latter just grinds down your soul.)

Note about the unusual due dates: Many students I have talked to have major projects due on Friday of "dead week." I set the "checkpoint" to be due the following today so people slammed with multiple projects that are due Friday have some breathing room, but I strongly recommend handling the checkpoint, which is meant to make sure you can compile and run Cell programs, early. The 2:20 PM due date for the "boss battle" on Friday of finals week is set to correspond to when our final would have ended if we had actually had a final; this seems like a reasonable approach since we're not doing a final. Again, I recommend you try to get the assignment finished early, but this gives you the flexibility to manage your time however seems best to you.

Review:

8A Checkoff: Modify the example in /opt/cell_class/Hands-on-30/basicDMA/DMA_getbuf_libspe2-async to replace the strings "Good morning!" and "Guten Morgen!" with your name and the name of your favorite video game, respectively (for instance, "Aaron Lanterman" and "Metal Gear Solid".) Submit two screenshots: (1) a screenshot of your complete VMWare screen showing you running the program in the simulator and (2) a screenshot of a terminal window showing you run it on the CellBuzz cluster. This is checkoff is obviously intended to make sure you understand the mechanics of compiling and running Cell code out of the way, and that you're not still trying to figure out how to compile and run code the day before 8B is due. (If you manage to complete 8B by the 8A deadline, you can directly turn in 8B early and not bother with 8A.)

This may seem trivial, and it is; but if I don't assign this as a separate early turn in, believe me, I will be getting a dozen e-mails Thursday night saying "I can't download the image/I can't get VMWare Player to work/I can't log in to Cellbuzz/I forgot to sign up for a Cellbuzz account what should I do now/etc."

Warning about compiling other examples: The example used in 8A seems to compile fine, but I think some of the examples in the cell_class directory may have been made for an earlier version of the SDK, so you need to do some tweaking to get them to work. In particular, IBM seems to have moved the locations of needed files around just to annoy us.

8B Boss Battle: Quaternions are a computationally efficient and conceptually intriguing alternative to rotation matrices. To concatenate rotations represented by quaternions, you just multiply the quaternions, just like you do with rotation matrices. Before continuing, please read over the set of supplemental slides on quaternions that we have prepared (slides, 4-up), particularly the slide on quaternion multiplication.

We will create a Cell program to do some computations with quaternions. We will create two sets of quaternions, which we will call A and B, each containing 1,000 quaternions, numbered 0 through 999. You can "hard code" the number of quaternions as 1000, i.e. declare arrays the size you need, and not worry about "mallocing" variable amounts of space for an arbitary numbers of quaternions.

We will initialize quaternions 0 through 399 of set A as

A(n) = {cos([2*pi*n/400]/2),sin([2*pi*n/400]/2)*(1,0,0)},

quaternions 400 to 699 of set A as

A(n) = {cos([2*pi*(n-400)/300]/2),sin([2*pi*(n-400)/300]/2)*(0,1,0)},

and 700 to 999 of set A as

A(n) = {cos([2*pi*(n-700)/300]/2),sin([2*pi*(n-700)/300]/2)*(0,0,1)}

We will initialize quaternion set B as

B(n) = {cos([2*pi*n/1000]/2),sin([2*pi*n/1000]/2)*(sqrt(1/6),sqrt(1/3),sqrt(1/2)}

Initialize your quaternions on the PPE, and then send that data to one of the SPEs. On the SPE, iterate the computation A(n) = A(n)*B(n), for all n=0...999 quaternions, where the equal sign represents assignment (not mathematical equality) and the * sign represents the quaternion multiplication as defined on slide 3 of the supplemental slides, 20 times. Do all computations in single-precision floating point. Note that A will change as a result of these iterations, but B will remain constant. This models a situation in which A represents the orientations of objects, and B represents an amount by which they rotate over a given time interval. Organize your computation such that the 20 iterations is the "outer loop" and the computation on the 1,000 quaternion pairs is the "inner loop" - the idea is that in a real game, some of the other SPEs may be doing other kinds of computations at each iteration of the "outer loop." When your iterations are done, ship the data for the final state of the A quaternions back to the PPE.

Do the same calculation on the PPE. If you are clever you can have the PPE and the SPE calculating simultaneously; alternatively, you can do the calculations sequentially, doing the PPE calculations first and the SPE calculations second or vice-versa. (I'm not picky about this since just getting the DMA working and getting a handle on the SIMD instructions is difficult enough.)

When all the computations are done, first have the PPE print out the final quaternions A(n) for n=0, 600, and 999 that were computued on the PPE, and then have the PPE print out the final quaternions A(n) for n=0, 600, and 999 that were computued on the SPE. Note that these results may differ slightly due to the SPEs handling floating point numbers differently than the PPEs.

Make extensive and clever use of SIMD intrinsics, particularly in your quaternion multiplications, both on the PPE and the SPE, to make the Cell purr. If you do not use SIMD intrinsics and strictly write old-fashioned scalar code, not only will your Cell code not purr, but you will lose points since getting a hang of using those instrinsics is one of the main points of the assignment. Note that although the SIMD intrinsics on PPE and SPE are similar, they are not the same. (You don't need to use SIMD intrinsics in the quaternion initialization phase. You can if you want, but I won't be picky about that since that code is only run once.)

In setting up your code, you will need to give careful thought to how you want to organize your data. Mike Acton, Engine Director for Insomniac Games, is fond of saying that "data is more important than code," and noting this is particularly true on the Cell. You will need to think about whether want to adopt an "array-of-structures" (AOS) style or a "structure-of-arrays" (SOA) style. An AOS style is natural on a GPU, since you could put the four elements of the quaternions into the (r,g,b,a) or (x,y,z,w) fields of a typical 128-bit GPU field, and you have "free" swizzle operations that would make implementing quaternion multiplication easy with that format. However, on the Cell, data permutation can be costly, so it might be better to take an SOA approach and pack all instance of one quaternion coordinate in one giant array, all instances of a second coordinate in another array, etc., and have your quaternion multiply essentially operate on four quaternion pairs at a time.

Although will you probably want to use the Cell simulator for you initial development and debugging (perhaps lowering the number of rotation iterations, at first, for the sake of speed), your final run of the code should be on the CellBuzz cluster.

Note we are not requiring you to use multiple SPEs or do any SPE<->SPE communication.

For this part, upload a ZIP file with your code, makefile, executable, and a screenshot showing your code running on the CellBuzz cluster, showing your results.

Deliverables: See the descriptions above.

Be sure to finish sufficiently in advance of the deadline that you will be able to work around any troubles T-square gives you to successfully submit before the deadline. If you have trouble getting T-square to work, please e-mail your compressed file to lanterma@ece.gatech.edu, with "MPG HW #8A" or "MPG HW #8B" and your full name in the header line; please only use this e-mail submission as a last resort if T-square isn't working.

Playstation 3 option: If you happen to have Linux installed on your own Playstation 3, you are welcome to try using it instead of the CellBuzz cluster. Please contact me if you intent to take that route.

Discussion board: As an experiment, we have set up a "HW #8" discussion board where students can discuss the homework and in particular ask questions that the professors and other students can help answer. (Try to avoid posting significant chunks of code on the discussion board; those are probably best directly e-mailed to the professors.)

Ground rules: You are welcome to discuss high-level implementation issues with your fellow students, but you should avoid actually looking at one another student's code as whole, and under no circumstances should you be copying any portion of another student's code. However, asking another student to focus on a few lines of your code to discuss why you are getting a particular kind of error is reasonable. Basically, these "ground rules" are intended to prevent a student from "freeloading" off another student, even accidentally, since they won't get the full yummy nutritional educational goodness out of the assignment if they do.

Looking at code from homeworks done in previous years is strictly prohibited.