ECE4893A/CS4803MPG: Multicore and GPU Programming for Video Games

Fall 2007

Homework #5: The Cell

5A checkpoint due: Friday, Nov. 7 at 23:59:59 (via T-square)

5B boss battle due: Friday, Nov. 14 at 17:00:00 (via T-square)

Late policy: The "5A checkpoint" will be graded out of 20 points; the "5B boss battle" will be graded out of 100 points. We will accept late submissions; however, for every day that a submission is overdue (including weekend days), we will subtract 10 points (for 4A) or 20 points from the total (for 4B). The latest you can turn it in is Sunday, November 16, at 5:00 PM; this is because we need to turn in grades by noon on the following Monday. (We understand that sometimes multiple assignments hit at once, or other life events intervene, and hence you have to make some tough choices. We'd rather let you turn something in late, with some points off, than have a "no late assignments accepted at all" policy, since the former encourages you to still do the assignment and learn something from it, while the latter just grinds down your soul.)

Review: Remember the following tips from Session 34, which Seunghwa Kang presented:

5A Checkoff: Modify the example in /opt/cell_class/Hands-on/DMA/example1b to replace the strings "Good morning!" and "Guten Morgen!" with your name and the name of your favorite video game, respectively. Upload two screenshots: (1) a screenshots of your complete VMWare screen showing you running the program in the simulator and (2) a screenshot of a terminal window showing you run it on the CellBuzz cluster. This is checkoff is obviously intended to make sure you get the mechanics of compiling and running cell code out of the way before the weekend after classes hits, and that you're not still trying to figure out how to compile and run code the day before 5B is due. (If you manage to complete 5B by the 5A deadline, you can directly turn in 5B early and not bother with 5A. However, I suspect few students will be able to get 5B working by then.)

Warning about compiling the examples: I think the examples in the cell_class directory may have been made for an easlier version of the SDK, so you need to do some tweaking.

When you try compiling, you'll get the complaint:

No rule to make target /opt/ibmcmp/xlc/8.1/include/spu_intrinsics.h

Digging around, I found there is a /opt/ibmcmp/xlc/8.2 directory, not an /opt/ibmcmp/xlc/8.1 directory. I resolved this my changing the 8.1's in the *.d files (which seem to hold dependencies) to 8.2's. (A more elegant approach migh be to just rename the 8.2 directory to 8.1, but I haven't tried it, so I'm not sure if that breaks something else).

Once I fixed that, it then complained that it couldn't find things like:

/opt/cell/toolchain-3.3/sysroot/usr/include/stdio.h

Well, it turns out /opt/cell/sysroot/blah blah blah exists, so if you remove the toolchain-3.3 part (or perhaps put a symbolic link in the cell directory called toolchain-3.3 that points to sysroot), it seems happy.

5B Boss Battle: Here, we will implement a tiny part of a physics engine for 300 Spartan warriors and 700 Theban volunteers tumbling through free space after falling off a cliff. (Most people only talk about the 300 Spartans, but they forget the 700 Thebans, as well as the 1,000 or so slaves that also faught at the Battle of Thermopylae, according to Wikipedia.) On the way down, they all caught a glimpse of the gaze of the Medusa, which turned them to stone. Hence, we can treat them as rigid bodies.

We will treat the Spartans and Thebans as the same. You can "hard code" the number of warriors as 1000, i.e. declare arrays the size you need, and not worry about "mallocing" variable amounts of space for an arbitary numbers of warriors.

We will only worry about the orientations of the warriors, which will be represented via quaternions. We will not worry about their positions, and we will not worry about collisions between them. (I couldn't think of a good narrative hack to explain away why we're not worrying about collisions.)

Let's number the warriors as 0 through 999. Referring to the number of a particular warrior as n, let's get some variety by initializing the quaternion for warriors 0 through 299 as

q = {cos([2*pi*n/300]/2),sin([2*pi*n/300]/2)*(1,0,0)}

for warriors 300 to 599 as

q = {cos([2*pi*(n-300)/300]/2),sin([2*pi*(n-300)/300]/2)*(0,1,0)}

for warriors 600 to 999 as

q = {cos([2*pi*(n-600)/400]/2),sin([2*pi*(n-600)/400]/2)*(0,0,1)}

Each warrior is rotating with a different angular velocity (omega0,omega1,omega2). Let's set the angular velocities as

omega0 = pi*(n/1000) radians per second

omega1 = 2*pi*(n/1000) radians per second

omega2 = 3*pi*(n/1000) radians per second

for each of the 1000 warriors.

We will keep the angular velocites constant through the simulation. You should precompute the angular velocities and store them in variables, and not recompute them on every iteration. The idea is that in the "real game," other forces would modify those angular velocities (for instance, wind, impulse collisions, magic spells, etc.) as gameplay proceeded.

In your lecture notes on physics engines, you will find the differential equation governing rotational dynamics using quaternion representations:

dq
--- = [omega0 i + omega1 j + omega2 k] q / 2.
dt

The above multiplication is quaternion multiplication (see Slide 3 of the Quaternion supplement slides.) Use a 2nd-order Runge-Kutta method (also known as the midpoint method) to numerically implement this differential equation. See your notes from the physics engine lectures or ask Wikipedia for details of the method.

After each iteration, renormalize the quaternions to have unit norm. This will prevent errors from building up. (Typically you would only bother to do this every-so-many iterations; here, for simplicity, we'll do it every iteration.) Then, compute a 4x4 matrix for each of the warriors that contains a rotation matrix in the upper 3x3 submatrix derived for that warrior from the quaternion representations (see Slide 7 of the Quaternion slide suppelment). Put 0s along the bottommost row and rightmost row, except for a 1 in the lower right corner. The idea here is that this is the representation that would be passed to the RSX (if we had access to the RSX). Each row of the matrix should consist of a 128-bit word with the usual x,y,z,w ordering of four 32-bit bytes, since that is how the data would be typically be presented to the GPU. (As an aside, quaternion elements are usually labeled w,x,y,z, so don't let that confuse use; they're all just labels.)

Use a time stepsize of 1/60th of a second. Run the simulation for 10 seconds of "simulated time." The code shouldn't need very much "read-world time" to do these calculations.

Generate your initial quaternions and angular velocities on the PPU, and then send that data to one of the SPUs. Do the simulation on the SPU. At the end of the 10 seconds of "simulated time," send the final result of the 1000 quaternions and the matrix representations back to the PPU. Have the PPU print out the quaternion and matrix for "warrior 0," "warrior 500," and "warrior 999." (Note we are not requiring you to use multiple SPEs or do any SPE<->SPE communication.)

In setting up your code, you will need to give careful thought to how you want to organize your data. Mike Acton, Engine Director for Insomniac Games, is fond of saying that on the Cell, "data is more important than code." In particular, you will need to think aboue whether you want to adopt an "array-of-structures" (AOS) style or a "structure-of-arrays" (SOA) style. An AOS style is natural on a GPU, since you could put the four elements of the quaternions into the (r,g,b,a) or (x,y,z,w) fields of a typical 128-bit GPU field, and you have "free" swizzle operations that would make implementing quaternion multiplication easy with that format. However, on the Cell, data permutation can be costly, so it might be better to take an SOA approach and pack all instance of one quaternion coordinate in one giant array, all instances of a second coordinate in another array, etc., and have your quaternion multiply essentially operate on four quaternion pairs at a time.

Make extensive and clever use of SIMD intrinsics throughout your code to make the Cell purr. If you do not use SIMD intrinsics and strictly write old-fashioned scalar code, not only will your Cell code not purr, but you will lose points since getting a hang of using those instrinsics is one of the main points of the assignment. (Exception: I don't care how efficient or not the initialization on the PPU is. You should focus your SIMD efforts on the our SPU code.)

When you run your code for the screenshot, run it via "time yourprogramnamehere" at the Linux prompt. This will present some extremely course timing information. (The Cell SDK provides a wide variety of sophisticated profiling tools that I was planning to have you use, but I decided that would be too much for the limited time we have left this semester. Just wrapping your head around how to organize the data and how to use the SIMD intrinsics will probably take up enough of your time).

You will find the example code in the /opt/cell_class/Hands-on/euler instructive. You may feel free to use that code, or any other code you see in the /opt/cell_class/Hands-on, as the basis for your work for 5B, or make up something from scratch.

For this part, upload a ZIP file with your code, executable, and a screenshot showing your code running on the CellBuzz cluster and which shows the desired output and timing information. (You will probably want to do your debugging on the simulator, perhaps using a smaller number of warriors.)

Deliverables: See the descriptions above for what to upload. For both 5A and 5B, please tell us an approximate number of hours you spent working it, as well your thoughts on the homework, particularly suggestions for improving it in the future. (If you don't have any suggestions, that's OK, just tell us an approximate number of hours.)

Be sure to finish sufficiently in advance of the deadline that you will be able to work around any troubles T-square gives you to successfully submit before the deadline. If you have trouble getting T-square to work, please e-mail your compressed file to lanterma@ece.gatech.edu, with "MPG HW #5A" or "MPG HW #5B" and your full name in the header line; please only use this e-mail submission as a last resort if T-square isn't working.

When you submit your homework, please tell us an approximate number of hours you spent working on it, as well your thoughts on the homework, particularly suggestions for improving it in the future. (If you don't have any suggestions, that's OK, just tell us an approximate number of hours.) This will help us with future offerings of the class.