===================================================================== ECE 6100 Advanced Computer Architecture Fall 2001 Project 2 Due: 4 December 2001 ===================================================================== Efficient Traffic Analysis using a SIMD Processing ============================= PURPOSE ============================= An introduction to a parallel architecture to become familiar with the Single Instruction, Multiple Data (SIMD) execution model and to gain experience in programming these types of parallel machines. ========================== INTRODUCTION =========================== In large metropolitan areas, such as Atlanta, where traffic congestion has become a serious issue, traffic monitoring and analysis systems are being put in place. For example, cameras distributed along the major Atlanta highways provide real-time information about the traffic flow, which can be monitored by traffic reporters and emergency personnel. In the future, the information may be reported to embedded GPS-based map displays in individual vehicles. Automated processing of the traffic images can efficiently provide timely information to help route traffic around troublespots and speedily direct emergency vehicles to accidents. In this lab, you'll take a gray-scale traffic image and identify vehicles in the midst of clutter and noise in the image. This lab makes use of a simulator of a new SIMD processor called SIMPil, being developed by the PICA research group in ECE at Georgia Tech. In order to perform the laboratory exercises, you first need to obtain the simulator software and documentation, both of which run under Windows. Download the file s16v301.exe from the 6100 web page (http://www.ece.gatech.edu/users/linda/6100/Projects/), download the files traffic.spa, cars.bmp, and cars.blf. To install SIMPil, run s16v301.exe. This will create a "simpil16TMP" directory under the current directory. Run Setup.exe which will perform the installation. Once the simulator is installed, launch it from the Start menu and use Help to view the documentation. Click on "Help Topics" for a table of contents. You should read the "Procedures" section under "SIMPil Simulator", which describes how to use the simulator. Under "SIMPil Architecture", read the "Overview". Then as needed, you can refer to "SIMPil Processor Instruction Set Architecture" and "SIMPil Controller Instruction Set Architecture" for descriptions of instructions, as you work with SIMPil code. Also, when you open and edit a SIMPil program file in the "Assembly Source Editor", you can mark any instruction operator in the editor and click on the "?" button in the tool bar to go directly to documentation and examples of that instruction. ============================ EXERCISES ============================ EXERCISE 1: Running and understanding a SIMPil program. The SIMPil program traffic.spa which you obtained from the ECE6100 web page provides a skeleton of an image processing program which you will flesh out. Launch the simulator, load traffic.spa, and view the state of the simulator using the various views available. Using the "New View" menu button in the tool bar, be sure to open the list of bitmaps file (cars.blf), which will also open the Input Image window, and open the Output Image window before running the program. The Input and Output Image windows will both be blank until the program is run for the first time. These views all need to be open for the image processing program to run successfully. Run the SIMPil program using the Processor Monitor. The Processor Monitor allows you to run the program in "Turbo" mode (quickly), or regular "Run" mode (which is useful when using breakpoints) or to single-step through the instructions (using "Step"). Use System View to click on various processor nodes and view their states in the Processor Monitor. Step through a few iterations of the loops L1 and L2 in THRESHOLD_NODES and DISPLAY_OUTPUT_LEVEL and use System View to observe processor nodes being put to sleep and waking up to get an idea of what these subroutines are doing. 1a) THRESHOLD_NODES: This subroutine loads in a gray-scale image and converts it to a binary image. It does this by dividing the image into small 4x4 pixel squares and mapping each to a processor node. The node keeps track of a single threshold bit (or flag) for each square: 0 if all 16 pixels in the square are below a THRESHOLD value and 1 otherwise. How large is the processor array? ______ x ______ nodes. How large is the input image? ______ x _____ pixels. How do we change the threshold value? _________________________ _______________________________________________________________ 1b) DISPLAY_ARRAY: This subroutine is used for displaying the binary image represented by the processor nodes' binary threshold flags. An image is constructed as a set of 4x4 pixel squares, one for each processor node. All 16 pixels in a square are given either a 255 output level (which is white) or a 0 output level (black), depending on the processor node's threshold flag. These 16 output levels are written to the node's local memory and OS_BITMAP_OUT is then used to display the constructed image. Run traffic.spa and view the output image created. How would you change DISPLAY_ARRAY to create the negative of this image (i.e., black pixels for pixels above threshold and white for the rest)? _______________________________________________________________ _______________________________________________________________ 1c) DISPLAY_REGIONS: This is another subroutine for displaying an output image. This can be used to view regions of an image: if all processor nodes in the same region have the same unique region ID in a certain register (R15), DISPLAY_REGIONS displays a gray-scale image in which each region has a different shade of gray. The first 3 instructions mask off the upper 8 bits of the value to be displayed (R4). Why are only the lower 8 bits used? (Hint: you may need to refer to the section on "Simulator Directives" in the SIMPil Simulator documentation.) __________________________________________________________________ __________________________________________________________________ 1d) Cleaning up the image. The thresholded image contains a great deal of noise in the form of stray dots and regions that should be contiguous but are fragmented. What do ERODE And DILATE do to help clean up noise? ERODE: ___________________________________________________________ __________________________________________________________________ __________________________________________________________________ DILATE: __________________________________________________________ __________________________________________________________________ __________________________________________________________________ ===================================================================== EXERCISE 2: Identifying vehicles in a cluttered image. The code provided in traffic.spa loads a traffic image into the processor array. The image contains vehicles as well as debris and other objects that sometimes will clog up Atlanta's highways. Devise and implement a parallel algorithm that identifies only the vehicles in the traffic image. It should display an output image in which only the vehicles are shown and no other objects. It should use ERODE and DILATE to clean up the image initially. (It may be best to call each one more than once.) Your routine should assign a unique vehicle ID (VID) to each vehicle and distribute the VID to every node in the region covered by the vehicle. If a node is not in any vehicle's region, its VID should be -1. Use DISPLAY_REGIONS with the VID serving as the region ID to display the results. Print the output image displayed and include it in your lab write-up. In your lab write-up, describe the algorithms your subroutines are using. Describe how you are testing the code and provide a convincing argument that it works. Whenever possible, be sure to use a data parallel algorithm rather than sequential code. Document your subroutines (and any other changes you make to the code) within the code listing. It is important to also document any assumptions you are making about the input image or application in the report. ===================================================================== EXERCISE 3: Evaluating performance. In this exercise, you will use Meter View to compute performance statistics. Once you have a working main program which computes and displays the correct regions, open up a Meter View and check the "Meter Concurrency" box. Run your program and then update the Meter View and save the results to a file. The output of Meter View is 1) a histogram of the instruction types and how often they were executed, 2) the total number of instructions executed, which is equal to the total number of cycles the execution took (since the simulator models each processor executing a single instruction on each cycle), and 3) the concurrency meter, which is a table showing the number of processing nodes awake (and executing an instruction) during intervals of a fixed number of cycles. For example, the following may be a portion of the concurrency meter table: Starting I-Cycle Min. # Nodes Avg. # Nodes Max. # Nodes 0 256 256 256 2 256 256 256 4 1 1 1 6 1 128 256 8 256 256 256 It says that on cycles 0 and 1, 256 nodes were awake (no fewer or greater). They were also all awake on cycles 2 and 3. On cycles 4 and 5, only 1 node was awake. On cycles 6 and 7, there was a cycle in which only 1 node was awake and a cycle in which 256 nodes were awake, giving an average over that interval of 128 awake nodes/cycle. On both cycles 8 and 9, all 256 nodes are awake. Based on the statistics you obtained from Meter View, compute the following: 3a) What is the average processor utilization? In other words, what is the overall average number of processors awake (working on an instruction) per cycle? 3b) Compute C = the average processor utilization divided by the number of processors. This is a measure of concurrency and we would like it to be close to 1. DISCUSSION: 3c) What factors can limit the average processor utilization and C? Are there any changes that you can make to your algorithms or their implementations to increase the utilization or C? 3d) Suppose we wanted to compare the performance of the parallel SIMPil machine execution on this application with the performance of our sequential, pipelined MIPS machine. Suppose we take the number of cycles needed by the parallel execution to be the total number of instructions executed (2nd output of Meter View). Suppose we estimate the total number of cycles needed by the sequential execution on the MIPS machine as follows: 1) instruction count: the total number of instructions executed sequentially is the average SIMPil processor utilization (in processor_nodes/cycle) * 1 instruction/processor_node * total number of cycles taken by the parallel machine; 2) CPI: assume that the average number of cycles per instruction for the pipelined MIPS machine is 1 cycle/instruction; 3) so, total cycles for the sequential execution is the product of the instruction count and CPI. Compute and compare the number of cycles needed by the two machines. By what percent does MIPS execute more cycles than SIMPil? There are several approximations and assumptions we are making in comparing the machines this way. Discuss whether our assumptions are sound. In what ways are we overestimating or underestimating instruction count, cycle count, and CPI? (Hints: Will there be additional instructions in the sequential program that are not necessary in the parallel one (and vice versa)? Is there concurrency in the sequential machine execution that we are not taking into account? Is our estimate of CPI valid? How would differences in the algorithms used in writing programs for a parallel machine versus a sequential machine affect our comparison? What about differences in clock rates?) ======================== WHAT TO HAND IN ========================== This assignment is due on Tuesday, 4 December 2001 in class. Hand in a folder containing a diskette and your laboratory report. Your laboratory report should contain the following: 1. Title and Author 2. Acknowledgements 3. Abstract 4. Answers to the questions asked in Exercise1 5. Algorithm description, validation (including testing strategy), any assumptions made, and results. 6. Results, analysis, and discussion of performance evaluation exercise (Exercise 3). Include a hardcopy of the output image. 7. A diskette containing the .SPA file(s) of the SIMPil programs you write for the exercises, including brief documentation. Also include the bitmap files you created in Exercise 2 and the statistics you saved in Exercise 3. Check your floppy for viruses. BE SURE TO SAVE A BACKUP of everything you turn in. LABEL EVERYTHING you turn in with your name. Any project reports slipped under Professor Wills' office/lab door will not be graded! Arrangements for handing in the project report after the deadline must be made with Cory Hawkins (cory@ece.gatech.edu) before the due date.