Papers
Papers
Home > Papers > Parallel Fractals
Introduction | Mandelbrot Set | Image Characteristics | Parallel Algorithm Design | Partitioning | Agglomeration | Output Synchronization | Token-Passing | Polling | Performance Analysis | Conclusion | Bibliography | Slides

Parallel Fractal Image Generation
Polling Synchronization

It seems that we have come to a dead end: Even when we synchronize the output operations of the nodes, we can not make sure that their output is actually collated in the right order. We can only be sure that the order in which a node outputs lines is the same as the order in which those lines appear in the complete output file – although randomly interspersed with other nodes' output.

Thus, the only way of assuring an in-order output of all the lines is to have one node output all of them - although this seems to violate our requirement that no node should have to store the whole image. However, the node responsible for the output does not have to store all the lines – it would be sufficient if it just got the other nodes' lines that have to be output next, output them in order, and then discard them to make room for the next set of lines (called "chunk" in the following). For example, the first chunk consists of the first lines of all nodes, the second chunk comprises the second lines of all nodes, and so on. Figure 8 illustrates this by showing the chunk number (1 to 15) and the node number (0 to 3) in front of each line:

Figure 10: Output chunks of four nodes
Figure 10: Output chunks of four nodes

This polling algorithm can be realized with the following algorithm, where requests and responses are implemented using non-blocking communication:

Master node:
        WHILE NOT all chunks are output
                send request for lines of next chunk to other nodes (non-blocking)
                initiate non-blocking reception of other nodes' lines
                REPEAT
                        IF NOT all my lines are computed
                                compute my next line
                        END IF
                UNTIL lines of chunk received
                output all lines of chunk in order (including my line)
                discard chunk
        END WHILE

Every other node:
        WHILE NOT all my lines are sent
                initiate non-blocking reception of master's request
                REPEAT
                        IF NOT all my lines are computed
                                compute my next line
                        END IF
                UNTIL request received
                send requested line to master (non-blocking)
        END WHILE

For each chunk of the image, the master requests the respective lines from the others nodes, then computes its own line for that chunk. If it received the other lines by the time it finished computation of its own line, it output all the lines of the chunk in order. Otherwise, it just continues computes more of its own lines until the other nodes have responded. As soon as the chunk has been output, it is removed from memory, and the cycle repeats for the next chunk.

The other nodes just compute their lines of the image all the time, and whenever they receive a request from the master, they send it the respective line. Since the master requests the lines in the order they are computed, and the other nodes always compute at least one line before checking for the reception of a request, it is ensured that they always have the requested line available. As in the previous algorithms, nodes may finish computation of all their lines before the whole image is output. In this case, they just sit idle while they are not handling requests.

One might argue that the polling mechanism is unnecessary communication overhead, and all the nodes could just send their lines to the master as soon as they are finished. However, in this case, the master would need a buffer to store all the lines that were received, but not yet due for output. Under some circumstances, the master could then end up storing almost all the lines of the image, which is something we were trying to avoid all the time. Using the request/response algorithm, on the other hand, the master just needs to buffer one line from each node at a time.

If we implement and run this algorithm on a parallel processing cluster, we finally get the desired result, as shown in the following output file:

PE0 started up on summit5
Maximum iterations:  150
Output size:         64 x 48 points
Top-left corner:     (-1.78, 0.96i)
Bottom-right corner: (0.78, -0.96i)

PE1 started up on summit4
Number of PE's:      4
Rows per PE:         12

PE2 started up on summit3
PE3 started up on summit8


                                         #
                                        ##
                                       #####
                                       #####
                                       #####
                                        ###
                                   #  ######## #
                               ##  ############
                               ################## ##
                               ######################
                              #####################
                            ########################
                             ########################
                           ###########################
                           ##########################
                #  #       ###########################
                ######     ###########################
               #########  ###########################
              ########### ############################
              ########### ###########################
            # ######################################
           ## ######################################
           ## ######################################
            # ######################################
              ########### ###########################
              ########### ############################
               #########  ###########################
                ######     ###########################
                #  #       ###########################
                           ##########################
                           ###########################
                             ########################
                            ########################
                              #####################
                               ######################
                               ################## ##
                               ##  ############
                                   #  ######## #
                                        ###
                                       #####
                                       #####
                                       #####
                                        ##
                                         #


Computation walltime: 0.006702 seconds

Figure 11: Output of polling synchronization algorithm

We now have a working algorithm for the in-order output of sequential data generated by a parallel program. The algorithm was implemented using the Message Passing Interface library [MPI97]. The serial and parallel programs were executed and timed on the University of Montana's supercomputer summit.cs.umt.edu [DM01].

Token-Passing Synchronization | Performance Analysis >

© 2001 Matthias Book