MTU Department of Computer Science

	Technical Reports *Non-Viable Path Branch Prediction* Ying Xiong and Soner Onder CS-TR-07-02 For modern superscalar processors which implement deeper and wider pipelines, accurate branch prediction is crucial for feeding sufficient number of correct instructions into the superscalar's highly-parallelized execution core. In this paper, we show that what the branch predictor is learning has significant implications for its ability to make effective use of branch correlation and its ability to exploit longer history lengths. *A Hill-Climbing Approach for Planning with Temporal Uncertainty* Janae N. Foss and Dr. Nilufer Onder CS-TR-06-02 We approach the problem of finding temporally contingent plans, i.e., plans with branches that are based on the duration of an action at execution time, using a hill-climbing algorithm. We find an optimistic plan that is valid when all actions complete quickly. We then use efficient temporal reasoning techniques to determine when the plan may fail. At time points that cause an unsafe situation, we insert temporal contingency branches. We describe our implemented planner PHOCUS-HC and provide experimental results. *UPC Collective Conformance Suite* Lisa Begum and Dr. Charles Wallace CS-TR-06-01 The UPC collective conformance suite is a collection of tests that help determine how closely a given implementation of the UPC collective operations conform to the specifications. The test suite is based on Version 1.1 of the UPC specification and Version 1.0 of the UPC collectives specifications. It has been used on the MuPC and HP UPC platforms. These tests exercise all of the collective operations, with the exception of the deprecated upc all sort operation, and notify the tester of behavior that departs from the specifications. This document describes the tests in the suite and explains why they were chosen. *A Specification of the Extensions to the Collective Operations of Unified Parallel C* Zinnu Ryne and Dr. Steven Seidel CS-TR-05-08 Unified Parallel C (UPC) is an extension of the C programming language that provides a partitioned shared-memory model for parallel programming. On such platforms, a simple assignment operation can perform remote memory reads or writes. Nevertheless, since bulk transfers of data are more efficient than byte-level copying, collective operations are needed, and as a result a specification for a standard set of collective operations was proposed in 2003. This specification is now part of the UPC V1.2 language specification. However, these operations have limited functionality. This thesis work explores ways to overcome these limitations by proposing a set of extensions to the standard collective relocalization operations. The proposal includes performance and convenience improvements that provide asynchronous behavior, an in-place option, runtime implementation hints and subsetting feature. Also proposed are the new one-sided collective operations which are alternatives to the traditional MPI way of looking at collective operations. A comparative performance analysis is done to ensure that the reference implementation of the extended collective operations perform within reasonable bounds of that of the standard collective operations. This provides necessary support to adopt these extensions as they provide greater flexibility and power to the programmer with a negligible performance penalty. *Implementing UPC's MYSYNC Synchronization Mode Using Pairwise Synchronization of Threads* Prasad Dhamne CS-TR-05-07 UPC (Unifed Parallel C) is an extension of ANSI C that provides a partitioned shared memory model for parallel programming. Synchronization between threads (processes) in UPC is done through the use of locks or barriers. We have investigated a new synchronization method which is better suited in situations where threads do not need to synchronize with all of the other threads in the system. We implemented pairwise synchronization that can be used to synchronize pairs of threads while not disturbing the remaining threads. This project also explored how to synchronize pairs of UPC threads over a Myrinet interconnect. Myrinet is a low-latency, high bandwidth local area network with a low level communication layer called GM. We implemented UPC's MYSYNC synchronization mode in collective operations which make use of pairwise synchronization. We compared the performance of MYSYNC synchronization mode to the ALLSYNC synchronization mode, that is often implemented by using a barrier. For performance analysis we used a collectives testbed previously developed at MTU. Results obtained from the testbed indicate that for collectives, such as Broadcast, Scatter and Gather, MYSYNC synchronization considerably reduces the collective time. Overall testbed execution time for 100 trials with MYSYNC mode was improved by 10-15% compared to the ALLSYNC synchronization. For the Permute collective operation the performance improvement is maximum in pull based implementation and it is close to 20%. For Gatherall and Exchange the performance improvement is minimal. However, ALLSYNC mode performs better over MYSYNC synchronization in push based implementation of Permute collective operation. In general, the performance of MYSYNC reduces with increase in number of threads. *Implementing Sort in UPC: Performance and Optimization* Kohinoor (Lisa) Begum and Dr. Steven Seidel CS-TR-05-06 Unifed Parallel C (UPC) is a parallel extension of ANSI C that is based on a partitioned shared address space. It supports development of parallel applications over diverse hardware platforms and does not require a true shared memory system. It simplifes programming by providing the same syntax for local references as for remote references. This helps the user focus on performance, and allows for the exploitation of locality. Our project is to implement a generic sort function in UPC. It will help programmers focus on design, implementation and performance of the entire application rather than work on details of sorting. upc_all_sort is an implementation of a generic sort function in UPC using a bin sort algorithm. Techniques applied to improve the performance of the sort function include: sampling of data to ensure more nearly uniform bucket sizes among threads, an all-to-all exchange of bucket sizes to minimize the amount of dynamically allocated memory, rebalancing data at the end of the sort to ensure that each thread finishes with the same number of elements as it started with. In this paper, we analyze and compare the performance of sequential and parallel sort, the potential of performance improvements over sequential sort and also the bottlenecks limiting these improvements. This sort function runs 3.9 times faster than the sequential sort for problem sizes of at least 128K when run on 8 threads. The performance scales up with the increasing number of threads and bigger problem sizes. It runs 6.8 times faster than sequential sort for problem size of 512K when run on 16 threads. *An Overlay Network Simulator in Java* Eric Simonton CS-TR-05-05 It is not too difficult to generate a power-law topology. It is not trivial, however to maintain that topology while simulating network churn, nor to keep the network unpartitioned in the process. This report describes how to do just that, and do it efficiently. *Eager Misprediction Recovery* Peng Zhou, Dr. Soner Onder, and Dr. Steve Carr CS-TR-05-04 Current trends in modern out-of-order processors involve implementing deeper pipelines and a large instruction window to achieve high performance. However, as pipeline depth increases, the branch misprediction penalty becomes a critical factor in overall processor performance. Current approaches to handling branch mispredictions either incrementally roll back to in-order state by waiting until the mispredicted branch reaches the head of the reorder buffer, or utilize checkpointing at branches for faster recovery. Rolling back to inorder state stalls the pipeline for a significant number of cycles and checkpointing is costly. This paper proposes a fast recovery mechanism, called Eager Misprediction Recovery (EMR), to reduce the branch misprediction penalty. Upon a misprediction, the processor immediately starts fetching, renaming and executing instructions from the correct path without restoring the map table. Those instructions that access incorrect speculative values wait until the correct data are restored; however, instructions that access correct values continue executing while recovery occurs. Thus, the recovery mechanism hides the latency of long branch recovery with useful instructions. EMR achieves a mean performance improvement very close to a recovery mechanism that supports checkpointing at each branch. In addition, EMR provides an average of 9.0% and up to 19.9% better performance than traditional sequential misprediction recovery on the SPEC2000 benchmark suite. *A Mechanism for Protecting Passwords from Key Loggers* Lior Shamir CS-TR-05-03 A simple mechanism that protects passwords from being stolen by malicious key loggers is proposed. The described mechanism is based on feeding potential key logers with incorrect passwords by generating faked keystrokes. Applications that require their user to type in passwords (e.g. internet browsers surfing to password protected web sites) can use this mechanism so that passwords can be protected even in the absence of an anti-key logger. Different approaches of protecction are proposed for thread based, hook based and kernel based key loggers. *MuON: Epidemic Based Mutual Anonymity* Neelesh Bansod, Ashish Malgi, Dr. Byung Kyu Choi and Dr. Jean Mayo CS-TR-05-02 A mutually anonymous service hides the identity of a client from the service provider and vice-versa. Providing mutual anonymity usually requires a large number of participants. While peer-to-peer (P2P) networks are capable of recruiting a large number of participants, reliable anonymous communication in these architectures, with low bandwidth usage, still needs further investigation. This paper presents MuON, a protocol to achieve mutual anonymity in unstructured P2P networks. MuON leverages epidemic-style data dissemination to deal with the high churn (changes in system membership) characteristic of unstructured P2P networks. The results from our security analysis and simulation show that MuON provides mutual anonymity over unstructured P2P networks while maintaining predictable latencies, high reliability, and low communication overhead. *An Introduction to Flexible Architecture Simulation Tool (FAST) and Architecture Description Language ADL* Dr. Soner Onder CS-TR-05-01 The following document provides an overview of how to use the Flexible Architecture Simulation Tool, and an indepth look at the Architecture Description Language. The system has been developed by Dr. Soner Onder and Dr. Rajiv Gupta. After many revisions, it has become a powerful simulation tool. The documentation that you are reading is a result of the collaboration between University of Alberta and Michigan Technological University. Aaron Dittrich under the direction of Dr. Mike MacGregor has put together pieces of documentation Dr. Onder had written into the form of a technical report (TR03-18, Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada) in a nicely organized way which has later became the basis for this document. The examples presented are centered around the MIPS RISC architecture. At first a single-stage MIPS architecture is presented, and then the remainder of the examples are centered around a simplified MIPS R2000/R3000 architecture. *High Performance Unified Parallel C (UPC) Collectives for Linux/Myrinet Platforms* Alok Mishra and Steven Seidel CS-TR-04-05 Unified Parallel C (UPC) is a partitioned shared memory parallel programming language that is being developed by a consortium of academia, industry and government. UPC is an extension of ANSI C. In this project we implemented a high performance UPC collective communications library of functions to perform data relocalization in UPC programs for Linux/Myrinet clusters. Myrinet is a low latency, high bandwidth local area network. The library was written using Myrinet?s low level communication layer called GM. We implemented the broadcast, scatter, gather, gather all, exchange and permute collective functions as defined in the UPC collectives specification document. The performance of these functions was compared to a UPC-level reference implementation of the collectives library. We also designed and implemented a micro-benchmarking application to measure and compare the performance of collective functions. The collective functions implemented in GM usually ran two to three times faster than the reference implementation over a wide range of message lengths and numbers of threads. *TransMig: Implementing a Transparent Strong Migration of Mobile Agents in Java* Ashish Malgi, Neelesh Bansod, and Dr. Byung Kyu Choi CS-TR-04-04 Mobile agents are programs which ship themselves from one machine to another with a programmer defined primitive. They form an interesting paradigm to develop distributed applications. The key issue for migrating a mobile agent from one machine to another is the transmission of the execution state of the agent. We propose TransMig, which is an approach to capture and re-establish the state of mobile agents programmed in Java without any modifications to the Java virtual machine. Our approach instruments the original source code and the resultant byte code to achieve the capture and re-establishment of the execution state of mobile agents at a language level. Currently, our prototype implementation of TransMig has been used for providing a programming and runtime environment for mobile agents, but it could easily be extended for any Java program. *Significant Precursors to UPC* Dr. Phillip Merkey and Jaisudha Purushothaman CS-TR-04-03 Designing parallel distributed languages involves deciding the level of abstraction that can be supported through the features provided to the programmer and the level of control the programmer is allowed to wield over the system. C has been one of the most commonly used languages for almost three decades. It has low level access to information and commands while still retaining the portability and syntax of a high level language. Its flexibility and powerful pointer manipulations capabilities are some of the reasons why its still actively maintained. It has been the natural base language on which researchers have developed parallel languages or have added libraries or extension for parallelism. For example, Cilk, C, Split-C, AC, PCP, Handel-C, mpC, DPCE, C- -, C//, Jade, Parsec, MPI, OpenMP,Parsec,ParaC are some of the languages developed as C libraries or extensions. UPC can be viewed as the culmination of previous attempts at coming up with the right combination of programming features for the parallel distributed shared memory environment. In particular, PCP1, Split-C and AC2 are consider immediate predecessors of UPC. This essay discusses these languages and UPC in a comparative manner. Exploring Linked Lists in UPC* Hiriomi Suenaga, Dr. Phillip Merkey, and Dr. Steven Seidel CS-TR-04-02 Shared memory multiprocessors offer great computational power while introducing new problems such as race conditions and deadlock. Concurrent objects can be accessed and modified by multiple processes simultaneously. In order to avoid corruption of data, we need to construct carefully planned algorithms that minimize the use of synchronization schemes which inhibit performance. Unified Parallel C (UPC) is a parallel extension of ANSI C. It sees its memory as a partitioned shared memory. Because of this, past studies that implemented concurrent objects in pure shared memory machines cannot be applied directly. In order to achieve high performance while avoiding data corruptions or inconsistent results, it is important to understand the underlying architecture and the memory model. In this study we explored three different ways of implementing an elemental data structure, the linked-list, on which multiple processes can operate at the same time. Unfortunately, we were not able to show good speedup at this point because of experimentally confirmed limitations of the current compiler, however, we are confident that our approaches will give us a great performance improvement as the compiler develops. *Pairwise Synchronization in UPC* Ernie Fessenden and Dr. Steven Seidel CS-TR-04-01 UPC (Unified Parallel C) is a small extension of C that provides a partitioned shared memory model for parallel programming. Synchronization between threads (processes) in UPC is done through the use of locks or barriers. We have investigated a new synchronization process that we believe to be better suited in situations where threads do not need to synchronize with all of the other threads in the system. This project explores how to synchronize UPC threads in specialized ways. Our approach is to use UPC locks to implement pairwise synchronization. Using this new approach we implemented the MYSYNC synchronization mode and compared it to the ALLSYNC synchronization mode that is often implemented by using a barrier. Pairwise synchronization can also be used to synchronize arbitrary subsets of threads while not disturbing the remaining threads. Such subset-based synchronization does not presently exist in UPC. We also investigate how to implement the barrier using pairwise synchronization. The MYSYNC synchronization mode has shown better performance than the ALLSYNC synchronization mode for several of the relocalization collective operations. Specifically, for an uneven computational load the MYSYNC implementation of upc all broadcast, upc all scatter, and upc all gather outperforms the ALLSYNC implementation. For upc all permute under even and uneven computational loads the MYSYNC implementation shows large improvements in run time. However, for upc all exchange and upc all gather all the MYSYNC implementation runs slower than the ALLSYNC implementation. *Illustrative Test Cases for the UPC Memory Model* William Kuchera and Charles Wallace CS-TR-03-02 The memory model underlying UPC is an important but subtle aspect of the language, and everyone involved with it must understand its implications. The only resource currently available with detailed coverage of the memory model is the high-level description in the UPC specification [1]. As mentioned in our previous report [3], it is difficult to tie this description to actual behavior of a UPC program. UPC implementors and application developers must be able to readily distinguish program behavior that complies with the UPC specification from behavior that does not comply. For instance, implementors of UPC on a particular platform need to ensure that the optimizations they employ guarantee compliant behavior. Programmers exploiting relaxed consistency must have a grasp of what possible behaviors their programs may induce. To this end, we have devised a set of test cases for the UPC memory model. These cases fall into two categories: Compliance tests. These are examples of behavior that falls outside the UPC specification. They illustrate the consistency guarantees that UPC gives the programmer. ?Dark corner? tests. These are examples of acceptable behavior, according to the UPC specification, that may be surprising to the UPC novice. The UPC memory model is designed to be quite lax and allows some cases that programmers may not anticipate. These examples serve to highlight some of these ?dark corners?. Our test cases are at the architecture level rather than the program level; that is, each thread?s execution is expressed in terms of a sequence of read, write, fence, notify, and wait instructions, rather than UPC statements. Any references to ?ordering,? ?following,? ?intervening,? etc. refer to this thread-by-thread instruction ordering. Read and write instructions are assumed to be relaxed unless specified otherwise. For each case, we illustrate why it illustrates compliance or non-compliance, using the operational semantics devised in our previous report [3]. *Toward a Programmer-Friendly Formal Specification of the UPC Memory Model* William Kuchera and Dr. Charles Wallace CS-TR-03-01 As part of our efforts to elucidate the UPC memory model, we have closely examined the memory model definition given in the official UPC specification [3] (hereafter, "the UPC spec"). This document gives most of the relevant details about the memory model. *ASM 101: An Abstract State Machine Primer* James K. Huggins and Dr. Charles Wallace CS-TR-02-04 Over the last twenty years, Abstract State Machines (ASMs) have been used to describe and validate a wide variety of computing systems [9]. Numerous introductions to ASMs have been written (see for example [20, 10, 21,36]). In addition, many ASM papers include brief overviews for the casual reader. In this paper we attempt to provide a gentler introduction, focusing more on the use of the technique than on formal definitions. Our target audience is a senior-level undergraduate class in software engineering, but we hope that this work proves helpful to other readers as well. While there are references to "software" throughout, the reader should be aware that ASMs are relevant to other kinds of system development as well. For simplicity, our focus is on ASMs with only a signle agent; distributed ASMs may be considered in a later paper. *MuPC: A Run Time System for Unified Parallel C* Jeevan Savant CS-TR-02-03 Unified Parallel C is an extension to the C programming language intended for parallel programming. UPC adds a small number of parallel programming constructs to C. These constructs enable users to write parallel programs based on a distributed shared memory model. MuPC (MTU's UPC) is a run time system for UPC, enabling it to run on a wide variety of architectures including networks of workstations, Beowulf clusters and shared memory machines. *Direct and Adjoint Sensitivity Analysis of Chemical Kinetic Systems with KPP: II - Theory and Software Tools* Dr. Adrian Sandu, D. Daescu and G.R. Carmichael CS-TR-02-02 The Kinetic PreProcessor KPP was extended to generate the building blocks needed for the direct and adjoint sensitivity analysis of chemical kinetic systems. An overview of the theoretical aspects of sensitivity calculations and a discussion of the KPP software tools is presented in the companion paper. In this work the correctness and efficiency of the KPP generated code for direct and adjoint sensitivity studies are analyzed through an extensive set of numerical experiments. Direct decoupled Rosenbrock methods are shown to be cost-effective for providing sensitivities at low and medium accuracies. A validation of the discrete-adjoint evaluated gradients is performed against the finite difference gradients. Consistency between the discrete and continuous adjoint models is shown through step size reduction. The accuracy of the adjoint gradients is measured against a reference gradient value obtained with a standard direct-decoupled method. The accuracy is studied for both constant step size and variable step size integration of the forward model. Applications of the KPP-1.2 software package to direct and adjoint sensitivity studies, variational data assimilation and parameter identification are considered for the comprehensive chemical mechanism SAPRC-99. *Direct and Adjoint Sensitivity Analysis of Chemical Kinetic Systems with KPP: I - Theory and Software Tools* Dr. Adrian Sandu, Dacian N. Daescu and Gregory R. Carmichael CS-TR-02-01 The analysis of comprehensive chemical reactions mechanisms, parameter estimation techniques, and variational chemical data assimilation applications require the development of efficient sensitivity methods for chemical kinetics systems. The new release (KPP-1.2) of the Kinetic PreProcessor KPP contains software tools that facilitate direct and adjoint sensitivity analysis. The direct decoupled method, build using BDF formulas, has been the method of choice for direct sensitivity studies. In this work we extend the direct decoupled approach to Rosenbrock stiff integration methods. The need for Jacobian derivatives prevented Rosenbrock methods to be used extensively in direct sensitivity calculations; however, the new automatic differentiation and symbolic differentiation technologies make the computation of these derivatives feasible. The adjoint modeling is presented as an efficient tool to evaluate the sensitivity of a scalar response function with respect to the initial conditions and model parameters. In addition, sensitivity with respect to time dependent model parameters may be obtained through a single backward integration of the adjoint model. KPP software may be used to completely generate the continuous and discrete adjoint models taking full advantage of the sparsity of the chemical mechanism. Flexible direct-decoupled and adjoint sensitivity code implementations are achieved with minimal user intervention. In the companion paper [6] we present an extensive set of numerical experiments that validate the KPP software tools for several direct/adjoint sensitivity applications, and demonstrate the efficiency of KPP-generated sensitivity code implementations. *A TCP/UDP Protocol Visualization Tool: Visual TCP/UDP Animator (VTA)* Ping Chen, Rong Ge, Yin Ma, Dr. Jean Mayo, Yuping Zhang, and Chunhua Zhao CS-TR-01-10 The TCP/UDP/IP/Ethernet protocol suite is commonly covered within an undergraduate curriculum. Often, this is the first exposure students have to a network protocol. We have found that students have difficulty understanding the unique function of these protocols and how the protocols are related. This is due at least in part to the fact that packet specification and construction takes place within the operating system, while covering a typical operating system protocol stack implementation would be beyond the scope or time constraints of many of these undergraduate courses. In order to address this problem we have written a visualization tool that illustrates the operation of TCP and UDP over the IP and Ethernet protocols using actual packets, captured via "libpcap," live from the network or stored in a file. The tool provides several views of captured data that, when used individually or in combination, illustrate the operation of these protocols. *Discretizing Aerosol Dynamics with B-Splines* Dr. Adrian Sandu and Christian T. Borden CS-TR-01-09 This paper presents a discretization technique for particle dynamics equation based on the B-spline interpolation of the solution. The method is developed in the general framework recently proposed by the authors, and extends the previous work of Pilinis and Seinfeld to arbitrary order splines and arbitrary knot sequences. Numerical tests include the coagulation-growth of the exponential distribution and of a cosine hill in logarithmic coordinates. *Reduced Atmospheric Chemistry Models by Sensitivity Matrix Approach* Dr. Adrian Sandu CS-TR-01-08 Simulation of chemical kinetics typically claims about 80 to 90% of the total computational time in three-demensional atmospheric chemistry-transport models; to alleviation this computational burden reduced chemical models are often employed. Reduced models are built via a dynamical analysis that results in the elimination of the fastest time scales, or by tuning a simpler model to reproduce the input-output relations of the kinetic system. The sensitivity matrix is a linear entity associated with the dynamics of a nonlinear system. In this communicatino we show that sensitivity matrices can be used to identify low-demensional linear evolution manifolds and to construct reduced kinetic models. *The Kinetic PreProcessor KPP - A Software Environment for Solving Chemical Kinetics* Valeriu Damian, Dr. Adrian Sandu, Mirela Damian, Florian Potra and Gregory Carmichael CS-TR-01-07 The KPP kinetic preprocessor is a software tool that assists the computer simulation of chemical kinetic systems. The concentrations of a chemical system evolve in time according to the differential law of mass action kinetics. A computer simulation requires the implementation of the differential system and its numerical integration in time. KPP translates a specification of the chemical mechanism into FORTRAN or C simulation code that implements the concentration time derivative function and its Jacobian, together with a suitable numerical integration scheme. Sparsity in Jacobian is carefully exploited in order to obtain computational efficiency. KPP incorporates a library with several widely used atmospheric chemistry mechanisms; users can add their own chemical mechanisms to the library. KPP also includes a comprehensive suite of stiff numerical integrators. The KPP development environment is designed in a modular fashion and allows for rapid prototyping of a new chemical kinetic schemes as well as new numerical integration methods. *Rosenbrock-Nystrom Integrators for SSODE of Multibody Dynamics* Dr. Adrian Sandu, D. Negrut, E.J. Haug, F.A. Potra and C. Sandu CS-TR-01-06 When performing dynamic analysis of a constrained mechanical system, a set of index 3 Differential-Algebraic Equations (DAE) describes the time evolution of the system model. The paper presents a state-space based method for the numerical solution of the resulting DAE. A subset of so called independent generalized coordinates, equal in number to the number of degrees of freedom of the mechanical system, is used to express the time evolution of the mechanical system. The second order state-space ordinary differential equations (SSODE) that describe the time variation of independent coordinates are numerically integrated using a Rosenbrock type formula. For stiff mechanical systems, the proposed algorithm is shown to significantly reduce simulation times when compared to state of the art existent algorithms. The better efficiency is due to the use of an L-stable integrator and a rigourous and general approach to providing analytical derivatives required by it. *A Communication Library for the Parallelization of Air Quality Models on Structured Grids* Philip Miehe, Dr. Adrian Sandu, Gregory R. Carmichael, Youhua Tang, and Dacian Daescu CS-TR-01-05 PAQMSG is an MPI-based, Fortran 90 communication library for the parallelization of Air Quality Models on structured grids. It consists of distribution, gathering and repartitioning routines for different domain decompositions implementing a master-worker strategy. The library is architecture and application independent and includes optimization strategies for different architectures. This paper presents the library from a user perspective. Results are shown from the parallelization of STEM-III on Beowulf clusters. The PAQMSG library is available on the web. The communication routines are easy to use, and should allow for an immediate parallelization of existing air quality models. PAQMSG can also be used for constructing new models. *A Spectral Method for Solving Aerosol Dynamics* Dr. Adrian Sandu CS-TR-01-04 This paper presents a discretization of particle dynamics equation based on spectral polynomial interpolation of the solution. Numerical tests show that the method achieves spectral accuracy in linear coordinates, and is able to reproduce well the reference solutions in logarithmic coordinates. *A Framework for the Numerical Treatment of Aerosol Dynamics* Dr. Adrian Sandu and Christian Borden CS-TR-01-03 This paper presents a general framework for the discretization of particle dynamics equation by projection methods, which include Galerkin and collocation techniques. Based on this framework a discretization over piecewise polynomial spaces is discussed. Numerical examples are given using both linear and logarithmic coordinates; the results show that this discretization approach is able to accurately solve aerosol size distribution using a relatively small number of "size bins." *A Newton-Cotes Quadrature Approach for Solving the Aerosol Coagulation Equation* Dr. Adrian Sandu CS-TR-01-02 This paper presents a direct approach to solving the aerosol coagulation equation. Newton-Cotes formulas are used to discretize the integral terms, and the semi-discrete system is built using collocation. A semi-implicit Gauss-Seidel time integration method is employed. The approach generalizes the semi-implicit method of Jacobson. *The Location Consistency Memory Model and Cache Protocol: Specification and Verification* Dr. Charles Wallace, Guy Tremblay, and Jose Nelson Amaral CS-TR-01-01 We use the Abstract State Machine methodology to give formal operational semanitics for the Location Consistency memory model and cache protocol. With these formal models, we prove that the cache protocol satisfies the memory model, but in a way that is strictly stronger than necessary, disallowing certain behavior allowed by the memory model. *Tools for Distributed Systems Instruction* Meghana Gothe and Dr. Jean Mayo CS-TR-00-02 Teaching a distributed system class becomes a difficult task when the student community is from a varied background. These systems are complex, as they are distributed across the network of component processes. A majority of the students have a sequential programming background. Hence, distributed systems concepts pose a challenge of re-learning, making the teachers job all the more difficult. Furthermore, distributed systems rely on message passing for communication between component processes. The large amount of system specific network interface detail obscures the actual concepts. Tools that attempt to eliminate this obsurity are needed. The students are introdudced to numerous protocols, which form the basic foundation of distributed systems programming. Shortage of time prvents students from implementing them all. A simulation of the protocols which can be used by the students for understanding is a need of time. With this motivation, we have implemented a tool which provides certain communication primitives and some important distributed system protocols. The communication primitives will provide an easy interface that can be used by the students in their assignments. Protocol excecution dumps will enable the students to correlate to the topics taught in class. *Time-Stepping Methods that Favor Positivity for Atmospheric Chemistry Modeling* Dr. Adrian Sandu CS-TR-00-01 Chemical kinetics conserves mass and renders non-negative solutions; a good numerical simulation would ideally produce mass balanced, positive concentration vectors. Many time-steping methods are mass conservative; however, unconditional positivity restricts the order of a traditional method to one. The projection method presented in [13] post-processes the solution of a one-step integration method to ensure mass conservation and positivity. It works best when the underlying time-steping scheme favors positivity. In this paper several Rosenbrock-type methods that favor positivity are presented; they are designed such that their transfer functions together with several derivatives are non-negative for all real, negative arguments. *Adjoint Implementation of Rosenbrock Methods Applied to Variational Data Assimilation Problems* Dacian Daescu, Gregory Carmichael and Dr. Adrian Sandu CS-TR-99-06 In the past decade the variational method has become an attractive alternative to the traditional Kalman filter in data assimilation problems for atmospheric chemistry models. From the computational point of view its major advantage is that the gradient of the cost function is computed fast, at the expense of few function evaluations, making the optimization process very efficient. The drawbacks are the high storage requirements and the difficulty to implement adjoint code when sophisticated integrators are used to solve the stiff chemistry. If the sparse structure of the models is carefully exploited Rosenbrock solvers have been proved a reliable alternative to classical schemes. In this paper we present an efficient implementation of the adjoint code for the Rosenbrocj type methods which can reduce the storage requirements of the forward model and is suitable for automatization. *Positive Numerical Integration Methods for Chemical Kinetic Systems* Dr. Adrian Sandu CS-TR-99-05 Mass action chemical kinetics conserves mass and renders non-negative solutions; a good numerical simulation would ideally produce a mass balanced, positive numerical concentration vector. Many time stepping methods are mass conservative; however, unconditional positivity restricts the order of a traditional method to one. The positive projection method presented in the paper ensures mass conservation and positivity. First, a numerical approximation is computed with one step of a mass-preserving traditional scheme. If there are negative components the nearest vector in the reaction simplex is found using a primal-dual optimization routine; this vector is shown to better approximate the true solution. The technique works best when the underlying time-stepping scheme favors positivity. Projected methods are able to use larger integration time steps, being more efficient then traditional methods for systems which are unstable outside the positive quadrant. *Stable Predicate Evaluation with Probabilistically Synchronized Clocks* Dr. Jean Mayo and Don Darling CS-TR-99-04 The evaluation of a global state in a distributed system is very difficult, since the individual processes that comprise the system share no common memory or time base, and can only communicate with each other via message passing. Although no common time base exists in these systems, the system clocks are often roughly synchronized within some known upper bound. In previous work, Mayo and Kearns developed a schema for evaluating a class of stable predicates on systems with a rough global time base. These global predicates are comprised of the conjunction of the local process states. The fact that the predicates are stable means that once they become true, they remain true forever. The ability to detect stable predicates in the system allows us to detect critical system states such as program termination and deadlock. We propose to develop a protocol based on Mayo and Kearns's schema, which will also handle systems where clocks are synchronized with a probabilistic clock synchronization protocol. Also, our protocol will allow processes to enter and leave the system, facilitating its use in dynamic systems. Finally, we plan to implement our protocol on an actual system in order to see how it performs in general use. *Do Blending and Offsetting Commute for Dupin Cyclides?* Dr. Ching-Kuang Shene CS-TR-99-03 CS-TR-99-03 is no longer available. This work has been published in "Computer-Aided Geometric Design," Vol. 17, pp. 891-910, 2000. *Register Assignment for Software Pipelining with Partitioned Register Banks* Jason Hiser, Dr. Philip Sweany, Dr. Steven Carr, and Dr. Steven Beaty CS-TR 99-02 Trends in instruction-level parallelism (ILP) are putting increased demands on a machine's register resources. Computer architects are adding increased numbers of functional units to allow more instructions to be executed in parallel, thereby increasing the number of register ports needed. This increase in ports hampers access time. Additionally, aggressive scheduling techniques, such as software pipelining, are requiring more registers to attain high levels of ILP. These two factors make it difficult to maintain a single monolithic register bank. One possible solution to the problem is to partition the register banks and connect each functional unit to a subset of the register banks, thereby decreasing the number of register ports required. In this type of architecture, if a functional unit needs access to a register to which it is not directly connected, a delay is incurred to copy the value so that the proper functional unit has access to it. As a consequence, the compiler now must not only schedule code for maximum parallelism but also for minimal delays due to copies. In this paper, we present a greedy framework for generating code for architectures with partitioned register banks. Our technique is global in nature and can be applied using any scheduling method (e.g, trace scheduling, modulo scheduling) and any register assignment technique. We present an experiment using our partitioning techniques with modulo scheduling and Chaitin/Briggs register assignment. Our results show that it is possible to generate good code for partitioned-register-bank architectures. *Register Assignment for Architectures with Partitioned Register Banks* Jason Hiser, Dr. Philip Sweany and Dr. Steven Carr CS-TR 99-01 Modern computers take advantage of the instruction-level parallelism (ILP) available in programs with advances in both architecture and compiler design. Unfortunately, large amounts of ILP hardware and aggressive instruction scheduling techniques put great demands on a machine's register resources. With increasing ILP, it becomes difficult to maintain a single monolithic register bank and a high clock rate. The number of ports required for such a register bank severely hampers access time. To provide support for large amounts of ILP while retaining a high clock rate, registers can be partitioned among several different register banks. Each bank is directly accessible by only a subset of the functional units with explicit inter-bank copies required to move data between banks. Therefore, a compiler must deal not only with achieving maximal parallelism via aggressive scheduling, but also with data placement to limit inter-bank copies. Our approach to code generation for ILP architectures with partitioned register resources provides flexibility by representing machine dependent features as node and edge weights and by remaining independent of scheduling and register allocation methods Preliminary experimentation with our framework has shown a degradation in execution performance of 10% on average when compared to an unrealizable monolithic-register-bank architecture with the same level of ILP. This compares very favorably with other approaches to the same problem. *An Efficient Intersection Detection Algorithm for Truncated Cylinders* Dr. Ching Kuang Shene and Yuan Zhao CS-TR 98-01 This paper presents an efficient algorithm for detecting the intersection of two truncated cylinders. A truncated cylinder is the solid bounded by a section of a cylinder and two cap disks whose planes are perpendicular to the axis of the cylinderl. This paper characterizes the intersection and reduces the problem to the intersection detetion of a disk and a truncated cylinder which, in turn, is reduced to testing if a degree four polynomial has a real root in an interval. The method of choice is a case-by-case analysis. This type of geometric analysis has been known for its robustness and efficiency, although the program that implements the algorithm could be larger. Instruction Manual for Schemm's Configurable Analyzer v0.1 v0.1 Evan Schemm CS-TR 98-02 In a research environment, it is often nice to check the effects of changes or additions to existing architectures. To build each variation of a design used in a comparative study would be difficult, as well as prohibitively expensive. Usually a simulator is used to generate a trace of the executed instructions that can be run through an analyzer. This analyzer is designed to be configurable for wide range of target architectures. Different architectural variations are specified via a configuration file, alleviating the need to recompile. The analyzer returns statistics on instruction and data cache hit rates, branch prediction, and total execution time. *An Optimal In-Place Array Rotation Algorithm* Dr. Ching-Kuang Shene CS-TR 97-01 This paper presents a simple optimal algorithm of rotating an array of n elements to the right D positions with n + gcd (n, D) assignments and one extra memory location. Its performance is compared against the HP STL implementation, which uses 3 (n - gcd(n, D)) assignments, and a commonly seen algorithm that uses an auxiliary array. *Geometric Computing in the Undergraduate Computer Science Curricula* Dr. John Lowther and Dr. C.K. Shene CS-TR 97-02 Geometric computing is a rapidly evolving interdisciplinary field involving computer science, engineering and mathematics. It has relationships to many other areas within computer science such as computational geometry, computer-aided design, computer graphics, computer vision, geometric/solid modeling, robotics, and visualization. Due to its interdisciplinary nature, geometric computing also serves as a vehicle for engineering students to approach important topics in manufacturing processes, NC-machining, molding, and milling), geometric design/modeling, (feature-based design and constraint solving), and computational metrology, (sampling strategy and tolerancing design). Unfortunately, in a typical computer science curriculum, computing with geometry is virtually missing in spite of its tremendous impact to computer science and the other fields mentioned above and its importance to increase students' employability. This paper investigates this issue and suggests a possible remedy by designing an elementary and interdisciplinary intermediate level geometric computing course. The rationale, goals and objectives, and the impact and significance of the proposed course are discussed. A review of previous work, including textbooks, software systems, and other media materials, is included. Finally, after presenting the merit of designing this course, this paper proposes a sample ten-week syllabus with software modules for lab use. *A Fast Geometric Conic Reconstruction Algorithm* Dr. Ching-Kuang Shene and Dr. George Paschos CS-TR 97-03 Conics are the first curvilinear feature beyond points and line segments that are commonly used in computer vision. This paper addresses the problem of determining if two conics, from two views, are the projections of a conic in space. If the answer is affirmative, recover this conic from the information given in the images and camera positions. This problem is important in the study of stereo vision and shape from motion. *Unroll-and-Jam Using Uniformly Generated Sets* Dr. Steven Carr and Yiping Guan CS-TR 97-04 Modern architectural trends in instruction-level parallelism (ILP) are to increase the computational power of microprocessors significantly. As a result, the demands on memory have increased. Unfortunately, memory systems have not kept pace. Even hierarchical cache structures are ineffective if programs do not exhibit cache locality. Because of this compilers need to be concerned not only with finding ILP to utilize machine resources effectively, but also with ensuring that the resulting code has a high degree of cache locality. One compiler transformation that is essential for a compiler to meet the above objectives is unroll-and-jam, or outer-loop unrolling. Previous work has either used a dependence based model to compute unroll amounts, significantly increasing the size of the dependence graph, or has applied brute force techniques. In this paper, we present an algorithm that uses linear-algebra-based techniques to compute unroll amounts that save 84% of the dependence-graph space needed by dependence-based techniques, while retaining performance, and that gives a more elegant solution. *On Parametric Representations of Cubic Dupin Cyclides* Dr. Ching-Kuang Shene CS-TR 96-01 Srinivas and Dutta recently constructed a parametric representation for cubic Dupin cyclides. They assumed two skew lines on a cubic Dupin cyclide are available and used the third intersection point of a moving line that touches both skew lines to generate a rational parameterization. In this note, it will be shown that this method is just another form of a popular circle parameterization. A new method based on the geometry of cubic Dupin cyclides will also be presented. Unlike Srinivas and Dutta's method, this new method does not require the cubic Dupin cyclide in a special orientation and a parametric representation is constructed directly without consulting an implicit equation. Implicitizing the parametric equations obtained from this new method yields an implicit equation. *Blending Two Cones with Dupin Cyclides* Dr. Ching-Kuang Shene CS-TR 96-02 This paper provides a complete theory of blending cones with Dupin cyclides and consists of four major contributions. First, a necessary and sufficient condition for two cones to have a blending Dupin cyclide will be established. Except for a very special case in which the cones have a double line and parallel axes, two cones have a blending Dupin cyclide if and only if they have planar intersection. Second, based on the intersection structure of the cones, finer characterization results are possible. For example, all blending Dupin cyclides for two cones with a double line, distinct vertices and intersecting axes must be cubic. Third, a new construction algorithm that establishes a correspondence between points on one or two coplanar lines and all constructed blending Dupin cyclides makes the construction easy and well- organized. Fourth, the completeness of the construction algorithm will also be proved. As a result, any blending Dupin cyclide can be constructed by the algorithm. Consequently, all blending Dupin cyclides are organized into one to four one-parameter families, each of which is ``parameterized'' by points on a special line. It will be shown that each family contains an infinite number of ring cyclides, ensuring the existence of singularity free blending surfaces. Finally, color plates are provided to illustrate various cyclide. *Eager Prediction* Spiridon A. Messaris and Dr. David A. Poplawski CS-TR 96-03 Speculative execution, in the form of branch prediction, has become a common feature in superscalar processors and has been shown to be essential in achieving high performance. Eager execution, the fetching, issuing and execution of instructions down both paths of a conditional branch, has the potential to improve performance even more, but suffers from several problems when several successive branches are encountered. In this paper we show that eager prediction, defined to be eager execution to a certain level followed by ordinary branch prediction afterwards, produces performance improvements over pure branch prediction alone when sufficient resources are available; pure prediction still wins on machines with small issue window sizes and few functional units. Given a sufficiently large machine, the marginal cost, in terms of functional unit needs and register and memory bandwidth requirements are shown to be very minimal. *Scalar Replacement and the Law of Diminishing Returns* Dr. Steve Carr and Qunyan Wu CS-TR 96-0c4 Scalar replacement is a transformation that replaces references to arrays with sequences of scalar temporaries to effect register allocation of array values. Scalar replacement algorithms with varying degrees of power have been proposed; however, no study has been done to show what degree of power is necessary in practice. Are simple algorithms enough? In addition, each algorithm may be applied using any number of dependence tests. More powerful tests should yield larger subsets of references that can be scalar replaced. Is this true? This paper addresses these two questions. Our experiments show that nearly all opportunities for scalar replacement that exist in scientific benchmark code can be handled by simple algorithms. Also, powerful dependence analysis is not needed. Our results show that scalar replacement is an effective transformation that is cheap to implement. *Planar Intersection, Common Inscribed Sphere and Blending with Dupin Cyclides* Dr. Ching-Kuang Shene CS-TR 96-05 In his paper ``Cyclides in Computer Aided Geometric Design'', Computer Aided Geometric Design, Vol. 7 (1990), pp. 221--242, Pratt announced the following result: two cones can be blended by a Dupin cyclide if and only if they have a common inscribed sphere. However, only a proof for the sufficient part is given. This paper fills this gap with a sequence of existential results. More precisely, for two cones with distinct axes and distinct vertices, it will be shown that the existence of a planar intersection, the existence of a common inscribed sphere, the existence of a pair of common tangent planes, and the existence of a blending Dupin cyclide are all equivalent. Moreover, the existence of a pair of common tangent planes is reduced to a test of equal heights, simplifying Piegl's conic intersection detection and calculation algorithm. Degenerate cases will also be addressed. *Rotating an Array In-Place* Dr. Ching-Kuang Shene CS-TR 96-06 A simple algorithm that is capable of rotating an array of n elements to the D right positions using n + GCD ( n, D) assignments and no extra space is presented in this paper. Its performance with different GCD algorithms is compared against a commonly seen algorithm that uses an auxiliary array. Techniques used include regression analysis for establishing linear models and the Komogorov-Smirnov test for comparing performance data. The algorithm and the comparison techniques could serve as a model for CS2 students to carry out experimental analysis of algorithms. *Blending with Affine and Projective Dupin Cyclides* Dr. Ching-Kuang Shene CS-TR 96-07 This paper presents three new construction algorithms for blending a plane and a quadric cone with affine and projective Dupin cyclides. The first algorithm is capable of constructing an affine Dupin cyclide that blends a quadric cone (along an ellipse) and a plane. The second one is equivalent to the first but is based on geometric methods. The third one is able to construct a projective Dupin cyclide that blends a plane and a quadric cone along a hyperbola or a parabola.Thus, the constructed projective Dupin cyclide ``smooths/covers'' the intersection angle made by the given surfaces. All algorithms use affine or projective transformations to reduce the given configuration to a right cone and a plane so that the blending Dupin cyclide construction can be applied. *Modulo Scheduling with Cache-Reuse Information* Chen Ding, Dr. Steven Carr and Dr. Philip Sweany CS-TR 96-0cd8 Instruction scheduling in general, and software pipelining in particular face the difficult task of scheduling operations in the presence of uncertain latencies. The largest contributor to these uncertain latencies is the use of cache memories required to provide adequate memory access speed in modern processors. Scheduling for instruction-level parallel architectures with non-blocking caches usually assigns memory access latency by assuming either that all accesses are cache hits or that all are cache misses. We contend that allowing memory latencies to be set by cache reuse analysis leads to better software pipelining than using either the all-hit or all-miss assumption. Using a simple cache reuse model in our modulo scheduling software pipelining optimization, we achieved a benefit of 10% improved execution performance over assuming all-cache-hits and we used 18% fewer registers than were required by an all-cache-miss assumption. In addition, we outline refinements to our simple reuse model that should allow modulo scheduling with reuse to achieve improved execution performance over the all-cache-miss assumption as well. Therefore, we conclude that software pipelining algorithms for target architectures with non-blocking cache, but without rotating register files, should use a memory-reuse latency model. *The Unlimited Resource Machine (URM)* Dr. David A. Poplawski CS-TR 95-01 The Unlimited Resource Machine (URM) is a hypothetical instruction set architecture that was designed specifically for study of the architecture of and code generation techniques for machines exhibiting instruction level parallelism. A set of tools have been implemented to support this machine model, including compilers, a (linking) assembler, simulators, and trace analyzers. The architecture and its tools are being used in a variety of research projects at MTU. This report describes the philosophy behind the URM, the architecture and instruction set, the tools and their interfaces, and how to use the tools to model various ILP architectures. *The Performance of Scalar Replacement on the HP 715/50* Dr. Steve Carr and Qunyan Wu CS-TR 95-02 Most conventional compilers fail to allocate array elements to registers because standard data-flow analysis treats arrays like scalars, making it impossible to analyze the definitions and uses of individual array elements. This deficiency is particularly troublesome for floating-point registers, which are most often used as temporary repositories for subscripted variables. This paper presents an experiment on an HP 715/50 with a source-to-source transformation, called scalar replacement, that finds opportunities for reuse of subscripted variables and replaces the references involved by references to temporary scalar variables. The scalar replaced variables are more likely to be assigned to registers by the coloring-based register allocators found in most compilers than are their unreplaced counterparts. Experimental results show that scalar replacement is extremely effective. On kernels, integer- factor improvements over code generated by a good optimizing compiler of conventional design are possible. *An Examination of the Behavior of Slice Based Cohesion Measures* Sakari Karstu and Dr. Linda M. Ott CS-TR 94-03 Is cohesion really indicative of other software attributes, like reliability or maintainability, as the software community believes? This question has been difficult to answer since there have been no automated tools for measuring the cohesion of software modules. With the development of an automated tool for collecting cohesion data, the validation of the measured cohesion with the original definition of cohesion was possible, using production software. This study is one of the first studies to relate cohesion to any other attributes of software, in this case, revisions made to the software. It is commonly known that the number of revisions is strongly related to the length of the module. As a result of the study, we present data that supports our intuition that in addition to length, the level of cohesion of a module is a factor in estimating the quality of the software. In particular the data indicates that there appears to be a correlation between modular cohesion and revisions made to the module. This would suggest that highly cohesive modules are less likely to need changes. *Compiler Optimizations for Improving Data Locality* Dr. Steven Carr, Dr. Kathryn S. McKinley (Univ. of Mass. at Amherst) and Dr. Chau-Wen Tseng (Stanford Univ.) CS-TR 94-02 / March 1994 In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, we present compiler optimizations to improve data locality based on a simple yet accurate cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations, consisting of loop permutation, loop distribution and loop fusion. We show that these transformations are required to optimize many programs. To validate our optimization strategy, we implemented our algorithms and ran experiments on a large collection of scientific programs and kernels. Experiments with kernels illustrate that our model and algorithm can select and achieve the best performance. For complete applications, we executed the original and transformed versions, simulated cache hit rates, and collected statistics about the inherent characteristics of these programs and our ability to improve them. To our knowledge, these studies are the first with this breadth and depth. *Overview of the ROCKET Retargetable C Compiler* Dr. Philip H. Sweany and Steven J. Beaty (Cray Computer Corporation). CS-TR 94-01 This report describes the ROCKET compiler, a highly optimizing C translator which is retargetable for a broad class of ILP architectures. *Improving the Ratio of Memory Operations to Floating-Point in Loops* Dr. Steven Carr and Dr. Ken Kennedy (Rice University) CS-TR 93-08 Over the past decade, microprocessor design strategies have focused on increasing the computational power on a single chip. Because computations often require more data from cache per floating-point operation than a machine can deliver and because operations are pipelined, idle computational cycles are common when scientific applications are run on modern microprocessors. To overcome these bottlenecks, programmers have learned to use a coding style that ensures a better balance between memory references and floating-point operations. In our view, this is a step in the wrong direction because it makes programs more machine-specific. A programmer should not be required to write a new program version for each new machine; instead, the task of specializing a program to a target machine should be left to the compiler. But is our view practical? Can a sophisticated optimizing compiler obviate the need for the myriad of programming tricks that have found their way into practice to improve performance of memory hierarchy? In this paper we attempt to answer that question. To do so, we develop and evaluate techniques that automatically restructure program loops to achieve high performance on specific target architectures. These methods attempt to balance computation and memory accesses and seek to eliminate or reduce pipeline interlock. To do this, they statically estimate the performance of each loop in a particular program and use these estimates to determine whether to apply various loop transformations. *Alignment of Three Sequences in Quadratic Space* Dr. Xiaoqiu Huang CS-TR 93-07 We present a dynamic programming algorithm for computing an optimal global alignment of three sequences in quadratic space. The algorithm has been implemented as a portable C program. The program makes it possible to align simultaneously any three protein sequences on workstations with tens of megabytes of main memory. This rigorous program is used as a benchmark to evaluate the performance of a heuristic multiple sequence alignment program on protein sequences from the Protein Identification Resource protein sequence database. Experimental results indicate that the rigorous program may perform significantly better than the heuristic program on three sequences of weak similarity. We also describe algorithms for computing local alignments among three sequences in quadratic space. *A Context Dependent Method for Comparing Sequences* Dr. Xiaoqiu Huang CS-TR 93-06 A scoring scheme is presented to measure the similarity score between two biological sequences, where matches are weighted dependent on their context. The scheme generalizes a widely used scoring scheme. A dynamic programming algorithm is developed to compute a largest-scoring alignment of two sequences of lengths m and n in O(mn) time and $O(m^+^n)$space. Also developed is an algorithm for computing a largest-scoring local alignment between two sequences in quadratic time and linear space. Both algorithms are implemented as portable C programs. An experiment is conducted to compare protein alignments of the new global alignment program with those of an existing program. An Algorithm for Identifying Regions of a DNA Sequence that Satisfy a Content Requirement Dr. Xiaoqiu Huang CS-TR 93-05 We present a dynamic programming algorithm for identifying regions of a DNA sequence that meet a user-specified compositional requirement. Applications of the algorithm include finding C+G-rich regions, locating TA+CG-deficient regions, identifying CpG islands, and finding regions rich in periodical three-base patterns. The algorithm has an advantage over the simple window method in that the algorithm shows the exact location of each identified region. The algorithm has been implemented as a portable C program called LCP (Local Content Program). LCP is extremely efficient in computer time and memory; it instantly locates all regions of a long DNA sequence meeting a given requirement. The LCP program was used to analyze the rabbit A-like goblin gene cluster sequence. *Broadcast and Complete Exchange Algorithms for Mesh Topologies* Siddhartha Takkella and Dr. Steven Seidel CS-TR 93-04 The availability of multicomputers with a mesh interconnection network has created a need to develop efficient global communication algorithms for that architecture. The broadcast and the complete exchange are the two communication problems of primary interest in this work. Improved complete exchange algorithms for meshes of arbitrary size, and for square meshes in particular, have been developed. Motivated by the efficiency of pipelined broadcast algorithms for hypercubes, efforts were concentrated to develop similar broadcast algorithms for meshes, resulting in an improved pipelined broadcast algorithm for meshes. Performance results from the Intel Delta mesh are also given for one of the new complete exchange algorithms. *Profiling the Communication Workload of an iPSC/860* Brian VanVoorst and Dr. Steven Seidel CS-TR 93-03 A communication profiler was developed for the Intel iPSC/860 hypercube. The profiler was implemented as a kernel modification so that communication traffic could be monitored transparently. The profiler was used to gather information about the communication traffic in applications that comprised an iPSC/860's daily workload. Since communication is an expensive operation in a MIMD machine, the profiler was designed to show where those costs were concentrated and to show where efforts to improve communication performance should be directed. The 128 node iPSC/860 at the NASA Ames Research Center was profiled for several days. Summaries of data are presented here, along with recommendations that include proposed changes to the long message protocol to reduce transmission latency. *A Compiler Blockable Algorithm for QR Decomposition* Dr. Steve Carr and Dr. Richard Lehoucq (Rice University) CS-TR 93-02 Over the past decade, microprocessor design strategies have focused on increasing the computational power on a single chip. Unfortunately, memory speeds have not kept pace, resulting in an imbalance between computation speed and memory speed. This imbalance is leading machine designers to use more complicated memory hierarchies. In turn, programmers are explicitly restructuring codes to perform well on particular memory systems, leading to machine-specific programs. The compiler should automatically perform restructuring well enough to obviate the need for hand-optimization, thus allowing machine-independent expression of algorithms with the expectation of good performance. This paper extends previous work on the ability of a compiler to restructure codes automatically to improve memory-hierarchy performance. Specifically, the paper extends work in the automatic derivation of block algorithms from their corresponding point algorithms. Previous work showed that the compiler could not automatically derive the block algorithm for QR factorization found in LAPACK from its corresponding point algorithm because the block LAPACK version is a new algorithm rather than a reshaping of the point algorithm. This paper presents a compiler-derivable block algorithm. The algorithm is shown on smaller array sizes to be effective enough to obviate the need for the tediously hand-optimized algorithm found in LAPACK implementations. By relying on the compiler to do the optimization, the algorithm remains machine independent and retains high-performance across multiple architectures, without programmer intervention.

Please send questions and comments about this CS Web Page to cswebmaster@mtu.edu Department of Computer Science Last Updated: Tuesday, February 28, 2007