Chapter 5 optimizing program performance




















The Vertexprocessor unit for animation data builds up weights for animation skinning with bones in hardware. A varying number of weights per bone can be used, so it is not necessary to waste more memory than is really needed.

In Gothic III , we use up to six weights per bone for our animation system. An index buffer corresponding to the above vertex buffers is also built up. Indices are stored into a mesh object as a separate stream of information. The bit size of index entries whether 16 or 32 bit depends on the number of vertices. Otherwise, WORD bit indices suffice.

Choosing the smallest possible data type helps reduce an application's memory footprint. As an example of using streams separately, we execute the G -stream that is, the geometry stream as a separate z-pass, to achieve fast z-rejects in hardware and to use this buffer in conjunction with an occlusion query system implemented in the render system. It is possible to add the A -stream the animation stream in the calculation, but the effort isn't worthwhile in most cases.

The number of pixels that differ because of the animation is typically small and thus out of proportion to the additional rendering cost of adding animation. Individual streams are activated depending on the view whether solid rendering or z-pass by the renderer. In this chapter, we have shown how current applications can overcome problems caused by the growing amount of geometry data in scenes.

We've discussed a flexible model that gives the application more control over the data and drives the detected hardware optimally by combining two powerful techniques:. Ashida, Koji. Cebenoyan, Cem. For additional information about programming one or more streams in DirectX, see the Microsoft Web site:.

Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed with initial capital letters or in all capitals.

The authors and publisher have taken care in the preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein.

The reader assumes all risk of any such claims based on his or her use of these techniques. For more information, please contact:. Corporate and Government Sales corpsales pearsontechgroup. International Sales international pearsoned. Visit Addison-Wesley on the Web: www.

Includes bibliographical references and index. ISBN hardcover : alk. Computer graphics. Real-time programming. Pharr, Matt. Fernando, Randima. G All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form, or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior consent of the publisher. Printed in the United States of America. Published simultaneously in Canada. For information on obtaining permission for use of material from this work, please submit a written request to:.

Pearson Education, Inc. Skip to main content. The CD content, including demos and content, is available on the web and for download. Chapter 5. Optimizing Resource Management with Multistreaming Oliver Hoeller Piranha Bytes Kurt Pelzer Piranha Bytes One of the most difficult problems in modern real-time graphics applications is the massive amount of data that must be managed. Here are the streams and their subtasks: G—Vertex stream for geometry data. Contains vertex position, normal, and vertex color s.

T—Vertex stream for texture-mapping data. Holds texture coordinate sets and additional information such as tangent vectors for tangent-space normal maps. A—Vertex stream for animation data. While the CPU is busy handing off geometry and state information to the GPU, the GPU's vertex processor can process each vertex that arrives, transforming it and so forth. Simultaneously, the rasterizer can convert groups of transformed vertices into fragments potentially many fragments , queuing them up for processing by the fragment processor.

Notice that the relative amount of work at each stage of the pipeline is typically increasing: a few vertices can result in the generation of many fragments, each of which can be expensive to process.

Given this relative increase in the amount of work done at each stage, it is helpful to view the stages conceptually as a series of nested loops even though each loop operates in parallel with the others, as just described. These conceptual nested loops work as shown in pseudocode in Listing For each operation we perform, we must be mindful of how computationally expensive that operation is and how frequently it is performed.

In a normal CPU program, this is fairly straightforward. With an actual series of nested loops as opposed to the merely conceptual nested loops seen here , it's easy to see that a given expression inside an inner loop is loop-invariant and can be hoisted out to an outer loop and computed less frequently.

Inner-loop branching in CPU programs is often avoided for similar reasons; the branch is expensive, and if it occurs in the inner loop, then it occurs frequently. When writing GPU programs, it is particularly crucial to minimize the amount of redundant work.

Naturally, all of the same techniques discussed previously for reducing computational frequency in CPU programs apply to GPU programs as well.

But given the nature of GPU programming, each of the conceptual nested loops in Listing is actually a separate program running on different hardware and possibly even written in different programming languages. That separation makes it easy to overlook some of these sorts of optimizations.

The first mistake a new GPU programmer is likely to make is to needlessly recompute values that vary linearly or are uniform across the geometric primitives inside a fragment program. Texture coordinates are a prime example. They vary linearly across the primitive being drawn, and the rasterizer interpolates them automatically.

But when multiple related texture coordinates are used such as the offset and neighbor coordinates in the example in Section This results in a possibly expensive computation being performed very frequently.

It would be much better to move the computation of the related texture coordinates into the vertex program. Though this effectively just shifts load around and interpolation in the rasterizer is still a per-fragment operation, the question is how much work is being done at each stage of the pipeline and how often that work must be done.

Either way we do it, the result will be a set of texture coordinates that vary linearly across the primitive being drawn. But interpolation is often a lot less computationally expensive than recomputation of a given value on a per-fragment basis. As long as there are many more fragments than vertices, shifting the bulk of the computation so that it occurs on a per-vertex rather than a per-fragment basis makes sense.

It is worth reemphasizing, however, that any value that varies linearly across the domain can be computed in this way, regardless of whether it will eventually be used to index into a texture.

Herein lies one of the keys to understanding GPU programming: the names that "special-purpose" GPU features go by are mostly irrelevant as long as you understand how they correspond to general-purpose concepts. To take the concept of hoisting loop-invariant code a step further, some values are best precomputed on the CPU rather than on the GPU. Any value that is constant across the geometry being drawn can be factored all the way out to the CPU and passed to the GPU program as a uniform parameter.

Although size and even size squared have semantic meaning, size squared times has little. In the more classic sense, "precomputation" means computation that is done offline in advance—the classic storage versus computation trade-off. This concept also maps readily onto GPUs: functions with a constant-size domain and range that are constant across runs of an algorithm—even if they vary in complex ways based on their input—can be precomputed and stored in texture maps.

Texture maps can be used for storing functions of one, two, or three variables over a finite domain as 1D, 2D, or 3D textures. Textures are usually indexed by floating-point texture coordinates between 0 and 1. The range is determined by the texture format used; 8-bit texture formats can only store values in the range [0, 1], but floating-point textures provide a much larger range of possible values.

Textures can store up to four channels, so you can encode as many as four separate functions in the same texture. Texture lookups also provide filtering interpolation , which you can use to get piecewise linear approximations to values in between the table entries. As an example, suppose we had a fragment program that we wanted to apply to a checkerboard: half of the fragments of a big quad, the "red" ones, would be processed in one pass, while the other half, the "black" fragments, would be processed in a second pass.

But how would the fragment program determine whether the fragment it was currently processing was red or black? One way would be to use modulo arithmetic on the fragment's position:. Here, roughly speaking, place is the location of the fragment modulo 2 and mask. But clearly this is all a ridiculous amount of work for a seemingly simple task.

It's much easier to precompute a checkerboard texture that stores a 0 in black texels and a 1 in red texels. Then we can skip all of the preceding arithmetic, replacing it with a single texture lookup, providing a substantial speedup. What we're left with is the following:. However, although table lookups in this case were a win in terms of performance, that's not always going to be the case. Many GPU applications—particularly those that use a large number of four-component floating-point texture lookups—are memory-bandwidth-limited, so the introduction of an additional texture read in order to save a small amount of computation might in fact be a loss rather than a win.

Furthermore, texture cache coherence is critical; a lookup table that is accessed incoherently will thrash the cache and hurt performance rather than help it. But if enough computation can be "pre-baked" or if the GPU programs in question are compute-limited already, and if the baked results are read from the texture in a spatially coherent way, table lookups can improve performance substantially.

See Section In CPU programming, it is often desirable to avoid branching inside inner loops. This usually involves making several copies of the loop, with each copy acting on a subset of the data and following the execution path specific to that subset. This technique is sometimes called static branch resolution or substreaming.

The same concept applies to GPUs. Because a fragment program conceptually represents an inner loop, applying this technique requires a fragment program containing a branch to be divided into multiple fragment programs without the branch. The resulting programs each account for one code execution path through the original, monolithic program. This technique also requires the application to subdivide the data, which for the GPU means rasterization of multiple primitives instead of one.

A typical example is a 2D grid where data elements on the boundary of the grid require special handling that does not apply to interior elements. In this case, it is preferable to create two separate fragment programs—a short one that does not account for boundary conditions and a longer one that does—and draw a filled-in quad over the interior elements and an outline quad over the boundary elements. An easily overlooked or underutilized feature of GPU programming is the swizzle operator.

Because all registers on the GPU are four-vectors but not all instructions take four-vectors as arguments, some mechanism for creating other-sized vectors out of these four-vector registers is necessary. The swizzle operator provides this functionality. It is syntactically similar to the C concept of a structure member access but has the additional interesting property that data members can be rearranged, duplicated, or omitted in arbitrary combinations, as shown in the following example:.

The swizzle operator has applications in the computational frequency realm. Computing the second and third texture coordinates from the first is a job best left to the vertex program and rasterizer, as has already been discussed.

But the rasterizer can interpolate four-vectors as easily as it can two-vectors, so there is no reason that these three texture coordinates should have to occupy three separate interpolants. So all three texture coordinates can actually be packed into and interpolated as a single four-vector.

Chapter 7: Linking 3. Chapter 8: Exceptional Control Flow 3. Chapter 9: Virtual Memory 4. Chapter Network Programming 4. Chapter Concurrent Programming 5. Appendix A: Error Handling 5. Powered by GitBook.



0コメント

  • 1000 / 1000