CUDA Application Design and Development

€ 46,35
epub eBook
Sofort lieferbar (Download)
Oktober 2011



As the computer industry retools to leverage massively parallel graphics processing units (GPUs), this book is designed to meet the needs of working software developers who need to understand GPU programming with CUDA and increase efficiency in their projects. CUDA Application Design and Development starts with an introduction to parallel computing concepts for readers with no previous parallel experience, and focuses on issues of immediate importance to working software developers: achieving high performance, maintaining competitiveness, analyzing CUDA benefits versus costs, and determining application lifespan.

The book then details the thought behind CUDA and teaches how to create, analyze, and debug CUDA applications. Throughout, the focus is on software engineering issues: how to use CUDA in the context of existing application code, with existing compilers, languages, software tools, and industry-standard API libraries.

Using an approach refined in a series of well-received articles at Dr Dobb's Journal, author Rob Farber takes the reader step-by-step from fundamentals to implementation, moving from language theory to practical coding.

  • Includes multiple examples building from simple to more complex applications in four key areas: machine learning, visualization, vision recognition, and mobile computing
  • Addresses the foundational issues for CUDA development: multi-threaded programming and the different memory hierarchy
  • Includes teaching chapters designed to give a full understanding of CUDA tools, techniques and structure.
  • Presents CUDA techniques in the context of the hardware they are implemented on as well as other styles of programming that will help readers bridge into the new material


1;Front Cover;1 2;CUDA Application Design and Development;4 3;Copyright;5 4;Dedication;6 5;Table of Contents;8 6;Foreword;12 7;Preface;14 8;1 First Programs and How to Think in CUDA;20 8.1;Source Code and Wiki;21 8.2;Distinguishing CUDA from Conventional Programming with a Simple Example;21 8.3;Choosing a CUDA API;24 8.4;Some Basic CUDA Concepts;27 8.5;Understanding Our First Runtime Kernel;30 8.6;Three Rules of GPGPU Programming;32 8.6.1;Rule 1: Get the Data on the GPU and Keep It There;32 8.6.2;Rule 2: Give the GPGPU Enough Work to Do;33 8.6.3;Rule 3: Focus on Data Reuse within the GPGPU to Avoid Memory Bandwidth Limitations;33 8.7;Big-O Considerations and Data Transfers;34 8.8;CUDA and Amdahls Law;36 8.9;Data and Task Parallelism;37 8.10;Hybrid Execution: Using Both CPU and GPU Resources;38 8.11;Regression Testing and Accuracy;40 8.12;Silent Errors;41 8.13;Introduction to Debugging;42 8.14;UNIX Debugging;43 8.14.1;NVIDIA's cuda-gdb Debugger;43 8.14.2;The CUDA Memory Checker;45 8.14.3;Use cuda-gdb with the UNIX ddd Interface;46 8.15;Windows Debugging with Parallel Nsight;48 8.16;Summary;49 9;2 CUDA for Machine Learning and Optimization;52 9.1;Modeling and Simulation;53 9.1.1;Fitting Parameterized Models;54 9.1.2;Nelder-Mead Method;55 9.1.3;Levenberg-Marquardt Method;55 9.1.4;Algorithmic Speedups;56 9.2;Machine Learning and Neural Networks;57 9.3;XOR: An Important Nonlinear Machine-Learning Problem;58 9.3.1;An Example Objective Function;60 9.3.2;A Complete Functor for Multiple GPU Devices and the Host Processors;61 9.3.3;Brief Discussion of a Complete Nelder-Mead Optimization Code;63 9.4;Performance Results on XOR;72 9.5;Performance Discussion;72 9.6;Summary;75 9.7;The C++ Nelder-Mead Template;76 10;3 The CUDA Tool Suite: Profiling a PCA/NLPCA Functor;82 10.1;PCA and NLPCA;83 10.1.1;Autoencoders;84;An Example Functor for PCA Analysis;85;An Example Functor for NLPCA Analysis;87 10.2;Obtaining Basic Profile Information;90 10.3;Gprof: A Common UNIX P
rofiler;92 10.4;The NVIDIA Visual Profiler: Computeprof;93 10.5;Parallel Nsight for Microsoft Visual Studio;96 10.5.1;The Nsight Timeline Analysis;96 10.5.2;The NVTX Tracing Library;98 10.5.3;Scaling Behavior of the CUDA API;99 10.6;Tuning and Analysis Utilities (TAU);101 10.7;Summary;102 11;4 The CUDA Execution Model;104 11.1;GPU Architecture Overview;105 11.1.1;Thread Scheduling: Orchestrating Performance and Parallelism via the Execution Configuration;106 11.1.2;Relevant computeprof Values for a Warp;109 11.1.3;Warp Divergence;109 11.1.4;Guidelines for Warp Divergence;110 11.1.5;Relevant computeprof Values for Warp Divergence;111 11.2;Warp Scheduling and TLP;111 11.2.1;Relevant computeprof Values for Occupancy;113 11.3;ILP: Higher Performance at Lower Occupancy;113 11.3.1;ILP Hides Arithmetic Latency;114 11.3.2;ILP Hides Data Latency;117 11.3.3;ILP in the Future;117 11.3.4;Relevant computeprof Values for Instruction Rates;119 11.4;Littles Law;119 11.5;CUDA Tools to Identify Limiting Factors;121 11.5.1;The nvcc Compiler;122 11.5.2;Launch Bounds;123 11.5.3;The Disassembler;124 11.5.4;PTX Kernels;125 11.5.5;GPU Emulators;126 11.6;Summary;127 12;5 CUDA Memory;128 12.1;The CUDA Memory Hierarchy;128 12.2;GPU Memory;130 12.3;L2 Cache;131 12.3.1;Relevant computeprof Values for the L2 Cache;132 12.4;L1 Cache;133 12.4.1;Relevant computeprof Values for the L1 Cache;134 12.5;CUDA Memory Types;135 12.5.1;Registers;135 12.5.2;Local memory;135 12.5.3;Relevant computeprof Values for Local Memory Cache;136 12.5.4;Shared Memory;136 12.5.5;Relevant computeprof Values for Shared Memory;139 12.5.6;Constant Memory;139 12.5.7;Texture Memory;140 12.5.8;Relevant computeprof Values for Texture Memory;143 12.6;Global Memory;143 12.6.1;Common Coalescing Use Cases;145 12.6.2;Allocation of Global Memory;146 12.6.3;Limiting Factors in the Design of Global Memory;147 12.6.4;Relevant computeprof Values for Global Memory;149 12.7;Summary;150 13;6 Efficiently Using GPU Memory;152 13.1;Reduction;15
3 13.1.1;The Reduction Template;153 13.1.2;A Test Program for functionReduce.h;159 13.1.3;Results;163 13.2;Utilizing Irregular Data Structures;165 13.3;Sparse Matrices and the CUSP Library;168 13.4;Graph Algorithms;170 13.5;SoA, AoS, and Other Structures;173 13.6;Tiles and Stencils;173 13.7;Summary;174 14;7 Techniques to Increase Parallelism;176 14.1;CUDA Contexts Extend Parallelism;177 14.2;Streams and Contexts;178 14.2.1;Multiple GPUs;178 14.2.2;Explicit Synchronization;179 14.2.3;Implicit Synchronization;180 14.2.4;The Unified Virtual Address Space;181 14.2.5;A Simple Example;181 14.2.6;Profiling Results;184 14.3;Out-of-Order Execution with Multiple Streams;185 14.3.1;Tip for Concurrent Kernel Execution on the Same GPU;188 14.3.2;Atomic Operations for Implicitly Concurrent Kernels;188 14.4;Tying Data to Computation;191 14.4.1;Manually Partitioning Data;191 14.4.2;Mapped Memory;192 14.4.3;How Mapped Memory Works;194 14.5;Summary;195 15;8 CUDA for All GPU and CPU Applications;198 15.1;Pathways from CUDA to Multiple Hardware Backends;199 15.1.1;The PGI CUDA x86 Compiler;200 15.1.2;The PGI CUDA x86 Compiler;202;An x86 core as an SM;204 15.1.3;The NVIDIA NVCC Compiler;205 15.1.4;Ocelot;206 15.1.5;Swan;207 15.1.6;MCUDA;207 15.2;Accessing CUDA from Other Languages;207 15.2.1;SWIG;208 15.2.2;Copperhead;208 15.2.3;EXCEL;209 15.2.4;MATLAB;209 15.3;Libraries;210 15.3.1;CUBLAS;210 15.3.2;CUFFT;210 15.3.3;MAGMA;221 15.3.4;phiGEMM Library;222 15.3.5;CURAND;222 15.4;Summary;224 16;9 Mixing CUDA and Rendering;226 16.1;OpenGL;227 16.1.1;GLUT;227 16.1.2;Mapping GPU Memory with OpenGL;228 16.1.3;Using Primitive Restart for 3D Performance;229 16.2;Introduction to the Files in the Framework;232 16.2.1;The Demo and Perlin Example Kernels;232;The Demo Kernel;233;The Demo Kernel to Generate a Colored Sinusoidal Surface;233;Perlin Noise;236;Using the Perlin Noise Kernel to Generate Artificial Terrain;238 16.2.2;The simpleGLmain.cpp File;243 16
.2.3;The simpleVBO.cpp File;247 16.2.4;The callbacksVBO.cpp File;252 16.3;Summary;257 17;10 CUDA in a Cloud and Cluster Environments;260 17.1;The Message Passing Interface (MPI);261 17.1.1;The MPI Programming Model;261 17.1.2;The MPI Communicator;262 17.1.3;MPI Rank;262 17.1.4;Master-Slave;264 17.1.5;Point-to-Point Basics;264 17.2;How MPI Communicates;265 17.3;Bandwidth;267 17.4;Balance Ratios;268 17.5;Considerations for Large MPI Runs;271 17.5.1;Scalability of the Initial Data Load;271 17.5.2;Using MPI to Perform a Calculation;272 17.5.3;Check Scalability;273 17.6;Cloud Computing;274 17.7;A Code Example;275 17.7.1;Data Generation;275 17.8;Summary;283 18;11 CUDA for Real Problems;284 18.1;Working with High-Dimensional Data;285 18.1.1;PCA/NLPCA;286 18.1.2;Multidimensional Scaling;286 18.1.3;K-Means Clustering;287 18.1.4;Expectation-Maximization;287 18.1.5;Support Vector Machines;288 18.1.6;Bayesian Networks;288 18.1.7;Mutual information;289 18.2;Force-Directed Graphs;290 18.3;Monte Carlo Methods;291 18.4;Molecular Modeling;292 18.5;Quantum Chemistry;292 18.6;Interactive Workflows;293 18.7;A Plethora of Projects;293 18.8;Summary;294 19;12 Application Focus on Live Streaming Video;296 19.1;Topics in Machine Vision;297 19.1.1;3D Effects;298 19.1.2;Segmentation of Flesh-colored Regions;298 19.1.3;Edge Detection;299 19.2;FFmpeg;300 19.3;TCP Server;302 19.4;Live Stream Application;306 19.4.1;kernelWave(): An Animated Kernel;306 19.4.2;kernelFlat(): Render the Image on a Flat Surface;307 19.4.3;kernelSkin(): Keep Only Flesh-colored Regions;307 19.4.4;kernelSobel(): A Simple Sobel Edge Detection Filter;308 19.4.5;The launch_kernel() Method;309 19.5;The simpleVBO.cpp File;310 19.6;The callbacksVBO.cpp File;310 19.7;Building and Running the Code;314 19.8;The Future;314 19.8.1;Machine Learning;314 19.8.2;The Connectome;315 19.9;Summary;316 19.10;Listing for simpleVBO.cpp;316 20;Works Cited;322 21;Index;330 21.1;A;330 21.2;B;330 21.3;C;330 21.4;D;330 21.5;E;331 21.6;F;331 21.7;G
;331 21.8;H;331 21.9;I;331 21.10;J;331 21.11;K;331 21.12;L;332 21.13;M;332 21.14;N;332 21.15;O;332 21.16;P;333 21.17;Q;333 21.18;R;333 21.19;S;333 21.20;T;333 21.21;U;334 21.22;V;334 21.23;W;334 21.24;X;334


Rob Farber has served as a scientist in Europe at the Irish Center for High-End Computing as well as U.S. national labs in Los Alamos, Berkeley, and the Pacific Northwest. He has also been on the external faculty at the Santa Fe Institute, consultant to fortune 100 companies, and co-founder of two computational startups that achieved liquidity events. He is the author of ¿CUDA Application Design and Development” as well as numerous articles and tutorials that have appeared in Dr. Dobb's Journal and Scientific Computing, The Code Project and others.


The book by Rob Faber on CUDA Application Design and Development is required reading for anyone who wants to understand and efficiently program CUDA for scientific and visual programming. It provides a hands-on exposure to the details in a readable and easy to understand form. Jack Dongarra, Innovative Computing Laboratory, EECS Department, University of Tennessee

GPUs have the potential to take computational simulations to new levels of scale and detail. Many scientists are already realising these benefits, tackling larger and more complex problems that are not feasible on conventional CPU-based systems. This book provides the tools and techniques for anyone wishing to join these pioneers, in an accessible though thorough text that a budding CUDA programmer would do well to keep close to hand. Dr. George Beckett, EPCC, University of Edinburgh

With his book, Farber takes us on a journey to the exciting world of programming multi-core processor machines with CUDA. Farber's pragmatic approach is effective in guiding the reader across challenges and their solutions.   Farber's broader presentation of parallel programming with CUDA ranging from CUDA in Cloud and Cluster environments to CUDA for real problems and applications helps the reader learning about the unique opportunities this parallel programming language can offer to the scientific community. This book is definitely a must for students, teachers, and developers! Michela Taufer, Assistant Professor, Department of Computer and Information Sciences, University of Delaware

Rob Farber has written an enlightening and accessible book on the application to CUDA for real research tasks, with an eye to developing scalable and distributed GPU applications.  He supplies clear and usable code examples combined with insight about _why_ one should use a particular approach.  This is an excellent book filled with practical advice for experienced CUDA programmers and ground-up guidance for beginners wondering if CUDA can accelerate their time to solution. Paul A. Navrátil, Manager, Visualization Software, Texas Advanced Computing Center

The book provides a solid introduction to the CUDA programming language starting with the basics and progressively exposing the reader to advanced concepts through the well annotated implementation of real-world applications. It makes a first-rate presentation of CUDA, its use in the implementation of portable and efficient applications and the underlying architecture of GPGPU/CPU systems with particular emphasis on memory hierarchies. This is complemented by a thorough presentation both of the CUDA Tool Suite and of techniques for the parallelisation of applications. Farber's book is a valuable addition to the bookshelves of both the advanced and novice CUDA programmer. Francis Wray, Independent Consultant and Visiting Professor at the Faculty of Computing, Information Systems and Mathematics at the University of Kingston

At a brisk pace, "CUDA Application Design and Development" will take one from the basics of CUDA programming to the level where real-time video processing becomes a stroll in the park. Along the way, the reader can get a clear understanding of how the hybrid CPU-GPU computing idea can be capitalized on, and how a 500-GPU configuration can be used in large scale machine learning problems.  Wasting no time on obscure issues of little relevance, the book provides an excellent account of the CUDA execution model, memory access issues, opportunities to increase parallelism in a program, and how advanced profiling can squeeze performance out of a code.  Rob provides a snapshot of everything that is relevant in CUDA based GPU computing in a style honed through a long series of Dr. Dobb¿s articles that have delighted scores of CUDA programmers.  His followers will be delighted once again. Dan Negrut, Associate Professor, University of Wisconsin-Madison, NVIDIA CUDA

EAN: 9780123884329
Untertitel: 211:eBook ePub. Sprache: Englisch.
Verlag: Elsevier Science
Erscheinungsdatum: Oktober 2011
Seitenanzahl: 336 Seiten
Format: epub eBook
Kopierschutz: Adobe DRM
Es gibt zu diesem Artikel noch keine Bewertungen.Kundenbewertung schreiben