CUDA Application Design and Development

€ 44,49
Bisher € 45,49
epub eBook
Sofort lieferbar (Download)
Oktober 2011



As the computer industry retools to leverage massively parallel graphics processing units (GPUs), this book is designed to meet the needs of working software developers who need to understand GPU programming with CUDA and increase efficiency in their projects. CUDA Application Design and Development starts with an introduction to parallel computing concepts for readers with no previous parallel experience, and focuses on issues of immediate importance to working software developers: achieving high performance, maintaining competitiveness, analyzing CUDA benefits versus costs, and determining application lifespan. The book then details the thought behind CUDA and teaches how to create, analyze, and debug CUDA applications. Throughout, the focus is on software engineering issues: how to use CUDA in the context of existing application code, with existing compilers, languages, software tools, and industry-standard API libraries. Using an approach refined in a series of well-received articles at Dr Dobb's Journal, author Rob Farber takes the reader step-by-step from fundamentals to implementation, moving from language theory to practical coding.Includes multiple examples building from simple to more complex applications in four key areas: machine learning, visualization, vision recognition, and mobile computingAddresses the foundational issues for CUDA development: multi-threaded programming and the different memory hierarchyIncludes teaching chapters designed to give a full understanding of CUDA tools, techniques and structure.Presents CUDA techniques in the context of the hardware they are implemented on as well as other styles of programming that will help readers bridge into the new material


1;Front Cover;1 2;CUDA Application Design and Development;4 3;Copyright;5 4;Dedication;6 5;Table of Contents;8 6;Foreword;12 7;Preface;14 8;1 First Programs and How to Think in CUDA;20 8.1;Source Code and Wiki;21 8.2;Distinguishing CUDA from Conventional Programming with a Simple Example;21 8.3;Choosing a CUDA API;24 8.4;Some Basic CUDA Concepts;27 8.5;Understanding Our First Runtime Kernel;30 8.6;Three Rules of GPGPU Programming;32 8.6.1;Rule 1: Get the Data on the GPU and Keep It There;32 8.6.2;Rule 2: Give the GPGPU Enough Work to Do;33 8.6.3;Rule 3: Focus on Data Reuse within the GPGPU to Avoid Memory Bandwidth Limitations;33 8.7;Big-O Considerations and Data Transfers;34 8.8;CUDA and Amdahls Law;36 8.9;Data and Task Parallelism;37 8.10;Hybrid Execution: Using Both CPU and GPU Resources;38 8.11;Regression Testing and Accuracy;40 8.12;Silent Errors;41 8.13;Introduction to Debugging;42 8.14;UNIX Debugging;43 8.14.1;NVIDIA's cuda-gdb Debugger;43 8.14.2;The CUDA Memory Checker;45 8.14.3;Use cuda-gdb with the UNIX ddd Interface;46 8.15;Windows Debugging with Parallel Nsight;48 8.16;Summary;49 9;2 CUDA for Machine Learning and Optimization;52 9.1;Modeling and Simulation;53 9.1.1;Fitting Parameterized Models;54 9.1.2;Nelder-Mead Method;55 9.1.3;Levenberg-Marquardt Method;55 9.1.4;Algorithmic Speedups;56 9.2;Machine Learning and Neural Networks;57 9.3;XOR: An Important Nonlinear Machine-Learning Problem;58 9.3.1;An Example Objective Function;60 9.3.2;A Complete Functor for Multiple GPU Devices and the Host Processors;61 9.3.3;Brief Discussion of a Complete Nelder-Mead Optimization Code;63 9.4;Performance Results on XOR;72 9.5;Performance Discussion;72 9.6;Summary;75 9.7;The C++ Nelder-Mead Template;76 10;3 The CUDA Tool Suite: Profiling a PCA/NLPCA Functor;82 10.1;PCA and NLPCA;83 10.1.1;Autoencoders;84;An Example Functor for PCA Analysis;85;An Example Functor for NLPCA Analysis;87 10.2;Obtaining Basic Profile Information;90 10.3;Gprof: A Common UNIX P
rofiler;92 10.4;The NVIDIA Visual Profiler: Computeprof;93 10.5;Parallel Nsight for Microsoft Visual Studio;96 10.5.1;The Nsight Timeline Analysis;96 10.5.2;The NVTX Tracing Library;98 10.5.3;Scaling Behavior of the CUDA API;99 10.6;Tuning and Analysis Utilities (TAU);101 10.7;Summary;102 11;4 The CUDA Execution Model;104 11.1;GPU Architecture Overview;105 11.1.1;Thread Scheduling: Orchestrating Performance and Parallelism via the Execution Configuration;106 11.1.2;Relevant computeprof Values for a Warp;109 11.1.3;Warp Divergence;109 11.1.4;Guidelines for Warp Divergence;110 11.1.5;Relevant computeprof Values for Warp Divergence;111 11.2;Warp Scheduling and TLP;111 11.2.1;Relevant computeprof Values for Occupancy;113 11.3;ILP: Higher Performance at Lower Occupancy;113 11.3.1;ILP Hides Arithmetic Latency;114 11.3.2;ILP Hides Data Latency;117 11.3.3;ILP in the Future;117 11.3.4;Relevant computeprof Values for Instruction Rates;119 11.4;Littles Law;119 11.5;CUDA Tools to Identify Limiting Factors;121 11.5.1;The nvcc Compiler;122 11.5.2;Launch Bounds;123 11.5.3;The Disassembler;124 11.5.4;PTX Kernels;125 11.5.5;GPU Emulators;126 11.6;Summary;127 12;5 CUDA Memory;128 12.1;The CUDA Memory Hierarchy;128 12.2;GPU Memory;130 12.3;L2 Cache;131 12.3.1;Relevant computeprof Values for the L2 Cache;132 12.4;L1 Cache;133 12.4.1;Relevant computeprof Values for the L1 Cache;134 12.5;CUDA Memory Types;135 12.5.1;Registers;135 12.5.2;Local memory;135 12.5.3;Relevant computeprof Values for Local Memory Cache;136 12.5.4;Shared Memory;136 12.5.5;Relevant computeprof Values for Shared Memory;139 12.5.6;Constant Memory;139 12.5.7;Texture Memory;140 12.5.8;Relevant computeprof Values for Texture Memory;143 12.6;Global Memory;143 12.6.1;Common Coalescing Use Cases;145 12.6.2;Allocation of Global Memory;146 12.6.3;Limiting Factors in the Design of Global Memory;147 12.6.4;Relevant computeprof Values for Global Memory;149 12.7;Summary;150 13;6 Efficiently Using GPU Memory;152 13.1;Reduction;15
3 13.1.1;The Reduction Template;153 13.1.2;A Test Program for functionReduce.h;159 13.1.3;Results;163 13.2;Utilizing Irregular Data Structures;165 13.3;Sparse Matrices and the CUSP Library;168 13.4;Graph Algorithms;170 13.5;SoA, AoS, and Other Structures;173 13.6;Tiles and Stencils;173 13.7;Summary;174 14;7 Techniques to Increase Parallelism;176 14.1;CUDA Contexts Extend Parallelism;177 14.2;Streams and Contexts;178 14.2.1;Multiple GPUs;178 14.2.2;Explicit Synchronization;179 14.2.3;Implicit Synchronization;180 14.2.4;The Unified Virtual Address Space;181 14.2.5;A Simple Example;181 14.2.6;Profiling Results;184 14.3;Out-of-Order Execution with Multiple Streams;185 14.3.1;Tip for Concurrent Kernel Execution on the Same GPU;188 14.3.2;Atomic Operations for Implicitly Concurrent Kernels;188 14.4;Tying Data to Computation;191 14.4.1;Manually Partitioning Data;191 14.4.2;Mapped Memory;192 14.4.3;How Mapped Memory Works;194 14.5;Summary;195 15;8 CUDA for All GPU and CPU Applications;198 15.1;Pathways from CUDA to Multiple Hardware Backends;199 15.1.1;The PGI CUDA x86 Compiler;200 15.1.2;The PGI CUDA x86 Compiler;202;An x86 core as an SM;204 15.1.3;The NVIDIA NVCC Compiler;205 15.1.4;Ocelot;206 15.1.5;Swan;207 15.1.6;MCUDA;207 15.2;Accessing CUDA from Other Languages;207 15.2.1;SWIG;208 15.2.2;Copperhead;208 15.2.3;EXCEL;209 15.2.4;MATLAB;209 15.3;Libraries;210 15.3.1;CUBLAS;210 15.3.2;CUFFT;210 15.3.3;MAGMA;221 15.3.4;phiGEMM Library;222 15.3.5;CURAND;222 15.4;Summary;224 16;9 Mixing CUDA and Rendering;226 16.1;OpenGL;227 16.1.1;GLUT;227 16.1.2;Mapping GPU Memory with OpenGL;228 16.1.3;Using Primitive Restart for 3D Performance;229 16.2;Introduction to the Files in the Framework;232 16.2.1;The Demo and Perlin Example Kernels;232;The Demo Kernel;233;The Demo Kernel to Generate a Colored Sinusoidal Surface;233;Perlin Noise;236;Using the Perlin Noise Kernel to Generate Artificial Terrain;238 16.2.2;The simpleGLmain.cpp File;243 16
.2.3;The simpleVBO.cpp File;247 16.2.4;The callbacksVBO.cpp File;252 16.3;Summary;257 17;10 CUDA in a Cloud and Cluster Environments;260 17.1;The Message Passing Interface (MPI);261 17.1.1;The MPI Programming Model;261 17.1.2;The MPI Communicator;262 17.1.3;MPI Rank;262 17.1.4;Master-Slave;264 17.1.5;Point-to-Point Basics;264 17.2;How MPI Communicates;265 17.3;Bandwidth;267 17.4;Balance Ratios;268 17.5;Considerations for Large MPI Runs;271 17.5.1;Scalability of the Initial Data Load;271 17.5.2;Using MPI to Perform a Calculation;272 17.5.3;Check Scalability;273 17.6;Cloud Computing;274 17.7;A Code Example;275 17.7.1;Data Generation;275 17.8;Summary;283 18;11 CUDA for Real Problems;284 18.1;Working with High-Dimensional Data;285 18.1.1;PCA/NLPCA;286 18.1.2;Multidimensional Scaling;286 18.1.3;K-Means Clustering;287 18.1.4;Expectation-Maximization;287 18.1.5;Support Vector Machines;288 18.1.6;Bayesian Networks;288 18.1.7;Mutual information;289 18.2;Force-Directed Graphs;290 18.3;Monte Carlo Methods;291 18.4;Molecular Modeling;292 18.5;Quantum Chemistry;292 18.6;Interactive Workflows;293 18.7;A Plethora of Projects;293 18.8;Summary;294 19;12 Application Focus on Live Streaming Video;296 19.1;Topics in Machine Vision;297 19.1.1;3D Effects;298 19.1.2;Segmentation of Flesh-colored Regions;298 19.1.3;Edge Detection;299 19.2;FFmpeg;300 19.3;TCP Server;302 19.4;Live Stream Application;306 19.4.1;kernelWave(): An Animated Kernel;306 19.4.2;kernelFlat(): Render the Image on a Flat Surface;307 19.4.3;kernelSkin(): Keep Only Flesh-colored Regions;307 19.4.4;kernelSobel(): A Simple Sobel Edge Detection Filter;308 19.4.5;The launch_kernel() Method;309 19.5;The simpleVBO.cpp File;310 19.6;The callbacksVBO.cpp File;310 19.7;Building and Running the Code;314 19.8;The Future;314 19.8.1;Machine Learning;314 19.8.2;The Connectome;315 19.9;Summary;316 19.10;Listing for simpleVBO.cpp;316 20;Works Cited;322 21;Index;330 21.1;A;330 21.2;B;330 21.3;C;330 21.4;D;330 21.5;E;331 21.6;F;331 21.7;G
;331 21.8;H;331 21.9;I;331 21.10;J;331 21.11;K;331 21.12;L;332 21.13;M;332 21.14;N;332 21.15;O;332 21.16;P;333 21.17;Q;333 21.18;R;333 21.19;S;333 21.20;T;333 21.21;U;334 21.22;V;334 21.23;W;334 21.24;X;334

EAN: 9780123884329
Untertitel: 211:eBook ePub. Sprache: Englisch.
Verlag: Elsevier Science
Erscheinungsdatum: Oktober 2011
Seitenanzahl: 336 Seiten
Format: epub eBook
Kopierschutz: Adobe DRM
Es gibt zu diesem Artikel noch keine Bewertungen.Kundenbewertung schreiben