What does CUDA stand for?

Compute Unified Device Architecture

Concurrent Unprocessed Data Algorithm

Compute User Driver Application

Coherent Unified Data Accelerator

What is a coalesced memory access in CUDA?

Merging consecutive global memory accesses from threads in a warp into a single transaction

Aligning shared memory banks to eliminate conflicts

Merging consecutive global memory accesses from threads in a warp into a single transaction

Executing multiple kernels back-to-back

Using atomic operations for synchronization

What causes a shared memory bank conflict?

Multiple threads in a warp accessing the same bank

Multiple threads in a warp accessing different banks

Multiple threads in a warp accessing the same bank

Accessing global memory from host code

Using too many registers per thread

How is occupancy defined in CUDA?

The ratio of active warps on an SM to the maximum possible

The number of active blocks per SM

The ratio of active warps on an SM to the maximum possible

The number of registers used per thread

The total global memory bandwidth utilization

What is the primary purpose of the __syncthreads() fun<wbr>ction in CUDA?

Synchronize threads within a block at a barrier

Synchronize all threads across the entire grid

Synchronize threads within a block at a barrier

Flush registers to global memory

Initiate a warp-level reduction

Which factor can most directly reduce occupancy on a CUDA streaming multiprocessor?

Allocating large amounts of shared memory per block

Using too few registers per thread

Allocating large amounts of shared memory per block

Using too many constant memory references

Which strategy is most effective for maximizing global memory throughput on a CUDA device?

Aligning all accesses to 128-byte boundaries and coalescing

Using random memory access patterns

Aligning all accesses to 128-byte boundaries and coalescing

Prefetching data into local memory

Relying on atomic operations to serialize access

1536/64 CUDA Parallel Processing Quiz

This quiz helps you practice CUDA parallel processing, from the 1536/64 split to shared memory, DRAM bursts, and thread indexing. You will spot gaps fast and sharpen reasoning about warps, blocks, and memory access. For more practice, try our matrix quiz or explore acceleration topics in the computer vision quiz.

Study Outcomes

Understand 1536/64 Parallel Processing Fundamentals -
Grasp the core principles of CUDA's 1536/64 execution model, including how threads are organized into warps and blocks to maximize parallel throughput.
Analyze CUDA Shared Memory Usage -
Examine real-world quiz scenarios to identify best practices and common pitfalls when allocating and accessing shared memory in CUDA kernels.
Optimize DRAM Burst Transfers -
Learn how to align and coalesce memory transactions to maximize DRAM burst efficiency and minimize latency in GPU applications.
Apply Thread Indexing Techniques -
Use various indexing schemes to map threads to data elements, ensuring correct computation and optimal memory access patterns.
Interpret Quiz Feedback for Skill Improvement -
Review instant feedback on each question to pinpoint knowledge gaps in CUDA parallel processing and create a targeted learning plan.

Cheat Sheet

Maximizing SM Occupancy -
Occupancy measures how many threads are active per SM versus the hardware limit (e.g., 1536 threads on Fermi GPUs). Calculate occupancy as active warps ÷ max warps (1536/32=48 warps) and tune your block size (e.g., 64 threads=2 warps) to utilize SMs efficiently. Pro tip: use NVIDIA's CUDA Occupancy Calculator to balance registers and shared memory per block (CUDA C Programming Guide).
Minimizing Shared Memory Bank Conflicts -
Shared memory is divided into 32 banks, and simultaneous accesses by threads in a half-warp to the same bank cause serialization. Avoid conflicts by padding rows with an extra element (stride+1) or using diagonal indexing so consecutive threads map to different banks (CUDA C Best Practices Guide). Mnemonic: "Stride +1 keeps banks on the run!"
Optimizing DRAM Bursts with Memory Coalescing -
Global DRAM serves data in 128-byte bursts covering 32 threads; full coalescing occurs when each thread in a warp accesses consecutive 4-byte words. Align arrays on 128-byte boundaries and leverage vector types like float4 for packed loads/stores (CUDA C Best Practices Guide). Remember: "contiguous threads, contiguous data" for peak bandwidth.
Efficient Thread Indexing Strategies -
Map multi-dimensional CUDA grids to linear indices via idx = blockIdx.x * blockDim.x + threadIdx.x, and for 2D grids: row = blockIdx.y * blockDim.y + threadIdx.y. This formula, from the NVIDIA CUDA Toolkit documentation, simplifies partitioning of arrays and matrices across threads. Mnemonic aid: "blockIdx multiplies, threadIdx accumulates."
Parallel Reduction Using Shared Memory -
Load data into shared memory and iteratively halve active threads in a tree pattern while avoiding warp divergence (NVIDIA Developer Blog). Unroll the final warp and use __syncthreads() judiciously to synchronize, yielding near-peak throughput for sum, min, or max operations. Remember: "halve and sync" keeps the reduction in the pink!