Unlock hundreds more features
Save your Quiz to the Dashboard
View and Export Results
Use AI to Create Quizzes and Analyse Results

Sign inSign in with Facebook
Sign inSign in with Google

1536/64 Parallel Processing Quiz: Test Your CUDA Prowess!

Think you can master CUDA shared memory, DRAM bursts & thread indexing? Jump in!

Difficulty: Moderate
2-5mins
Learning OutcomesCheat Sheet
Paper art style circuits memory modules and thread symbols on golden yellow background for parallel processing quiz

Use this 1536/64 parallel processing quiz to practice CUDA basics like shared memory, DRAM bursts, and thread indexing. You'll spot gaps fast and get quicker at reasoning about warps, blocks, and memory access. If you've aced the CPU and memory quiz or worked through other computational questions, start here to keep your skills sharp.

What does CUDA stand for?
Concurrent Unprocessed Data Algorithm
Compute Unified Device Architecture
Compute User Driver Application
Coherent Unified Data Accelerator
CUDA is an acronym for Compute Unified Device Architecture, NVIDIA's parallel computing platform and programming model. It enables software developers to use a CUDA-enabled GPU for general purpose processing. CUDA exposes GPU hardware resources to developers in the form of C/C++ extensions, libraries, and tools.
In CUDA programming, how many threads make up a single warp?
64 threads
16 threads
32 threads
128 threads
A warp in CUDA consists of 32 threads that execute instructions in lockstep on the GPU's streaming multiprocessor. Warps are the basic scheduling unit in NVIDIA's SIMT (Single Instruction, Multiple Thread) architecture. Properly structuring work into warps can maximize utilization and performance.
What type of GPU memory is on-chip and shared among threads in the same block?
Constant memory
Global memory
Shared memory
Texture memory
Shared memory is on-chip scratchpad memory that can be accessed by all threads in a block with low latency. It is useful for data reuse and communication between threads in a block. Because it resides on the streaming multiprocessor, shared memory offers much higher bandwidth than global memory.
Which GPU memory space refers to off-chip DRAM accessible by all threads?
Local memory
Shared memory
Global memory
Register memory
Global memory refers to the GPU's off-chip DRAM that is accessible by all threads and has high capacity but higher latency. It is used for large data sets and persists across kernel launches. Optimizing global memory access patterns (coalescing) is essential for high performance.
What is a coalesced memory access in CUDA?
Aligning shared memory banks to eliminate conflicts
Merging consecutive global memory accesses from threads in a warp into a single transaction
Executing multiple kernels back-to-back
Using atomic operations for synchronization
Coalesced memory access occurs when threads in a warp access contiguous memory locations, allowing the hardware to combine multiple memory requests into fewer DRAM transactions. This significantly increases memory throughput and reduces latency. Ensuring proper alignment and access patterns is key to achieving coalescing.
What causes a shared memory bank conflict?
Using too many registers per thread
Multiple threads in a warp accessing different banks
Accessing global memory from host code
Multiple threads in a warp accessing the same bank
A bank conflict occurs when two or more threads in the same warp access different addresses within the same shared memory bank simultaneously. Because each bank can only serve one access per cycle, conflicts serialize accesses and degrade performance. Padding shared memory or reorganizing data accesses can eliminate conflicts.
How is occupancy defined in CUDA?
The ratio of active warps on an SM to the maximum possible
The number of registers used per thread
The number of active blocks per SM
The total global memory bandwidth utilization
Occupancy is the ratio of active warps on a streaming multiprocessor (SM) to the maximum number of possible active warps. Higher occupancy can help hide memory latency, but it is impacted by resource usage such as registers and shared memory. Balancing occupancy with resource usage is crucial for optimal performance.
What is the recommended number of threads per block for achieving high performance on modern NVIDIA GPUs?
1024-2048 threads
128-512 threads
16-32 threads
2048-4096 threads
Using 128 to 512 threads per block often yields high performance on modern NVIDIA GPUs by balancing parallelism against register and shared memory usage. Blocks too small underutilize the SM, while blocks too large can reduce occupancy or exceed resource limits. Tuning block size per kernel is a common optimization step.
Which compute capability first introduced dynamic parallelism (kernel launches from the device)?
Compute Capability 5.0
Compute Capability 6.0
Compute Capability 2.0
Compute Capability 3.5
Dynamic parallelism, which allows CUDA kernels to launch other kernels from the device, was introduced with Compute Capability 3.5 on the Kepler architecture. This feature simplifies certain algorithms by removing round trips to the host. However, dynamic parallelism can introduce serialization overhead if misused.
What is the primary purpose of the __syncthreads() function in CUDA?
Flush registers to global memory
Synchronize threads within a block at a barrier
Initiate a warp-level reduction
Synchronize all threads across the entire grid
__syncthreads() acts as a barrier, ensuring that all threads in a block have reached the same point of execution and that shared memory writes are visible to all. It prevents race conditions when sharing data in shared memory. It does not synchronize across blocks or affect registers.
What is the typical DRAM burst length on modern NVIDIA GPUs' global memory interface?
32 bytes
256 bytes
128 bytes
64 bytes
Modern NVIDIA GPUs often use a 128-byte burst length for global memory accesses, meaning data is fetched in 128-byte contiguous chunks. Aligning memory transactions to this boundary improves throughput by reducing wasted bandwidth. Misaligned accesses can lead to split transactions and decreased performance.
Which factor can most directly reduce occupancy on a CUDA streaming multiprocessor?
Using too few registers per thread
Using too many constant memory references
Allocating large amounts of shared memory per block
Choosing a small block size
Allocating large shared memory per block can limit the number of blocks that can be resident on an SM simultaneously, directly reducing occupancy. Shared memory is a scarce on-chip resource, and excessive use per block may prevent additional blocks from launching. Balancing shared memory and register usage is key to maximizing occupancy.
How many shared memory banks exist on NVIDIA GPUs with compute capability 2.x and above?
32 banks
128 banks
16 banks
64 banks
On NVIDIA GPUs with compute capability 2.x and later, shared memory is divided into 32 banks. Each bank can deliver one 32-bit word per cycle, and proper thread access patterns avoid bank conflicts. Understanding the bank count is crucial for designing efficient shared memory algorithms.
Which strategy is most effective for maximizing global memory throughput on a CUDA device?
Prefetching data into local memory
Relying on atomic operations to serialize access
Aligning all accesses to 128-byte boundaries and coalescing
Using random memory access patterns
Maximizing global memory throughput requires aligning memory accesses to the GPU's burst size (typically 128 bytes) and ensuring coalesced accesses by threads in a warp. This minimizes the number of memory transactions and maximizes bus utilization. Misaligned or uncoalesced accesses lead to split transactions and lower bandwidth.
0
{"name":"What does CUDA stand for?", "url":"https://www.quiz-maker.com/QPREVIEW","txt":"What does CUDA stand for?, In CUDA programming, how many threads make up a single warp?, What type of GPU memory is on-chip and shared among threads in the same block?","img":"https://www.quiz-maker.com/3012/images/ogquiz.png"}

Study Outcomes

  1. Understand 1536/64 Parallel Processing Fundamentals -

    Grasp the core principles of CUDA's 1536/64 execution model, including how threads are organized into warps and blocks to maximize parallel throughput.

  2. Analyze CUDA Shared Memory Usage -

    Examine real-world quiz scenarios to identify best practices and common pitfalls when allocating and accessing shared memory in CUDA kernels.

  3. Optimize DRAM Burst Transfers -

    Learn how to align and coalesce memory transactions to maximize DRAM burst efficiency and minimize latency in GPU applications.

  4. Apply Thread Indexing Techniques -

    Use various indexing schemes to map threads to data elements, ensuring correct computation and optimal memory access patterns.

  5. Interpret Quiz Feedback for Skill Improvement -

    Review instant feedback on each question to pinpoint knowledge gaps in CUDA parallel processing and create a targeted learning plan.

Cheat Sheet

  1. Maximizing SM Occupancy -

    Occupancy measures how many threads are active per SM versus the hardware limit (e.g., 1536 threads on Fermi GPUs). Calculate occupancy as active warps ÷ max warps (1536/32=48 warps) and tune your block size (e.g., 64 threads=2 warps) to utilize SMs efficiently. Pro tip: use NVIDIA's CUDA Occupancy Calculator to balance registers and shared memory per block (CUDA C Programming Guide).

  2. Minimizing Shared Memory Bank Conflicts -

    Shared memory is divided into 32 banks, and simultaneous accesses by threads in a half-warp to the same bank cause serialization. Avoid conflicts by padding rows with an extra element (stride+1) or using diagonal indexing so consecutive threads map to different banks (CUDA C Best Practices Guide). Mnemonic: "Stride +1 keeps banks on the run!"

  3. Optimizing DRAM Bursts with Memory Coalescing -

    Global DRAM serves data in 128-byte bursts covering 32 threads; full coalescing occurs when each thread in a warp accesses consecutive 4-byte words. Align arrays on 128-byte boundaries and leverage vector types like float4 for packed loads/stores (CUDA C Best Practices Guide). Remember: "contiguous threads, contiguous data" for peak bandwidth.

  4. Efficient Thread Indexing Strategies -

    Map multi-dimensional CUDA grids to linear indices via idx = blockIdx.x * blockDim.x + threadIdx.x, and for 2D grids: row = blockIdx.y * blockDim.y + threadIdx.y. This formula, from the NVIDIA CUDA Toolkit documentation, simplifies partitioning of arrays and matrices across threads. Mnemonic aid: "blockIdx multiplies, threadIdx accumulates."

  5. Parallel Reduction Using Shared Memory -

    Load data into shared memory and iteratively halve active threads in a tree pattern while avoiding warp divergence (NVIDIA Developer Blog). Unroll the final warp and use __syncthreads() judiciously to synchronize, yielding near-peak throughput for sum, min, or max operations. Remember: "halve and sync" keeps the reduction in the pink!

Powered by: Quiz Maker