The latest version of this sample is hosted on GitHub.
This is the DirectX SDK's Direct3D 11 BasicCompute11 and ComputeShaderSort11 samples updated to use Visual Studio 2012 and the Windows SDK 8.0 without any dependencies on legacy DirectX SDK content. These are samples Win32 desktop console DirectX 11.0 applications for Windows 8, Windows 7, and Windows Vista Service Pack 2 with the DirectX 11.0 runtime.
This is based on the the legacy DirectX SDK (June 2010) Win32 desktop console samples running on Windows Vista, Windows 7, and Windows 8. This is not intended for use with Windows Store apps or Windows RT, although the basic techniques are applicable.
Description
These are Win32 desktop console applications which demonstrate use of DirectCompute (aka Direct3D 11 Compute Shaders). For more complicated examples that combine DirectCompute with 3D rendering, see DirectCompute Graphics Win32 Samples.
This sample shows the basic usage of DirectX11 Compute Shader (aka DirectCompute) by implementing array A + array B.
Setting up the Compute Shader involves the following steps:
The sample and the HLSL code supports two additional build modes controlled by compile-time defines.
USE_STRUCTURED_BUFFERS
: If this is defined, then the sample and shader uses structured buffer types. If this is commented out, then the sample and shader use raw buffer types.
TEST_DOUBLE
: If this is defined, then the sample tests a double-precision data type. Note that double support requires Compute Shader 5.0 and optional driver support for double-precision shaders. If the sample and shader are compiled with TEST_DOUBLE
defined, then the sample will only run on 11-class hardware with doubles support, or it will fall back to using the Reference device.
This sample demonstrates the basic usage of the DirectX 11 Compute Shader 4.0 feature to implement a bitonic sort algorithm. It also highlights the considerations that must be taken to achieve good performance.
Bitonic sort is a simple algorithm that works by sorting the data set into alternating ascending and descending sorted sequences. These sequences can then be combined and sorted to produce larger sequences. This is repeated until you produce one final ascending sequence for the sorted data.
Now let's look at how to implement the bitonic sort in computer shader for a single thread group. To achieve good performance when implementing the sorting algorithm, it is important to limit the amount of memory accesses where possible. Because this algorithm has very few ALU operations and is limited by its memory accesses, we perform portions of the sort in shared memory, which is significantly faster. Unfortunately, there are two problems that must be worked around. First, there is a limited amount of group shared memory and a limited number of threads in a group. And second, in CS4.0, the group shared memory supports random access reads but it does not support random access writes. Even with these limitations, it is possible to create an efficient implementation using group shared memory.
Step 1: Load the group shared memory. Each thread loads one element.
shared_data[GI] = Data[DTid.x];
Step 2: Next, the threads must by synchronized to guarantee that all of the elements are loaded because the next operation will perform a random access read.
GroupMemoryBarrierWithGroupSync();
Step 3: Now each thread must pick the min or max of the two elements it is comparing. The thread cannot compare and swap both elements because that would require random access writes.
unsigned int result = ((shared_data[GI & ~j] <= shared_data[GI | j]) == (bool)(g_iLevelMask & DTid.x))? shared_data[GI ^ j] : shared_data[GI];
Step 4: Again, the threads must be synchronized. This is to prevent any threads from performing the write operation before all threads have completed the read.
GroupMemoryBarrierWithGroupSync();
Step 5: The min or max is now stored in group shared memory and synchronized. (The algorithm loops back to step 3 and must finish all writes before threads start reading.)
shared_data[GI] = result;
GroupMemoryBarrierWithGroupSync();
Step 6: With the memory sorted, the results can be stored back to the buffer.
Data[DTid.x] = shared_data[GI];
The bitonic sort shader we have created works great when the data set is small enough to run with one thread group. Unfortunately, for CS4.0, this means a maximum of 512 elements, which is the largest power of 2 number of threads in a group. To solve this, we can add two additional steps to the algorithm. When we need to sort a section that is too large to be processed by a single group of threads, we transpose the entire data set. With the data transposed, larger sort steps can be performed entirely in shared memory without changing the bitonic sort algorithm. Once the large steps are completed, the data can be transposed back to complete the smaller steps of the sort.
Implementing a transpose in Compute Shader is simple, but making it efficient requires a little bit of care. For best memory performance, it is preferable to access memory in a nice linear and consecutive pattern. Reading a row of data from the source with multiple threads is naturally a linear memory access. However, when that row is written to the destination as a column, the writes are no longer consecutive in memory. To achieve the best performance, a square block of data is first read into group shared memory as multiple contiguous memory reads. Then the shared memory is accessed as column data so that it can be written back as multiple contiguous memory writes. This allows us to shift the burden of the nonlinear access pattern to the high-performance group shared memory.
The code in these samples can be built using Visual Studio 2010 rather than Visual Studio 2012. The changes required are:
Open the project with Visual Studio 2013 and upgrade the VC++ complier and libraries.
July 22, 2014 - Code review updates
September 17, 2013 - Original version cleaned up from DirectX SDK (June 2010) release
Where is the DirectX SDK (2013 Edition)?
Games for Windows and DirectX SDK blog