DirectCompute Basic Win32 Samples


Win32, DirectX, DirectX SDK, DirectCompute
Graphics and 3D
Desktop
en-US
10/12/2015

The latest version of this sample is hosted on GitHub.

This is the DirectX SDK's Direct3D 11 BasicCompute11 and ComputeShaderSort11 samples updated to use Visual Studio 2012 and the Windows SDK 8.0 without any dependencies on legacy DirectX SDK content. These are samples Win32 desktop console DirectX 11.0 applications for Windows 8, Windows 7, and Windows Vista Service Pack 2 with the DirectX 11.0 runtime. 

This is based on the the legacy DirectX SDK (June 2010) Win32 desktop console samples running on Windows Vista, Windows 7, and Windows 8. This is not intended for use with Windows Store apps or Windows RT, although the basic techniques are applicable.

Description

These are Win32 desktop console applications which demonstrate use of DirectCompute (aka Direct3D 11 Compute Shaders). For more complicated examples that combine DirectCompute with 3D rendering, see DirectCompute Graphics Win32 Samples.

BasicCompute11

This sample shows the basic usage of DirectX11 Compute Shader (aka DirectCompute) by implementing array A + array B.

How the Sample Works

Setting up the Compute Shader involves the following steps:

  1. Create a D3D11 device and context. Make sure to check the feature level of the device created. Based on what graphics card is installed in the system, possibilities are:
    1. If an FL11 device has been created, we get full Compute Shader 5.0 capability.
    2. However, if we have an FL10 or FL10.1 device, Compute Shader 4.0/4.1 is potentially available, since CS4.0/4.1 is available on most DirectX 10 cards but not all of them. Call CheckFeatureSupport to see whether CS4.0/4.1 is available. Refer to the sample code to see how this is done.
    3. If we get an FL9.x device, Compute Shader is not available.
  2. Compile and then create the Compute Shader.
  3. Create input resource for the Compute Shader and fill them with data. As we are doing array A + array B, we create two buffers as the input resource.
  4. Create an SRV (shader resource view) for both of the input buffer resources. Shader resource view is used to bind input resources to shaders.
  5. Create output resource for the Compute Shader.
  6. Create a UAV (unordered resource view) for the output resource. Unordered resource view is used to bind output resources to Compute Shaders. CS4.0/4.1 can have only one output resource bound to a Compute Shader at a time. CS5.0 doesn't have this limitation.
  7. Execute the Compute Shader by calling Dispatch.

Build Options

The sample and the HLSL code supports two additional build modes controlled by compile-time defines.

 

ComputeShaderSort11

This sample demonstrates the basic usage of the DirectX 11 Compute Shader 4.0 feature to implement a bitonic sort algorithm. It also highlights the considerations that must be taken to achieve good performance.

Bitonic Sort

Bitonic sort is a simple algorithm that works by sorting the data set into alternating ascending and descending sorted sequences. These sequences can then be combined and sorted to produce larger sequences. This is repeated until you produce one final ascending sequence for the sorted data.

Bitonic Sort with Compute Shader

Now let's look at how to implement the bitonic sort in computer shader for a single thread group. To achieve good performance when implementing the sorting algorithm, it is important to limit the amount of memory accesses where possible. Because this algorithm has very few ALU operations and is limited by its memory accesses, we perform portions of the sort in shared memory, which is significantly faster. Unfortunately, there are two problems that must be worked around. First, there is a limited amount of group shared memory and a limited number of threads in a group. And second, in CS4.0, the group shared memory supports random access reads but it does not support random access writes. Even with these limitations, it is possible to create an efficient implementation using group shared memory.

Step 1: Load the group shared memory. Each thread loads one element.

    shared_data[GI] = Data[DTid.x];


Step 2: Next, the threads must by synchronized to guarantee that all of the elements are loaded because the next operation will perform a random access read.

    GroupMemoryBarrierWithGroupSync();


Step 3: Now each thread must pick the min or max of the two elements it is comparing. The thread cannot compare and swap both elements because that would require random access writes.

    unsigned int result = ((shared_data[GI & ~j] <= shared_data[GI | j]) == (bool)(g_iLevelMask & DTid.x))? shared_data[GI ^ j] : shared_data[GI];


Step 4: Again, the threads must be synchronized. This is to prevent any threads from performing the write operation before all threads have completed the read.

    GroupMemoryBarrierWithGroupSync();


Step 5: The min or max is now stored in group shared memory and synchronized. (The algorithm loops back to step 3 and must finish all writes before threads start reading.)

    shared_data[GI] = result;
    GroupMemoryBarrierWithGroupSync();


Step 6: With the memory sorted, the results can be stored back to the buffer.

    Data[DTid.x] = shared_data[GI];

Sorting More Data 

The bitonic sort shader we have created works great when the data set is small enough to run with one thread group. Unfortunately, for CS4.0, this means a maximum of 512 elements, which is the largest power of 2 number of threads in a group. To solve this, we can add two additional steps to the algorithm. When we need to sort a section that is too large to be processed by a single group of threads, we transpose the entire data set. With the data transposed, larger sort steps can be performed entirely in shared memory without changing the bitonic sort algorithm. Once the large steps are completed, the data can be transposed back to complete the smaller steps of the sort.

Transpose

Implementing a transpose in Compute Shader is simple, but making it efficient requires a little bit of care. For best memory performance, it is preferable to access memory in a nice linear and consecutive pattern. Reading a row of data from the source with multiple threads is naturally a linear memory access. However, when that row is written to the destination as a column, the writes are no longer consecutive in memory. To achieve the best performance, a square block of data is first read into group shared memory as multiple contiguous memory reads. Then the shared memory is accessed as column data so that it can be written back as multiple contiguous memory writes. This allows us to shift the burden of the nonlinear access pattern to the high-performance group shared memory.

Building with Visual Studio 2010

The code in these samples can be built using Visual Studio 2010 rather than Visual Studio 2012. The changes required are:

Building with Visual Studio 2013

Open the project with Visual Studio 2013 and upgrade the VC++ complier and libraries. 

Version History

July 22, 2014 - Code review updates

September 17, 2013 - Original version cleaned up from DirectX SDK (June 2010) release

More Information

DirectCompute

Where is the DirectX SDK?

Where is the DirectX SDK (2013 Edition)?

Games for Windows and DirectX SDK blog