Catalog#

Reference#

Practical SIMD Programming, Supplemental tutorial for INFOB3CC, INFOMOV & INFOMAGR, Jacco Bikker, 2017

Outline#

Introduction#

Recent CPUs use a radically different form of instruction level parallelism. These CPUs deploy a versatile set of vector operations: instructions that operate on 4 or 8 inputs1 , yielding 4 or 8 results, often in a single cycle. This is known as SIMD: Single Instruction, Multiple Data.

This document provides a practical introduction to SIMD programming in C++ and C#.

Body#

SIMD Concepts#

Efficient SIMD code requires an efficient data layout; this must be done manually.

Data parallelism#

For a data-parallel algorithm, each of the scalars in a SIMD register holds the data for one ‘thread’. We call the slots in the register lanes. The input data is called a stream.

A Practical Example: C++#

The following code renders a Mandelbrot fractal:
// based on https://www.shadertoy.com/view/Mdy3Dw
float scale = 1 + cosf( t );
t += 0.01f;
for( int y = 0; y < SCRHEIGHT; y++ )
{
 float yoffs = ((float)y / SCRHEIGHT - 0.5f) * scale;
 float xoffs = -0.5f * scale, dx = scale / SCRWIDTH;
 for( int x = 0; x < SCRWIDTH; x++, xoffs += dx )
 {
   	 float ox = 0, oy = 0, py;
     for( int i = 0; i < 99; i++ ) 
 		 px = ox, py = oy,
     oy = -(py * py - px * px - 0.55f + xoffs),
     ox = -(px * py + py * px - 0.55f + yoffs);
   
     int r = min( 255, max( 0, (int)(ox * 255) ) );
     int g = min( 255, max( 0, (int)(oy * 255) ) );
   
     screen->Plot( x, y, (r << 16) + (g << 8) );
 }
}

STEP 1: make a backup of the original code
- An #if 1 … #else … #endif block

STEP 2: create four streams

Start out by simulating the use of four streams. Working on four pixels at a time means that x increases in steps of 4.

// my version
for( int x = 0; x < SCRWIDTH; x += 4, xoffs += dx * 4 )
{
 float ox[4] = { 0, 0, 0, 0 }, oy[4] = { 0, 0, 0, 0 };
 for( int i = 0; i < 99; i++ ) 
   	oy[0] = -(oy[0] * oy[0] - ox[0] * ox[0] - 0.55f + xoffs + lane * dx), 
  	ox[0] = -(ox[0] * oy[0] + oy[0] * ox[0] - 0.55f + yoffs);
   	oy[1] = -(oy[1] * oy[1] - ox[1] * ox[1] - 0.55f + xoffs + lane * dx), 
  	ox[1] = -(ox[1] * oy[1] + oy[1] * ox[1] - 0.55f + yoffs);
   	oy[2] = -(oy[2] * oy[2] - ox[2] * ox[2] - 0.55f + xoffs + lane * dx), 
  	ox[2] = -(ox[2] * oy[2] + oy[2] * ox[2] - 0.55f + yoffs);
  	oy[3] = -(oy[3] * oy[3] - ox[3] * ox[3] - 0.55f + xoffs + lane * dx), 
  	ox[3] = -(ox[3] * oy[3] + oy[3] * ox[3] - 0.55f + yoffs);
}

STEP 3: create the SIMD data structure

union { __m128 ox4; float ox[4]; };
union { __m128 oy4; float oy[4]; };
// uses an intrinsic to set the 128-bit vectors to zero.
ox4 = oy4 = _mm_setzero_ps();

STEP 4: check functionality

STEP 5: convert the inner loop

We use SIMD data structure ox4 and oxy to represent each 4 ox and oy
Regarding xoffs4: variable xoffs used to be incremented by dx at every iteration

oy4 = -(py4 * py4 - px4 * px4 - 0.55f + xoffs4); can be converted to:

// 0 - (py4 * py4 - px4 * px4 - 0.55f + xoffs4)
oy4 = _mm_sub_ps(_mm_setzero_ps(), (
  // py4 * py4 - px4 * px4 - 0.55f + xoffs4
	_mm_add_ps(
    // py4 * py4 - px4 * px4 - 0.55f
    _mm_sub_ps(
      // py4 * py4 - px4 * px4
      _mm_sub_ps(
        //  py4 * py4
      	_mm_mul_ps(py4, py4),
        // px4 * px4
        _mm_mul_ps(px4, px4)
      ),
      _mm_set1_ps(0.55f)), 
    xoffs4)
))

Bringing everything together, we get the final vectorized program:

// my version
for( int x = 0; x < SCRWIDTH; x += 4, xoffs += dx * 4 )
{
 	union { __m128 ox4; float ox[4]; };
 	union { __m128 oy4; float oy[4]; };
  ox4 = oy4 = _mm_setzero_ps();
  __m128 xoffs4 =  _mm_setr_ps(xoffs, xoffs + dx, xoffs + dx*2, xoffs + dx*3);
  __m128 yoffs4 =  _mm_set_ps1(yoffs);
 	for( int i = 0; i < 99; i++ ) 
    oy4 = _mm_sub_ps(_mm_setzero_ps(), (
      _mm_add_ps(
        _mm_sub_ps(
          _mm_sub_ps(
            _mm_mul_ps(py4, py4),
            _mm_mul_ps(px4, px4)
          ),
          _mm_set1_ps(0.55f)), 
        xoffs4),
    ox4 = _mm_sub_ps(_mm_setzero_ps(), (
      _mm_add_ps(
        _mm_sub_ps(
          _mm_sub_ps(
            _mm_mul_ps(px4, py4),
            _mm_mul_ps(py4, px4)
          ),
          _mm_set1_ps(0.55f)), 
        xoffs4);
	for( int lane = 0; lane < 4; lane++ ) {
 		int r = min( 255, max( 0, (int)(ox[lane] * 255) ) );
 		int g = min( 255, max( 0, (int)(oy[lane] * 255) ) );
 		screen->Plot( x + lane, y, (r << 16) + (g << 8) );
 	}
}

Conditional Code & SIMD#

Clearly anything that is conditional may lead to scalar flows not executing the same code.

for( int lane = 0; lane < 4; lane++ )
 {
 int r = min( 255, max( 0, (int)(ox[lane] * 255) ) );
 int g = min( 255, max( 0, (int)(oy[lane] * 255) ) );
 screen->Plot( x + lane, y, (r << 16) + (g << 8) );
 }

Min & Max#

For the specific case of min and max, SSE and AVX offer an efficient solution:

1 2	__m128 c4 = _mm_min_ps( a4, b4 ); __m128 c4 = _mm_max_ps( a4, b4 );

If statement - Most important part#

Thinking about how to veterize Plot which contains a if statement.

void Surface::Plot( int x, int y, Pixel c )
{
 if ((x >= 0) && (y >= 0) && (x < m_Width) && (y < m_Height))
 m_Buffer[x + y * m_Pitch] = c;
}

I just wrote SSE and AVX do not have if statements, but they do in fact have comparison instructions. These do not yield ‘quadbools’, but they do return something useful: bitmasks. Here is an example:
1
2
>  __m128 mask = _mm_cmpge_ps( x4, _mm_setzero_ps() ); // if (x4 >= 0
>

The mask value is a 128-bit value. After the comparison, its contents reflect the result: 32 zeroes for a ‘false’, and 32 ones for a ‘true’. We can also combine comparisons:
1
2
3
4
>  __m128 mask1 = _mm_cmpge_ps( x4, _mm_setzero_ps() ); // if (x4 >= 0) 
>  __m128 mask2 = _mm_cmpge_ps( y4, _mm_setzero_ps() ); // if (y4 >= 0) 
>  __m128 mask = _mm_and_ps( mask1, mask2 ); // if (x4 >= 0 && y4 >= 0
>

The resulting bitmasks can be used in two distinctive ways.

The first way is to break the vector instruction flow, and switch to scalar code to handle the result of the comparisons.
- The benefit is that we now at least did the comparisons using vector code. We didn’t solve the actual problem though: the conditional code still breaks our vector flow.

The second way is to disable functionality for lanes

Consider the actual conditional code: m_Buffer[x + y * m_Pitch] = c;
Now, what if we replaced this address by some other safe location, for instance the address of a dummy variable? We would still perform the write, but this time it does not yield a visible pixel. Like a pixel happens to be off-screen
Replace the address calculation x + y * m_Pitch by (x + y * m_Pitch) * mask . And the address will be 0 if mask is 0

Let’s compute the full mask for the plot statement:

// if ((x >= 0) && (y >= 0) && (x < m_Width) && (y < m_Height)) { mask = 1} else (mask = 0)

__m128 x_ge_0 = _mm_cmpge_ps(x4, _mm_setzero_ps());
__m128 y_ge_0 = _mm_cmpge_ps(y4, _mm_setzero_ps());
__m128 x_lt_width = _mm_cmplt_ps(x4, _mm_set_ps1(m_Width));
__m128 y_lt_Heigth = _mm_cmplt_ps(x4, _mm_set_ps1(m_Height));
__m128 mask = _mm_and_ps(_mm_and_ps(_mm_and_ps(x_ge_0, y_ge_0), x_lt_width), y_lt_Heigth);

Then we can calculate the four pixel addresses as follows:

// Multiplying two 32-bit integers yields a 64-bit integer, which doesn’t fit in a 32-bit lane. 
// The _mm_mullo_epi32 instruction discards the top 32-bit, which is fine in this case.
__m128 address4 = _mm_add_ps(x4, _mm_mullo_epi32(y4, m_Pitch4)); 

// There is no _mm_and_epi32 instruction; 
// instead doing a bitwise and to integers operates directly on the 128 bits using _mm_and_si128.
// `*(__m128i*)&mask` Our mask is a quadfloat, We thus convert it on-the-fly to the correct type.
address4 = _mm_and_si128(address4, *(__m128i*)&mask);

Optimizing and Debugging SIMD Code, Hints#

In the previous sections, we have seen how code can be vectorized, and how to deal with conditional code. In this section we will discuss some common opportunities for improving the efficiency of SIMD code.

Instruction count#

In principle every intrinsic compiles to a single compiler instruction.

Floating point versus integer#

Floating point support in SSE and AVX is much better than integer support. S

Reduce the use of _mm_set_ps#

_mm_set_ps is an expensive function, as it takes four operands.

Consider caching the result: calculate the quadfloat outside loops

Avoid gather operations#

Data alignment#

One thing to keep in mind: a quadfloat in memory must always be stored at an address that is a multiple of 16. . Failure to do so will result in a crash.

Vectorize bottlenecks only#

Vectorization is hard and you want to focus your effort on bottlenecks only.

A lot of work focused in a tiny portion of code, screaming for close-to-the-metal optimization.

C++ debugger#

Support for SIMD is well-integrated in the Visual Studio debugger. You can e.g. effortlessly inspect individual values in SIMD variables.

Evade fancy SIMD libraries#

consider Agner Fog’s vector library: http://www.agner.org/optimize/#vectorclass

Closing Remarks#

Jacco Bikker : bikker.j@gmail.com

Paper Notes: Pratical SIMD Programming