Catalog#
Reference#
Outline#
Introduction#
Recent CPUs use a radically different form of instruction level parallelism. These CPUs deploy a versatile set of vector operations: instructions that operate on 4 or 8 inputs1 , yielding 4 or 8 results, often in a single cycle. This is known as SIMD: Single Instruction, Multiple Data.
This document provides a practical introduction to SIMD programming in C++ and C#.
Body#
SIMD Concepts#
Efficient SIMD code requires an efficient data layout; this must be done manually.
Data parallelism#
For a data-parallel algorithm, each of the scalars in a SIMD register holds the data for one ‘thread’. We call the slots in the register lanes. The input data is called a stream.
A Practical Example: C++#
1 | The following code renders a Mandelbrot fractal: |
STEP 1: make a backup of the original code
- An #if 1 … #else … #endif block
STEP 2: create four streams
Start out by simulating the use of four streams. Working on four pixels at a time means that x increases in steps of 4.
1
2
3
4
5
6
7
8
9
10
11
12
13
14// my version
for( int x = 0; x < SCRWIDTH; x += 4, xoffs += dx * 4 )
{
float ox[4] = { 0, 0, 0, 0 }, oy[4] = { 0, 0, 0, 0 };
for( int i = 0; i < 99; i++ )
oy[0] = -(oy[0] * oy[0] - ox[0] * ox[0] - 0.55f + xoffs + lane * dx),
ox[0] = -(ox[0] * oy[0] + oy[0] * ox[0] - 0.55f + yoffs);
oy[1] = -(oy[1] * oy[1] - ox[1] * ox[1] - 0.55f + xoffs + lane * dx),
ox[1] = -(ox[1] * oy[1] + oy[1] * ox[1] - 0.55f + yoffs);
oy[2] = -(oy[2] * oy[2] - ox[2] * ox[2] - 0.55f + xoffs + lane * dx),
ox[2] = -(ox[2] * oy[2] + oy[2] * ox[2] - 0.55f + yoffs);
oy[3] = -(oy[3] * oy[3] - ox[3] * ox[3] - 0.55f + xoffs + lane * dx),
ox[3] = -(ox[3] * oy[3] + oy[3] * ox[3] - 0.55f + yoffs);
}
STEP 3: create the SIMD data structure
1
2
3
4union { __m128 ox4; float ox[4]; };
union { __m128 oy4; float oy[4]; };
// uses an intrinsic to set the 128-bit vectors to zero.
ox4 = oy4 = _mm_setzero_ps();
STEP 4: check functionality
STEP 5: convert the inner loop
We use SIMD data structure
ox4
andoxy
to represent each 4ox
andoy
Regarding
xoffs4
: variable xoffs used to be incremented bydx
at every iterationoy4 = -(py4 * py4 - px4 * px4 - 0.55f + xoffs4)
; can be converted to:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16// 0 - (py4 * py4 - px4 * px4 - 0.55f + xoffs4)
oy4 = _mm_sub_ps(_mm_setzero_ps(), (
// py4 * py4 - px4 * px4 - 0.55f + xoffs4
_mm_add_ps(
// py4 * py4 - px4 * px4 - 0.55f
_mm_sub_ps(
// py4 * py4 - px4 * px4
_mm_sub_ps(
// py4 * py4
_mm_mul_ps(py4, py4),
// px4 * px4
_mm_mul_ps(px4, px4)
),
_mm_set1_ps(0.55f)),
xoffs4)
))
Bringing everything together, we get the final vectorized program:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33// my version
for( int x = 0; x < SCRWIDTH; x += 4, xoffs += dx * 4 )
{
union { __m128 ox4; float ox[4]; };
union { __m128 oy4; float oy[4]; };
ox4 = oy4 = _mm_setzero_ps();
__m128 xoffs4 = _mm_setr_ps(xoffs, xoffs + dx, xoffs + dx*2, xoffs + dx*3);
__m128 yoffs4 = _mm_set_ps1(yoffs);
for( int i = 0; i < 99; i++ )
oy4 = _mm_sub_ps(_mm_setzero_ps(), (
_mm_add_ps(
_mm_sub_ps(
_mm_sub_ps(
_mm_mul_ps(py4, py4),
_mm_mul_ps(px4, px4)
),
_mm_set1_ps(0.55f)),
xoffs4),
ox4 = _mm_sub_ps(_mm_setzero_ps(), (
_mm_add_ps(
_mm_sub_ps(
_mm_sub_ps(
_mm_mul_ps(px4, py4),
_mm_mul_ps(py4, px4)
),
_mm_set1_ps(0.55f)),
xoffs4);
for( int lane = 0; lane < 4; lane++ ) {
int r = min( 255, max( 0, (int)(ox[lane] * 255) ) );
int g = min( 255, max( 0, (int)(oy[lane] * 255) ) );
screen->Plot( x + lane, y, (r << 16) + (g << 8) );
}
}
Conditional Code & SIMD#
Clearly anything that is conditional may lead to scalar flows not executing the same code.
1 | for( int lane = 0; lane < 4; lane++ ) |
Min & Max#
For the specific case of min and max, SSE and AVX offer an efficient solution:
1 | __m128 c4 = _mm_min_ps( a4, b4 ); |
If statement - Most important part#
Thinking about how to veterize Plot which contains a if
statement.
1 | void Surface::Plot( int x, int y, Pixel c ) |
I just wrote SSE and AVX do not have if statements, but they do in fact have comparison instructions. These do not yield ‘quadbools’, but they do return something useful: bitmasks. Here is an example:
1
2 > __m128 mask = _mm_cmpge_ps( x4, _mm_setzero_ps() ); // if (x4 >= 0
>
The mask value is a 128-bit value. After the comparison, its contents reflect the result: 32 zeroes for a ‘false’, and 32 ones for a ‘true’. We can also combine comparisons:
1
2
3
4 > __m128 mask1 = _mm_cmpge_ps( x4, _mm_setzero_ps() ); // if (x4 >= 0)
> __m128 mask2 = _mm_cmpge_ps( y4, _mm_setzero_ps() ); // if (y4 >= 0)
> __m128 mask = _mm_and_ps( mask1, mask2 ); // if (x4 >= 0 && y4 >= 0
>
The resulting bitmasks can be used in two distinctive ways.
The first way is to break the vector instruction flow, and switch to scalar code to handle the result of the comparisons.
- The benefit is that we now at least did the comparisons using vector code. We didn’t solve the actual problem though: the conditional code still breaks our vector flow.
The second way is to disable functionality for lanes
Consider the actual conditional code:
m_Buffer[x + y * m_Pitch] = c;
Now, what if we replaced this address by some other safe location, for instance the address of a dummy variable? We would still perform the write, but this time it does not yield a visible pixel. Like a pixel happens to be off-screen
Replace the address calculation
x + y * m_Pitch
by(x + y * m_Pitch) * mask
. And the address will be 0 ifmask
is 0Let’s compute the full mask for the plot statement:
1
2
3
4
5
6
7// if ((x >= 0) && (y >= 0) && (x < m_Width) && (y < m_Height)) { mask = 1} else (mask = 0)
__m128 x_ge_0 = _mm_cmpge_ps(x4, _mm_setzero_ps());
__m128 y_ge_0 = _mm_cmpge_ps(y4, _mm_setzero_ps());
__m128 x_lt_width = _mm_cmplt_ps(x4, _mm_set_ps1(m_Width));
__m128 y_lt_Heigth = _mm_cmplt_ps(x4, _mm_set_ps1(m_Height));
__m128 mask = _mm_and_ps(_mm_and_ps(_mm_and_ps(x_ge_0, y_ge_0), x_lt_width), y_lt_Heigth);
Then we can calculate the four pixel addresses as follows:
1
2
3
4
5
6
7
8// Multiplying two 32-bit integers yields a 64-bit integer, which doesn’t fit in a 32-bit lane.
// The _mm_mullo_epi32 instruction discards the top 32-bit, which is fine in this case.
__m128 address4 = _mm_add_ps(x4, _mm_mullo_epi32(y4, m_Pitch4));
// There is no _mm_and_epi32 instruction;
// instead doing a bitwise and to integers operates directly on the 128 bits using _mm_and_si128.
// `*(__m128i*)&mask` Our mask is a quadfloat, We thus convert it on-the-fly to the correct type.
address4 = _mm_and_si128(address4, *(__m128i*)&mask);
Optimizing and Debugging SIMD Code, Hints#
In the previous sections, we have seen how code can be vectorized, and how to deal with conditional code. In this section we will discuss some common opportunities for improving the efficiency of SIMD code.
Instruction count#
In principle every intrinsic compiles to a single compiler instruction.
Floating point versus integer#
Floating point support in SSE and AVX is much better than integer support. S
Reduce the use of _mm_set_ps#
_mm_set_ps
is an expensive function, as it takes four operands.
- Consider caching the result: calculate the quadfloat outside loops
Avoid gather operations#
Data alignment#
One thing to keep in mind: a quadfloat in memory must always be stored at an address that is a multiple of 16. . Failure to do so will result in a crash.
Vectorize bottlenecks only#
Vectorization is hard and you want to focus your effort on bottlenecks only.
A lot of work focused in a tiny portion of code, screaming for close-to-the-metal optimization.
C++ debugger#
Support for SIMD is well-integrated in the Visual Studio debugger. You can e.g. effortlessly inspect individual values in SIMD variables.
Evade fancy SIMD libraries#
consider Agner Fog’s vector library: http://www.agner.org/optimize/#vectorclass
Closing Remarks#
Jacco Bikker : bikker.j@gmail.com