For example, what Intel AVX/SSE intrinsics can I use to split a array of complex numbers into a two arrays of real and imaginary parts respectively?
So something like [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, ...] becomes [1.0, 3.0, 5.0, 7.0...] and [2.0, 4.0, 6.0, 8.0,...]
What set of instructions will maximize the throughput for accomplishing this split?
I appreciate answers for the following scenerios:
I am aiming for something that is fastest on the modern Intel or AMD architectures with AVX/AVX2 capabilities but doesn't significantly compromise performance on the other architectures.
I am writing a small FFT library. Deinterleaving comes up twice. For radix-2 and radix-4 implementation, Deinterleaving comes up when the stage length is less than the SIMD operand length. Deinterleaving is also useful for separating the complex number into real and imaginary components so that the input can be more efficiently processed with separate arrays of real/imaginary parts.
Note: I believe the equivalent instruction on SVE for ARM is UZPx. I am quite confused why there isn't an equivalent single instruction for AVX, given how useful this instruction is.
The best I can do for 256-bit width vector of 64-bit doubles is to use the unpack
twice per 128-bit lane for a throughput of 2 cycles per vector. And do unpack
3 times per lane with 256-bit width vector of 32-bit float for a throughput of 3 cycles per vector. Can I do better?
For example, what Intel AVX/SSE intrinsics can I use to split a array of complex numbers into a two arrays of real and imaginary parts respectively?
So something like [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, ...] becomes [1.0, 3.0, 5.0, 7.0...] and [2.0, 4.0, 6.0, 8.0,...]
What set of instructions will maximize the throughput for accomplishing this split?
I appreciate answers for the following scenerios:
I am aiming for something that is fastest on the modern Intel or AMD architectures with AVX/AVX2 capabilities but doesn't significantly compromise performance on the other architectures.
I am writing a small FFT library. Deinterleaving comes up twice. For radix-2 and radix-4 implementation, Deinterleaving comes up when the stage length is less than the SIMD operand length. Deinterleaving is also useful for separating the complex number into real and imaginary components so that the input can be more efficiently processed with separate arrays of real/imaginary parts.
Note: I believe the equivalent instruction on SVE for ARM is UZPx. I am quite confused why there isn't an equivalent single instruction for AVX, given how useful this instruction is.
The best I can do for 256-bit width vector of 64-bit doubles is to use the unpack
twice per 128-bit lane for a throughput of 2 cycles per vector. And do unpack
3 times per lane with 256-bit width vector of 32-bit float for a throughput of 3 cycles per vector. Can I do better?
In an ideal world, you'd be storing data in a Structure-Of-Array layout, so that the swizzles were pointless.
The next question you need to ask, is whether the returned ordering is actually important? For example, if you are multiplying two arrays of complex numbers, both arrays only need to be swizzled to the same ordering, and they don't actually need the correct ordering. For this, you just need unpacklo / unpackhi to convert to SOA, and then the same again to convert back to AOS.
If the original ordering is an absolute requirement (honestly, that isn't always the case in my experience), then yeah, you're going to need unpacklo/unpackhi, and a pair of permutes to convert back and forth.
struct Complex {
float real;
float imag;
};
struct Complex8 {
__m256 real;
__m256 imag;
};
// returns original values in the following ordering: 04152637
Complex8 load8(const Complex Input[8])
{
__m256 A = _mm256_loadu_ps(&Input[0].real);
__m256 B = _mm256_loadu_ps(&Input[1].real);
return (Complex8) {
_mm256_unpacklo_ps(A, B),
_mm256_unpackhi_ps(A, B)
};
}
// returns original values in the following ordering: 04152637
Complex8 load8_with_proper_ordering(const Complex Input[8])
{
Complex8 In8 = load8(Input);
__m256i Indices = _mm256_setr_epi32(0, 2, 4, 6, 1, 3, 5, 7);
return (Complex8) {
_mm256_permutevar8x32_ps(In8.real, Indices),
_mm256_permutevar8x32_ps(In8.imag, Indices)
};
}
_mm256_unpacklo_pd()
and_mm256_unpackhi_pd()
work for you? It is pretty much exactly what your referenced UZPx is doing. Am I missing something? – André Commented Mar 4 at 6:33[0 2 4 6; 8 10 12 14], [1 3 5 7; 9 11 13 15]
or would something like[0 2 8 10; 4 6 12 14], [1 3 9 11; 5 7 13 15]
be ok? – chtz Commented Mar 4 at 12:53