c++ - What is the fastest way to split even and odd indices with AVX2? - Stack Overflow

admin2025-04-19  0

For example, what Intel AVX/SSE intrinsics can I use to split a array of complex numbers into a two arrays of real and imaginary parts respectively?

So something like [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, ...] becomes [1.0, 3.0, 5.0, 7.0...] and [2.0, 4.0, 6.0, 8.0,...]

What set of instructions will maximize the throughput for accomplishing this split?

I appreciate answers for the following scenerios:

  • 256-bit width vectors of 64-bit doubles
  • 256-bit width vectors of 32-bit floats (duplicate of Unpacking real and imaginary parts of complex numbers into separate ymm registers unless there's anything better than what that Q&A came up with.)

I am aiming for something that is fastest on the modern Intel or AMD architectures with AVX/AVX2 capabilities but doesn't significantly compromise performance on the other architectures.

I am writing a small FFT library. Deinterleaving comes up twice. For radix-2 and radix-4 implementation, Deinterleaving comes up when the stage length is less than the SIMD operand length. Deinterleaving is also useful for separating the complex number into real and imaginary components so that the input can be more efficiently processed with separate arrays of real/imaginary parts.


Note: I believe the equivalent instruction on SVE for ARM is UZPx. I am quite confused why there isn't an equivalent single instruction for AVX, given how useful this instruction is.

The best I can do for 256-bit width vector of 64-bit doubles is to use the unpack twice per 128-bit lane for a throughput of 2 cycles per vector. And do unpack 3 times per lane with 256-bit width vector of 32-bit float for a throughput of 3 cycles per vector. Can I do better?

For example, what Intel AVX/SSE intrinsics can I use to split a array of complex numbers into a two arrays of real and imaginary parts respectively?

So something like [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, ...] becomes [1.0, 3.0, 5.0, 7.0...] and [2.0, 4.0, 6.0, 8.0,...]

What set of instructions will maximize the throughput for accomplishing this split?

I appreciate answers for the following scenerios:

  • 256-bit width vectors of 64-bit doubles
  • 256-bit width vectors of 32-bit floats (duplicate of Unpacking real and imaginary parts of complex numbers into separate ymm registers unless there's anything better than what that Q&A came up with.)

I am aiming for something that is fastest on the modern Intel or AMD architectures with AVX/AVX2 capabilities but doesn't significantly compromise performance on the other architectures.

I am writing a small FFT library. Deinterleaving comes up twice. For radix-2 and radix-4 implementation, Deinterleaving comes up when the stage length is less than the SIMD operand length. Deinterleaving is also useful for separating the complex number into real and imaginary components so that the input can be more efficiently processed with separate arrays of real/imaginary parts.


Note: I believe the equivalent instruction on SVE for ARM is UZPx. I am quite confused why there isn't an equivalent single instruction for AVX, given how useful this instruction is.

The best I can do for 256-bit width vector of 64-bit doubles is to use the unpack twice per 128-bit lane for a throughput of 2 cycles per vector. And do unpack 3 times per lane with 256-bit width vector of 32-bit float for a throughput of 3 cycles per vector. Can I do better?

Share Improve this question edited Mar 4 at 1:39 Golden Rockefeller asked Mar 4 at 0:02 Golden RockefellerGolden Rockefeller 3332 silver badges10 bronze badges 5
  • 1 Use single/half precisions? – aaa Commented Mar 4 at 0:38
  • Some details on the use-case were mentioned in a comment on the staging-ground post (stackoverflow/staging-ground/…) - radix 2 and 4 for FFTs. It got posted before the OP could respond to clarify if they just need re/im parts to line up with each other (e.g. if they're going to be re-packed later), or if it's necessary that they be in sequential order. – Peter Cordes Commented Mar 4 at 1:22
  • Obligatory note on any question about SIMD on complex numbers: it's more efficient to store real/imaginary parts separately in the first place so you never have to unpack. Because multiply packed floats or doubles sucks: AVX2: What is the best way to multiply and sum 4 complex values with 4 double values? – Peter Cordes Commented Mar 4 at 1:25
  • Why won't _mm256_unpacklo_pd() and _mm256_unpackhi_pd() work for you? It is pretty much exactly what your referenced UZPx is doing. Am I missing something? – André Commented Mar 4 at 6:33
  • Does the result have to be [0 2 4 6; 8 10 12 14], [1 3 5 7; 9 11 13 15] or would something like [0 2 8 10; 4 6 12 14], [1 3 9 11; 5 7 13 15] be ok? – chtz Commented Mar 4 at 12:53
Add a comment  | 

1 Answer 1

Reset to default 1

In an ideal world, you'd be storing data in a Structure-Of-Array layout, so that the swizzles were pointless.

The next question you need to ask, is whether the returned ordering is actually important? For example, if you are multiplying two arrays of complex numbers, both arrays only need to be swizzled to the same ordering, and they don't actually need the correct ordering. For this, you just need unpacklo / unpackhi to convert to SOA, and then the same again to convert back to AOS.

If the original ordering is an absolute requirement (honestly, that isn't always the case in my experience), then yeah, you're going to need unpacklo/unpackhi, and a pair of permutes to convert back and forth.

struct Complex {
    float real;
    float imag;
};

struct Complex8 {
    __m256 real;
    __m256 imag;
};

// returns original values in the following ordering: 04152637
Complex8 load8(const Complex Input[8])
{
    __m256 A = _mm256_loadu_ps(&Input[0].real);
    __m256 B = _mm256_loadu_ps(&Input[1].real);

    return (Complex8) {
         _mm256_unpacklo_ps(A, B),
         _mm256_unpackhi_ps(A, B)
    };
}

// returns original values in the following ordering: 04152637
Complex8 load8_with_proper_ordering(const Complex Input[8])
{
    Complex8 In8 = load8(Input);
    __m256i Indices = _mm256_setr_epi32(0, 2, 4, 6, 1, 3, 5, 7);

    return (Complex8) {
        _mm256_permutevar8x32_ps(In8.real, Indices),
        _mm256_permutevar8x32_ps(In8.imag, Indices)
    };
}

转载请注明原文地址:http://conceptsofalgorithm.com/Algorithm/1745066884a283041.html

最新回复(0)