Примерно полгода назад я познакомился с VLIW‑процессором Эльбрус-8СВ. На тот момент у меня уже был опыт написания кода на ассемблере для VLIW‑процессора TMS320C66. Поэтому я захотел написать нечто похожее для Эльбруса. А именно, реализовать алгоритм FFT на ассемблере. Но из‑за нехватки документации на инструкции процессора мне пришлось начать с реализации какого‑нибудь простого алгоритма на Си, чтобы изучать его ассемблерный вывод. По результатам той деятельности была написана предыдущая статья.
После написания той статьи я решил попробовать реализовать алгоритм FFT на Си для Эльбруса. Работа ещё не завершена, но определённые успехи уже есть (сравнение с EML присутствует). В этой статье я хочу поделиться полученными на данный момент результатами.
Содержание:
Пишем функцию Reverse
reverse_radix2
Пишем функцию Stage
stage_radix2
stage_radix2_2x
stage_radix2_readConjSwap
stage_radix2_readConjSwap_2x
stage_radix4
stage_radix4_2x
stage_radix4_readConjSwap
stage_radix4_readConjSwap_2x
Собираем FFT
Постановка задачи
Дано: указатель на входной массив комплексных чисел, количество элементов в нём и указатель на выходной массив.
Требуется: вычислить FFT от входных данных, ответ записать в выходной массив.Будем считать, что комплексные числа имеют тип
float complex(то есть действительная и мнимая части имеют типfloat).
Будем считать, что количество элементов массива является степенью числа 2 (или 4, если надо будет).
И, конечно же, пусть массивы будут выровнены в памяти на удобные нам границы.
Ключи компиляции
При компиляции использовались следующие ключи lcc:
-Wall -O3 -faligned -ffast-math -march=elbrus-v5
Почему я компилирую для elbrus-v5
После написания предыдущей статьи у меня пропал доступ к elbrus‑v5 (он уехал в ремонт). Сначала я придумывал алгоритмы «на бумажке», потом мне был предоставлен доступ к elbrus‑v6. Я планировал делать код для elbrus‑v5, поэтому добавил в скрипт компиляции ключ ‑march=elbrus‑v5. В дальнейшем доступ к v5 восстановился, но я уже привык работать на v6, так как он был круглосуточно доступен.
Перед написанием этой статьи я решил убрать ключ -march=elbrus-v5. После перекомпиляции некоторый код стал работать медленнее, чем при наличии ключа. Выяснилось, что этот код компилируется для v5 эффективнее, чем для v6 (плотнее упакованы инструкции). Получилось так, что код, скомпилированный для v5, работает на v6 быстрее, чем тот же код, скомпилированный для v6.
Такое поведение можно наблюдать в следующих функциях:
Поэтому я пока остановился на компиляции для v5.
Посмотреть на различия компиляции можно на сайте ce.mentality.rip:
-
Вставляем в левое поле одну из указанных выше функций
-
Добавляем перед вставленной функцией такой код:
#include <e2kintrin.h> #include <stdint.h> typedef struct { float real; float imag; } myComplex; -
В строке «Compiler options» указываем ключи компиляции:
-Wall -O3 -faligned -ffast-math -march=elbrus-v5
После этого считаем количество тактов в цикле (делаем поиск по «loop_mode»).
При указании -march=elbrus-v5 тактов меньше, чем при -march=elbrus-v6.
Как измерялось время
Замеры времени делались с помощью функции clock_gettime():
struct timespec t0, t1;
clock_gettime(CLOCK_REALTIME, &t0);
/*** здесь измеряемый код ***/
clock_gettime(CLOCK_REALTIME, &t1);
int usec = (t1.tv_sec - t0.tv_sec)*1000000 + (t1.tv_nsec - t0.tv_nsec)/1000;
Также использовалось чтение счётчика тактов процессора:
uint64_t get_clock_count()
{
uint64_t dst;
#pragma asm_inline
asm ("rrd %%clkr, %0" : "=r" (dst));
return dst;
}
...
uint64_t ticks0 = get_clock_count();
/*** здесь измеряемый код ***/
uint64_t ticks1 = get_clock_count();
uint64_t ticks = ticks1 - ticks0;
Про опции pragma
В этот раз я использовал #pragma prefetch вместо #pragma loop_count(100).
Как я понял, оба варианта включают APB. Разница в том, что prefetchне отключает предварительный запрос данных перед циклом, а loop_count отключает. Этот запрос немного уменьшает время исполнения цикла. При большом числе итераций это ускорение несущественно, но увеличивает размер кода.
Разницы почти нет, но prefetch выглядит проще, чем loop_count(100).
Также использовалась опция #pragma ivdep, которая указывает компилятору, что разные итерации цикла независимы между собой по обращениям к памяти, и можно начать выполнение следующей итерации, не дожидаясь завершения текущей.
Опция полезна, когда в цикле одновременно присутствуют чтение и запись в память.
Про раскрутку циклов
Иногда для ускорения полезно сделать раскрутку циклов. В одном такте могут одновременно выполняться 6 инструкций, потому что в этом процессоре есть 6 исполнительных юнитов. Если, например, итерация цикла состоит из 9 инструкций, то она будет выполняться за 2 такта, а две такие итерации, соответственно, за 4 такта. После раскрутки в 2 раза одна итерация нового цикла будет состоять из 2*9=18 инструкций, и можно ожидать, что она будет выполняться за 3 такта. Таким образом, две итерации исходного цикла будут выполняться не за 4 такта, а за 3. Но следует помнить, что по разным причинам не всегда удаётся уместить в каждый такт 6 инструкций. Например, потому что некоторые инструкции способны выполняться не на всех исполнительных юнитах.
Способы раскрутить цикл:
-
Раскрутка цикла компилятором (у таких функций я добавляю к названию суффикс
"_unroll2"/"_unroll3"/"_unroll4")Для её использования надо написать
#pragma unroll(k)перед циклом, где k — множитель раскрутки. И компилятор раскрутит цикл ровно в k раз. Также будет добавлен код, проверяющий кратность числа итераций исходного цикла параметру k. Если число итераций не кратно k, остаток будет обработан отдельным кодом. Раскрученный код будет выполнять те же действия и в том же порядке, что и изначальный код. Если поставить#pragma unroll(1), то раскрутка не будет производиться. Это бывает полезно, потому что по‑умолчанию компилятор пытается сделатьunroll(2). -
Ручная раскрутка цикла программистом (у таких функций я добавляю к названию суффикс
"_x2"/"_x3"/"_x4")При ручной раскрутке программист сам пишет код так, чтобы в одной итерации цикла выполнялось k итераций алгоритма. Проверка кратности числа итераций алгоритма величине k лежит на программисте. Если число итераций может быть не кратно k, нужно обрабатывать эти случаи отдельно. В приведённом в этой статье коде такие случаи не обрабатываются для упрощения. В отличие от раскрутки компилятором, полностью повторяющей все действия в исходном порядке, раскрученный вручную код можно оптимизировать, заменяя действия на другие и меняя их порядок.
Интересным вариантом является ручная раскрутка цикла, тело которого состоит из другого цикла, обрабатывающего один и тот же массив. В этом случае можно объединить k итераций внутреннего цикла. Например, если исходная итерация внутреннего цикла состоит из чтения данных из памяти, обработки данных и сохранения результата в память, то после раскрутки внешнего цикла в 2 раза можно убрать сохранение результата в конце первой итерации и чтение его обратно перед второй. В итоге останется просто чтение, обработка, ещё раз обработка и сохранение. Пример такой оптимизации находится в функциях с пометкой
"2x", например,stage_radix2_2x(не"x2"для лучшей сортировки файлов по имени).
Что такое FFT
Пусть дан набор комплексных чисел x0, …, xN-1.
Дискретным преобразованием Фурье (DFT) называется перевод этого набора в другой набор комплексных чисел X0, …, XN-1 по следующей формуле:
Вычисление X0, …, XN-1 по этой формуле требует O(N2) операций.
Быстрым преобразованием Фурье (FFT) называется алгоритм, позволяющий вычислить DFT значительно быстрее (обычно за O(N logN) операций).
Наиболее известным FFT является алгоритм Кули‑Тьюки (Cooley‑Tukey). Этот алгоритм рекурсивно вычисляет DFT через DFT меньшего размера (метод «разделяй и властвуй» / «divide‑and‑conquer»).
Например, если поделить исходный массив на два подмассива (вариант «radix-2»), исходную формулу можно преобразовать к такому виду:
Et — DFT от N/2 чётных (Even) элементов исходного массива (с индексами вида 2s+0).
Ot — DFT от N/2 нечётных (Odd) элементов исходного массива (с индексами вида 2s+1).
Вариант «radix-4»
По аналогии вариант «radix-4» преобразует формулу к такому виду:
EEt — DFT от N/4 элементов исходного массива с индексами вида 4s+0.
EOt — DFT от N/4 элементов исходного массива с индексами вида 4s+1.
OEt — DFT от N/4 элементов исходного массива с индексами вида 4s+2.
OOt — DFT от N/4 элементов исходного массива с индексами вида 4s+3.
Напишем псевдокод рекурсивной функции, реализующей вариант «radix-2»:
FFT(IN) // IN - набор входных данных
{
N = IN.length
if N == 1
return IN
E = FFT(IN[0:N:2]) // FFT от чётных элементов
O = FFT(IN[1:N:2]) // FFT от нечётных элементов
for t = 0:N/2
{
c = e^(-2*pi*i*t/N)
OUT[t ] = E[t] + c*O[t]
OUT[t + N/2] = E[t] - c*O[t]
}
return OUT
}
Напишем чуть ближе к реальному коду:
FFT(*IN, N, s)
// IN - указатель на первый элемент
// N - количество элементов
// s - расстояние между элементами
{
if N == 1
return IN[0]
E[0:N/2] = FFT(IN, N/2, 2s) // FFT от чётных элементов
O[0:N/2] = FFT(IN + s, N/2, 2s) // FFT от нечётных элементов
for t = 0:N/2
{
c = e^(-2*pi*i*t/N)
OUT[t ] = E[t] + c*O[t]
OUT[t + N/2] = E[t] - c*O[t]
}
return OUT[0:N]
}
Разместим элементы Et в первой половине OUT, а элементы Ot во второй половине OUT:
FFT(*IN, N, s)
// IN - указатель на первый элемент
// N - количество элементов
// s - расстояние между элементами
{
if N == 1
return IN[0]
OUT[0:N/2] = FFT(IN, N/2, 2s) // FFT от чётных элементов
OUT[N/2:N] = FFT(IN + s, N/2, 2s) // FFT от нечётных элементов
for t = 0:N/2
{
x = OUT[t ]
y = OUT[t + N/2]
c = e^(-2*pi*i*t/N)
OUT[t ] = x + c*y
OUT[t + N/2] = x - c*y
}
return OUT[0:N]
}
Для удобства реализации перед рекурсивными вызовами можно передвинуть все чётные элементы в начало массива, нечётные — в конец массива. Тогда подмассивы будут занимать подряд идущие ячейки памяти, а не перемежаться между собой.
Если количество элементов N = 2n, то данная перестановка равнозначна перестановке вида IN[k] → OUT[rotateRight(k, n)], где rotateRight(k, n) — операция, «прокручивающая вправо» младшие n битов числа k, т.е. переставляющая младший бит числа k с позиции 0 на позицию n-1, сдвигая биты с позиций n-1, …, 1 на одну позицию в сторону позиции 0.
Псевдокод станет таким:
FFT(*IN, N)
// IN - указатель на первый элемент
// N - количество элементов
{
if N == 1
return IN[0]
// перестановка "чётные - в начало, нечётные - в конец"
IN = Even2Beginning_Odd2Ending(IN, N)
OUT[0:N/2] = FFT(IN, N/2) // FFT от начальной половины массива
OUT[N/2:N] = FFT(IN + N/2, N/2) // FFT от конечной половины массива
for t = 0:N/2
{
x = OUT[t ]
y = OUT[t + N/2]
c = e^(-2*pi*i*t/N)
OUT[t ] = x + c*y
OUT[t + N/2] = x - c*y
}
return OUT[0:N]
}
Если теперь мысленно проследить за ходом рекурсии, можно увидеть, что к моменту достижения самой глубины рекурсивных вызовов все элементы исходного массива будут переставлены в порядке, который обычно называется «bit reversal». В этой статье я буду называть такую перестановку просто «Reverse».
Что такое Reverse
Перестановка элементов вида IN[k] → OUT[reverseNumber(k)], где reverseNumber(k) — операция, переставляющая биты числа k в обратном порядке.
Результат reverseNumber(k) зависит от количества битов в числе k.
Например:
- если битов 3, то reverseNumber(6) = 3 ( 110 → 011 )
- если битов 4, то reverseNumber(6) = 6 ( 0110 → 0110 )
- если битов 5, то reverseNumber(6) = 12 (00110 → 01100)

В случае «radix-4» мы придём к аналогичной перестановке, где потребуется вместо двоичных цифр (битов) переставлять четверичные цифры (пары битов).

Таким образом, движение по рекурсии вглубь можно заменить на Reverse.
Обратный путь рекурсии состоит из обработки подмассивов, размер которых увеличивается по мере возврата из рекурсии. Один шаг возврата из рекурсии, обрабатывающий все подмассивы сразу, будем называть «Stage».
Stage(*OUT, N, stage_num)
{
m = 2^stage_num // m - размер подмассива
for k = 0:N/m // k - номер подмассива
{
for t = 0:m/2
{
x = OUT[m*k + t ]
y = OUT[m*k + t + m/2]
c = e^(-2*pi*i*t/N)
OUT[m*k + t ] = x + c*y
OUT[m*k + t + m/2] = x - c*y
}
}
}

Псевдокод алгоритма теперь выглядит так (без рекурсии):
FFT(*IN, N, *OUT)
{
OUT = Reverse(IN, N)
for stage_num = 1, ..., log2(N)
Stage(OUT, N, stage_num)
}
Таким образом, FFT = Reverse + log2(N)*Stage.
Для решения задачи нужно написать две функции: Reverse и Stage.
Особенности реализации FFT на Эльбрусе
В процессоре Эльбрус есть механизм APB, который позволяет быстро читать данные, расположенные в памяти с равным шагом. Число потоков чтения в APB ограничено 32 штуками.
В последнем варианте алгоритма (без рекурсии) разные Stage обрабатывают подмассивы разного размера:
-
первый Stage обрабатывает подмассивы длиной 2 (это можно представить в виде 2-х потоков чтения с равным шагом: чётные и нечётные элементы)
-
второй Stage обрабатывает подмассивы длиной 4 (это можно представить в виде 4-х потоков чтения с равным шагом)
-
третий Stage обрабатывает подмассивы длиной 8 (это можно представить в виде 8-ми потоков чтения с равным шагом)
-
и так далее
При достаточно большом количестве Stage перестанет хватать потоков чтения APB. Для эффективного использования APB модифицируем алгоритм Stage таким образом, чтобы элементы всегда читались парами (как сейчас в первом Stage).
Получится такой вариант для Эльбруса:
Stage(*IN, N, *OUT, stage_num)
{
m = 2^stage_num
for k = 0:m/2
{
c = e^(-2*pi*i*k/m)
for t = 0:N/m
{
x = IN[2*t ]
y = IN[2*t + 1]
OUT[t ] = x + c*y
OUT[t + N/2] = x - c*y
}
}
}

Если мысленно проследить за ходом операций, можно увидеть, что в этом варианте на каждом Stage выполняются те же арифметические операции с теми же парами чисел и теми же коэффициентами, что и в классическом варианте. Просто эти пары чисел обрабатываются в другом порядке (на всех Stage, кроме первого). И в конце каждого Stage добавляется перестановка «чётные — в начало, нечётные — в конец». Как было написано выше, эта перестановка равнозначна перестановке вида IN[k] → OUT[rotateRight(k, n)], поэтому после прохода по всем Stage числа возвращаются на исходные позиции (прокрутка делается log2(N) раз).
Как выглядит Stage для версии «radix-4»

Именно этот вариант будет реализован в данной статье.
Замечание про коэффициенты в функциях Stage
Для вычисления Stage нам нужны входные данные и коэффициенты. Входные данные располагаются в памяти изначально. Коэффициенты же можно либо читать из памяти одновременно с входными данными (вычислять заранее), либо вычислять на ходу.
Вариант чтения коэффициентов из памяти хорошо работает в случае малого числа коэффициентов, когда они все помещаются в кэше.
Вариант вычисления на ходу оправдан при большом числе коэффициентов, так как в этом случае тонким место является канал доступа к памяти и отказ от чтения коэффициентов из памяти позволяет использовать весь канал только для чтения входных данных. Общий размер коэффициентов (в зависимости от алгоритма) может в несколько раз превосходить размер входных данных. Поэтому отказ от чтения коэффициентов из памяти может увеличить скорость чтения входных данных в несколько раз. При малом числе коэффициентов вычисление на ходу будет замедлять работу, так как оно требует дополнительных инструкций для собственно вычисления.
В приведенном в этой статье коде реализован только вариант чтения коэффициентов из памяти.
Пишем функцию Reverse
reverse_radix2
1. reverse_radix2_etalon
Эталонный вариант для сравнения на корректность.
Здесь reverseNumber вычисляется с помощью цикла.
int reverseNumber_radix2(int number, int bit_count)
{
int answer = 0;
for(int i = 0; i < bit_count; ++i)
{
answer <<= 1;
answer |= number & 1;
number >>= 1;
}
return answer;
}
void reverse_radix2_etalon(int bit_count, myComplex *data_in, myComplex *data_out)
{
int count = 1 << bit_count;
for(int64_t i = 0; i < count; ++i)
{
int index = reverseNumber_radix2(i, bit_count);
data_out[index] = data_in[i];
}
}
2. reverse_radix2
В процессоре есть инструкция bitrevd, которая производит операцию reverseNumber_radix2 над 64-битным числом. Заменим reverseNumber_radix2() на __builtin_e2k_bitrevd().
Схема перемещения данных в памяти

Код на Си
void reverse_radix2(int bit_count, myComplex *data_in, myComplex *data_out)
{
int count = 1 << bit_count;
int shift = 64 - bit_count;
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < count; ++i)
{
int64_t index = __builtin_e2k_bitrevd(i) >> shift;
data_out[index] = data_in[i];
}
}
Основной цикл на ассемблере
.L1554:
{
fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=5, abs=0, disp=0
}
.L1385:
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
bitrevd,0,sm %b[9], %b[1]
addd,2,sm %b[9], 0x1, %b[7]
shrd,3,sm %b[5], %r0, %b[10]
shld,4,sm %b[12], 0x3, %b[11]
std,5 %r2, %b[13], %b[8]
movad,1 area=0, ind=0, am=1, be=0, %b[0]
}
Теоретическая скорость: 1 комплексное число за 1 такт (1/1) = 8 Байт/такт
Замеры скорости

3. reverse_radix2_x2_bad
Попробуем ускорить с помощью ручной раскрутки цикла в 2 раза.
Схема перемещения данных в памяти

Код на Си
void reverse_radix2_x2_bad(int bit_count, myComplex *data_in, myComplex *data_out)
{
int count = 1 << bit_count;
int shift = 64 - bit_count;
myComplex *data_out_0 = &data_out[0];
myComplex *data_out_1 = &data_out[count/2];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < count; i += 2)
{
int64_t index = __builtin_e2k_bitrevd(i) >> shift;
data_out_0[index] = data_in[i + 0];
data_out_1[index] = data_in[i + 1];
}
}
Основной цикл на ассемблере
.L1860:
{
fapb ct=1, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
}
.L1655:
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
bitrevd,0,sm %b[20], %b[12]
addd,1,sm %b[20], 0x2, %b[18]
std,2 %r2, %b[17], %b[10]
shrd,3,sm %b[16], %r4, %b[19]
shld,4,sm %b[21], 0x3, %b[13]
std,5 %r0, %b[17], %b[11]
movad,0 area=0, ind=0, am=0, be=0, %b[0]
movad,1 area=0, ind=8, am=1, be=0, %b[1]
}
Теоретическая скорость: 2 комплексных числа за 1 такт (2/1) = 16 Байт/такт
Замеры скорости

Видим ускорение в начале графика.
Здесь происходит два чтения из одного места памяти и запись в два разных места памяти.
Вероятно, отсутствие ускорения по всей длине графика связано с тем, что запись в память всё равно делается по очереди (в один банк памяти?).
4. reverse_radix2_x2_good
Попробуем сделать наоборот: будем читать из двух разных мест, а писать рядом.
Схема перемещения данных в памяти

Код на Си
void reverse_radix2_x2_good(int bit_count, myComplex *data_in, myComplex *data_out)
{
int count = 1 << bit_count;
int shift = 64 - bit_count;
myComplex *data_in_0 = &data_in[0];
myComplex *data_in_1 = &data_in[count/2];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < count/2; ++i)
{
int64_t index = __builtin_e2k_bitrevd(i) >> shift;
data_out[index + 0] = data_in_0[i];
data_out[index + 1] = data_in_1[i];
}
}
Основной цикл на ассемблере
.L2162:
{
fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=5, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=5, abs=0, disp=0
}
.L1993:
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
bitrevd,0,sm %b[20], %b[12]
addd,1,sm %b[20], 0x1, %b[18]
std,2 %b[17], %r2, %b[11]
shrd,3,sm %b[16], %r4, %b[19]
shld,4,sm %b[21], 0x3, %b[13]
std,5 %r0, %b[17], %b[10]
movad,1 area=0, ind=0, am=1, be=0, %b[1]
movad,3 area=0, ind=0, am=1, be=0, %b[0]
}
Теоретическая скорость: 2 комплексных числа за 1 такт (2/1) = 16 Байт/такт
Замеры скорости

Видим желаемое ускорение по всей длине графика.
Строго говоря, это не раскрутка цикла. Честной раскруткой был предыдущий вариант. Здесь же произошло изменение алгоритма (данные обрабатываются в другом порядке). Но я не придумал, как это назвать («stream2»?), поэтому все дальнейшие «раскрутки» будут называться x4/x8 и т.д.
5. reverse_radix2_x2_best
Прежде, чем переходить к более сильным раскруткам, посмотрим, что будет, если вместо двух 64-битных записей в память сделать одну 128-битную запись.
Схема перемещения данных в памяти

Код на Си
void reverse_radix2_x2_best(int bit_count, myComplex *data_in, myComplex *data_out)
{
int count = 1 << bit_count;
int shift = 64 - bit_count;
uint64_t *data_in_0 = (uint64_t*)&data_in[0];
uint64_t *data_in_1 = (uint64_t*)&data_in[count/2];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < count/2; ++i)
{
int64_t offset = 8 * (__builtin_e2k_bitrevd(i) >> shift);
*(__v2du*)((void*)data_out + offset) = (__v2du){data_in_0[i], data_in_1[i]};
}
}
Основной цикл на ассемблере
.L2350:
{
fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=5, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=5, abs=0, disp=0
}
.L2262:
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
bitrevd,0,sm %b[22], %b[21]
qppackdl,1,sm %b[10], %b[11], %b[13]
addd,2,sm %b[22], 0x1, %b[20]
shrd,3,sm %b[25], %r0, %b[24]
shld,4,sm %b[26], 0x3, %b[27]
stqp,5 %r2, %b[29], %b[19]
movad,1 area=0, ind=0, am=1, be=0, %b[1]
movad,3 area=0, ind=0, am=1, be=0, %b[0]
}
Теоретическая скорость: 2 комплексных числа за 1 такт (2/1) = 16 Байт/такт
Замеры скорости

Видим небольшое ускорение.
В дальнейшем будем всегда писать в память 128-битными кусками.
6. reverse_radix2_x4
Сделаем аналогичную «псевдо раскрутку» теперь в 4 раза.
Схема перемещения данных в памяти

Код на Си
void reverse_radix2_x4(int bit_count, myComplex *data_in, myComplex *data_out)
{
int count = 1 << bit_count;
int shift = 64 - bit_count;
uint64_t *data_in_00 = (uint64_t*)&data_in[0 * count/4];
uint64_t *data_in_01 = (uint64_t*)&data_in[1 * count/4];
uint64_t *data_in_10 = (uint64_t*)&data_in[2 * count/4];
uint64_t *data_in_11 = (uint64_t*)&data_in[3 * count/4];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < count/4; ++i)
{
int64_t offset = 8 * (__builtin_e2k_bitrevd(i) >> shift);
*(__v2du*)((void*)data_out + offset + 0*16) = (__v2du){data_in_00[i], data_in_10[i]};
*(__v2du*)((void*)data_out + offset + 1*16) = (__v2du){data_in_01[i], data_in_11[i]};
}
}
Основной цикл на ассемблере
.L2619:
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=4, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=4, abs=0, disp=0
}
{
fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=4, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=4, abs=16, disp=0
}
.L2503:
{
loop_mode
qppackdl,1,sm %b[15], %b[18], %b[20]
shrd,3,sm %b[21], %r4, %b[1]
qppackdl,4,sm %b[9], %b[12], %b[22]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
addd,1,sm %b[4], 0x1, %b[2]
stqp,2 %r2, %b[17], %b[20]
bitrevd,3,sm %b[4], %b[19]
shld,4,sm %b[1], 0x3, %b[15]
stqp,5 %r0, %b[17], %b[22]
movad,0 area=1, ind=0, am=1, be=0, %b[6]
movad,1 area=0, ind=0, am=1, be=0, %b[12]
movad,2 area=1, ind=0, am=1, be=0, %b[3]
movad,3 area=0, ind=0, am=1, be=0, %b[9]
}
Теоретическая скорость: 4 комплексных числа за 2 такта (4/2) = 16 Байт/такт
Замеры скорости

Видим сильное ускорение.
Однако, код перестал вмещаться в один такт. Хочется это исправить.
7. reverse_radix2_x4_oneTickVersion
Перепишем код, чтобы вместиться в один такт (убираем инструкции shrd и shld).
Схема перемещения данных в памяти

Код на Си
void reverse_radix2_x4_oneTickVersion(int bit_count, myComplex *data_in, myComplex *data_out)
{
int count = 1 << bit_count;
int shift = 64 - bit_count;
int64_t delta = (1LL << shift) / 8;
uint64_t *data_in_00 = (uint64_t*)&data_in[0 * count/4];
uint64_t *data_in_01 = (uint64_t*)&data_in[1 * count/4];
uint64_t *data_in_10 = (uint64_t*)&data_in[2 * count/4];
uint64_t *data_in_11 = (uint64_t*)&data_in[3 * count/4];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0, shifted_i = 0; i < count/4; ++i, shifted_i += delta)
{
int64_t offset = __builtin_e2k_bitrevd(shifted_i);
*(__v2du*)((void*)data_out + offset + 0*16) = (__v2du){data_in_00[i], data_in_10[i]};
*(__v2du*)((void*)data_out + offset + 1*16) = (__v2du){data_in_01[i], data_in_11[i]};
}
}
Основной цикл на ассемблере
.L2891:
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=4, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=4, abs=0, disp=0
}
{
fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=4, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=4, abs=16, disp=0
}
.L2777:
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
bitrevd,0,sm %b[34], %b[25]
qppackdl,1,sm %b[22], %b[23], %b[31]
stqp,2 %r2, %b[29], %b[33]
qppackdl,3,sm %b[10], %b[11], %b[35]
addd,4,sm %b[32], %r4, %b[30]
stqp,5 %r0, %b[29], %b[37]
movad,0 area=1, ind=0, am=1, be=0, %b[1]
movad,1 area=0, ind=0, am=1, be=0, %b[13]
movad,2 area=1, ind=0, am=1, be=0, %b[0]
movad,3 area=0, ind=0, am=1, be=0, %b[12]
}
Теоретическая скорость: 4 комплексных числа за 1 такт (4/1) = 32 Байт/такт
Замеры скорости

Видим ускорение в начале и замедление в конце графика.
Должно было быть либо лучше предыдущего варианта, либо так же. Не знаю, как это объяснить.
Примерно в этот момент у меня была мысль, что дальше оптимизировать не получится. Но, посмотрев на схемы алгоритмов, возник вопрос: «а если раскрутить по аналогии ещё в 2 раза?» Можно ли читать из 8-ми разных мест одновременно без потери скорости?
8. reverse_radix2_x8
Продолжаем «псевдо раскручивать» дальше.
Схема перемещения данных в памяти

Код на Си
void reverse_radix2_x8(int bit_count, myComplex *data_in, myComplex *data_out)
{
int count = 1 << bit_count;
int shift = 64 - bit_count;
uint64_t *data_in_000 = (uint64_t*)&data_in[0 * count/8];
uint64_t *data_in_001 = (uint64_t*)&data_in[1 * count/8];
uint64_t *data_in_010 = (uint64_t*)&data_in[2 * count/8];
uint64_t *data_in_011 = (uint64_t*)&data_in[3 * count/8];
uint64_t *data_in_100 = (uint64_t*)&data_in[4 * count/8];
uint64_t *data_in_101 = (uint64_t*)&data_in[5 * count/8];
uint64_t *data_in_110 = (uint64_t*)&data_in[6 * count/8];
uint64_t *data_in_111 = (uint64_t*)&data_in[7 * count/8];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < count/8; ++i)
{
int64_t offset = 8 * (__builtin_e2k_bitrevd(i) >> shift);
*(__v2du*)((void*)data_out + offset + 0*16) = (__v2du){data_in_000[i], data_in_100[i]};
*(__v2du*)((void*)data_out + offset + 1*16) = (__v2du){data_in_010[i], data_in_110[i]};
*(__v2du*)((void*)data_out + offset + 2*16) = (__v2du){data_in_001[i], data_in_101[i]};
*(__v2du*)((void*)data_out + offset + 3*16) = (__v2du){data_in_011[i], data_in_111[i]};
}
}
Основной цикл на ассемблере
.L3314:
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=8, asz=3, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=3, abs=0, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=3, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=3, abs=8, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=3, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=3, abs=16, disp=0
}
{
fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=3, abs=24, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=3, abs=24, disp=0
}
.L3146:
{
loop_mode
bitrevd,0,sm %b[32], %b[29]
qppackdl,1,sm %b[20], %b[21], %b[33]
stqp,2 %r2, %b[42], %b[38]
shld,3,sm %b[34], 0x3, %b[40]
qppackdl,4,sm %b[6], %b[7], %b[37]
stqp,5 %r0, %b[42], %b[36]
movad,0 area=3, ind=0, am=1, be=0, %b[1]
movad,1 area=2, ind=0, am=1, be=0, %b[15]
movad,2 area=3, ind=0, am=1, be=0, %b[0]
movad,3 area=2, ind=0, am=1, be=0, %b[14]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
qppackdl,0,sm %b[26], %b[27], %b[36]
addd,1,sm %b[30], 0x1, %b[28]
stqp,2 %r4, %b[42], %b[35]
qppackdl,3,sm %b[12], %b[13], %b[34]
shrd,4,sm %b[31], %r6, %b[32]
stqp,5 %r5, %b[42], %b[39]
movad,0 area=1, ind=0, am=1, be=0, %b[7]
movad,1 area=0, ind=0, am=1, be=0, %b[21]
movad,2 area=1, ind=0, am=1, be=0, %b[6]
movad,3 area=0, ind=0, am=1, be=0, %b[20]
}
Теоретическая скорость: 8 комплексных чисел за 2 такта (8/2) = 32 Байт/такт
Замеры скорости

Видим замедление в начале и ускорение в конце графика.
Однако же, система справилась с чтением из 8-ми разных мест.
9. reverse_radix2_x16
А если в 16 раз?
Схема перемещения данных в памяти

Код на Си
void reverse_radix2_x16(int bit_count, myComplex *data_in, myComplex *data_out)
{
int count = 1 << bit_count;
int shift = 64 - bit_count;
uint64_t *data_in_0000 = (uint64_t*)&data_in[ 0 * count/16];
uint64_t *data_in_0001 = (uint64_t*)&data_in[ 1 * count/16];
uint64_t *data_in_0010 = (uint64_t*)&data_in[ 2 * count/16];
uint64_t *data_in_0011 = (uint64_t*)&data_in[ 3 * count/16];
uint64_t *data_in_0100 = (uint64_t*)&data_in[ 4 * count/16];
uint64_t *data_in_0101 = (uint64_t*)&data_in[ 5 * count/16];
uint64_t *data_in_0110 = (uint64_t*)&data_in[ 6 * count/16];
uint64_t *data_in_0111 = (uint64_t*)&data_in[ 7 * count/16];
uint64_t *data_in_1000 = (uint64_t*)&data_in[ 8 * count/16];
uint64_t *data_in_1001 = (uint64_t*)&data_in[ 9 * count/16];
uint64_t *data_in_1010 = (uint64_t*)&data_in[10 * count/16];
uint64_t *data_in_1011 = (uint64_t*)&data_in[11 * count/16];
uint64_t *data_in_1100 = (uint64_t*)&data_in[12 * count/16];
uint64_t *data_in_1101 = (uint64_t*)&data_in[13 * count/16];
uint64_t *data_in_1110 = (uint64_t*)&data_in[14 * count/16];
uint64_t *data_in_1111 = (uint64_t*)&data_in[15 * count/16];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < count/16; ++i)
{
int64_t offset = 8 * (__builtin_e2k_bitrevd(i) >> shift);
*(__v2du*)((void*)data_out + offset + 0*16) = (__v2du){data_in_0000[i], data_in_1000[i]};
*(__v2du*)((void*)data_out + offset + 1*16) = (__v2du){data_in_0100[i], data_in_1100[i]};
*(__v2du*)((void*)data_out + offset + 2*16) = (__v2du){data_in_0010[i], data_in_1010[i]};
*(__v2du*)((void*)data_out + offset + 3*16) = (__v2du){data_in_0110[i], data_in_1110[i]};
*(__v2du*)((void*)data_out + offset + 4*16) = (__v2du){data_in_0001[i], data_in_1001[i]};
*(__v2du*)((void*)data_out + offset + 5*16) = (__v2du){data_in_0101[i], data_in_1101[i]};
*(__v2du*)((void*)data_out + offset + 6*16) = (__v2du){data_in_0011[i], data_in_1011[i]};
*(__v2du*)((void*)data_out + offset + 7*16) = (__v2du){data_in_0111[i], data_in_1111[i]};
}
}
Основной цикл на ассемблере
.L4040:
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=1, incr=0, ind=0, asz=2, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=15, asz=2, abs=0, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=14, asz=2, abs=4, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=13, asz=2, abs=4, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=12, asz=2, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=11, asz=2, abs=8, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=10, asz=2, abs=12, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=9, asz=2, abs=12, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=8, asz=2, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=2, abs=16, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=2, abs=20, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=2, abs=20, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=2, abs=24, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=2, abs=24, disp=0
}
{
fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=2, abs=28, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=2, abs=28, disp=0
}
.L3765:
{
loop_mode
qppackdl,1,sm %b[21], %b[24], %b[40]
stqp,2 %r2, %b[37], %b[41]
qppackdl,4,sm %b[7], %b[10], %b[42]
stqp,5 %r0, %b[37], %b[38]
movad,0 area=7, ind=0, am=1, be=0, %b[4]
movad,1 area=6, ind=0, am=1, be=0, %b[18]
movad,2 area=7, ind=0, am=1, be=0, %b[1]
movad,3 area=6, ind=0, am=1, be=0, %b[15]
}
{
loop_mode
qppackdl,1,sm %b[25], %b[28], %b[41]
stqp,2 %r5, %b[37], %b[40]
shrd,3,sm %b[31], %r12, %b[29]
qppackdl,4,sm %b[11], %b[14], %b[38]
stqp,5 %r6, %b[37], %b[42]
movad,0 area=5, ind=0, am=1, be=0, %b[10]
movad,1 area=4, ind=0, am=1, be=0, %b[24]
movad,2 area=5, ind=0, am=1, be=0, %b[7]
movad,3 area=4, ind=0, am=1, be=0, %b[21]
}
{
loop_mode
qppackdl,1,sm %b[17], %b[20], %b[32]
stqp,2 %r7, %b[37], %b[41]
shld,3,sm %b[29], 0x3, %b[35]
qppackdl,4,sm %b[3], %b[6], %b[31]
stqp,5 %r9, %b[37], %b[38]
movad,0 area=1, ind=0, am=1, be=0, %b[14]
movad,1 area=0, ind=0, am=1, be=0, %b[28]
movad,2 area=1, ind=0, am=1, be=0, %b[11]
movad,3 area=0, ind=0, am=1, be=0, %b[25]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
qppackdl,0,sm %b[27], %b[30], %b[39]
addd,1,sm %b[2], 0x1, %b[0]
stqp,2 %r10, %b[37], %b[34]
qppackdl,3,sm %b[13], %b[16], %b[36]
bitrevd,4,sm %b[2], %b[29]
stqp,5 %r11, %b[37], %b[33]
movad,0 area=3, ind=0, am=1, be=0, %b[6]
movad,1 area=2, ind=0, am=1, be=0, %b[20]
movad,2 area=3, ind=0, am=1, be=0, %b[3]
movad,3 area=2, ind=0, am=1, be=0, %b[17]
}
Теоретическая скорость: 16 комплексных чисел за 4 такта (16/4) = 32 Байт/такт
Замеры скорости

Видим сильное ускорение.
10. reverse_radix2_x32
В 32 раза?
Код на Си
void reverse_radix2_x32(int bit_count, myComplex *data_in, myComplex *data_out)
{
int count = 1 << bit_count;
int shift = 64 - bit_count;
uint64_t *data_in_00000 = (uint64_t*)&data_in[ 0 * count/32];
uint64_t *data_in_00001 = (uint64_t*)&data_in[ 1 * count/32];
uint64_t *data_in_00010 = (uint64_t*)&data_in[ 2 * count/32];
uint64_t *data_in_00011 = (uint64_t*)&data_in[ 3 * count/32];
uint64_t *data_in_00100 = (uint64_t*)&data_in[ 4 * count/32];
uint64_t *data_in_00101 = (uint64_t*)&data_in[ 5 * count/32];
uint64_t *data_in_00110 = (uint64_t*)&data_in[ 6 * count/32];
uint64_t *data_in_00111 = (uint64_t*)&data_in[ 7 * count/32];
uint64_t *data_in_01000 = (uint64_t*)&data_in[ 8 * count/32];
uint64_t *data_in_01001 = (uint64_t*)&data_in[ 9 * count/32];
uint64_t *data_in_01010 = (uint64_t*)&data_in[10 * count/32];
uint64_t *data_in_01011 = (uint64_t*)&data_in[11 * count/32];
uint64_t *data_in_01100 = (uint64_t*)&data_in[12 * count/32];
uint64_t *data_in_01101 = (uint64_t*)&data_in[13 * count/32];
uint64_t *data_in_01110 = (uint64_t*)&data_in[14 * count/32];
uint64_t *data_in_01111 = (uint64_t*)&data_in[15 * count/32];
uint64_t *data_in_10000 = (uint64_t*)&data_in[16 * count/32];
uint64_t *data_in_10001 = (uint64_t*)&data_in[17 * count/32];
uint64_t *data_in_10010 = (uint64_t*)&data_in[18 * count/32];
uint64_t *data_in_10011 = (uint64_t*)&data_in[19 * count/32];
uint64_t *data_in_10100 = (uint64_t*)&data_in[20 * count/32];
uint64_t *data_in_10101 = (uint64_t*)&data_in[21 * count/32];
uint64_t *data_in_10110 = (uint64_t*)&data_in[22 * count/32];
uint64_t *data_in_10111 = (uint64_t*)&data_in[23 * count/32];
uint64_t *data_in_11000 = (uint64_t*)&data_in[24 * count/32];
uint64_t *data_in_11001 = (uint64_t*)&data_in[25 * count/32];
uint64_t *data_in_11010 = (uint64_t*)&data_in[26 * count/32];
uint64_t *data_in_11011 = (uint64_t*)&data_in[27 * count/32];
uint64_t *data_in_11100 = (uint64_t*)&data_in[28 * count/32];
uint64_t *data_in_11101 = (uint64_t*)&data_in[29 * count/32];
uint64_t *data_in_11110 = (uint64_t*)&data_in[30 * count/32];
uint64_t *data_in_11111 = (uint64_t*)&data_in[31 * count/32];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < count/32; ++i)
{
int64_t offset = 8 * (__builtin_e2k_bitrevd(i) >> shift);
*(__v2du*)((void*)data_out + offset + 0*16) = (__v2du){data_in_00000[i], data_in_10000[i]};
*(__v2du*)((void*)data_out + offset + 1*16) = (__v2du){data_in_01000[i], data_in_11000[i]};
*(__v2du*)((void*)data_out + offset + 2*16) = (__v2du){data_in_00100[i], data_in_10100[i]};
*(__v2du*)((void*)data_out + offset + 3*16) = (__v2du){data_in_01100[i], data_in_11100[i]};
*(__v2du*)((void*)data_out + offset + 4*16) = (__v2du){data_in_00010[i], data_in_10010[i]};
*(__v2du*)((void*)data_out + offset + 5*16) = (__v2du){data_in_01010[i], data_in_11010[i]};
*(__v2du*)((void*)data_out + offset + 6*16) = (__v2du){data_in_00110[i], data_in_10110[i]};
*(__v2du*)((void*)data_out + offset + 7*16) = (__v2du){data_in_01110[i], data_in_11110[i]};
*(__v2du*)((void*)data_out + offset + 8*16) = (__v2du){data_in_00001[i], data_in_10001[i]};
*(__v2du*)((void*)data_out + offset + 9*16) = (__v2du){data_in_01001[i], data_in_11001[i]};
*(__v2du*)((void*)data_out + offset + 10*16) = (__v2du){data_in_00101[i], data_in_10101[i]};
*(__v2du*)((void*)data_out + offset + 11*16) = (__v2du){data_in_01101[i], data_in_11101[i]};
*(__v2du*)((void*)data_out + offset + 12*16) = (__v2du){data_in_00011[i], data_in_10011[i]};
*(__v2du*)((void*)data_out + offset + 13*16) = (__v2du){data_in_01011[i], data_in_11011[i]};
*(__v2du*)((void*)data_out + offset + 14*16) = (__v2du){data_in_00111[i], data_in_10111[i]};
*(__v2du*)((void*)data_out + offset + 15*16) = (__v2du){data_in_01111[i], data_in_11111[i]};
}
}
Основной цикл на ассемблере
.L5406:
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=17, incr=0, ind=0, asz=1, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=16, incr=0, ind=0, asz=1, abs=0, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=15, incr=0, ind=0, asz=1, abs=2, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=14, incr=0, ind=0, asz=1, abs=2, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=13, incr=0, ind=0, asz=1, abs=4, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=12, incr=0, ind=0, asz=1, abs=4, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=11, incr=0, ind=0, asz=1, abs=6, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=10, incr=0, ind=0, asz=1, abs=6, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=9, incr=0, ind=0, asz=1, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=8, incr=0, ind=0, asz=1, abs=8, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=7, incr=0, ind=0, asz=1, abs=10, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=6, incr=0, ind=0, asz=1, abs=10, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=5, incr=0, ind=0, asz=1, abs=12, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=4, incr=0, ind=0, asz=1, abs=12, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=3, incr=0, ind=0, asz=1, abs=14, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=2, incr=0, ind=0, asz=1, abs=14, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=1, incr=0, ind=0, asz=1, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=15, asz=1, abs=16, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=14, asz=1, abs=18, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=13, asz=1, abs=18, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=12, asz=1, abs=20, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=11, asz=1, abs=20, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=10, asz=1, abs=22, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=9, asz=1, abs=22, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=8, asz=1, abs=24, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=1, abs=24, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=1, abs=26, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=1, abs=26, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=1, abs=28, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=1, abs=28, disp=0
}
{
fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=1, abs=30, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=1, abs=30, disp=0
}
.L4875:
{
loop_mode
qppackdl,1,sm %b[46], %b[47], %b[63]
stqp,2 %r2, %b[61], %b[32]
qppackdl,4,sm %b[38], %b[39], %b[62]
stqp,5 %r4, %b[61], %b[51]
movad,0 area=15, ind=0, am=1, be=0, %b[5]
movad,1 area=14, ind=0, am=1, be=0, %b[13]
movad,2 area=15, ind=0, am=1, be=0, %b[1]
movad,3 area=14, ind=0, am=1, be=0, %b[9]
}
{
loop_mode
qppackdl,1,sm %b[20], %b[26], %b[65]
stqp,2 %r5, %b[61], %b[63]
qppackdl,4,sm %b[8], %b[14], %b[64]
stqp,5 %r6, %b[61], %b[62]
movad,0 area=13, ind=0, am=1, be=0, %b[25]
movad,1 area=12, ind=0, am=1, be=0, %b[32]
movad,2 area=13, ind=0, am=1, be=0, %b[17]
movad,3 area=12, ind=0, am=1, be=0, %b[33]
}
{
loop_mode
qppackdl,1,sm %b[59], %b[60], %b[38]
stqp,2 %r7, %b[61], %b[65]
qppackdl,4,sm %b[55], %b[56], %b[39]
stqp,5 %r9, %b[61], %b[64]
movad,0 area=11, ind=0, am=1, be=0, %b[14]
movad,1 area=10, ind=0, am=1, be=0, %b[26]
movad,2 area=11, ind=0, am=1, be=0, %b[8]
movad,3 area=10, ind=0, am=1, be=0, %b[20]
}
{
loop_mode
qppackdl,1,sm %b[57], %b[58], %b[47]
stqp,2 %r10, %b[61], %b[40]
qppackdl,4,sm %b[54], %b[53], %b[46]
stqp,5 %r11, %b[61], %b[41]
movad,0 area=9, ind=0, am=1, be=0, %b[51]
movad,1 area=8, ind=0, am=1, be=0, %b[56]
movad,2 area=9, ind=0, am=1, be=0, %b[52]
movad,3 area=8, ind=0, am=1, be=0, %b[55]
}
{
loop_mode
addd,0,sm %b[4], 0x1, %b[2]
qppackdl,1,sm %b[22], %b[28], %b[40]
stqp,2 %r12, %b[61], %b[49]
bitrevd,3,sm %b[4], %b[59]
qppackdl,4,sm %b[10], %b[16], %b[41]
stqp,5 %r13, %b[61], %b[48]
movad,0 area=7, ind=0, am=1, be=0, %b[54]
movad,1 area=6, ind=0, am=1, be=0, %b[58]
movad,2 area=7, ind=0, am=1, be=0, %b[53]
movad,3 area=6, ind=0, am=1, be=0, %b[57]
}
{
loop_mode
qppackdl,1,sm %b[35], %b[34], %b[48]
stqp,2 %r14, %b[61], %b[42]
shrd,3,sm %b[59], %r0, %b[49]
qppackdl,4,sm %b[19], %b[27], %b[28]
stqp,5 %r15, %b[61], %b[43]
movad,0 area=5, ind=0, am=1, be=0, %b[10]
movad,1 area=4, ind=0, am=1, be=0, %b[22]
movad,2 area=5, ind=0, am=1, be=0, %b[4]
movad,3 area=4, ind=0, am=1, be=0, %b[16]
}
{
loop_mode
qppackdl,1,sm %b[9], %b[13], %b[19]
stqp,2 %r16, %b[61], %b[50]
shld,3,sm %b[49], 0x3, %b[59]
qppackdl,4,sm %b[1], %b[5], %b[27]
stqp,5 %r17, %b[61], %b[30]
movad,0 area=3, ind=0, am=1, be=0, %b[35]
movad,1 area=2, ind=0, am=1, be=0, %b[43]
movad,2 area=3, ind=0, am=1, be=0, %b[34]
movad,3 area=2, ind=0, am=1, be=0, %b[42]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
qppackdl,0,sm %b[11], %b[15], %b[30]
stqp,2 %r18, %b[61], %b[23]
qppackdl,3,sm %b[3], %b[7], %b[49]
stqp,5 %r19, %b[61], %b[31]
movad,0 area=1, ind=0, am=1, be=0, %b[5]
movad,1 area=0, ind=0, am=1, be=0, %b[13]
movad,2 area=1, ind=0, am=1, be=0, %b[1]
movad,3 area=0, ind=0, am=1, be=0, %b[9]
}
Теоретическая скорость: 32 комплексных числа за 8 тактов (32/8) = 32 Байт/такт
Замеры скорости

Видим замедление в начале и ускорение в конце графика.
При попытке «псевдо раскрутить» в 64 раза получается резко менее эффективный код. APB может читать максимум из 32 потоков, поэтому для чтения из 64 потоков компилятор вставляет операции обычного чтения ldd. В итоге скорость резко проседает.
Можно ли ускорить ещё?
В голову приходит разве что попробовать читать не 64-битными кусками, а 128-битными.
11. reverse_radix2_x32x2
Попробуем увеличить скорость чтения версии reverse_radix2_x32.
По сути, в этом варианте сделана честная раскрутка в 2 раза.
Код на Си
void reverse_radix2_x32x2(int bit_count, myComplex *data_in, myComplex *data_out)
{
int count = 1 << bit_count;
int shift = 64 - bit_count;
__v2di *data_in_00000 = (__v2di*)&data_in[ 0 * count/32];
__v2di *data_in_00001 = (__v2di*)&data_in[ 1 * count/32];
__v2di *data_in_00010 = (__v2di*)&data_in[ 2 * count/32];
__v2di *data_in_00011 = (__v2di*)&data_in[ 3 * count/32];
__v2di *data_in_00100 = (__v2di*)&data_in[ 4 * count/32];
__v2di *data_in_00101 = (__v2di*)&data_in[ 5 * count/32];
__v2di *data_in_00110 = (__v2di*)&data_in[ 6 * count/32];
__v2di *data_in_00111 = (__v2di*)&data_in[ 7 * count/32];
__v2di *data_in_01000 = (__v2di*)&data_in[ 8 * count/32];
__v2di *data_in_01001 = (__v2di*)&data_in[ 9 * count/32];
__v2di *data_in_01010 = (__v2di*)&data_in[10 * count/32];
__v2di *data_in_01011 = (__v2di*)&data_in[11 * count/32];
__v2di *data_in_01100 = (__v2di*)&data_in[12 * count/32];
__v2di *data_in_01101 = (__v2di*)&data_in[13 * count/32];
__v2di *data_in_01110 = (__v2di*)&data_in[14 * count/32];
__v2di *data_in_01111 = (__v2di*)&data_in[15 * count/32];
__v2di *data_in_10000 = (__v2di*)&data_in[16 * count/32];
__v2di *data_in_10001 = (__v2di*)&data_in[17 * count/32];
__v2di *data_in_10010 = (__v2di*)&data_in[18 * count/32];
__v2di *data_in_10011 = (__v2di*)&data_in[19 * count/32];
__v2di *data_in_10100 = (__v2di*)&data_in[20 * count/32];
__v2di *data_in_10101 = (__v2di*)&data_in[21 * count/32];
__v2di *data_in_10110 = (__v2di*)&data_in[22 * count/32];
__v2di *data_in_10111 = (__v2di*)&data_in[23 * count/32];
__v2di *data_in_11000 = (__v2di*)&data_in[24 * count/32];
__v2di *data_in_11001 = (__v2di*)&data_in[25 * count/32];
__v2di *data_in_11010 = (__v2di*)&data_in[26 * count/32];
__v2di *data_in_11011 = (__v2di*)&data_in[27 * count/32];
__v2di *data_in_11100 = (__v2di*)&data_in[28 * count/32];
__v2di *data_in_11101 = (__v2di*)&data_in[29 * count/32];
__v2di *data_in_11110 = (__v2di*)&data_in[30 * count/32];
__v2di *data_in_11111 = (__v2di*)&data_in[31 * count/32];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < count/32/2; ++i)
{
int64_t offset0 = 8 * (__builtin_e2k_bitrevd(2*i + 0) >> shift);
__v2di mask0 = {0x0706050403020100, 0x0706050403020100};
*(__v2du*)((void*)data_out + offset0 + 0*16) = __builtin_e2k_qpshufb(data_in_10000[i], data_in_00000[i], mask0);
*(__v2du*)((void*)data_out + offset0 + 1*16) = __builtin_e2k_qpshufb(data_in_11000[i], data_in_01000[i], mask0);
*(__v2du*)((void*)data_out + offset0 + 2*16) = __builtin_e2k_qpshufb(data_in_10100[i], data_in_00100[i], mask0);
*(__v2du*)((void*)data_out + offset0 + 3*16) = __builtin_e2k_qpshufb(data_in_11100[i], data_in_01100[i], mask0);
*(__v2du*)((void*)data_out + offset0 + 4*16) = __builtin_e2k_qpshufb(data_in_10010[i], data_in_00010[i], mask0);
*(__v2du*)((void*)data_out + offset0 + 5*16) = __builtin_e2k_qpshufb(data_in_11010[i], data_in_01010[i], mask0);
*(__v2du*)((void*)data_out + offset0 + 6*16) = __builtin_e2k_qpshufb(data_in_10110[i], data_in_00110[i], mask0);
*(__v2du*)((void*)data_out + offset0 + 7*16) = __builtin_e2k_qpshufb(data_in_11110[i], data_in_01110[i], mask0);
*(__v2du*)((void*)data_out + offset0 + 8*16) = __builtin_e2k_qpshufb(data_in_10001[i], data_in_00001[i], mask0);
*(__v2du*)((void*)data_out + offset0 + 9*16) = __builtin_e2k_qpshufb(data_in_11001[i], data_in_01001[i], mask0);
*(__v2du*)((void*)data_out + offset0 + 10*16) = __builtin_e2k_qpshufb(data_in_10101[i], data_in_00101[i], mask0);
*(__v2du*)((void*)data_out + offset0 + 11*16) = __builtin_e2k_qpshufb(data_in_11101[i], data_in_01101[i], mask0);
*(__v2du*)((void*)data_out + offset0 + 12*16) = __builtin_e2k_qpshufb(data_in_10011[i], data_in_00011[i], mask0);
*(__v2du*)((void*)data_out + offset0 + 13*16) = __builtin_e2k_qpshufb(data_in_11011[i], data_in_01011[i], mask0);
*(__v2du*)((void*)data_out + offset0 + 14*16) = __builtin_e2k_qpshufb(data_in_10111[i], data_in_00111[i], mask0);
*(__v2du*)((void*)data_out + offset0 + 15*16) = __builtin_e2k_qpshufb(data_in_11111[i], data_in_01111[i], mask0);
int64_t offset1 = 8 * (__builtin_e2k_bitrevd(2*i + 1) >> shift);
__v2di mask1 = {0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908};
*(__v2du*)((void*)data_out + offset1 + 0*16) = __builtin_e2k_qpshufb(data_in_10000[i], data_in_00000[i], mask1);
*(__v2du*)((void*)data_out + offset1 + 1*16) = __builtin_e2k_qpshufb(data_in_11000[i], data_in_01000[i], mask1);
*(__v2du*)((void*)data_out + offset1 + 2*16) = __builtin_e2k_qpshufb(data_in_10100[i], data_in_00100[i], mask1);
*(__v2du*)((void*)data_out + offset1 + 3*16) = __builtin_e2k_qpshufb(data_in_11100[i], data_in_01100[i], mask1);
*(__v2du*)((void*)data_out + offset1 + 4*16) = __builtin_e2k_qpshufb(data_in_10010[i], data_in_00010[i], mask1);
*(__v2du*)((void*)data_out + offset1 + 5*16) = __builtin_e2k_qpshufb(data_in_11010[i], data_in_01010[i], mask1);
*(__v2du*)((void*)data_out + offset1 + 6*16) = __builtin_e2k_qpshufb(data_in_10110[i], data_in_00110[i], mask1);
*(__v2du*)((void*)data_out + offset1 + 7*16) = __builtin_e2k_qpshufb(data_in_11110[i], data_in_01110[i], mask1);
*(__v2du*)((void*)data_out + offset1 + 8*16) = __builtin_e2k_qpshufb(data_in_10001[i], data_in_00001[i], mask1);
*(__v2du*)((void*)data_out + offset1 + 9*16) = __builtin_e2k_qpshufb(data_in_11001[i], data_in_01001[i], mask1);
*(__v2du*)((void*)data_out + offset1 + 10*16) = __builtin_e2k_qpshufb(data_in_10101[i], data_in_00101[i], mask1);
*(__v2du*)((void*)data_out + offset1 + 11*16) = __builtin_e2k_qpshufb(data_in_11101[i], data_in_01101[i], mask1);
*(__v2du*)((void*)data_out + offset1 + 12*16) = __builtin_e2k_qpshufb(data_in_10011[i], data_in_00011[i], mask1);
*(__v2du*)((void*)data_out + offset1 + 13*16) = __builtin_e2k_qpshufb(data_in_11011[i], data_in_01011[i], mask1);
*(__v2du*)((void*)data_out + offset1 + 14*16) = __builtin_e2k_qpshufb(data_in_10111[i], data_in_00111[i], mask1);
*(__v2du*)((void*)data_out + offset1 + 15*16) = __builtin_e2k_qpshufb(data_in_11111[i], data_in_01111[i], mask1);
}
}
Основной цикл на ассемблере
.L7238:
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=17, incr=0, ind=0, asz=1, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=16, incr=0, ind=0, asz=1, abs=0, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=15, incr=0, ind=0, asz=1, abs=2, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=14, incr=0, ind=0, asz=1, abs=2, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=13, incr=0, ind=0, asz=1, abs=4, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=12, incr=0, ind=0, asz=1, abs=4, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=11, incr=0, ind=0, asz=1, abs=6, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=10, incr=0, ind=0, asz=1, abs=6, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=9, incr=0, ind=0, asz=1, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=8, incr=0, ind=0, asz=1, abs=8, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=7, incr=0, ind=0, asz=1, abs=10, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=6, incr=0, ind=0, asz=1, abs=10, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=5, incr=0, ind=0, asz=1, abs=12, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=4, incr=0, ind=0, asz=1, abs=12, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=3, incr=0, ind=0, asz=1, abs=14, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=2, incr=0, ind=0, asz=1, abs=14, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=1, incr=0, ind=0, asz=1, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=15, asz=1, abs=16, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=14, asz=1, abs=18, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=13, asz=1, abs=18, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=12, asz=1, abs=20, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=11, asz=1, abs=20, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=10, asz=1, abs=22, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=9, asz=1, abs=22, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=8, asz=1, abs=24, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=7, asz=1, abs=24, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=6, asz=1, abs=26, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=5, asz=1, abs=26, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=1, abs=28, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=3, asz=1, abs=28, disp=0
}
{
fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=2, asz=1, abs=30, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=1, asz=1, abs=30, disp=0
}
.L5673:
{
loop_mode
qpshufb,1,sm %b[42], %b[41], %r0, %b[40]
stqp,2 %r2, %b[4], %b[48]
qpshufb,3,sm %b[40], %b[39], %r0, %b[39]
stqp,5 %r4, %b[4], %b[47]
}
{
loop_mode
qpshufb,1,sm %b[46], %b[45], %r1, %b[40]
stqp,2 %r2, %g16, %b[40]
qpshufb,4,sm %b[44], %b[43], %r1, %b[39]
stqp,5 %r4, %g16, %b[39]
}
{
loop_mode
qpshufb,1,sm %b[46], %b[45], %r0, %b[40]
stqp,2 %r6, %b[4], %b[40]
qpshufb,4,sm %b[44], %b[43], %r0, %b[39]
stqp,5 %r7, %b[4], %b[39]
}
{
loop_mode
qpshufb,1,sm %b[38], %b[37], %r0, %b[38]
stqp,2 %r6, %g16, %b[40]
qpshufb,4,sm %b[38], %b[37], %r1, %b[37]
stqp,5 %r7, %g16, %b[39]
}
{
loop_mode
qpshufb,1,sm %b[36], %b[35], %r0, %b[36]
stqp,2 %r9, %g16, %b[38]
qpshufb,4,sm %b[36], %b[35], %r1, %b[35]
stqp,5 %r9, %b[4], %b[37]
movaqp,0 area=15, ind=0, am=1, be=0, %b[5]
movaqp,1 area=14, ind=0, am=1, be=0, %b[9]
movaqp,2 area=15, ind=0, am=1, be=0, %b[1]
movaqp,3 area=14, ind=0, am=1, be=0, %b[6]
}
{
loop_mode
qpshufb,1,sm %b[34], %b[33], %r0, %b[34]
stqp,2 %r10, %g16, %b[36]
qpshufb,4,sm %b[34], %b[33], %r1, %b[33]
stqp,5 %r10, %b[4], %b[35]
movaqp,0 area=13, ind=0, am=1, be=0, %b[13]
movaqp,1 area=12, ind=0, am=1, be=0, %b[17]
movaqp,2 area=13, ind=0, am=1, be=0, %b[10]
movaqp,3 area=12, ind=0, am=1, be=0, %b[14]
}
{
loop_mode
qpshufb,1,sm %b[32], %b[30], %r0, %b[32]
stqp,2 %r11, %g16, %b[34]
qpshufb,4,sm %b[32], %b[30], %r1, %b[30]
stqp,5 %r11, %b[4], %b[33]
movaqp,0 area=11, ind=0, am=1, be=0, %b[21]
movaqp,1 area=10, ind=0, am=1, be=0, %b[25]
movaqp,2 area=11, ind=0, am=1, be=0, %b[18]
movaqp,3 area=10, ind=0, am=1, be=0, %b[22]
}
{
loop_mode
qpshufb,1,sm %g17, %g18, %r0, %b[32]
stqp,2 %r12, %g16, %b[32]
qpshufb,4,sm %g17, %g18, %r1, %b[30]
stqp,5 %r12, %b[4], %b[30]
movaqp,0 area=9, ind=0, am=1, be=0, %b[29]
movaqp,1 area=8, ind=0, am=1, be=0, %g17
movaqp,2 area=9, ind=0, am=1, be=0, %b[26]
movaqp,3 area=8, ind=0, am=1, be=0, %g18
}
{
loop_mode
qpshufb,1,sm %b[31], %b[28], %r0, %b[34]
stqp,2 %r13, %g16, %b[32]
qpshufb,4,sm %b[31], %b[28], %r1, %b[33]
stqp,5 %r13, %b[4], %b[30]
movaqp,0 area=7, ind=0, am=1, be=0, %b[30]
movaqp,1 area=6, ind=0, am=1, be=0, %b[32]
movaqp,2 area=7, ind=0, am=1, be=0, %b[28]
movaqp,3 area=6, ind=0, am=1, be=0, %b[31]
}
{
loop_mode
qpshufb,1,sm %b[27], %b[24], %r0, %b[38]
stqp,2 %r14, %g16, %b[34]
qpshufb,4,sm %b[27], %b[24], %r1, %b[37]
stqp,5 %r14, %b[4], %b[33]
movaqp,0 area=5, ind=0, am=1, be=0, %b[34]
movaqp,1 area=4, ind=0, am=1, be=0, %b[36]
movaqp,2 area=5, ind=0, am=1, be=0, %b[33]
movaqp,3 area=4, ind=0, am=1, be=0, %b[35]
}
{
loop_mode
qpshufb,1,sm %b[23], %b[20], %r0, %b[42]
stqp,2 %r15, %g16, %b[38]
qpshufb,4,sm %b[23], %b[20], %r1, %b[41]
stqp,5 %r15, %b[4], %b[37]
movaqp,0 area=1, ind=0, am=1, be=0, %b[38]
movaqp,1 area=0, ind=0, am=1, be=0, %b[40]
movaqp,2 area=1, ind=0, am=1, be=0, %b[37]
movaqp,3 area=0, ind=0, am=1, be=0, %b[39]
}
{
loop_mode
addd,0,sm 0x2, %b[2], %b[0]
qpshufb,1,sm %b[19], %b[16], %r0, %b[47]
stqp,2 %r16, %g16, %b[42]
addd,3,sm %b[2], 0x1, %b[45]
qpshufb,4,sm %b[19], %b[16], %r1, %b[46]
stqp,5 %r16, %b[4], %b[41]
movaqp,0 area=3, ind=0, am=1, be=0, %b[42]
movaqp,1 area=2, ind=0, am=1, be=0, %b[44]
movaqp,2 area=3, ind=0, am=1, be=0, %b[41]
movaqp,3 area=2, ind=0, am=1, be=0, %b[43]
}
{
loop_mode
bitrevd,0,sm %b[2], %b[46]
qpshufb,1,sm %b[15], %b[12], %r0, %b[49]
stqp,2 %r17, %g16, %b[47]
bitrevd,3,sm %b[45], %b[45]
qpshufb,4,sm %b[15], %b[12], %r1, %b[47]
stqp,5 %r17, %b[4], %b[46]
}
{
loop_mode
shrd,0,sm %b[46], %r5, %b[45]
qpshufb,1,sm %b[11], %b[8], %r0, %b[48]
stqp,2 %r18, %g16, %b[49]
shrd,3,sm %b[45], %r5, %b[46]
qpshufb,4,sm %b[11], %b[8], %r1, %b[47]
stqp,5 %r18, %b[4], %b[47]
}
{
loop_mode
qpshufb,1,sm %b[7], %b[3], %r0, %b[47]
stqp,2 %r19, %g16, %b[48]
shld,3,sm %b[46], 0x3, %b[2]
qpshufb,4,sm %b[7], %b[3], %r1, %b[46]
stqp,5 %r19, %b[4], %b[47]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
qpshufb,0,sm %b[40], %b[39], %r1, %b[46]
stqp,2 %r20, %g16, %b[47]
qpshufb,3,sm %b[38], %b[37], %r1, %b[45]
shld,4,sm %b[45], 0x3, %g16
stqp,5 %r20, %b[4], %b[46]
}
Теоретическая скорость: 64 комплексных числа за 16 тактов (64/16) = 32 Байт/такт
Замеры скорости

Видим замедление в середине графика.
Итоги по reverse_radix2


Победителем можно считать вариант reverse_radix2_x32.
При реализации Radix-2 FFT будем использовать его.
reverse_radix4
1. reverse_radix4_etalon
Эталонный вариант для сравнения на корректность.
Здесь reverseNumber вычисляется с помощью цикла.
int reverseNumber_radix4(int number, int bit_count)
{
int answer = 0;
for(int i = 0; i < bit_count/2; ++i)
{
answer <<= 2;
answer |= number & 3;
number >>= 2;
}
return answer;
}
void reverse_radix4_etalon(int bit_count, myComplex *data_in, myComplex *data_out)
{
int count = 1 << bit_count;
for(int64_t i = 0; i < count; ++i)
{
int index = reverseNumber_radix4(i, bit_count);
data_out[index] = data_in[i];
}
}
2. reverse_radix4
В процессоре нет готовых инструкций, производящих операцию reverseNumber_radix4.
Поэтому выполним инструкцию bitrevd и переставим соседние биты местами.
Схема перемещения данных в памяти

Код на Си
void reverse_radix4(int bit_count, myComplex *data_in, myComplex *data_out)
{
int count = 1 << bit_count;
int shift = 64 - bit_count;
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < count; ++i)
{
uint64_t rev = __builtin_e2k_bitrevd(i) >> shift;
int64_t index = ((rev<<1) & 0xAAAAAAAAAAAAAAAA) | ((rev>>1) & 0x5555555555555555);
data_out[index] = data_in[i];
}
}
Основной цикл на ассемблере
.L1601:
{
fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=5, abs=0, disp=0
}
.L1398:
{
loop_mode
shrd,2,sm %b[16], %r0, %b[1]
shld,4,sm %b[17], 0x3, %b[18]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
bitrevd,0,sm %b[2], %b[14]
shr_andd,1,sm %b[1], 0x1, %r5, %b[5]
addd,2,sm %b[2], 0x1, %b[0]
ord,3,sm %b[13], %b[9], %b[15]
shl_andd,4,sm %b[3], 0x1, %r4, %b[11]
std,5 %r2, %b[18], %b[12]
movad,1 area=0, ind=0, am=1, be=0, %b[4]
}
Теоретическая скорость: 1 комплексное число за 2 такта (1/2) = 4 Байт/такт
Замеры скорости

Заметим, что код можно вместить в один такт, если немного перетасовать инструкции.
3. reverse_radix4_oneTickVersion
Перепишем код, чтобы вместиться в один такт (убираем инструкции shrd и shld).
Схема перемещения данных в памяти

Код на Си
void reverse_radix4_oneTickVersion(int bit_count, myComplex *data_in, myComplex *data_out)
{
int count = 1 << bit_count;
int shift = 64 - bit_count;
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < count; ++i)
{
uint64_t rev = __builtin_e2k_bitrevd(i);
int64_t offset = ((rev>>(shift-3-1)) & 0x5555555555555555) | ((rev>>(shift-3+1)) & 0xAAAAAAAAAAAAAAAA);
*(myComplex*)((void*)data_out + offset) = data_in[i];
}
}
Основной цикл на ассемблере
.L1873:
{
fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=5, abs=0, disp=0
}
.L1686:
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
bitrevd,0,sm %b[21], %b[14]
shr_andd,1,sm %b[16], %r0, %r6, %b[1]
addd,2,sm %b[21], 0x1, %b[19]
ord,3,sm %b[17], %b[9], %b[20]
shr_andd,4,sm %b[18], %r4, %r5, %b[11]
std,5 %r2, %b[22], %b[12]
movad,1 area=0, ind=0, am=1, be=0, %b[0]
}
Теоретическая скорость: 1 комплексное число за 1 такт (1/1) = 8 Байт/такт
Замеры скорости

Видим ускорение в начале графика.
4. reverse_radix4_x4_bad
Попробуем ускорить с помощью ручной раскрутки цикла в 4 раза.
Схема перемещения данных в памяти

Код на Си
void reverse_radix4_x4_bad(int bit_count, myComplex *data_in, myComplex *data_out)
{
int count = 1 << bit_count;
int shift = 64 - bit_count;
myComplex *data_out_0 = &data_out[0 * count/4];
myComplex *data_out_1 = &data_out[1 * count/4];
myComplex *data_out_2 = &data_out[2 * count/4];
myComplex *data_out_3 = &data_out[3 * count/4];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < count; i += 4)
{
uint64_t rev = __builtin_e2k_bitrevd(i);
int64_t index = ((rev>>(shift-1)) & 0xAAAAAAAAAAAAAAAA) | ((rev>>(shift+1)) & 0x5555555555555555);
data_out_0[index] = data_in[i + 0];
data_out_1[index] = data_in[i + 1];
data_out_2[index] = data_in[i + 2];
data_out_3[index] = data_in[i + 3];
}
}
Основной цикл на ассемблере
.L2338:
{
fapb ct=1, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=1, asz=5, abs=0, disp=16
}
.L2049:
{
loop_mode
bitrevd,0,sm %b[29], %b[23]
std,2 %r0, %b[26], %b[18]
addd,3,sm %b[29], 0x4, %b[27]
shld,4,sm %b[30], 0x3, %b[24]
std,5 %r5, %b[26], %b[19]
movad,0 area=0, ind=0, am=0, be=0, %b[1]
movad,1 area=0, ind=8, am=1, be=0, %b[11]
movad,2 area=0, ind=0, am=0, be=0, %b[0]
movad,3 area=0, ind=8, am=1, be=0, %b[10]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
shr_andd,1,sm %b[23], %r7, %r9, %b[18]
std,2 %r4, %b[26], %b[8]
ord,3,sm %b[21], %b[22], %b[28]
shr_andd,4,sm %b[25], %r6, %r8, %b[19]
std,5 %r2, %b[26], %b[9]
}
Теоретическая скорость: 4 комплексных числа за 2 такта (4/2) = 16 Байт/такт
Замеры скорости

Видим ускорение в начале графика.
Как мы помним из reverse_radix2, запись в разные места памяти работает хуже, чем запись рядом.
5. reverse_radix4_x4_good
Попробуем сделать наоборот: будем читать из разных мест, а писать рядом.
Схема перемещения данных в памяти

Код на Си
void reverse_radix4_x4_good(int bit_count, myComplex *data_in, myComplex *data_out)
{
int count = 1 << bit_count;
int shift = 64 - bit_count;
myComplex *data_in_0 = &data_in[0 * count/4];
myComplex *data_in_1 = &data_in[1 * count/4];
myComplex *data_in_2 = &data_in[2 * count/4];
myComplex *data_in_3 = &data_in[3 * count/4];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < count/4; ++i)
{
uint64_t rev = __builtin_e2k_bitrevd(i);
int64_t index = ((rev>>(shift-1)) & 0xAAAAAAAAAAAAAAAA) | ((rev>>(shift+1)) & 0x5555555555555555);
data_out[index + 0] = data_in_0[i];
data_out[index + 1] = data_in_1[i];
data_out[index + 2] = data_in_2[i];
data_out[index + 3] = data_in_3[i];
}
}
Основной цикл на ассемблере
.L2807:
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=4, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=4, abs=0, disp=0
}
{
fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=4, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=4, abs=16, disp=0
}
.L2518:
{
loop_mode
bitrevd,0,sm %b[29], %b[23]
std,2 %r5, %b[26], %b[19]
addd,3,sm %b[29], 0x1, %b[27]
shld,4,sm %b[30], 0x3, %b[24]
std,5 %r0, %b[26], %b[18]
movad,0 area=1, ind=0, am=1, be=0, %b[1]
movad,1 area=0, ind=0, am=1, be=0, %b[0]
movad,2 area=1, ind=0, am=1, be=0, %b[11]
movad,3 area=0, ind=0, am=1, be=0, %b[10]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
shr_andd,1,sm %b[23], %r7, %r9, %b[18]
std,2 %r4, %b[26], %b[9]
ord,3,sm %b[21], %b[22], %b[28]
shr_andd,4,sm %b[25], %r6, %r8, %b[19]
std,5 %b[26], %r2, %b[8]
}
Теоретическая скорость: 4 комплексных числа за 2 такта (4/2) = 16 Байт/такт
Замеры скорости

Видим желаемое ускорение по всей длине графика.
Строго говоря, это не раскрутка цикла. Честной раскруткой был предыдущий вариант. Здесь же произошло изменение алгоритма (данные обрабатываются в другом порядке). Но я не придумал, как это назвать («stream4»?), поэтому все дальнейшие «раскрутки» будут называться x4/x16 и т.д.
6. reverse_radix4_x4_best
Вместо четырёх 64-битных записей в память сделаем две 128-битные записи.
Схема перемещения данных в памяти

Код на Си
void reverse_radix4_x4_best(int bit_count, myComplex *data_in, myComplex *data_out)
{
int count = 1 << bit_count;
int shift = 64 - bit_count;
uint64_t *data_in_0 = (uint64_t*)&data_in[0 * count/4];
uint64_t *data_in_1 = (uint64_t*)&data_in[1 * count/4];
uint64_t *data_in_2 = (uint64_t*)&data_in[2 * count/4];
uint64_t *data_in_3 = (uint64_t*)&data_in[3 * count/4];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < count/4; ++i)
{
uint64_t rev = __builtin_e2k_bitrevd(i);
int64_t offset = 8 * (((rev>>(shift-1)) & 0xAAAAAAAAAAAAAAAA) | ((rev>>(shift+1)) & 0x5555555555555555));
*(__v2du*)((void*)data_out + offset + 0*16) = (__v2du){data_in_0[i], data_in_1[i]};
*(__v2du*)((void*)data_out + offset + 1*16) = (__v2du){data_in_2[i], data_in_3[i]};
}
}
Основной цикл на ассемблере
.L3099:
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=4, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=4, abs=0, disp=0
}
{
fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=4, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=4, abs=16, disp=0
}
.L2975:
{
loop_mode
qppackdl,0,sm %b[10], %b[16], %b[9]
shr_andd,1,sm %b[23], %r5, %r7, %b[0]
qppackdl,3,sm %b[21], %b[22], %b[5]
shr_andd,4,sm %b[25], %r4, %r6, %b[13]
ord,5,sm %b[15], %b[4], %b[26]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
bitrevd,0,sm %b[3], %b[21]
stqp,2 %r0, %b[24], %b[11]
addd,3,sm %b[3], 0x1, %b[1]
shld,4,sm %b[26], 0x3, %b[22]
stqp,5 %r2, %b[24], %b[7]
movad,0 area=1, ind=0, am=1, be=0, %b[10]
movad,1 area=0, ind=0, am=1, be=0, %b[16]
movad,2 area=1, ind=0, am=1, be=0, %b[4]
movad,3 area=0, ind=0, am=1, be=0, %b[15]
}
Теоретическая скорость: 4 комплексных числа за 2 такта (4/2) = 16 Байт/такт
Замеры скорости

Видим сильное ускорение.
В дальнейшем будем всегда писать в память 128-битными кусками.
7. reverse_radix4_x16
Продолжим «псевдо раскручивать» дальше.
Схема перемещения данных в памяти

Код на Си
void reverse_radix4_x16(int bit_count, myComplex *data_in, myComplex *data_out)
{
int count = 1 << bit_count;
int shift = 64 - bit_count;
uint64_t *data_in_00 = (uint64_t*)&data_in[ 0 * count/16];
uint64_t *data_in_01 = (uint64_t*)&data_in[ 1 * count/16];
uint64_t *data_in_02 = (uint64_t*)&data_in[ 2 * count/16];
uint64_t *data_in_03 = (uint64_t*)&data_in[ 3 * count/16];
uint64_t *data_in_10 = (uint64_t*)&data_in[ 4 * count/16];
uint64_t *data_in_11 = (uint64_t*)&data_in[ 5 * count/16];
uint64_t *data_in_12 = (uint64_t*)&data_in[ 6 * count/16];
uint64_t *data_in_13 = (uint64_t*)&data_in[ 7 * count/16];
uint64_t *data_in_20 = (uint64_t*)&data_in[ 8 * count/16];
uint64_t *data_in_21 = (uint64_t*)&data_in[ 9 * count/16];
uint64_t *data_in_22 = (uint64_t*)&data_in[10 * count/16];
uint64_t *data_in_23 = (uint64_t*)&data_in[11 * count/16];
uint64_t *data_in_30 = (uint64_t*)&data_in[12 * count/16];
uint64_t *data_in_31 = (uint64_t*)&data_in[13 * count/16];
uint64_t *data_in_32 = (uint64_t*)&data_in[14 * count/16];
uint64_t *data_in_33 = (uint64_t*)&data_in[15 * count/16];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < count/16; ++i)
{
uint64_t rev = __builtin_e2k_bitrevd(i);
int64_t offset = 8 * (((rev>>(shift-1)) & 0xAAAAAAAAAAAAAAAA) | ((rev>>(shift+1)) & 0x5555555555555555));
*(__v2du*)((void*)data_out + offset + 0*16) = (__v2du){data_in_00[i], data_in_10[i]};
*(__v2du*)((void*)data_out + offset + 1*16) = (__v2du){data_in_20[i], data_in_30[i]};
*(__v2du*)((void*)data_out + offset + 2*16) = (__v2du){data_in_01[i], data_in_11[i]};
*(__v2du*)((void*)data_out + offset + 3*16) = (__v2du){data_in_21[i], data_in_31[i]};
*(__v2du*)((void*)data_out + offset + 4*16) = (__v2du){data_in_02[i], data_in_12[i]};
*(__v2du*)((void*)data_out + offset + 5*16) = (__v2du){data_in_22[i], data_in_32[i]};
*(__v2du*)((void*)data_out + offset + 6*16) = (__v2du){data_in_03[i], data_in_13[i]};
*(__v2du*)((void*)data_out + offset + 7*16) = (__v2du){data_in_23[i], data_in_33[i]};
}
}
Основной цикл на ассемблере
.L3848:
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=1, incr=0, ind=0, asz=2, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=15, asz=2, abs=0, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=14, asz=2, abs=4, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=13, asz=2, abs=4, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=12, asz=2, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=11, asz=2, abs=8, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=10, asz=2, abs=12, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=9, asz=2, abs=12, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=8, asz=2, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=2, abs=16, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=2, abs=20, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=2, abs=20, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=2, abs=24, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=2, abs=24, disp=0
}
{
fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=2, abs=28, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=2, abs=28, disp=0
}
.L3565:
{
loop_mode
qppackdl,0,sm %b[51], %b[58], %b[1]
stqp,2 %r0, %b[57], %b[42]
qppackdl,3,sm %b[35], %b[50], %b[6]
shr_andd,4,sm %b[60], %r12, %r15, %b[29]
stqp,5 %r2, %b[57], %b[43]
movad,0 area=7, ind=0, am=1, be=0, %b[18]
movad,1 area=6, ind=0, am=1, be=0, %b[26]
movad,2 area=7, ind=0, am=1, be=0, %b[13]
movad,3 area=6, ind=0, am=1, be=0, %b[21]
}
{
loop_mode
ord,0,sm %b[61], %b[31], %b[59]
qppackdl,1,sm %b[54], %b[55], %b[35]
stqp,2 %r5, %b[57], %b[5]
qppackdl,4,sm %b[46], %b[47], %b[34]
stqp,5 %r6, %b[57], %b[10]
movad,0 area=5, ind=0, am=1, be=0, %b[43]
movad,1 area=4, ind=0, am=1, be=0, %b[51]
movad,2 area=5, ind=0, am=1, be=0, %b[42]
movad,3 area=4, ind=0, am=1, be=0, %b[50]
}
{
loop_mode
shld,0,sm %b[59], 0x3, %b[55]
qppackdl,1,sm %b[23], %b[28], %b[5]
stqp,2 %r7, %b[57], %b[39]
addd,3,sm %b[4], 0x1, %b[2]
qppackdl,4,sm %b[15], %b[20], %b[10]
stqp,5 %r9, %b[57], %b[38]
movad,0 area=3, ind=0, am=1, be=0, %b[46]
movad,1 area=2, ind=0, am=1, be=0, %b[54]
movad,2 area=3, ind=0, am=1, be=0, %b[31]
movad,3 area=2, ind=0, am=1, be=0, %b[47]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
qppackdl,0,sm %b[19], %b[24], %b[38]
shr_andd,1,sm %b[60], %r13, %r14, %b[59]
stqp,2 %r10, %b[57], %b[11]
qppackdl,3,sm %b[27], %b[32], %b[39]
bitrevd,4,sm %b[4], %b[58]
stqp,5 %r11, %b[57], %b[16]
movad,0 area=1, ind=0, am=1, be=0, %b[20]
movad,1 area=0, ind=0, am=1, be=0, %b[28]
movad,2 area=1, ind=0, am=1, be=0, %b[15]
movad,3 area=0, ind=0, am=1, be=0, %b[23]
}
Теоретическая скорость: 16 комплексных чисел за 4 такта (16/4) = 32 Байт/такт
Замеры скорости

Видим сильное ускорение.
При попытке «псевдо раскрутить» в 64 раза получается резко менее эффективный код. APB может читать максимум из 32 потоков, поэтому для чтения из 64 потоков компилятор вставляет операции обычного чтения ldd. В итоге скорость резко проседает.
Попробуем читать не 64-битными кусками, а 128-битными.
8. reverse_radix4_x16x2
Попробуем увеличить скорость чтения версии reverse_radix4_x16.
По сути, в этом варианте сделана честная раскрутка в 2 раза.
Код на Си
void reverse_radix4_x16x2(int bit_count, myComplex *data_in, myComplex *data_out)
{
int count = 1 << bit_count;
int shift = 64 - bit_count;
__v2di *data_in_00 = (__v2di*)&data_in[ 0 * count/16];
__v2di *data_in_01 = (__v2di*)&data_in[ 1 * count/16];
__v2di *data_in_02 = (__v2di*)&data_in[ 2 * count/16];
__v2di *data_in_03 = (__v2di*)&data_in[ 3 * count/16];
__v2di *data_in_10 = (__v2di*)&data_in[ 4 * count/16];
__v2di *data_in_11 = (__v2di*)&data_in[ 5 * count/16];
__v2di *data_in_12 = (__v2di*)&data_in[ 6 * count/16];
__v2di *data_in_13 = (__v2di*)&data_in[ 7 * count/16];
__v2di *data_in_20 = (__v2di*)&data_in[ 8 * count/16];
__v2di *data_in_21 = (__v2di*)&data_in[ 9 * count/16];
__v2di *data_in_22 = (__v2di*)&data_in[10 * count/16];
__v2di *data_in_23 = (__v2di*)&data_in[11 * count/16];
__v2di *data_in_30 = (__v2di*)&data_in[12 * count/16];
__v2di *data_in_31 = (__v2di*)&data_in[13 * count/16];
__v2di *data_in_32 = (__v2di*)&data_in[14 * count/16];
__v2di *data_in_33 = (__v2di*)&data_in[15 * count/16];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < count/16/2; ++i)
{
uint64_t rev0 = __builtin_e2k_bitrevd(2*i+0);
int64_t offset0 = 8 * (((rev0>>(shift-1)) & 0xAAAAAAAAAAAAAAAA) | ((rev0>>(shift+1)) & 0x5555555555555555));
__v2di mask0 = {0x0706050403020100, 0x0706050403020100};
*(__v2du*)((void*)data_out + offset0 + 0*16) = __builtin_e2k_qpshufb(data_in_10[i], data_in_00[i], mask0);
*(__v2du*)((void*)data_out + offset0 + 1*16) = __builtin_e2k_qpshufb(data_in_30[i], data_in_20[i], mask0);
*(__v2du*)((void*)data_out + offset0 + 2*16) = __builtin_e2k_qpshufb(data_in_11[i], data_in_01[i], mask0);
*(__v2du*)((void*)data_out + offset0 + 3*16) = __builtin_e2k_qpshufb(data_in_31[i], data_in_21[i], mask0);
*(__v2du*)((void*)data_out + offset0 + 4*16) = __builtin_e2k_qpshufb(data_in_12[i], data_in_02[i], mask0);
*(__v2du*)((void*)data_out + offset0 + 5*16) = __builtin_e2k_qpshufb(data_in_32[i], data_in_22[i], mask0);
*(__v2du*)((void*)data_out + offset0 + 6*16) = __builtin_e2k_qpshufb(data_in_13[i], data_in_03[i], mask0);
*(__v2du*)((void*)data_out + offset0 + 7*16) = __builtin_e2k_qpshufb(data_in_33[i], data_in_23[i], mask0);
uint64_t rev1 = __builtin_e2k_bitrevd(2*i+1);
int64_t offset1 = 8 * (((rev1>>(shift-1)) & 0xAAAAAAAAAAAAAAAA) | ((rev1>>(shift+1)) & 0x5555555555555555));
__v2di mask1 = {0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908};
*(__v2du*)((void*)data_out + offset1 + 0*16) = __builtin_e2k_qpshufb(data_in_10[i], data_in_00[i], mask1);
*(__v2du*)((void*)data_out + offset1 + 1*16) = __builtin_e2k_qpshufb(data_in_30[i], data_in_20[i], mask1);
*(__v2du*)((void*)data_out + offset1 + 2*16) = __builtin_e2k_qpshufb(data_in_11[i], data_in_01[i], mask1);
*(__v2du*)((void*)data_out + offset1 + 3*16) = __builtin_e2k_qpshufb(data_in_31[i], data_in_21[i], mask1);
*(__v2du*)((void*)data_out + offset1 + 4*16) = __builtin_e2k_qpshufb(data_in_12[i], data_in_02[i], mask1);
*(__v2du*)((void*)data_out + offset1 + 5*16) = __builtin_e2k_qpshufb(data_in_32[i], data_in_22[i], mask1);
*(__v2du*)((void*)data_out + offset1 + 6*16) = __builtin_e2k_qpshufb(data_in_13[i], data_in_03[i], mask1);
*(__v2du*)((void*)data_out + offset1 + 7*16) = __builtin_e2k_qpshufb(data_in_33[i], data_in_23[i], mask1);
}
}
Основной цикл на ассемблере
.L4839:
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=1, incr=0, ind=0, asz=2, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=15, asz=2, abs=0, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=14, asz=2, abs=4, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=13, asz=2, abs=4, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=12, asz=2, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=11, asz=2, abs=8, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=10, asz=2, abs=12, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=9, asz=2, abs=12, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=8, asz=2, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=7, asz=2, abs=16, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=6, asz=2, abs=20, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=5, asz=2, abs=20, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=2, abs=24, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=3, asz=2, abs=24, disp=0
}
{
fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=2, asz=2, abs=28, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=1, asz=2, abs=28, disp=0
}
.L3987:
{
loop_mode
qpshufb,0,sm %b[26], %b[21], %r12, %b[51]
shr_andd,1,sm %b[23], %r14, %r17, %b[1]
stqp,2 %r2, %b[4], %b[13]
qpshufb,3,sm %b[45], %b[42], %r12, %b[50]
shr_andd,4,sm %b[23], %r15, %r16, %b[5]
stqp,5 %r0, %b[4], %b[12]
}
{
loop_mode
qpshufb,0,sm %b[35], %b[32], %r12, %b[54]
shr_andd,1,sm %b[3], %r14, %r17, %b[23]
stqp,2 %r5, %b[4], %b[51]
qpshufb,3,sm %b[29], %b[11], %r12, %b[53]
ord,4,sm %b[7], %b[25], %b[52]
stqp,5 %r6, %b[4], %b[50]
movaqp,0 area=7, ind=0, am=1, be=0, %b[12]
movaqp,1 area=6, ind=0, am=1, be=0, %b[18]
movaqp,2 area=7, ind=0, am=1, be=0, %b[6]
movaqp,3 area=6, ind=0, am=1, be=0, %b[13]
}
{
loop_mode
qpshufb,1,sm %b[22], %b[17], %r12, %b[55]
stqp,2 %r7, %b[4], %b[54]
qpshufb,3,sm %b[22], %b[17], %r13, %b[51]
shld,4,sm %b[52], 0x3, %b[50]
stqp,5 %r9, %b[4], %b[53]
movaqp,0 area=5, ind=0, am=1, be=0, %b[25]
movaqp,1 area=4, ind=0, am=1, be=0, %b[31]
movaqp,2 area=5, ind=0, am=1, be=0, %b[7]
movaqp,3 area=4, ind=0, am=1, be=0, %b[28]
}
{
loop_mode
qpshufb,1,sm %b[45], %b[42], %r13, %b[53]
stqp,2 %r10, %b[4], %b[55]
qpshufb,4,sm %b[35], %b[32], %r13, %b[52]
stqp,5 %r10, %b[50], %b[51]
movaqp,0 area=3, ind=0, am=1, be=0, %b[41]
movaqp,1 area=2, ind=0, am=1, be=0, %b[22]
movaqp,2 area=3, ind=0, am=1, be=0, %b[38]
movaqp,3 area=2, ind=0, am=1, be=0, %b[17]
}
{
loop_mode
addd,0,sm %b[2], 0x1, %b[48]
qpshufb,1,sm %b[49], %b[46], %r13, %b[54]
stqp,2 %r6, %b[50], %b[53]
addd,3,sm 0x2, %b[2], %b[0]
qpshufb,4,sm %b[26], %b[21], %r13, %b[51]
stqp,5 %r7, %b[50], %b[52]
movaqp,0 area=1, ind=0, am=1, be=0, %b[45]
movaqp,1 area=0, ind=0, am=1, be=0, %b[35]
movaqp,2 area=1, ind=0, am=1, be=0, %b[42]
movaqp,3 area=0, ind=0, am=1, be=0, %b[32]
}
{
loop_mode
bitrevd,0,sm %b[2], %b[21]
qpshufb,1,sm %b[39], %b[36], %r13, %b[52]
stqp,2 %r0, %b[50], %b[54]
ord,3,sm %b[5], %b[1], %b[26]
qpshufb,4,sm %b[16], %b[10], %r13, %b[49]
stqp,5 %r5, %b[50], %b[51]
}
{
loop_mode
bitrevd,0,sm %b[48], %b[1]
qpshufb,1,sm %b[16], %b[10], %r12, %b[53]
stqp,2 %r2, %b[50], %b[52]
shld,3,sm %b[26], 0x3, %b[2]
qpshufb,4,sm %b[29], %b[11], %r13, %b[51]
stqp,5 %r11, %b[50], %b[49]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
qpshufb,0,sm %b[37], %b[34], %r12, %b[11]
stqp,2 %r11, %b[4], %b[53]
qpshufb,3,sm %b[47], %b[44], %r12, %b[10]
shr_andd,4,sm %b[3], %r15, %r16, %b[5]
stqp,5 %r9, %b[50], %b[51]
}
Теоретическая скорость: 32 комплексных числа за 8 тактов (32/8) = 32 Байт/такт
Замеры скорости

Видим замедление в середине графика.
Ещё можно сделать раскрутку в 32 раза. Для этого напишем версию раскрутки в 64 раза и обработаем сначала одну половину строк в одном цикле, а потом вторую половину строк во втором цикле. В каждом цикле будут использованы 32 потока чтения APB.
9. reverse_radix4_x32
Сделаем «псевдо раскрутку» в 32 раза с помощью двух циклов.
Код на Си
void reverse_radix4_x32(int bit_count, myComplex *data_in, myComplex *data_out)
{
int count = 1 << bit_count;
int shift = 64 - bit_count;
uint64_t *data_in_000 = (uint64_t*)&data_in[ 0 * count/64];
uint64_t *data_in_001 = (uint64_t*)&data_in[ 1 * count/64];
uint64_t *data_in_002 = (uint64_t*)&data_in[ 2 * count/64];
uint64_t *data_in_003 = (uint64_t*)&data_in[ 3 * count/64];
uint64_t *data_in_010 = (uint64_t*)&data_in[ 4 * count/64];
uint64_t *data_in_011 = (uint64_t*)&data_in[ 5 * count/64];
uint64_t *data_in_012 = (uint64_t*)&data_in[ 6 * count/64];
uint64_t *data_in_013 = (uint64_t*)&data_in[ 7 * count/64];
uint64_t *data_in_020 = (uint64_t*)&data_in[ 8 * count/64];
uint64_t *data_in_021 = (uint64_t*)&data_in[ 9 * count/64];
uint64_t *data_in_022 = (uint64_t*)&data_in[10 * count/64];
uint64_t *data_in_023 = (uint64_t*)&data_in[11 * count/64];
uint64_t *data_in_030 = (uint64_t*)&data_in[12 * count/64];
uint64_t *data_in_031 = (uint64_t*)&data_in[13 * count/64];
uint64_t *data_in_032 = (uint64_t*)&data_in[14 * count/64];
uint64_t *data_in_033 = (uint64_t*)&data_in[15 * count/64];
uint64_t *data_in_100 = (uint64_t*)&data_in[16 * count/64];
uint64_t *data_in_101 = (uint64_t*)&data_in[17 * count/64];
uint64_t *data_in_102 = (uint64_t*)&data_in[18 * count/64];
uint64_t *data_in_103 = (uint64_t*)&data_in[19 * count/64];
uint64_t *data_in_110 = (uint64_t*)&data_in[20 * count/64];
uint64_t *data_in_111 = (uint64_t*)&data_in[21 * count/64];
uint64_t *data_in_112 = (uint64_t*)&data_in[22 * count/64];
uint64_t *data_in_113 = (uint64_t*)&data_in[23 * count/64];
uint64_t *data_in_120 = (uint64_t*)&data_in[24 * count/64];
uint64_t *data_in_121 = (uint64_t*)&data_in[25 * count/64];
uint64_t *data_in_122 = (uint64_t*)&data_in[26 * count/64];
uint64_t *data_in_123 = (uint64_t*)&data_in[27 * count/64];
uint64_t *data_in_130 = (uint64_t*)&data_in[28 * count/64];
uint64_t *data_in_131 = (uint64_t*)&data_in[29 * count/64];
uint64_t *data_in_132 = (uint64_t*)&data_in[30 * count/64];
uint64_t *data_in_133 = (uint64_t*)&data_in[31 * count/64];
uint64_t *data_in_200 = (uint64_t*)&data_in[32 * count/64];
uint64_t *data_in_201 = (uint64_t*)&data_in[33 * count/64];
uint64_t *data_in_202 = (uint64_t*)&data_in[34 * count/64];
uint64_t *data_in_203 = (uint64_t*)&data_in[35 * count/64];
uint64_t *data_in_210 = (uint64_t*)&data_in[36 * count/64];
uint64_t *data_in_211 = (uint64_t*)&data_in[37 * count/64];
uint64_t *data_in_212 = (uint64_t*)&data_in[38 * count/64];
uint64_t *data_in_213 = (uint64_t*)&data_in[39 * count/64];
uint64_t *data_in_220 = (uint64_t*)&data_in[40 * count/64];
uint64_t *data_in_221 = (uint64_t*)&data_in[41 * count/64];
uint64_t *data_in_222 = (uint64_t*)&data_in[42 * count/64];
uint64_t *data_in_223 = (uint64_t*)&data_in[43 * count/64];
uint64_t *data_in_230 = (uint64_t*)&data_in[44 * count/64];
uint64_t *data_in_231 = (uint64_t*)&data_in[45 * count/64];
uint64_t *data_in_232 = (uint64_t*)&data_in[46 * count/64];
uint64_t *data_in_233 = (uint64_t*)&data_in[47 * count/64];
uint64_t *data_in_300 = (uint64_t*)&data_in[48 * count/64];
uint64_t *data_in_301 = (uint64_t*)&data_in[49 * count/64];
uint64_t *data_in_302 = (uint64_t*)&data_in[50 * count/64];
uint64_t *data_in_303 = (uint64_t*)&data_in[51 * count/64];
uint64_t *data_in_310 = (uint64_t*)&data_in[52 * count/64];
uint64_t *data_in_311 = (uint64_t*)&data_in[53 * count/64];
uint64_t *data_in_312 = (uint64_t*)&data_in[54 * count/64];
uint64_t *data_in_313 = (uint64_t*)&data_in[55 * count/64];
uint64_t *data_in_320 = (uint64_t*)&data_in[56 * count/64];
uint64_t *data_in_321 = (uint64_t*)&data_in[57 * count/64];
uint64_t *data_in_322 = (uint64_t*)&data_in[58 * count/64];
uint64_t *data_in_323 = (uint64_t*)&data_in[59 * count/64];
uint64_t *data_in_330 = (uint64_t*)&data_in[60 * count/64];
uint64_t *data_in_331 = (uint64_t*)&data_in[61 * count/64];
uint64_t *data_in_332 = (uint64_t*)&data_in[62 * count/64];
uint64_t *data_in_333 = (uint64_t*)&data_in[63 * count/64];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < count/32/2; ++i)
{
uint64_t rev = __builtin_e2k_bitrevd(i);
int64_t offset = 8 * (((rev>>(shift-1)) & 0xAAAAAAAAAAAAAAAA) | ((rev>>(shift+1)) & 0x5555555555555555));
*(__v2du*)((void*)data_out + offset + 0*16) = (__v2du){data_in_000[i], data_in_100[i]};
*(__v2du*)((void*)data_out + offset + 1*16) = (__v2du){data_in_200[i], data_in_300[i]};
*(__v2du*)((void*)data_out + offset + 2*16) = (__v2du){data_in_010[i], data_in_110[i]};
*(__v2du*)((void*)data_out + offset + 3*16) = (__v2du){data_in_210[i], data_in_310[i]};
*(__v2du*)((void*)data_out + offset + 4*16) = (__v2du){data_in_020[i], data_in_120[i]};
*(__v2du*)((void*)data_out + offset + 5*16) = (__v2du){data_in_220[i], data_in_320[i]};
*(__v2du*)((void*)data_out + offset + 6*16) = (__v2du){data_in_030[i], data_in_130[i]};
*(__v2du*)((void*)data_out + offset + 7*16) = (__v2du){data_in_230[i], data_in_330[i]};
*(__v2du*)((void*)data_out + offset + 8*16) = (__v2du){data_in_001[i], data_in_101[i]};
*(__v2du*)((void*)data_out + offset + 9*16) = (__v2du){data_in_201[i], data_in_301[i]};
*(__v2du*)((void*)data_out + offset + 10*16) = (__v2du){data_in_011[i], data_in_111[i]};
*(__v2du*)((void*)data_out + offset + 11*16) = (__v2du){data_in_211[i], data_in_311[i]};
*(__v2du*)((void*)data_out + offset + 12*16) = (__v2du){data_in_021[i], data_in_121[i]};
*(__v2du*)((void*)data_out + offset + 13*16) = (__v2du){data_in_221[i], data_in_321[i]};
*(__v2du*)((void*)data_out + offset + 14*16) = (__v2du){data_in_031[i], data_in_131[i]};
*(__v2du*)((void*)data_out + offset + 15*16) = (__v2du){data_in_231[i], data_in_331[i]};
}
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < count/32/2; ++i)
{
uint64_t rev = __builtin_e2k_bitrevd(i);
int64_t offset = 8 * (((rev>>(shift-1)) & 0xAAAAAAAAAAAAAAAA) | ((rev>>(shift+1)) & 0x5555555555555555));
*(__v2du*)((void*)data_out + offset + 16*16) = (__v2du){data_in_002[i], data_in_102[i]};
*(__v2du*)((void*)data_out + offset + 17*16) = (__v2du){data_in_202[i], data_in_302[i]};
*(__v2du*)((void*)data_out + offset + 18*16) = (__v2du){data_in_012[i], data_in_112[i]};
*(__v2du*)((void*)data_out + offset + 19*16) = (__v2du){data_in_212[i], data_in_312[i]};
*(__v2du*)((void*)data_out + offset + 20*16) = (__v2du){data_in_022[i], data_in_122[i]};
*(__v2du*)((void*)data_out + offset + 21*16) = (__v2du){data_in_222[i], data_in_322[i]};
*(__v2du*)((void*)data_out + offset + 22*16) = (__v2du){data_in_032[i], data_in_132[i]};
*(__v2du*)((void*)data_out + offset + 23*16) = (__v2du){data_in_232[i], data_in_332[i]};
*(__v2du*)((void*)data_out + offset + 24*16) = (__v2du){data_in_003[i], data_in_103[i]};
*(__v2du*)((void*)data_out + offset + 25*16) = (__v2du){data_in_203[i], data_in_303[i]};
*(__v2du*)((void*)data_out + offset + 26*16) = (__v2du){data_in_013[i], data_in_113[i]};
*(__v2du*)((void*)data_out + offset + 27*16) = (__v2du){data_in_213[i], data_in_313[i]};
*(__v2du*)((void*)data_out + offset + 28*16) = (__v2du){data_in_023[i], data_in_123[i]};
*(__v2du*)((void*)data_out + offset + 29*16) = (__v2du){data_in_223[i], data_in_323[i]};
*(__v2du*)((void*)data_out + offset + 30*16) = (__v2du){data_in_033[i], data_in_133[i]};
*(__v2du*)((void*)data_out + offset + 31*16) = (__v2du){data_in_233[i], data_in_333[i]};
}
}
Основной цикл на ассемблере
.L7926:
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=17, incr=0, ind=0, asz=1, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=16, incr=0, ind=0, asz=1, abs=0, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=15, incr=0, ind=0, asz=1, abs=2, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=14, incr=0, ind=0, asz=1, abs=2, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=13, incr=0, ind=0, asz=1, abs=4, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=12, incr=0, ind=0, asz=1, abs=4, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=11, incr=0, ind=0, asz=1, abs=6, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=10, incr=0, ind=0, asz=1, abs=6, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=9, incr=0, ind=0, asz=1, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=8, incr=0, ind=0, asz=1, abs=8, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=7, incr=0, ind=0, asz=1, abs=10, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=6, incr=0, ind=0, asz=1, abs=10, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=5, incr=0, ind=0, asz=1, abs=12, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=4, incr=0, ind=0, asz=1, abs=12, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=3, incr=0, ind=0, asz=1, abs=14, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=2, incr=0, ind=0, asz=1, abs=14, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=1, incr=0, ind=0, asz=1, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=15, asz=1, abs=16, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=14, asz=1, abs=18, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=13, asz=1, abs=18, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=12, asz=1, abs=20, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=11, asz=1, abs=20, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=10, asz=1, abs=22, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=9, asz=1, abs=22, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=8, asz=1, abs=24, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=1, abs=24, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=1, abs=26, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=1, abs=26, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=1, abs=28, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=1, abs=28, disp=0
}
{
fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=1, abs=30, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=1, abs=30, disp=0
}
.L6604:
{
loop_mode
qppackdl,0,sm %b[55], %b[56], %b[1]
shr_andd,1,sm %b[62], %r0, %r21, %b[67]
stqp,2 %r4, %b[58], %b[14]
qppackdl,3,sm %b[26], %b[50], %b[14]
shr_andd,4,sm %b[62], %r1, %r20, %b[68]
stqp,5 %r2, %b[58], %b[19]
movad,0 area=15, ind=0, am=1, be=0, %b[62]
movad,1 area=14, ind=0, am=1, be=0, %b[66]
movad,2 area=15, ind=0, am=1, be=0, %b[60]
movad,3 area=14, ind=0, am=1, be=0, %b[64]
}
{
loop_mode
qppackdl,1,sm %b[44], %b[49], %b[31]
stqp,2 %r5, %b[58], %b[3]
qppackdl,4,sm %b[25], %b[31], %b[26]
stqp,5 %r6, %b[58], %b[16]
movad,0 area=13, ind=0, am=1, be=0, %b[16]
movad,1 area=12, ind=0, am=1, be=0, %b[25]
movad,2 area=13, ind=0, am=1, be=0, %b[3]
movad,3 area=12, ind=0, am=1, be=0, %b[19]
}
{
loop_mode
qppackdl,1,sm %b[63], %b[65], %b[33]
stqp,2 %r7, %b[58], %b[33]
qppackdl,4,sm %b[59], %b[61], %b[28]
stqp,5 %r9, %b[58], %b[28]
movad,0 area=11, ind=0, am=1, be=0, %b[49]
movad,1 area=10, ind=0, am=1, be=0, %b[55]
movad,2 area=11, ind=0, am=1, be=0, %b[44]
movad,3 area=10, ind=0, am=1, be=0, %b[50]
}
{
loop_mode
qppackdl,1,sm %g18, %g19, %b[37]
stqp,2 %r10, %b[58], %b[37]
qppackdl,4,sm %g16, %g17, %b[32]
stqp,5 %r11, %b[58], %b[32]
movad,0 area=9, ind=0, am=1, be=0, %g17
movad,1 area=8, ind=0, am=1, be=0, %g19
movad,2 area=9, ind=0, am=1, be=0, %g16
movad,3 area=8, ind=0, am=1, be=0, %g18
}
{
loop_mode
qppackdl,1,sm %b[52], %b[57], %b[41]
stqp,2 %r12, %b[58], %b[41]
qppackdl,4,sm %b[46], %b[51], %b[36]
stqp,5 %r13, %b[58], %b[36]
movad,0 area=7, ind=0, am=1, be=0, %b[59]
movad,1 area=6, ind=0, am=1, be=0, %b[63]
movad,2 area=7, ind=0, am=1, be=0, %b[57]
movad,3 area=6, ind=0, am=1, be=0, %b[61]
}
{
loop_mode
addd,0,sm %b[4], 0x1, %b[2] ? %pcnt2
qppackdl,1,sm %b[21], %b[27], %b[5]
stqp,2 %r14, %b[58], %b[45]
ord,3,sm %b[68], %b[67], %b[65]
qppackdl,4,sm %b[5], %b[18], %b[18]
stqp,5 %r15, %b[58], %b[40]
movad,0 area=5, ind=0, am=1, be=0, %b[27]
movad,1 area=4, ind=0, am=1, be=0, %b[45]
movad,2 area=5, ind=0, am=1, be=0, %b[21]
movad,3 area=4, ind=0, am=1, be=0, %b[40]
}
{
loop_mode
bitrevd,0,sm %b[4], %b[60]
qppackdl,1,sm %b[64], %b[66], %b[9]
stqp,2 %r16, %b[58], %b[9]
shld,3,sm %b[65], 0x3, %b[56]
qppackdl,4,sm %b[60], %b[62], %b[4]
stqp,5 %r17, %b[58], %b[22]
movad,0 area=3, ind=0, am=1, be=0, %b[46]
movad,1 area=2, ind=0, am=1, be=0, %b[52]
movad,2 area=3, ind=0, am=1, be=0, %b[22]
movad,3 area=2, ind=0, am=1, be=0, %b[51]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
qppackdl,0,sm %g22, %g23, %b[10]
stqp,2 %r18, %b[58], %b[15]
qppackdl,3,sm %g20, %g21, %b[15]
stqp,5 %r19, %b[58], %b[10]
movad,0 area=1, ind=0, am=1, be=0, %g23
movad,1 area=0, ind=0, am=1, be=0, %g21
movad,2 area=1, ind=0, am=1, be=0, %g22
movad,3 area=0, ind=0, am=1, be=0, %g20
}
...
.L7272:
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=17, incr=0, ind=0, asz=1, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=16, incr=0, ind=0, asz=1, abs=0, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=15, incr=0, ind=0, asz=1, abs=2, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=14, incr=0, ind=0, asz=1, abs=2, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=13, incr=0, ind=0, asz=1, abs=4, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=12, incr=0, ind=0, asz=1, abs=4, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=11, incr=0, ind=0, asz=1, abs=6, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=10, incr=0, ind=0, asz=1, abs=6, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=9, incr=0, ind=0, asz=1, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=8, incr=0, ind=0, asz=1, abs=8, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=7, incr=0, ind=0, asz=1, abs=10, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=6, incr=0, ind=0, asz=1, abs=10, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=5, incr=0, ind=0, asz=1, abs=12, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=4, incr=0, ind=0, asz=1, abs=12, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=3, incr=0, ind=0, asz=1, abs=14, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=2, incr=0, ind=0, asz=1, abs=14, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=1, incr=0, ind=0, asz=1, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=15, asz=1, abs=16, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=14, asz=1, abs=18, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=13, asz=1, abs=18, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=12, asz=1, abs=20, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=11, asz=1, abs=20, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=10, asz=1, abs=22, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=9, asz=1, abs=22, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=8, asz=1, abs=24, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=1, abs=24, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=1, abs=26, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=1, abs=26, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=1, abs=28, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=1, abs=28, disp=0
}
{
fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=1, abs=30, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=1, asz=1, abs=30, disp=0
}
.L6518:
{
loop_mode
qppackdl,0,sm %b[55], %b[56], %b[1]
shr_andd,1,sm %b[62], %r0, %r21, %b[67]
stqp,2 %r4, %b[58], %b[19]
qppackdl,3,sm %b[26], %b[50], %b[14]
shr_andd,4,sm %b[62], %r1, %r20, %b[68]
stqp,5 %r2, %b[58], %b[14]
movad,0 area=15, ind=0, am=1, be=0, %b[62]
movad,1 area=14, ind=0, am=1, be=0, %b[66]
movad,2 area=15, ind=0, am=1, be=0, %b[60]
movad,3 area=14, ind=0, am=1, be=0, %b[64]
}
{
loop_mode
qppackdl,1,sm %b[44], %b[49], %b[26]
stqp,2 %r5, %b[58], %b[3]
qppackdl,4,sm %b[25], %b[31], %b[31]
stqp,5 %r6, %b[58], %b[16]
movad,0 area=13, ind=0, am=1, be=0, %b[16]
movad,1 area=12, ind=0, am=1, be=0, %b[25]
movad,2 area=13, ind=0, am=1, be=0, %b[3]
movad,3 area=12, ind=0, am=1, be=0, %b[19]
}
{
loop_mode
qppackdl,1,sm %b[63], %b[65], %b[33]
stqp,2 %r7, %b[58], %b[28]
qppackdl,4,sm %b[59], %b[61], %b[28]
stqp,5 %r9, %b[58], %b[33]
movad,0 area=11, ind=0, am=1, be=0, %b[49]
movad,1 area=10, ind=0, am=1, be=0, %b[55]
movad,2 area=11, ind=0, am=1, be=0, %b[44]
movad,3 area=10, ind=0, am=1, be=0, %b[50]
}
{
loop_mode
qppackdl,1,sm %g18, %g19, %b[37]
stqp,2 %r10, %b[58], %b[37]
qppackdl,4,sm %g16, %g17, %b[32]
stqp,5 %r11, %b[58], %b[32]
movad,0 area=9, ind=0, am=1, be=0, %g17
movad,1 area=8, ind=0, am=1, be=0, %g19
movad,2 area=9, ind=0, am=1, be=0, %g16
movad,3 area=8, ind=0, am=1, be=0, %g18
}
{
loop_mode
qppackdl,1,sm %b[52], %b[57], %b[36]
stqp,2 %r12, %b[58], %b[41]
qppackdl,4,sm %b[46], %b[51], %b[41]
stqp,5 %r13, %b[58], %b[36]
movad,0 area=7, ind=0, am=1, be=0, %b[59]
movad,1 area=6, ind=0, am=1, be=0, %b[63]
movad,2 area=7, ind=0, am=1, be=0, %b[57]
movad,3 area=6, ind=0, am=1, be=0, %b[61]
}
{
loop_mode
addd,0,sm %b[4], 0x1, %b[2] ? %pcnt2
qppackdl,1,sm %b[21], %b[27], %b[5]
stqp,2 %r14, %b[58], %b[40]
ord,3,sm %b[68], %b[67], %b[65]
qppackdl,4,sm %b[5], %b[18], %b[18]
stqp,5 %r15, %b[58], %b[45]
movad,0 area=5, ind=0, am=1, be=0, %b[27]
movad,1 area=4, ind=0, am=1, be=0, %b[45]
movad,2 area=5, ind=0, am=1, be=0, %b[21]
movad,3 area=4, ind=0, am=1, be=0, %b[40]
}
{
loop_mode
bitrevd,0,sm %b[4], %b[60]
qppackdl,1,sm %b[64], %b[66], %b[4]
stqp,2 %r16, %b[58], %b[9]
shld,3,sm %b[65], 0x3, %b[56]
qppackdl,4,sm %b[60], %b[62], %b[9]
stqp,5 %r17, %b[58], %b[22]
movad,0 area=3, ind=0, am=1, be=0, %b[46]
movad,1 area=2, ind=0, am=1, be=0, %b[52]
movad,2 area=3, ind=0, am=1, be=0, %b[22]
movad,3 area=2, ind=0, am=1, be=0, %b[51]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
qppackdl,0,sm %g22, %g23, %b[15]
stqp,2 %r18, %b[58], %b[10]
qppackdl,3,sm %g20, %g21, %b[10]
stqp,5 %r19, %b[58], %b[15]
movad,0 area=1, ind=0, am=1, be=0, %g23
movad,1 area=0, ind=0, am=1, be=0, %g21
movad,2 area=1, ind=0, am=1, be=0, %g22
movad,3 area=0, ind=0, am=1, be=0, %g20
}
Теоретическая скорость: 32 комплексных числа за 8 тактов (32/8) = 32 Байт/такт
Замеры скорости

Видим замедление в начале и ускорение в конце графика.
Накладные расходы на организацию второго цикла не дают проявиться ускорению по всей длине графика.
Итоги по reverse_radix4


Победителем можно считать либо reverse_radix4_x16, либо reverse_radix4_x32.
Алгоритм FFT состоит из одного запуска Reverse и нескольких запусков Stage. Чем больше запусков Stage, тем меньший вклад вносит скорость Reverse в итоговую скорость FFT. Поэтому скорость Reverse важнее на меньших длинах входных данных, где меньше запусков Stage.
При реализации Radix-4 FFT будем использовать reverse_radix4_x16.
Потом можно заменить на reverse_radix4_x32 и посмотреть, как изменится скорость FFT.
Пишем функцию Stage
stage_radix2
Схема алгоритма Stage для версии «radix-2».

1. stage_radix2_etalon
Эталонный вариант для сравнения на корректность.
Код на Си
void stage_radix2_etalon(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef)
{
myComplex *x_in = &data_in[0];
myComplex *y_in = &data_in[1];
myComplex *c_in = coef;
myComplex *out_add = &data_out[0];
myComplex *out_sub = &data_out[data_count/2];
#pragma ivdep
#pragma unroll(1)
// #pragma prefetch
for(int64_t i = 0; i < data_count/2; ++i)
{
myComplex x = x_in[2*i];
myComplex y = y_in[2*i];
myComplex c = c_in[i];
myComplex cy = complex_mul(c, y);
out_add[i] = complex_add(x, cy);
out_sub[i] = complex_sub(x, cy);
}
}
Основной цикл на ассемблере
.L444:
{
fapb ct=1, dcd=0, fmt=3, mrng=16, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=2, asz=5, abs=0, disp=0
}
.L120:
{
loop_mode
fmuls,0,sm %b[67], %b[6], %b[37]
fsubs,1,sm %b[46], %b[24], %b[58]
staaw,2 %b[62], %aad1[ %aasti3 + _f32s,_lts0 0x4 ]
fmul_adds,3,sm %b[55], %b[13], %b[43], %b[14]
fadds,4,sm %b[46], %b[24], %b[57]
staaw,5 %b[61], %aad2[ %aasti4 + _f32s,_lts0 0x4 ]
movaw,0 area=0, ind=8, am=0, be=0, %b[0]
movaw,1 area=0, ind=12, am=0, be=0, %b[1]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
fmuls,0,sm %b[67], %b[7], %b[62]
fsubs,1,sm %b[35], %b[56], %b[70]
staaw,2 %b[74], %aad1[ %aasti3 ]
incr,2 %aaincr3
fmul_rsubs,3,sm %b[55], %b[12], %b[68], %b[46]
fadds,4,sm %b[35], %b[56], %b[69]
staaw,5 %b[73], %aad2[ %aasti4 ]
incr,5 %aaincr3
movaw,0 area=0, ind=0, am=0, be=0, %b[13]
movaw,1 area=0, ind=4, am=1, be=0, %b[24]
movaw,2 area=0, ind=4, am=1, be=0, %b[61]
movaw,3 area=0, ind=0, am=0, be=0, %b[43]
}
Теоретическая скорость: 2 комплексных числа за 2 такта (2/2) = 8 Байт/такт
Замеры скорости

2. stage_radix2_etalon_unroll2
Этот вариант появился случайно.
Здесь происходит раскрутка цикла в 2 раза с помощью опции unroll.
Можно видеть, что компилятор умеет использовать векторные инструкции.
Код на Си
void stage_radix2_etalon_unroll2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef)
{
myComplex *x_in = &data_in[0];
myComplex *y_in = &data_in[1];
myComplex *c_in = coef;
myComplex *out_add = &data_out[0];
myComplex *out_sub = &data_out[data_count/2];
#pragma ivdep
#pragma unroll(2)
// #pragma prefetch
for(int64_t i = 0; i < data_count/2; ++i)
{
myComplex x = x_in[2*i];
myComplex y = y_in[2*i];
myComplex c = c_in[i];
myComplex cy = complex_mul(c, y);
out_add[i] = complex_add(x, cy);
out_sub[i] = complex_sub(x, cy);
}
}
Основной цикл на ассемблере
.L1266:
{
fapb ct=0, dcd=0, fmt=3, mrng=12, d=0, incr=1, ind=1, asz=4, abs=0, disp=20
fapb dpl=0, dcd=0, fmt=3, mrng=20, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
}
{
fapb ct=1, dcd=0, fmt=4, mrng=16, d=0, incr=2, ind=2, asz=4, abs=16, disp=0
}
.L463:
{
loop_mode
pfmuls,0,sm %b[51], %b[32], %b[53]
insfd,1,sm %b[28], %r8, %b[54], %b[1]
pfmuls,2,sm %b[51], %b[13], %b[45]
insfd,3,sm %b[23], %r8, %b[50], %b[0]
pshufb,4,sm %b[9], %b[19], %r0, %b[56]
pfadds,5,sm %b[33], %b[39], %b[10]
movaw,1 area=0, ind=8, am=0, be=0, %b[38]
movaw,3 area=0, ind=12, am=0, be=0, %b[44]
}
{
loop_mode
pfmul_rsubs,0,sm %b[5], %b[15], %b[55], %b[39]
insfd,1,sm %b[20], %r8, %b[24], %b[23]
pfmul_adds,2,sm %b[5], %b[34], %b[47], %b[33]
insfd,3,sm %b[40], %r8, %b[46], %b[28]
pshufb,4,sm %b[12], %b[16], %r0, %b[54]
staad,5 %b[56], %aad1[ %aasti3 + _f32s,_lts0 0x8 ]
movad,1 area=1, ind=0, am=0, be=0, %b[50]
}
{
loop_mode
pfsubs,0,sm %b[8], %b[43], %b[15]
insfd,1,sm %b[9], %r8, %b[19], %b[57]
pfsubs,2,sm %b[31], %b[37], %b[5]
insfd,3,sm %b[12], %r8, %b[16], %b[55]
staad,5 %b[54], %aad2[ %aasti4 + _f32s,_lts0 0x8 ]
movad,0 area=1, ind=8, am=1, be=0, %b[24]
movaw,1 area=0, ind=4, am=0, be=0, %b[34]
movaw,2 area=0, ind=4, am=0, be=0, %b[20]
movaw,3 area=0, ind=8, am=0, be=0, %b[40]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
pfadds,0,sm %b[8], %b[43], %b[12]
insfd,1,sm %b[36], %r8, %b[42], %b[9]
staad,2 %b[57], %aad1[ %aasti3 ]
incr,2 %aaincr3
pshufb,4,sm %b[26], %b[52], %r0, %b[47]
staad,5 %b[55], %aad2[ %aasti4 ]
incr,5 %aaincr3
movaw,1 area=0, ind=0, am=1, be=0, %b[16]
movaw,2 area=0, ind=0, am=1, be=0, %b[46]
movaw,3 area=0, ind=16, am=0, be=0, %b[19]
}
Теоретическая скорость: 4 комплексных числа за 4 такта (4/4) = 8 Байт/такт
Замеры скорости

Видим ускорение.
Теоретическая скорость не изменилась по сравнению с эталонным вариантом, но скорость выросла. В ассемблерном коде можно видеть, что компилятор вставил векторные инструкции.
3. stage_radix2_simd64
Прямую векторизацию сейчас пробовать не будем. Её посмотрим потом отдельно.
Сейчас попробуем использовать векторные инструкции SIMD64 для выполнения нескольких умножений одной инструкцией.
Умножение двух комплексных чисел c и y будем делать так:
-
читаем комплексные числа c и y из памяти в 64-битные регистры (в одну половину регистра попадает действительная часть, в другую половину — мнимая часть)
-
меняем знак у мнимой части c с помощью
xor(получаем conj_c) и перемножаем векторно conj_c и y — получаем полуфабрикат для действительной части cy (для завершения получения действительной части cy надо сложить половины регистра) -
меняем местами действительную и мнимую части c с помощью
shuf(получаем swap_c) и перемножаем векторно swap_c и y — получаем полуфабрикат для мнимой части cy (для завершения получения мнимой части cy надо сложить половины регистра) -
складываем половины регистров‑полуфабрикатов с помощью векторного горизонтального сложения
fhadd— получаем cy
Код на Си
void stage_radix2_simd64(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef)
{
uint64_t *x_in = (uint64_t*)&data_in[0];
uint64_t *y_in = (uint64_t*)&data_in[1];
uint64_t *c_in = (uint64_t*)coef;
uint64_t *out_add = (uint64_t*)&data_out[0];
uint64_t *out_sub = (uint64_t*)&data_out[data_count/2];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < data_count/2; ++i)
{
uint64_t x = x_in[2*i];
uint64_t y = y_in[2*i];
uint64_t c = c_in[i];
uint64_t conj_c = __builtin_e2k_pxord(c, 1LL<<63);
uint64_t swap_c = __builtin_e2k_pshufb(0, c, 0x0302010007060504);
uint64_t cy_real = __builtin_e2k_pfmuls(conj_c, y);
uint64_t cy_imag = __builtin_e2k_pfmuls(swap_c, y);
uint64_t cy = __builtin_e2k_pfhadds(cy_real, cy_imag);
out_add[i] = __builtin_e2k_pfadds(x, cy);
out_sub[i] = __builtin_e2k_pfsubs(x, cy);
}
}
Основной цикл на ассемблере
.L1588:
{
fapb ct=1, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=5, abs=0, disp=0
}
.L1388:
{
loop_mode
pfmuls,0,sm %b[35], %b[28], %b[18]
pfmul_hadds,1,sm %b[33], %b[32], %b[22], %b[0]
pshufb,4,sm 0x0, %b[7], %r5, %b[29]
pfadds,5,sm %b[27], %b[10], %b[12]
movad,3 area=0, ind=0, am=1, be=0, %b[1]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
xord,0,sm %b[5], %r0, %b[33]
pfsubs,1,sm %b[27], %b[10], %b[32]
staad,2 %b[36], %aad1[ %aasti3 ]
incr,2 %aaincr0
staad,5 %b[16], %aad2[ %aasti4 ]
incr,5 %aaincr0
movad,0 area=0, ind=8, am=1, be=0, %b[22]
movad,1 area=0, ind=0, am=0, be=0, %b[7]
}
После компиляции видим, что цикл состоит из 8 инструкций: xor, shuf, fmul, fmul_fhadd, fadd, fsub, std, std. Инструкция fhadd оказалась «сцеплена» с одной из инструкций fmul (оказывается, Эльбрус так умеет).
Теоретическая скорость: 2 комплексных числа за 2 такта (2/2) = 8 Байт/такт
Замеры скорости

Видим небольшое ускорение.
В одном такте помещается 6 инструкций, а у нас здесь 8 инструкций. Т.е. у нас занято 8/6 такта. В идеале, если раскрутить цикл в 3 раза, получится самая плотная упаковка (3 * 8/6 = 4 такта). Раскручивать будем с помощью опции unroll.
Но сначала посмотрим на раскрутку в 2 раза (2 * 8/6 = 3 такта).
4. stage_radix2_simd64_unroll2
Здесь происходит раскрутка цикла в 2 раза с помощью опции unroll.
Код на Си
void stage_radix2_simd64_unroll2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef)
{
uint64_t *x_in = (uint64_t*)&data_in[0];
uint64_t *y_in = (uint64_t*)&data_in[1];
uint64_t *c_in = (uint64_t*)coef;
uint64_t *out_add = (uint64_t*)&data_out[0];
uint64_t *out_sub = (uint64_t*)&data_out[data_count/2];
#pragma ivdep
#pragma unroll(2)
#pragma prefetch
for(int64_t i = 0; i < data_count/2; ++i)
{
uint64_t x = x_in[2*i];
uint64_t y = y_in[2*i];
uint64_t c = c_in[i];
uint64_t conj_c = __builtin_e2k_pxord(c, 1LL<<63);
uint64_t swap_c = __builtin_e2k_pshufb(0, c, 0x0302010007060504);
uint64_t cy_real = __builtin_e2k_pfmuls(conj_c, y);
uint64_t cy_imag = __builtin_e2k_pfmuls(swap_c, y);
uint64_t cy = __builtin_e2k_pfhadds(cy_real, cy_imag);
out_add[i] = __builtin_e2k_pfadds(x, cy);
out_sub[i] = __builtin_e2k_pfsubs(x, cy);
}
}
Основной цикл на ассемблере
.L2152:
{
fapb ct=1, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=16, d=0, incr=2, ind=2, asz=5, abs=0, disp=0
}
.L1710:
{
loop_mode
pfmul_hadds,0,sm %b[51], %b[10], %b[36], %b[11]
pfmuls,1,sm %b[57], %b[7], %b[44]
pfsubs,2,sm %b[41], %b[25], %b[45]
xord,4,sm %b[42], %r0, %b[52]
pfadds,5,sm %b[41], %b[25], %b[48]
movad,0 area=0, ind=24, am=0, be=0, %b[1]
movad,1 area=0, ind=8, am=0, be=0, %b[0]
}
{
loop_mode
pshufb,0,sm 0x0, %b[32], %r9, %b[41]
pfsubs,1,sm %b[26], %b[17], %b[51]
staad,2 %b[47], %aad1[ %aasti3 + _f32s,_lts0 0x8 ]
pfadds,3,sm %b[26], %b[17], %b[54]
xord,4,sm %b[30], %r0, %b[55]
staad,5 %b[50], %aad2[ %aasti4 + _f32s,_lts0 0x8 ]
movad,0 area=0, ind=0, am=1, be=0, %b[10]
movad,1 area=0, ind=16, am=0, be=0, %b[25]
movad,3 area=0, ind=0, am=0, be=0, %b[36]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
pfmul_hadds,0,sm %b[43], %b[9], %b[46], %b[17]
pfmuls,1,sm %b[52], %b[6], %b[32]
staad,2 %b[53], %aad1[ %aasti3 ]
incr,2 %aaincr3
pshufb,4,sm 0x0, %b[42], %r9, %b[47]
staad,5 %b[56], %aad2[ %aasti4 ]
incr,5 %aaincr3
movad,3 area=0, ind=8, am=1, be=0, %b[26]
}
Теоретическая скорость: 4 комплексных числа за 3 такта (4/3) = 10.67 Байт/такт
Замеры скорости

Видим ускорение.
Теперь посмотрим на раскрутку в 3 раза.
5. stage_radix2_simd64_unroll3
Здесь происходит раскрутка цикла в 3 раза с помощью опции unroll.
Код на Си
void stage_radix2_simd64_unroll3(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef)
{
uint64_t *x_in = (uint64_t*)&data_in[0];
uint64_t *y_in = (uint64_t*)&data_in[1];
uint64_t *c_in = (uint64_t*)coef;
uint64_t *out_add = (uint64_t*)&data_out[0];
uint64_t *out_sub = (uint64_t*)&data_out[data_count/2];
#pragma ivdep
#pragma unroll(3)
#pragma prefetch
for(int64_t i = 0; i < data_count/2; ++i)
{
uint64_t x = x_in[2*i];
uint64_t y = y_in[2*i];
uint64_t c = c_in[i];
uint64_t conj_c = __builtin_e2k_pxord(c, 1LL<<63);
uint64_t swap_c = __builtin_e2k_pshufb(0, c, 0x0302010007060504);
uint64_t cy_real = __builtin_e2k_pfmuls(conj_c, y);
uint64_t cy_imag = __builtin_e2k_pfmuls(swap_c, y);
uint64_t cy = __builtin_e2k_pfhadds(cy_real, cy_imag);
out_add[i] = __builtin_e2k_pfadds(x, cy);
out_sub[i] = __builtin_e2k_pfsubs(x, cy);
}
}
Основной цикл на ассемблере
.L2815:
{
fapb ct=0, dcd=0, fmt=4, mrng=24, d=0, incr=2, ind=2, asz=4, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
}
{
fapb ct=1, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=1, asz=4, abs=16, disp=32
}
.L2177:
{
loop_mode
pfmuls,0,sm %b[61], %b[23], %b[41]
pfmuls,1,sm %b[73], %b[12], %b[1]
xord,2,sm %b[57], %r0, %b[59]
pfmul_hadds,3,sm %b[78], %b[53], %b[28], %b[0]
xord,4,sm %b[52], %r0, %b[66]
pfadds,5,sm %b[48], %b[34], %b[58]
}
{
loop_mode
pfmul_hadds,0,sm %b[67], %b[25], %b[43], %b[28]
pfsubs,1,sm %b[48], %b[34], %b[68]
staad,2 %b[70], %aad1[ %aasti3 + _f32s,_lts0 0x10 ]
pfsubs,3,sm %b[17], %b[4], %b[61]
xord,4,sm %b[20], %r0, %b[71]
staad,5 %b[60], %aad2[ %aasti4 + _f32s,_lts0 0x10 ]
movad,1 area=0, ind=16, am=0, be=0, %b[53]
}
{
loop_mode
pfmul_hadds,0,sm %b[72], %b[14], %b[3], %b[60]
pfsubs,1,sm %b[39], %b[64], %b[73]
staad,2 %b[75], %aad1[ %aasti3 + _f32s,_lts0 0x8 ]
pfadds,3,sm %b[17], %b[4], %b[67]
pshufb,4,sm 0x0, %b[22], %r10, %b[70]
staad,5 %b[63], %aad1[ %aasti3 ]
incr,5 %aaincr3
movad,0 area=0, ind=0, am=0, be=0, %b[48]
movad,1 area=1, ind=0, am=0, be=0, %b[34]
movad,2 area=0, ind=16, am=0, be=0, %b[25]
movad,3 area=0, ind=8, am=0, be=0, %b[43]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
pfmuls,0,sm %b[66], %b[47], %b[22]
pfadds,1,sm %b[39], %b[64], %b[72]
staad,2 %b[74], %aad2[ %aasti4 + _f32s,_lts0 0x8 ]
pshufb,3,sm 0x0, %b[57], %r10, %b[63]
pshufb,4,sm 0x0, %b[56], %r10, %b[76]
staad,5 %b[69], %aad2[ %aasti4 ]
incr,5 %aaincr3
movad,0 area=1, ind=8, am=1, be=0, %b[17]
movad,1 area=0, ind=8, am=1, be=0, %b[14]
movad,2 area=0, ind=0, am=1, be=0, %b[3]
movad,3 area=0, ind=24, am=0, be=0, %b[4]
}
Теоретическая скорость: 6 комплексных чисел за 4 такта (6/4) = 12 Байт/такт
Замеры скорости

Видим ускорение в середине графика, но замедление в начале и в конце графика.
6. stage_radix2_simd128
Теперь попробуем использовать векторные инструкции SIMD128 по аналогии с SIMD64.
В отличие от SIMD64, здесь придётся перетасовать данные в начале и в конце цикла с помощью инструкции shuf. Это нужно для того, чтобы в одном 128-битном регистре оказались данные, относящиеся к двум числам x, а в другом — данные, относящиеся к двум числам y.
Код на Си
void stage_radix2_simd128(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef)
{
__v2di *xy0_in = (__v2di*)&data_in[0];
__v2di *xy1_in = (__v2di*)&data_in[2];
__v2di *c_in = (__v2di*)coef;
__v2di *out_add = (__v2di*)&data_out[0];
__v2di *out_sub = (__v2di*)&data_out[data_count/2];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < data_count/4; ++i)
{
__v2di xy0 = xy0_in[2*i];
__v2di xy1 = xy1_in[2*i];
__v2di c = c_in[i];
__v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di conj_c = __builtin_e2k_qpxor(c, (__v2di){1LL<<63, 1LL<<63});
__v2di swap_c = __builtin_e2k_qpshufb(c, c, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di cy_real = __builtin_e2k_qpfmuls(conj_c, y);
__v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y);
__v2di cy_rrii = __builtin_e2k_qpfhadds(cy_real, cy_imag);
__v2di cy = __builtin_e2k_qpshufb(cy_rrii, cy_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
out_add[i] = __builtin_e2k_qpfadds(x, cy);
out_sub[i] = __builtin_e2k_qpfsubs(x, cy);
}
}
Основной цикл на ассемблере
.L3099:
{
fapb ct=1, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=2, asz=5, abs=0, disp=0
}
.L2840:
{
loop_mode
qpshufb,0,sm %b[36], %b[45], %r6, %b[0]
qpfmuls,1,sm %b[28], %b[5], %b[18]
qpfsubs,2,sm %b[16], %b[47], %b[11]
qpshufb,3,sm %b[34], %b[43], %r7, %b[1]
qpxor,4,sm %b[23], %r0, %b[24]
staaqp,5 %b[17], %aad1[ %aasti3 ]
incr,5 %aaincr0
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
qpshufb,0,sm %b[35], %b[35], %r5, %b[45]
qpfmul_hadds,1,sm %b[42], %b[9], %b[22], %b[27]
qpfadds,2,sm %b[16], %b[47], %b[44]
qpshufb,4,sm %b[25], %b[25], %r9, %b[36]
staaqp,5 %b[50], %aad2[ %aasti4 ]
incr,5 %aaincr0
movaqp,0 area=0, ind=0, am=0, be=0, %b[37]
movaqp,1 area=0, ind=16, am=1, be=0, %b[28]
movaqp,3 area=0, ind=0, am=1, be=0, %b[17]
}
После компиляции видим, что цикл состоит из 11 инструкций (такие же 8 инструкций, что были в варианте SIMD64, и ещё 3 дополнительные инструкции shuf).
Теоретическая скорость: 4 комплексных числа за 2 такта (4/2) = 16 Байт/такт
Замеры скорости

Видим ускорение.
Прежде, чем переходить к раскрутке, заметим, что можно сделать цикл на одну инструкцию меньше. В дальнейшем это позволит выполнить более эффективную раскрутку.
7. stage_radix2_simd128_noConj
Воспользуется тем, что Эльбрус умеет сцеплять некоторые инструкции.
Откажемся от создания conj_c (убрали инструкцию xor) и будем использовать fhsub для получения действительной части cy из полуфабриката. Мнимую часть будем, как и раньше, получать с помощью fhadd. Обе эти инструкции будут сцеплены с двумя fmul, то есть будут «бесплатными». Итоговое соединение в единое комплексное число будет сделано в финальном shuf одновременно с уже имеющейся перетасовкой данных.
В версии SIMD64 такой приём сделать было нельзя, потому что там не было финального shuf.
(финальный shuf пришлось заменить на perm, инструкции аналогичны друг другу)
Код на Си
void stage_radix2_simd128_noConj(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef)
{
__v2di *xy0_in = (__v2di*)&data_in[0];
__v2di *xy1_in = (__v2di*)&data_in[2];
__v2di *c_in = (__v2di*)coef;
__v2di *out_add = (__v2di*)&data_out[0];
__v2di *out_sub = (__v2di*)&data_out[data_count/2];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < data_count/4; ++i)
{
__v2di xy0 = xy0_in[2*i];
__v2di xy1 = xy1_in[2*i];
__v2di c = c_in[i];
__v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di swap_c = __builtin_e2k_qpshufb(c, c, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di cy_real = __builtin_e2k_qpfmuls( c, y);
__v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y);
__v2di cy_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy_real);
__v2di cy_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy_imag);
__v2di cy = __builtin_e2k_qppermb(cy_ii, cy_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
out_add[i] = __builtin_e2k_qpfadds(x, cy);
out_sub[i] = __builtin_e2k_qpfsubs(x, cy);
}
}
Основной цикл на ассемблере
.L3385:
{
fapb ct=1, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=2, asz=5, abs=0, disp=0
}
.L3124:
{
loop_mode
qpfmul_hsubs,0,sm %b[25], %b[28], %r9, %b[16]
qpfmul_hadds,1,sm %b[27], %b[28], %r9, %b[1]
qpfsubs,2,sm %b[14], %b[42], %b[37]
qpshufb,3,sm %b[35], %b[36], %r6, %b[0]
qppermb,4,sm %b[11], %b[26], %r7, %b[38]
staaqp,5 %b[43], %aad1[ %aasti3 ]
incr,5 %aaincr0
movaqp,0 area=0, ind=0, am=0, be=0, %b[30]
movaqp,1 area=0, ind=16, am=1, be=0, %b[29]
movaqp,3 area=0, ind=0, am=1, be=0, %b[19]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
qpshufb,0,sm %b[33], %b[34], %r0, %b[26]
qpshufb,1,sm %b[23], %b[23], %r5, %b[25]
qpfadds,2,sm %b[14], %b[42], %b[11]
staaqp,5 %b[17], %aad2[ %aasti4 ]
incr,5 %aaincr0
}
После компиляции видим, что цикл состоит из 10 инструкций (убрали инструкцию xor).
Теоретическая скорость: 4 комплексных числа за 2 такта (4/2) = 16 Байт/такт
Замеры скорости

Скорость не изменилась по сравнению с предыдущим вариантом.
И теперь переходим к раскрутке.
Сейчас занято 10/6 такта. Раскрутка в 2 раза даст 2 * 10/6 = 4 такта, то есть ничего интересного (одна итерация цикла обработает в 2 раза больше данных за в 2 раза большее число тактов).
Поэтому сразу переходим к раскрутке в 3 раза (3 * 10/6 = 5 тактов).
8. stage_radix2_simd128_noConj_unroll3
Здесь происходит раскрутка цикла в 3 раза с помощью опции unroll.
Код на Си
void stage_radix2_simd128_noConj_unroll3(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef)
{
__v2di *xy0_in = (__v2di*)&data_in[0];
__v2di *xy1_in = (__v2di*)&data_in[2];
__v2di *c_in = (__v2di*)coef;
__v2di *out_add = (__v2di*)&data_out[0];
__v2di *out_sub = (__v2di*)&data_out[data_count/2];
#pragma ivdep
#pragma unroll(3)
#pragma prefetch
for(int64_t i = 0; i < data_count/4; ++i)
{
__v2di xy0 = xy0_in[2*i];
__v2di xy1 = xy1_in[2*i];
__v2di c = c_in[i];
__v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di swap_c = __builtin_e2k_qpshufb(c, c, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di cy_real = __builtin_e2k_qpfmuls( c, y);
__v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y);
__v2di cy_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy_real);
__v2di cy_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy_imag);
__v2di cy = __builtin_e2k_qppermb(cy_ii, cy_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
out_add[i] = __builtin_e2k_qpfadds(x, cy);
out_sub[i] = __builtin_e2k_qpfsubs(x, cy);
}
}
Основной цикл на ассемблере
.L3932:
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=4, abs=0, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=3, abs=8, disp=64
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=2, asz=4, abs=16, disp=0
}
{
fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=2, asz=4, abs=16, disp=32
}
.L3410:
{
loop_mode
qpfmul_hsubs,0,sm %b[19], %b[55], %r14, %b[0]
qpshufb,1,sm %b[26], %b[23], %r12, %b[15]
qpfmul_hsubs,2,sm %b[31], %b[63], %r14, %b[1]
qpshufb,3,sm %b[52], %b[53], %r0, %b[18]
qpfadds,4,sm %b[66], %b[69], %b[6]
qpfadds,5,sm %b[64], %b[42], %b[68]
}
{
loop_mode
qpfmul_hsubs,0,sm %b[34], %b[15], %r14, %b[52]
qpshufb,1,sm %b[34], %b[34], %r11, %b[61]
staaqp,2 %b[35], %aad1[ %aasti3 + _f32s,_lts0 0x10 ]
qpshufb,3,sm %b[44], %b[45], %r12, %b[53]
qpfsubs,4,sm %b[62], %b[40], %b[57]
qpfsubs,5,sm %b[18], %b[65], %b[58]
movaqp,0 area=0, ind=0, am=0, be=0, %b[43]
movaqp,1 area=0, ind=16, am=1, be=0, %b[42]
}
{
loop_mode
qpfmul_hadds,0,sm %b[61], %b[15], %r14, %b[34]
qpshufb,1,sm %b[31], %b[31], %r11, %b[66]
staaqp,2 %b[60], %aad1[ %aasti3 ]
qpshufb,3,sm %b[30], %b[27], %r0, %b[64]
qppermb,4,sm %b[38], %b[56], %r13, %b[67]
qpfadds,5,sm %b[18], %b[65], %b[35]
}
{
loop_mode
qpfmul_hadds,0,sm %b[66], %b[63], %r14, %b[18]
qpshufb,1,sm %b[19], %b[19], %r11, %b[56]
staaqp,2 %b[8], %aad2[ %aasti4 + _f32s,_lts0 0x10 ]
qppermb,3,sm %b[22], %b[5], %r13, %b[38]
qpfsubs,4,sm %b[64], %b[67], %b[31]
staaqp,5 %b[37], %aad2[ %aasti4 ]
movaqp,1 area=2, ind=0, am=1, be=0, %b[27]
movaqp,2 area=1, ind=16, am=1, be=0, %b[30]
movaqp,3 area=1, ind=0, am=0, be=0, %b[15]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
qpfmul_hadds,0,sm %b[56], %b[55], %r14, %b[37]
qpshufb,1,sm %b[10], %b[7], %r12, %b[61]
staaqp,2 %b[59], %aad1[ %aasti3 + _f32s,_lts0 0x20 ]
incr,2 %aaincr3
qpshufb,3,sm %b[16], %b[13], %r0, %b[60]
qppermb,4,sm %b[41], %b[4], %r13, %b[63]
staaqp,5 %b[68], %aad2[ %aasti4 + _f32s,_lts0 0x20 ]
incr,5 %aaincr3
movaqp,0 area=1, ind=0, am=0, be=0, %b[5]
movaqp,1 area=1, ind=16, am=1, be=0, %b[8]
movaqp,2 area=0, ind=0, am=0, be=0, %b[19]
movaqp,3 area=0, ind=16, am=1, be=0, %b[22]
}
Теоретическая скорость: 12 комплексных чисел за 5 тактов (12/5) = 19.2 Байт/такт
Замеры скорости

Видим ускорение в середине графика, но замедление в начале и в конце графика.
Итоги по stage_radix2


График FFT находится здесь.
stage_radix2_2x
Схема алгоритма Stage для версии «radix-2» 2x.

Один проход по stage_radix2_2x совершает ту же работу, что 2 прохода по stage_radix2. Поэтому скорость stage_radix2_2x будем умножать на 2 для удобства сравнения с stage_radix2 (этот факт подписан на оси графика и в выводе консоли).
1. stage_radix2_2x_etalon
Здесь происходит ручная раскрутка алгоритма stage_radix2_etalon в 2 раза.
Код на Си
void stage_radix2_2x_etalon(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef_a, myComplex *coef_b)
{
myComplex *x0_in = &data_in[0];
myComplex *y0_in = &data_in[1];
myComplex *x1_in = &data_in[2];
myComplex *y1_in = &data_in[3];
myComplex *c0a_in = &coef_a[0];
myComplex *c1a_in = &coef_a[1];
myComplex *c0b_in = &coef_b[0];
myComplex *c1b_in = &coef_b[data_count/4];
myComplex *out_add0 = &data_out[0*data_count/4];
myComplex *out_add1 = &data_out[1*data_count/4];
myComplex *out_sub0 = &data_out[2*data_count/4];
myComplex *out_sub1 = &data_out[3*data_count/4];
#pragma ivdep
#pragma unroll(1)
// #pragma prefetch
for(int64_t i = 0; i < data_count/4; ++i)
{
myComplex x0 = x0_in[4*i];
myComplex y0 = y0_in[4*i];
myComplex c0 = c0a_in[2*i];
myComplex x1 = x1_in[4*i];
myComplex y1 = y1_in[4*i];
myComplex c1 = c1a_in[2*i];
myComplex cy0 = complex_mul(c0, y0);
myComplex cy1 = complex_mul(c1, y1);
myComplex add0 = complex_add(x0, cy0);
myComplex sub0 = complex_sub(x0, cy0);
myComplex add1 = complex_add(x1, cy1);
myComplex sub1 = complex_sub(x1, cy1);
x0 = add0;
y0 = add1;
c0 = c0b_in[i];
x1 = sub0;
y1 = sub1;
c1 = c1b_in[i];
cy0 = complex_mul(c0, y0);
cy1 = complex_mul(c1, y1);
out_add0[i] = complex_add(x0, cy0);
out_sub0[i] = complex_sub(x0, cy0);
out_add1[i] = complex_add(x1, cy1);
out_sub1[i] = complex_sub(x1, cy1);
}
}
Основной цикл на ассемблере
.L965:
{
fapb ct=0, dcd=0, fmt=3, mrng=16, d=0, incr=3, ind=2, asz=3, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
}
{
fapb ct=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=4, asz=3, abs=8, disp=0
}
{
fapb ct=1, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=3, asz=4, abs=16, disp=0
}
.L259:
{
loop_mode
fmul_adds,0,sm %b[27], %b[84], %b[93], %b[45]
fsub_adds,1,sm %b[15], %b[79], %b[88], %b[0]
fsub_rsubs,2,sm %b[15], %b[79], %b[88], %b[1]
fmuls,3,sm %b[56], %b[74], %b[90]
fmuls,4,sm %b[45], %b[34], %b[89]
fmuls,5,sm %b[68], %b[65], %b[88]
}
{
loop_mode
fmul_rsubs,0,sm %b[24], %b[67], %b[95], %b[56]
fadd_adds,1,sm %b[15], %b[79], %b[47], %b[24]
fadd_rsubs,2,sm %b[15], %b[79], %b[47], %b[47]
fmul_rsubs,3,sm %b[73], %b[74], %b[94], %b[15]
fmuls,5,sm %b[63], %b[51], %b[91]
}
{
loop_mode
fmul_rsubs,0,sm %b[27], %b[53], %b[96], %b[58]
fsub_adds,1,sm %b[14], %b[83], %b[58], %b[27]
fsub_rsubs,2,sm %b[14], %b[83], %b[58], %b[53]
fmuls,4,sm %b[54], %b[50], %b[92]
fmuls,5,sm %b[68], %b[85], %b[93]
}
{
loop_mode
fadd_adds,0,sm %b[14], %b[83], %b[60], %b[67]
fadd_rsubs,1,sm %b[14], %b[83], %b[60], %b[68]
staaw,2 %b[2], %aad2[ %aasti6 + _f32s,_lts0 0x4 ]
fsubs,3,sm %b[48], %b[17], %b[63]
fmuls,4,sm %b[63], %b[82], %b[94]
staaw,5 %b[3], %aad1[ %aasti5 + _f32s,_lts0 0x4 ]
movaw,0 area=2, ind=0, am=0, be=0, %b[14]
movaw,1 area=2, ind=4, am=1, be=0, %b[60]
movaw,2 area=0, ind=0, am=0, be=0, %b[2]
movaw,3 area=0, ind=4, am=0, be=0, %b[3]
}
{
loop_mode
staaw,2 %b[26], %aad4[ %aasti8 + _f32s,_lts0 0x4 ]
fmul_adds,3,sm %b[73], %b[52], %b[90], %b[73]
fadds,4,sm %b[48], %b[17], %b[49]
staaw,5 %b[49], %aad3[ %aasti7 + _f32s,_lts0 0x4 ]
movaw,0 area=1, ind=0, am=0, be=0, %b[17]
movaw,1 area=0, ind=12, am=0, be=0, %b[52]
movaw,2 area=0, ind=8, am=0, be=0, %b[26]
movaw,3 area=0, ind=28, am=0, be=0, %b[48]
}
{
loop_mode
fmul_rsubs,1,sm %b[42], %b[34], %b[87], %b[79]
staaw,2 %b[29], %aad2[ %aasti6 ]
incr,2 %aaincr4
fsubs,4,sm %b[80], %b[75], %b[83]
staaw,5 %b[55], %aad1[ %aasti5 ]
incr,5 %aaincr4
movaw,0 area=1, ind=4, am=1, be=0, %b[55]
movaw,1 area=0, ind=0, am=0, be=0, %b[34]
movaw,2 area=0, ind=12, am=0, be=0, %b[29]
movaw,3 area=0, ind=20, am=0, be=0, %b[74]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
fmul_adds,0,sm %b[42], %b[37], %b[89], %b[75]
fmul_adds,1,sm %b[22], %b[85], %b[88], %b[84]
staaw,2 %b[69], %aad4[ %aasti8 ]
incr,2 %aaincr4
fmuls,3,sm %b[43], %b[35], %b[85]
fadds,4,sm %b[80], %b[75], %b[80]
staaw,5 %b[70], %aad3[ %aasti7 ]
incr,5 %aaincr4
movaw,0 area=0, ind=4, am=1, be=0, %b[37]
movaw,1 area=0, ind=8, am=0, be=0, %b[69]
movaw,2 area=0, ind=16, am=1, be=0, %b[42]
movaw,3 area=0, ind=24, am=0, be=0, %b[70]
}
Теоретическая скорость: 4 комплексных числа за 7 тактов (4/7) = 4.57 Байт/такт
Двойная теоретическая скорость: 9.14 Байт/такт
Замеры скорости

2. stage_radix2_2x_etalon_unroll2
Здесь происходит раскрутка цикла в 2 раза с помощью опции unroll.
Код на Си
void stage_radix2_2x_etalon_unroll2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef_a, myComplex *coef_b)
{
myComplex *x0_in = &data_in[0];
myComplex *y0_in = &data_in[1];
myComplex *x1_in = &data_in[2];
myComplex *y1_in = &data_in[3];
myComplex *c0a_in = &coef_a[0];
myComplex *c1a_in = &coef_a[1];
myComplex *c0b_in = &coef_b[0];
myComplex *c1b_in = &coef_b[data_count/4];
myComplex *out_add0 = &data_out[0*data_count/4];
myComplex *out_add1 = &data_out[1*data_count/4];
myComplex *out_sub0 = &data_out[2*data_count/4];
myComplex *out_sub1 = &data_out[3*data_count/4];
#pragma ivdep
#pragma unroll(2)
// #pragma prefetch
for(int64_t i = 0; i < data_count/4; ++i)
{
myComplex x0 = x0_in[4*i];
myComplex y0 = y0_in[4*i];
myComplex c0 = c0a_in[2*i];
myComplex x1 = x1_in[4*i];
myComplex y1 = y1_in[4*i];
myComplex c1 = c1a_in[2*i];
myComplex cy0 = complex_mul(c0, y0);
myComplex cy1 = complex_mul(c1, y1);
myComplex add0 = complex_add(x0, cy0);
myComplex sub0 = complex_sub(x0, cy0);
myComplex add1 = complex_add(x1, cy1);
myComplex sub1 = complex_sub(x1, cy1);
x0 = add0;
y0 = add1;
c0 = c0b_in[i];
x1 = sub0;
y1 = sub1;
c1 = c1b_in[i];
cy0 = complex_mul(c0, y0);
cy1 = complex_mul(c1, y1);
out_add0[i] = complex_add(x0, cy0);
out_sub0[i] = complex_sub(x0, cy0);
out_add1[i] = complex_add(x1, cy1);
out_sub1[i] = complex_sub(x1, cy1);
}
}
Основной цикл на ассемблере
.L2305:
{
fapb ct=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=3, mrng=16, d=0, incr=2, ind=2, asz=3, abs=0, disp=16
}
{
fapb ct=0, dcd=0, fmt=3, mrng=16, d=0, incr=2, ind=2, asz=3, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=3, abs=8, disp=32
}
{
fapb ct=1, dcd=0, fmt=4, mrng=16, d=0, incr=3, ind=3, asz=4, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=16, d=0, incr=3, ind=4, asz=4, abs=16, disp=0
}
.L1020:
{
loop_mode
pfmul_rsubs,0,sm %b[71], %b[17], %b[75], %b[1]
insfd,1,sm %b[10], %r6, %b[11], %b[87]
pfmul_rsubs,2,sm %b[101], %b[47], %b[96], %b[0]
insfd,3,sm %b[92], %r6, %b[98], %b[86]
insfd,4,sm %b[86], %r6, %b[87], %b[10]
pfsubs,5,sm %b[29], %b[40], %b[11]
}
{
loop_mode
pfmul_adds,0,sm %b[101], %b[13], %b[80], %b[17]
insfd,1,sm %b[63], %r6, %b[38], %b[12]
pfmul_rsubs,2,sm %b[87], %b[93], %b[85], %b[47]
pfmul_adds,3,sm %b[90], %b[12], %b[97], %b[38]
insfd,4,sm %b[76], %r6, %b[52], %b[13]
}
{
loop_mode
pfadd_adds,0,sm %b[18], %b[3], %b[49], %b[49]
insfd,1,sm %b[82], %r6, %b[95], %b[29]
pfadd_rsubs,2,sm %b[18], %b[3], %b[49], %b[40]
pfadds,3,sm %b[29], %b[40], %b[52]
pshufb,4,sm %b[43], %b[57], %r0, %b[80]
pfmuls,5,sm %b[91], %b[93], %b[76]
}
{
loop_mode
pfsub_rsubs,0,sm %b[18], %b[3], %b[2], %b[55]
insfd,1,sm %b[94], %r6, %b[81], %b[82]
pfsub_rsubs,2,sm %b[37], %b[28], %b[19], %b[41]
pshufb,3,sm %b[24], %b[25], %r0, %b[85]
pshufb,4,sm %b[34], %b[48], %r0, %b[90]
pfmuls,5,sm %b[41], %b[15], %b[81]
}
{
loop_mode
pfsub_adds,0,sm %b[18], %b[3], %b[2], %b[46]
pfmuls,1,sm %b[82], %b[10], %b[92]
pfsub_adds,2,sm %b[37], %b[28], %b[19], %b[32]
pfadds,3,sm %b[32], %b[46], %b[91]
pshufb,4,sm %b[64], %b[42], %r0, %b[93]
movad,0 area=2, ind=0, am=0, be=0, %b[19]
movad,1 area=2, ind=8, am=1, be=0, %b[18]
movad,2 area=2, ind=0, am=0, be=0, %b[3]
movad,3 area=2, ind=8, am=1, be=0, %b[2]
}
{
loop_mode
pfadd_rsubs,0,sm %b[37], %b[28], %b[84], %b[62]
insfd,1,sm %b[89], %r6, %b[100], %b[56]
staad,2 %b[80], %aad1[ %aasti5 + _f32s,_lts0 0x8 ]
pshufb,3,sm %b[8], %b[9], %r0, %b[89]
pshufb,4,sm %b[70], %b[51], %r0, %b[95]
pfmuls,5,sm %b[85], %b[11], %b[94]
movaw,0 area=1, ind=0, am=0, be=0, %b[63]
movaw,1 area=1, ind=4, am=0, be=0, %b[66]
movaw,2 area=1, ind=4, am=0, be=0, %b[80]
movaw,3 area=1, ind=0, am=0, be=0, %b[59]
}
{
loop_mode
pfadd_adds,0,sm %b[37], %b[28], %b[84], %b[68]
insfd,1,sm %b[78], %r6, %b[99], %b[28]
staad,2 %b[90], %aad2[ %aasti6 + _f32s,_lts0 0x8 ]
pfmuls,3,sm %b[85], %b[45], %b[78]
insfd,4,sm %b[83], %r6, %b[68], %b[37]
pfmuls,5,sm %b[89], %b[52], %b[83]
movaw,0 area=0, ind=24, am=0, be=0, %b[96]
movaw,1 area=0, ind=28, am=0, be=0, %b[85]
movaw,2 area=1, ind=24, am=0, be=0, %b[90]
movaw,3 area=1, ind=28, am=0, be=0, %b[84]
}
{
loop_mode
insfd,1,sm %b[79], %r6, %g16, %b[88]
staad,2 %b[93], %aad3[ %aasti7 + _f32s,_lts0 0x8 ]
insfd,4,sm %b[88], %r6, %b[65], %b[65]
pfmuls,5,sm %b[39], %b[58], %b[71]
movaw,0 area=1, ind=8, am=0, be=0, %g16
movaw,1 area=1, ind=12, am=1, be=0, %b[79]
movaw,2 area=1, ind=8, am=0, be=0, %b[72]
movaw,3 area=1, ind=20, am=0, be=0, %b[75]
}
{
loop_mode
pfmul_adds,0,sm %b[87], %b[54], %b[76], %b[82]
insfd,1,sm %b[43], %r6, %b[57], %b[98]
staad,2 %b[95], %aad4[ %aasti8 + _f32s,_lts0 0x8 ]
pfmuls,3,sm %b[82], %b[86], %b[95]
insfd,4,sm %b[34], %r6, %b[48], %b[97]
pfsubs,5,sm %b[30], %b[44], %b[43]
movaw,0 area=0, ind=4, am=0, be=0, %b[93]
movaw,1 area=0, ind=0, am=0, be=0, %b[34]
movaw,2 area=1, ind=16, am=0, be=0, %b[76]
movaw,3 area=1, ind=12, am=1, be=0, %b[87]
}
{
loop_mode
pfmul_rsubs,0,sm %b[88], %b[86], %b[92], %b[42]
insfd,1,sm %b[70], %r6, %b[51], %b[97]
staad,2 %b[98], %aad1[ %aasti5 ]
incr,2 %aaincr4
insfd,4,sm %b[64], %r6, %b[42], %b[98]
staad,5 %b[97], %aad2[ %aasti6 ]
incr,5 %aaincr4
movaw,0 area=0, ind=8, am=0, be=0, %b[48]
movaw,1 area=0, ind=20, am=0, be=0, %b[51]
movaw,2 area=0, ind=0, am=0, be=0, %b[86]
movaw,3 area=0, ind=12, am=0, be=0, %b[92]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
pfmul_adds,0,sm %b[69], %b[60], %b[81], %b[24]
insfd,1,sm %b[24], %r6, %b[25], %b[99]
staad,2 %b[97], %aad4[ %aasti8 ]
incr,2 %aaincr4
insfd,4,sm %b[77], %r6, %b[53], %b[25]
staad,5 %b[98], %aad3[ %aasti7 ]
incr,5 %aaincr4
movaw,0 area=0, ind=16, am=0, be=0, %b[97]
movaw,1 area=0, ind=12, am=1, be=0, %b[98]
movaw,2 area=0, ind=4, am=1, be=0, %b[81]
movaw,3 area=0, ind=8, am=0, be=0, %b[77]
}
Так же, как это было в stage_radix2_etalon_unroll2, можно видеть, что компилятор вставил векторные инструкции.
Теоретическая скорость: 8 комплексных чисел за 11 тактов (8/11) = 5.82 Байт/такт
Двойная теоретическая скорость: 11.64 Байт/такт
Замеры скорости

3. stage_radix2_2x_simd64
Здесь происходит ручная раскрутка алгоритма stage_radix2_simd64 в 2 раза.
Код на Си
void stage_radix2_2x_simd64(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef_a, myComplex *coef_b)
{
uint64_t *x0_in = (uint64_t*)&data_in[0];
uint64_t *y0_in = (uint64_t*)&data_in[1];
uint64_t *x1_in = (uint64_t*)&data_in[2];
uint64_t *y1_in = (uint64_t*)&data_in[3];
uint64_t *c0a_in = (uint64_t*)&coef_a[0];
uint64_t *c1a_in = (uint64_t*)&coef_a[1];
uint64_t *c0b_in = (uint64_t*)&coef_b[0];
uint64_t *c1b_in = (uint64_t*)&coef_b[data_count/4];
uint64_t *out_add0 = (uint64_t*)&data_out[0*data_count/4];
uint64_t *out_add1 = (uint64_t*)&data_out[1*data_count/4];
uint64_t *out_sub0 = (uint64_t*)&data_out[2*data_count/4];
uint64_t *out_sub1 = (uint64_t*)&data_out[3*data_count/4];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < data_count/4; ++i)
{
uint64_t x0 = x0_in[4*i];
uint64_t y0 = y0_in[4*i];
uint64_t c0 = c0a_in[2*i];
uint64_t x1 = x1_in[4*i];
uint64_t y1 = y1_in[4*i];
uint64_t c1 = c1a_in[2*i];
uint64_t conj_c0 = __builtin_e2k_pxord(c0, 1LL<<63);
uint64_t conj_c1 = __builtin_e2k_pxord(c1, 1LL<<63);
uint64_t swap_c0 = __builtin_e2k_pshufb(0, c0, 0x0302010007060504);
uint64_t swap_c1 = __builtin_e2k_pshufb(0, c1, 0x0302010007060504);
uint64_t cy0_real = __builtin_e2k_pfmuls(conj_c0, y0);
uint64_t cy1_real = __builtin_e2k_pfmuls(conj_c1, y1);
uint64_t cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0);
uint64_t cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1);
uint64_t cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag);
uint64_t cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag);
uint64_t add0 = __builtin_e2k_pfadds(x0, cy0);
uint64_t sub0 = __builtin_e2k_pfsubs(x0, cy0);
uint64_t add1 = __builtin_e2k_pfadds(x1, cy1);
uint64_t sub1 = __builtin_e2k_pfsubs(x1, cy1);
x0 = add0;
y0 = add1;
c0 = c0b_in[i];
x1 = sub0;
y1 = sub1;
c1 = c1b_in[i];
conj_c0 = __builtin_e2k_pxord(c0, 1LL<<63);
conj_c1 = __builtin_e2k_pxord(c1, 1LL<<63);
swap_c0 = __builtin_e2k_pshufb(0, c0, 0x0302010007060504);
swap_c1 = __builtin_e2k_pshufb(0, c1, 0x0302010007060504);
cy0_real = __builtin_e2k_pfmuls(conj_c0, y0);
cy1_real = __builtin_e2k_pfmuls(conj_c1, y1);
cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0);
cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1);
cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag);
cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag);
out_add0[i] = __builtin_e2k_pfadds(x0, cy0);
out_sub0[i] = __builtin_e2k_pfsubs(x0, cy0);
out_add1[i] = __builtin_e2k_pfadds(x1, cy1);
out_sub1[i] = __builtin_e2k_pfsubs(x1, cy1);
}
}
Основной цикл на ассемблере
.L2998:
{
fapb ct=0, dcd=0, fmt=4, mrng=16, d=0, incr=2, ind=2, asz=3, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=3, abs=8, disp=0
}
{
fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=4, abs=16, disp=0
}
.L2607:
{
loop_mode
pfmul_hadds,0,sm %b[43], %b[20], %b[37], %b[24]
pfmul_hadds,1,sm %b[41], %b[5], %b[39], %b[28]
pfmuls,2,sm %b[75], %b[18], %b[35]
xord,3,sm %b[44], %r0, %b[84]
xord,4,sm %b[2], %r0, %b[81]
pfsubs,5,sm %b[78], %b[49], %b[1]
movad,1 area=0, ind=8, am=0, be=0, %b[0]
}
{
loop_mode
pfmul_hadds,1,sm %b[62], %b[9], %b[60], %b[20]
pfmuls,2,sm %b[83], %b[3], %b[37]
pshufb,3,sm 0x0, %b[71], %r6, %b[41]
pshufb,4,sm 0x0, %b[58], %r6, %b[39]
pfadds,5,sm %b[78], %b[49], %b[5]
}
{
loop_mode
pfmul_hadds,1,sm %b[73], %b[15], %b[53], %b[43]
pfmuls,2,sm %b[84], %b[7], %b[58]
pshufb,4,sm 0x0, %b[44], %r6, %b[60]
pfmuls,5,sm %b[81], %b[11], %b[49]
movad,3 area=0, ind=24, am=0, be=0, %b[9]
}
{
loop_mode
pfsub_adds,0,sm %b[33], %b[26], %b[30], %b[62]
pfsub_rsubs,1,sm %b[33], %b[26], %b[30], %b[53]
staad,2 %b[66], %aad2[ %aasti6 ]
incr,2 %aaincr0
xord,3,sm %b[69], %r0, %b[73]
pshufb,4,sm 0x0, %b[4], %r6, %b[71]
staad,5 %b[57], %aad1[ %aasti5 ]
incr,5 %aaincr0
movad,1 area=2, ind=0, am=1, be=0, %b[44]
movad,3 area=0, ind=0, am=0, be=0, %b[15]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
pfadd_adds,0,sm %b[33], %b[26], %b[22], %b[78]
pfadd_rsubs,1,sm %b[33], %b[26], %b[22], %b[75]
staad,2 %b[82], %aad4[ %aasti8 ]
incr,2 %aaincr0
xord,3,sm %b[56], %r0, %b[81]
staad,5 %b[79], %aad3[ %aasti7 ]
incr,5 %aaincr0
movad,0 area=1, ind=0, am=1, be=0, %b[30]
movad,1 area=0, ind=0, am=1, be=0, %b[57]
movad,2 area=0, ind=8, am=1, be=0, %b[4]
movad,3 area=0, ind=16, am=0, be=0, %b[66]
}
Теоретическая скорость: 4 комплексных числа за 5 тактов (4/5) = 6.4 Байт/такт
Двойная теоретическая скорость: 12.8 Байт/такт
Замеры скорости

4. stage_radix2_2x_simd128
Здесь происходит ручная раскрутка алгоритма stage_radix2_simd128 в 2 раза.
Код на Си
void stage_radix2_2x_simd128(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef_a, myComplex *coef_b)
{
__v2di *xy0_in = (__v2di*)&data_in[0];
__v2di *xy1_in = (__v2di*)&data_in[2];
__v2di *xy2_in = (__v2di*)&data_in[4];
__v2di *xy3_in = (__v2di*)&data_in[6];
__v2di *c0a_in = (__v2di*)&coef_a[0];
__v2di *c1a_in = (__v2di*)&coef_a[2];
__v2di *c0b_in = (__v2di*)&coef_b[0];
__v2di *c1b_in = (__v2di*)&coef_b[data_count/4];
__v2di *out_add0 = (__v2di*)&data_out[0*data_count/4];
__v2di *out_add1 = (__v2di*)&data_out[1*data_count/4];
__v2di *out_sub0 = (__v2di*)&data_out[2*data_count/4];
__v2di *out_sub1 = (__v2di*)&data_out[3*data_count/4];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < data_count/8; ++i)
{
__v2di xy0 = xy0_in[4*i];
__v2di xy1 = xy1_in[4*i];
__v2di c0 = c0a_in[2*i];
__v2di xy2 = xy2_in[4*i];
__v2di xy3 = xy3_in[4*i];
__v2di c1 = c1a_in[2*i];
__v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di conj_c0 = __builtin_e2k_qpxor(c0, (__v2di){1LL<<63, 1LL<<63});
__v2di conj_c1 = __builtin_e2k_qpxor(c1, (__v2di){1LL<<63, 1LL<<63});
__v2di swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0);
__v2di cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1);
__v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
__v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
__v2di cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag);
__v2di cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag);
__v2di cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di add0 = __builtin_e2k_qpfadds(x0, cy0);
__v2di sub0 = __builtin_e2k_qpfsubs(x0, cy0);
__v2di add1 = __builtin_e2k_qpfadds(x1, cy1);
__v2di sub1 = __builtin_e2k_qpfsubs(x1, cy1);
xy0 = add0;
xy1 = add1;
c0 = c0b_in[i];
xy2 = sub0;
xy3 = sub1;
c1 = c1b_in[i];
x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
conj_c0 = __builtin_e2k_qpxor(c0, (__v2di){1LL<<63, 1LL<<63});
conj_c1 = __builtin_e2k_qpxor(c1, (__v2di){1LL<<63, 1LL<<63});
swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0);
cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1);
cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag);
cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag);
cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
out_add0[i] = __builtin_e2k_qpfadds(x0, cy0);
out_sub0[i] = __builtin_e2k_qpfsubs(x0, cy0);
out_add1[i] = __builtin_e2k_qpfadds(x1, cy1);
out_sub1[i] = __builtin_e2k_qpfsubs(x1, cy1);
}
}
Основной цикл на ассемблере
.L3790:
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=32
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=4, abs=0, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=3, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=2, asz=4, abs=16, disp=0
}
{
fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=3, asz=4, abs=16, disp=0
}
.L3059:
{
loop_mode
qpfmul_hadds,0,sm %b[56], %b[54], %b[70], %b[1]
qpshufb,1,sm %b[9], %b[32], %r15, %b[52]
qpfadds,2,sm %b[88], %b[87], %b[0]
qpshufb,3,sm %b[33], %b[33], %r0, %b[94]
qpshufb,4,sm %b[6], %b[30], %r16, %b[95]
qpfadds,5,sm %b[61], %b[71], %b[92]
}
{
loop_mode
qpfmul_hadds,0,sm %b[74], %b[69], %b[7], %b[33]
qpshufb,1,sm %b[29], %b[29], %r14, %b[56]
qpfsubs,2,sm %b[55], %b[90], %b[30]
qpxor,3,sm %b[60], %r13, %b[54]
qpxor,4,sm %b[27], %r13, %b[57]
qpfsubs,5,sm %b[95], %b[94], %b[96]
movaqp,3 area=1, ind=0, am=0, be=0, %b[6]
}
{
loop_mode
qpfmul_hadds,0,sm %b[56], %b[73], %b[93], %b[29]
qpshufb,1,sm %b[13], %b[36], %r16, %b[59]
qpfsubs,2,sm %b[86], %b[85], %b[7]
qpshufb,4,sm %b[62], %b[62], %r14, %b[61]
qpfadds,5,sm %b[95], %b[94], %b[97]
}
{
loop_mode
qpshufb,1,sm %b[3], %b[3], %r0, %b[69]
qpfmuls,2,sm %b[53], %b[52], %b[68]
qpshufb,3,sm %b[14], %b[43], %r15, %b[62]
qpshufb,4,sm %b[75], %b[76], %r15, %b[63]
staaqp,5 %b[72], %aad1[ %aasti5 ]
incr,5 %aaincr0
movaqp,0 area=2, ind=0, am=1, be=0, %b[36]
movaqp,1 area=1, ind=0, am=1, be=0, %b[13]
movaqp,3 area=1, ind=16, am=1, be=0, %b[56]
}
{
loop_mode
qpfmuls,0,sm %b[41], %b[65], %b[3]
qpshufb,1,sm %b[0], %b[24], %r15, %b[71]
qpfsubs,2,sm %b[59], %b[69], %b[70]
qpshufb,3,sm %b[12], %b[12], %r14, %b[72]
qpshufb,4,sm %b[83], %b[84], %r16, %b[53]
staaqp,5 %b[92], %aad2[ %aasti6 ]
incr,5 %aaincr0
}
{
loop_mode
qpfmuls,0,sm %b[54], %b[64], %b[87]
qpshufb,1,sm %b[35], %b[35], %r0, %b[88]
qpfmuls,2,sm %b[57], %b[71], %b[91]
qpshufb,3,sm %b[22], %b[51], %r16, %b[84]
qpshufb,4,sm %b[39], %b[39], %r0, %b[83]
staaqp,5 %b[96], %aad3[ %aasti7 ]
incr,5 %aaincr0
movaqp,0 area=0, ind=0, am=0, be=0, %b[41]
movaqp,1 area=0, ind=16, am=1, be=0, %b[12]
movaqp,2 area=0, ind=0, am=0, be=0, %b[74]
movaqp,3 area=0, ind=16, am=1, be=0, %b[73]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
qpfmul_hadds,0,sm %b[61], %b[66], %b[89], %b[35]
qpshufb,1,sm %b[50], %b[50], %r14, %b[54]
qpfadds,2,sm %b[55], %b[90], %b[22]
qpxor,3,sm %b[8], %r13, %b[39]
qpxor,4,sm %b[48], %r13, %b[51]
staaqp,5 %b[97], %aad4[ %aasti8 ]
incr,5 %aaincr0
}
Теоретическая скорость: 8 комплексных чисел за 7 тактов (8/7) = 9.14 Байт/такт
Двойная теоретическая скорость: 18.29 Байт/такт
Замеры скорости

5. stage_radix2_2x_simd128_noConj
Здесь происходит ручная раскрутка алгоритма stage_radix2_simd128_noConj в 2 раза.
Код на Си
void stage_radix2_2x_simd128_noConj(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coef_a, myComplex *coef_b)
{
__v2di *xy0_in = (__v2di*)&data_in[0];
__v2di *xy1_in = (__v2di*)&data_in[2];
__v2di *xy2_in = (__v2di*)&data_in[4];
__v2di *xy3_in = (__v2di*)&data_in[6];
__v2di *c0a_in = (__v2di*)&coef_a[0];
__v2di *c1a_in = (__v2di*)&coef_a[2];
__v2di *c0b_in = (__v2di*)&coef_b[0];
__v2di *c1b_in = (__v2di*)&coef_b[data_count/4];
__v2di *out_add0 = (__v2di*)&data_out[0*data_count/4];
__v2di *out_add1 = (__v2di*)&data_out[1*data_count/4];
__v2di *out_sub0 = (__v2di*)&data_out[2*data_count/4];
__v2di *out_sub1 = (__v2di*)&data_out[3*data_count/4];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < data_count/8; ++i)
{
__v2di xy0 = xy0_in[4*i];
__v2di xy1 = xy1_in[4*i];
__v2di c0 = c0a_in[2*i];
__v2di xy2 = xy2_in[4*i];
__v2di xy3 = xy3_in[4*i];
__v2di c1 = c1a_in[2*i];
__v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di cy0_real = __builtin_e2k_qpfmuls( c0, y0);
__v2di cy1_real = __builtin_e2k_qpfmuls( c1, y1);
__v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
__v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
__v2di cy0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy0_real);
__v2di cy1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy1_real);
__v2di cy0_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy0_imag);
__v2di cy1_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy1_imag);
__v2di cy0 = __builtin_e2k_qppermb(cy0_ii, cy0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di cy1 = __builtin_e2k_qppermb(cy1_ii, cy1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di add0 = __builtin_e2k_qpfadds(x0, cy0);
__v2di sub0 = __builtin_e2k_qpfsubs(x0, cy0);
__v2di add1 = __builtin_e2k_qpfadds(x1, cy1);
__v2di sub1 = __builtin_e2k_qpfsubs(x1, cy1);
xy0 = add0;
xy1 = add1;
c0 = c0b_in[i];
xy2 = sub0;
xy3 = sub1;
c1 = c1b_in[i];
x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
cy0_real = __builtin_e2k_qpfmuls( c0, y0);
cy1_real = __builtin_e2k_qpfmuls( c1, y1);
cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
cy0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy0_real);
cy1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy1_real);
cy0_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy0_imag);
cy1_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy1_imag);
cy0 = __builtin_e2k_qppermb(cy0_ii, cy0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
cy1 = __builtin_e2k_qppermb(cy1_ii, cy1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
out_add0[i] = __builtin_e2k_qpfadds(x0, cy0);
out_sub0[i] = __builtin_e2k_qpfsubs(x0, cy0);
out_add1[i] = __builtin_e2k_qpfadds(x1, cy1);
out_sub1[i] = __builtin_e2k_qpfsubs(x1, cy1);
}
}
Основной цикл на ассемблере
.L4577:
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=32
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=4, abs=0, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=3, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=2, asz=4, abs=16, disp=0
}
{
fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=3, asz=4, abs=16, disp=0
}
.L3851:
{
loop_mode
qpfmul_hsubs,0,sm %b[53], %b[54], %r16, %b[5]
qpshufb,1,sm %b[39], %b[39], %r15, %b[25]
qpfmul_hadds,2,sm %b[92], %b[54], %r16, %b[1]
qpshufb,3,sm %b[16], %b[8], %r14, %b[24]
qpfsubs,4,sm %b[66], %b[22], %b[9]
qpfsubs,5,sm %b[89], %b[69], %b[0]
}
{
loop_mode
qpfmul_hsubs,0,sm %b[39], %b[71], %r16, %b[58]
qpshufb,1,sm %b[72], %b[75], %r0, %b[64]
qpfmul_hadds,2,sm %b[25], %b[71], %r16, %b[54]
qpshufb,3,sm %b[20], %b[20], %r15, %b[62]
qpfsubs,4,sm %b[86], %b[59], %b[8]
qpfadds,5,sm %b[24], %b[48], %b[61]
movaqp,2 area=1, ind=0, am=0, be=0, %b[53]
movaqp,3 area=1, ind=16, am=1, be=0, %b[16]
}
{
loop_mode
qpfmul_hsubs,0,sm %b[57], %b[64], %r16, %b[78]
qpshufb,1,sm %b[57], %b[57], %r15, %b[88]
staaqp,2 %b[87], %aad1[ %aasti5 ]
incr,2 %aaincr0
qpshufb,3,sm %b[26], %b[34], %r0, %b[81]
qpshufb,4,sm %b[32], %b[40], %r14, %b[84]
qpfsubs,5,sm %b[24], %b[48], %b[85]
movaqp,0 area=2, ind=0, am=1, be=0, %b[39]
movaqp,1 area=1, ind=0, am=1, be=0, %b[25]
movaqp,2 area=0, ind=0, am=0, be=0, %b[71]
movaqp,3 area=0, ind=16, am=1, be=0, %b[68]
}
{
loop_mode
qpfmul_hadds,0,sm %b[88], %b[64], %r16, %b[48]
qpshufb,1,sm %b[51], %b[51], %r15, %b[90]
staaqp,2 %b[63], %aad2[ %aasti6 ]
incr,2 %aaincr0
qpshufb,3,sm %b[76], %b[79], %r14, %b[87]
qppermb,4,sm %b[15], %b[67], %r13, %b[57]
qpfadds,5,sm %b[89], %b[69], %b[40]
movaqp,0 area=0, ind=0, am=0, be=0, %b[32]
movaqp,1 area=0, ind=16, am=1, be=0, %b[24]
}
{
loop_mode
qpfmul_hsubs,0,sm %b[20], %b[83], %r16, %b[63]
qpshufb,1,sm %b[17], %b[42], %r0, %b[69]
staaqp,2 %b[11], %aad3[ %aasti7 ]
incr,2 %aaincr0
qppermb,3,sm %b[52], %b[82], %r13, %b[67]
qpshufb,4,sm %b[21], %b[46], %r14, %b[64]
qpfadds,5,sm %b[86], %b[59], %b[15]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
qpfmul_hadds,0,sm %b[62], %b[83], %r16, %b[11]
qpshufb,1,sm %b[10], %b[2], %r0, %b[52]
staaqp,2 %b[23], %aad4[ %aasti8 ]
incr,2 %aaincr0
qppermb,3,sm %b[3], %b[7], %r13, %b[46]
qppermb,4,sm %b[56], %b[60], %r13, %b[20]
qpfadds,5,sm %b[66], %b[22], %b[21]
}
Теоретическая скорость: 8 комплексных чисел за 6 тактов (8/6) = 10.67 Байт/такт
Двойная теоретическая скорость: 21.33 Байт/такт
Замеры скорости

Итоги по stage_radix2_2x


Скорости выросли по сравнению с исходными версиями stage_radix2.
График FFT находится здесь.
stage_radix2_readConjSwap
Вернёмся к алгоритмам stage_radix2. Обратим внимание, что conj_c и swap_c получаются напрямую из c, который читается из памяти и больше нигде не используется.
Оптимизация: вместо вычисления conj_c и swap_c сразу читать их из памяти, чтение c больше не нужно. В результате уйдут две инструкции: xor и shuf.
Смотрим, что получится.
1. stage_radix2_readConjSwap_simd64
Развитие stage_radix2_simd64: замена вычисления conj и swap на чтение из памяти.
Код на Си
void stage_radix2_readConjSwap_simd64(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coef, myComplex *swap_coef)
{
uint64_t *x_in = (uint64_t*)&data_in[0];
uint64_t *y_in = (uint64_t*)&data_in[1];
uint64_t *conj_c_in = (uint64_t*)conj_coef;
uint64_t *swap_c_in = (uint64_t*)swap_coef;
uint64_t *out_add = (uint64_t*)&data_out[0];
uint64_t *out_sub = (uint64_t*)&data_out[data_count/2];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < data_count/2; ++i)
{
uint64_t x = x_in[2*i];
uint64_t y = y_in[2*i];
uint64_t conj_c = conj_c_in[i];
uint64_t swap_c = swap_c_in[i];
uint64_t cy_real = __builtin_e2k_pfmuls(conj_c, y);
uint64_t cy_imag = __builtin_e2k_pfmuls(swap_c, y);
uint64_t cy = __builtin_e2k_pfhadds(cy_real, cy_imag);
out_add[i] = __builtin_e2k_pfadds(x, cy);
out_sub[i] = __builtin_e2k_pfsubs(x, cy);
}
}
Основной цикл на ассемблере
.L326:
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=4, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
}
{
fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=4, abs=16, disp=0
}
.L125:
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
pfmuls,0,sm %b[67], %b[33], %b[45]
pfsubs,1,sm %b[42], %b[62], %b[62]
staad,2 %b[70], %aad1[ %aasti4 ]
incr,2 %aaincr0
pfmul_hadds,3,sm %b[23], %b[45], %b[57], %b[42]
pfadds,4,sm %b[42], %b[62], %b[67]
staad,5 %b[75], %aad2[ %aasti5 ]
incr,5 %aaincr0
movad,0 area=0, ind=0, am=1, be=0, %b[57]
movad,1 area=1, ind=0, am=1, be=0, %b[1]
movad,2 area=0, ind=8, am=1, be=0, %b[23]
movad,3 area=0, ind=0, am=0, be=0, %b[0]
}
Раньше было 8 инструкций в цикле, теперь стало 6.
6 инструкций идеально помещаются в 1 такт.
Теоретическая скорость: 2 комплексных числа за 1 такт (2/1) = 16 Байт/такт
Замеры скорости

2. stage_radix2_readConjSwap_simd128
Развитие stage_radix2_simd128: замена вычисления conj и swap на чтение из памяти.
Развитие stage_radix2_simd128_noConj приходит сюда же.
Код на Си
void stage_radix2_readConjSwap_simd128(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coef, myComplex *swap_coef)
{
__v2di *xy0_in = (__v2di*)&data_in[0];
__v2di *xy1_in = (__v2di*)&data_in[2];
__v2di *conj_c_in = (__v2di*)conj_coef;
__v2di *swap_c_in = (__v2di*)swap_coef;
__v2di *out_add = (__v2di*)&data_out[0];
__v2di *out_sub = (__v2di*)&data_out[data_count/2];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < data_count/4; ++i)
{
__v2di xy0 = xy0_in[2*i];
__v2di xy1 = xy1_in[2*i];
__v2di conj_c = conj_c_in[i];
__v2di swap_c = swap_c_in[i];
__v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di cy_real = __builtin_e2k_qpfmuls(conj_c, y);
__v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y);
__v2di cy_rrii = __builtin_e2k_qpfhadds(cy_real, cy_imag);
__v2di cy = __builtin_e2k_qpshufb(cy_rrii, cy_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
out_add[i] = __builtin_e2k_qpfadds(x, cy);
out_sub[i] = __builtin_e2k_qpfsubs(x, cy);
}
}
Основной цикл на ассемблере
.L599:
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=3, asz=4, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
}
{
fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=2, asz=4, abs=16, disp=0
}
.L353:
{
loop_mode
qpshufb,1,sm %b[31], %b[40], %r0, %b[16]
qpshufb,3,sm %b[33], %b[42], %r7, %b[0]
qpfsubs,4,sm %b[14], %b[44], %b[21]
staaqp,5 %b[25], %aad1[ %aasti4 ]
incr,5 %aaincr0
movaqp,0 area=0, ind=0, am=1, be=0, %b[13]
movaqp,1 area=1, ind=0, am=1, be=0, %b[1]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
qpfmuls,0,sm %b[19], %b[16], %b[33]
qpfmul_hadds,2,sm %b[11], %b[20], %b[37], %b[22]
qpshufb,3,sm %b[32], %b[32], %r6, %b[42]
qpfadds,4,sm %b[14], %b[44], %b[39]
staaqp,5 %b[43], %aad2[ %aasti5 ]
incr,5 %aaincr0
movaqp,2 area=0, ind=0, am=0, be=0, %b[34]
movaqp,3 area=0, ind=16, am=1, be=0, %b[25]
}
Раньше было 11 инструкций в цикле, теперь стало 9.
Теоретическая скорость: 4 комплексных числа за 2 такта (4/2) = 16 Байт/такт
Замеры скорости

Сейчас занято 9/6 такта. Раскрутка в 2 раза даст 2 * 9/6 = 3 такта.
3. stage_radix2_readConjSwap_simd128_unroll2
Здесь происходит раскрутка цикла в 2 раза с помощью опции unroll.
Код на Си
void stage_radix2_readConjSwap_simd128_unroll2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coef, myComplex *swap_coef)
{
__v2di *xy0_in = (__v2di*)&data_in[0];
__v2di *xy1_in = (__v2di*)&data_in[2];
__v2di *conj_c_in = (__v2di*)conj_coef;
__v2di *swap_c_in = (__v2di*)swap_coef;
__v2di *out_add = (__v2di*)&data_out[0];
__v2di *out_sub = (__v2di*)&data_out[data_count/2];
#pragma ivdep
#pragma unroll(2)
#pragma prefetch
for(int64_t i = 0; i < data_count/4; ++i)
{
__v2di xy0 = xy0_in[2*i];
__v2di xy1 = xy1_in[2*i];
__v2di conj_c = conj_c_in[i];
__v2di swap_c = swap_c_in[i];
__v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di cy_real = __builtin_e2k_qpfmuls(conj_c, y);
__v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y);
__v2di cy_rrii = __builtin_e2k_qpfhadds(cy_real, cy_imag);
__v2di cy = __builtin_e2k_qpshufb(cy_rrii, cy_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
out_add[i] = __builtin_e2k_qpfadds(x, cy);
out_sub[i] = __builtin_e2k_qpfsubs(x, cy);
}
}
Основной цикл на ассемблере
.L992:
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=4, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=4, abs=0, disp=32
}
{
fapb ct=1, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=3, asz=4, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=2, asz=4, abs=16, disp=0
}
.L626:
{
loop_mode
qpfmul_hadds,0,sm %b[27], %b[14], %b[92], %b[1]
qpfmuls,1,sm %b[81], %b[4], %b[9]
qpfsubs,2,sm %b[89], %b[88], %b[34]
qpshufb,3,sm %b[52], %b[51], %r0, %b[0]
qpshufb,4,sm %b[20], %b[33], %r0, %b[8]
qpfadds,5,sm %b[89], %b[88], %b[13]
}
{
loop_mode
qpfmuls,0,sm %b[73], %b[10], %b[88]
qpfsubs,1,sm %b[84], %b[85], %b[89]
staaqp,2 %b[36], %aad1[ %aasti4 ]
qpshufb,3,sm %b[66], %b[65], %r13, %b[80]
qpshufb,4,sm %b[74], %b[74], %r14, %b[81]
staaqp,5 %b[15], %aad2[ %aasti5 ]
movaqp,0 area=0, ind=0, am=0, be=0, %b[27]
movaqp,1 area=0, ind=16, am=1, be=0, %b[14]
movaqp,2 area=0, ind=0, am=0, be=0, %b[47]
movaqp,3 area=0, ind=16, am=1, be=0, %b[48]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
qpfmul_hadds,0,sm %b[46], %b[6], %b[11], %b[66]
qpfadds,1,sm %b[82], %b[83], %b[74]
staaqp,2 %b[91], %aad1[ %aasti4 + _f32s,_lts0 0x10 ]
incr,2 %aaincr3
qpshufb,3,sm %b[32], %b[45], %r13, %b[85]
qpshufb,4,sm %b[7], %b[7], %r14, %b[84]
staaqp,5 %b[78], %aad2[ %aasti5 + _f32s,_lts0 0x10 ]
incr,5 %aaincr3
movaqp,0 area=1, ind=0, am=0, be=0, %b[65]
movaqp,1 area=1, ind=16, am=1, be=0, %b[73]
movaqp,2 area=1, ind=0, am=0, be=0, %b[15]
movaqp,3 area=1, ind=16, am=1, be=0, %b[36]
}
Теоретическая скорость: 8 комплексных чисел за 3 такта (8/3) = 21.33 Байт/такт
Замеры скорости

Итоги по stage_radix2_readConjSwap


График FFT находится здесь.
stage_radix2_readConjSwap_2x
Один проход по stage_radix2_readConjSwap_2x совершает ту же работу, что 2 прохода по stage_radix2. Поэтому скорость stage_radix2_readConjSwap_2x будем умножать на 2 для удобства сравнения с stage_radix2 (этот факт подписан на оси графика и в выводе консоли).
1. stage_radix2_readConjSwap_2x_simd64
Здесь происходит ручная раскрутка алгоритма stage_radix2_readConjSwap_simd64 в 2 раза.
Код на Си
void stage_radix2_readConjSwap_2x_simd64(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coef_a, myComplex *conj_coef_b, myComplex *swap_coef_a, myComplex *swap_coef_b)
{
uint64_t *x0_in = (uint64_t*)&data_in[0];
uint64_t *y0_in = (uint64_t*)&data_in[1];
uint64_t *x1_in = (uint64_t*)&data_in[2];
uint64_t *y1_in = (uint64_t*)&data_in[3];
uint64_t *conj_c0a_in = (uint64_t*)&conj_coef_a[0];
uint64_t *conj_c1a_in = (uint64_t*)&conj_coef_a[1];
uint64_t *conj_c0b_in = (uint64_t*)&conj_coef_b[0];
uint64_t *conj_c1b_in = (uint64_t*)&conj_coef_b[data_count/4];
uint64_t *swap_c0a_in = (uint64_t*)&swap_coef_a[0];
uint64_t *swap_c1a_in = (uint64_t*)&swap_coef_a[1];
uint64_t *swap_c0b_in = (uint64_t*)&swap_coef_b[0];
uint64_t *swap_c1b_in = (uint64_t*)&swap_coef_b[data_count/4];
uint64_t *out_add0 = (uint64_t*)&data_out[0*data_count/4];
uint64_t *out_add1 = (uint64_t*)&data_out[1*data_count/4];
uint64_t *out_sub0 = (uint64_t*)&data_out[2*data_count/4];
uint64_t *out_sub1 = (uint64_t*)&data_out[3*data_count/4];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < data_count/4; ++i)
{
uint64_t x0 = x0_in[4*i];
uint64_t y0 = y0_in[4*i];
uint64_t conj_c0 = conj_c0a_in[2*i];
uint64_t swap_c0 = swap_c0a_in[2*i];
uint64_t x1 = x1_in[4*i];
uint64_t y1 = y1_in[4*i];
uint64_t conj_c1 = conj_c1a_in[2*i];
uint64_t swap_c1 = swap_c1a_in[2*i];
uint64_t cy0_real = __builtin_e2k_pfmuls(conj_c0, y0);
uint64_t cy1_real = __builtin_e2k_pfmuls(conj_c1, y1);
uint64_t cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0);
uint64_t cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1);
uint64_t cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag);
uint64_t cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag);
uint64_t add0 = __builtin_e2k_pfadds(x0, cy0);
uint64_t sub0 = __builtin_e2k_pfsubs(x0, cy0);
uint64_t add1 = __builtin_e2k_pfadds(x1, cy1);
uint64_t sub1 = __builtin_e2k_pfsubs(x1, cy1);
x0 = add0;
y0 = add1;
conj_c0 = conj_c0b_in[i];
swap_c0 = swap_c0b_in[i];
x1 = sub0;
y1 = sub1;
conj_c1 = conj_c1b_in[i];
swap_c1 = swap_c1b_in[i];
cy0_real = __builtin_e2k_pfmuls(conj_c0, y0);
cy1_real = __builtin_e2k_pfmuls(conj_c1, y1);
cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0);
cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1);
cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag);
cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag);
out_add0[i] = __builtin_e2k_pfadds(x0, cy0);
out_sub0[i] = __builtin_e2k_pfsubs(x0, cy0);
out_add1[i] = __builtin_e2k_pfadds(x1, cy1);
out_sub1[i] = __builtin_e2k_pfsubs(x1, cy1);
}
}
Основной цикл на ассемблере
.L723:
{
fapb ct=0, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=3, asz=3, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=1, asz=3, abs=0, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=2, asz=3, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=3, abs=8, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=3, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=4, abs=16, disp=0
}
{
fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=3, abs=24, disp=0
}
.L322:
{
loop_mode
pfmuls,1,sm %b[60], %b[45], %b[34]
pfmul_hadds,2,sm %b[108], %b[47], %b[36], %b[0]
pfmul_hadds,3,sm %b[41], %b[97], %b[111], %b[21]
pfsub_adds,4,sm %b[77], %b[103], %b[25], %b[6]
pfsub_rsubs,5,sm %b[77], %b[103], %b[25], %b[1]
}
{
loop_mode
pfmul_hadds,3,sm %b[76], %b[102], %b[107], %b[53]
pfadd_adds,4,sm %b[77], %b[103], %b[57], %b[50]
pfadd_rsubs,5,sm %b[77], %b[103], %b[57], %b[47]
movad,0 area=0, ind=8, am=0, be=0, %b[56]
movad,1 area=3, ind=0, am=1, be=0, %b[36]
movad,2 area=0, ind=24, am=0, be=0, %b[41]
movad,3 area=2, ind=0, am=1, be=0, %b[25]
}
{
loop_mode
staad,2 %b[10], %aad2[ %aasti9 ]
incr,2 %aaincr0
pfsubs,3,sm %b[32], %b[4], %b[91]
pfmuls,4,sm %b[22], %b[17], %b[92]
staad,5 %b[5], %aad1[ %aasti8 ]
incr,5 %aaincr0
movad,0 area=2, ind=0, am=1, be=0, %b[77]
movad,1 area=1, ind=0, am=0, be=0, %b[76]
movad,2 area=1, ind=0, am=1, be=0, %b[60]
movad,3 area=0, ind=0, am=0, be=0, %b[57]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
pfadds,0,sm %b[32], %b[4], %b[96]
pfmuls,1,sm %b[89], %b[98], %b[103]
staad,2 %b[54], %aad4[ %aasti11 ]
incr,2 %aaincr0
pfmuls,3,sm %b[48], %b[93], %b[107]
pfmul_hadds,4,sm %b[90], %b[19], %b[94], %b[97]
staad,5 %b[51], %aad3[ %aasti10 ]
incr,5 %aaincr0
movad,0 area=1, ind=8, am=1, be=0, %b[102]
movad,1 area=0, ind=0, am=1, be=0, %b[10]
movad,2 area=0, ind=8, am=1, be=0, %b[5]
movad,3 area=0, ind=16, am=0, be=0, %b[22]
}
Теоретическая скорость: 4 комплексных числа за 4 такта (4/4) = 8 Байт/такт
Двойная теоретическая скорость: 16 Байт/такт
Замеры скорости

2. stage_radix2_readConjSwap_2x_simd64_unroll2
Здесь происходит раскрутка цикла в 2 раза с помощью опции unroll.
Код на Си
void stage_radix2_readConjSwap_2x_simd64_unroll2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coef_a, myComplex *conj_coef_b, myComplex *swap_coef_a, myComplex *swap_coef_b)
{
uint64_t *x0_in = (uint64_t*)&data_in[0];
uint64_t *y0_in = (uint64_t*)&data_in[1];
uint64_t *x1_in = (uint64_t*)&data_in[2];
uint64_t *y1_in = (uint64_t*)&data_in[3];
uint64_t *conj_c0a_in = (uint64_t*)&conj_coef_a[0];
uint64_t *conj_c1a_in = (uint64_t*)&conj_coef_a[1];
uint64_t *conj_c0b_in = (uint64_t*)&conj_coef_b[0];
uint64_t *conj_c1b_in = (uint64_t*)&conj_coef_b[data_count/4];
uint64_t *swap_c0a_in = (uint64_t*)&swap_coef_a[0];
uint64_t *swap_c1a_in = (uint64_t*)&swap_coef_a[1];
uint64_t *swap_c0b_in = (uint64_t*)&swap_coef_b[0];
uint64_t *swap_c1b_in = (uint64_t*)&swap_coef_b[data_count/4];
uint64_t *out_add0 = (uint64_t*)&data_out[0*data_count/4];
uint64_t *out_add1 = (uint64_t*)&data_out[1*data_count/4];
uint64_t *out_sub0 = (uint64_t*)&data_out[2*data_count/4];
uint64_t *out_sub1 = (uint64_t*)&data_out[3*data_count/4];
#pragma ivdep
#pragma unroll(2)
#pragma prefetch
for(int64_t i = 0; i < data_count/4; ++i)
{
uint64_t x0 = x0_in[4*i];
uint64_t y0 = y0_in[4*i];
uint64_t conj_c0 = conj_c0a_in[2*i];
uint64_t swap_c0 = swap_c0a_in[2*i];
uint64_t x1 = x1_in[4*i];
uint64_t y1 = y1_in[4*i];
uint64_t conj_c1 = conj_c1a_in[2*i];
uint64_t swap_c1 = swap_c1a_in[2*i];
uint64_t cy0_real = __builtin_e2k_pfmuls(conj_c0, y0);
uint64_t cy1_real = __builtin_e2k_pfmuls(conj_c1, y1);
uint64_t cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0);
uint64_t cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1);
uint64_t cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag);
uint64_t cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag);
uint64_t add0 = __builtin_e2k_pfadds(x0, cy0);
uint64_t sub0 = __builtin_e2k_pfsubs(x0, cy0);
uint64_t add1 = __builtin_e2k_pfadds(x1, cy1);
uint64_t sub1 = __builtin_e2k_pfsubs(x1, cy1);
x0 = add0;
y0 = add1;
conj_c0 = conj_c0b_in[i];
swap_c0 = swap_c0b_in[i];
x1 = sub0;
y1 = sub1;
conj_c1 = conj_c1b_in[i];
swap_c1 = swap_c1b_in[i];
cy0_real = __builtin_e2k_pfmuls(conj_c0, y0);
cy1_real = __builtin_e2k_pfmuls(conj_c1, y1);
cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0);
cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1);
cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag);
cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag);
out_add0[i] = __builtin_e2k_pfadds(x0, cy0);
out_sub0[i] = __builtin_e2k_pfsubs(x0, cy0);
out_add1[i] = __builtin_e2k_pfadds(x1, cy1);
out_sub1[i] = __builtin_e2k_pfsubs(x1, cy1);
}
}
Основной цикл на ассемблере
.L1964:
{
fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=3, ind=1, asz=3, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=3, ind=1, asz=3, abs=0, disp=32
}
{
fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=3, asz=3, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=2, asz=3, abs=8, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=7, asz=3, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=6, asz=3, abs=16, disp=0
}
{
fapb ct=1, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=5, asz=3, abs=24, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=16, d=0, incr=1, ind=4, asz=3, abs=24, disp=0
}
.L1045:
{
loop_mode
pfmul_hadds,0,sm %b[71], %b[68], %b[118], %b[31]
pfsub_adds,1,sm %b[85], %b[94], %b[107], %b[1]
pfsub_rsubs,2,sm %b[85], %b[94], %b[107], %b[0]
pfmuls,3,sm %b[54], %b[50], %b[119]
pfmuls,4,sm %b[31], %b[18], %b[117]
pfmuls,5,sm %b[62], %b[103], %b[115]
}
{
loop_mode
pfmul_hadds,0,sm %b[43], %b[116], %g16, %b[85]
pfadd_adds,1,sm %b[85], %b[94], %b[33], %b[71]
pfadd_rsubs,2,sm %b[85], %b[94], %b[33], %b[68]
pfmul_hadds,3,sm %b[108], %b[38], %g17, %b[62]
pfmuls,5,sm %b[95], %b[66], %b[116]
movad,0 area=3, ind=0, am=0, be=0, %b[43]
movad,1 area=3, ind=8, am=1, be=0, %b[54]
movad,2 area=3, ind=0, am=0, be=0, %b[33]
movad,3 area=3, ind=8, am=1, be=0, %b[38]
}
{
loop_mode
pfmul_hadds,0,sm %b[61], %b[76], %g18, %b[107]
pfsub_adds,1,sm %b[23], %b[101], %b[87], %b[94]
pfsub_rsubs,2,sm %b[23], %b[101], %b[87], %b[95]
pfmuls,4,sm %g19, %b[36], %g17
pfmuls,5,sm %b[51], %b[114], %g16
movad,0 area=2, ind=0, am=0, be=0, %b[76]
movad,1 area=2, ind=8, am=1, be=0, %b[87]
movad,2 area=2, ind=0, am=0, be=0, %b[51]
movad,3 area=2, ind=8, am=1, be=0, %b[61]
}
{
loop_mode
pfadd_adds,0,sm %b[23], %b[101], %b[109], %b[108]
pfadd_rsubs,1,sm %b[23], %b[101], %b[109], %b[109]
staad,2 %b[3], %aad2[ %aasti9 + _f32s,_lts0 0x8 ]
pfsubs,3,sm %b[30], %b[64], %b[101]
pfmuls,4,sm %b[84], %b[74], %g18
staad,5 %b[2], %aad1[ %aasti8 + _f32s,_lts0 0x8 ]
movad,0 area=1, ind=16, am=0, be=0, %b[23]
movad,1 area=1, ind=0, am=0, be=0, %b[84]
movad,2 area=1, ind=16, am=0, be=0, %b[2]
movad,3 area=1, ind=0, am=0, be=0, %b[3]
}
{
loop_mode
staad,2 %b[73], %aad4[ %aasti11 + _f32s,_lts0 0x8 ]
pfmul_hadds,3,sm %b[34], %b[50], %b[119], %b[70]
pfadds,4,sm %b[30], %b[64], %b[64]
staad,5 %b[70], %aad3[ %aasti10 + _f32s,_lts0 0x8 ]
movad,0 area=1, ind=8, am=1, be=0, %b[50]
movad,1 area=1, ind=24, am=0, be=0, %g19
movad,2 area=1, ind=8, am=0, be=0, %b[30]
movad,3 area=0, ind=24, am=0, be=0, %b[34]
}
{
loop_mode
pfmul_hadds,1,sm %b[11], %b[104], %b[113], %b[97]
staad,2 %b[96], %aad2[ %aasti9 ]
incr,2 %aaincr4
pfsubs,4,sm %b[24], %b[72], %b[112]
staad,5 %b[97], %aad1[ %aasti8 ]
incr,5 %aaincr4
movad,0 area=0, ind=0, am=0, be=0, %b[11]
movad,1 area=0, ind=8, am=0, be=0, %b[96]
movad,2 area=1, ind=24, am=1, be=0, %b[104]
movad,3 area=0, ind=0, am=0, be=0, %b[73]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
pfmul_hadds,0,sm %b[10], %b[18], %b[117], %b[90]
pfmul_hadds,1,sm %b[46], %b[103], %b[115], %b[103]
staad,2 %b[110], %aad4[ %aasti11 ]
incr,2 %aaincr4
pfmuls,3,sm %b[90], %b[102], %b[111]
pfadds,4,sm %b[24], %b[72], %b[72]
staad,5 %b[111], %aad3[ %aasti10 ]
incr,5 %aaincr4
movad,0 area=0, ind=16, am=0, be=0, %b[18]
movad,1 area=0, ind=24, am=1, be=0, %b[46]
movad,2 area=0, ind=8, am=1, be=0, %b[10]
movad,3 area=0, ind=16, am=0, be=0, %b[24]
}
Теоретическая скорость: 8 комплексных чисел за 7 тактов (8/7) = 9.14 Байт/такт
Двойная теоретическая скорость: 18.29 Байт/такт
Замеры скорости

3. stage_radix2_readConjSwap_2x_simd64_unroll4
Здесь происходит раскрутка цикла в 4 раза с помощью опции unroll.
Код на Си
void stage_radix2_readConjSwap_2x_simd64_unroll4(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coef_a, myComplex *conj_coef_b, myComplex *swap_coef_a, myComplex *swap_coef_b)
{
uint64_t *x0_in = (uint64_t*)&data_in[0];
uint64_t *y0_in = (uint64_t*)&data_in[1];
uint64_t *x1_in = (uint64_t*)&data_in[2];
uint64_t *y1_in = (uint64_t*)&data_in[3];
uint64_t *conj_c0a_in = (uint64_t*)&conj_coef_a[0];
uint64_t *conj_c1a_in = (uint64_t*)&conj_coef_a[1];
uint64_t *conj_c0b_in = (uint64_t*)&conj_coef_b[0];
uint64_t *conj_c1b_in = (uint64_t*)&conj_coef_b[data_count/4];
uint64_t *swap_c0a_in = (uint64_t*)&swap_coef_a[0];
uint64_t *swap_c1a_in = (uint64_t*)&swap_coef_a[1];
uint64_t *swap_c0b_in = (uint64_t*)&swap_coef_b[0];
uint64_t *swap_c1b_in = (uint64_t*)&swap_coef_b[data_count/4];
uint64_t *out_add0 = (uint64_t*)&data_out[0*data_count/4];
uint64_t *out_add1 = (uint64_t*)&data_out[1*data_count/4];
uint64_t *out_sub0 = (uint64_t*)&data_out[2*data_count/4];
uint64_t *out_sub1 = (uint64_t*)&data_out[3*data_count/4];
#pragma ivdep
#pragma unroll(4)
#pragma prefetch
for(int64_t i = 0; i < data_count/4; ++i)
{
uint64_t x0 = x0_in[4*i];
uint64_t y0 = y0_in[4*i];
uint64_t conj_c0 = conj_c0a_in[2*i];
uint64_t swap_c0 = swap_c0a_in[2*i];
uint64_t x1 = x1_in[4*i];
uint64_t y1 = y1_in[4*i];
uint64_t conj_c1 = conj_c1a_in[2*i];
uint64_t swap_c1 = swap_c1a_in[2*i];
uint64_t cy0_real = __builtin_e2k_pfmuls(conj_c0, y0);
uint64_t cy1_real = __builtin_e2k_pfmuls(conj_c1, y1);
uint64_t cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0);
uint64_t cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1);
uint64_t cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag);
uint64_t cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag);
uint64_t add0 = __builtin_e2k_pfadds(x0, cy0);
uint64_t sub0 = __builtin_e2k_pfsubs(x0, cy0);
uint64_t add1 = __builtin_e2k_pfadds(x1, cy1);
uint64_t sub1 = __builtin_e2k_pfsubs(x1, cy1);
x0 = add0;
y0 = add1;
conj_c0 = conj_c0b_in[i];
swap_c0 = swap_c0b_in[i];
x1 = sub0;
y1 = sub1;
conj_c1 = conj_c1b_in[i];
swap_c1 = swap_c1b_in[i];
cy0_real = __builtin_e2k_pfmuls(conj_c0, y0);
cy1_real = __builtin_e2k_pfmuls(conj_c1, y1);
cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0);
cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1);
cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag);
cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag);
out_add0[i] = __builtin_e2k_pfadds(x0, cy0);
out_sub0[i] = __builtin_e2k_pfsubs(x0, cy0);
out_add1[i] = __builtin_e2k_pfadds(x1, cy1);
out_sub1[i] = __builtin_e2k_pfsubs(x1, cy1);
}
}
Основной цикл на ассемблере
.L3317:
{
fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=3, ind=1, asz=2, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=3, ind=1, asz=2, abs=0, disp=32
}
{
fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=3, ind=1, asz=2, abs=4, disp=64
fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=3, ind=1, asz=2, abs=4, disp=96
}
{
fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=3, asz=2, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=3, asz=2, abs=8, disp=32
}
{
fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=2, asz=2, abs=12, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=2, asz=2, abs=12, disp=32
}
{
fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=7, asz=3, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=6, asz=3, abs=16, disp=0
}
{
fapb ct=1, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=5, asz=3, abs=24, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=4, asz=3, abs=24, disp=0
}
.L2286:
{
loop_mode
pfmul_hadds,0,sm %b[13], %b[110], %g16, %b[110]
pfadd_adds,1,sm %b[37], %b[72], %b[111], %b[115]
pfmuls,2,sm %b[18], %b[116], %g17
pfadd_rsubs,3,sm %b[37], %b[72], %b[111], %b[111]
pfmuls,4,sm %b[115], %g20, %g21
pfmuls,5,sm %g18, %b[107], %g19
}
{
loop_mode
pfmul_hadds,0,sm %b[12], %b[103], %g24, %b[118]
pfsub_adds,1,sm %b[30], %b[102], %b[118], %g23
pfmuls,2,sm %b[52], %b[114], %g25
pfsub_rsubs,3,sm %b[30], %b[102], %b[118], %g22
pfmuls,4,sm %g26, %b[99], %g27
pfmuls,5,sm %b[40], %b[80], %b[103]
movad,0 area=5, ind=16, am=0, be=0, %b[12]
movad,1 area=5, ind=0, am=0, be=0, %b[13]
movad,2 area=5, ind=16, am=0, be=0, %b[0]
movad,3 area=5, ind=0, am=0, be=0, %b[1]
}
{
loop_mode
pfmul_hadds,0,sm %b[7], %r0, %r1, %b[102]
pfadd_adds,1,sm %b[30], %b[102], %g29, %g31
pfsubs,2,sm %b[92], %b[109], %r2
pfadd_rsubs,3,sm %b[30], %b[102], %g29, %g30
pfmuls,4,sm %b[21], %b[57], %r3
staad,5 %g28, %aad2[ %aasti9 + _f32s,_lts0 0x8 ]
movad,0 area=5, ind=24, am=0, be=0, %b[21]
movad,1 area=5, ind=8, am=1, be=0, %b[30]
movad,2 area=5, ind=24, am=0, be=0, %b[7]
movad,3 area=5, ind=8, am=1, be=0, %b[18]
}
{
loop_mode
pfmul_hadds,1,sm %b[46], %b[108], %r4, %b[109]
pfadds,2,sm %b[92], %b[109], %r5
pfsubs,4,sm %b[81], %b[100], %b[108]
staad,5 %b[119], %aad1[ %aasti8 + _f32s,_lts0 0x8 ]
movad,0 area=4, ind=16, am=0, be=0, %b[46]
movad,1 area=4, ind=0, am=0, be=0, %b[52]
movad,2 area=4, ind=16, am=0, be=0, %b[37]
movad,3 area=4, ind=0, am=0, be=0, %b[40]
}
{
loop_mode
pfmul_hadds,0,sm %b[49], %b[63], %r6, %b[100]
pfmul_hadds,1,sm %b[6], %b[116], %g17, %b[116]
pfmuls,2,sm %b[71], %b[88], %g17
pfadds,4,sm %b[81], %b[100], %b[101]
staad,5 %b[101], %aad4[ %aasti11 + _f32s,_lts0 0x8 ]
movad,0 area=4, ind=24, am=0, be=0, %b[63]
movad,1 area=4, ind=8, am=1, be=0, %b[71]
movad,2 area=4, ind=24, am=0, be=0, %b[6]
movad,3 area=4, ind=8, am=1, be=0, %b[49]
}
{
loop_mode
pfmul_hadds,0,sm %b[43], %b[114], %g25, %g29
pfsub_adds,1,sm %b[68], %b[112], %b[117], %g28
pfsubs,2,sm %b[84], %r9, %b[114]
pfsubs,4,sm %b[89], %b[106], %r0
staad,5 %r8, %aad3[ %aasti10 + _f32s,_lts0 0x8 ]
movad,0 area=3, ind=16, am=0, be=0, %b[81]
movad,1 area=3, ind=0, am=0, be=0, %b[93]
movad,2 area=3, ind=0, am=0, be=0, %b[43]
movad,3 area=3, ind=16, am=0, be=0, %b[72]
}
{
loop_mode
pfsub_rsubs,1,sm %b[68], %b[112], %b[117], %b[117]
pfmuls,2,sm %b[34], %r2, %g25
pfadds,4,sm %b[89], %b[106], %b[106]
staad,5 %r10, %aad2[ %aasti9 + _f32s,_lts0 0x18 ]
movad,0 area=3, ind=8, am=1, be=0, %b[34]
movad,1 area=3, ind=24, am=0, be=0, %b[96]
movad,2 area=3, ind=8, am=1, be=0, %b[89]
movad,3 area=3, ind=24, am=0, be=0, %b[92]
}
{
loop_mode
pfmul_hadds,0,sm %b[98], %b[99], %g27, %b[107]
pfadd_adds,1,sm %b[68], %b[112], %r11, %b[99]
pfmuls,2,sm %b[75], %r5, %r12
pfmul_hadds,3,sm %b[94], %b[107], %g19, %b[98]
pfmuls,4,sm %b[25], %b[108], %g16
staad,5 %b[113], %aad1[ %aasti8 + _f32s,_lts0 0x18 ]
movad,0 area=2, ind=8, am=0, be=0, %g19
movad,1 area=0, ind=24, am=0, be=0, %b[75]
movad,2 area=2, ind=0, am=0, be=0, %b[25]
movad,3 area=2, ind=8, am=0, be=0, %b[113]
}
{
loop_mode
pfmul_hadds,0,sm %b[91], %g20, %g21, %r9
pfadd_rsubs,1,sm %b[68], %b[112], %r11, %r8
staad,2 %r13, %aad4[ %aasti11 + _f32s,_lts0 0x18 ]
pfadds,3,sm %b[84], %r9, %b[112]
pfmuls,4,sm %b[67], %b[101], %g24
staad,5 %b[105], %aad3[ %aasti10 + _f32s,_lts0 0x18 ]
movad,0 area=2, ind=0, am=0, be=0, %b[67]
movad,1 area=1, ind=24, am=0, be=0, %g20
movad,2 area=2, ind=24, am=0, be=0, %g18
movad,3 area=1, ind=24, am=0, be=0, %b[105]
}
{
loop_mode
pfmul_hadds,0,sm %b[97], %b[88], %g17, %b[68]
pfsub_adds,1,sm %b[62], %r16, %b[110], %r10
staad,2 %r15, %aad2[ %aasti9 ]
pfmul_hadds,3,sm %b[36], %b[77], %b[104], %b[104]
pfmuls,4,sm %b[17], %r0, %r1
staad,5 %r14, %aad1[ %aasti8 ]
movad,0 area=2, ind=16, am=0, be=0, %b[17]
movad,1 area=2, ind=24, am=1, be=0, %g26
movad,2 area=2, ind=16, am=1, be=0, %b[36]
movad,3 area=0, ind=24, am=0, be=0, %b[97]
}
{
loop_mode
pfmul_hadds,0,sm %b[85], %b[57], %r3, %b[110]
pfmul_hadds,1,sm %b[22], %r2, %g25, %b[115]
staad,2 %b[115], %aad4[ %aasti11 ]
pfsub_rsubs,3,sm %b[62], %r16, %b[110], %b[111]
pfmuls,4,sm %b[56], %b[106], %r4
staad,5 %b[111], %aad3[ %aasti10 ]
movad,0 area=1, ind=0, am=0, be=0, %b[22]
movad,1 area=1, ind=8, am=0, be=0, %b[57]
movad,2 area=1, ind=0, am=0, be=0, %b[56]
movad,3 area=1, ind=16, am=0, be=0, %b[77]
}
{
loop_mode
pfmul_hadds,0,sm %b[76], %b[80], %b[103], %r16
pfadd_adds,1,sm %b[62], %r16, %b[118], %r13
staad,2 %g23, %aad2[ %aasti9 + _f32s,_lts0 0x10 ]
incr,2 %aaincr4
pfadd_rsubs,3,sm %b[62], %r16, %b[118], %b[103]
pfmuls,4,sm %b[29], %b[61], %r6
staad,5 %g22, %aad1[ %aasti8 + _f32s,_lts0 0x10 ]
incr,5 %aaincr4
movad,0 area=1, ind=16, am=1, be=0, %b[80]
movad,1 area=0, ind=0, am=0, be=0, %b[29]
movad,2 area=1, ind=8, am=1, be=0, %b[76]
movad,3 area=0, ind=0, am=0, be=0, %b[62]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
pfmul_hadds,0,sm %b[53], %r5, %r12, %r11
pfsub_adds,1,sm %b[35], %b[70], %b[102], %r15
staad,2 %g31, %aad4[ %aasti11 + _f32s,_lts0 0x10 ]
incr,2 %aaincr4
pfsub_rsubs,3,sm %b[35], %b[70], %b[102], %r14
pfmuls,4,sm %g19, %b[75], %b[102]
staad,5 %g30, %aad3[ %aasti10 + _f32s,_lts0 0x10 ]
incr,5 %aaincr4
movad,0 area=0, ind=16, am=0, be=0, %b[85]
movad,1 area=0, ind=8, am=1, be=0, %b[84]
movad,2 area=0, ind=8, am=1, be=0, %b[53]
movad,3 area=0, ind=16, am=0, be=0, %b[88]
}
Теоретическая скорость: 16 комплексных чисел за 13 тактов (16/13) = 9.85 Байт/такт
Двойная теоретическая скорость: 19.69 Байт/такт
Замеры скорости

4. stage_radix2_readConjSwap_2x_simd128
Здесь происходит ручная раскрутка алгоритма stage_radix2_readConjSwap_simd128 в 2 раза.
Код на Си
void stage_radix2_readConjSwap_2x_simd128(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coef_a, myComplex *conj_coef_b, myComplex *swap_coef_a, myComplex *swap_coef_b)
{
__v2di *xy0_in = (__v2di*)&data_in[0];
__v2di *xy1_in = (__v2di*)&data_in[2];
__v2di *xy2_in = (__v2di*)&data_in[4];
__v2di *xy3_in = (__v2di*)&data_in[6];
__v2di *conj_c0a_in = (__v2di*)&conj_coef_a[0];
__v2di *conj_c1a_in = (__v2di*)&conj_coef_a[2];
__v2di *conj_c0b_in = (__v2di*)&conj_coef_b[0];
__v2di *conj_c1b_in = (__v2di*)&conj_coef_b[data_count/4];
__v2di *swap_c0a_in = (__v2di*)&swap_coef_a[0];
__v2di *swap_c1a_in = (__v2di*)&swap_coef_a[2];
__v2di *swap_c0b_in = (__v2di*)&swap_coef_b[0];
__v2di *swap_c1b_in = (__v2di*)&swap_coef_b[data_count/4];
__v2di *out_add0 = (__v2di*)&data_out[0*data_count/4];
__v2di *out_add1 = (__v2di*)&data_out[1*data_count/4];
__v2di *out_sub0 = (__v2di*)&data_out[2*data_count/4];
__v2di *out_sub1 = (__v2di*)&data_out[3*data_count/4];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < data_count/8; ++i)
{
__v2di xy0 = xy0_in[4*i];
__v2di xy1 = xy1_in[4*i];
__v2di conj_c0 = conj_c0a_in[2*i];
__v2di swap_c0 = swap_c0a_in[2*i];
__v2di xy2 = xy2_in[4*i];
__v2di xy3 = xy3_in[4*i];
__v2di conj_c1 = conj_c1a_in[2*i];
__v2di swap_c1 = swap_c1a_in[2*i];
__v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0);
__v2di cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1);
__v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
__v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
__v2di cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag);
__v2di cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag);
__v2di cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di add0 = __builtin_e2k_qpfadds(x0, cy0);
__v2di sub0 = __builtin_e2k_qpfsubs(x0, cy0);
__v2di add1 = __builtin_e2k_qpfadds(x1, cy1);
__v2di sub1 = __builtin_e2k_qpfsubs(x1, cy1);
xy0 = add0;
xy1 = add1;
conj_c0 = conj_c0b_in[i];
swap_c0 = swap_c0b_in[i];
xy2 = sub0;
xy3 = sub1;
conj_c1 = conj_c1b_in[i];
swap_c1 = swap_c1b_in[i];
x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0);
cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1);
cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag);
cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag);
cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
out_add0[i] = __builtin_e2k_qpfadds(x0, cy0);
out_sub0[i] = __builtin_e2k_qpfsubs(x0, cy0);
out_add1[i] = __builtin_e2k_qpfadds(x1, cy1);
out_sub1[i] = __builtin_e2k_qpfsubs(x1, cy1);
}
}
Основной цикл на ассемблере
.L4621:
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=3, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=3, abs=0, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=3, asz=3, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=2, asz=3, abs=8, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=7, asz=3, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=6, asz=3, abs=16, disp=0
}
{
fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=5, asz=3, abs=24, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=3, abs=24, disp=0
}
.L3922:
{
loop_mode
qpfsubs,2,sm %b[65], %b[78], %b[95]
qpfadds,3,sm %b[28], %b[3], %b[0]
qpshufb,4,sm %b[69], %b[69], %r20, %b[1]
qpfadds,5,sm %b[65], %b[78], %b[96]
}
{
loop_mode
qpfmul_hadds,0,sm %b[74], %b[30], %b[76], %b[17]
qpshufb,1,sm %b[50], %b[55], %r18, %b[28]
qpfsubs,2,sm %b[77], %b[92], %b[97]
qpshufb,4,sm %b[88], %b[91], %r18, %b[21]
qpfadds,5,sm %b[77], %b[92], %b[98]
movaqp,0 area=3, ind=0, am=1, be=0, %b[8]
movaqp,1 area=2, ind=0, am=1, be=0, %b[3]
}
{
loop_mode
qpfmul_hadds,0,sm %b[68], %b[89], %b[71], %b[65]
qpshufb,1,sm %b[90], %b[93], %r19, %b[59]
qpfmuls,2,sm %b[86], %b[28], %b[74]
qpshufb,4,sm %b[62], %b[62], %r20, %b[76]
qpfmuls,5,sm %b[82], %b[87], %b[69]
movaqp,0 area=0, ind=0, am=0, be=0, %b[51]
movaqp,1 area=0, ind=16, am=1, be=0, %b[46]
movaqp,2 area=3, ind=0, am=1, be=0, %b[33]
movaqp,3 area=2, ind=0, am=1, be=0, %b[30]
}
{
loop_mode
qpshufb,0,sm %b[4], %b[29], %r18, %b[85]
qpshufb,1,sm %b[2], %b[81], %r19, %b[71]
qpfadds,2,sm %b[58], %b[54], %b[77]
qpfsubs,3,sm %b[58], %b[54], %b[89]
qpshufb,4,sm %b[27], %b[27], %r20, %b[90]
qpfsubs,5,sm %b[26], %b[1], %b[86]
movaqp,0 area=1, ind=16, am=1, be=0, %b[78]
movaqp,1 area=1, ind=0, am=0, be=0, %b[82]
movaqp,2 area=1, ind=16, am=1, be=0, %b[62]
movaqp,3 area=1, ind=0, am=0, be=0, %b[68]
}
{
loop_mode
qpfmul_hadds,0,sm %b[47], %b[23], %b[94], %b[58]
qpshufb,1,sm %b[52], %b[57], %r19, %b[54]
staaqp,2 %b[95], %aad1[ %aasti8 ]
incr,2 %aaincr0
qpfmuls,3,sm %b[20], %b[21], %b[92]
qpshufb,4,sm %b[0], %b[79], %r18, %b[81]
staaqp,5 %b[96], %aad2[ %aasti9 ]
incr,5 %aaincr0
movaqp,2 area=0, ind=0, am=0, be=0, %b[27]
movaqp,3 area=0, ind=16, am=1, be=0, %b[2]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
qpfmul_hadds,0,sm %b[44], %b[83], %b[49], %b[23]
qpshufb,1,sm %b[6], %b[31], %r19, %b[20]
staaqp,2 %b[97], %aad3[ %aasti10 ]
incr,2 %aaincr0
qpfmuls,3,sm %b[15], %b[81], %b[47]
qpshufb,4,sm %b[19], %b[19], %r20, %b[52]
staaqp,5 %b[98], %aad4[ %aasti11 ]
incr,5 %aaincr0
}
Теоретическая скорость: 8 комплексных чисел за 6 тактов (8/6) = 10.67 Байт/такт
Двойная теоретическая скорость: 21.33 Байт/такт
Замеры скорости

5. stage_radix2_readConjSwap_2x_simd128_v2
Перетасовали код, чтобы уменьшить число инструкций.
Код на Си
void stage_radix2_readConjSwap_2x_simd128_v2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coef_a, myComplex *conj_coef_b, myComplex *swap_coef_a, myComplex *swap_coef_b)
{
__v2di *xy0_in = (__v2di*)&data_in[0];
__v2di *xy1_in = (__v2di*)&data_in[2];
__v2di *xy2_in = (__v2di*)&data_in[4];
__v2di *xy3_in = (__v2di*)&data_in[6];
__v2di *conj_c0a_in = (__v2di*)&conj_coef_a[0];
__v2di *conj_c1a_in = (__v2di*)&conj_coef_a[2];
__v2di *conj_c0b_in = (__v2di*)&conj_coef_b[0];
__v2di *conj_c1b_in = (__v2di*)&conj_coef_b[data_count/4];
__v2di *swap_c0a_in = (__v2di*)&swap_coef_a[0];
__v2di *swap_c1a_in = (__v2di*)&swap_coef_a[2];
__v2di *swap_c0b_in = (__v2di*)&swap_coef_b[0];
__v2di *swap_c1b_in = (__v2di*)&swap_coef_b[data_count/4];
__v2di *out_add0 = (__v2di*)&data_out[0*data_count/4];
__v2di *out_add1 = (__v2di*)&data_out[1*data_count/4];
__v2di *out_sub0 = (__v2di*)&data_out[2*data_count/4];
__v2di *out_sub1 = (__v2di*)&data_out[3*data_count/4];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < data_count/8; ++i)
{
__v2di xy0 = xy0_in[4*i];
__v2di xy1 = xy1_in[4*i];
__v2di conj_c0 = conj_c0a_in[2*i];
__v2di swap_c0 = swap_c0a_in[2*i];
__v2di xy2 = xy2_in[4*i];
__v2di xy3 = xy3_in[4*i];
__v2di conj_c1 = conj_c1a_in[2*i];
__v2di swap_c1 = swap_c1a_in[2*i];
__v2di x0_rrii = __builtin_e2k_qppermb(xy1, xy0, (__v2di){0x1312111003020100, 0x1716151407060504});
__v2di x1_rrii = __builtin_e2k_qppermb(xy3, xy2, (__v2di){0x1312111003020100, 0x1716151407060504});
__v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0);
__v2di cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1);
__v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
__v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
__v2di cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag);
__v2di cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag);
__v2di add0_rrii = __builtin_e2k_qpfadds(x0_rrii, cy0_rrii);
__v2di sub0_rrii = __builtin_e2k_qpfsubs(x0_rrii, cy0_rrii);
__v2di add1_rrii = __builtin_e2k_qpfadds(x1_rrii, cy1_rrii);
__v2di sub1_rrii = __builtin_e2k_qpfsubs(x1_rrii, cy1_rrii);
__v2di xy0_rrii = add0_rrii;
__v2di xy1_rrii = add1_rrii;
conj_c0 = conj_c0b_in[i];
swap_c0 = swap_c0b_in[i];
__v2di xy2_rrii = sub0_rrii;
__v2di xy3_rrii = sub1_rrii;
conj_c1 = conj_c1b_in[i];
swap_c1 = swap_c1b_in[i];
__v2di x0 = __builtin_e2k_qppermb(xy1_rrii, xy0_rrii, (__v2di){0x0B0A090803020100, 0x1B1A191813121110});
__v2di x1 = __builtin_e2k_qppermb(xy3_rrii, xy2_rrii, (__v2di){0x0B0A090803020100, 0x1B1A191813121110});
y0 = __builtin_e2k_qppermb(xy1_rrii, xy0_rrii, (__v2di){0x0F0E0D0C07060504, 0x1F1E1D1C17161514});
y1 = __builtin_e2k_qppermb(xy3_rrii, xy2_rrii, (__v2di){0x0F0E0D0C07060504, 0x1F1E1D1C17161514});
cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0);
cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1);
cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag);
cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag);
__v2di cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
out_add0[i] = __builtin_e2k_qpfadds(x0, cy0);
out_sub0[i] = __builtin_e2k_qpfsubs(x0, cy0);
out_add1[i] = __builtin_e2k_qpfadds(x1, cy1);
out_sub1[i] = __builtin_e2k_qpfsubs(x1, cy1);
}
}
Основной цикл на ассемблере
.L5345:
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=3, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=3, abs=0, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=3, asz=3, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=2, asz=3, abs=8, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=7, asz=3, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=6, asz=3, abs=16, disp=0
}
{
fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=5, asz=3, abs=24, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=3, abs=24, disp=0
}
.L4696:
{
loop_mode
qpfmul_hadds,0,sm %b[62], %b[79], %b[110], %b[62]
qpshufb,1,sm %b[83], %b[83], %r11, %b[110]
qpfsubs,2,sm %b[119], %b[118], %b[113]
qpfmuls,3,sm %b[73], %b[103], %b[73]
qppermb,4,sm %b[69], %b[12], %r9, %b[115]
qpfadds,5,sm %b[119], %b[118], %b[114]
movaqp,0 area=0, ind=0, am=0, be=0, %b[0]
movaqp,1 area=1, ind=0, am=0, be=0, %b[69]
movaqp,2 area=0, ind=0, am=0, be=0, %b[12]
movaqp,3 area=0, ind=16, am=1, be=0, %b[1]
}
{
loop_mode
qpfmul_hadds,0,sm %b[44], %b[26], %b[108], %b[79]
qppermb,1,sm %b[92], %b[76], %r0, %b[117]
qpfmuls,2,sm %b[57], %b[77], %b[108]
qpfmuls,3,sm %b[80], %b[109], %b[80]
qpshufb,4,sm %b[66], %b[66], %r11, %b[116]
qpfadds,5,sm %b[115], %b[25], %b[66]
movaqp,0 area=0, ind=16, am=1, be=0, %b[57]
movaqp,1 area=1, ind=16, am=1, be=0, %b[76]
movaqp,2 area=3, ind=0, am=1, be=0, %b[26]
movaqp,3 area=2, ind=0, am=1, be=0, %b[44]
}
{
loop_mode
qpfmul_hadds,0,sm %b[106], %b[111], %b[82], %b[92]
qpshufb,1,sm %b[3], %b[14], %r12, %b[107]
qpfmuls,2,sm %b[41], %b[24], %b[106]
qpfadds,3,sm %g16, %b[98], %b[82]
qpshufb,4,sm %b[59], %b[2], %r12, %b[101]
qpfsubs,5,sm %b[115], %b[25], %b[83]
movaqp,0 area=3, ind=0, am=1, be=0, %b[25]
movaqp,1 area=2, ind=0, am=1, be=0, %b[41]
movaqp,2 area=1, ind=0, am=0, be=0, %b[93]
movaqp,3 area=1, ind=16, am=1, be=0, %b[100]
}
{
loop_mode
qpfsubs,0,sm %g18, %b[110], %g17
qppermb,1,sm %b[11], %b[22], %r9, %g16
staaqp,2 %g17, %aad1[ %aasti8 ]
incr,2 %aaincr0
qpfsubs,3,sm %g16, %b[98], %b[11]
qppermb,4,sm %b[13], %b[85], %r10, %b[22]
staaqp,5 %b[112], %aad2[ %aasti9 ]
incr,5 %aaincr0
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
qpfmul_hadds,0,sm %b[99], %b[105], %b[75], %b[19]
qppermb,1,sm %b[84], %b[68], %r10, %b[75]
staaqp,2 %b[113], %aad3[ %aasti10 ]
incr,2 %aaincr0
qpfadds,3,sm %g18, %b[110], %b[110]
qppermb,4,sm %b[19], %b[91], %r0, %g18
staaqp,5 %b[114], %aad4[ %aasti11 ]
incr,5 %aaincr0
}
Теоретическая скорость: 8 комплексных чисел за 5 тактов (8/5) = 12.8 Байт/такт
Двойная теоретическая скорость: 25.6 Байт/такт
Замеры скорости

Итоги по stage_radix2_readConjSwap_2x


Скорости выросли по сравнению с исходными версиями stage_radix2_readConjSwap.
График FFT находится здесь.
stage_radix4
Схема алгоритма Stage для версии «radix-4».

Один проход по stage_radix4 совершает ту же работу, что 2 прохода по stage_radix2. Поэтому скорость stage_radix4 будем умножать на 2 для удобства сравнения с stage_radix2 (этот факт подписан на оси графика и в выводе консоли).
1. stage_radix4_etalon
Эталонный вариант для сравнения на корректность.
Код на Си
void stage_radix4_etalon(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC, myComplex *coefD, myComplex *coefE)
{
myComplex *x_in = &data_in[0];
myComplex *y_in = &data_in[1];
myComplex *z_in = &data_in[2];
myComplex *w_in = &data_in[3];
myComplex *c_in = coefC;
myComplex *d_in = coefD;
myComplex *e_in = coefE;
myComplex *out_0 = &data_out[0*data_count/4];
myComplex *out_1 = &data_out[1*data_count/4];
myComplex *out_2 = &data_out[2*data_count/4];
myComplex *out_3 = &data_out[3*data_count/4];
#pragma ivdep
#pragma unroll(1)
// #pragma prefetch
for(int64_t i = 0; i < data_count/4; ++i)
{
myComplex x = x_in[4*i];
myComplex y = y_in[4*i];
myComplex z = z_in[4*i];
myComplex w = w_in[4*i];
myComplex c = c_in[i];
myComplex d = d_in[i];
myComplex e = e_in[i];
myComplex cy = complex_mul(c, y);
myComplex dz = complex_mul(d, z);
myComplex ew = complex_mul(e, w);
myComplex add02 = complex_add( x, dz);
myComplex sub02 = complex_sub( x, dz);
myComplex add13 = complex_add(cy, ew);
myComplex sub13 = complex_sub(cy, ew);
myComplex sub13i = (myComplex){.real = -sub13.imag, .imag = sub13.real};
out_0[i] = complex_add(add02, add13);
out_1[i] = complex_sub(sub02, sub13i);
out_2[i] = complex_sub(add02, add13);
out_3[i] = complex_add(sub02, sub13i);
}
}
Основной цикл на ассемблере
.L868:
{
fapb ct=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=4, asz=3, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
}
{
fapb ct=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=3, asz=3, abs=8, disp=0
}
{
fapb ct=1, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=2, asz=4, abs=16, disp=0
}
.L236:
{
loop_mode
fmul_adds,0,sm %b[66], %b[72], %b[54], %b[1]
fsub_rsubs,1,sm %b[12], %b[71], %b[55], %b[5]
fsub_adds,2,sm %b[12], %b[71], %b[55], %b[45]
fmuls,4,sm %b[59], %b[58], %b[52]
fsubs,5,sm %b[79], %b[75], %b[53]
movaw,3 area=0, ind=4, am=0, be=0, %b[0]
}
{
loop_mode
fmul_rsubs,0,sm %b[66], %b[60], %b[82], %b[71]
fadd_adds,1,sm %b[24], %b[83], %b[90], %b[72]
fadd_rsubs,2,sm %b[24], %b[83], %b[90], %b[76]
fmuls,4,sm %b[59], %b[70], %b[80]
fadds,5,sm %b[79], %b[75], %b[88]
movaw,1 area=2, ind=4, am=0, be=0, %b[55]
movaw,2 area=0, ind=0, am=0, be=0, %b[12]
movaw,3 area=0, ind=24, am=0, be=0, %b[54]
}
{
loop_mode
fmul_rsubs,0,sm %b[42], %b[17], %b[84], %b[75]
fmul_rsubs,1,sm %b[32], %b[67], %b[85], %b[79]
staaw,2 %b[36], %aad3[ %aasti7 ]
fmuls,3,sm %b[29], %b[46], %b[82]
fmuls,4,sm %b[35], %b[23], %b[83]
staaw,5 %b[39], %aad1[ %aasti5 ]
movaw,0 area=2, ind=0, am=1, be=0, %b[60]
movaw,1 area=1, ind=0, am=0, be=0, %b[24]
movaw,2 area=0, ind=16, am=0, be=0, %b[59]
movaw,3 area=0, ind=28, am=0, be=0, %b[66]
}
{
loop_mode
fmul_adds,0,sm %b[40], %b[46], %b[86], %b[39]
fmul_adds,1,sm %b[32], %b[25], %b[87], %b[67]
staaw,2 %b[11], %aad4[ %aasti8 + _f32s,_lts0 0x4 ]
fmuls,3,sm %b[27], %b[13], %b[84]
fmuls,4,sm %b[35], %b[65], %b[85]
staaw,5 %b[51], %aad2[ %aasti6 + _f32s,_lts0 0x4 ]
movaw,0 area=1, ind=4, am=1, be=0, %b[29]
movaw,1 area=0, ind=0, am=0, be=0, %b[36]
movaw,2 area=0, ind=20, am=0, be=0, %b[17]
movaw,3 area=0, ind=12, am=0, be=0, %b[42]
}
{
loop_mode
fsub_rsubs,0,sm %b[22], %b[81], %b[48], %b[32]
fsub_adds,1,sm %b[22], %b[81], %b[48], %b[35]
staaw,2 %b[7], %aad3[ %aasti7 + _f32s,_lts0 0x4 ]
incr,2 %aaincr3
fsubs,4,sm %b[3], %b[43], %b[46]
staaw,5 %b[47], %aad1[ %aasti5 + _f32s,_lts0 0x4 ]
incr,5 %aaincr3
movaw,1 area=0, ind=4, am=1, be=0, %b[25]
movaw,3 area=0, ind=8, am=1, be=0, %b[11]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
fadd_adds,0,sm %b[10], %b[69], %b[50], %b[7]
fadd_rsubs,1,sm %b[10], %b[69], %b[50], %b[47]
staaw,2 %b[74], %aad4[ %aasti8 ]
incr,2 %aaincr3
fadds,4,sm %b[43], %b[3], %b[48]
staaw,5 %b[78], %aad2[ %aasti6 ]
incr,5 %aaincr3
}
Теоретическая скорость: 4 комплексных числа за 6 тактов (4/6) = 5.33 Байт/такт
Двойная теоретическая скорость: 10.67 Байт/такт
Замеры скорости

2. stage_radix4_etalon_unroll2
Здесь происходит раскрутка цикла в 2 раза с помощью опции unroll.
Код на Си
void stage_radix4_etalon_unroll2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC, myComplex *coefD, myComplex *coefE)
{
myComplex *x_in = &data_in[0];
myComplex *y_in = &data_in[1];
myComplex *z_in = &data_in[2];
myComplex *w_in = &data_in[3];
myComplex *c_in = coefC;
myComplex *d_in = coefD;
myComplex *e_in = coefE;
myComplex *out_0 = &data_out[0*data_count/4];
myComplex *out_1 = &data_out[1*data_count/4];
myComplex *out_2 = &data_out[2*data_count/4];
myComplex *out_3 = &data_out[3*data_count/4];
#pragma ivdep
#pragma unroll(2)
// #pragma prefetch
for(int64_t i = 0; i < data_count/4; ++i)
{
myComplex x = x_in[4*i];
myComplex y = y_in[4*i];
myComplex z = z_in[4*i];
myComplex w = w_in[4*i];
myComplex c = c_in[i];
myComplex d = d_in[i];
myComplex e = e_in[i];
myComplex cy = complex_mul(c, y);
myComplex dz = complex_mul(d, z);
myComplex ew = complex_mul(e, w);
myComplex add02 = complex_add( x, dz);
myComplex sub02 = complex_sub( x, dz);
myComplex add13 = complex_add(cy, ew);
myComplex sub13 = complex_sub(cy, ew);
myComplex sub13i = (myComplex){.real = -sub13.imag, .imag = sub13.real};
out_0[i] = complex_add(add02, add13);
out_1[i] = complex_sub(sub02, sub13i);
out_2[i] = complex_sub(add02, add13);
out_3[i] = complex_add(sub02, sub13i);
}
}
Основной цикл на ассемблере
.L2050:
{
fapb ct=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=4, abs=0, disp=32
}
{
fapb ct=0, dcd=0, fmt=4, mrng=16, d=0, incr=2, ind=4, asz=3, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=16, d=0, incr=2, ind=3, asz=4, abs=16, disp=0
}
{
fapb ct=1, dcd=0, fmt=4, mrng=16, d=0, incr=2, ind=2, asz=4, abs=16, disp=0
}
.L913:
{
loop_mode
insfd,0,sm %b[67], %r7, %b[70], %b[0]
pfmul_rsubs,1,sm %b[78], %b[5], %g16, %b[30]
pfadd_adds,2,sm %b[50], %b[32], %b[77], %b[5]
pshufb,3,sm %b[22], %b[47], %r0, %b[67]
insfd,4,sm %b[30], %r7, %b[31], %b[1]
pfmuls,5,sm %b[83], %b[58], %g16
}
{
loop_mode
insfd,0,sm %b[24], %r7, %b[49], %b[78]
pfsub_adds,1,sm %b[50], %b[32], %b[81], %b[31]
pfsub_adds,2,sm %b[6], %b[25], %b[71], %b[24]
pshufb,3,sm %b[61], %b[8], %r0, %b[83]
pshufb,4,sm %b[26], %b[33], %r0, %b[87]
pfmuls,5,sm %b[83], %b[3], %b[70]
}
{
loop_mode
insfd,0,sm %b[63], %r7, %b[10], %b[71]
pfsub_rsubs,1,sm %b[50], %b[32], %b[81], %b[36]
pfsub_rsubs,2,sm %b[6], %b[25], %b[71], %b[35]
insfd,3,sm %b[76], %r7, %b[79], %b[10]
pshufb,4,sm %b[37], %b[38], %r0, %b[76]
}
{
loop_mode
insfd,0,sm %b[85], %r7, %b[86], %b[43]
pfadd_rsubs,1,sm %b[50], %b[32], %b[77], %b[39]
pfadd_rsubs,2,sm %b[6], %b[25], %b[75], %b[32]
insfd,3,sm %b[80], %r7, %b[84], %b[40]
pshufb,4,sm %b[34], %b[41], %r0, %b[77]
}
{
loop_mode
pfmuls,0,sm %b[67], %b[43], %b[25]
pfadd_adds,1,sm %b[6], %b[25], %b[75], %b[49]
staad,2 %b[87], %aad1[ %aasti5 + _f32s,_lts0 0x8 ]
pfsubs,3,sm %b[64], %b[21], %b[79]
pshufb,4,sm %b[51], %b[7], %r0, %b[84]
pfadds,5,sm %b[11], %b[20], %b[75]
movad,0 area=2, ind=0, am=0, be=0, %b[6]
movaw,1 area=0, ind=24, am=0, be=0, %b[81]
movaw,3 area=0, ind=24, am=0, be=0, %b[80]
}
{
loop_mode
pfmuls,0,sm %b[83], %b[40], %b[50]
pfmul_adds,1,sm %b[78], %b[45], %b[69], %b[17]
staad,2 %b[76], %aad3[ %aasti7 + _f32s,_lts0 0x8 ]
pfsubs,3,sm %b[11], %b[20], %b[69]
insfd,4,sm %b[59], %r7, %b[17], %b[76]
movad,0 area=1, ind=0, am=0, be=0, %b[45]
movad,1 area=1, ind=8, am=1, be=0, %b[20]
movad,3 area=1, ind=0, am=0, be=0, %b[11]
}
{
loop_mode
insfd,0,sm %b[72], %r7, %b[82], %b[54]
pfmul_adds,1,sm %b[71], %b[42], %g17, %b[60]
staad,2 %b[77], %aad2[ %aasti6 + _f32s,_lts0 0x8 ]
pfmuls,3,sm %b[83], %b[14], %g17
insfd,4,sm %b[73], %r7, %b[74], %b[42]
pfmuls,5,sm %b[67], %b[10], %b[67]
movad,0 area=2, ind=8, am=1, be=0, %b[59]
movaw,1 area=0, ind=4, am=0, be=0, %b[66]
movad,2 area=1, ind=8, am=1, be=0, %b[53]
movaw,3 area=0, ind=4, am=0, be=0, %b[63]
}
{
loop_mode
insfd,0,sm %b[26], %r7, %b[33], %b[83]
pfmul_rsubs,1,sm %b[71], %b[16], %b[52], %b[16]
staad,2 %b[84], %aad4[ %aasti8 + _f32s,_lts0 0x8 ]
insfd,4,sm %b[37], %r7, %b[38], %b[82]
pfadds,5,sm %b[21], %b[64], %b[73]
movaw,0 area=0, ind=0, am=0, be=0, %b[72]
movaw,1 area=0, ind=8, am=0, be=0, %b[77]
movaw,2 area=0, ind=0, am=0, be=0, %b[71]
movaw,3 area=0, ind=8, am=0, be=0, %b[74]
}
{
loop_mode
insfd,0,sm %b[51], %r7, %b[7], %b[85]
pfmul_rsubs,1,sm %b[78], %b[12], %b[27], %b[7]
staad,2 %b[83], %aad1[ %aasti5 ]
incr,2 %aaincr3
insfd,4,sm %b[34], %r7, %b[41], %b[86]
staad,5 %b[82], %aad3[ %aasti7 ]
incr,5 %aaincr3
movaw,0 area=0, ind=28, am=0, be=0, %b[82]
movaw,1 area=0, ind=12, am=0, be=0, %b[84]
movaw,2 area=0, ind=28, am=0, be=0, %b[78]
movaw,3 area=0, ind=12, am=0, be=0, %b[83]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
pfmul_adds,1,sm %b[76], %b[58], %b[70], %b[21]
staad,2 %b[85], %aad4[ %aasti8 ]
incr,2 %aaincr3
insfd,3,sm %b[80], %r7, %b[81], %b[12]
pshufb,4,sm %b[57], %b[15], %r0, %b[81]
staad,5 %b[86], %aad2[ %aasti6 ]
incr,5 %aaincr3
movaw,0 area=0, ind=16, am=0, be=0, %b[27]
movaw,1 area=0, ind=20, am=1, be=0, %b[80]
movaw,2 area=0, ind=16, am=0, be=0, %b[26]
movaw,3 area=0, ind=20, am=1, be=0, %b[70]
}
Теоретическая скорость: 8 комплексных чисел за 10 тактов (8/10) = 6.4 Байт/такт
Двойная теоретическая скорость: 12.8 Байт/такт
Замеры скорости

Видим ускорение.
3. stage_radix4_simd64
Вычисления делаем аналогично stage_radix2_simd64.
Код на Си
void stage_radix4_simd64(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC, myComplex *coefD, myComplex *coefE)
{
uint64_t *x_in = (uint64_t*)&data_in[0];
uint64_t *y_in = (uint64_t*)&data_in[1];
uint64_t *z_in = (uint64_t*)&data_in[2];
uint64_t *w_in = (uint64_t*)&data_in[3];
uint64_t *c_in = (uint64_t*)coefC;
uint64_t *d_in = (uint64_t*)coefD;
uint64_t *e_in = (uint64_t*)coefE;
uint64_t *out_0 = (uint64_t*)&data_out[0*data_count/4];
uint64_t *out_1 = (uint64_t*)&data_out[1*data_count/4];
uint64_t *out_2 = (uint64_t*)&data_out[2*data_count/4];
uint64_t *out_3 = (uint64_t*)&data_out[3*data_count/4];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < data_count/4; ++i)
{
uint64_t x = x_in[4*i];
uint64_t y = y_in[4*i];
uint64_t z = z_in[4*i];
uint64_t w = w_in[4*i];
uint64_t c = c_in[i];
uint64_t d = d_in[i];
uint64_t e = e_in[i];
uint64_t conj_c = __builtin_e2k_pxord(c, 1LL<<63);
uint64_t conj_d = __builtin_e2k_pxord(d, 1LL<<63);
uint64_t conj_e = __builtin_e2k_pxord(e, 1LL<<63);
uint64_t swap_c = __builtin_e2k_pshufb(0, c, 0x0302010007060504);
uint64_t swap_d = __builtin_e2k_pshufb(0, d, 0x0302010007060504);
uint64_t swap_e = __builtin_e2k_pshufb(0, e, 0x0302010007060504);
uint64_t cy_real = __builtin_e2k_pfmuls(conj_c, y);
uint64_t dz_real = __builtin_e2k_pfmuls(conj_d, z);
uint64_t ew_real = __builtin_e2k_pfmuls(conj_e, w);
uint64_t cy_imag = __builtin_e2k_pfmuls(swap_c, y);
uint64_t dz_imag = __builtin_e2k_pfmuls(swap_d, z);
uint64_t ew_imag = __builtin_e2k_pfmuls(swap_e, w);
uint64_t cy = __builtin_e2k_pfhadds(cy_real, cy_imag);
uint64_t dz = __builtin_e2k_pfhadds(dz_real, dz_imag);
uint64_t ew = __builtin_e2k_pfhadds(ew_real, ew_imag);
uint64_t add02 = __builtin_e2k_pfadds( x, dz);
uint64_t sub02 = __builtin_e2k_pfsubs( x, dz);
uint64_t add13 = __builtin_e2k_pfadds(cy, ew);
uint64_t sub13 = __builtin_e2k_pfsubs(cy, ew);
//uint64_t conj_sub13 = __builtin_e2k_pxord(sub13, 1LL<<63);
//uint64_t sub13i = __builtin_e2k_pshufb(0, conj_sub13, 0x0302010007060504);
uint64_t swap_sub13 = __builtin_e2k_pshufb(0, sub13, 0x0302010007060504);
uint64_t sub13i = __builtin_e2k_pxord(swap_sub13, 1LL<<31);
out_0[i] = __builtin_e2k_pfadds(add02, add13);
out_1[i] = __builtin_e2k_pfsubs(sub02, sub13i);
out_2[i] = __builtin_e2k_pfsubs(add02, add13);
out_3[i] = __builtin_e2k_pfadds(sub02, sub13i);
}
}
Основной цикл на ассемблере
.L2675:
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=3, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=5, abs=0, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=3, abs=8, disp=0
}
{
fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=4, abs=16, disp=0
}
.L2317:
{
loop_mode
pfadds,0,sm %b[70], %b[67], %b[73]
pfadd_adds,1,sm %b[40], %b[47], %b[75], %b[1]
pfadd_rsubs,2,sm %b[40], %b[47], %b[75], %b[0]
pfsubs,3,sm %b[66], %b[63], %b[39]
xord,4,sm %b[51], %r0, %b[58]
xord,5,sm %b[33], %r0, %b[79]
}
{
loop_mode
pfmuls,0,sm %b[60], %b[21], %b[70]
pfsub_rsubs,1,sm %b[40], %b[47], %b[81], %b[67]
pfsub_adds,2,sm %b[40], %b[47], %b[81], %b[33]
pshufb,3,sm 0x0, %b[18], %r8, %b[75]
pshufb,4,sm 0x0, %b[57], %r8, %b[84]
xord,5,sm %b[18], %r0, %b[82]
}
{
loop_mode
pfmuls,0,sm %b[79], %b[13], %b[85]
pfmul_hadds,1,sm %b[78], %b[15], %b[87], %b[57]
staad,2 %b[5], %aad4[ %aasti8 ]
incr,2 %aaincr0
pfmul_hadds,3,sm %b[84], %b[25], %b[74], %b[60]
pshufb,4,sm 0x0, %b[41], %r8, %b[81]
staad,5 %b[4], %aad2[ %aasti6 ]
incr,5 %aaincr0
movad,1 area=0, ind=0, am=1, be=0, %b[47]
movad,2 area=0, ind=0, am=0, be=0, %b[18]
movad,3 area=0, ind=16, am=0, be=0, %b[40]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
pfmuls,0,sm %b[82], %b[54], %b[78]
pfmul_hadds,1,sm %b[77], %b[56], %b[80], %b[41]
staad,2 %b[71], %aad3[ %aasti7 ]
incr,2 %aaincr0
pshufb,3,sm 0x0, %b[31], %r8, %b[74]
xord,4,sm %b[83], %r7, %b[79]
staad,5 %b[37], %aad1[ %aasti5 ]
incr,5 %aaincr0
movad,0 area=2, ind=0, am=1, be=0, %b[25]
movad,1 area=1, ind=0, am=1, be=0, %b[4]
movad,2 area=0, ind=24, am=0, be=0, %b[5]
movad,3 area=0, ind=8, am=1, be=0, %b[15]
}
Теоретическая скорость: 4 комплексных числа за 4 такта (4/4) = 8 Байт/такт
Двойная теоретическая скорость: 16 Байт/такт
Замеры скорости

Видим ускорение.
4. stage_radix4_simd128
Вычисления делаем аналогично stage_radix2_simd128.
Код на Си
void stage_radix4_simd128(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC, myComplex *coefD, myComplex *coefE)
{
__v2di *xy0_in = (__v2di*)&data_in[0];
__v2di *zw0_in = (__v2di*)&data_in[2];
__v2di *xy1_in = (__v2di*)&data_in[4];
__v2di *zw1_in = (__v2di*)&data_in[6];
__v2di *c_in = (__v2di*)coefC;
__v2di *d_in = (__v2di*)coefD;
__v2di *e_in = (__v2di*)coefE;
__v2di *out_0 = (__v2di*)&data_out[0*data_count/4];
__v2di *out_1 = (__v2di*)&data_out[1*data_count/4];
__v2di *out_2 = (__v2di*)&data_out[2*data_count/4];
__v2di *out_3 = (__v2di*)&data_out[3*data_count/4];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < data_count/8; ++i)
{
__v2di xy0 = xy0_in[4*i];
__v2di zw0 = zw0_in[4*i];
__v2di xy1 = xy1_in[4*i];
__v2di zw1 = zw1_in[4*i];
__v2di c = c_in[i];
__v2di d = d_in[i];
__v2di e = e_in[i];
__v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di conj_c = __builtin_e2k_qpxor(c, (__v2di){1LL<<63, 1LL<<63});
__v2di conj_d = __builtin_e2k_qpxor(d, (__v2di){1LL<<63, 1LL<<63});
__v2di conj_e = __builtin_e2k_qpxor(e, (__v2di){1LL<<63, 1LL<<63});
__v2di swap_c = __builtin_e2k_qpshufb(c, c, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_d = __builtin_e2k_qpshufb(d, d, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_e = __builtin_e2k_qpshufb(e, e, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di cy_real = __builtin_e2k_qpfmuls(conj_c, y);
__v2di dz_real = __builtin_e2k_qpfmuls(conj_d, z);
__v2di ew_real = __builtin_e2k_qpfmuls(conj_e, w);
__v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y);
__v2di dz_imag = __builtin_e2k_qpfmuls(swap_d, z);
__v2di ew_imag = __builtin_e2k_qpfmuls(swap_e, w);
__v2di cy_rrii = __builtin_e2k_qpfhadds(cy_real, cy_imag);
__v2di dz_rrii = __builtin_e2k_qpfhadds(dz_real, dz_imag);
__v2di ew_rrii = __builtin_e2k_qpfhadds(ew_real, ew_imag);
__v2di cy = __builtin_e2k_qpshufb(cy_rrii, cy_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di dz = __builtin_e2k_qpshufb(dz_rrii, dz_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di ew = __builtin_e2k_qpshufb(ew_rrii, ew_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di add02 = __builtin_e2k_qpfadds( x, dz);
__v2di sub02 = __builtin_e2k_qpfsubs( x, dz);
__v2di add13 = __builtin_e2k_qpfadds(cy, ew);
__v2di sub13 = __builtin_e2k_qpfsubs(cy, ew);
//__v2di conj_sub13 = __builtin_e2k_qpxor(sub13, (__v2di){1LL<<63, 1LL<<63});
//__v2di sub13i = __builtin_e2k_qpshufb(conj_sub13, conj_sub13, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_sub13 = __builtin_e2k_qpshufb(sub13, sub13, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di sub13i = __builtin_e2k_qpxor(swap_sub13, (__v2di){1LL<<31, 1LL<<31});
out_0[i] = __builtin_e2k_qpfadds(add02, add13);
out_1[i] = __builtin_e2k_qpfsubs(sub02, sub13i);
out_2[i] = __builtin_e2k_qpfsubs(add02, add13);
out_3[i] = __builtin_e2k_qpfadds(sub02, sub13i);
}
}
Основной цикл на ассемблере
.L3309:
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=4, abs=0, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=3, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=3, asz=4, abs=16, disp=0
}
{
fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=2, asz=4, abs=16, disp=0
}
.L2728:
{
loop_mode
qpfadd_rsubs,0,sm %b[13], %b[38], %b[21], %b[0]
qpshufb,1,sm %b[44], %b[44], %r9, %b[8]
qpshufb,3,sm %b[16], %b[16], %r10, %b[25]
qpshufb,4,sm %b[17], %b[17], %r10, %b[48]
qpfmuls,5,sm %b[60], %b[49], %b[1]
}
{
loop_mode
qpfsub_rsubs,0,sm %b[13], %b[38], %b[58], %b[44]
qpshufb,1,sm %b[36], %b[36], %r9, %b[61]
qpshufb,3,sm %b[29], %b[52], %r12, %b[16]
qpfsub_adds,4,sm %b[13], %b[38], %b[58], %b[21]
qpfadds,5,sm %b[48], %b[25], %b[17]
}
{
loop_mode
qpfmul_hadds,0,sm %b[10], %b[51], %b[3], %b[13]
qpshufb,1,sm %b[56], %b[56], %r10, %b[36]
qpxor,3,sm %b[34], %r0, %b[38]
qpshufb,4,sm %b[59], %b[59], %r9, %b[52]
qpfmuls,5,sm %b[57], %b[45], %b[29]
}
{
loop_mode
qpfmul_hadds,0,sm %b[55], %b[47], %b[31], %b[10]
qpshufb,1,sm %b[37], %b[43], %r12, %b[3]
staaqp,2 %b[26], %aad4[ %aasti8 ]
incr,2 %aaincr0
qpxor,4,sm %b[52], %r7, %b[56]
qpfmuls,5,sm %b[38], %b[20], %b[51]
}
{
loop_mode
qpfmul_hadds,0,sm %b[61], %b[22], %b[53], %b[52]
qpxor,1,sm %b[4], %r0, %b[55]
staaqp,2 %b[2], %aad2[ %aasti6 ]
incr,2 %aaincr0
qpshufb,3,sm %b[35], %b[41], %r11, %b[47]
qpshufb,4,sm %b[27], %b[50], %r11, %b[43]
qpfsubs,5,sm %b[48], %b[25], %b[57]
movaqp,0 area=1, ind=0, am=1, be=0, %b[38]
movaqp,1 area=0, ind=0, am=0, be=0, %b[37]
movaqp,2 area=1, ind=0, am=1, be=0, %b[26]
movaqp,3 area=0, ind=0, am=0, be=0, %b[31]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
qpfadd_adds,0,sm %b[11], %b[36], %b[19], %b[22]
qpshufb,1,sm %b[6], %b[6], %r9, %b[53]
staaqp,2 %b[46], %aad3[ %aasti7 ]
incr,2 %aaincr0
qpxor,3,sm %b[42], %r0, %b[58]
staaqp,5 %b[23], %aad1[ %aasti5 ]
incr,5 %aaincr0
movaqp,0 area=2, ind=0, am=1, be=0, %b[2]
movaqp,1 area=0, ind=16, am=1, be=0, %b[48]
movaqp,3 area=0, ind=16, am=1, be=0, %b[25]
}
Теоретическая скорость: 8 комплексных чисел за 6 тактов (8/6) = 10.67 Байт/такт
Двойная теоретическая скорость: 21.33 Байт/такт
Замеры скорости

Видим ускорение.
5. stage_radix4_simd128_noConj
Уменьшаем число инструкций аналогично stage_radix2_simd128_noConj.
Код на Си
void stage_radix4_simd128_noConj(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC, myComplex *coefD, myComplex *coefE)
{
__v2di *xy0_in = (__v2di*)&data_in[0];
__v2di *zw0_in = (__v2di*)&data_in[2];
__v2di *xy1_in = (__v2di*)&data_in[4];
__v2di *zw1_in = (__v2di*)&data_in[6];
__v2di *c_in = (__v2di*)coefC;
__v2di *d_in = (__v2di*)coefD;
__v2di *e_in = (__v2di*)coefE;
__v2di *out_0 = (__v2di*)&data_out[0*data_count/4];
__v2di *out_1 = (__v2di*)&data_out[1*data_count/4];
__v2di *out_2 = (__v2di*)&data_out[2*data_count/4];
__v2di *out_3 = (__v2di*)&data_out[3*data_count/4];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < data_count/8; ++i)
{
__v2di xy0 = xy0_in[4*i];
__v2di zw0 = zw0_in[4*i];
__v2di xy1 = xy1_in[4*i];
__v2di zw1 = zw1_in[4*i];
__v2di c = c_in[i];
__v2di d = d_in[i];
__v2di e = e_in[i];
__v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di swap_c = __builtin_e2k_qpshufb(c, c, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_d = __builtin_e2k_qpshufb(d, d, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_e = __builtin_e2k_qpshufb(e, e, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di cy_real = __builtin_e2k_qpfmuls( c, y);
__v2di dz_real = __builtin_e2k_qpfmuls( d, z);
__v2di ew_real = __builtin_e2k_qpfmuls( e, w);
__v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y);
__v2di dz_imag = __builtin_e2k_qpfmuls(swap_d, z);
__v2di ew_imag = __builtin_e2k_qpfmuls(swap_e, w);
__v2di cy_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy_real);
__v2di dz_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz_real);
__v2di ew_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew_real);
__v2di cy_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy_imag);
__v2di dz_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz_imag);
__v2di ew_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew_imag);
__v2di cy = __builtin_e2k_qppermb(cy_ii, cy_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di dz = __builtin_e2k_qppermb(dz_ii, dz_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di ew = __builtin_e2k_qppermb(ew_ii, ew_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di add02 = __builtin_e2k_qpfadds( x, dz);
__v2di sub02 = __builtin_e2k_qpfsubs( x, dz);
__v2di add13 = __builtin_e2k_qpfadds(cy, ew);
__v2di sub13 = __builtin_e2k_qpfsubs(cy, ew);
//__v2di conj_sub13 = __builtin_e2k_qpxor(sub13, (__v2di){1LL<<63, 1LL<<63});
//__v2di sub13i = __builtin_e2k_qpshufb(conj_sub13, conj_sub13, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_sub13 = __builtin_e2k_qpshufb(sub13, sub13, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di sub13i = __builtin_e2k_qpxor(swap_sub13, (__v2di){1LL<<31, 1LL<<31});
out_0[i] = __builtin_e2k_qpfadds(add02, add13);
out_1[i] = __builtin_e2k_qpfsubs(sub02, sub13i);
out_2[i] = __builtin_e2k_qpfsubs(add02, add13);
out_3[i] = __builtin_e2k_qpfadds(sub02, sub13i);
}
}
Основной цикл на ассемблере
.L3939:
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=4, abs=0, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=3, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=3, asz=4, abs=16, disp=0
}
{
fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=2, asz=4, abs=16, disp=0
}
.L3362:
{
loop_mode
qpfmul_hsubs,0,sm %b[61], %b[63], %r12, %b[1]
qpfmul_hadds,2,sm %b[64], %b[63], %r12, %b[0]
qpshufb,3,sm %b[8], %b[9], %r9, %b[40]
qpshufb,4,sm %b[19], %b[19], %r11, %b[31]
qpfadds,5,sm %b[37], %b[30], %b[26]
movaqp,0 area=1, ind=0, am=1, be=0, %b[17]
movaqp,1 area=0, ind=0, am=0, be=0, %b[7]
movaqp,3 area=0, ind=0, am=0, be=0, %b[6]
}
{
loop_mode
qpfmul_hsubs,0,sm %b[21], %b[42], %r12, %b[44]
qpfmul_hadds,2,sm %b[33], %b[42], %r12, %b[41]
qpshufb,3,sm %b[49], %b[52], %r9, %b[61]
qpshufb,4,sm %b[59], %b[59], %r11, %b[62]
qpfsubs,5,sm %b[37], %b[30], %b[58]
movaqp,0 area=2, ind=0, am=1, be=0, %b[57]
movaqp,1 area=0, ind=16, am=1, be=0, %b[50]
movaqp,3 area=0, ind=16, am=1, be=0, %b[47]
}
{
loop_mode
qpfmul_hsubs,0,sm %b[16], %b[39], %r12, %b[21]
qpshufb,1,sm %b[60], %b[60], %r11, %b[42]
qpfadd_adds,2,sm %b[24], %b[56], %b[28], %b[30]
qpshufb,3,sm %b[14], %b[14], %r11, %b[33]
qpshufb,4,sm %b[51], %b[54], %r10, %b[37]
staaqp,5 %b[34], %aad4[ %aasti8 ]
incr,5 %aaincr0
}
{
loop_mode
qpfmul_hadds,0,sm %b[35], %b[39], %r12, %b[34]
qpxor,1,sm %b[42], %r7, %b[60]
qpfadd_rsubs,2,sm %b[24], %b[56], %b[28], %b[51]
qpshufb,3,sm %b[10], %b[11], %r10, %b[16]
qppermb,4,sm %b[38], %b[25], %r0, %b[54]
staaqp,5 %b[55], %aad2[ %aasti6 ]
incr,5 %aaincr0
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
qpfsub_rsubs,0,sm %b[24], %b[56], %b[60], %b[25]
qpfsub_adds,1,sm %b[24], %b[56], %b[60], %b[11]
staaqp,2 %b[29], %aad3[ %aasti7 ]
incr,2 %aaincr0
qppermb,3,sm %b[4], %b[5], %r0, %b[28]
qppermb,4,sm %b[45], %b[48], %r0, %b[35]
staaqp,5 %b[15], %aad1[ %aasti5 ]
incr,5 %aaincr0
movaqp,3 area=1, ind=0, am=1, be=0, %b[10]
}
Теоретическая скорость: 8 комплексных чисел за 5 тактов (8/5) = 12.8 Байт/такт
Двойная теоретическая скорость: 25.6 Байт/такт
Замеры скорости

Видим ускорение.
6. stage_radix4_simd128_noConj_unroll3
Здесь происходит раскрутка цикла в 3 раза с помощью опции unroll.
Код на Си
void stage_radix4_simd128_noConj_unroll3(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC, myComplex *coefD, myComplex *coefE)
{
__v2di *xy0_in = (__v2di*)&data_in[0];
__v2di *zw0_in = (__v2di*)&data_in[2];
__v2di *xy1_in = (__v2di*)&data_in[4];
__v2di *zw1_in = (__v2di*)&data_in[6];
__v2di *c_in = (__v2di*)coefC;
__v2di *d_in = (__v2di*)coefD;
__v2di *e_in = (__v2di*)coefE;
__v2di *out_0 = (__v2di*)&data_out[0*data_count/4];
__v2di *out_1 = (__v2di*)&data_out[1*data_count/4];
__v2di *out_2 = (__v2di*)&data_out[2*data_count/4];
__v2di *out_3 = (__v2di*)&data_out[3*data_count/4];
#pragma ivdep
#pragma unroll(3)
#pragma prefetch
for(int64_t i = 0; i < data_count/8; ++i)
{
__v2di xy0 = xy0_in[4*i];
__v2di zw0 = zw0_in[4*i];
__v2di xy1 = xy1_in[4*i];
__v2di zw1 = zw1_in[4*i];
__v2di c = c_in[i];
__v2di d = d_in[i];
__v2di e = e_in[i];
__v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di swap_c = __builtin_e2k_qpshufb(c, c, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_d = __builtin_e2k_qpshufb(d, d, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_e = __builtin_e2k_qpshufb(e, e, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di cy_real = __builtin_e2k_qpfmuls( c, y);
__v2di dz_real = __builtin_e2k_qpfmuls( d, z);
__v2di ew_real = __builtin_e2k_qpfmuls( e, w);
__v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y);
__v2di dz_imag = __builtin_e2k_qpfmuls(swap_d, z);
__v2di ew_imag = __builtin_e2k_qpfmuls(swap_e, w);
__v2di cy_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy_real);
__v2di dz_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz_real);
__v2di ew_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew_real);
__v2di cy_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy_imag);
__v2di dz_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz_imag);
__v2di ew_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew_imag);
__v2di cy = __builtin_e2k_qppermb(cy_ii, cy_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di dz = __builtin_e2k_qppermb(dz_ii, dz_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di ew = __builtin_e2k_qppermb(ew_ii, ew_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di add02 = __builtin_e2k_qpfadds( x, dz);
__v2di sub02 = __builtin_e2k_qpfsubs( x, dz);
__v2di add13 = __builtin_e2k_qpfadds(cy, ew);
__v2di sub13 = __builtin_e2k_qpfsubs(cy, ew);
//__v2di conj_sub13 = __builtin_e2k_qpxor(sub13, (__v2di){1LL<<63, 1LL<<63});
//__v2di sub13i = __builtin_e2k_qpshufb(conj_sub13, conj_sub13, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_sub13 = __builtin_e2k_qpshufb(sub13, sub13, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di sub13i = __builtin_e2k_qpxor(swap_sub13, (__v2di){1LL<<31, 1LL<<31});
out_0[i] = __builtin_e2k_qpfadds(add02, add13);
out_1[i] = __builtin_e2k_qpfsubs(sub02, sub13i);
out_2[i] = __builtin_e2k_qpfsubs(add02, add13);
out_3[i] = __builtin_e2k_qpfadds(sub02, sub13i);
}
}
Основной цикл на ассемблере
.L5038:
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=2, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=2, abs=0, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=2, abs=4, disp=64
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=2, abs=4, disp=96
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=2, abs=8, disp=128
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=2, abs=8, disp=160
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=4, asz=2, abs=12, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=3, asz=2, abs=12, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=2, asz=3, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=4, asz=3, abs=16, disp=32
}
{
fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=3, asz=3, abs=24, disp=32
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=2, asz=3, abs=24, disp=32
}
.L3992:
{
loop_mode
qpfsub_adds,0,sm %b[106], %b[103], %g16, %g17
qpfmul_hsubs,1,sm %b[10], %b[97], %r9, %b[0]
qpfadds,2,sm %g18, %g19, %g20
qpshufb,3,sm %b[82], %b[81], %r0, %g22
qpshufb,4,sm %b[82], %b[81], %r5, %g21
qpfmul_hadds,5,sm %g23, %g24, %r9, %b[1]
}
{
loop_mode
qpfsub_rsubs,0,sm %g25, %g26, %g27, %b[117]
qppermb,1,sm %g28, %b[117], %r1, %g29
qpfadds,2,sm %b[105], %b[108], %b[108]
qpshufb,3,sm %b[54], %b[54], %r3, %g31
qpshufb,4,sm %b[13], %b[13], %r3, %g30
qpfmul_hsubs,5,sm %b[13], %g21, %r9, %b[105]
}
{
loop_mode
qpfsub_adds,0,sm %b[116], %b[109], %b[113], %r2
qpfmul_hadds,1,sm %r11, %b[43], %r9, %b[3]
qpfsubs,2,sm %g29, %b[110], %r7
qppermb,3,sm %b[3], %b[64], %r1, %g19
qppermb,4,sm %b[118], %r10, %r1, %g18
qpfmul_hadds,5,sm %g30, %g21, %r9, %g21
}
{
loop_mode
qpfsub_rsubs,0,sm %b[106], %b[103], %g16, %b[110]
qpfmul_hsubs,1,sm %b[38], %b[114], %r9, %b[13]
qpfadds,2,sm %g29, %b[110], %b[113]
qppermb,3,sm %b[27], %b[16], %r1, %b[106]
qppermb,4,sm %b[21], %b[17], %r1, %b[103]
qpfsub_rsubs,5,sm %b[116], %b[109], %b[113], %b[109]
}
{
loop_mode
qpfmul_hsubs,0,sm %b[54], %g22, %r9, %r12
qpfmul_hsubs,1,sm %b[57], %b[41], %r9, %b[16]
staaqp,2 %b[101], %aad2[ %aasti6 + _f32s,_lts0 0x10 ]
qpshufb,3,sm %b[59], %b[42], %r5, %b[102]
qppermb,4,sm %b[102], %r12, %r1, %b[101]
qpfmul_hadds,5,sm %b[104], %b[114], %r9, %b[17]
}
{
loop_mode
qpfmul_hadds,0,sm %g31, %g22, %r9, %b[100]
qpshufb,1,sm %b[74], %b[73], %r0, %b[104]
staaqp,2 %b[115], %aad4[ %aasti8 + _f32s,_lts0 0x10 ]
qpshufb,3,sm %b[100], %b[100], %r3, %b[115]
qpshufb,4,sm %b[10], %b[10], %r3, %b[114]
qpfmul_hsubs,5,sm %b[100], %b[102], %r9, %b[10]
}
{
loop_mode
qpfmul_hsubs,0,sm %b[51], %b[112], %r9, %b[115]
qpfsubs,1,sm %g18, %g19, %b[27]
staaqp,2 %r13, %aad2[ %aasti6 + _f32s,_lts0 0x20 ]
qpshufb,3,sm %b[36], %b[36], %r3, %b[102]
qpshufb,4,sm %b[88], %b[87], %r5, %b[116]
qpfmul_hadds,5,sm %b[115], %b[102], %r9, %b[21]
}
{
loop_mode
qpfmul_hadds,0,sm %r14, %b[112], %r9, %g28
qpfsubs,1,sm %b[103], %b[106], %b[38]
staaqp,2 %b[119], %aad2[ %aasti6 ]
incr,2 %aaincr3
qpshufb,3,sm %b[31], %b[26], %r5, %b[112]
qpshufb,4,sm %b[60], %b[60], %r3, %g22
qpfmul_hsubs,5,sm %b[60], %b[116], %r9, %r10
}
{
loop_mode
qpfadd_rsubs,0,sm %b[104], %b[101], %b[113], %b[99]
qppermb,1,sm %b[5], %b[20], %r1, %b[107]
staaqp,2 %b[107], %aad4[ %aasti8 + _f32s,_lts0 0x20 ]
qppermb,3,sm %b[99], %b[2], %r1, %g26
qpshufb,4,sm %b[39], %b[34], %r0, %g25
qpfmul_hadds,5,sm %g22, %b[116], %r9, %b[116]
movaqp,1 area=5, ind=0, am=1, be=0, %b[2]
}
{
loop_mode
qpfmul_hadds,0,sm %b[114], %b[97], %r9, %b[97]
qpshufb,1,sm %b[92], %b[91], %r0, %b[114]
staaqp,2 %b[98], %aad4[ %aasti8 ]
incr,2 %aaincr3
qpshufb,3,sm %b[96], %b[95], %r0, %b[39]
qpshufb,4,sm %b[24], %b[24], %r3, %g23
qpfadd_adds,5,sm %b[104], %b[101], %b[113], %b[113]
movaqp,0 area=4, ind=16, am=1, be=0, %b[5]
movaqp,1 area=4, ind=0, am=0, be=0, %b[20]
movaqp,2 area=5, ind=0, am=1, be=0, %b[98]
movaqp,3 area=4, ind=0, am=1, be=0, %b[34]
}
{
loop_mode
qpfadd_adds,0,sm %b[114], %b[107], %g20, %b[96]
qpshufb,1,sm %r7, %r7, %r3, %g17
staaqp,2 %g17, %aad1[ %aasti5 + _f32s,_lts0 0x10 ]
qpshufb,3,sm %b[96], %b[95], %r5, %g24
qpshufb,4,sm %b[63], %b[46], %r0, %b[95]
qpfadd_rsubs,5,sm %g25, %g26, %b[108], %r13
movaqp,0 area=3, ind=16, am=1, be=0, %b[43]
movaqp,1 area=3, ind=0, am=0, be=0, %b[54]
movaqp,2 area=3, ind=16, am=1, be=0, %b[46]
movaqp,3 area=3, ind=0, am=0, be=0, %b[51]
}
{
loop_mode
qpfadd_rsubs,0,sm %b[114], %b[107], %g20, %b[117]
qpshufb,1,sm %b[40], %b[40], %r3, %g22
staaqp,2 %b[117], %aad3[ %aasti7 + _f32s,_lts0 0x20 ]
qpshufb,3,sm %b[57], %b[57], %r3, %r11
qpshufb,4,sm %b[29], %b[29], %r3, %g29
qpfmul_hsubs,5,sm %b[24], %g24, %r9, %b[60]
movaqp,0 area=2, ind=0, am=0, be=0, %b[24]
movaqp,1 area=2, ind=16, am=1, be=0, %b[40]
movaqp,2 area=2, ind=0, am=0, be=0, %b[29]
movaqp,3 area=2, ind=16, am=1, be=0, %b[57]
}
{
loop_mode
qpfadd_adds,0,sm %g25, %g26, %b[108], %b[105]
qpxor,1,sm %g22, %r4, %g27
staaqp,2 %b[111], %aad1[ %aasti5 + _f32s,_lts0 0x20 ]
qppermb,3,sm %g21, %b[105], %r1, %b[108]
qpxor,4,sm %g29, %r4, %b[111]
staaqp,5 %r2, %aad1[ %aasti5 ]
incr,5 %aaincr3
movaqp,0 area=1, ind=16, am=1, be=0, %b[73]
movaqp,1 area=1, ind=0, am=0, be=0, %b[63]
movaqp,2 area=1, ind=16, am=1, be=0, %b[74]
movaqp,3 area=1, ind=0, am=0, be=0, %b[64]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
qpfsub_adds,0,sm %g25, %g26, %g27, %b[109]
qpxor,1,sm %g17, %r4, %g16
staaqp,2 %b[110], %aad3[ %aasti7 + _f32s,_lts0 0x10 ]
qpshufb,3,sm %b[49], %b[49], %r3, %r14
qpshufb,4,sm %b[70], %b[69], %r5, %b[110]
staaqp,5 %b[109], %aad3[ %aasti7 ]
incr,5 %aaincr3
movaqp,0 area=0, ind=0, am=0, be=0, %b[81]
movaqp,1 area=0, ind=16, am=1, be=0, %b[91]
movaqp,2 area=0, ind=0, am=0, be=0, %b[82]
movaqp,3 area=0, ind=16, am=1, be=0, %b[92]
}
Теоретическая скорость: 24 комплексных числа за 14 тактов (24/14) = 13.71 Байт/такт
Двойная теоретическая скорость: 27.43 Байт/такт
Замеры скорости

Видим ускорение в середине графика, но замедление в начале и в конце графика.
Итоги по stage_radix4


График FFT находится здесь.
stage_radix4_2x
Схема алгоритма Stage для версии «radix-4» 2x.

Один проход по stage_radix4_2x совершает ту же работу, что 2 прохода по stage_radix4. А один проход по stage_radix4 совершает ту же работу, что 2 прохода по stage_radix2. Поэтому скорость stage_radix4_2x будем умножать на 4 для удобства сравнения с stage_radix2 (этот факт подписан на оси графика и в выводе консоли).
1. stage_radix4_2x_etalon
Здесь происходит ручная раскрутка алгоритма stage_radix4_etalon в 2 раза.
Код на Си
void stage_radix4_2x_etalon(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC_a, myComplex *coefD_a, myComplex *coefE_a, myComplex *coefC_b, myComplex *coefD_b, myComplex *coefE_b)
{
myComplex *x0_in = &data_in[ 0];
myComplex *y0_in = &data_in[ 1];
myComplex *z0_in = &data_in[ 2];
myComplex *w0_in = &data_in[ 3];
myComplex *x1_in = &data_in[ 4];
myComplex *y1_in = &data_in[ 5];
myComplex *z1_in = &data_in[ 6];
myComplex *w1_in = &data_in[ 7];
myComplex *x2_in = &data_in[ 8];
myComplex *y2_in = &data_in[ 9];
myComplex *z2_in = &data_in[10];
myComplex *w2_in = &data_in[11];
myComplex *x3_in = &data_in[12];
myComplex *y3_in = &data_in[13];
myComplex *z3_in = &data_in[14];
myComplex *w3_in = &data_in[15];
myComplex *c0a_in = &coefC_a[0];
myComplex *c1a_in = &coefC_a[1];
myComplex *c2a_in = &coefC_a[2];
myComplex *c3a_in = &coefC_a[3];
myComplex *d0a_in = &coefD_a[0];
myComplex *d1a_in = &coefD_a[1];
myComplex *d2a_in = &coefD_a[2];
myComplex *d3a_in = &coefD_a[3];
myComplex *e0a_in = &coefE_a[0];
myComplex *e1a_in = &coefE_a[1];
myComplex *e2a_in = &coefE_a[2];
myComplex *e3a_in = &coefE_a[3];
myComplex *c0b_in = &coefC_b[0*data_count/16];
myComplex *c1b_in = &coefC_b[1*data_count/16];
myComplex *c2b_in = &coefC_b[2*data_count/16];
myComplex *c3b_in = &coefC_b[3*data_count/16];
myComplex *d0b_in = &coefD_b[0*data_count/16];
myComplex *d1b_in = &coefD_b[1*data_count/16];
myComplex *d2b_in = &coefD_b[2*data_count/16];
myComplex *d3b_in = &coefD_b[3*data_count/16];
myComplex *e0b_in = &coefE_b[0*data_count/16];
myComplex *e1b_in = &coefE_b[1*data_count/16];
myComplex *e2b_in = &coefE_b[2*data_count/16];
myComplex *e3b_in = &coefE_b[3*data_count/16];
myComplex *out_0 = &data_out[ 0*data_count/16];
myComplex *out_1 = &data_out[ 1*data_count/16];
myComplex *out_2 = &data_out[ 2*data_count/16];
myComplex *out_3 = &data_out[ 3*data_count/16];
myComplex *out_4 = &data_out[ 4*data_count/16];
myComplex *out_5 = &data_out[ 5*data_count/16];
myComplex *out_6 = &data_out[ 6*data_count/16];
myComplex *out_7 = &data_out[ 7*data_count/16];
myComplex *out_8 = &data_out[ 8*data_count/16];
myComplex *out_9 = &data_out[ 9*data_count/16];
myComplex *out_10 = &data_out[10*data_count/16];
myComplex *out_11 = &data_out[11*data_count/16];
myComplex *out_12 = &data_out[12*data_count/16];
myComplex *out_13 = &data_out[13*data_count/16];
myComplex *out_14 = &data_out[14*data_count/16];
myComplex *out_15 = &data_out[15*data_count/16];
#pragma ivdep
#pragma unroll(1)
// #pragma prefetch
for(int64_t i = 0; i < data_count/16; ++i)
{
myComplex x0 = x0_in[16*i];
myComplex y0 = y0_in[16*i];
myComplex z0 = z0_in[16*i];
myComplex w0 = w0_in[16*i];
myComplex c0 = c0a_in[4*i];
myComplex d0 = d0a_in[4*i];
myComplex e0 = e0a_in[4*i];
myComplex x1 = x1_in[16*i];
myComplex y1 = y1_in[16*i];
myComplex z1 = z1_in[16*i];
myComplex w1 = w1_in[16*i];
myComplex c1 = c1a_in[4*i];
myComplex d1 = d1a_in[4*i];
myComplex e1 = e1a_in[4*i];
myComplex x2 = x2_in[16*i];
myComplex y2 = y2_in[16*i];
myComplex z2 = z2_in[16*i];
myComplex w2 = w2_in[16*i];
myComplex c2 = c2a_in[4*i];
myComplex d2 = d2a_in[4*i];
myComplex e2 = e2a_in[4*i];
myComplex x3 = x3_in[16*i];
myComplex y3 = y3_in[16*i];
myComplex z3 = z3_in[16*i];
myComplex w3 = w3_in[16*i];
myComplex c3 = c3a_in[4*i];
myComplex d3 = d3a_in[4*i];
myComplex e3 = e3a_in[4*i];
myComplex cy0 = complex_mul(c0, y0);
myComplex cy1 = complex_mul(c1, y1);
myComplex cy2 = complex_mul(c2, y2);
myComplex cy3 = complex_mul(c3, y3);
myComplex dz0 = complex_mul(d0, z0);
myComplex dz1 = complex_mul(d1, z1);
myComplex dz2 = complex_mul(d2, z2);
myComplex dz3 = complex_mul(d3, z3);
myComplex ew0 = complex_mul(e0, w0);
myComplex ew1 = complex_mul(e1, w1);
myComplex ew2 = complex_mul(e2, w2);
myComplex ew3 = complex_mul(e3, w3);
myComplex add02_0 = complex_add( x0, dz0);
myComplex add02_1 = complex_add( x1, dz1);
myComplex add02_2 = complex_add( x2, dz2);
myComplex add02_3 = complex_add( x3, dz3);
myComplex sub02_0 = complex_sub( x0, dz0);
myComplex sub02_1 = complex_sub( x1, dz1);
myComplex sub02_2 = complex_sub( x2, dz2);
myComplex sub02_3 = complex_sub( x3, dz3);
myComplex add13_0 = complex_add(cy0, ew0);
myComplex add13_1 = complex_add(cy1, ew1);
myComplex add13_2 = complex_add(cy2, ew2);
myComplex add13_3 = complex_add(cy3, ew3);
myComplex sub13_0 = complex_sub(cy0, ew0);
myComplex sub13_1 = complex_sub(cy1, ew1);
myComplex sub13_2 = complex_sub(cy2, ew2);
myComplex sub13_3 = complex_sub(cy3, ew3);
myComplex sub13i_0 = (myComplex){.real = -sub13_0.imag, .imag = sub13_0.real};
myComplex sub13i_1 = (myComplex){.real = -sub13_1.imag, .imag = sub13_1.real};
myComplex sub13i_2 = (myComplex){.real = -sub13_2.imag, .imag = sub13_2.real};
myComplex sub13i_3 = (myComplex){.real = -sub13_3.imag, .imag = sub13_3.real};
myComplex out0 = complex_add(add02_0, add13_0);
myComplex out1 = complex_add(add02_1, add13_1);
myComplex out2 = complex_add(add02_2, add13_2);
myComplex out3 = complex_add(add02_3, add13_3);
myComplex out4 = complex_sub(sub02_0, sub13i_0);
myComplex out5 = complex_sub(sub02_1, sub13i_1);
myComplex out6 = complex_sub(sub02_2, sub13i_2);
myComplex out7 = complex_sub(sub02_3, sub13i_3);
myComplex out8 = complex_sub(add02_0, add13_0);
myComplex out9 = complex_sub(add02_1, add13_1);
myComplex out10 = complex_sub(add02_2, add13_2);
myComplex out11 = complex_sub(add02_3, add13_3);
myComplex out12 = complex_add(sub02_0, sub13i_0);
myComplex out13 = complex_add(sub02_1, sub13i_1);
myComplex out14 = complex_add(sub02_2, sub13i_2);
myComplex out15 = complex_add(sub02_3, sub13i_3);
x0 = out0;
y0 = out1;
z0 = out2;
w0 = out3;
c0 = c0b_in[i];
d0 = d0b_in[i];
e0 = e0b_in[i];
x1 = out4;
y1 = out5;
z1 = out6;
w1 = out7;
c1 = c1b_in[i];
d1 = d1b_in[i];
e1 = e1b_in[i];
x2 = out8;
y2 = out9;
z2 = out10;
w2 = out11;
c2 = c2b_in[i];
d2 = d2b_in[i];
e2 = e2b_in[i];
x3 = out12;
y3 = out13;
z3 = out14;
w3 = out15;
c3 = c3b_in[i];
d3 = d3b_in[i];
e3 = e3b_in[i];
cy0 = complex_mul(c0, y0);
cy1 = complex_mul(c1, y1);
cy2 = complex_mul(c2, y2);
cy3 = complex_mul(c3, y3);
dz0 = complex_mul(d0, z0);
dz1 = complex_mul(d1, z1);
dz2 = complex_mul(d2, z2);
dz3 = complex_mul(d3, z3);
ew0 = complex_mul(e0, w0);
ew1 = complex_mul(e1, w1);
ew2 = complex_mul(e2, w2);
ew3 = complex_mul(e3, w3);
add02_0 = complex_add( x0, dz0);
add02_1 = complex_add( x1, dz1);
add02_2 = complex_add( x2, dz2);
add02_3 = complex_add( x3, dz3);
sub02_0 = complex_sub( x0, dz0);
sub02_1 = complex_sub( x1, dz1);
sub02_2 = complex_sub( x2, dz2);
sub02_3 = complex_sub( x3, dz3);
add13_0 = complex_add(cy0, ew0);
add13_1 = complex_add(cy1, ew1);
add13_2 = complex_add(cy2, ew2);
add13_3 = complex_add(cy3, ew3);
sub13_0 = complex_sub(cy0, ew0);
sub13_1 = complex_sub(cy1, ew1);
sub13_2 = complex_sub(cy2, ew2);
sub13_3 = complex_sub(cy3, ew3);
sub13i_0 = (myComplex){.real = -sub13_0.imag, .imag = sub13_0.real};
sub13i_1 = (myComplex){.real = -sub13_1.imag, .imag = sub13_1.real};
sub13i_2 = (myComplex){.real = -sub13_2.imag, .imag = sub13_2.real};
sub13i_3 = (myComplex){.real = -sub13_3.imag, .imag = sub13_3.real};
out_0[i] = complex_add(add02_0, add13_0);
out_1[i] = complex_add(add02_1, add13_1);
out_2[i] = complex_add(add02_2, add13_2);
out_3[i] = complex_add(add02_3, add13_3);
out_4[i] = complex_sub(sub02_0, sub13i_0);
out_5[i] = complex_sub(sub02_1, sub13i_1);
out_6[i] = complex_sub(sub02_2, sub13i_2);
out_7[i] = complex_sub(sub02_3, sub13i_3);
out_8[i] = complex_sub(add02_0, add13_0);
out_9[i] = complex_sub(add02_1, add13_1);
out_10[i] = complex_sub(add02_2, add13_2);
out_11[i] = complex_sub(add02_3, add13_3);
out_12[i] = complex_add(sub02_0, sub13i_0);
out_13[i] = complex_add(sub02_1, sub13i_1);
out_14[i] = complex_add(sub02_2, sub13i_2);
out_15[i] = complex_add(sub02_3, sub13i_3);
}
}
Основной цикл на ассемблере
.L1379:
{
fapb ct=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=1, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=3, mrng=16, d=0, incr=3, ind=2, asz=1, abs=0, disp=16
}
{
fapb ct=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=1, abs=2, disp=64
fapb dpl=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=1, abs=2, disp=32
}
{
fapb ct=0, dcd=0, fmt=3, mrng=0, d=0, incr=3, ind=4, asz=1, abs=4, disp=0
fapb dpl=0, dcd=0, fmt=3, mrng=0, d=0, incr=1, ind=1, asz=1, abs=4, disp=96
}
{
fapb ct=0, dcd=0, fmt=3, mrng=16, d=0, incr=3, ind=2, asz=1, abs=6, disp=0
fapb dpl=0, dcd=0, fmt=3, mrng=0, d=0, incr=3, ind=3, asz=1, abs=6, disp=0
}
{
fapb ct=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=15, asz=2, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=3, mrng=8, d=1, incr=2, ind=0, asz=2, abs=8, disp=0
}
{
fapb ct=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=13, asz=2, abs=12, disp=0
fapb dpl=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=14, asz=2, abs=12, disp=0
}
{
fapb ct=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=11, asz=2, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=12, asz=2, abs=16, disp=0
}
{
fapb ct=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=9, asz=2, abs=20, disp=0
fapb dpl=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=10, asz=2, abs=20, disp=0
}
{
fapb ct=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=7, asz=2, abs=24, disp=0
fapb dpl=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=8, asz=2, abs=24, disp=0
}
{
fapb ct=1, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=5, asz=2, abs=28, disp=0
fapb dpl=0, dcd=0, fmt=3, mrng=8, d=0, incr=2, ind=6, asz=2, abs=28, disp=0
}
.L285:
{
loop_mode
disp %ctpr1, .L285
movaw,0 area=0, ind=24, am=0, be=0, %g17
movaw,1 area=0, ind=28, am=0, be=0, %g16
movaw,2 area=0, ind=8, am=0, be=0, %g19
movaw,3 area=0, ind=12, am=0, be=0, %g18
}
{
loop_mode
movaw,0 area=0, ind=16, am=0, be=0, %g21
movaw,1 area=0, ind=20, am=0, be=0, %g20
movaw,2 area=0, ind=0, am=1, be=0, %g23
movaw,3 area=0, ind=4, am=0, be=0, %g22
}
{
loop_mode
movaw,0 area=0, ind=8, am=0, be=0, %g25
movaw,1 area=0, ind=12, am=0, be=0, %g24
movaw,2 area=1, ind=24, am=0, be=0, %g27
movaw,3 area=1, ind=28, am=0, be=0, %g26
}
{
loop_mode
movaw,0 area=0, ind=0, am=1, be=0, %g29
movaw,1 area=0, ind=4, am=0, be=0, %g28
movaw,2 area=1, ind=16, am=0, be=0, %g31
movaw,3 area=1, ind=20, am=0, be=0, %g30
}
{
loop_mode
movaw,0 area=1, ind=24, am=0, be=0, %r3
movaw,1 area=1, ind=28, am=0, be=0, %r1
movaw,2 area=1, ind=8, am=0, be=0, %r5
movaw,3 area=1, ind=12, am=0, be=0, %r4
}
{
loop_mode
movaw,0 area=1, ind=16, am=0, be=0, %r9
movaw,1 area=1, ind=20, am=0, be=0, %r7
movaw,2 area=1, ind=0, am=1, be=0, %r42
movaw,3 area=1, ind=4, am=0, be=0, %r41
}
{
loop_mode
movaw,0 area=1, ind=8, am=0, be=0, %r44
movaw,1 area=1, ind=12, am=0, be=0, %r43
movaw,2 area=2, ind=24, am=0, be=0, %r46
movaw,3 area=2, ind=28, am=0, be=0, %r45
}
{
loop_mode
movaw,0 area=1, ind=0, am=1, be=0, %r48
movaw,1 area=1, ind=4, am=0, be=0, %r47
movaw,2 area=2, ind=16, am=0, be=0, %r50
movaw,3 area=2, ind=20, am=0, be=0, %r49
}
{
loop_mode
movaw,0 area=2, ind=24, am=0, be=0, %r52
movaw,1 area=2, ind=28, am=0, be=0, %r51
movaw,2 area=2, ind=8, am=0, be=0, %r54
movaw,3 area=2, ind=12, am=0, be=0, %r53
}
{
loop_mode
fmuls,0 %g23, %r3, %r57
fmuls,1 %g22, %r1, %r58
fmuls,2 %g23, %r1, %g23
fmuls,3 %g22, %r3, %g22
movaw,0 area=2, ind=16, am=0, be=0, %r56
movaw,1 area=2, ind=20, am=0, be=0, %r55
movaw,2 area=2, ind=0, am=1, be=0, %r3
movaw,3 area=2, ind=4, am=0, be=0, %r1
}
{
loop_mode
movaw,0 area=2, ind=8, am=0, be=0, %r60
movaw,1 area=2, ind=12, am=0, be=0, %r59
movaw,2 area=3, ind=24, am=0, be=0, %r62
movaw,3 area=3, ind=28, am=0, be=0, %r61
}
{
loop_mode
fmuls,0 %g19, %r46, %r63
fmuls,1 %g18, %r45, %b[0]
fmuls,2 %g19, %r45, %g19
fmuls,3 %g18, %r46, %g18
movaw,0 area=2, ind=0, am=1, be=0, %r46
movaw,1 area=2, ind=4, am=0, be=0, %r45
movaw,2 area=3, ind=16, am=0, be=0, %b[2]
movaw,3 area=3, ind=20, am=0, be=0, %b[1]
}
{
loop_mode
movaw,0 area=3, ind=8, am=0, be=0, %b[4]
movaw,1 area=3, ind=12, am=0, be=0, %b[3]
movaw,2 area=3, ind=8, am=0, be=0, %b[6]
movaw,3 area=3, ind=12, am=0, be=0, %b[5]
}
{
loop_mode
fmuls,0 %r52, %r54, %b[7]
fmuls,1 %r51, %r53, %b[8]
fmuls,2 %r52, %r53, %r52
fmuls,3 %r51, %r54, %r51
movaw,0 area=3, ind=0, am=1, be=0, %r54
movaw,1 area=3, ind=4, am=0, be=0, %r53
movaw,2 area=3, ind=0, am=1, be=0, %b[10]
movaw,3 area=3, ind=4, am=0, be=0, %b[9]
}
{
loop_mode
fmuls,0 %r56, %r44, %b[11]
fmuls,1 %r55, %r43, %b[12]
fmuls,2 %r56, %r43, %r43
fmuls,3 %r55, %r44, %r44
movaw,0 area=4, ind=0, am=1, be=0, %r56
movaw,1 area=4, ind=4, am=0, be=0, %r55
movaw,2 area=4, ind=0, am=1, be=0, %b[14]
movaw,3 area=4, ind=4, am=0, be=0, %b[13]
}
{
loop_mode
fmuls,0 %r60, %r5, %b[15]
fmuls,1 %r59, %r4, %b[16]
fmuls,2 %r60, %r4, %r4
fmuls,3 %r59, %r5, %r5
fmuls,4 %r61, %r49, %r59
fmuls,5 %r61, %r50, %r60
movaw,0 area=5, ind=0, am=1, be=0, %b[17]
movaw,1 area=5, ind=4, am=0, be=0, %r61
movaw,2 area=5, ind=0, am=1, be=0, %b[19]
movaw,3 area=5, ind=4, am=0, be=0, %b[18]
}
{
loop_mode
fmuls,0 %b[2], %r9, %b[20]
fmuls,1 %b[1], %r7, %b[21]
fmuls,2 %b[2], %r7, %r7
fmuls,3 %b[1], %r9, %r9
fmuls,4 %r62, %r50, %r50
fmuls,5 %r62, %r49, %r49
movaw,0 area=6, ind=0, am=1, be=0, %b[1]
movaw,1 area=6, ind=4, am=0, be=0, %r62
movaw,2 area=6, ind=0, am=1, be=0, %b[22]
movaw,3 area=6, ind=4, am=0, be=0, %b[2]
}
{
loop_mode
fmuls,0 %b[5], %g30, %b[23]
fmuls,1 %b[5], %g31, %b[5]
fmuls,2 %b[4], %g27, %b[24]
fmuls,3 %b[3], %g26, %b[25]
fmuls,4 %b[4], %g26, %g26
fmuls,5 %b[3], %g27, %g27
movaw,0 area=7, ind=0, am=1, be=0, %b[4]
movaw,1 area=7, ind=4, am=0, be=0, %b[3]
movaw,2 area=7, ind=0, am=1, be=0, %b[27]
movaw,3 area=7, ind=4, am=0, be=0, %b[26]
}
{
loop_mode
fmuls,0 %b[6], %g31, %g31
fmuls,1 %b[6], %g30, %g30
fsubs,2 %r57, %r58, %r57
fadds,3 %g23, %g22, %g22
fsubs,4 %r63, %b[0], %g23
fadds,5 %g19, %g18, %g18
movaw,0 area=8, ind=0, am=1, be=0, %r58
movaw,1 area=8, ind=4, am=0, be=0, %g19
movaw,2 area=8, ind=0, am=1, be=0, %b[0]
movaw,3 area=8, ind=4, am=0, be=0, %r63
}
{
loop_mode
fsubs,0 %b[7], %b[8], %b[6]
fadds,1 %r52, %r51, %r51
fsubs,2 %b[11], %b[12], %r52
fmuls,3 %r45, %g24, %b[7]
fmuls,4 %r46, %g24, %g24
fmuls,5 %r45, %g25, %r45
movaw,0 area=9, ind=0, am=1, be=0, %b[11]
movaw,1 area=9, ind=4, am=0, be=0, %b[8]
movaw,2 area=9, ind=0, am=1, be=0, %b[28]
movaw,3 area=9, ind=4, am=0, be=0, %b[12]
}
{
loop_mode
fadds,0 %r43, %r44, %r43
fsubs,1 %b[20], %b[21], %r44
fsubs,2 %b[15], %b[16], %b[15]
fsubs,3 %r50, %r59, %r50
fadds,4 %r49, %r60, %r49
fmuls,5 %r46, %g25, %g25
}
{
loop_mode
fadds,0 %r4, %r5, %r4
fmuls,1 %r53, %g16, %g27
fmuls,2 %r53, %g17, %r5
fadds,3 %g26, %g27, %g26
fmuls,4 %b[9], %g21, %r46
fmuls,5 %r54, %g17, %g17
}
{
loop_mode
fadds,0 %r7, %r9, %r7
fsubs,1 %g31, %b[23], %g31
fadds,2 %g30, %b[5], %g30
fmuls,3 %r54, %g16, %g16
fmuls,4 %b[10], %g21, %g21
fmuls,5 %b[9], %g20, %r9
}
{
loop_mode
fsubs,0 %b[24], %b[25], %r53
fmuls,1 %b[10], %g20, %g20
fadds,2 %r52, %r57, %r54
fadds,3 %g24, %r45, %g24
}
{
loop_mode
fadds,0 %r48, %r44, %r45
fsubs,1 %r48, %r44, %r44
fsubs,2 %g22, %r43, %r48
fsubs,3 %r1, %r49, %r59
fadds,4 %r3, %r50, %r60
fadds,5 %r1, %r49, %r1
}
{
loop_mode
fadds,0 %r43, %g22, %g22
fsubs,1 %g18, %r51, %r43
fadds,2 %b[6], %g23, %r49
fadds,3 %r51, %g18, %g18
fsubs,4 %b[6], %g23, %g23
fsubs,5 %r52, %r57, %r51
}
{
loop_mode
fsubs,0 %r3, %r50, %r3
fadds,1 %r47, %r7, %r50
fsubs,2 %r47, %r7, %r7
fsubs,3 %g25, %b[7], %g25
fsubs,4 %g21, %r9, %g21
}
{
loop_mode
fadds,0 %r4, %g26, %r9
fsubs,1 %g26, %r4, %g26
fadds,2 %r42, %g31, %r4
fsubs,3 %g17, %g27, %g17
fadds,4 %g16, %r5, %g16
}
{
loop_mode
fadds,0 %r41, %g30, %g27
fsubs,1 %r42, %g31, %g31
fadds,2 %b[15], %r53, %r5
fsubs,3 %r41, %g30, %g30
}
{
loop_mode
fsubs,0 %b[15], %r53, %r41
fadds,1 %g20, %r46, %g20
fsubs,2 %r44, %r48, %r42
fsubs,3 %r59, %g23, %r46
fadds,4 %r59, %g23, %g23
fadds,5 %r1, %g18, %r47
}
{
loop_mode
fsubs,0 %r45, %r54, %r52
fadds,1 %r44, %r48, %r44
fadds,2 %r45, %r54, %r45
fsubs,3 %r1, %g18, %g18
fsubs,4 %g29, %g21, %r1
fadds,5 %g29, %g21, %g21
}
{
loop_mode
fadds,0 %r50, %g22, %g29
fsubs,1 %r50, %g22, %g22
fadds,2 %r3, %r43, %r48
fadds,3 %r60, %r49, %r50
fsubs,4 %r60, %r49, %r49
fadds,5 %g24, %g16, %r53
}
{
loop_mode
fsubs,0 %r3, %r43, %r3
fadds,1 %g27, %r9, %r43
fsubs,2 %r7, %r51, %r54
fadds,3 %r7, %r51, %r7
fsubs,4 %g25, %g17, %r51
fsubs,5 %g16, %g24, %g16
}
{
loop_mode
fsubs,0 %g27, %r9, %g24
fadds,1 %g31, %g26, %g27
fadds,2 %r4, %r5, %r9
fadds,3 %g25, %g17, %g17
fmuls,4 %r62, %r46, %g25
fmuls,5 %b[1], %r46, %r46
}
{
loop_mode
fadds,0 %g30, %r41, %r57
fsubs,1 %g31, %g26, %g26
fsubs,2 %g30, %r41, %g30
fsubs,3 %r4, %r5, %g31
fmuls,4 %b[8], %g23, %r4
fmuls,5 %b[11], %g23, %g23
}
{
loop_mode
fadds,0 %g28, %g20, %r5
fsubs,1 %g28, %g20, %g20
fmuls,2 %b[22], %r42, %g28
fmuls,3 %b[2], %r42, %r41
fmuls,4 %b[18], %r47, %r42
fmuls,5 %b[19], %r47, %r47
}
{
loop_mode
fmuls,0 %b[4], %r52, %r59
fmuls,1 %b[3], %r52, %r52
fmuls,2 %b[28], %r44, %r60
fmuls,3 %b[12], %r44, %r44
fmuls,4 %r56, %r45, %b[5]
fmuls,5 %r55, %r45, %r45
}
{
loop_mode
fmuls,0 %r63, %g18, %b[6]
fmuls,1 %b[0], %g18, %g18
fmuls,2 %b[11], %r48, %b[7]
fmuls,3 %b[8], %r48, %r48
fmuls,4 %r55, %g29, %r55
fmuls,5 %r56, %g29, %g29
}
{
loop_mode
fmuls,0 %b[3], %g22, %r56
fmuls,1 %b[4], %g22, %g22
fmuls,2 %b[1], %r3, %b[1]
fmuls,3 %r62, %r3, %r3
fmuls,4 %b[12], %r7, %r62
fmuls,5 %b[28], %r7, %r7
}
{
loop_mode
fmuls,0 %b[19], %r50, %b[3]
fmuls,1 %b[18], %r50, %r50
fmuls,2 %b[0], %r49, %b[0]
fmuls,3 %r63, %r49, %r49
fmuls,4 %b[26], %g24, %r63
fmuls,5 %b[27], %g24, %g24
}
{
loop_mode
fmuls,0 %g19, %r57, %b[4]
fmuls,1 %r58, %r57, %r57
fmuls,2 %b[2], %r54, %b[2]
fmuls,3 %b[22], %r54, %r54
fmuls,4 %b[13], %r43, %b[8]
fmuls,5 %b[14], %r43, %r43
}
{
loop_mode
fmuls,0 %r61, %g30, %b[9]
fmuls,1 %b[17], %g30, %g30
fmuls,2 %r58, %g27, %r58
fmuls,3 %g19, %g27, %g19
fmuls,4 %b[14], %r9, %g27
fmuls,5 %b[13], %r9, %r9
}
{
loop_mode
fmuls,0 %b[17], %g26, %b[10]
fmuls,1 %r61, %g26, %g26
fmuls,2 %b[27], %g31, %r61
fmuls,3 %b[26], %g31, %g31
fsubs,4 %g20, %r51, %b[11]
fadds,5 %g20, %r51, %g20
}
{
loop_mode
fadds,0 %g21, %g17, %r51
fadds,1 %r5, %r53, %b[12]
fsubs,2 %r1, %g16, %b[13]
fsubs,3 %g21, %g17, %g17
fsubs,4 %r5, %r53, %g21
fadds,5 %r1, %g16, %g16
}
{
loop_mode
fsubs,0 %b[5], %r55, %r1
fadds,1 %g29, %r45, %g29
fsubs,2 %r59, %r56, %r5
fadds,3 %g22, %r52, %g22
fsubs,4 %b[7], %r4, %r4
fadds,5 %g23, %r48, %g23
}
{
loop_mode
fsubs,0 %b[3], %r42, %r42
fadds,1 %r47, %r50, %r45
fsubs,2 %b[1], %g25, %g25
fadds,3 %r46, %r3, %r3
fsubs,4 %b[0], %b[6], %r46
fadds,5 %g18, %r49, %g18
}
{
loop_mode
fsubs,0 %r61, %r63, %r47
fsubs,1 %r58, %b[4], %r48
fsubs,2 %g28, %b[2], %g28
fadds,3 %r54, %r41, %r41
fsubs,4 %r60, %r62, %r49
fadds,5 %r7, %r44, %r7
}
{
loop_mode
fsubs,0 %g27, %b[8], %g27
fadds,1 %r43, %r9, %r9
fadds,2 %g30, %g26, %g26
fadds,3 %g24, %g31, %g24
fadds,4 %r57, %g19, %g19
}
{
loop_mode
fsubs,0 %b[10], %b[9], %g30
fsubs,1 %r51, %r1, %r43
fadds,2 %r51, %r1, %g22
fadds,3 %g21, %g22, %g31
fsubs,4 %g21, %g22, %g21
}
{
loop_mode
fadds,0 %g17, %r5, %r1
fsubs,1 %g17, %r5, %g17
fadds,2 %b[12], %g29, %r5
}
{
loop_mode
fsubs,0 %b[12], %g29, %g29
fadds,1 %b[13], %g28, %r44
fsubs,2 %b[13], %g28, %g28
fadds,3 %g20, %r7, %r50
fsubs,4 %g16, %r49, %r51
fsubs,5 %g20, %r7, %g20
}
{
loop_mode
fsubs,0 %r48, %r4, %r7
fadds,1 %r47, %r46, %r49
fadds,2 %r48, %r4, %r4
fadds,3 %b[11], %r41, %r52
fadds,4 %g16, %r49, %g16
fsubs,5 %b[11], %r41, %r41
}
{
loop_mode
fsubs,0 %r47, %r46, %r46
fadds,1 %r9, %r45, %r47
fsubs,2 %g27, %r42, %r53
fadds,3 %g19, %g23, %r48
fsubs,4 %g18, %g24, %r54
fsubs,5 %g23, %g19, %g19
}
{
loop_mode
fsubs,0 %r45, %r9, %g23
fadds,1 %g27, %r42, %g27
fadds,2 %g26, %r3, %r9
fadds,3 %g24, %g18, %g18
fsubs,4 %r3, %g26, %g24
}
{
loop_mode
fadds,0 %g30, %g25, %g26
fsubs,1 %g30, %g25, %g25
}
{
loop_mode
fsubs,0 %r1, %r49, %g30
fadds,1 %r1, %r49, %r1
}
{
loop_mode
fsubs,0 %g21, %r46, %r3
fadds,1 %g21, %r46, %g21
fsubs,2 %g20, %r7, %r42
fadds,3 %r51, %g19, %r45
fadds,4 %r50, %r48, %r46
fsubs,5 %r51, %g19, %g19
}
{
loop_mode
fsubs,0 %r43, %g23, %r49
fadds,1 %r43, %g23, %g23
fadds,2 %g20, %r7, %g20
fsubs,3 %r50, %r48, %r7
fadds,4 %g16, %r4, %r43
fsubs,5 %g17, %r54, %r48
}
{
loop_mode
fadds,0 %r5, %r47, %r50
fsubs,1 %r5, %r47, %r5
fadds,2 %g29, %r53, %r47
fsubs,3 %g29, %r53, %g29
fsubs,4 %g16, %r4, %g16
fadds,5 %g17, %r54, %g17
}
{
loop_mode
fsubs,0 %g22, %g27, %r4
fadds,1 %r52, %r9, %r51
fadds,2 %g31, %g18, %r53
fsubs,3 %g28, %g24, %r54
fsubs,4 %r52, %r9, %r9
fsubs,5 %g31, %g18, %g18
}
{
loop_mode
fadds,0 %g28, %g24, %g24
fadds,1 %g22, %g27, %g22
fsubs,2 %r44, %g26, %g26
fadds,3 %r44, %g26, %g27
fsubs,4 %r41, %g25, %g28
fadds,5 %r41, %g25, %g25
}
{
loop_mode
stw,2 %r23, %r0, %g21
stw,5 %r32, %r0, %g30
}
{
loop_mode
stw,2 %r6, %r0, %r3
stw,5 %r39, %r0, %r1
}
{
loop_mode
stw,2 %r22, %r0, %g20
stw,5 %r21, %r0, %r42
}
{
loop_mode
stw,2 %r27, %r0, %r45
stw,5 %r38, %r0, %r49
}
{
loop_mode
stw,2 %r31, %r0, %g23
stw,5 %r34, %r0, %g19
}
{
loop_mode
stw,2 %r17, %r0, %r7
stw,5 %r13, %r0, %r46
}
{
loop_mode
stw,2 %r20, %r0, %r5
stw,5 %r2, %r0, %r50
}
{
loop_mode
stw,2 %r16, %r0, %r47
stw,5 %r29, %r0, %r48
}
{
loop_mode
stw,2 %r28, %r0, %g17
stw,5 %r12, %r0, %g29
}
{
loop_mode
stw,2 %r26, %r0, %g16
stw,5 %r30, %r0, %r43
}
{
loop_mode
stw,2 %r37, %r0, %r4
stw,5 %r14, %r0, %r53
}
{
loop_mode
stw,2 %r18, %r0, %g18
stw,5 %r35, %r0, %r54
}
{
loop_mode
stw,2 %r25, %r0, %g24
stw,5 %r15, %r0, %r51
}
{
loop_mode
stw,2 %r19, %r0, %r9
stw,5 %r40, %r0, %g22
}
{
loop_mode
stw,2 %r24, %r0, %g25
stw,5 %r11, %r0, %g28
}
{
loop_mode
ct %ctpr1 ? %NOT_LOOP_END
alc alcf=1, alct=1
stw,2 %r33, %r0, %g26
addd,3,sm 0x8, %r0, %r0
stw,5 %r36, %r0, %g27
}
Теоретическая скорость: 16 комплексных чисел за 77 тактов (16/77) = 1.66 Байт/такт
Четверная теоретическая скорость: 6.65 Байт/такт
Замеры скорости

2. stage_radix4_2x_simd64
Здесь происходит ручная раскрутка алгоритма stage_radix4_simd64 в 2 раза.
Код на Си
void stage_radix4_2x_simd64(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC_a, myComplex *coefD_a, myComplex *coefE_a, myComplex *coefC_b, myComplex *coefD_b, myComplex *coefE_b)
{
uint64_t *x0_in = (uint64_t*)&data_in[ 0];
uint64_t *y0_in = (uint64_t*)&data_in[ 1];
uint64_t *z0_in = (uint64_t*)&data_in[ 2];
uint64_t *w0_in = (uint64_t*)&data_in[ 3];
uint64_t *x1_in = (uint64_t*)&data_in[ 4];
uint64_t *y1_in = (uint64_t*)&data_in[ 5];
uint64_t *z1_in = (uint64_t*)&data_in[ 6];
uint64_t *w1_in = (uint64_t*)&data_in[ 7];
uint64_t *x2_in = (uint64_t*)&data_in[ 8];
uint64_t *y2_in = (uint64_t*)&data_in[ 9];
uint64_t *z2_in = (uint64_t*)&data_in[10];
uint64_t *w2_in = (uint64_t*)&data_in[11];
uint64_t *x3_in = (uint64_t*)&data_in[12];
uint64_t *y3_in = (uint64_t*)&data_in[13];
uint64_t *z3_in = (uint64_t*)&data_in[14];
uint64_t *w3_in = (uint64_t*)&data_in[15];
uint64_t *c0a_in = (uint64_t*)&coefC_a[0];
uint64_t *c1a_in = (uint64_t*)&coefC_a[1];
uint64_t *c2a_in = (uint64_t*)&coefC_a[2];
uint64_t *c3a_in = (uint64_t*)&coefC_a[3];
uint64_t *d0a_in = (uint64_t*)&coefD_a[0];
uint64_t *d1a_in = (uint64_t*)&coefD_a[1];
uint64_t *d2a_in = (uint64_t*)&coefD_a[2];
uint64_t *d3a_in = (uint64_t*)&coefD_a[3];
uint64_t *e0a_in = (uint64_t*)&coefE_a[0];
uint64_t *e1a_in = (uint64_t*)&coefE_a[1];
uint64_t *e2a_in = (uint64_t*)&coefE_a[2];
uint64_t *e3a_in = (uint64_t*)&coefE_a[3];
uint64_t *c0b_in = (uint64_t*)&coefC_b[0*data_count/16];
uint64_t *c1b_in = (uint64_t*)&coefC_b[1*data_count/16];
uint64_t *c2b_in = (uint64_t*)&coefC_b[2*data_count/16];
uint64_t *c3b_in = (uint64_t*)&coefC_b[3*data_count/16];
uint64_t *d0b_in = (uint64_t*)&coefD_b[0*data_count/16];
uint64_t *d1b_in = (uint64_t*)&coefD_b[1*data_count/16];
uint64_t *d2b_in = (uint64_t*)&coefD_b[2*data_count/16];
uint64_t *d3b_in = (uint64_t*)&coefD_b[3*data_count/16];
uint64_t *e0b_in = (uint64_t*)&coefE_b[0*data_count/16];
uint64_t *e1b_in = (uint64_t*)&coefE_b[1*data_count/16];
uint64_t *e2b_in = (uint64_t*)&coefE_b[2*data_count/16];
uint64_t *e3b_in = (uint64_t*)&coefE_b[3*data_count/16];
uint64_t *out_0 = (uint64_t*)&data_out[ 0*data_count/16];
uint64_t *out_1 = (uint64_t*)&data_out[ 1*data_count/16];
uint64_t *out_2 = (uint64_t*)&data_out[ 2*data_count/16];
uint64_t *out_3 = (uint64_t*)&data_out[ 3*data_count/16];
uint64_t *out_4 = (uint64_t*)&data_out[ 4*data_count/16];
uint64_t *out_5 = (uint64_t*)&data_out[ 5*data_count/16];
uint64_t *out_6 = (uint64_t*)&data_out[ 6*data_count/16];
uint64_t *out_7 = (uint64_t*)&data_out[ 7*data_count/16];
uint64_t *out_8 = (uint64_t*)&data_out[ 8*data_count/16];
uint64_t *out_9 = (uint64_t*)&data_out[ 9*data_count/16];
uint64_t *out_10 = (uint64_t*)&data_out[10*data_count/16];
uint64_t *out_11 = (uint64_t*)&data_out[11*data_count/16];
uint64_t *out_12 = (uint64_t*)&data_out[12*data_count/16];
uint64_t *out_13 = (uint64_t*)&data_out[13*data_count/16];
uint64_t *out_14 = (uint64_t*)&data_out[14*data_count/16];
uint64_t *out_15 = (uint64_t*)&data_out[15*data_count/16];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < data_count/16; ++i)
{
uint64_t x0 = x0_in[16*i];
uint64_t y0 = y0_in[16*i];
uint64_t z0 = z0_in[16*i];
uint64_t w0 = w0_in[16*i];
uint64_t c0 = c0a_in[4*i];
uint64_t d0 = d0a_in[4*i];
uint64_t e0 = e0a_in[4*i];
uint64_t x1 = x1_in[16*i];
uint64_t y1 = y1_in[16*i];
uint64_t z1 = z1_in[16*i];
uint64_t w1 = w1_in[16*i];
uint64_t c1 = c1a_in[4*i];
uint64_t d1 = d1a_in[4*i];
uint64_t e1 = e1a_in[4*i];
uint64_t x2 = x2_in[16*i];
uint64_t y2 = y2_in[16*i];
uint64_t z2 = z2_in[16*i];
uint64_t w2 = w2_in[16*i];
uint64_t c2 = c2a_in[4*i];
uint64_t d2 = d2a_in[4*i];
uint64_t e2 = e2a_in[4*i];
uint64_t x3 = x3_in[16*i];
uint64_t y3 = y3_in[16*i];
uint64_t z3 = z3_in[16*i];
uint64_t w3 = w3_in[16*i];
uint64_t c3 = c3a_in[4*i];
uint64_t d3 = d3a_in[4*i];
uint64_t e3 = e3a_in[4*i];
uint64_t conj_c0 = __builtin_e2k_pxord(c0, 1LL<<63);
uint64_t conj_c1 = __builtin_e2k_pxord(c1, 1LL<<63);
uint64_t conj_c2 = __builtin_e2k_pxord(c2, 1LL<<63);
uint64_t conj_c3 = __builtin_e2k_pxord(c3, 1LL<<63);
uint64_t conj_d0 = __builtin_e2k_pxord(d0, 1LL<<63);
uint64_t conj_d1 = __builtin_e2k_pxord(d1, 1LL<<63);
uint64_t conj_d2 = __builtin_e2k_pxord(d2, 1LL<<63);
uint64_t conj_d3 = __builtin_e2k_pxord(d3, 1LL<<63);
uint64_t conj_e0 = __builtin_e2k_pxord(e0, 1LL<<63);
uint64_t conj_e1 = __builtin_e2k_pxord(e1, 1LL<<63);
uint64_t conj_e2 = __builtin_e2k_pxord(e2, 1LL<<63);
uint64_t conj_e3 = __builtin_e2k_pxord(e3, 1LL<<63);
uint64_t swap_c0 = __builtin_e2k_pshufb(0, c0, 0x0302010007060504);
uint64_t swap_c1 = __builtin_e2k_pshufb(0, c1, 0x0302010007060504);
uint64_t swap_c2 = __builtin_e2k_pshufb(0, c2, 0x0302010007060504);
uint64_t swap_c3 = __builtin_e2k_pshufb(0, c3, 0x0302010007060504);
uint64_t swap_d0 = __builtin_e2k_pshufb(0, d0, 0x0302010007060504);
uint64_t swap_d1 = __builtin_e2k_pshufb(0, d1, 0x0302010007060504);
uint64_t swap_d2 = __builtin_e2k_pshufb(0, d2, 0x0302010007060504);
uint64_t swap_d3 = __builtin_e2k_pshufb(0, d3, 0x0302010007060504);
uint64_t swap_e0 = __builtin_e2k_pshufb(0, e0, 0x0302010007060504);
uint64_t swap_e1 = __builtin_e2k_pshufb(0, e1, 0x0302010007060504);
uint64_t swap_e2 = __builtin_e2k_pshufb(0, e2, 0x0302010007060504);
uint64_t swap_e3 = __builtin_e2k_pshufb(0, e3, 0x0302010007060504);
uint64_t cy0_real = __builtin_e2k_pfmuls(conj_c0, y0);
uint64_t cy1_real = __builtin_e2k_pfmuls(conj_c1, y1);
uint64_t cy2_real = __builtin_e2k_pfmuls(conj_c2, y2);
uint64_t cy3_real = __builtin_e2k_pfmuls(conj_c3, y3);
uint64_t dz0_real = __builtin_e2k_pfmuls(conj_d0, z0);
uint64_t dz1_real = __builtin_e2k_pfmuls(conj_d1, z1);
uint64_t dz2_real = __builtin_e2k_pfmuls(conj_d2, z2);
uint64_t dz3_real = __builtin_e2k_pfmuls(conj_d3, z3);
uint64_t ew0_real = __builtin_e2k_pfmuls(conj_e0, w0);
uint64_t ew1_real = __builtin_e2k_pfmuls(conj_e1, w1);
uint64_t ew2_real = __builtin_e2k_pfmuls(conj_e2, w2);
uint64_t ew3_real = __builtin_e2k_pfmuls(conj_e3, w3);
uint64_t cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0);
uint64_t cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1);
uint64_t cy2_imag = __builtin_e2k_pfmuls(swap_c2, y2);
uint64_t cy3_imag = __builtin_e2k_pfmuls(swap_c3, y3);
uint64_t dz0_imag = __builtin_e2k_pfmuls(swap_d0, z0);
uint64_t dz1_imag = __builtin_e2k_pfmuls(swap_d1, z1);
uint64_t dz2_imag = __builtin_e2k_pfmuls(swap_d2, z2);
uint64_t dz3_imag = __builtin_e2k_pfmuls(swap_d3, z3);
uint64_t ew0_imag = __builtin_e2k_pfmuls(swap_e0, w0);
uint64_t ew1_imag = __builtin_e2k_pfmuls(swap_e1, w1);
uint64_t ew2_imag = __builtin_e2k_pfmuls(swap_e2, w2);
uint64_t ew3_imag = __builtin_e2k_pfmuls(swap_e3, w3);
uint64_t cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag);
uint64_t cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag);
uint64_t cy2 = __builtin_e2k_pfhadds(cy2_real, cy2_imag);
uint64_t cy3 = __builtin_e2k_pfhadds(cy3_real, cy3_imag);
uint64_t dz0 = __builtin_e2k_pfhadds(dz0_real, dz0_imag);
uint64_t dz1 = __builtin_e2k_pfhadds(dz1_real, dz1_imag);
uint64_t dz2 = __builtin_e2k_pfhadds(dz2_real, dz2_imag);
uint64_t dz3 = __builtin_e2k_pfhadds(dz3_real, dz3_imag);
uint64_t ew0 = __builtin_e2k_pfhadds(ew0_real, ew0_imag);
uint64_t ew1 = __builtin_e2k_pfhadds(ew1_real, ew1_imag);
uint64_t ew2 = __builtin_e2k_pfhadds(ew2_real, ew2_imag);
uint64_t ew3 = __builtin_e2k_pfhadds(ew3_real, ew3_imag);
uint64_t add02_0 = __builtin_e2k_pfadds( x0, dz0);
uint64_t add02_1 = __builtin_e2k_pfadds( x1, dz1);
uint64_t add02_2 = __builtin_e2k_pfadds( x2, dz2);
uint64_t add02_3 = __builtin_e2k_pfadds( x3, dz3);
uint64_t sub02_0 = __builtin_e2k_pfsubs( x0, dz0);
uint64_t sub02_1 = __builtin_e2k_pfsubs( x1, dz1);
uint64_t sub02_2 = __builtin_e2k_pfsubs( x2, dz2);
uint64_t sub02_3 = __builtin_e2k_pfsubs( x3, dz3);
uint64_t add13_0 = __builtin_e2k_pfadds(cy0, ew0);
uint64_t add13_1 = __builtin_e2k_pfadds(cy1, ew1);
uint64_t add13_2 = __builtin_e2k_pfadds(cy2, ew2);
uint64_t add13_3 = __builtin_e2k_pfadds(cy3, ew3);
uint64_t sub13_0 = __builtin_e2k_pfsubs(cy0, ew0);
uint64_t sub13_1 = __builtin_e2k_pfsubs(cy1, ew1);
uint64_t sub13_2 = __builtin_e2k_pfsubs(cy2, ew2);
uint64_t sub13_3 = __builtin_e2k_pfsubs(cy3, ew3);
//uint64_t conj_sub13_0 = __builtin_e2k_pxord(sub13_0, 1LL<<63);
//uint64_t conj_sub13_1 = __builtin_e2k_pxord(sub13_1, 1LL<<63);
//uint64_t conj_sub13_2 = __builtin_e2k_pxord(sub13_2, 1LL<<63);
//uint64_t conj_sub13_3 = __builtin_e2k_pxord(sub13_3, 1LL<<63);
//uint64_t sub13i_0 = __builtin_e2k_pshufb(0, conj_sub13_0, 0x0302010007060504);
//uint64_t sub13i_1 = __builtin_e2k_pshufb(0, conj_sub13_1, 0x0302010007060504);
//uint64_t sub13i_2 = __builtin_e2k_pshufb(0, conj_sub13_2, 0x0302010007060504);
//uint64_t sub13i_3 = __builtin_e2k_pshufb(0, conj_sub13_3, 0x0302010007060504);
uint64_t swap_sub13_0 = __builtin_e2k_pshufb(0, sub13_0, 0x0302010007060504);
uint64_t swap_sub13_1 = __builtin_e2k_pshufb(0, sub13_1, 0x0302010007060504);
uint64_t swap_sub13_2 = __builtin_e2k_pshufb(0, sub13_2, 0x0302010007060504);
uint64_t swap_sub13_3 = __builtin_e2k_pshufb(0, sub13_3, 0x0302010007060504);
uint64_t sub13i_0 = __builtin_e2k_pxord(swap_sub13_0, 1LL<<31);
uint64_t sub13i_1 = __builtin_e2k_pxord(swap_sub13_1, 1LL<<31);
uint64_t sub13i_2 = __builtin_e2k_pxord(swap_sub13_2, 1LL<<31);
uint64_t sub13i_3 = __builtin_e2k_pxord(swap_sub13_3, 1LL<<31);
uint64_t out0 = __builtin_e2k_pfadds(add02_0, add13_0);
uint64_t out1 = __builtin_e2k_pfadds(add02_1, add13_1);
uint64_t out2 = __builtin_e2k_pfadds(add02_2, add13_2);
uint64_t out3 = __builtin_e2k_pfadds(add02_3, add13_3);
uint64_t out4 = __builtin_e2k_pfsubs(sub02_0, sub13i_0);
uint64_t out5 = __builtin_e2k_pfsubs(sub02_1, sub13i_1);
uint64_t out6 = __builtin_e2k_pfsubs(sub02_2, sub13i_2);
uint64_t out7 = __builtin_e2k_pfsubs(sub02_3, sub13i_3);
uint64_t out8 = __builtin_e2k_pfsubs(add02_0, add13_0);
uint64_t out9 = __builtin_e2k_pfsubs(add02_1, add13_1);
uint64_t out10 = __builtin_e2k_pfsubs(add02_2, add13_2);
uint64_t out11 = __builtin_e2k_pfsubs(add02_3, add13_3);
uint64_t out12 = __builtin_e2k_pfadds(sub02_0, sub13i_0);
uint64_t out13 = __builtin_e2k_pfadds(sub02_1, sub13i_1);
uint64_t out14 = __builtin_e2k_pfadds(sub02_2, sub13i_2);
uint64_t out15 = __builtin_e2k_pfadds(sub02_3, sub13i_3);
x0 = out0;
y0 = out1;
z0 = out2;
w0 = out3;
c0 = c0b_in[i];
d0 = d0b_in[i];
e0 = e0b_in[i];
x1 = out4;
y1 = out5;
z1 = out6;
w1 = out7;
c1 = c1b_in[i];
d1 = d1b_in[i];
e1 = e1b_in[i];
x2 = out8;
y2 = out9;
z2 = out10;
w2 = out11;
c2 = c2b_in[i];
d2 = d2b_in[i];
e2 = e2b_in[i];
x3 = out12;
y3 = out13;
z3 = out14;
w3 = out15;
c3 = c3b_in[i];
d3 = d3b_in[i];
e3 = e3b_in[i];
conj_c0 = __builtin_e2k_pxord(c0, 1LL<<63);
conj_c1 = __builtin_e2k_pxord(c1, 1LL<<63);
conj_c2 = __builtin_e2k_pxord(c2, 1LL<<63);
conj_c3 = __builtin_e2k_pxord(c3, 1LL<<63);
conj_d0 = __builtin_e2k_pxord(d0, 1LL<<63);
conj_d1 = __builtin_e2k_pxord(d1, 1LL<<63);
conj_d2 = __builtin_e2k_pxord(d2, 1LL<<63);
conj_d3 = __builtin_e2k_pxord(d3, 1LL<<63);
conj_e0 = __builtin_e2k_pxord(e0, 1LL<<63);
conj_e1 = __builtin_e2k_pxord(e1, 1LL<<63);
conj_e2 = __builtin_e2k_pxord(e2, 1LL<<63);
conj_e3 = __builtin_e2k_pxord(e3, 1LL<<63);
swap_c0 = __builtin_e2k_pshufb(0, c0, 0x0302010007060504);
swap_c1 = __builtin_e2k_pshufb(0, c1, 0x0302010007060504);
swap_c2 = __builtin_e2k_pshufb(0, c2, 0x0302010007060504);
swap_c3 = __builtin_e2k_pshufb(0, c3, 0x0302010007060504);
swap_d0 = __builtin_e2k_pshufb(0, d0, 0x0302010007060504);
swap_d1 = __builtin_e2k_pshufb(0, d1, 0x0302010007060504);
swap_d2 = __builtin_e2k_pshufb(0, d2, 0x0302010007060504);
swap_d3 = __builtin_e2k_pshufb(0, d3, 0x0302010007060504);
swap_e0 = __builtin_e2k_pshufb(0, e0, 0x0302010007060504);
swap_e1 = __builtin_e2k_pshufb(0, e1, 0x0302010007060504);
swap_e2 = __builtin_e2k_pshufb(0, e2, 0x0302010007060504);
swap_e3 = __builtin_e2k_pshufb(0, e3, 0x0302010007060504);
cy0_real = __builtin_e2k_pfmuls(conj_c0, y0);
cy1_real = __builtin_e2k_pfmuls(conj_c1, y1);
cy2_real = __builtin_e2k_pfmuls(conj_c2, y2);
cy3_real = __builtin_e2k_pfmuls(conj_c3, y3);
dz0_real = __builtin_e2k_pfmuls(conj_d0, z0);
dz1_real = __builtin_e2k_pfmuls(conj_d1, z1);
dz2_real = __builtin_e2k_pfmuls(conj_d2, z2);
dz3_real = __builtin_e2k_pfmuls(conj_d3, z3);
ew0_real = __builtin_e2k_pfmuls(conj_e0, w0);
ew1_real = __builtin_e2k_pfmuls(conj_e1, w1);
ew2_real = __builtin_e2k_pfmuls(conj_e2, w2);
ew3_real = __builtin_e2k_pfmuls(conj_e3, w3);
cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0);
cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1);
cy2_imag = __builtin_e2k_pfmuls(swap_c2, y2);
cy3_imag = __builtin_e2k_pfmuls(swap_c3, y3);
dz0_imag = __builtin_e2k_pfmuls(swap_d0, z0);
dz1_imag = __builtin_e2k_pfmuls(swap_d1, z1);
dz2_imag = __builtin_e2k_pfmuls(swap_d2, z2);
dz3_imag = __builtin_e2k_pfmuls(swap_d3, z3);
ew0_imag = __builtin_e2k_pfmuls(swap_e0, w0);
ew1_imag = __builtin_e2k_pfmuls(swap_e1, w1);
ew2_imag = __builtin_e2k_pfmuls(swap_e2, w2);
ew3_imag = __builtin_e2k_pfmuls(swap_e3, w3);
cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag);
cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag);
cy2 = __builtin_e2k_pfhadds(cy2_real, cy2_imag);
cy3 = __builtin_e2k_pfhadds(cy3_real, cy3_imag);
dz0 = __builtin_e2k_pfhadds(dz0_real, dz0_imag);
dz1 = __builtin_e2k_pfhadds(dz1_real, dz1_imag);
dz2 = __builtin_e2k_pfhadds(dz2_real, dz2_imag);
dz3 = __builtin_e2k_pfhadds(dz3_real, dz3_imag);
ew0 = __builtin_e2k_pfhadds(ew0_real, ew0_imag);
ew1 = __builtin_e2k_pfhadds(ew1_real, ew1_imag);
ew2 = __builtin_e2k_pfhadds(ew2_real, ew2_imag);
ew3 = __builtin_e2k_pfhadds(ew3_real, ew3_imag);
add02_0 = __builtin_e2k_pfadds( x0, dz0);
add02_1 = __builtin_e2k_pfadds( x1, dz1);
add02_2 = __builtin_e2k_pfadds( x2, dz2);
add02_3 = __builtin_e2k_pfadds( x3, dz3);
sub02_0 = __builtin_e2k_pfsubs( x0, dz0);
sub02_1 = __builtin_e2k_pfsubs( x1, dz1);
sub02_2 = __builtin_e2k_pfsubs( x2, dz2);
sub02_3 = __builtin_e2k_pfsubs( x3, dz3);
add13_0 = __builtin_e2k_pfadds(cy0, ew0);
add13_1 = __builtin_e2k_pfadds(cy1, ew1);
add13_2 = __builtin_e2k_pfadds(cy2, ew2);
add13_3 = __builtin_e2k_pfadds(cy3, ew3);
sub13_0 = __builtin_e2k_pfsubs(cy0, ew0);
sub13_1 = __builtin_e2k_pfsubs(cy1, ew1);
sub13_2 = __builtin_e2k_pfsubs(cy2, ew2);
sub13_3 = __builtin_e2k_pfsubs(cy3, ew3);
//conj_sub13_0 = __builtin_e2k_pxord(sub13_0, 1LL<<63);
//conj_sub13_1 = __builtin_e2k_pxord(sub13_1, 1LL<<63);
//conj_sub13_2 = __builtin_e2k_pxord(sub13_2, 1LL<<63);
//conj_sub13_3 = __builtin_e2k_pxord(sub13_3, 1LL<<63);
//sub13i_0 = __builtin_e2k_pshufb(0, conj_sub13_0, 0x0302010007060504);
//sub13i_1 = __builtin_e2k_pshufb(0, conj_sub13_1, 0x0302010007060504);
//sub13i_2 = __builtin_e2k_pshufb(0, conj_sub13_2, 0x0302010007060504);
//sub13i_3 = __builtin_e2k_pshufb(0, conj_sub13_3, 0x0302010007060504);
swap_sub13_0 = __builtin_e2k_pshufb(0, sub13_0, 0x0302010007060504);
swap_sub13_1 = __builtin_e2k_pshufb(0, sub13_1, 0x0302010007060504);
swap_sub13_2 = __builtin_e2k_pshufb(0, sub13_2, 0x0302010007060504);
swap_sub13_3 = __builtin_e2k_pshufb(0, sub13_3, 0x0302010007060504);
sub13i_0 = __builtin_e2k_pxord(swap_sub13_0, 1LL<<31);
sub13i_1 = __builtin_e2k_pxord(swap_sub13_1, 1LL<<31);
sub13i_2 = __builtin_e2k_pxord(swap_sub13_2, 1LL<<31);
sub13i_3 = __builtin_e2k_pxord(swap_sub13_3, 1LL<<31);
out_0[i] = __builtin_e2k_pfadds(add02_0, add13_0);
out_1[i] = __builtin_e2k_pfadds(add02_1, add13_1);
out_2[i] = __builtin_e2k_pfadds(add02_2, add13_2);
out_3[i] = __builtin_e2k_pfadds(add02_3, add13_3);
out_4[i] = __builtin_e2k_pfsubs(sub02_0, sub13i_0);
out_5[i] = __builtin_e2k_pfsubs(sub02_1, sub13i_1);
out_6[i] = __builtin_e2k_pfsubs(sub02_2, sub13i_2);
out_7[i] = __builtin_e2k_pfsubs(sub02_3, sub13i_3);
out_8[i] = __builtin_e2k_pfsubs(add02_0, add13_0);
out_9[i] = __builtin_e2k_pfsubs(add02_1, add13_1);
out_10[i] = __builtin_e2k_pfsubs(add02_2, add13_2);
out_11[i] = __builtin_e2k_pfsubs(add02_3, add13_3);
out_12[i] = __builtin_e2k_pfadds(sub02_0, sub13i_0);
out_13[i] = __builtin_e2k_pfadds(sub02_1, sub13i_1);
out_14[i] = __builtin_e2k_pfadds(sub02_2, sub13i_2);
out_15[i] = __builtin_e2k_pfadds(sub02_3, sub13i_3);
}
}
Основной цикл на ассемблере
.L3676:
{
fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=1, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=16, d=0, incr=2, ind=2, asz=1, abs=0, disp=16
}
{
fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=1, abs=2, disp=64
fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=1, abs=2, disp=32
}
{
fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=4, asz=1, abs=4, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=1, abs=4, disp=96
}
{
fapb ct=0, dcd=0, fmt=4, mrng=16, d=0, incr=2, ind=2, asz=1, abs=6, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=3, asz=1, abs=6, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=15, asz=2, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=1, incr=0, ind=0, asz=2, abs=8, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=13, asz=2, abs=12, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=14, asz=2, abs=12, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=11, asz=2, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=12, asz=2, abs=16, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=9, asz=2, abs=20, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=10, asz=2, abs=20, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=2, abs=24, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=8, asz=2, abs=24, disp=0
}
{
fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=2, abs=28, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=2, abs=28, disp=0
}
.L2949:
{
loop_mode
pfmul_hadds,0,sm %b[111], %b[62], %b[104], %b[62]
pfmuls,1,sm %b[63], %b[57], %b[63]
pfmuls,2,sm %b[97], %b[82], %b[97]
pshufb,3,sm 0x0, %b[29], %r26, %b[99]
}
{
loop_mode
pfsubs,1,sm %b[68], %b[67], %b[103]
pfmul_hadds,2,sm %b[100], %b[52], %b[103], %b[100]
pshufb,3,sm 0x0, %b[32], %r26, %b[104]
xord,4,sm %b[86], %r0, %b[105]
xord,5,sm %b[105], %r6, %b[106]
}
{
loop_mode
pfmul_hadds,0,sm %b[80], %b[53], %b[98], %b[80]
pfsubs,1,sm %b[81], %b[79], %b[93]
pfmul_hadds,2,sm %b[93], %b[96], %b[89], %b[89]
pshufb,3,sm 0x0, %b[49], %r26, %b[98]
pfmuls,5,sm %b[105], %b[102], %b[96]
}
{
loop_mode
pfsub_adds,0,sm %b[13], %b[64], %b[106], %b[87]
pfsubs,1,sm %b[91], %b[87], %b[91]
pfadds,2,sm %b[91], %b[87], %b[105]
xord,5,sm %b[21], %r0, %b[108]
}
{
loop_mode
pfmul_hadds,0,sm %b[99], %b[82], %b[97], %b[82]
pfsubs,1,sm %b[94], %b[84], %b[84]
pfadds,2,sm %b[94], %b[84], %b[86]
pshufb,3,sm 0x0, %b[48], %r26, %b[94]
pshufb,4,sm 0x0, %b[86], %r26, %b[97]
xord,5,sm %b[74], %r0, %b[99]
}
{
loop_mode
pfsub_rsubs,0,sm %b[13], %b[64], %b[106], %b[64]
pfsubs,1,sm %b[69], %b[15], %b[106]
pfmuls,3,sm %b[99], %b[77], %b[99]
pfmuls,4,sm %b[108], %b[10], %b[1]
xord,5,sm %b[83], %r0, %b[108]
}
{
loop_mode
pfmul_hadds,0,sm %b[104], %b[60], %b[90], %b[63]
pfadds,1,sm %b[81], %b[79], %b[79]
pfmul_hadds,2,sm %b[98], %b[57], %b[63], %b[60]
pshufb,3,sm 0x0, %b[23], %r26, %b[90]
pfmuls,4,sm %b[108], %b[72], %b[81]
pfmul_hadds,5,sm %b[97], %b[102], %b[96], %b[13]
}
{
loop_mode
pfadd_adds,0,sm %b[107], %b[71], %b[105], %b[75]
pfsubs,1,sm %b[85], %b[75], %b[85]
pfadds,2,sm %b[85], %b[75], %b[96]
xord,5,sm %b[78], %r0, %b[97]
}
{
loop_mode
pfmul_hadds,0,sm %b[94], %b[61], %b[88], %b[93]
xord,1,sm %b[47], %r0, %b[61]
pfadd_rsubs,2,sm %b[101], %b[70], %b[86], %b[94]
pshufb,3,sm 0x0, %b[93], %r26, %b[97]
xord,4,sm %b[30], %r0, %b[88]
pfmuls,5,sm %b[97], %b[92], %b[98]
}
{
loop_mode
pfadd_rsubs,0,sm %b[107], %b[71], %b[105], %b[83]
pfadds,1,sm %b[69], %b[15], %b[91]
pfadds,2,sm %b[62], %b[73], %b[102]
pshufb,3,sm 0x0, %b[91], %r26, %b[104]
pshufb,4,sm 0x0, %b[83], %r26, %b[105]
xord,5,sm %b[109], %r0, %b[69]
}
{
loop_mode
pfmul_hadds,0,sm %b[90], %b[12], %b[3], %b[62]
pshufb,1,sm 0x0, %b[74], %r26, %b[90]
pfadd_adds,2,sm %b[101], %b[70], %b[86], %b[73]
pshufb,3,sm 0x0, %b[84], %r26, %b[74]
pfsubs,4,sm %b[62], %b[73], %b[84]
xord,5,sm %b[76], %r0, %b[86]
}
{
loop_mode
pfadd_adds,0,sm %b[41], %b[80], %b[79], %b[68]
pfadds,1,sm %b[68], %b[67], %b[78]
pfadd_rsubs,2,sm %b[87], %b[89], %b[96], %b[72]
pshufb,3,sm 0x0, %b[106], %r26, %b[81]
pshufb,4,sm 0x0, %b[78], %r26, %b[105]
pfmul_hadds,5,sm %b[105], %b[72], %b[81], %b[67]
}
{
loop_mode
pfadd_rsubs,0,sm %b[41], %b[80], %b[79], %b[90]
pfmul_hadds,1,sm %b[90], %b[77], %b[99], %b[77]
pfadd_adds,2,sm %b[87], %b[89], %b[96], %b[96]
pshufb,3,sm 0x0, %b[103], %r26, %b[98]
xord,4,sm %b[110], %r0, %b[92]
pfmul_hadds,5,sm %b[105], %b[92], %b[98], %b[79]
}
{
loop_mode
pfadd_adds,0,sm %b[20], %b[100], %b[91], %b[85]
pfmuls,1,sm %b[86], %b[66], %b[86]
pfadd_adds,2,sm %b[64], %b[82], %b[102], %b[99]
pshufb,3,sm 0x0, %b[85], %r26, %b[105]
xord,4,sm %b[28], %r0, %b[103]
xord,5,sm %b[104], %r6, %b[104]
}
{
loop_mode
pfadd_rsubs,0,sm %b[20], %b[100], %b[91], %b[74]
xord,1,sm %b[95], %r0, %b[108]
pfadd_rsubs,2,sm %b[64], %b[82], %b[102], %b[84]
pshufb,3,sm 0x0, %b[84], %r26, %b[106]
xord,4,sm %b[45], %r0, %b[91]
xord,5,sm %b[74], %r6, %b[102]
}
{
loop_mode
pfadd_rsubs,0,sm %b[56], %b[60], %b[78], %b[97]
pfmuls,1,sm %b[108], %b[65], %b[111]
pfsub_adds,2,sm %b[107], %b[71], %b[104], %b[108]
xord,3,sm %b[46], %r0, %b[81]
xord,4,sm %b[97], %r6, %b[112]
xord,5,sm %b[81], %r6, %b[113]
}
{
loop_mode
pfadd_adds,0,sm %b[56], %b[60], %b[78], %b[109]
pshufb,1,sm 0x0, %b[110], %r26, %b[78]
pfsub_rsubs,2,sm %b[101], %b[70], %b[102], %b[110]
pshufb,3,sm 0x0, %b[109], %r26, %b[98]
xord,4,sm %b[98], %r6, %b[114]
addd,5,sm 0x8, %b[8], %b[6] ? %pcnt0
}
{
loop_mode
pfsub_rsubs,0,sm %b[20], %b[100], %b[113], %b[71]
pfsubs,1,sm %b[93], %b[63], %b[75]
pfsub_rsubs,2,sm %b[107], %b[71], %b[104], %b[76]
pshufb,3,sm 0x0, %b[76], %r26, %b[104]
xord,4,sm %b[105], %r6, %b[105]
std,5 %r25, %b[8], %b[75]
}
{
loop_mode
pfsub_adds,0,sm %b[20], %b[100], %b[113], %b[63]
pfadds,1,sm %b[93], %b[63], %b[93]
pfsub_adds,2,sm %b[101], %b[70], %b[102], %b[70]
pshufb,3,sm 0x0, %b[95], %r26, %b[94]
xord,4,sm %b[106], %r6, %b[95]
std,5 %r23, %b[8], %b[94]
}
{
loop_mode
pfsub_adds,0,sm %b[56], %b[60], %b[114], %b[83]
pfmuls,1,sm %b[91], %b[68], %b[100]
pfsub_rsubs,2,sm %b[87], %b[89], %b[105], %b[101]
pshufb,3,sm 0x0, %b[36], %r26, %b[91]
xord,4,sm %b[44], %r0, %b[102]
std,5 %r18, %b[8], %b[83]
}
{
loop_mode
pfsub_rsubs,0,sm %b[56], %b[60], %b[114], %b[60]
pfmuls,1,sm %b[103], %b[90], %b[103]
pfsub_adds,2,sm %b[64], %b[82], %b[95], %b[73]
pshufb,3,sm 0x0, %b[45], %r26, %b[106]
xord,4,sm %b[24], %r0, %b[107]
std,5 %r2, %b[8], %b[73]
}
{
loop_mode
pfsub_adds,0,sm %b[87], %b[89], %b[105], %b[64]
pfmuls,1,sm %b[102], %b[85], %b[82]
pfsub_rsubs,2,sm %b[64], %b[82], %b[95], %b[72]
pshufb,3,sm 0x0, %b[44], %r26, %b[87]
xord,4,sm %b[33], %r0, %b[89]
std,5 %r12, %b[8], %b[72]
}
{
loop_mode
pfmul_hadds,0,sm %b[94], %b[65], %b[111], %b[65]
pfmuls,1,sm %b[107], %b[74], %b[95]
pfsub_adds,2,sm %b[41], %b[80], %b[112], %b[94]
pshufb,3,sm 0x0, %b[24], %r26, %b[96]
xord,4,sm %b[40], %r0, %b[102]
std,5 %r16, %b[8], %b[96]
movad,0 area=9, ind=0, am=1, be=0, %b[15]
movad,1 area=8, ind=0, am=1, be=0, %b[12]
movad,2 area=9, ind=0, am=1, be=0, %b[3]
movad,3 area=8, ind=0, am=1, be=0, %b[20]
}
{
loop_mode
pfmul_hadds,0,sm %b[104], %b[66], %b[86], %b[66]
pfmuls,1,sm %b[89], %b[97], %b[86]
pshufb,3,sm 0x0, %b[33], %r26, %b[89]
xord,4,sm %b[36], %r0, %b[99]
std,5 %r22, %b[8], %b[99]
movad,0 area=7, ind=0, am=1, be=0, %b[24]
movad,1 area=6, ind=0, am=1, be=0, %b[32]
movad,2 area=7, ind=0, am=1, be=0, %b[29]
movad,3 area=6, ind=0, am=1, be=0, %b[23]
}
{
loop_mode
pfsub_rsubs,0,sm %b[41], %b[80], %b[112], %b[80]
pfmuls,1,sm %b[102], %b[109], %b[84]
pshufb,3,sm 0x0, %b[40], %r26, %b[102]
xord,4,sm %b[19], %r0, %b[104]
std,5 %r19, %b[8], %b[84]
movad,0 area=5, ind=0, am=1, be=0, %b[33]
movad,1 area=4, ind=0, am=1, be=0, %b[41]
movad,2 area=5, ind=0, am=1, be=0, %b[40]
movad,3 area=4, ind=0, am=1, be=0, %b[36]
}
{
loop_mode
pfadd_adds,0,sm %b[11], %b[62], %b[93], %b[99]
pfmuls,1,sm %b[99], %b[71], %b[111]
pfadd_rsubs,2,sm %b[11], %b[62], %b[93], %b[105]
pshufb,3,sm 0x0, %b[28], %r26, %b[112]
xord,4,sm %b[16], %r0, %b[113]
std,5 %r14, %b[8], %b[108]
movad,0 area=3, ind=8, am=1, be=0, %b[93]
movad,1 area=3, ind=0, am=0, be=0, %b[28]
movad,2 area=3, ind=16, am=0, be=0, %b[108]
movad,3 area=3, ind=24, am=0, be=0, %b[107]
}
{
loop_mode
pfmul_hadds,0,sm %b[87], %b[85], %b[82], %b[82]
pfmuls,1,sm %b[104], %b[63], %b[87]
pfmul_hadds,2,sm %b[96], %b[74], %b[95], %b[85]
pshufb,3,sm 0x0, %b[19], %r26, %b[95]
xord,4,sm %b[37], %r0, %b[96]
std,5 %r24, %b[8], %b[110]
movad,0 area=2, ind=0, am=0, be=0, %b[44]
movad,1 area=2, ind=8, am=0, be=0, %b[74]
movad,2 area=3, ind=8, am=1, be=0, %b[45]
movad,3 area=3, ind=0, am=0, be=0, %b[19]
}
{
loop_mode
pfmul_hadds,0,sm %b[89], %b[97], %b[86], %b[89]
pfmuls,1,sm %b[81], %b[59], %b[86]
pfmuls,2,sm %b[113], %b[83], %b[97]
pshufb,3,sm 0x0, %b[16], %r26, %b[104]
std,5 %r15, %b[8], %b[76]
movad,0 area=2, ind=16, am=0, be=0, %b[76]
movad,1 area=2, ind=24, am=1, be=0, %b[81]
movad,2 area=2, ind=0, am=0, be=0, %b[16]
movad,3 area=2, ind=16, am=0, be=0, %b[48]
}
{
loop_mode
pfmul_hadds,0,sm %b[102], %b[109], %b[84], %b[92]
pfmuls,1,sm %b[92], %b[51], %b[96]
pfmuls,2,sm %b[96], %b[60], %b[102]
pshufb,3,sm 0x0, %b[37], %r26, %b[109]
xord,4,sm %b[7], %r0, %b[110]
std,5 %r17, %b[8], %b[70]
movad,0 area=1, ind=0, am=0, be=0, %b[37]
movad,1 area=1, ind=16, am=0, be=0, %b[49]
movad,2 area=2, ind=8, am=0, be=0, %b[70]
movad,3 area=0, ind=8, am=0, be=0, %b[84]
}
{
loop_mode
pfmul_hadds,0,sm %b[106], %b[68], %b[100], %b[68]
pfmuls,1,sm %b[69], %b[50], %b[101]
pfmul_hadds,2,sm %b[112], %b[90], %b[103], %b[69]
pshufb,3,sm 0x0, %b[75], %r26, %b[103]
std,5 %r20, %b[8], %b[101]
movad,0 area=1, ind=8, am=1, be=0, %b[90]
movad,1 area=1, ind=24, am=0, be=0, %b[75]
movad,2 area=2, ind=24, am=1, be=0, %b[100]
movad,3 area=1, ind=0, am=0, be=0, %b[52]
}
{
loop_mode
pfmul_hadds,0,sm %b[91], %b[71], %b[111], %b[71]
pfmuls,1,sm %b[110], %b[94], %b[87]
pfmul_hadds,2,sm %b[95], %b[63], %b[87], %b[73]
pshufb,3,sm 0x0, %b[7], %r26, %b[91]
xord,4,sm %b[27], %r0, %b[95]
std,5 %r11, %b[8], %b[73]
movad,0 area=0, ind=0, am=0, be=0, %b[7]
movad,1 area=0, ind=24, am=0, be=0, %b[56]
movad,2 area=1, ind=16, am=0, be=0, %b[53]
movad,3 area=1, ind=24, am=0, be=0, %b[63]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
pfmul_hadds,0,sm %b[104], %b[83], %b[97], %b[83]
pfmuls,1,sm %b[88], %b[58], %b[88]
std,2 %r13, %b[8], %b[64]
std,5 %r21, %b[8], %b[72]
movad,0 area=0, ind=8, am=1, be=0, %b[57]
movad,1 area=0, ind=16, am=0, be=0, %b[8]
movad,2 area=1, ind=8, am=1, be=0, %b[64]
movad,3 area=0, ind=0, am=1, be=0, %b[72]
}
Теоретическая скорость: 16 комплексных чисел за 32 такта (16/32) = 4 Байт/такт
Четверная теоретическая скорость: 16 Байт/такт
Замеры скорости

3. stage_radix4_2x_simd128
Здесь происходит ручная раскрутка алгоритма stage_radix4_simd128 в 2 раза.
Код на Си
void stage_radix4_2x_simd128(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC_a, myComplex *coefD_a, myComplex *coefE_a, myComplex *coefC_b, myComplex *coefD_b, myComplex *coefE_b)
{
__v2di *xy0_in = (__v2di*)&data_in[ 0];
__v2di *zw0_in = (__v2di*)&data_in[ 2];
__v2di *xy1_in = (__v2di*)&data_in[ 4];
__v2di *zw1_in = (__v2di*)&data_in[ 6];
__v2di *xy2_in = (__v2di*)&data_in[ 8];
__v2di *zw2_in = (__v2di*)&data_in[10];
__v2di *xy3_in = (__v2di*)&data_in[12];
__v2di *zw3_in = (__v2di*)&data_in[14];
__v2di *xy4_in = (__v2di*)&data_in[16];
__v2di *zw4_in = (__v2di*)&data_in[18];
__v2di *xy5_in = (__v2di*)&data_in[20];
__v2di *zw5_in = (__v2di*)&data_in[22];
__v2di *xy6_in = (__v2di*)&data_in[24];
__v2di *zw6_in = (__v2di*)&data_in[26];
__v2di *xy7_in = (__v2di*)&data_in[28];
__v2di *zw7_in = (__v2di*)&data_in[30];
__v2di *c0a_in = (__v2di*)&coefC_a[0];
__v2di *c1a_in = (__v2di*)&coefC_a[2];
__v2di *c2a_in = (__v2di*)&coefC_a[4];
__v2di *c3a_in = (__v2di*)&coefC_a[6];
__v2di *d0a_in = (__v2di*)&coefD_a[0];
__v2di *d1a_in = (__v2di*)&coefD_a[2];
__v2di *d2a_in = (__v2di*)&coefD_a[4];
__v2di *d3a_in = (__v2di*)&coefD_a[6];
__v2di *e0a_in = (__v2di*)&coefE_a[0];
__v2di *e1a_in = (__v2di*)&coefE_a[2];
__v2di *e2a_in = (__v2di*)&coefE_a[4];
__v2di *e3a_in = (__v2di*)&coefE_a[6];
__v2di *c0b_in = (__v2di*)&coefC_b[0*data_count/16];
__v2di *c1b_in = (__v2di*)&coefC_b[1*data_count/16];
__v2di *c2b_in = (__v2di*)&coefC_b[2*data_count/16];
__v2di *c3b_in = (__v2di*)&coefC_b[3*data_count/16];
__v2di *d0b_in = (__v2di*)&coefD_b[0*data_count/16];
__v2di *d1b_in = (__v2di*)&coefD_b[1*data_count/16];
__v2di *d2b_in = (__v2di*)&coefD_b[2*data_count/16];
__v2di *d3b_in = (__v2di*)&coefD_b[3*data_count/16];
__v2di *e0b_in = (__v2di*)&coefE_b[0*data_count/16];
__v2di *e1b_in = (__v2di*)&coefE_b[1*data_count/16];
__v2di *e2b_in = (__v2di*)&coefE_b[2*data_count/16];
__v2di *e3b_in = (__v2di*)&coefE_b[3*data_count/16];
__v2di *out_0 = (__v2di*)&data_out[ 0*data_count/16];
__v2di *out_1 = (__v2di*)&data_out[ 1*data_count/16];
__v2di *out_2 = (__v2di*)&data_out[ 2*data_count/16];
__v2di *out_3 = (__v2di*)&data_out[ 3*data_count/16];
__v2di *out_4 = (__v2di*)&data_out[ 4*data_count/16];
__v2di *out_5 = (__v2di*)&data_out[ 5*data_count/16];
__v2di *out_6 = (__v2di*)&data_out[ 6*data_count/16];
__v2di *out_7 = (__v2di*)&data_out[ 7*data_count/16];
__v2di *out_8 = (__v2di*)&data_out[ 8*data_count/16];
__v2di *out_9 = (__v2di*)&data_out[ 9*data_count/16];
__v2di *out_10 = (__v2di*)&data_out[10*data_count/16];
__v2di *out_11 = (__v2di*)&data_out[11*data_count/16];
__v2di *out_12 = (__v2di*)&data_out[12*data_count/16];
__v2di *out_13 = (__v2di*)&data_out[13*data_count/16];
__v2di *out_14 = (__v2di*)&data_out[14*data_count/16];
__v2di *out_15 = (__v2di*)&data_out[15*data_count/16];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < data_count/32; ++i)
{
__v2di xy0 = xy0_in[16*i];
__v2di zw0 = zw0_in[16*i];
__v2di xy1 = xy1_in[16*i];
__v2di zw1 = zw1_in[16*i];
__v2di c0 = c0a_in[4*i];
__v2di d0 = d0a_in[4*i];
__v2di e0 = e0a_in[4*i];
__v2di xy2 = xy2_in[16*i];
__v2di zw2 = zw2_in[16*i];
__v2di xy3 = xy3_in[16*i];
__v2di zw3 = zw3_in[16*i];
__v2di c1 = c1a_in[4*i];
__v2di d1 = d1a_in[4*i];
__v2di e1 = e1a_in[4*i];
__v2di xy4 = xy4_in[16*i];
__v2di zw4 = zw4_in[16*i];
__v2di xy5 = xy5_in[16*i];
__v2di zw5 = zw5_in[16*i];
__v2di c2 = c2a_in[4*i];
__v2di d2 = d2a_in[4*i];
__v2di e2 = e2a_in[4*i];
__v2di xy6 = xy6_in[16*i];
__v2di zw6 = zw6_in[16*i];
__v2di xy7 = xy7_in[16*i];
__v2di zw7 = zw7_in[16*i];
__v2di c3 = c3a_in[4*i];
__v2di d3 = d3a_in[4*i];
__v2di e3 = e3a_in[4*i];
__v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di conj_c0 = __builtin_e2k_qpxor(c0, (__v2di){1LL<<63, 1LL<<63});
__v2di conj_c1 = __builtin_e2k_qpxor(c1, (__v2di){1LL<<63, 1LL<<63});
__v2di conj_c2 = __builtin_e2k_qpxor(c2, (__v2di){1LL<<63, 1LL<<63});
__v2di conj_c3 = __builtin_e2k_qpxor(c3, (__v2di){1LL<<63, 1LL<<63});
__v2di conj_d0 = __builtin_e2k_qpxor(d0, (__v2di){1LL<<63, 1LL<<63});
__v2di conj_d1 = __builtin_e2k_qpxor(d1, (__v2di){1LL<<63, 1LL<<63});
__v2di conj_d2 = __builtin_e2k_qpxor(d2, (__v2di){1LL<<63, 1LL<<63});
__v2di conj_d3 = __builtin_e2k_qpxor(d3, (__v2di){1LL<<63, 1LL<<63});
__v2di conj_e0 = __builtin_e2k_qpxor(e0, (__v2di){1LL<<63, 1LL<<63});
__v2di conj_e1 = __builtin_e2k_qpxor(e1, (__v2di){1LL<<63, 1LL<<63});
__v2di conj_e2 = __builtin_e2k_qpxor(e2, (__v2di){1LL<<63, 1LL<<63});
__v2di conj_e3 = __builtin_e2k_qpxor(e3, (__v2di){1LL<<63, 1LL<<63});
__v2di swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0);
__v2di cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1);
__v2di cy2_real = __builtin_e2k_qpfmuls(conj_c2, y2);
__v2di cy3_real = __builtin_e2k_qpfmuls(conj_c3, y3);
__v2di dz0_real = __builtin_e2k_qpfmuls(conj_d0, z0);
__v2di dz1_real = __builtin_e2k_qpfmuls(conj_d1, z1);
__v2di dz2_real = __builtin_e2k_qpfmuls(conj_d2, z2);
__v2di dz3_real = __builtin_e2k_qpfmuls(conj_d3, z3);
__v2di ew0_real = __builtin_e2k_qpfmuls(conj_e0, w0);
__v2di ew1_real = __builtin_e2k_qpfmuls(conj_e1, w1);
__v2di ew2_real = __builtin_e2k_qpfmuls(conj_e2, w2);
__v2di ew3_real = __builtin_e2k_qpfmuls(conj_e3, w3);
__v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
__v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
__v2di cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2);
__v2di cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3);
__v2di dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0);
__v2di dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1);
__v2di dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2);
__v2di dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3);
__v2di ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0);
__v2di ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1);
__v2di ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2);
__v2di ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3);
__v2di cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag);
__v2di cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag);
__v2di cy2_rrii = __builtin_e2k_qpfhadds(cy2_real, cy2_imag);
__v2di cy3_rrii = __builtin_e2k_qpfhadds(cy3_real, cy3_imag);
__v2di dz0_rrii = __builtin_e2k_qpfhadds(dz0_real, dz0_imag);
__v2di dz1_rrii = __builtin_e2k_qpfhadds(dz1_real, dz1_imag);
__v2di dz2_rrii = __builtin_e2k_qpfhadds(dz2_real, dz2_imag);
__v2di dz3_rrii = __builtin_e2k_qpfhadds(dz3_real, dz3_imag);
__v2di ew0_rrii = __builtin_e2k_qpfhadds(ew0_real, ew0_imag);
__v2di ew1_rrii = __builtin_e2k_qpfhadds(ew1_real, ew1_imag);
__v2di ew2_rrii = __builtin_e2k_qpfhadds(ew2_real, ew2_imag);
__v2di ew3_rrii = __builtin_e2k_qpfhadds(ew3_real, ew3_imag);
__v2di cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di cy2 = __builtin_e2k_qpshufb(cy2_rrii, cy2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di cy3 = __builtin_e2k_qpshufb(cy3_rrii, cy3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di dz0 = __builtin_e2k_qpshufb(dz0_rrii, dz0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di dz1 = __builtin_e2k_qpshufb(dz1_rrii, dz1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di dz2 = __builtin_e2k_qpshufb(dz2_rrii, dz2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di dz3 = __builtin_e2k_qpshufb(dz3_rrii, dz3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di ew0 = __builtin_e2k_qpshufb(ew0_rrii, ew0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di ew1 = __builtin_e2k_qpshufb(ew1_rrii, ew1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di ew2 = __builtin_e2k_qpshufb(ew2_rrii, ew2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di ew3 = __builtin_e2k_qpshufb(ew3_rrii, ew3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di add02_0 = __builtin_e2k_qpfadds( x0, dz0);
__v2di add02_1 = __builtin_e2k_qpfadds( x1, dz1);
__v2di add02_2 = __builtin_e2k_qpfadds( x2, dz2);
__v2di add02_3 = __builtin_e2k_qpfadds( x3, dz3);
__v2di sub02_0 = __builtin_e2k_qpfsubs( x0, dz0);
__v2di sub02_1 = __builtin_e2k_qpfsubs( x1, dz1);
__v2di sub02_2 = __builtin_e2k_qpfsubs( x2, dz2);
__v2di sub02_3 = __builtin_e2k_qpfsubs( x3, dz3);
__v2di add13_0 = __builtin_e2k_qpfadds(cy0, ew0);
__v2di add13_1 = __builtin_e2k_qpfadds(cy1, ew1);
__v2di add13_2 = __builtin_e2k_qpfadds(cy2, ew2);
__v2di add13_3 = __builtin_e2k_qpfadds(cy3, ew3);
__v2di sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0);
__v2di sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1);
__v2di sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2);
__v2di sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3);
__v2di swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31});
__v2di sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31});
__v2di sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31});
__v2di sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31});
__v2di out0 = __builtin_e2k_qpfadds(add02_0, add13_0);
__v2di out1 = __builtin_e2k_qpfadds(add02_1, add13_1);
__v2di out2 = __builtin_e2k_qpfadds(add02_2, add13_2);
__v2di out3 = __builtin_e2k_qpfadds(add02_3, add13_3);
__v2di out4 = __builtin_e2k_qpfsubs(sub02_0, sub13i_0);
__v2di out5 = __builtin_e2k_qpfsubs(sub02_1, sub13i_1);
__v2di out6 = __builtin_e2k_qpfsubs(sub02_2, sub13i_2);
__v2di out7 = __builtin_e2k_qpfsubs(sub02_3, sub13i_3);
__v2di out8 = __builtin_e2k_qpfsubs(add02_0, add13_0);
__v2di out9 = __builtin_e2k_qpfsubs(add02_1, add13_1);
__v2di out10 = __builtin_e2k_qpfsubs(add02_2, add13_2);
__v2di out11 = __builtin_e2k_qpfsubs(add02_3, add13_3);
__v2di out12 = __builtin_e2k_qpfadds(sub02_0, sub13i_0);
__v2di out13 = __builtin_e2k_qpfadds(sub02_1, sub13i_1);
__v2di out14 = __builtin_e2k_qpfadds(sub02_2, sub13i_2);
__v2di out15 = __builtin_e2k_qpfadds(sub02_3, sub13i_3);
xy0 = out0;
zw0 = out1;
xy1 = out2;
zw1 = out3;
c0 = c0b_in[i];
d0 = d0b_in[i];
e0 = e0b_in[i];
xy2 = out4;
zw2 = out5;
xy3 = out6;
zw3 = out7;
c1 = c1b_in[i];
d1 = d1b_in[i];
e1 = e1b_in[i];
xy4 = out8;
zw4 = out9;
xy5 = out10;
zw5 = out11;
c2 = c2b_in[i];
d2 = d2b_in[i];
e2 = e2b_in[i];
xy6 = out12;
zw6 = out13;
xy7 = out14;
zw7 = out15;
c3 = c3b_in[i];
d3 = d3b_in[i];
e3 = e3b_in[i];
x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100});
w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100});
y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100});
w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100});
y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100});
w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
conj_c0 = __builtin_e2k_qpxor(c0, (__v2di){1LL<<63, 1LL<<63});
conj_c1 = __builtin_e2k_qpxor(c1, (__v2di){1LL<<63, 1LL<<63});
conj_c2 = __builtin_e2k_qpxor(c2, (__v2di){1LL<<63, 1LL<<63});
conj_c3 = __builtin_e2k_qpxor(c3, (__v2di){1LL<<63, 1LL<<63});
conj_d0 = __builtin_e2k_qpxor(d0, (__v2di){1LL<<63, 1LL<<63});
conj_d1 = __builtin_e2k_qpxor(d1, (__v2di){1LL<<63, 1LL<<63});
conj_d2 = __builtin_e2k_qpxor(d2, (__v2di){1LL<<63, 1LL<<63});
conj_d3 = __builtin_e2k_qpxor(d3, (__v2di){1LL<<63, 1LL<<63});
conj_e0 = __builtin_e2k_qpxor(e0, (__v2di){1LL<<63, 1LL<<63});
conj_e1 = __builtin_e2k_qpxor(e1, (__v2di){1LL<<63, 1LL<<63});
conj_e2 = __builtin_e2k_qpxor(e2, (__v2di){1LL<<63, 1LL<<63});
conj_e3 = __builtin_e2k_qpxor(e3, (__v2di){1LL<<63, 1LL<<63});
swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0);
cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1);
cy2_real = __builtin_e2k_qpfmuls(conj_c2, y2);
cy3_real = __builtin_e2k_qpfmuls(conj_c3, y3);
dz0_real = __builtin_e2k_qpfmuls(conj_d0, z0);
dz1_real = __builtin_e2k_qpfmuls(conj_d1, z1);
dz2_real = __builtin_e2k_qpfmuls(conj_d2, z2);
dz3_real = __builtin_e2k_qpfmuls(conj_d3, z3);
ew0_real = __builtin_e2k_qpfmuls(conj_e0, w0);
ew1_real = __builtin_e2k_qpfmuls(conj_e1, w1);
ew2_real = __builtin_e2k_qpfmuls(conj_e2, w2);
ew3_real = __builtin_e2k_qpfmuls(conj_e3, w3);
cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2);
cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3);
dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0);
dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1);
dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2);
dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3);
ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0);
ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1);
ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2);
ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3);
cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag);
cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag);
cy2_rrii = __builtin_e2k_qpfhadds(cy2_real, cy2_imag);
cy3_rrii = __builtin_e2k_qpfhadds(cy3_real, cy3_imag);
dz0_rrii = __builtin_e2k_qpfhadds(dz0_real, dz0_imag);
dz1_rrii = __builtin_e2k_qpfhadds(dz1_real, dz1_imag);
dz2_rrii = __builtin_e2k_qpfhadds(dz2_real, dz2_imag);
dz3_rrii = __builtin_e2k_qpfhadds(dz3_real, dz3_imag);
ew0_rrii = __builtin_e2k_qpfhadds(ew0_real, ew0_imag);
ew1_rrii = __builtin_e2k_qpfhadds(ew1_real, ew1_imag);
ew2_rrii = __builtin_e2k_qpfhadds(ew2_real, ew2_imag);
ew3_rrii = __builtin_e2k_qpfhadds(ew3_real, ew3_imag);
cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
cy2 = __builtin_e2k_qpshufb(cy2_rrii, cy2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
cy3 = __builtin_e2k_qpshufb(cy3_rrii, cy3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
dz0 = __builtin_e2k_qpshufb(dz0_rrii, dz0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
dz1 = __builtin_e2k_qpshufb(dz1_rrii, dz1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
dz2 = __builtin_e2k_qpshufb(dz2_rrii, dz2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
dz3 = __builtin_e2k_qpshufb(dz3_rrii, dz3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
ew0 = __builtin_e2k_qpshufb(ew0_rrii, ew0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
ew1 = __builtin_e2k_qpshufb(ew1_rrii, ew1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
ew2 = __builtin_e2k_qpshufb(ew2_rrii, ew2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
ew3 = __builtin_e2k_qpshufb(ew3_rrii, ew3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
add02_0 = __builtin_e2k_qpfadds( x0, dz0);
add02_1 = __builtin_e2k_qpfadds( x1, dz1);
add02_2 = __builtin_e2k_qpfadds( x2, dz2);
add02_3 = __builtin_e2k_qpfadds( x3, dz3);
sub02_0 = __builtin_e2k_qpfsubs( x0, dz0);
sub02_1 = __builtin_e2k_qpfsubs( x1, dz1);
sub02_2 = __builtin_e2k_qpfsubs( x2, dz2);
sub02_3 = __builtin_e2k_qpfsubs( x3, dz3);
add13_0 = __builtin_e2k_qpfadds(cy0, ew0);
add13_1 = __builtin_e2k_qpfadds(cy1, ew1);
add13_2 = __builtin_e2k_qpfadds(cy2, ew2);
add13_3 = __builtin_e2k_qpfadds(cy3, ew3);
sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0);
sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1);
sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2);
sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3);
swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31});
sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31});
sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31});
sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31});
out_0[i] = __builtin_e2k_qpfadds(add02_0, add13_0);
out_1[i] = __builtin_e2k_qpfadds(add02_1, add13_1);
out_2[i] = __builtin_e2k_qpfadds(add02_2, add13_2);
out_3[i] = __builtin_e2k_qpfadds(add02_3, add13_3);
out_4[i] = __builtin_e2k_qpfsubs(sub02_0, sub13i_0);
out_5[i] = __builtin_e2k_qpfsubs(sub02_1, sub13i_1);
out_6[i] = __builtin_e2k_qpfsubs(sub02_2, sub13i_2);
out_7[i] = __builtin_e2k_qpfsubs(sub02_3, sub13i_3);
out_8[i] = __builtin_e2k_qpfsubs(add02_0, add13_0);
out_9[i] = __builtin_e2k_qpfsubs(add02_1, add13_1);
out_10[i] = __builtin_e2k_qpfsubs(add02_2, add13_2);
out_11[i] = __builtin_e2k_qpfsubs(add02_3, add13_3);
out_12[i] = __builtin_e2k_qpfadds(sub02_0, sub13i_0);
out_13[i] = __builtin_e2k_qpfadds(sub02_1, sub13i_1);
out_14[i] = __builtin_e2k_qpfadds(sub02_2, sub13i_2);
out_15[i] = __builtin_e2k_qpfadds(sub02_3, sub13i_3);
}
}
Основной цикл на ассемблере
.L7295:
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=0, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=2, disp=64
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=2, disp=96
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=4, disp=128
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=4, disp=160
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=6, disp=192
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=6, disp=224
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=4, asz=1, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=4, asz=1, abs=8, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=3, asz=1, abs=10, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=3, asz=1, abs=10, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=2, asz=1, abs=12, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=2, asz=1, abs=12, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=1, incr=0, ind=0, asz=1, abs=14, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=15, asz=1, abs=14, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=14, asz=1, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=13, asz=1, abs=16, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=12, asz=1, abs=18, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=11, asz=1, abs=18, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=10, asz=2, abs=20, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=9, asz=2, abs=20, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=8, asz=2, abs=24, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=7, asz=2, abs=24, disp=0
}
{
fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=6, asz=2, abs=28, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=5, asz=2, abs=28, disp=0
}
.L3969:
{
loop_mode
disp %ctpr1, .L3969
movaqp,0 area=0, ind=0, am=1, be=0, %g17
movaqp,1 area=0, ind=16, am=0, be=0, %g16
movaqp,2 area=0, ind=0, am=1, be=0, %g19
movaqp,3 area=0, ind=16, am=0, be=0, %g18
}
{
loop_mode
movaqp,0 area=1, ind=0, am=1, be=0, %g21
movaqp,1 area=1, ind=16, am=0, be=0, %g20
movaqp,2 area=1, ind=0, am=1, be=0, %g23
movaqp,3 area=1, ind=16, am=0, be=0, %g22
}
{
loop_mode
movaqp,0 area=2, ind=0, am=1, be=0, %g25
movaqp,1 area=2, ind=16, am=0, be=0, %g24
movaqp,2 area=2, ind=0, am=1, be=0, %g27
movaqp,3 area=2, ind=16, am=0, be=0, %g26
}
{
loop_mode
movaqp,0 area=3, ind=0, am=1, be=0, %g29
movaqp,1 area=3, ind=16, am=0, be=0, %g28
movaqp,2 area=3, ind=0, am=1, be=0, %g31
movaqp,3 area=3, ind=16, am=0, be=0, %g30
}
{
loop_mode
movaqp,0 area=4, ind=0, am=1, be=0, %r26
movaqp,1 area=4, ind=16, am=0, be=0, %r9
movaqp,2 area=4, ind=0, am=1, be=0, %r28
movaqp,3 area=4, ind=16, am=0, be=0, %r27
}
{
loop_mode
qpshufb,0 %g19, %g17, %r1, %r33
qpshufb,1 %g18, %g16, %r1, %r34
qpshufb,3 %g18, %g16, %r7, %g16
qpshufb,4 %g19, %g17, %r7, %g17
movaqp,0 area=5, ind=0, am=1, be=0, %r30
movaqp,1 area=5, ind=16, am=0, be=0, %r29
movaqp,2 area=5, ind=0, am=1, be=0, %r32
movaqp,3 area=5, ind=16, am=0, be=0, %r31
}
{
loop_mode
qpshufb,0 %g23, %g21, %r1, %r37
qpshufb,1 %g22, %g20, %r1, %r38
qpshufb,3 %g22, %g20, %r7, %g20
qpshufb,4 %g23, %g21, %r7, %g21
movaqp,0 area=6, ind=0, am=1, be=0, %g19
movaqp,1 area=6, ind=16, am=0, be=0, %g18
movaqp,2 area=6, ind=0, am=1, be=0, %r36
movaqp,3 area=6, ind=16, am=0, be=0, %r35
}
{
loop_mode
qpshufb,0 %g27, %g25, %r1, %g22
qpshufb,1 %g26, %g24, %r1, %g23
qpshufb,3 %g26, %g24, %r7, %g24
qpshufb,4 %g27, %g25, %r7, %g25
movaqp,0 area=8, ind=0, am=1, be=0, %r39
movaqp,1 area=7, ind=0, am=1, be=0, %g26
movaqp,2 area=8, ind=0, am=1, be=0, %r40
movaqp,3 area=7, ind=0, am=1, be=0, %g27
}
{
loop_mode
qpshufb,0 %g31, %g29, %r1, %r41
qpshufb,1 %g30, %g28, %r1, %r42
qpshufb,3 %g30, %g28, %r7, %g28
qpshufb,4 %g31, %g29, %r7, %g29
movaqp,0 area=10, ind=0, am=1, be=0, %r43
movaqp,1 area=9, ind=0, am=1, be=0, %g30
movaqp,2 area=10, ind=0, am=1, be=0, %r44
movaqp,3 area=9, ind=0, am=1, be=0, %g31
}
{
loop_mode
qpxor,0 %r26, %r5, %r45
qpxor,1 %r9, %r5, %r46
qpshufb,3 %r26, %r26, %r6, %r26
qpshufb,4 %r9, %r9, %r6, %r9
movaqp,0 area=12, ind=0, am=1, be=0, %r49
movaqp,1 area=11, ind=0, am=1, be=0, %r47
movaqp,2 area=12, ind=0, am=1, be=0, %r50
movaqp,3 area=11, ind=0, am=1, be=0, %r48
}
{
loop_mode
qpxor,0 %r28, %r5, %r51
qpxor,1 %r27, %r5, %r52
qpfmuls,2 %r45, %r33, %r33
qpshufb,3 %r28, %r28, %r6, %r28
qpshufb,4 %r27, %r27, %r6, %r27
qpfmuls,5 %r26, %r33, %r26
}
{
loop_mode
qpxor,0 %g19, %r5, %r45
qpxor,1 %g18, %r5, %r53
qpfmuls,2 %r46, %r37, %r37
qpshufb,3 %g19, %g19, %r6, %g19
qpshufb,4 %g18, %g18, %r6, %g18
qpfmuls,5 %r9, %r37, %r9
}
{
loop_mode
qpxor,0 %r36, %r5, %r46
qpxor,1 %r35, %r5, %r54
qpfmuls,2 %r51, %g22, %g22
qpshufb,3 %r36, %r36, %r6, %r36
qpshufb,4 %r35, %r35, %r6, %r35
qpfmuls,5 %r28, %g22, %r28
}
{
loop_mode
qpfmuls,0 %r52, %r41, %r41
qpfmuls,1 %r45, %r34, %r45
qpfmuls,2 %r53, %r38, %r51
qpfmuls,3 %r27, %r41, %r27
qpfmuls,4 %g19, %r34, %g19
qpfmuls,5 %g18, %r38, %g18
}
{
loop_mode
qpfmuls,0 %r54, %r42, %r38
qpxor,1 %r29, %r5, %r36
qpfmuls,2 %r46, %g23, %r34
qpfmuls,3 %r35, %r42, %r35
qpshufb,4 %r29, %r29, %r6, %r29
qpfmuls,5 %r36, %g23, %g23
}
{
loop_mode
qpfmuls,2 %r36, %g20, %r36
qpfmuls,5 %r29, %g20, %g20
}
{
loop_mode
qpxor,1 %r30, %r5, %r29
qpfhadds,2 %r33, %r26, %r26
qpxor,4 %r31, %r5, %r42
}
{
loop_mode
qpshufb,0 %r30, %r30, %r6, %r30
qpshufb,1 %r31, %r31, %r6, %r31
qpfmuls,2 %r29, %g16, %r29
qpxor,3 %r32, %r5, %r33
qpshufb,4 %r32, %r32, %r6, %r32
qpfmuls,5 %r42, %g28, %r42
}
{
loop_mode
qpfmuls,0 %r30, %g16, %g16
qpfhadds,1 %r37, %r9, %r9
qpfmuls,2 %r31, %g28, %g28
qpfmuls,3 %r32, %g24, %g24
qpfhadds,4 %g22, %r28, %g22
qpfmuls,5 %r33, %g24, %r30
}
{
loop_mode
qpfhadds,0 %r45, %g19, %g19
qpfhadds,1 %r51, %g18, %g18
qpfhadds,2 %r41, %r27, %r27
qpshufb,3 %g26, %g26, %r6, %r28
qpxor,4 %g26, %r5, %g26
}
{
loop_mode
qpfhadds,0 %r38, %r35, %r31
qpfhadds,2 %r34, %g23, %g23
}
{
loop_mode
qpfhadds,2 %r36, %g20, %g20
qpshufb,3 %r39, %r39, %r6, %r32
qpxor,4 %r39, %r5, %r33
}
{
loop_mode
qpshufb,1 %r26, %r26, %r3, %r26
qpfhadds,2 %r29, %g16, %g16
qpshufb,3 %g22, %g22, %r3, %g22
qpshufb,4 %r40, %r40, %r6, %r29
qpfhadds,5 %r30, %g24, %g24
}
{
loop_mode
qpshufb,0 %r9, %r9, %r3, %r9
qpshufb,1 %g19, %g19, %r3, %g19
qpfhadds,2 %r42, %g28, %g28
qpxor,3 %r40, %r5, %r30
qpshufb,4 %g31, %g31, %r6, %r34
}
{
loop_mode
qpshufb,0 %r27, %r27, %r3, %r27
qpshufb,1 %g18, %g18, %r3, %g18
qpfsubs,2 %r26, %g19, %r35
qpxor,3 %g31, %r5, %g31
qpxor,4 %r43, %r5, %r36
}
{
loop_mode
qpshufb,0 %r31, %r31, %r3, %r31
qpshufb,1 %g23, %g23, %r3, %g23
qpfsubs,2 %r9, %g18, %r37
qpshufb,3 %r43, %r43, %r6, %r38
qpxor,4 %r48, %r5, %r39
}
{
loop_mode
qpfsubs,0 %g22, %g23, %r41
qpshufb,1 %g16, %g16, %r3, %g16
qpfsubs,2 %r27, %r31, %r40
qpshufb,3 %g24, %g24, %r3, %g24
qpxor,4 %r47, %r5, %r42
}
{
loop_mode
qpshufb,0 %g28, %g28, %r3, %g28
qpshufb,1 %g20, %g20, %r3, %g20
qpfadds,2 %r26, %g19, %g19
qpfsubs,3 %g25, %g24, %g24
qpshufb,4 %r48, %r48, %r6, %g25
qpfadds,5 %g25, %g24, %r26
}
{
loop_mode
qpfadds,0 %r9, %g18, %g18
qpfadds,1 %r27, %r31, %r9
qpfadds,2 %g22, %g23, %g22
qpshufb,3 %r47, %r47, %r6, %g23
qpxor,4 %r50, %r5, %r27
}
{
loop_mode
qpfadds,0 %g17, %g16, %r31
qpfadds,1 %g29, %g28, %g17
qpfsubs,2 %g17, %g16, %g16
qpshufb,4 %r50, %r50, %r6, %r43
}
{
loop_mode
qpfadds,0 %g21, %g20, %g29
qpfsubs,1 %g21, %g20, %g20
qpfsubs,2 %g29, %g28, %g28
qpshufb,3 %r35, %r35, %r6, %g21
qpxor,4 %g27, %r5, %r35
}
{
loop_mode
qpshufb,3 %r37, %r37, %r6, %r37
qpxor,4 %g21, %r4, %g21
}
{
loop_mode
qpshufb,3 %r40, %r40, %r6, %r40
qpshufb,4 %r41, %r41, %r6, %r41
}
{
loop_mode
qpfsubs,0 %r31, %g19, %r45
qpfadds,1 %r31, %g19, %g19
qpfadds,2 %g17, %r9, %r31
qpxor,3 %r37, %r4, %r37
qpxor,4 %r41, %r4, %r41
}
{
loop_mode
qpfsubs,0 %g17, %r9, %g17
qpfadds,1 %g29, %g18, %r9
qpfsubs,2 %g29, %g18, %g18
qpxor,3 %r40, %r4, %r40
qpfadds,4 %r26, %g22, %g29
qpfsubs,5 %r26, %g22, %g22
}
{
loop_mode
qpfsubs,2 %g16, %g21, %r26
qpfsubs,3 %g24, %r41, %g21
qpfadds,4 %g24, %r41, %g24
qpfadds,5 %g16, %g21, %g16
}
{
loop_mode
qpfsubs,3 %g20, %r37, %r46
qpfadds,4 %g28, %r40, %g28
qpfsubs,5 %g28, %r40, %r41
}
{
loop_mode
qpshufb,0 %g27, %g27, %r6, %g27
qpshufb,1 %g30, %g30, %r6, %r37
qpfadds,2 %g20, %r37, %g20
}
{
loop_mode
qpshufb,0 %r31, %r9, %r1, %r40
qpshufb,1 %g17, %g18, %r1, %r47
}
{
loop_mode
qpfmuls,0 %r33, %r40, %r33
qpfmuls,1 %r42, %r47, %r40
qpfmuls,2 %r32, %r40, %r32
qpshufb,3 %g29, %g19, %r1, %r48
qpshufb,4 %g22, %r45, %r1, %r50
}
{
loop_mode
qpxor,0 %g30, %r5, %g30
qpshufb,1 %r44, %r44, %r6, %r47
qpfmuls,2 %g23, %r47, %g23
qpshufb,3 %r41, %r46, %r1, %r42
qpshufb,4 %g24, %g16, %r1, %r51
qpfmuls,5 %r38, %r50, %r38
}
{
loop_mode
qpshufb,3 %g21, %r26, %r1, %r52
qpfmuls,4 %g31, %r42, %g31
qpfmuls,5 %r34, %r42, %r34
}
{
loop_mode
qpshufb,0 %g28, %g20, %r1, %r42
qpshufb,1 %g17, %g18, %r7, %g17
qpfmuls,3 %g26, %r48, %g26
qpfmuls,4 %r36, %r50, %r36
qpfmuls,5 %r28, %r48, %r28
}
{
loop_mode
qpfmuls,0 %r43, %r42, %r39
qpxor,1 %r44, %r5, %r42
qpfmuls,2 %r27, %r42, %r27
qpfmuls,3 %r39, %r51, %g18
qpfmuls,4 %g25, %r51, %g25
qpfmuls,5 %r29, %r52, %r29
}
{
loop_mode
qpxor,0 %r49, %r5, %r43
qpshufb,1 %r49, %r49, %r6, %r44
qpfmuls,2 %r42, %g17, %r42
qpfmuls,5 %r30, %r52, %r30
}
{
loop_mode
qpfhadds,0 %r33, %r32, %r31
qpshufb,1 %r31, %r9, %r7, %r9
qpfmuls,2 %r47, %g17, %g17
qpfhadds,5 %g31, %r34, %g31
}
{
loop_mode
qpshufb,0 %g28, %g20, %r7, %g20
qpshufb,1 %r41, %r46, %r7, %g28
qpfmuls,2 %r35, %r9, %r32
qpfhadds,3 %r36, %r38, %r28
qpfhadds,4 %r40, %g23, %g23
qpfhadds,5 %g26, %r28, %g26
}
{
loop_mode
qpfmuls,0 %r43, %g20, %r33
qpfmuls,1 %r44, %g20, %g20
qpfmuls,2 %g27, %r9, %g27
qpshufb,3 %g29, %g19, %r7, %g19
qpshufb,4 %g22, %r45, %r7, %g22
qpfhadds,5 %g18, %g25, %g18
}
{
loop_mode
qpfmuls,0 %g30, %g28, %g28
qpfhadds,1 %r27, %r39, %g29
qpfmuls,2 %r37, %g28, %g25
qpfhadds,5 %r30, %r29, %g30
}
{
loop_mode
qpfhadds,2 %r42, %g17, %g17
qpshufb,3 %g31, %g31, %r3, %g31
qpshufb,4 %g24, %g16, %r7, %g16
}
{
loop_mode
qpshufb,3 %g26, %g26, %r3, %g24
qpshufb,4 %r28, %r28, %r3, %g26
}
{
loop_mode
qpfhadds,0 %r32, %g27, %g27
qpshufb,1 %r31, %r31, %r3, %r9
qpfhadds,2 %r33, %g20, %g20
qpshufb,3 %g23, %g23, %r3, %g23
qpshufb,4 %g18, %g18, %r3, %g18
}
{
loop_mode
qpshufb,0 %g29, %g29, %r3, %g28
qpshufb,1 %g21, %r26, %r7, %g21
qpfhadds,2 %g28, %g25, %g25
qpshufb,3 %g30, %g30, %r3, %g29
qpfadds,4 %g26, %g23, %g23
qpfsubs,5 %g26, %g23, %g30
}
{
loop_mode
qpshufb,1 %g17, %g17, %r3, %g17
qpfadds,3 %g29, %g31, %g29
qpfsubs,5 %g29, %g31, %g26
}
{
loop_mode
qpfsubs,0 %g24, %r9, %g31
qpfadds,1 %g24, %r9, %g24
qpfadds,2 %g22, %g17, %r9
}
{
loop_mode
qpshufb,0 %g20, %g20, %r3, %g20
qpshufb,1 %g27, %g27, %r3, %g27
qpfsubs,2 %g18, %g28, %r26
}
{
loop_mode
qpfadds,0 %g18, %g28, %g18
qpfsubs,1 %g16, %g20, %g16
qpfadds,2 %g16, %g20, %g28
qpshufb,3 %g30, %g30, %r6, %g20
}
{
loop_mode
qpshufb,0 %g25, %g25, %r3, %g25
qpfadds,1 %g19, %g27, %g19
qpfsubs,2 %g19, %g27, %g30
qpshufb,3 %g26, %g26, %r6, %g22
qpxor,4 %g20, %r4, %g20
qpfsubs,5 %g22, %g17, %g17
}
{
loop_mode
qpfsubs,0 %g21, %g25, %g21
qpfadds,1 %r9, %g23, %g25
qpfadds,2 %g21, %g25, %g26
qpxor,3 %g22, %r4, %g22
}
{
loop_mode
qpshufb,0 %g31, %g31, %r6, %g27
qpfsubs,1 %r9, %g23, %g23
}
{
loop_mode
qpfadds,0 %g28, %g18, %g31
qpfsubs,1 %g28, %g18, %g18
}
{
loop_mode
qpshufb,0 %r26, %r26, %r6, %g28
qpfadds,1 %g19, %g24, %r9
qpfsubs,2 %g19, %g24, %g19
qpfsubs,3 %g17, %g20, %g24
qpfadds,4 %g17, %g20, %g17
}
{
loop_mode
qpfadds,0 %g26, %g29, %g20
qpfsubs,1 %g26, %g29, %g26
qpfadds,2 %g21, %g22, %g29
}
{
loop_mode
qpxor,0 %g27, %r4, %g27
qpfsubs,1 %g21, %g22, %g21
stqp,2 %r18, %r0, %g23
}
{
loop_mode
qpfsubs,0 %g30, %g27, %g22
qpfadds,1 %g30, %g27, %g23
stqp,2 %r12, %r0, %g18
stqp,5 %r25, %r0, %g25
}
{
loop_mode
qpxor,0 %g28, %r4, %g18
stqp,2 %r16, %r0, %g31
stqp,5 %r14, %r0, %g17
}
{
loop_mode
qpfsubs,0 %g16, %g18, %g17
qpfadds,1 %g16, %g18, %g16
stqp,2 %r23, %r0, %g19
stqp,5 %r15, %r0, %g24
}
{
loop_mode
stqp,2 %r2, %r0, %r9
}
{
loop_mode
stqp,2 %r24, %r0, %g22
stqp,5 %r22, %r0, %g20
}
{
loop_mode
stqp,2 %r19, %r0, %g26
stqp,5 %r21, %r0, %g21
}
{
loop_mode
stqp,2 %r17, %r0, %g23
stqp,5 %r11, %r0, %g29
}
{
loop_mode
stqp,2 %r20, %r0, %g17
}
{
loop_mode
ct %ctpr1 ? %NOT_LOOP_END
alc alcf=1, alct=1
addd,0,sm 0x10, %r0, %r0
stqp,2 %r13, %r0, %g16
}
Теоретическая скорость: 32 комплексных числа за 73 такта (32/73) = 3.51 Байт/такт
Четверная теоретическая скорость: 14.03 Байт/такт
Замеры скорости

4. stage_radix4_2x_simd128_unroll2
Здесь происходит раскрутка цикла в 2 раза с помощью опции unroll.
Код на Си
void stage_radix4_2x_simd128_unroll2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC_a, myComplex *coefD_a, myComplex *coefE_a, myComplex *coefC_b, myComplex *coefD_b, myComplex *coefE_b)
{
__v2di *xy0_in = (__v2di*)&data_in[ 0];
__v2di *zw0_in = (__v2di*)&data_in[ 2];
__v2di *xy1_in = (__v2di*)&data_in[ 4];
__v2di *zw1_in = (__v2di*)&data_in[ 6];
__v2di *xy2_in = (__v2di*)&data_in[ 8];
__v2di *zw2_in = (__v2di*)&data_in[10];
__v2di *xy3_in = (__v2di*)&data_in[12];
__v2di *zw3_in = (__v2di*)&data_in[14];
__v2di *xy4_in = (__v2di*)&data_in[16];
__v2di *zw4_in = (__v2di*)&data_in[18];
__v2di *xy5_in = (__v2di*)&data_in[20];
__v2di *zw5_in = (__v2di*)&data_in[22];
__v2di *xy6_in = (__v2di*)&data_in[24];
__v2di *zw6_in = (__v2di*)&data_in[26];
__v2di *xy7_in = (__v2di*)&data_in[28];
__v2di *zw7_in = (__v2di*)&data_in[30];
__v2di *c0a_in = (__v2di*)&coefC_a[0];
__v2di *c1a_in = (__v2di*)&coefC_a[2];
__v2di *c2a_in = (__v2di*)&coefC_a[4];
__v2di *c3a_in = (__v2di*)&coefC_a[6];
__v2di *d0a_in = (__v2di*)&coefD_a[0];
__v2di *d1a_in = (__v2di*)&coefD_a[2];
__v2di *d2a_in = (__v2di*)&coefD_a[4];
__v2di *d3a_in = (__v2di*)&coefD_a[6];
__v2di *e0a_in = (__v2di*)&coefE_a[0];
__v2di *e1a_in = (__v2di*)&coefE_a[2];
__v2di *e2a_in = (__v2di*)&coefE_a[4];
__v2di *e3a_in = (__v2di*)&coefE_a[6];
__v2di *c0b_in = (__v2di*)&coefC_b[0*data_count/16];
__v2di *c1b_in = (__v2di*)&coefC_b[1*data_count/16];
__v2di *c2b_in = (__v2di*)&coefC_b[2*data_count/16];
__v2di *c3b_in = (__v2di*)&coefC_b[3*data_count/16];
__v2di *d0b_in = (__v2di*)&coefD_b[0*data_count/16];
__v2di *d1b_in = (__v2di*)&coefD_b[1*data_count/16];
__v2di *d2b_in = (__v2di*)&coefD_b[2*data_count/16];
__v2di *d3b_in = (__v2di*)&coefD_b[3*data_count/16];
__v2di *e0b_in = (__v2di*)&coefE_b[0*data_count/16];
__v2di *e1b_in = (__v2di*)&coefE_b[1*data_count/16];
__v2di *e2b_in = (__v2di*)&coefE_b[2*data_count/16];
__v2di *e3b_in = (__v2di*)&coefE_b[3*data_count/16];
__v2di *out_0 = (__v2di*)&data_out[ 0*data_count/16];
__v2di *out_1 = (__v2di*)&data_out[ 1*data_count/16];
__v2di *out_2 = (__v2di*)&data_out[ 2*data_count/16];
__v2di *out_3 = (__v2di*)&data_out[ 3*data_count/16];
__v2di *out_4 = (__v2di*)&data_out[ 4*data_count/16];
__v2di *out_5 = (__v2di*)&data_out[ 5*data_count/16];
__v2di *out_6 = (__v2di*)&data_out[ 6*data_count/16];
__v2di *out_7 = (__v2di*)&data_out[ 7*data_count/16];
__v2di *out_8 = (__v2di*)&data_out[ 8*data_count/16];
__v2di *out_9 = (__v2di*)&data_out[ 9*data_count/16];
__v2di *out_10 = (__v2di*)&data_out[10*data_count/16];
__v2di *out_11 = (__v2di*)&data_out[11*data_count/16];
__v2di *out_12 = (__v2di*)&data_out[12*data_count/16];
__v2di *out_13 = (__v2di*)&data_out[13*data_count/16];
__v2di *out_14 = (__v2di*)&data_out[14*data_count/16];
__v2di *out_15 = (__v2di*)&data_out[15*data_count/16];
#pragma ivdep
#pragma unroll(2)
#pragma prefetch
for(int64_t i = 0; i < data_count/32; ++i)
{
__v2di xy0 = xy0_in[16*i];
__v2di zw0 = zw0_in[16*i];
__v2di xy1 = xy1_in[16*i];
__v2di zw1 = zw1_in[16*i];
__v2di c0 = c0a_in[4*i];
__v2di d0 = d0a_in[4*i];
__v2di e0 = e0a_in[4*i];
__v2di xy2 = xy2_in[16*i];
__v2di zw2 = zw2_in[16*i];
__v2di xy3 = xy3_in[16*i];
__v2di zw3 = zw3_in[16*i];
__v2di c1 = c1a_in[4*i];
__v2di d1 = d1a_in[4*i];
__v2di e1 = e1a_in[4*i];
__v2di xy4 = xy4_in[16*i];
__v2di zw4 = zw4_in[16*i];
__v2di xy5 = xy5_in[16*i];
__v2di zw5 = zw5_in[16*i];
__v2di c2 = c2a_in[4*i];
__v2di d2 = d2a_in[4*i];
__v2di e2 = e2a_in[4*i];
__v2di xy6 = xy6_in[16*i];
__v2di zw6 = zw6_in[16*i];
__v2di xy7 = xy7_in[16*i];
__v2di zw7 = zw7_in[16*i];
__v2di c3 = c3a_in[4*i];
__v2di d3 = d3a_in[4*i];
__v2di e3 = e3a_in[4*i];
__v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di conj_c0 = __builtin_e2k_qpxor(c0, (__v2di){1LL<<63, 1LL<<63});
__v2di conj_c1 = __builtin_e2k_qpxor(c1, (__v2di){1LL<<63, 1LL<<63});
__v2di conj_c2 = __builtin_e2k_qpxor(c2, (__v2di){1LL<<63, 1LL<<63});
__v2di conj_c3 = __builtin_e2k_qpxor(c3, (__v2di){1LL<<63, 1LL<<63});
__v2di conj_d0 = __builtin_e2k_qpxor(d0, (__v2di){1LL<<63, 1LL<<63});
__v2di conj_d1 = __builtin_e2k_qpxor(d1, (__v2di){1LL<<63, 1LL<<63});
__v2di conj_d2 = __builtin_e2k_qpxor(d2, (__v2di){1LL<<63, 1LL<<63});
__v2di conj_d3 = __builtin_e2k_qpxor(d3, (__v2di){1LL<<63, 1LL<<63});
__v2di conj_e0 = __builtin_e2k_qpxor(e0, (__v2di){1LL<<63, 1LL<<63});
__v2di conj_e1 = __builtin_e2k_qpxor(e1, (__v2di){1LL<<63, 1LL<<63});
__v2di conj_e2 = __builtin_e2k_qpxor(e2, (__v2di){1LL<<63, 1LL<<63});
__v2di conj_e3 = __builtin_e2k_qpxor(e3, (__v2di){1LL<<63, 1LL<<63});
__v2di swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0);
__v2di cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1);
__v2di cy2_real = __builtin_e2k_qpfmuls(conj_c2, y2);
__v2di cy3_real = __builtin_e2k_qpfmuls(conj_c3, y3);
__v2di dz0_real = __builtin_e2k_qpfmuls(conj_d0, z0);
__v2di dz1_real = __builtin_e2k_qpfmuls(conj_d1, z1);
__v2di dz2_real = __builtin_e2k_qpfmuls(conj_d2, z2);
__v2di dz3_real = __builtin_e2k_qpfmuls(conj_d3, z3);
__v2di ew0_real = __builtin_e2k_qpfmuls(conj_e0, w0);
__v2di ew1_real = __builtin_e2k_qpfmuls(conj_e1, w1);
__v2di ew2_real = __builtin_e2k_qpfmuls(conj_e2, w2);
__v2di ew3_real = __builtin_e2k_qpfmuls(conj_e3, w3);
__v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
__v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
__v2di cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2);
__v2di cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3);
__v2di dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0);
__v2di dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1);
__v2di dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2);
__v2di dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3);
__v2di ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0);
__v2di ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1);
__v2di ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2);
__v2di ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3);
__v2di cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag);
__v2di cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag);
__v2di cy2_rrii = __builtin_e2k_qpfhadds(cy2_real, cy2_imag);
__v2di cy3_rrii = __builtin_e2k_qpfhadds(cy3_real, cy3_imag);
__v2di dz0_rrii = __builtin_e2k_qpfhadds(dz0_real, dz0_imag);
__v2di dz1_rrii = __builtin_e2k_qpfhadds(dz1_real, dz1_imag);
__v2di dz2_rrii = __builtin_e2k_qpfhadds(dz2_real, dz2_imag);
__v2di dz3_rrii = __builtin_e2k_qpfhadds(dz3_real, dz3_imag);
__v2di ew0_rrii = __builtin_e2k_qpfhadds(ew0_real, ew0_imag);
__v2di ew1_rrii = __builtin_e2k_qpfhadds(ew1_real, ew1_imag);
__v2di ew2_rrii = __builtin_e2k_qpfhadds(ew2_real, ew2_imag);
__v2di ew3_rrii = __builtin_e2k_qpfhadds(ew3_real, ew3_imag);
__v2di cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di cy2 = __builtin_e2k_qpshufb(cy2_rrii, cy2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di cy3 = __builtin_e2k_qpshufb(cy3_rrii, cy3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di dz0 = __builtin_e2k_qpshufb(dz0_rrii, dz0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di dz1 = __builtin_e2k_qpshufb(dz1_rrii, dz1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di dz2 = __builtin_e2k_qpshufb(dz2_rrii, dz2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di dz3 = __builtin_e2k_qpshufb(dz3_rrii, dz3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di ew0 = __builtin_e2k_qpshufb(ew0_rrii, ew0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di ew1 = __builtin_e2k_qpshufb(ew1_rrii, ew1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di ew2 = __builtin_e2k_qpshufb(ew2_rrii, ew2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di ew3 = __builtin_e2k_qpshufb(ew3_rrii, ew3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di add02_0 = __builtin_e2k_qpfadds( x0, dz0);
__v2di add02_1 = __builtin_e2k_qpfadds( x1, dz1);
__v2di add02_2 = __builtin_e2k_qpfadds( x2, dz2);
__v2di add02_3 = __builtin_e2k_qpfadds( x3, dz3);
__v2di sub02_0 = __builtin_e2k_qpfsubs( x0, dz0);
__v2di sub02_1 = __builtin_e2k_qpfsubs( x1, dz1);
__v2di sub02_2 = __builtin_e2k_qpfsubs( x2, dz2);
__v2di sub02_3 = __builtin_e2k_qpfsubs( x3, dz3);
__v2di add13_0 = __builtin_e2k_qpfadds(cy0, ew0);
__v2di add13_1 = __builtin_e2k_qpfadds(cy1, ew1);
__v2di add13_2 = __builtin_e2k_qpfadds(cy2, ew2);
__v2di add13_3 = __builtin_e2k_qpfadds(cy3, ew3);
__v2di sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0);
__v2di sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1);
__v2di sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2);
__v2di sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3);
__v2di swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31});
__v2di sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31});
__v2di sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31});
__v2di sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31});
__v2di out0 = __builtin_e2k_qpfadds(add02_0, add13_0);
__v2di out1 = __builtin_e2k_qpfadds(add02_1, add13_1);
__v2di out2 = __builtin_e2k_qpfadds(add02_2, add13_2);
__v2di out3 = __builtin_e2k_qpfadds(add02_3, add13_3);
__v2di out4 = __builtin_e2k_qpfsubs(sub02_0, sub13i_0);
__v2di out5 = __builtin_e2k_qpfsubs(sub02_1, sub13i_1);
__v2di out6 = __builtin_e2k_qpfsubs(sub02_2, sub13i_2);
__v2di out7 = __builtin_e2k_qpfsubs(sub02_3, sub13i_3);
__v2di out8 = __builtin_e2k_qpfsubs(add02_0, add13_0);
__v2di out9 = __builtin_e2k_qpfsubs(add02_1, add13_1);
__v2di out10 = __builtin_e2k_qpfsubs(add02_2, add13_2);
__v2di out11 = __builtin_e2k_qpfsubs(add02_3, add13_3);
__v2di out12 = __builtin_e2k_qpfadds(sub02_0, sub13i_0);
__v2di out13 = __builtin_e2k_qpfadds(sub02_1, sub13i_1);
__v2di out14 = __builtin_e2k_qpfadds(sub02_2, sub13i_2);
__v2di out15 = __builtin_e2k_qpfadds(sub02_3, sub13i_3);
xy0 = out0;
zw0 = out1;
xy1 = out2;
zw1 = out3;
c0 = c0b_in[i];
d0 = d0b_in[i];
e0 = e0b_in[i];
xy2 = out4;
zw2 = out5;
xy3 = out6;
zw3 = out7;
c1 = c1b_in[i];
d1 = d1b_in[i];
e1 = e1b_in[i];
xy4 = out8;
zw4 = out9;
xy5 = out10;
zw5 = out11;
c2 = c2b_in[i];
d2 = d2b_in[i];
e2 = e2b_in[i];
xy6 = out12;
zw6 = out13;
xy7 = out14;
zw7 = out15;
c3 = c3b_in[i];
d3 = d3b_in[i];
e3 = e3b_in[i];
x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100});
w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100});
y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100});
w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100});
y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100});
w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
conj_c0 = __builtin_e2k_qpxor(c0, (__v2di){1LL<<63, 1LL<<63});
conj_c1 = __builtin_e2k_qpxor(c1, (__v2di){1LL<<63, 1LL<<63});
conj_c2 = __builtin_e2k_qpxor(c2, (__v2di){1LL<<63, 1LL<<63});
conj_c3 = __builtin_e2k_qpxor(c3, (__v2di){1LL<<63, 1LL<<63});
conj_d0 = __builtin_e2k_qpxor(d0, (__v2di){1LL<<63, 1LL<<63});
conj_d1 = __builtin_e2k_qpxor(d1, (__v2di){1LL<<63, 1LL<<63});
conj_d2 = __builtin_e2k_qpxor(d2, (__v2di){1LL<<63, 1LL<<63});
conj_d3 = __builtin_e2k_qpxor(d3, (__v2di){1LL<<63, 1LL<<63});
conj_e0 = __builtin_e2k_qpxor(e0, (__v2di){1LL<<63, 1LL<<63});
conj_e1 = __builtin_e2k_qpxor(e1, (__v2di){1LL<<63, 1LL<<63});
conj_e2 = __builtin_e2k_qpxor(e2, (__v2di){1LL<<63, 1LL<<63});
conj_e3 = __builtin_e2k_qpxor(e3, (__v2di){1LL<<63, 1LL<<63});
swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0);
cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1);
cy2_real = __builtin_e2k_qpfmuls(conj_c2, y2);
cy3_real = __builtin_e2k_qpfmuls(conj_c3, y3);
dz0_real = __builtin_e2k_qpfmuls(conj_d0, z0);
dz1_real = __builtin_e2k_qpfmuls(conj_d1, z1);
dz2_real = __builtin_e2k_qpfmuls(conj_d2, z2);
dz3_real = __builtin_e2k_qpfmuls(conj_d3, z3);
ew0_real = __builtin_e2k_qpfmuls(conj_e0, w0);
ew1_real = __builtin_e2k_qpfmuls(conj_e1, w1);
ew2_real = __builtin_e2k_qpfmuls(conj_e2, w2);
ew3_real = __builtin_e2k_qpfmuls(conj_e3, w3);
cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2);
cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3);
dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0);
dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1);
dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2);
dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3);
ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0);
ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1);
ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2);
ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3);
cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag);
cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag);
cy2_rrii = __builtin_e2k_qpfhadds(cy2_real, cy2_imag);
cy3_rrii = __builtin_e2k_qpfhadds(cy3_real, cy3_imag);
dz0_rrii = __builtin_e2k_qpfhadds(dz0_real, dz0_imag);
dz1_rrii = __builtin_e2k_qpfhadds(dz1_real, dz1_imag);
dz2_rrii = __builtin_e2k_qpfhadds(dz2_real, dz2_imag);
dz3_rrii = __builtin_e2k_qpfhadds(dz3_real, dz3_imag);
ew0_rrii = __builtin_e2k_qpfhadds(ew0_real, ew0_imag);
ew1_rrii = __builtin_e2k_qpfhadds(ew1_real, ew1_imag);
ew2_rrii = __builtin_e2k_qpfhadds(ew2_real, ew2_imag);
ew3_rrii = __builtin_e2k_qpfhadds(ew3_real, ew3_imag);
cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
cy2 = __builtin_e2k_qpshufb(cy2_rrii, cy2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
cy3 = __builtin_e2k_qpshufb(cy3_rrii, cy3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
dz0 = __builtin_e2k_qpshufb(dz0_rrii, dz0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
dz1 = __builtin_e2k_qpshufb(dz1_rrii, dz1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
dz2 = __builtin_e2k_qpshufb(dz2_rrii, dz2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
dz3 = __builtin_e2k_qpshufb(dz3_rrii, dz3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
ew0 = __builtin_e2k_qpshufb(ew0_rrii, ew0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
ew1 = __builtin_e2k_qpshufb(ew1_rrii, ew1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
ew2 = __builtin_e2k_qpshufb(ew2_rrii, ew2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
ew3 = __builtin_e2k_qpshufb(ew3_rrii, ew3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
add02_0 = __builtin_e2k_qpfadds( x0, dz0);
add02_1 = __builtin_e2k_qpfadds( x1, dz1);
add02_2 = __builtin_e2k_qpfadds( x2, dz2);
add02_3 = __builtin_e2k_qpfadds( x3, dz3);
sub02_0 = __builtin_e2k_qpfsubs( x0, dz0);
sub02_1 = __builtin_e2k_qpfsubs( x1, dz1);
sub02_2 = __builtin_e2k_qpfsubs( x2, dz2);
sub02_3 = __builtin_e2k_qpfsubs( x3, dz3);
add13_0 = __builtin_e2k_qpfadds(cy0, ew0);
add13_1 = __builtin_e2k_qpfadds(cy1, ew1);
add13_2 = __builtin_e2k_qpfadds(cy2, ew2);
add13_3 = __builtin_e2k_qpfadds(cy3, ew3);
sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0);
sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1);
sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2);
sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3);
swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31});
sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31});
sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31});
sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31});
out_0[i] = __builtin_e2k_qpfadds(add02_0, add13_0);
out_1[i] = __builtin_e2k_qpfadds(add02_1, add13_1);
out_2[i] = __builtin_e2k_qpfadds(add02_2, add13_2);
out_3[i] = __builtin_e2k_qpfadds(add02_3, add13_3);
out_4[i] = __builtin_e2k_qpfsubs(sub02_0, sub13i_0);
out_5[i] = __builtin_e2k_qpfsubs(sub02_1, sub13i_1);
out_6[i] = __builtin_e2k_qpfsubs(sub02_2, sub13i_2);
out_7[i] = __builtin_e2k_qpfsubs(sub02_3, sub13i_3);
out_8[i] = __builtin_e2k_qpfsubs(add02_0, add13_0);
out_9[i] = __builtin_e2k_qpfsubs(add02_1, add13_1);
out_10[i] = __builtin_e2k_qpfsubs(add02_2, add13_2);
out_11[i] = __builtin_e2k_qpfsubs(add02_3, add13_3);
out_12[i] = __builtin_e2k_qpfadds(sub02_0, sub13i_0);
out_13[i] = __builtin_e2k_qpfadds(sub02_1, sub13i_1);
out_14[i] = __builtin_e2k_qpfadds(sub02_2, sub13i_2);
out_15[i] = __builtin_e2k_qpfadds(sub02_3, sub13i_3);
}
}
Основной цикл на ассемблере
.L11610:
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=0, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=1, disp=64
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=1, disp=96
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=2, disp=128
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=2, disp=160
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=3, disp=192
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=3, disp=224
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=4, disp=256
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=4, disp=288
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=5, disp=320
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=5, disp=352
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=6, disp=384
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=6, disp=416
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=7, disp=448
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=7, disp=480
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=1, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=1, abs=8, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=1, abs=10, disp=64
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=1, abs=10, disp=96
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=1, abs=12, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=1, abs=12, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=1, abs=14, disp=64
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=1, abs=14, disp=96
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=1, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=1, abs=16, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=1, abs=18, disp=64
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=1, abs=18, disp=96
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=1, incr=2, ind=0, asz=1, abs=20, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=15, asz=1, abs=20, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=14, asz=1, abs=22, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=13, asz=1, abs=22, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=12, asz=1, abs=24, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=11, asz=1, abs=24, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=10, asz=1, abs=26, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=9, asz=1, abs=26, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=8, asz=1, abs=28, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=7, asz=1, abs=28, disp=0
}
{
fapb ct=1, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=6, asz=1, abs=30, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=5, asz=1, abs=30, disp=0
}
.L7588:
{
loop_mode
disp %ctpr1, .L7588
movaqp,0 area=0, ind=0, am=1, be=0, %g17
movaqp,1 area=0, ind=16, am=0, be=0, %g16
movaqp,2 area=0, ind=0, am=1, be=0, %g19
movaqp,3 area=0, ind=16, am=0, be=0, %g18
}
{
loop_mode
movaqp,0 area=1, ind=0, am=1, be=0, %g21
movaqp,1 area=1, ind=16, am=0, be=0, %g20
movaqp,2 area=1, ind=0, am=1, be=0, %g23
movaqp,3 area=1, ind=16, am=0, be=0, %g22
}
{
loop_mode
movaqp,0 area=2, ind=0, am=1, be=0, %g25
movaqp,1 area=2, ind=16, am=0, be=0, %g24
movaqp,2 area=2, ind=0, am=1, be=0, %g27
movaqp,3 area=2, ind=16, am=0, be=0, %g26
}
{
loop_mode
movaqp,0 area=3, ind=0, am=1, be=0, %g29
movaqp,1 area=3, ind=16, am=0, be=0, %g28
movaqp,2 area=3, ind=0, am=1, be=0, %g31
movaqp,3 area=3, ind=16, am=0, be=0, %g30
}
{
loop_mode
movaqp,0 area=4, ind=0, am=1, be=0, %b[20]
movaqp,1 area=4, ind=16, am=0, be=0, %b[19]
movaqp,2 area=4, ind=0, am=1, be=0, %b[22]
movaqp,3 area=4, ind=16, am=0, be=0, %b[21]
}
{
loop_mode
qpshufb,0 %g19, %g17, %r24, %b[27]
qpshufb,1 %g18, %g16, %r24, %b[28]
qpshufb,3 %g18, %g16, %r7, %g16
qpshufb,4 %g19, %g17, %r7, %g17
movaqp,0 area=5, ind=0, am=1, be=0, %b[24]
movaqp,1 area=5, ind=16, am=0, be=0, %b[23]
movaqp,2 area=5, ind=0, am=1, be=0, %b[26]
movaqp,3 area=5, ind=16, am=0, be=0, %b[25]
}
{
loop_mode
qpshufb,0 %g23, %g21, %r24, %b[31]
qpshufb,1 %g22, %g20, %r24, %b[32]
qpshufb,3 %g22, %g20, %r7, %g20
qpshufb,4 %g23, %g21, %r7, %g21
movaqp,0 area=6, ind=0, am=1, be=0, %g19
movaqp,1 area=6, ind=16, am=0, be=0, %g18
movaqp,2 area=6, ind=0, am=1, be=0, %b[30]
movaqp,3 area=6, ind=16, am=0, be=0, %b[29]
}
{
loop_mode
qpshufb,0 %g27, %g25, %r24, %b[35]
qpshufb,1 %g26, %g24, %r24, %b[36]
qpshufb,3 %g26, %g24, %r7, %g24
qpshufb,4 %g27, %g25, %r7, %g25
movaqp,0 area=7, ind=0, am=1, be=0, %g23
movaqp,1 area=7, ind=16, am=0, be=0, %g22
movaqp,2 area=7, ind=0, am=1, be=0, %b[34]
movaqp,3 area=7, ind=16, am=0, be=0, %b[33]
}
{
loop_mode
qpshufb,0 %g31, %g29, %r24, %b[39]
qpshufb,1 %g30, %g28, %r24, %b[40]
qpshufb,3 %g30, %g28, %r7, %g28
qpshufb,4 %g31, %g29, %r7, %g29
movaqp,0 area=8, ind=0, am=1, be=0, %g27
movaqp,1 area=8, ind=16, am=0, be=0, %g26
movaqp,2 area=8, ind=0, am=1, be=0, %b[38]
movaqp,3 area=8, ind=16, am=0, be=0, %b[37]
}
{
loop_mode
qpshufb,0 %b[22], %b[20], %r24, %b[43]
qpshufb,1 %b[21], %b[19], %r24, %b[44]
qpshufb,3 %b[21], %b[19], %r7, %b[19]
qpshufb,4 %b[22], %b[20], %r7, %b[20]
movaqp,0 area=9, ind=0, am=1, be=0, %g31
movaqp,1 area=9, ind=16, am=0, be=0, %g30
movaqp,2 area=9, ind=0, am=1, be=0, %b[42]
movaqp,3 area=9, ind=16, am=0, be=0, %b[41]
}
{
loop_mode
qpshufb,0 %b[26], %b[24], %r24, %b[47]
qpshufb,1 %b[25], %b[23], %r24, %b[48]
qpshufb,3 %b[25], %b[23], %r7, %b[23]
qpshufb,4 %b[26], %b[24], %r7, %b[24]
movaqp,0 area=10, ind=0, am=1, be=0, %b[22]
movaqp,1 area=10, ind=16, am=0, be=0, %b[21]
movaqp,2 area=10, ind=0, am=1, be=0, %b[46]
movaqp,3 area=10, ind=16, am=0, be=0, %b[45]
}
{
loop_mode
qpshufb,0 %b[30], %g19, %r24, %b[51]
qpshufb,1 %b[29], %g18, %r24, %b[52]
qpshufb,3 %b[29], %g18, %r7, %g18
qpshufb,4 %b[30], %g19, %r7, %g19
movaqp,0 area=11, ind=0, am=1, be=0, %b[26]
movaqp,1 area=11, ind=16, am=0, be=0, %b[25]
movaqp,2 area=11, ind=0, am=1, be=0, %b[50]
movaqp,3 area=11, ind=16, am=0, be=0, %b[49]
}
{
loop_mode
qpshufb,0 %b[34], %g23, %r24, %b[55]
qpshufb,1 %b[33], %g22, %r24, %b[56]
qpshufb,3 %b[33], %g22, %r7, %g22
qpshufb,4 %b[34], %g23, %r7, %g23
movaqp,0 area=12, ind=0, am=1, be=0, %b[30]
movaqp,1 area=12, ind=16, am=0, be=0, %b[29]
movaqp,2 area=12, ind=0, am=1, be=0, %b[54]
movaqp,3 area=12, ind=16, am=0, be=0, %b[53]
}
{
loop_mode
qpxor,0 %g27, %r6, %b[59]
qpxor,1 %g26, %r6, %b[60]
qpxor,3 %b[38], %r6, %b[61]
qpxor,4 %b[37], %r6, %b[62]
movaqp,0 area=13, ind=0, am=1, be=0, %b[34]
movaqp,1 area=13, ind=16, am=0, be=0, %b[33]
movaqp,2 area=13, ind=0, am=1, be=0, %b[58]
movaqp,3 area=13, ind=16, am=0, be=0, %b[57]
}
{
loop_mode
qpxor,0 %g31, %r6, %b[63]
qpxor,1 %g30, %r6, %b[64]
qpfmuls,2 %b[60], %b[31], %b[60]
qpxor,3 %b[42], %r6, %b[65]
qpxor,4 %b[41], %r6, %b[66]
qpfmuls,5 %b[62], %b[39], %b[62]
movaqp,0 area=14, ind=0, am=1, be=0, %b[68]
movaqp,1 area=14, ind=16, am=0, be=0, %b[67]
movaqp,2 area=14, ind=0, am=1, be=0, %b[70]
movaqp,3 area=14, ind=16, am=0, be=0, %b[69]
}
{
loop_mode
qpfmuls,0 %b[63], %b[43], %b[63]
qpfmuls,1 %b[64], %b[47], %b[64]
qpfmuls,2 %b[59], %b[27], %b[59]
qpfmuls,3 %b[65], %b[51], %b[65]
qpxor,4 %b[22], %r6, %b[71]
qpfmuls,5 %b[61], %b[35], %b[61]
movaqp,0 area=15, ind=0, am=1, be=0, %b[73]
movaqp,1 area=15, ind=16, am=0, be=0, %b[72]
movaqp,2 area=15, ind=0, am=1, be=0, %b[75]
movaqp,3 area=15, ind=16, am=0, be=0, %b[74]
}
{
loop_mode
qpxor,0 %b[21], %r6, %b[76]
qpxor,1 %b[45], %r6, %b[77]
qpxor,3 %b[46], %r6, %b[78]
qpxor,4 %b[26], %r6, %b[79]
qpfmuls,5 %b[66], %b[55], %b[66]
movaqp,0 area=16, ind=0, am=1, be=0, %b[81]
movaqp,1 area=16, ind=16, am=0, be=0, %b[80]
movaqp,2 area=16, ind=0, am=1, be=0, %b[83]
movaqp,3 area=16, ind=16, am=0, be=0, %b[82]
}
{
loop_mode
qpfmuls,0 %b[77], %g28, %b[77]
qpfmuls,2 %b[76], %g20, %b[76]
qpfmuls,3 %b[78], %g24, %b[78]
qpxor,4 %b[30], %r6, %b[84]
qpfmuls,5 %b[71], %g16, %b[71]
movaqp,0 area=17, ind=0, am=1, be=0, %b[86]
movaqp,1 area=17, ind=16, am=0, be=0, %b[85]
movaqp,2 area=17, ind=0, am=1, be=0, %b[88]
movaqp,3 area=17, ind=16, am=0, be=0, %b[87]
}
{
loop_mode
qpxor,0 %b[29], %r6, %b[89]
qpxor,1 %b[53], %r6, %b[90]
qpxor,3 %b[54], %r6, %b[91]
qpxor,4 %b[34], %r6, %b[92]
qpfmuls,5 %b[84], %b[28], %b[84]
movaqp,0 area=18, ind=0, am=1, be=0, %b[94]
movaqp,1 area=18, ind=16, am=0, be=0, %b[93]
movaqp,2 area=18, ind=0, am=1, be=0, %b[96]
movaqp,3 area=18, ind=16, am=0, be=0, %b[95]
}
{
loop_mode
qpfmuls,0 %b[90], %b[40], %b[90]
qpxor,1 %b[33], %r6, %b[97]
qpfmuls,2 %b[89], %b[32], %b[89]
qpfmuls,3 %b[92], %b[44], %b[92]
qpxor,4 %b[58], %r6, %b[98]
qpfmuls,5 %b[91], %b[36], %b[91]
movaqp,0 area=19, ind=0, am=1, be=0, %b[100]
movaqp,1 area=19, ind=16, am=0, be=0, %b[99]
movaqp,2 area=19, ind=0, am=1, be=0, %b[102]
movaqp,3 area=19, ind=16, am=0, be=0, %b[101]
}
{
loop_mode
qpxor,0 %b[57], %r6, %b[103]
qpxor,1 %b[25], %r6, %b[104]
qpfmuls,2 %b[97], %b[48], %b[97]
qpxor,3 %b[49], %r6, %b[105]
qpxor,4 %b[50], %r6, %b[106]
qpfmuls,5 %b[98], %b[52], %b[98]
}
{
loop_mode
qpfmuls,0 %b[79], %b[19], %b[79]
qpfmuls,1 %b[104], %b[23], %b[104]
qpfmuls,2 %b[103], %b[56], %b[103]
qpfmuls,3 %b[106], %g18, %b[106]
qpshufb,4 %g27, %g27, %r25, %g27
qpfmuls,5 %b[105], %g22, %b[105]
}
{
loop_mode
qpshufb,0 %g26, %g26, %r25, %g26
qpshufb,1 %b[37], %b[37], %r25, %b[37]
qpshufb,3 %b[38], %b[38], %r25, %b[38]
qpshufb,4 %g31, %g31, %r25, %g31
qpfmul_hadds,5 %g27, %b[27], %b[59], %g27
}
{
loop_mode
qpfmul_hadds,0 %b[37], %b[39], %b[62], %b[27]
qpfmul_hadds,2 %g26, %b[31], %b[60], %g26
qpfmul_hadds,3 %g31, %b[43], %b[63], %g31
qpshufb,4 %g30, %g30, %r25, %g30
qpfmul_hadds,5 %b[38], %b[35], %b[61], %b[31]
}
{
loop_mode
qpshufb,0 %b[41], %b[41], %r25, %b[35]
qpshufb,1 %b[30], %b[30], %r25, %b[30]
qpshufb,3 %b[29], %b[29], %r25, %b[29]
qpshufb,4 %b[42], %b[42], %r25, %b[37]
qpfmul_hadds,5 %g30, %b[47], %b[64], %g30
}
{
loop_mode
qpshufb,0 %b[54], %b[54], %r25, %b[38]
qpshufb,1 %b[53], %b[53], %r25, %b[39]
qpfmul_hadds,2 %b[35], %b[55], %b[66], %b[35]
qpshufb,3 %b[34], %b[34], %r25, %b[34]
qpshufb,4 %b[33], %b[33], %r25, %b[33]
qpfmul_hadds,5 %b[37], %b[51], %b[65], %b[37]
}
{
loop_mode
qpshufb,0 %b[58], %b[58], %r25, %b[41]
qpshufb,1 %b[57], %b[57], %r25, %b[42]
qpfmul_hadds,2 %b[30], %b[28], %b[84], %b[28]
qpfmul_hadds,3 %b[29], %b[32], %b[89], %b[29]
qpfmul_hadds,4 %b[34], %b[44], %b[92], %b[32]
qpfmul_hadds,5 %b[33], %b[48], %b[97], %b[30]
}
{
loop_mode
qpfmul_hadds,0 %b[38], %b[36], %b[91], %b[34]
qpfmul_hadds,1 %b[41], %b[52], %b[98], %b[36]
qpfmul_hadds,2 %b[39], %b[40], %b[90], %b[33]
qpshufb,3 %b[21], %b[21], %r25, %b[21]
qpshufb,4 %b[22], %b[22], %r25, %b[22]
}
{
loop_mode
qpshufb,0 %b[46], %b[46], %r25, %b[39]
qpshufb,1 %b[45], %b[45], %r25, %b[40]
qpfmul_hadds,2 %b[42], %b[56], %b[103], %b[38]
qpshufb,3 %b[26], %b[26], %r25, %b[26]
qpshufb,4 %b[25], %b[25], %r25, %b[25]
qpfmul_hadds,5 %b[21], %g20, %b[76], %g20
}
{
loop_mode
qpfmul_hadds,0 %b[40], %g28, %b[77], %g28
qpshufb,1 %b[50], %b[50], %r25, %b[21]
qpfmul_hadds,2 %b[39], %g24, %b[78], %g24
qpfmul_hadds,3 %b[26], %b[19], %b[79], %b[19]
qpshufb,4 %b[49], %b[49], %r25, %b[41]
qpfmul_hadds,5 %b[22], %g16, %b[71], %g16
}
{
loop_mode
qpxor,0 %b[67], %r6, %b[21]
qpxor,1 %b[68], %r6, %b[23]
qpfmul_hadds,2 %b[21], %g18, %b[106], %g18
qpfmul_hadds,3 %b[41], %g22, %b[105], %g22
qpshufb,4 %g27, %g27, %r22, %g27
qpfmul_hadds,5 %b[25], %b[23], %b[104], %b[22]
}
{
loop_mode
qpshufb,0 %g26, %g26, %r22, %g26
qpshufb,1 %b[27], %b[27], %r22, %b[26]
qpshufb,3 %b[31], %b[31], %r22, %b[25]
qpshufb,4 %g31, %g31, %r22, %g31
}
{
loop_mode
qpxor,0 %b[73], %r6, %b[27]
qpxor,1 %b[72], %r6, %b[31]
}
{
loop_mode
qpshufb,3 %g30, %g30, %r22, %g30
qpshufb,4 %b[37], %b[37], %r22, %b[37]
}
{
loop_mode
qpshufb,0 %b[35], %b[35], %r22, %b[35]
qpshufb,1 %b[28], %b[28], %r22, %b[28]
qpshufb,3 %b[29], %b[29], %r22, %b[29]
qpshufb,4 %b[32], %b[32], %r22, %b[32]
}
{
loop_mode
qpfsubs,0 %g27, %b[28], %b[39]
qpshufb,1 %b[34], %b[34], %r22, %b[34]
qpfadds,2 %g27, %b[28], %g27
qpfsubs,3 %g26, %b[29], %b[40]
qpshufb,4 %b[30], %b[30], %r22, %b[30]
qpfsubs,5 %g31, %b[32], %b[41]
}
{
loop_mode
qpshufb,0 %b[33], %b[33], %r22, %b[28]
qpshufb,1 %b[36], %b[36], %r22, %b[33]
qpfsubs,2 %b[25], %b[34], %b[36]
qpfadds,3 %g26, %b[29], %g26
qpshufb,4 %g20, %g20, %r22, %g20
qpfsubs,5 %g30, %b[30], %b[42]
}
{
loop_mode
qpfsubs,0 %b[37], %b[33], %b[43]
qpshufb,1 %b[38], %b[38], %r22, %b[29]
qpfsubs,2 %b[26], %b[28], %b[38]
qpfadds,3 %g31, %b[32], %g31
qpshufb,4 %g16, %g16, %r22, %g16
qpfadds,5 %g30, %b[30], %g30
}
{
loop_mode
qpshufb,0 %g28, %g28, %r22, %g28
qpshufb,1 %g24, %g24, %r22, %g24
qpfsubs,2 %b[35], %b[29], %b[30]
qpfadds,3 %g21, %g20, %b[32]
qpshufb,4 %b[22], %b[22], %r22, %b[22]
qpfsubs,5 %g21, %g20, %g20
}
{
loop_mode
qpfadds,0 %b[26], %b[28], %b[19]
qpshufb,1 %b[19], %b[19], %r22, %g21
qpfadds,2 %b[25], %b[34], %b[25]
qpfadds,3 %b[24], %b[22], %b[26]
qpshufb,4 %g22, %g22, %r22, %g22
qpfsubs,5 %b[24], %b[22], %b[22]
}
{
loop_mode
qpshufb,0 %g18, %g18, %r22, %g18
qpfadds,1 %b[37], %b[33], %b[24]
qpfadds,2 %b[35], %b[29], %b[28]
qpfadds,3 %g17, %g16, %b[29]
qpfsubs,4 %g17, %g16, %g16
qpfadds,5 %g23, %g22, %g17
}
{
loop_mode
qpfadds,0 %g29, %g28, %b[33]
qpfsubs,1 %g29, %g28, %g28
qpfadds,2 %b[20], %g21, %g29
qpshufb,4 %b[39], %b[39], %r25, %g23
qpfsubs,5 %g23, %g22, %g22
}
{
loop_mode
qpfsubs,0 %g25, %g24, %g24
qpfsubs,1 %g19, %g18, %g25
qpfsubs,2 %b[20], %g21, %g21
qpfsubs,3 %b[32], %g26, %b[34]
qpfadds,4 %b[32], %g26, %g26
qpfadds,5 %g25, %g24, %b[20]
}
{
loop_mode
qpfadds,2 %g19, %g18, %g18
qpshufb,3 %b[40], %b[40], %r25, %g19
qpshufb,4 %b[36], %b[36], %r25, %b[32]
qpfsubs,5 %b[26], %g30, %b[35]
}
{
loop_mode
qpfadds,3 %b[26], %g30, %g30
qpfadds,4 %b[29], %g27, %g27
qpfsubs,5 %b[29], %g27, %b[36]
}
{
loop_mode
qpshufb,0 %b[38], %b[38], %r25, %b[26]
qpshufb,1 %b[42], %b[42], %r25, %b[29]
qpfsubs,2 %g29, %g31, %b[38]
qpshufb,3 %b[41], %b[41], %r25, %b[37]
qpshufb,4 %b[30], %b[30], %r25, %b[30]
}
{
loop_mode
qpshufb,0 %b[43], %b[43], %r25, %b[39]
qpxor,1 %g23, %r23, %g23
qpfsubs,2 %b[33], %b[19], %b[40]
qpfadds,3 %b[20], %b[25], %b[41]
qpfadds,4 %g17, %b[28], %b[25]
qpfsubs,5 %b[20], %b[25], %b[20]
}
{
loop_mode
qpxor,0 %g19, %r23, %g19
qpxor,1 %b[32], %r23, %b[32]
qpfadds,2 %g29, %g31, %g29
qpxor,3 %b[37], %r23, %b[37]
qpxor,4 %b[30], %r23, %b[30]
qpfadds,5 %b[33], %b[19], %g31
}
{
loop_mode
qpxor,0 %b[29], %r23, %b[19]
qpxor,1 %b[26], %r23, %b[26]
qpfadds,2 %g20, %g19, %b[29]
qpfsubs,3 %g17, %b[28], %g17
qpfsubs,4 %g21, %b[37], %b[28]
qpfadds,5 %g21, %b[37], %g21
}
{
loop_mode
qpxor,0 %b[39], %r23, %b[33]
qpfsubs,1 %g20, %g19, %g19
qpfadds,2 %g18, %b[24], %g20
qpfsubs,3 %g18, %b[24], %g18
qpfsubs,4 %g22, %b[30], %g22
qpfadds,5 %g22, %b[30], %b[24]
}
{
loop_mode
qpfsubs,0 %g16, %g23, %b[30]
qpfadds,1 %g16, %g23, %g16
qpfadds,2 %g24, %b[32], %g23
}
{
loop_mode
qpfadds,0 %g28, %b[26], %b[37]
qpfsubs,1 %g24, %b[32], %g24
qpfsubs,2 %g28, %b[26], %g28
}
{
loop_mode
qpfsubs,0 %g25, %b[33], %b[22]
qpfadds,1 %g25, %b[33], %g25
qpfsubs,2 %b[22], %b[19], %b[26]
qpxor,3 %b[75], %r6, %b[32]
qpxor,4 %b[83], %r6, %b[33]
qpfadds,5 %b[22], %b[19], %b[19]
}
{
loop_mode
qpxor,3 %b[74], %r6, %b[39]
qpxor,4 %b[82], %r6, %b[42]
}
{
loop_mode
qpxor,3 %b[86], %r6, %b[43]
qpxor,4 %b[94], %r6, %b[44]
}
{
loop_mode
qpxor,0 %b[85], %r6, %b[45]
qpxor,1 %b[93], %r6, %b[46]
qpxor,3 %b[96], %r6, %b[47]
qpxor,4 %b[102], %r6, %b[48]
}
{
loop_mode
qpxor,0 %b[95], %r6, %b[49]
qpxor,1 %b[101], %r6, %b[50]
qpshufb,3 %b[41], %g27, %r24, %b[51]
qpshufb,4 %g31, %g26, %r24, %b[52]
}
{
loop_mode
qpshufb,0 %b[20], %b[36], %r24, %b[53]
qpshufb,1 %b[40], %b[34], %r24, %b[54]
qpshufb,3 %g20, %g29, %r24, %b[55]
qpshufb,4 %b[25], %g30, %r24, %b[56]
qpfmuls,5 %b[23], %b[51], %b[23]
}
{
loop_mode
qpshufb,0 %g18, %b[38], %r24, %b[57]
qpshufb,1 %g17, %b[35], %r24, %b[58]
qpfmuls,2 %b[43], %b[53], %b[43]
qpshufb,3 %g24, %b[30], %r24, %b[59]
qpshufb,4 %g28, %g19, %r24, %b[60]
qpfmuls,5 %b[27], %b[52], %b[27]
}
{
loop_mode
qpshufb,0 %g23, %g16, %r24, %b[61]
qpshufb,1 %b[37], %b[29], %r24, %b[62]
qpfmuls,2 %b[44], %b[54], %b[44]
qpshufb,3 %b[22], %b[28], %r24, %b[63]
qpshufb,4 %g22, %b[26], %r24, %b[64]
qpfmuls,5 %b[21], %b[55], %b[21]
}
{
loop_mode
qpshufb,0 %g25, %g21, %r24, %b[65]
qpshufb,1 %b[24], %b[19], %r24, %b[66]
qpfmuls,2 %b[45], %b[57], %b[45]
qpfmuls,3 %b[31], %b[56], %b[31]
qpfmuls,4 %b[32], %b[59], %b[32]
qpfmuls,5 %b[33], %b[60], %b[33]
}
{
loop_mode
qpfmuls,0 %b[46], %b[58], %b[46]
qpfmuls,1 %b[47], %b[61], %b[47]
qpfmuls,2 %b[48], %b[62], %b[48]
qpfmuls,3 %b[42], %b[64], %b[42]
qpxor,4 %b[70], %r6, %b[71]
qpfmuls,5 %b[39], %b[63], %b[39]
}
{
loop_mode
qpfmuls,0 %b[50], %b[66], %b[50]
qpxor,1 %b[81], %r6, %b[76]
qpfmuls,2 %b[49], %b[65], %b[49]
}
{
loop_mode
qpxor,4 %b[69], %r6, %b[77]
}
{
loop_mode
qpxor,1 %b[80], %r6, %b[78]
qpxor,3 %b[88], %r6, %b[79]
qpxor,4 %b[100], %r6, %b[84]
}
{
loop_mode
qpxor,0 %b[87], %r6, %b[89]
qpxor,1 %b[99], %r6, %b[90]
qpshufb,3 %g31, %g26, %r7, %g26
qpshufb,4 %b[40], %b[34], %r7, %g31
}
{
loop_mode
qpshufb,0 %b[25], %g30, %r7, %g30
qpshufb,1 %g17, %b[35], %r7, %g17
qpshufb,3 %g28, %g19, %r7, %g19
qpshufb,4 %b[37], %b[29], %r7, %g28
qpfmuls,5 %b[79], %g31, %b[25]
}
{
loop_mode
qpshufb,0 %g22, %b[26], %r7, %g22
qpshufb,1 %b[24], %b[19], %r7, %b[19]
qpfmuls,2 %b[77], %g30, %b[26]
qpfmuls,3 %b[76], %g19, %b[29]
qpfmuls,4 %b[84], %g28, %b[34]
qpfmuls,5 %b[71], %g26, %b[24]
}
{
loop_mode
qpfmuls,0 %b[90], %b[19], %b[37]
qpfmuls,1 %b[89], %g17, %b[40]
qpfmuls,2 %b[78], %g22, %b[35]
qpshufb,3 %b[68], %b[68], %r25, %b[68]
qpshufb,4 %b[67], %b[67], %r25, %b[67]
}
{
loop_mode
qpshufb,0 %b[73], %b[73], %r25, %b[71]
qpshufb,1 %b[72], %b[72], %r25, %b[72]
qpfmul_hadds,3 %b[67], %b[55], %b[21], %b[21]
qpfmul_hadds,5 %b[68], %b[51], %b[23], %b[23]
}
{
loop_mode
qpfmul_hadds,0 %b[72], %b[56], %b[31], %b[31]
qpfmul_hadds,2 %b[71], %b[52], %b[27], %b[27]
qpshufb,3 %b[75], %b[75], %r25, %b[51]
qpshufb,4 %b[83], %b[83], %r25, %b[55]
}
{
loop_mode
qpshufb,0 %b[74], %b[74], %r25, %b[52]
qpshufb,1 %b[82], %b[82], %r25, %b[56]
qpshufb,3 %b[86], %b[86], %r25, %b[67]
qpshufb,4 %b[94], %b[94], %r25, %b[68]
qpfmul_hadds,5 %b[51], %b[59], %b[32], %b[32]
}
{
loop_mode
qpshufb,0 %b[85], %b[85], %r25, %b[51]
qpshufb,1 %b[93], %b[93], %r25, %b[59]
qpfmul_hadds,2 %b[52], %b[63], %b[39], %b[39]
qpshufb,3 %b[96], %b[96], %r25, %b[71]
qpshufb,4 %b[102], %b[102], %r25, %b[72]
qpfmul_hadds,5 %b[55], %b[60], %b[33], %b[33]
}
{
loop_mode
qpshufb,0 %b[95], %b[95], %r25, %b[52]
qpshufb,1 %b[101], %b[101], %r25, %b[55]
qpfmul_hadds,2 %b[56], %b[64], %b[42], %b[42]
qpfmul_hadds,3 %b[67], %b[53], %b[43], %b[43]
qpfmul_hadds,4 %b[68], %b[54], %b[44], %b[44]
qpfmul_hadds,5 %b[71], %b[61], %b[47], %b[47]
}
{
loop_mode
qpfmul_hadds,0 %b[59], %b[58], %b[46], %b[46]
qpfmul_hadds,1 %b[52], %b[65], %b[49], %b[49]
qpfmul_hadds,2 %b[51], %b[57], %b[45], %b[45]
qpshufb,3 %b[70], %b[70], %r25, %b[51]
qpshufb,4 %b[69], %b[69], %r25, %b[52]
qpfmul_hadds,5 %b[72], %b[62], %b[48], %b[48]
}
{
loop_mode
qpshufb,0 %b[81], %b[81], %r25, %b[53]
qpshufb,1 %b[88], %b[88], %r25, %b[54]
qpfmul_hadds,2 %b[55], %b[66], %b[50], %b[50]
qpfmul_hadds,3 %b[52], %g30, %b[26], %g30
qpshufb,4 %b[80], %b[80], %r25, %b[55]
qpfmul_hadds,5 %b[51], %g26, %b[24], %g26
}
{
loop_mode
qpfmul_hadds,0 %b[53], %g19, %b[29], %g19
qpshufb,1 %b[87], %b[87], %r25, %b[24]
qpfmul_hadds,2 %b[54], %g31, %b[25], %g31
qpshufb,3 %b[100], %b[100], %r25, %b[26]
qpshufb,4 %b[99], %b[99], %r25, %b[51]
qpfmul_hadds,5 %b[55], %g22, %b[35], %g22
}
{
loop_mode
qpshufb,0 %b[20], %b[36], %r7, %b[20]
qpshufb,1 %b[41], %g27, %r7, %g27
qpfmul_hadds,2 %b[24], %g17, %b[40], %g17
qpfmul_hadds,3 %b[51], %b[19], %b[37], %b[19]
qpshufb,4 %b[23], %b[23], %r22, %b[23]
qpfmul_hadds,5 %b[26], %g28, %b[34], %g28
}
{
loop_mode
qpshufb,0 %b[27], %b[27], %r22, %b[24]
qpshufb,1 %b[31], %b[31], %r22, %b[25]
qpshufb,3 %b[21], %b[21], %r22, %b[21]
qpshufb,4 %g20, %g29, %r7, %g20
}
{
loop_mode
qpshufb,0 %g18, %b[38], %r7, %g18
qpshufb,1 %g23, %g16, %r7, %g16
}
{
loop_mode
qpshufb,3 %b[32], %b[32], %r22, %g23
qpshufb,4 %b[33], %b[33], %r22, %g29
}
{
loop_mode
qpshufb,0 %b[39], %b[39], %r22, %b[26]
qpshufb,1 %b[42], %b[42], %r22, %b[27]
qpshufb,4 %b[43], %b[43], %r22, %b[29]
}
{
loop_mode
qpfsubs,0 %b[26], %b[27], %b[36]
qpshufb,1 %b[45], %b[45], %r22, %b[31]
qpfsubs,2 %b[23], %b[24], %b[34]
qpshufb,3 %b[44], %b[44], %r22, %b[32]
qpshufb,4 %b[47], %b[47], %r22, %b[33]
qpfsubs,5 %b[21], %b[25], %b[35]
}
{
loop_mode
qpshufb,0 %b[46], %b[46], %r22, %b[37]
qpshufb,1 %b[49], %b[49], %r22, %b[38]
qpfadds,2 %b[23], %b[24], %b[23]
qpfsubs,3 %b[29], %b[32], %b[41]
qpshufb,4 %b[48], %b[48], %r22, %b[39]
qpfsubs,5 %g23, %g29, %b[40]
}
{
loop_mode
qpfadds,0 %b[21], %b[25], %g25
qpshufb,1 %b[50], %b[50], %r22, %b[24]
qpfsubs,2 %b[31], %b[37], %b[43]
qpshufb,3 %g24, %b[30], %r7, %g24
qpshufb,4 %g25, %g21, %r7, %g21
qpfsubs,5 %b[33], %b[39], %b[42]
}
{
loop_mode
qpshufb,0 %b[22], %b[28], %r7, %b[22]
qpshufb,1 %g26, %g26, %r22, %g26
qpfsubs,2 %b[38], %b[24], %b[21]
qpfadds,3 %g23, %g29, %g23
qpshufb,4 %g30, %g30, %r22, %g30
qpfadds,5 %b[26], %b[27], %g29
}
{
loop_mode
qpfadds,0 %b[38], %b[24], %b[24]
qpshufb,1 %g17, %g17, %r22, %g17
qpfadds,2 %b[31], %b[37], %b[25]
qpshufb,3 %g22, %g22, %r22, %g22
qpshufb,4 %g31, %g31, %r22, %g31
qpfadds,5 %b[33], %b[39], %b[26]
}
{
loop_mode
qpshufb,0 %g19, %g19, %r22, %g19
qpshufb,1 %g28, %g28, %r22, %g28
qpfadds,2 %b[29], %b[32], %b[27]
qpfadds,3 %g20, %g30, %b[28]
qpshufb,4 %b[19], %b[19], %r22, %b[19]
qpfsubs,5 %g20, %g30, %g20
}
{
loop_mode
qpfsubs,0 %g27, %g26, %g30
qpfadds,1 %g27, %g26, %g26
qpfadds,2 %g18, %g17, %b[20]
qpfadds,3 %b[20], %g31, %g27
qpfsubs,4 %b[20], %g31, %g31
qpfsubs,5 %g21, %b[19], %b[29]
}
{
loop_mode
qpfsubs,0 %g18, %g17, %g17
qpfsubs,1 %g24, %g19, %g18
qpfadds,2 %g24, %g19, %g19
qpfadds,3 %b[22], %g22, %g24
qpfsubs,4 %b[22], %g22, %g22
qpfadds,5 %g21, %b[19], %g21
}
{
loop_mode
qpfsubs,0 %g16, %g28, %g16
qpfadds,2 %g16, %g28, %b[19]
}
{
loop_mode
qpfsubs,3 %b[28], %g25, %g28
qpfadds,4 %b[28], %g25, %g25
}
{
loop_mode
qpfadds,0 %g26, %b[23], %b[31]
qpshufb,1 %b[35], %b[35], %r25, %b[22]
qpfsubs,2 %g26, %b[23], %g26
qpshufb,3 %b[34], %b[34], %r25, %b[28]
qpshufb,4 %b[40], %b[40], %r25, %b[30]
}
{
loop_mode
qpshufb,0 %b[36], %b[36], %r25, %b[23]
qpshufb,1 %b[21], %b[21], %r25, %b[21]
qpfsubs,2 %b[20], %b[25], %b[27]
qpfadds,3 %g27, %b[27], %b[32]
qpfsubs,4 %g27, %b[27], %g27
qpfadds,5 %g21, %b[24], %b[33]
}
{
loop_mode
qpfadds,0 %b[20], %b[25], %b[20]
qpshufb,1 %b[43], %b[43], %r25, %b[34]
qpfadds,2 %g19, %g23, %b[25]
qpshufb,3 %b[42], %b[42], %r25, %b[35]
qpshufb,4 %b[41], %b[41], %r25, %b[36]
qpfsubs,5 %g21, %b[24], %g21
}
{
loop_mode
qpxor,0 %b[22], %r23, %b[22]
qpxor,1 %b[23], %r23, %b[23]
qpfsubs,2 %b[19], %b[26], %g23
qpfsubs,3 %g19, %g23, %g19
qpfadds,4 %g24, %g29, %b[24]
qpfsubs,5 %g24, %g29, %g24
}
{
loop_mode
qpfadds,0 %b[19], %b[26], %b[19]
qpxor,1 %b[28], %r23, %g29
qpfadds,2 %g20, %b[22], %b[26]
qpxor,3 %b[30], %r23, %b[28]
qpxor,4 %b[35], %r23, %b[30]
stqp,5 %r33, %r0, %g28
}
{
loop_mode
qpxor,0 %b[21], %r23, %g28
qpxor,1 %b[34], %r23, %b[21]
qpfsubs,2 %g30, %g29, %b[34]
qpfadds,3 %g18, %b[28], %b[35]
qpfsubs,4 %g18, %b[28], %g18
qpfadds,5 %g16, %b[30], %b[28]
}
{
loop_mode
qpfadds,0 %g30, %g29, %g29
qpxor,1 %b[36], %r23, %b[36]
qpfsubs,2 %g20, %b[22], %g20
qpfsubs,3 %g16, %b[30], %g16
stqp,5 %r18, %r0, %g26
}
{
loop_mode
qpfsubs,0 %g22, %b[23], %g26
qpfadds,1 %g22, %b[23], %g22
qpfadds,2 %b[29], %g28, %g30
stqp,5 %r28, %r0, %g25
}
{
loop_mode
qpfsubs,0 %g17, %b[21], %g25
qpfadds,1 %g17, %b[21], %g17
qpfsubs,2 %b[29], %g28, %g28
stqp,5 %r2, %r0, %b[31]
}
{
loop_mode
qpfsubs,0 %g31, %b[36], %b[21]
qpfadds,1 %g31, %b[36], %g31
stqp,2 %r16, %r0, %g27
stqp,5 %r20, %r0, %b[32]
}
{
loop_mode
stqp,2 %r37, %r0, %b[27]
stqp,5 %r26, %r0, %b[20]
}
{
loop_mode
stqp,2 %r13, %r0, %g19
stqp,5 %r21, %r0, %b[25]
}
{
loop_mode
stqp,2 %r36, %r0, %g21
stqp,5 %r3, %r0, %g23
}
{
loop_mode
stqp,2 %r9, %r0, %b[33]
stqp,5 %r27, %r0, %b[24]
}
{
loop_mode
stqp,2 %r32, %r0, %g24
stqp,5 %r17, %r0, %b[19]
}
{
loop_mode
stqp,2 %r35, %r0, %b[26]
stqp,5 %r15, %r0, %g29
}
{
loop_mode
stqp,2 %r31, %r0, %g20
stqp,5 %r19, %r0, %b[34]
}
{
loop_mode
stqp,2 %r1, %r0, %b[35]
stqp,5 %r12, %r0, %g18
}
{
loop_mode
stqp,2 %r40, %r0, %g22
stqp,5 %r30, %r0, %g26
}
{
loop_mode
stqp,2 %r38, %r0, %g30
stqp,5 %r4, %r0, %b[28]
}
{
loop_mode
stqp,2 %r34, %r0, %g28
stqp,5 %r39, %r0, %g17
}
{
loop_mode
stqp,2 %r14, %r0, %g16
stqp,5 %r5, %r0, %g31
}
{
loop_mode
ct %ctpr1 ? %NOT_LOOP_END
alc alcf=1, alct=1
stqp,2 %r29, %r0, %g25
addd,3,sm %r0, _f16s,_lts0lo 0x20, %r0
stqp,5 %r11, %r0, %b[21]
}
Теоретическая скорость: 64 комплексных числа за 115 тактов (64/115) = 4.45 Байт/такт
Четверная теоретическая скорость: 17.81 Байт/такт
Замеры скорости

5. stage_radix4_2x_simd128_noConj
Здесь происходит ручная раскрутка алгоритма stage_radix4_simd128_noConj в 2 раза.
Код на Си
void stage_radix4_2x_simd128_noConj(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC_a, myComplex *coefD_a, myComplex *coefE_a, myComplex *coefC_b, myComplex *coefD_b, myComplex *coefE_b)
{
__v2di *xy0_in = (__v2di*)&data_in[ 0];
__v2di *zw0_in = (__v2di*)&data_in[ 2];
__v2di *xy1_in = (__v2di*)&data_in[ 4];
__v2di *zw1_in = (__v2di*)&data_in[ 6];
__v2di *xy2_in = (__v2di*)&data_in[ 8];
__v2di *zw2_in = (__v2di*)&data_in[10];
__v2di *xy3_in = (__v2di*)&data_in[12];
__v2di *zw3_in = (__v2di*)&data_in[14];
__v2di *xy4_in = (__v2di*)&data_in[16];
__v2di *zw4_in = (__v2di*)&data_in[18];
__v2di *xy5_in = (__v2di*)&data_in[20];
__v2di *zw5_in = (__v2di*)&data_in[22];
__v2di *xy6_in = (__v2di*)&data_in[24];
__v2di *zw6_in = (__v2di*)&data_in[26];
__v2di *xy7_in = (__v2di*)&data_in[28];
__v2di *zw7_in = (__v2di*)&data_in[30];
__v2di *c0a_in = (__v2di*)&coefC_a[0];
__v2di *c1a_in = (__v2di*)&coefC_a[2];
__v2di *c2a_in = (__v2di*)&coefC_a[4];
__v2di *c3a_in = (__v2di*)&coefC_a[6];
__v2di *d0a_in = (__v2di*)&coefD_a[0];
__v2di *d1a_in = (__v2di*)&coefD_a[2];
__v2di *d2a_in = (__v2di*)&coefD_a[4];
__v2di *d3a_in = (__v2di*)&coefD_a[6];
__v2di *e0a_in = (__v2di*)&coefE_a[0];
__v2di *e1a_in = (__v2di*)&coefE_a[2];
__v2di *e2a_in = (__v2di*)&coefE_a[4];
__v2di *e3a_in = (__v2di*)&coefE_a[6];
__v2di *c0b_in = (__v2di*)&coefC_b[0*data_count/16];
__v2di *c1b_in = (__v2di*)&coefC_b[1*data_count/16];
__v2di *c2b_in = (__v2di*)&coefC_b[2*data_count/16];
__v2di *c3b_in = (__v2di*)&coefC_b[3*data_count/16];
__v2di *d0b_in = (__v2di*)&coefD_b[0*data_count/16];
__v2di *d1b_in = (__v2di*)&coefD_b[1*data_count/16];
__v2di *d2b_in = (__v2di*)&coefD_b[2*data_count/16];
__v2di *d3b_in = (__v2di*)&coefD_b[3*data_count/16];
__v2di *e0b_in = (__v2di*)&coefE_b[0*data_count/16];
__v2di *e1b_in = (__v2di*)&coefE_b[1*data_count/16];
__v2di *e2b_in = (__v2di*)&coefE_b[2*data_count/16];
__v2di *e3b_in = (__v2di*)&coefE_b[3*data_count/16];
__v2di *out_0 = (__v2di*)&data_out[ 0*data_count/16];
__v2di *out_1 = (__v2di*)&data_out[ 1*data_count/16];
__v2di *out_2 = (__v2di*)&data_out[ 2*data_count/16];
__v2di *out_3 = (__v2di*)&data_out[ 3*data_count/16];
__v2di *out_4 = (__v2di*)&data_out[ 4*data_count/16];
__v2di *out_5 = (__v2di*)&data_out[ 5*data_count/16];
__v2di *out_6 = (__v2di*)&data_out[ 6*data_count/16];
__v2di *out_7 = (__v2di*)&data_out[ 7*data_count/16];
__v2di *out_8 = (__v2di*)&data_out[ 8*data_count/16];
__v2di *out_9 = (__v2di*)&data_out[ 9*data_count/16];
__v2di *out_10 = (__v2di*)&data_out[10*data_count/16];
__v2di *out_11 = (__v2di*)&data_out[11*data_count/16];
__v2di *out_12 = (__v2di*)&data_out[12*data_count/16];
__v2di *out_13 = (__v2di*)&data_out[13*data_count/16];
__v2di *out_14 = (__v2di*)&data_out[14*data_count/16];
__v2di *out_15 = (__v2di*)&data_out[15*data_count/16];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < data_count/32; ++i)
{
__v2di xy0 = xy0_in[16*i];
__v2di zw0 = zw0_in[16*i];
__v2di xy1 = xy1_in[16*i];
__v2di zw1 = zw1_in[16*i];
__v2di c0 = c0a_in[4*i];
__v2di d0 = d0a_in[4*i];
__v2di e0 = e0a_in[4*i];
__v2di xy2 = xy2_in[16*i];
__v2di zw2 = zw2_in[16*i];
__v2di xy3 = xy3_in[16*i];
__v2di zw3 = zw3_in[16*i];
__v2di c1 = c1a_in[4*i];
__v2di d1 = d1a_in[4*i];
__v2di e1 = e1a_in[4*i];
__v2di xy4 = xy4_in[16*i];
__v2di zw4 = zw4_in[16*i];
__v2di xy5 = xy5_in[16*i];
__v2di zw5 = zw5_in[16*i];
__v2di c2 = c2a_in[4*i];
__v2di d2 = d2a_in[4*i];
__v2di e2 = e2a_in[4*i];
__v2di xy6 = xy6_in[16*i];
__v2di zw6 = zw6_in[16*i];
__v2di xy7 = xy7_in[16*i];
__v2di zw7 = zw7_in[16*i];
__v2di c3 = c3a_in[4*i];
__v2di d3 = d3a_in[4*i];
__v2di e3 = e3a_in[4*i];
__v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di cy0_real = __builtin_e2k_qpfmuls( c0, y0);
__v2di cy1_real = __builtin_e2k_qpfmuls( c1, y1);
__v2di cy2_real = __builtin_e2k_qpfmuls( c2, y2);
__v2di cy3_real = __builtin_e2k_qpfmuls( c3, y3);
__v2di dz0_real = __builtin_e2k_qpfmuls( d0, z0);
__v2di dz1_real = __builtin_e2k_qpfmuls( d1, z1);
__v2di dz2_real = __builtin_e2k_qpfmuls( d2, z2);
__v2di dz3_real = __builtin_e2k_qpfmuls( d3, z3);
__v2di ew0_real = __builtin_e2k_qpfmuls( e0, w0);
__v2di ew1_real = __builtin_e2k_qpfmuls( e1, w1);
__v2di ew2_real = __builtin_e2k_qpfmuls( e2, w2);
__v2di ew3_real = __builtin_e2k_qpfmuls( e3, w3);
__v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
__v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
__v2di cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2);
__v2di cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3);
__v2di dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0);
__v2di dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1);
__v2di dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2);
__v2di dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3);
__v2di ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0);
__v2di ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1);
__v2di ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2);
__v2di ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3);
__v2di cy0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy0_real);
__v2di cy1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy1_real);
__v2di cy2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy2_real);
__v2di cy3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy3_real);
__v2di dz0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz0_real);
__v2di dz1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz1_real);
__v2di dz2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz2_real);
__v2di dz3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz3_real);
__v2di ew0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew0_real);
__v2di ew1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew1_real);
__v2di ew2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew2_real);
__v2di ew3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew3_real);
__v2di cy0_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy0_imag);
__v2di cy1_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy1_imag);
__v2di cy2_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy2_imag);
__v2di cy3_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy3_imag);
__v2di dz0_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz0_imag);
__v2di dz1_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz1_imag);
__v2di dz2_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz2_imag);
__v2di dz3_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz3_imag);
__v2di ew0_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew0_imag);
__v2di ew1_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew1_imag);
__v2di ew2_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew2_imag);
__v2di ew3_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew3_imag);
__v2di cy0 = __builtin_e2k_qppermb(cy0_ii, cy0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di cy1 = __builtin_e2k_qppermb(cy1_ii, cy1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di cy2 = __builtin_e2k_qppermb(cy2_ii, cy2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di cy3 = __builtin_e2k_qppermb(cy3_ii, cy3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di dz0 = __builtin_e2k_qppermb(dz0_ii, dz0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di dz1 = __builtin_e2k_qppermb(dz1_ii, dz1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di dz2 = __builtin_e2k_qppermb(dz2_ii, dz2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di dz3 = __builtin_e2k_qppermb(dz3_ii, dz3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di ew0 = __builtin_e2k_qppermb(ew0_ii, ew0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di ew1 = __builtin_e2k_qppermb(ew1_ii, ew1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di ew2 = __builtin_e2k_qppermb(ew2_ii, ew2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di ew3 = __builtin_e2k_qppermb(ew3_ii, ew3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di add02_0 = __builtin_e2k_qpfadds( x0, dz0);
__v2di add02_1 = __builtin_e2k_qpfadds( x1, dz1);
__v2di add02_2 = __builtin_e2k_qpfadds( x2, dz2);
__v2di add02_3 = __builtin_e2k_qpfadds( x3, dz3);
__v2di sub02_0 = __builtin_e2k_qpfsubs( x0, dz0);
__v2di sub02_1 = __builtin_e2k_qpfsubs( x1, dz1);
__v2di sub02_2 = __builtin_e2k_qpfsubs( x2, dz2);
__v2di sub02_3 = __builtin_e2k_qpfsubs( x3, dz3);
__v2di add13_0 = __builtin_e2k_qpfadds(cy0, ew0);
__v2di add13_1 = __builtin_e2k_qpfadds(cy1, ew1);
__v2di add13_2 = __builtin_e2k_qpfadds(cy2, ew2);
__v2di add13_3 = __builtin_e2k_qpfadds(cy3, ew3);
__v2di sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0);
__v2di sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1);
__v2di sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2);
__v2di sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3);
__v2di swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31});
__v2di sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31});
__v2di sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31});
__v2di sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31});
__v2di out0 = __builtin_e2k_qpfadds(add02_0, add13_0);
__v2di out1 = __builtin_e2k_qpfadds(add02_1, add13_1);
__v2di out2 = __builtin_e2k_qpfadds(add02_2, add13_2);
__v2di out3 = __builtin_e2k_qpfadds(add02_3, add13_3);
__v2di out4 = __builtin_e2k_qpfsubs(sub02_0, sub13i_0);
__v2di out5 = __builtin_e2k_qpfsubs(sub02_1, sub13i_1);
__v2di out6 = __builtin_e2k_qpfsubs(sub02_2, sub13i_2);
__v2di out7 = __builtin_e2k_qpfsubs(sub02_3, sub13i_3);
__v2di out8 = __builtin_e2k_qpfsubs(add02_0, add13_0);
__v2di out9 = __builtin_e2k_qpfsubs(add02_1, add13_1);
__v2di out10 = __builtin_e2k_qpfsubs(add02_2, add13_2);
__v2di out11 = __builtin_e2k_qpfsubs(add02_3, add13_3);
__v2di out12 = __builtin_e2k_qpfadds(sub02_0, sub13i_0);
__v2di out13 = __builtin_e2k_qpfadds(sub02_1, sub13i_1);
__v2di out14 = __builtin_e2k_qpfadds(sub02_2, sub13i_2);
__v2di out15 = __builtin_e2k_qpfadds(sub02_3, sub13i_3);
xy0 = out0;
zw0 = out1;
xy1 = out2;
zw1 = out3;
c0 = c0b_in[i];
d0 = d0b_in[i];
e0 = e0b_in[i];
xy2 = out4;
zw2 = out5;
xy3 = out6;
zw3 = out7;
c1 = c1b_in[i];
d1 = d1b_in[i];
e1 = e1b_in[i];
xy4 = out8;
zw4 = out9;
xy5 = out10;
zw5 = out11;
c2 = c2b_in[i];
d2 = d2b_in[i];
e2 = e2b_in[i];
xy6 = out12;
zw6 = out13;
xy7 = out14;
zw7 = out15;
c3 = c3b_in[i];
d3 = d3b_in[i];
e3 = e3b_in[i];
x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100});
w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100});
y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100});
w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100});
y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100});
w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
cy0_real = __builtin_e2k_qpfmuls( c0, y0);
cy1_real = __builtin_e2k_qpfmuls( c1, y1);
cy2_real = __builtin_e2k_qpfmuls( c2, y2);
cy3_real = __builtin_e2k_qpfmuls( c3, y3);
dz0_real = __builtin_e2k_qpfmuls( d0, z0);
dz1_real = __builtin_e2k_qpfmuls( d1, z1);
dz2_real = __builtin_e2k_qpfmuls( d2, z2);
dz3_real = __builtin_e2k_qpfmuls( d3, z3);
ew0_real = __builtin_e2k_qpfmuls( e0, w0);
ew1_real = __builtin_e2k_qpfmuls( e1, w1);
ew2_real = __builtin_e2k_qpfmuls( e2, w2);
ew3_real = __builtin_e2k_qpfmuls( e3, w3);
cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2);
cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3);
dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0);
dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1);
dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2);
dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3);
ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0);
ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1);
ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2);
ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3);
cy0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy0_real);
cy1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy1_real);
cy2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy2_real);
cy3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy3_real);
dz0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz0_real);
dz1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz1_real);
dz2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz2_real);
dz3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz3_real);
ew0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew0_real);
ew1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew1_real);
ew2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew2_real);
ew3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew3_real);
cy0_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy0_imag);
cy1_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy1_imag);
cy2_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy2_imag);
cy3_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy3_imag);
dz0_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz0_imag);
dz1_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz1_imag);
dz2_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz2_imag);
dz3_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz3_imag);
ew0_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew0_imag);
ew1_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew1_imag);
ew2_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew2_imag);
ew3_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew3_imag);
cy0 = __builtin_e2k_qppermb(cy0_ii, cy0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
cy1 = __builtin_e2k_qppermb(cy1_ii, cy1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
cy2 = __builtin_e2k_qppermb(cy2_ii, cy2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
cy3 = __builtin_e2k_qppermb(cy3_ii, cy3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
dz0 = __builtin_e2k_qppermb(dz0_ii, dz0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
dz1 = __builtin_e2k_qppermb(dz1_ii, dz1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
dz2 = __builtin_e2k_qppermb(dz2_ii, dz2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
dz3 = __builtin_e2k_qppermb(dz3_ii, dz3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
ew0 = __builtin_e2k_qppermb(ew0_ii, ew0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
ew1 = __builtin_e2k_qppermb(ew1_ii, ew1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
ew2 = __builtin_e2k_qppermb(ew2_ii, ew2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
ew3 = __builtin_e2k_qppermb(ew3_ii, ew3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
add02_0 = __builtin_e2k_qpfadds( x0, dz0);
add02_1 = __builtin_e2k_qpfadds( x1, dz1);
add02_2 = __builtin_e2k_qpfadds( x2, dz2);
add02_3 = __builtin_e2k_qpfadds( x3, dz3);
sub02_0 = __builtin_e2k_qpfsubs( x0, dz0);
sub02_1 = __builtin_e2k_qpfsubs( x1, dz1);
sub02_2 = __builtin_e2k_qpfsubs( x2, dz2);
sub02_3 = __builtin_e2k_qpfsubs( x3, dz3);
add13_0 = __builtin_e2k_qpfadds(cy0, ew0);
add13_1 = __builtin_e2k_qpfadds(cy1, ew1);
add13_2 = __builtin_e2k_qpfadds(cy2, ew2);
add13_3 = __builtin_e2k_qpfadds(cy3, ew3);
sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0);
sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1);
sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2);
sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3);
swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31});
sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31});
sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31});
sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31});
out_0[i] = __builtin_e2k_qpfadds(add02_0, add13_0);
out_1[i] = __builtin_e2k_qpfadds(add02_1, add13_1);
out_2[i] = __builtin_e2k_qpfadds(add02_2, add13_2);
out_3[i] = __builtin_e2k_qpfadds(add02_3, add13_3);
out_4[i] = __builtin_e2k_qpfsubs(sub02_0, sub13i_0);
out_5[i] = __builtin_e2k_qpfsubs(sub02_1, sub13i_1);
out_6[i] = __builtin_e2k_qpfsubs(sub02_2, sub13i_2);
out_7[i] = __builtin_e2k_qpfsubs(sub02_3, sub13i_3);
out_8[i] = __builtin_e2k_qpfsubs(add02_0, add13_0);
out_9[i] = __builtin_e2k_qpfsubs(add02_1, add13_1);
out_10[i] = __builtin_e2k_qpfsubs(add02_2, add13_2);
out_11[i] = __builtin_e2k_qpfsubs(add02_3, add13_3);
out_12[i] = __builtin_e2k_qpfadds(sub02_0, sub13i_0);
out_13[i] = __builtin_e2k_qpfadds(sub02_1, sub13i_1);
out_14[i] = __builtin_e2k_qpfadds(sub02_2, sub13i_2);
out_15[i] = __builtin_e2k_qpfadds(sub02_3, sub13i_3);
}
}
Основной цикл на ассемблере
.L15211:
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=0, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=2, disp=64
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=2, disp=96
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=4, disp=128
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=4, disp=160
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=6, disp=192
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=1, abs=6, disp=224
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=4, asz=1, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=4, asz=1, abs=8, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=3, asz=1, abs=10, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=3, asz=1, abs=10, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=2, asz=1, abs=12, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=2, asz=1, abs=12, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=1, incr=0, ind=0, asz=1, abs=14, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=15, asz=1, abs=14, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=14, asz=1, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=13, asz=1, abs=16, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=12, asz=1, abs=18, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=11, asz=1, abs=18, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=10, asz=2, abs=20, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=9, asz=2, abs=20, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=8, asz=2, abs=24, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=7, asz=2, abs=24, disp=0
}
{
fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=6, asz=2, abs=28, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=5, asz=2, abs=28, disp=0
}
.L11903:
{
loop_mode
disp %ctpr1, .L11903
movaqp,0 area=0, ind=0, am=1, be=0, %g17
movaqp,1 area=0, ind=16, am=0, be=0, %g16
movaqp,2 area=0, ind=0, am=1, be=0, %g19
movaqp,3 area=0, ind=16, am=0, be=0, %g18
}
{
loop_mode
movaqp,0 area=1, ind=0, am=1, be=0, %g21
movaqp,1 area=1, ind=16, am=0, be=0, %g20
movaqp,2 area=1, ind=0, am=1, be=0, %g23
movaqp,3 area=1, ind=16, am=0, be=0, %g22
}
{
loop_mode
movaqp,0 area=2, ind=0, am=1, be=0, %g25
movaqp,1 area=2, ind=16, am=0, be=0, %g24
movaqp,2 area=2, ind=0, am=1, be=0, %g27
movaqp,3 area=2, ind=16, am=0, be=0, %g26
}
{
loop_mode
movaqp,0 area=3, ind=0, am=1, be=0, %g29
movaqp,1 area=3, ind=16, am=0, be=0, %g28
movaqp,2 area=3, ind=0, am=1, be=0, %g31
movaqp,3 area=3, ind=16, am=0, be=0, %g30
}
{
loop_mode
movaqp,0 area=4, ind=0, am=1, be=0, %r26
movaqp,1 area=4, ind=16, am=0, be=0, %r9
movaqp,2 area=4, ind=0, am=1, be=0, %r28
movaqp,3 area=4, ind=16, am=0, be=0, %r27
}
{
loop_mode
qpshufb,0 %g19, %g17, %r1, %r33
qpshufb,1 %g18, %g16, %r1, %r34
qpshufb,3 %g18, %g16, %r6, %g16
qpshufb,4 %g19, %g17, %r6, %g17
movaqp,0 area=5, ind=0, am=1, be=0, %r30
movaqp,1 area=5, ind=16, am=0, be=0, %r29
movaqp,2 area=5, ind=0, am=1, be=0, %r32
movaqp,3 area=5, ind=16, am=0, be=0, %r31
}
{
loop_mode
qpshufb,0 %g23, %g21, %r1, %r37
qpshufb,1 %g22, %g20, %r1, %r38
qpshufb,3 %g22, %g20, %r6, %g20
qpshufb,4 %g23, %g21, %r6, %g21
movaqp,0 area=6, ind=0, am=1, be=0, %g19
movaqp,1 area=6, ind=16, am=0, be=0, %g18
movaqp,2 area=6, ind=0, am=1, be=0, %r36
movaqp,3 area=6, ind=16, am=0, be=0, %r35
}
{
loop_mode
qpshufb,0 %g27, %g25, %r1, %g22
qpshufb,1 %g26, %g24, %r1, %g23
qpshufb,3 %g26, %g24, %r6, %g24
qpshufb,4 %g27, %g25, %r6, %g25
movaqp,0 area=8, ind=0, am=1, be=0, %r39
movaqp,1 area=7, ind=0, am=1, be=0, %g26
movaqp,2 area=8, ind=0, am=1, be=0, %r40
movaqp,3 area=7, ind=0, am=1, be=0, %g27
}
{
loop_mode
qpshufb,0 %g31, %g29, %r1, %r41
qpshufb,1 %g30, %g28, %r1, %r42
qpshufb,3 %g30, %g28, %r6, %g28
qpshufb,4 %g31, %g29, %r6, %g29
movaqp,0 area=10, ind=0, am=1, be=0, %r43
movaqp,1 area=9, ind=0, am=1, be=0, %g30
movaqp,2 area=10, ind=0, am=1, be=0, %r44
movaqp,3 area=9, ind=0, am=1, be=0, %g31
}
{
loop_mode
qpshufb,0 %r26, %r26, %r5, %r45
qpshufb,1 %r9, %r9, %r5, %r46
qpfmul_hsubs,2 %r26, %r33, %r7, %r26
qpshufb,3 %r28, %r28, %r5, %r47
qpshufb,4 %r27, %r27, %r5, %r48
movaqp,0 area=12, ind=0, am=1, be=0, %r51
movaqp,1 area=11, ind=0, am=1, be=0, %r49
movaqp,2 area=12, ind=0, am=1, be=0, %r52
movaqp,3 area=11, ind=0, am=1, be=0, %r50
}
{
loop_mode
qpfmul_hadds,0 %r46, %r37, %r7, %r37
qpfmul_hsubs,1 %r28, %g22, %r7, %r28
qpfmul_hsubs,2 %r9, %r37, %r7, %r9
qpshufb,3 %r30, %r30, %r5, %r46
qpshufb,4 %r29, %r29, %r5, %r53
qpfmul_hsubs,5 %r30, %g16, %r7, %r30
}
{
loop_mode
qpshufb,0 %g19, %g19, %r5, %r54
qpshufb,1 %g18, %g18, %r5, %r55
qpfmul_hsubs,2 %r27, %r41, %r7, %r27
qpshufb,3 %r36, %r36, %r5, %r56
qpshufb,4 %r35, %r35, %r5, %r57
qpfmul_hsubs,5 %g19, %r34, %r7, %g19
}
{
loop_mode
qpfmul_hsubs,0 %g18, %r38, %r7, %g18
qpfmul_hsubs,1 %r36, %g23, %r7, %r36
qpfmul_hsubs,2 %r35, %r42, %r7, %r35
qpfmul_hadds,3 %r47, %g22, %r7, %g22
qpfmul_hadds,4 %r48, %r41, %r7, %r41
qpfmul_hadds,5 %r56, %g23, %r7, %g23
}
{
loop_mode
qpfmul_hadds,0 %r54, %r34, %r7, %r34
qpfmul_hadds,1 %r55, %r38, %r7, %r38
qpfmul_hadds,2 %r45, %r33, %r7, %r33
qpshufb,3 %r32, %r32, %r5, %r45
qpshufb,4 %r31, %r31, %r5, %r47
qpfmul_hadds,5 %r57, %r42, %r7, %r42
}
{
loop_mode
qpfmul_hsubs,0 %r29, %g20, %r7, %r29
qpfmul_hsubs,1 %r32, %g24, %r7, %r32
qpfmul_hsubs,2 %r31, %g28, %r7, %r31
qpfmul_hadds,3 %r53, %g20, %r7, %g20
qpfmul_hadds,4 %r45, %g24, %r7, %g24
qpfmul_hadds,5 %r46, %g16, %r7, %g16
}
{
loop_mode
qpshufb,0 %r40, %r40, %r5, %r47
qpshufb,1 %g31, %g31, %r5, %r48
qpshufb,3 %g26, %g26, %r5, %r45
qpshufb,4 %r39, %r39, %r5, %r46
qpfmul_hadds,5 %r47, %g28, %r7, %g28
}
{
loop_mode
qpshufb,3 %r43, %r43, %r5, %r53
qpshufb,4 %r50, %r50, %r5, %r54
}
{
loop_mode
nop 1
qpshufb,0 %r49, %r49, %r5, %r55
qpshufb,1 %r52, %r52, %r5, %r56
qpshufb,3 %g27, %g27, %r5, %r57
qpshufb,4 %g30, %g30, %r5, %r58
}
{
loop_mode
nop 1
qpshufb,3 %r44, %r44, %r5, %r59
qpshufb,4 %r51, %r51, %r5, %r60
}
{
loop_mode
qppermb,0 %r33, %r26, %r3, %r26
qppermb,1 %r34, %g19, %r3, %g19
qppermb,3 %r41, %r27, %r3, %r27
qppermb,4 %r37, %r9, %r3, %r9
}
{
loop_mode
qppermb,0 %r38, %g18, %r3, %g18
qppermb,1 %g22, %r28, %r3, %g22
qpfsubs,2 %r26, %g19, %r33
qppermb,3 %g23, %r36, %r3, %g23
qppermb,4 %r42, %r35, %r3, %r28
}
{
loop_mode
qpfadds,0 %r26, %g19, %g19
qppermb,3 %g16, %r30, %r3, %g16
qpfadds,4 %r27, %r28, %r27
qpfsubs,5 %r27, %r28, %r34
}
{
loop_mode
qppermb,0 %g20, %r29, %r3, %g20
qppermb,1 %g24, %r32, %r3, %g24
qppermb,3 %g28, %r31, %r3, %g28
qpfadds,4 %g17, %g16, %r26
qpfsubs,5 %g17, %g16, %g16
}
{
loop_mode
qpfsubs,0 %r9, %g18, %g17
qpfadds,1 %r9, %g18, %g18
qpfsubs,2 %g21, %g20, %r9
qpfadds,3 %g29, %g28, %r28
qpfsubs,4 %g29, %g28, %g28
}
{
loop_mode
qpfadds,0 %g21, %g20, %g20
qpfsubs,1 %g25, %g24, %g21
qpfsubs,2 %g22, %g23, %g29
qpfadds,5 %g22, %g23, %g22
}
{
loop_mode
qpfadds,2 %g25, %g24, %g23
}
{
loop_mode
qpshufb,3 %r34, %r34, %r5, %g24
qpshufb,4 %r33, %r33, %r5, %g25
}
{
loop_mode
qpshufb,0 %g17, %g17, %r5, %g17
qpxor,3 %g24, %r4, %g24
qpxor,4 %g25, %r4, %g25
qpfadds,5 %r26, %g19, %r29
}
{
loop_mode
qpshufb,0 %g29, %g29, %r5, %g29
qpxor,1 %g17, %r4, %g17
qpfsubs,2 %r26, %g19, %g19
qpfadds,3 %r28, %r27, %r26
qpfsubs,4 %r28, %r27, %r27
qpfsubs,5 %g16, %g25, %r28
}
{
loop_mode
qpxor,0 %g29, %r4, %g29
qpfadds,1 %g20, %g18, %r30
qpfsubs,2 %g20, %g18, %g18
qpfadds,3 %g16, %g25, %g16
qpfsubs,4 %g28, %g24, %g20
qpfadds,5 %g28, %g24, %g24
}
{
loop_mode
qpfadds,0 %r9, %g17, %g25
qpfadds,1 %g23, %g22, %g28
qpfsubs,2 %g23, %g22, %g22
}
{
loop_mode
nop 2
qpfsubs,0 %r9, %g17, %g17
qpfsubs,1 %g21, %g29, %g23
qpfadds,2 %g21, %g29, %g21
}
{
loop_mode
qpshufb,0 %r26, %r30, %r1, %g29
qpshufb,1 %r27, %g18, %r1, %r9
}
{
loop_mode
qpshufb,0 %g28, %r29, %r1, %r31
qpshufb,1 %g22, %g19, %r1, %r32
qpfmul_hadds,2 %r55, %r9, %r7, %r33
qpshufb,3 %r27, %g18, %r6, %g18
qpshufb,4 %r26, %r30, %r6, %r26
}
{
loop_mode
qpshufb,0 %g23, %r28, %r1, %r27
qpshufb,1 %g20, %g17, %r1, %r30
qpfmul_hsubs,2 %r49, %r9, %r7, %r9
qpshufb,3 %g24, %g25, %r1, %r34
qpshufb,4 %g24, %g25, %r6, %g24
qpfmul_hadds,5 %r59, %g18, %r7, %g25
}
{
loop_mode
qpshufb,0 %g21, %g16, %r1, %r35
qpfmul_hsubs,1 %r40, %r27, %r7, %r36
qpfmul_hadds,2 %r47, %r27, %r7, %r27
qpfmul_hadds,3 %r56, %r34, %r7, %r34
qpshufb,4 %g20, %g17, %r6, %g17
qpfmul_hsubs,5 %r52, %r34, %r7, %r37
}
{
loop_mode
qpfmul_hsubs,0 %r39, %g29, %r7, %g20
qpfmul_hadds,1 %r46, %g29, %r7, %g29
qpfmul_hsubs,2 %r43, %r32, %r7, %r38
qpfmul_hsubs,3 %r44, %g18, %r7, %g18
qpfmul_hsubs,4 %g27, %r26, %r7, %g27
qpfmul_hadds,5 %r57, %r26, %r7, %r26
}
{
loop_mode
qpfmul_hadds,0 %r45, %r31, %r7, %r39
qpfmul_hadds,1 %r53, %r32, %r7, %r32
qpfmul_hsubs,2 %g26, %r31, %r7, %g26
qpfmul_hadds,3 %r58, %g17, %r7, %r40
qpfmul_hadds,4 %r60, %g24, %r7, %g24
qpfmul_hsubs,5 %r51, %g24, %r7, %r31
}
{
loop_mode
qpfmul_hadds,0 %r54, %r35, %r7, %r41
qpfmul_hadds,1 %r48, %r30, %r7, %r30
qpfmul_hsubs,2 %g31, %r30, %r7, %g31
qpshufb,3 %g22, %g19, %r6, %g19
qpshufb,4 %g28, %r29, %r6, %g22
qpfmul_hsubs,5 %g30, %g17, %r7, %g17
}
{
loop_mode
nop 4
qpshufb,0 %g23, %r28, %r6, %g23
qpshufb,1 %g21, %g16, %r6, %g16
qpfmul_hsubs,2 %r50, %r35, %r7, %g28
}
{
loop_mode
qppermb,3 %r33, %r9, %r3, %g21
qppermb,4 %r34, %r37, %r3, %g30
}
{
loop_mode
qppermb,0 %r27, %r36, %r3, %r9
qppermb,1 %g29, %g20, %r3, %g20
qppermb,3 %r26, %g27, %r3, %g27
qppermb,4 %g25, %g18, %r3, %g18
}
{
loop_mode
qppermb,0 %r39, %g26, %r3, %g25
qppermb,1 %r32, %r38, %r3, %g26
qppermb,3 %r40, %g17, %r3, %g17
qppermb,4 %g24, %r31, %r3, %g24
qpfsubs,5 %g19, %g18, %g29
}
{
loop_mode
qppermb,0 %r30, %g31, %r3, %g31
qppermb,1 %r41, %g28, %r3, %g28
qpfsubs,2 %g25, %g20, %r26
qpfadds,3 %g22, %g27, %r27
qpfadds,4 %g19, %g18, %g18
qpfsubs,5 %g22, %g27, %g19
}
{
loop_mode
qpfsubs,0 %g26, %g21, %g22
qpfsubs,1 %r9, %g31, %g27
qpfsubs,2 %g28, %g30, %r28
qpfadds,3 %g16, %g24, %r29
qpfsubs,4 %g23, %g17, %r30
qpfsubs,5 %g16, %g24, %g16
}
{
loop_mode
qpfadds,0 %g25, %g20, %g20
qpfadds,1 %g26, %g21, %g21
qpfadds,2 %r9, %g31, %g23
qpfadds,3 %g23, %g17, %g17
}
{
loop_mode
nop 1
qpfadds,0 %g28, %g30, %g24
}
{
loop_mode
qpshufb,1 %r26, %r26, %r5, %g25
}
{
loop_mode
qpshufb,0 %g22, %g22, %r5, %g22
qpshufb,1 %g27, %g27, %r5, %g26
qpfsubs,2 %r27, %g20, %g27
}
{
loop_mode
qpshufb,0 %r28, %r28, %r5, %g28
qpxor,1 %g25, %r4, %g25
qpfadds,2 %r27, %g20, %g20
}
{
loop_mode
qpxor,0 %g22, %r4, %g22
qpxor,1 %g26, %r4, %g26
qpfadds,2 %g18, %g21, %g30
qpfsubs,3 %g18, %g21, %g18
qpfadds,4 %g17, %g23, %g21
qpfsubs,5 %g17, %g23, %g17
}
{
loop_mode
qpxor,0 %g28, %r4, %g23
qpfadds,1 %r29, %g24, %g28
qpfsubs,2 %r29, %g24, %g24
}
{
loop_mode
qpfsubs,0 %g19, %g25, %g31
qpfadds,1 %g19, %g25, %g19
qpfadds,2 %g29, %g22, %g25
}
{
loop_mode
qpfsubs,0 %g29, %g22, %g22
qpfsubs,1 %r30, %g26, %g29
qpfadds,2 %r30, %g26, %g26
}
{
loop_mode
qpfsubs,0 %g16, %g23, %r9
qpfadds,1 %g16, %g23, %g16
stqp,2 %r25, %r0, %g30
stqp,5 %r23, %r0, %g27
}
{
loop_mode
stqp,2 %r2, %r0, %g20
stqp,5 %r18, %r0, %g18
}
{
loop_mode
stqp,2 %r16, %r0, %g28
stqp,5 %r19, %r0, %g17
}
{
loop_mode
stqp,2 %r22, %r0, %g21
stqp,5 %r12, %r0, %g24
}
{
loop_mode
stqp,2 %r24, %r0, %g31
stqp,5 %r17, %r0, %g19
}
{
loop_mode
stqp,2 %r15, %r0, %g22
stqp,5 %r14, %r0, %g25
}
{
loop_mode
stqp,2 %r21, %r0, %g29
stqp,5 %r11, %r0, %g26
}
{
loop_mode
ct %ctpr1 ? %NOT_LOOP_END
alc alcf=1, alct=1
addd,0,sm 0x10, %r0, %r0
stqp,2 %r20, %r0, %r9
stqp,5 %r13, %r0, %g16
}
Теоретическая скорость: 32 комплексных числа за 62 такта (32/62) = 4.13 Байт/такт
Четверная теоретическая скорость: 16.52 Байт/такт
Замеры скорости

6. stage_radix4_2x_simd128_noConj_unroll2
Здесь происходит раскрутка цикла в 2 раза с помощью опции unroll.
Код на Си
void stage_radix4_2x_simd128_noConj_unroll2(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC_a, myComplex *coefD_a, myComplex *coefE_a, myComplex *coefC_b, myComplex *coefD_b, myComplex *coefE_b)
{
__v2di *xy0_in = (__v2di*)&data_in[ 0];
__v2di *zw0_in = (__v2di*)&data_in[ 2];
__v2di *xy1_in = (__v2di*)&data_in[ 4];
__v2di *zw1_in = (__v2di*)&data_in[ 6];
__v2di *xy2_in = (__v2di*)&data_in[ 8];
__v2di *zw2_in = (__v2di*)&data_in[10];
__v2di *xy3_in = (__v2di*)&data_in[12];
__v2di *zw3_in = (__v2di*)&data_in[14];
__v2di *xy4_in = (__v2di*)&data_in[16];
__v2di *zw4_in = (__v2di*)&data_in[18];
__v2di *xy5_in = (__v2di*)&data_in[20];
__v2di *zw5_in = (__v2di*)&data_in[22];
__v2di *xy6_in = (__v2di*)&data_in[24];
__v2di *zw6_in = (__v2di*)&data_in[26];
__v2di *xy7_in = (__v2di*)&data_in[28];
__v2di *zw7_in = (__v2di*)&data_in[30];
__v2di *c0a_in = (__v2di*)&coefC_a[0];
__v2di *c1a_in = (__v2di*)&coefC_a[2];
__v2di *c2a_in = (__v2di*)&coefC_a[4];
__v2di *c3a_in = (__v2di*)&coefC_a[6];
__v2di *d0a_in = (__v2di*)&coefD_a[0];
__v2di *d1a_in = (__v2di*)&coefD_a[2];
__v2di *d2a_in = (__v2di*)&coefD_a[4];
__v2di *d3a_in = (__v2di*)&coefD_a[6];
__v2di *e0a_in = (__v2di*)&coefE_a[0];
__v2di *e1a_in = (__v2di*)&coefE_a[2];
__v2di *e2a_in = (__v2di*)&coefE_a[4];
__v2di *e3a_in = (__v2di*)&coefE_a[6];
__v2di *c0b_in = (__v2di*)&coefC_b[0*data_count/16];
__v2di *c1b_in = (__v2di*)&coefC_b[1*data_count/16];
__v2di *c2b_in = (__v2di*)&coefC_b[2*data_count/16];
__v2di *c3b_in = (__v2di*)&coefC_b[3*data_count/16];
__v2di *d0b_in = (__v2di*)&coefD_b[0*data_count/16];
__v2di *d1b_in = (__v2di*)&coefD_b[1*data_count/16];
__v2di *d2b_in = (__v2di*)&coefD_b[2*data_count/16];
__v2di *d3b_in = (__v2di*)&coefD_b[3*data_count/16];
__v2di *e0b_in = (__v2di*)&coefE_b[0*data_count/16];
__v2di *e1b_in = (__v2di*)&coefE_b[1*data_count/16];
__v2di *e2b_in = (__v2di*)&coefE_b[2*data_count/16];
__v2di *e3b_in = (__v2di*)&coefE_b[3*data_count/16];
__v2di *out_0 = (__v2di*)&data_out[ 0*data_count/16];
__v2di *out_1 = (__v2di*)&data_out[ 1*data_count/16];
__v2di *out_2 = (__v2di*)&data_out[ 2*data_count/16];
__v2di *out_3 = (__v2di*)&data_out[ 3*data_count/16];
__v2di *out_4 = (__v2di*)&data_out[ 4*data_count/16];
__v2di *out_5 = (__v2di*)&data_out[ 5*data_count/16];
__v2di *out_6 = (__v2di*)&data_out[ 6*data_count/16];
__v2di *out_7 = (__v2di*)&data_out[ 7*data_count/16];
__v2di *out_8 = (__v2di*)&data_out[ 8*data_count/16];
__v2di *out_9 = (__v2di*)&data_out[ 9*data_count/16];
__v2di *out_10 = (__v2di*)&data_out[10*data_count/16];
__v2di *out_11 = (__v2di*)&data_out[11*data_count/16];
__v2di *out_12 = (__v2di*)&data_out[12*data_count/16];
__v2di *out_13 = (__v2di*)&data_out[13*data_count/16];
__v2di *out_14 = (__v2di*)&data_out[14*data_count/16];
__v2di *out_15 = (__v2di*)&data_out[15*data_count/16];
#pragma ivdep
#pragma unroll(2)
#pragma prefetch
for(int64_t i = 0; i < data_count/32; ++i)
{
__v2di xy0 = xy0_in[16*i];
__v2di zw0 = zw0_in[16*i];
__v2di xy1 = xy1_in[16*i];
__v2di zw1 = zw1_in[16*i];
__v2di c0 = c0a_in[4*i];
__v2di d0 = d0a_in[4*i];
__v2di e0 = e0a_in[4*i];
__v2di xy2 = xy2_in[16*i];
__v2di zw2 = zw2_in[16*i];
__v2di xy3 = xy3_in[16*i];
__v2di zw3 = zw3_in[16*i];
__v2di c1 = c1a_in[4*i];
__v2di d1 = d1a_in[4*i];
__v2di e1 = e1a_in[4*i];
__v2di xy4 = xy4_in[16*i];
__v2di zw4 = zw4_in[16*i];
__v2di xy5 = xy5_in[16*i];
__v2di zw5 = zw5_in[16*i];
__v2di c2 = c2a_in[4*i];
__v2di d2 = d2a_in[4*i];
__v2di e2 = e2a_in[4*i];
__v2di xy6 = xy6_in[16*i];
__v2di zw6 = zw6_in[16*i];
__v2di xy7 = xy7_in[16*i];
__v2di zw7 = zw7_in[16*i];
__v2di c3 = c3a_in[4*i];
__v2di d3 = d3a_in[4*i];
__v2di e3 = e3a_in[4*i];
__v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di cy0_real = __builtin_e2k_qpfmuls( c0, y0);
__v2di cy1_real = __builtin_e2k_qpfmuls( c1, y1);
__v2di cy2_real = __builtin_e2k_qpfmuls( c2, y2);
__v2di cy3_real = __builtin_e2k_qpfmuls( c3, y3);
__v2di dz0_real = __builtin_e2k_qpfmuls( d0, z0);
__v2di dz1_real = __builtin_e2k_qpfmuls( d1, z1);
__v2di dz2_real = __builtin_e2k_qpfmuls( d2, z2);
__v2di dz3_real = __builtin_e2k_qpfmuls( d3, z3);
__v2di ew0_real = __builtin_e2k_qpfmuls( e0, w0);
__v2di ew1_real = __builtin_e2k_qpfmuls( e1, w1);
__v2di ew2_real = __builtin_e2k_qpfmuls( e2, w2);
__v2di ew3_real = __builtin_e2k_qpfmuls( e3, w3);
__v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
__v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
__v2di cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2);
__v2di cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3);
__v2di dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0);
__v2di dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1);
__v2di dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2);
__v2di dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3);
__v2di ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0);
__v2di ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1);
__v2di ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2);
__v2di ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3);
__v2di cy0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy0_real);
__v2di cy1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy1_real);
__v2di cy2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy2_real);
__v2di cy3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy3_real);
__v2di dz0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz0_real);
__v2di dz1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz1_real);
__v2di dz2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz2_real);
__v2di dz3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz3_real);
__v2di ew0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew0_real);
__v2di ew1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew1_real);
__v2di ew2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew2_real);
__v2di ew3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew3_real);
__v2di cy0_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy0_imag);
__v2di cy1_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy1_imag);
__v2di cy2_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy2_imag);
__v2di cy3_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy3_imag);
__v2di dz0_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz0_imag);
__v2di dz1_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz1_imag);
__v2di dz2_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz2_imag);
__v2di dz3_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz3_imag);
__v2di ew0_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew0_imag);
__v2di ew1_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew1_imag);
__v2di ew2_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew2_imag);
__v2di ew3_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew3_imag);
__v2di cy0 = __builtin_e2k_qppermb(cy0_ii, cy0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di cy1 = __builtin_e2k_qppermb(cy1_ii, cy1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di cy2 = __builtin_e2k_qppermb(cy2_ii, cy2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di cy3 = __builtin_e2k_qppermb(cy3_ii, cy3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di dz0 = __builtin_e2k_qppermb(dz0_ii, dz0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di dz1 = __builtin_e2k_qppermb(dz1_ii, dz1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di dz2 = __builtin_e2k_qppermb(dz2_ii, dz2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di dz3 = __builtin_e2k_qppermb(dz3_ii, dz3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di ew0 = __builtin_e2k_qppermb(ew0_ii, ew0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di ew1 = __builtin_e2k_qppermb(ew1_ii, ew1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di ew2 = __builtin_e2k_qppermb(ew2_ii, ew2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di ew3 = __builtin_e2k_qppermb(ew3_ii, ew3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di add02_0 = __builtin_e2k_qpfadds( x0, dz0);
__v2di add02_1 = __builtin_e2k_qpfadds( x1, dz1);
__v2di add02_2 = __builtin_e2k_qpfadds( x2, dz2);
__v2di add02_3 = __builtin_e2k_qpfadds( x3, dz3);
__v2di sub02_0 = __builtin_e2k_qpfsubs( x0, dz0);
__v2di sub02_1 = __builtin_e2k_qpfsubs( x1, dz1);
__v2di sub02_2 = __builtin_e2k_qpfsubs( x2, dz2);
__v2di sub02_3 = __builtin_e2k_qpfsubs( x3, dz3);
__v2di add13_0 = __builtin_e2k_qpfadds(cy0, ew0);
__v2di add13_1 = __builtin_e2k_qpfadds(cy1, ew1);
__v2di add13_2 = __builtin_e2k_qpfadds(cy2, ew2);
__v2di add13_3 = __builtin_e2k_qpfadds(cy3, ew3);
__v2di sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0);
__v2di sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1);
__v2di sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2);
__v2di sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3);
__v2di swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31});
__v2di sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31});
__v2di sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31});
__v2di sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31});
__v2di out0 = __builtin_e2k_qpfadds(add02_0, add13_0);
__v2di out1 = __builtin_e2k_qpfadds(add02_1, add13_1);
__v2di out2 = __builtin_e2k_qpfadds(add02_2, add13_2);
__v2di out3 = __builtin_e2k_qpfadds(add02_3, add13_3);
__v2di out4 = __builtin_e2k_qpfsubs(sub02_0, sub13i_0);
__v2di out5 = __builtin_e2k_qpfsubs(sub02_1, sub13i_1);
__v2di out6 = __builtin_e2k_qpfsubs(sub02_2, sub13i_2);
__v2di out7 = __builtin_e2k_qpfsubs(sub02_3, sub13i_3);
__v2di out8 = __builtin_e2k_qpfsubs(add02_0, add13_0);
__v2di out9 = __builtin_e2k_qpfsubs(add02_1, add13_1);
__v2di out10 = __builtin_e2k_qpfsubs(add02_2, add13_2);
__v2di out11 = __builtin_e2k_qpfsubs(add02_3, add13_3);
__v2di out12 = __builtin_e2k_qpfadds(sub02_0, sub13i_0);
__v2di out13 = __builtin_e2k_qpfadds(sub02_1, sub13i_1);
__v2di out14 = __builtin_e2k_qpfadds(sub02_2, sub13i_2);
__v2di out15 = __builtin_e2k_qpfadds(sub02_3, sub13i_3);
xy0 = out0;
zw0 = out1;
xy1 = out2;
zw1 = out3;
c0 = c0b_in[i];
d0 = d0b_in[i];
e0 = e0b_in[i];
xy2 = out4;
zw2 = out5;
xy3 = out6;
zw3 = out7;
c1 = c1b_in[i];
d1 = d1b_in[i];
e1 = e1b_in[i];
xy4 = out8;
zw4 = out9;
xy5 = out10;
zw5 = out11;
c2 = c2b_in[i];
d2 = d2b_in[i];
e2 = e2b_in[i];
xy6 = out12;
zw6 = out13;
xy7 = out14;
zw7 = out15;
c3 = c3b_in[i];
d3 = d3b_in[i];
e3 = e3b_in[i];
x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100});
w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100});
y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100});
w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100});
y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100});
w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
cy0_real = __builtin_e2k_qpfmuls( c0, y0);
cy1_real = __builtin_e2k_qpfmuls( c1, y1);
cy2_real = __builtin_e2k_qpfmuls( c2, y2);
cy3_real = __builtin_e2k_qpfmuls( c3, y3);
dz0_real = __builtin_e2k_qpfmuls( d0, z0);
dz1_real = __builtin_e2k_qpfmuls( d1, z1);
dz2_real = __builtin_e2k_qpfmuls( d2, z2);
dz3_real = __builtin_e2k_qpfmuls( d3, z3);
ew0_real = __builtin_e2k_qpfmuls( e0, w0);
ew1_real = __builtin_e2k_qpfmuls( e1, w1);
ew2_real = __builtin_e2k_qpfmuls( e2, w2);
ew3_real = __builtin_e2k_qpfmuls( e3, w3);
cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2);
cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3);
dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0);
dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1);
dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2);
dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3);
ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0);
ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1);
ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2);
ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3);
cy0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy0_real);
cy1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy1_real);
cy2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy2_real);
cy3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy3_real);
dz0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz0_real);
dz1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz1_real);
dz2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz2_real);
dz3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz3_real);
ew0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew0_real);
ew1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew1_real);
ew2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew2_real);
ew3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew3_real);
cy0_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy0_imag);
cy1_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy1_imag);
cy2_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy2_imag);
cy3_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy3_imag);
dz0_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz0_imag);
dz1_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz1_imag);
dz2_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz2_imag);
dz3_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz3_imag);
ew0_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew0_imag);
ew1_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew1_imag);
ew2_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew2_imag);
ew3_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew3_imag);
cy0 = __builtin_e2k_qppermb(cy0_ii, cy0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
cy1 = __builtin_e2k_qppermb(cy1_ii, cy1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
cy2 = __builtin_e2k_qppermb(cy2_ii, cy2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
cy3 = __builtin_e2k_qppermb(cy3_ii, cy3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
dz0 = __builtin_e2k_qppermb(dz0_ii, dz0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
dz1 = __builtin_e2k_qppermb(dz1_ii, dz1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
dz2 = __builtin_e2k_qppermb(dz2_ii, dz2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
dz3 = __builtin_e2k_qppermb(dz3_ii, dz3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
ew0 = __builtin_e2k_qppermb(ew0_ii, ew0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
ew1 = __builtin_e2k_qppermb(ew1_ii, ew1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
ew2 = __builtin_e2k_qppermb(ew2_ii, ew2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
ew3 = __builtin_e2k_qppermb(ew3_ii, ew3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
add02_0 = __builtin_e2k_qpfadds( x0, dz0);
add02_1 = __builtin_e2k_qpfadds( x1, dz1);
add02_2 = __builtin_e2k_qpfadds( x2, dz2);
add02_3 = __builtin_e2k_qpfadds( x3, dz3);
sub02_0 = __builtin_e2k_qpfsubs( x0, dz0);
sub02_1 = __builtin_e2k_qpfsubs( x1, dz1);
sub02_2 = __builtin_e2k_qpfsubs( x2, dz2);
sub02_3 = __builtin_e2k_qpfsubs( x3, dz3);
add13_0 = __builtin_e2k_qpfadds(cy0, ew0);
add13_1 = __builtin_e2k_qpfadds(cy1, ew1);
add13_2 = __builtin_e2k_qpfadds(cy2, ew2);
add13_3 = __builtin_e2k_qpfadds(cy3, ew3);
sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0);
sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1);
sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2);
sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3);
swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31});
sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31});
sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31});
sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31});
out_0[i] = __builtin_e2k_qpfadds(add02_0, add13_0);
out_1[i] = __builtin_e2k_qpfadds(add02_1, add13_1);
out_2[i] = __builtin_e2k_qpfadds(add02_2, add13_2);
out_3[i] = __builtin_e2k_qpfadds(add02_3, add13_3);
out_4[i] = __builtin_e2k_qpfsubs(sub02_0, sub13i_0);
out_5[i] = __builtin_e2k_qpfsubs(sub02_1, sub13i_1);
out_6[i] = __builtin_e2k_qpfsubs(sub02_2, sub13i_2);
out_7[i] = __builtin_e2k_qpfsubs(sub02_3, sub13i_3);
out_8[i] = __builtin_e2k_qpfsubs(add02_0, add13_0);
out_9[i] = __builtin_e2k_qpfsubs(add02_1, add13_1);
out_10[i] = __builtin_e2k_qpfsubs(add02_2, add13_2);
out_11[i] = __builtin_e2k_qpfsubs(add02_3, add13_3);
out_12[i] = __builtin_e2k_qpfadds(sub02_0, sub13i_0);
out_13[i] = __builtin_e2k_qpfadds(sub02_1, sub13i_1);
out_14[i] = __builtin_e2k_qpfadds(sub02_2, sub13i_2);
out_15[i] = __builtin_e2k_qpfadds(sub02_3, sub13i_3);
}
}
Основной цикл на ассемблере
.L19508:
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=0, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=1, disp=64
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=1, disp=96
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=2, disp=128
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=2, disp=160
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=3, disp=192
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=3, disp=224
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=4, disp=256
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=4, disp=288
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=5, disp=320
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=5, disp=352
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=6, disp=384
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=6, disp=416
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=7, disp=448
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=7, disp=480
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=1, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=1, abs=8, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=1, abs=10, disp=64
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=1, abs=10, disp=96
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=1, abs=12, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=1, abs=12, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=1, abs=14, disp=64
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=1, abs=14, disp=96
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=1, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=1, abs=16, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=1, abs=18, disp=64
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=1, abs=18, disp=96
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=1, incr=2, ind=0, asz=1, abs=20, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=15, asz=1, abs=20, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=14, asz=1, abs=22, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=13, asz=1, abs=22, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=12, asz=1, abs=24, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=11, asz=1, abs=24, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=10, asz=1, abs=26, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=9, asz=1, abs=26, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=8, asz=1, abs=28, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=7, asz=1, abs=28, disp=0
}
{
fapb ct=1, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=6, asz=1, abs=30, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=5, asz=1, abs=30, disp=0
}
.L15504:
{
loop_mode
disp %ctpr1, .L15504
movaqp,0 area=0, ind=0, am=1, be=0, %g17
movaqp,1 area=0, ind=16, am=0, be=0, %g16
movaqp,2 area=0, ind=0, am=1, be=0, %g19
movaqp,3 area=0, ind=16, am=0, be=0, %g18
}
{
loop_mode
movaqp,0 area=1, ind=0, am=1, be=0, %g21
movaqp,1 area=1, ind=16, am=0, be=0, %g20
movaqp,2 area=1, ind=0, am=1, be=0, %g23
movaqp,3 area=1, ind=16, am=0, be=0, %g22
}
{
loop_mode
movaqp,0 area=2, ind=0, am=1, be=0, %g25
movaqp,1 area=2, ind=16, am=0, be=0, %g24
movaqp,2 area=2, ind=0, am=1, be=0, %g27
movaqp,3 area=2, ind=16, am=0, be=0, %g26
}
{
loop_mode
movaqp,0 area=3, ind=0, am=1, be=0, %g29
movaqp,1 area=3, ind=16, am=0, be=0, %g28
movaqp,2 area=3, ind=0, am=1, be=0, %g31
movaqp,3 area=3, ind=16, am=0, be=0, %g30
}
{
loop_mode
movaqp,0 area=4, ind=0, am=1, be=0, %b[20]
movaqp,1 area=4, ind=16, am=0, be=0, %b[19]
movaqp,2 area=4, ind=0, am=1, be=0, %b[22]
movaqp,3 area=4, ind=16, am=0, be=0, %b[21]
}
{
loop_mode
qpshufb,0 %g19, %g17, %r7, %b[27]
qpshufb,1 %g18, %g16, %r7, %b[28]
qpshufb,3 %g18, %g16, %r6, %g16
qpshufb,4 %g19, %g17, %r6, %g17
movaqp,0 area=5, ind=0, am=1, be=0, %b[24]
movaqp,1 area=5, ind=16, am=0, be=0, %b[23]
movaqp,2 area=5, ind=0, am=1, be=0, %b[26]
movaqp,3 area=5, ind=16, am=0, be=0, %b[25]
}
{
loop_mode
qpshufb,0 %g23, %g21, %r7, %b[31]
qpshufb,1 %g22, %g20, %r7, %b[32]
qpshufb,3 %g22, %g20, %r6, %g20
qpshufb,4 %g23, %g21, %r6, %g21
movaqp,0 area=6, ind=0, am=1, be=0, %g19
movaqp,1 area=6, ind=16, am=0, be=0, %g18
movaqp,2 area=6, ind=0, am=1, be=0, %b[30]
movaqp,3 area=6, ind=16, am=0, be=0, %b[29]
}
{
loop_mode
qpshufb,0 %g27, %g25, %r7, %b[35]
qpshufb,1 %g26, %g24, %r7, %b[36]
qpshufb,3 %g26, %g24, %r6, %g24
qpshufb,4 %g27, %g25, %r6, %g25
movaqp,0 area=7, ind=0, am=1, be=0, %g23
movaqp,1 area=7, ind=16, am=0, be=0, %g22
movaqp,2 area=7, ind=0, am=1, be=0, %b[34]
movaqp,3 area=7, ind=16, am=0, be=0, %b[33]
}
{
loop_mode
qpshufb,0 %g31, %g29, %r7, %b[39]
qpshufb,1 %g30, %g28, %r7, %b[40]
qpshufb,3 %g30, %g28, %r6, %g28
qpshufb,4 %g31, %g29, %r6, %g29
movaqp,0 area=8, ind=0, am=1, be=0, %g27
movaqp,1 area=8, ind=16, am=0, be=0, %g26
movaqp,2 area=8, ind=0, am=1, be=0, %b[38]
movaqp,3 area=8, ind=16, am=0, be=0, %b[37]
}
{
loop_mode
qpshufb,0 %b[22], %b[20], %r7, %b[43]
qpshufb,1 %b[21], %b[19], %r7, %b[44]
qpshufb,3 %b[21], %b[19], %r6, %b[19]
qpshufb,4 %b[22], %b[20], %r6, %b[20]
movaqp,0 area=9, ind=0, am=1, be=0, %g31
movaqp,1 area=9, ind=16, am=0, be=0, %g30
movaqp,2 area=9, ind=0, am=1, be=0, %b[42]
movaqp,3 area=9, ind=16, am=0, be=0, %b[41]
}
{
loop_mode
qpshufb,0 %b[26], %b[24], %r7, %b[47]
qpshufb,1 %b[25], %b[23], %r7, %b[48]
qpshufb,3 %b[25], %b[23], %r6, %b[23]
qpshufb,4 %b[26], %b[24], %r6, %b[24]
movaqp,0 area=10, ind=0, am=1, be=0, %b[22]
movaqp,1 area=10, ind=16, am=0, be=0, %b[21]
movaqp,2 area=10, ind=0, am=1, be=0, %b[46]
movaqp,3 area=10, ind=16, am=0, be=0, %b[45]
}
{
loop_mode
qpshufb,0 %b[30], %g19, %r7, %b[51]
qpshufb,1 %b[29], %g18, %r7, %b[52]
qpshufb,3 %b[29], %g18, %r6, %g18
qpshufb,4 %b[30], %g19, %r6, %g19
movaqp,0 area=11, ind=0, am=1, be=0, %b[26]
movaqp,1 area=11, ind=16, am=0, be=0, %b[25]
movaqp,2 area=11, ind=0, am=1, be=0, %b[50]
movaqp,3 area=11, ind=16, am=0, be=0, %b[49]
}
{
loop_mode
qpshufb,0 %b[34], %g23, %r7, %b[55]
qpshufb,1 %b[33], %g22, %r7, %b[56]
qpshufb,3 %b[33], %g22, %r6, %g22
qpshufb,4 %b[34], %g23, %r6, %g23
movaqp,0 area=12, ind=0, am=1, be=0, %b[30]
movaqp,1 area=12, ind=16, am=0, be=0, %b[29]
movaqp,2 area=12, ind=0, am=1, be=0, %b[54]
movaqp,3 area=12, ind=16, am=0, be=0, %b[53]
}
{
loop_mode
qpshufb,0 %g27, %g27, %r23, %b[59]
qpshufb,1 %g26, %g26, %r23, %b[60]
qpfmul_hsubs,2 %g27, %b[27], %r40, %g27
qpshufb,3 %b[38], %b[38], %r23, %b[61]
qpshufb,4 %b[37], %b[37], %r23, %b[62]
qpfmul_hsubs,5 %g26, %b[31], %r40, %g26
movaqp,0 area=13, ind=0, am=1, be=0, %b[34]
movaqp,1 area=13, ind=16, am=0, be=0, %b[33]
movaqp,2 area=13, ind=0, am=1, be=0, %b[58]
movaqp,3 area=13, ind=16, am=0, be=0, %b[57]
}
{
loop_mode
qpshufb,0 %g31, %g31, %r23, %b[63]
qpshufb,1 %g30, %g30, %r23, %b[64]
qpfmul_hsubs,2 %b[38], %b[35], %r40, %b[38]
qpshufb,3 %b[42], %b[42], %r23, %b[65]
qpshufb,4 %b[41], %b[41], %r23, %b[66]
qpfmul_hsubs,5 %b[37], %b[39], %r40, %b[37]
movaqp,0 area=14, ind=0, am=1, be=0, %b[68]
movaqp,1 area=14, ind=16, am=0, be=0, %b[67]
movaqp,2 area=14, ind=0, am=1, be=0, %b[70]
movaqp,3 area=14, ind=16, am=0, be=0, %b[69]
}
{
loop_mode
qpfmul_hsubs,0 %g31, %b[43], %r40, %g31
qpfmul_hsubs,1 %g30, %b[47], %r40, %g30
qpfmul_hsubs,2 %b[42], %b[51], %r40, %b[42]
qpfmul_hadds,3 %b[61], %b[35], %r40, %b[35]
qpfmul_hadds,4 %b[62], %b[39], %r40, %b[39]
qpfmul_hadds,5 %b[65], %b[51], %r40, %b[51]
movaqp,0 area=15, ind=0, am=1, be=0, %b[62]
movaqp,1 area=15, ind=16, am=0, be=0, %b[61]
movaqp,2 area=15, ind=0, am=1, be=0, %b[71]
movaqp,3 area=15, ind=16, am=0, be=0, %b[65]
}
{
loop_mode
qpfmul_hadds,0 %b[59], %b[27], %r40, %b[27]
qpfmul_hadds,1 %b[64], %b[47], %r40, %b[47]
qpfmul_hadds,2 %b[60], %b[31], %r40, %b[31]
qpfmul_hadds,3 %b[66], %b[55], %r40, %b[55]
qpshufb,4 %b[21], %b[21], %r23, %b[59]
qpfmul_hsubs,5 %b[41], %b[55], %r40, %b[41]
movaqp,0 area=16, ind=0, am=1, be=0, %b[64]
movaqp,1 area=16, ind=16, am=0, be=0, %b[60]
movaqp,2 area=16, ind=0, am=1, be=0, %b[72]
movaqp,3 area=16, ind=16, am=0, be=0, %b[66]
}
{
loop_mode
qpshufb,0 %b[30], %b[30], %r23, %b[73]
qpshufb,1 %b[29], %b[29], %r23, %b[74]
qpfmul_hsubs,2 %b[29], %b[32], %r40, %b[29]
qpshufb,3 %b[54], %b[54], %r23, %b[75]
qpshufb,4 %b[53], %b[53], %r23, %b[76]
qpfmul_hsubs,5 %b[30], %b[28], %r40, %b[30]
movaqp,0 area=17, ind=0, am=1, be=0, %b[78]
movaqp,1 area=17, ind=16, am=0, be=0, %b[77]
movaqp,2 area=17, ind=0, am=1, be=0, %b[80]
movaqp,3 area=17, ind=16, am=0, be=0, %b[79]
}
{
loop_mode
qpshufb,0 %b[34], %b[34], %r23, %b[81]
qpshufb,1 %b[33], %b[33], %r23, %b[82]
qpfmul_hadds,2 %b[63], %b[43], %r40, %b[43]
qpshufb,3 %b[58], %b[58], %r23, %b[83]
qpshufb,4 %b[57], %b[57], %r23, %b[84]
qpfmul_hsubs,5 %b[54], %b[36], %r40, %b[54]
movaqp,0 area=18, ind=0, am=1, be=0, %b[85]
movaqp,1 area=18, ind=16, am=0, be=0, %b[63]
movaqp,2 area=18, ind=0, am=1, be=0, %b[87]
movaqp,3 area=18, ind=16, am=0, be=0, %b[86]
}
{
loop_mode
qpfmul_hsubs,0 %b[53], %b[40], %r40, %b[53]
qpfmul_hsubs,1 %b[34], %b[44], %r40, %b[34]
qpfmul_hsubs,2 %b[33], %b[48], %r40, %b[33]
qpfmul_hsubs,3 %b[58], %b[52], %r40, %b[58]
qpfmul_hsubs,4 %b[57], %b[56], %r40, %b[57]
qpfmul_hadds,5 %b[75], %b[36], %r40, %b[36]
movaqp,0 area=19, ind=0, am=1, be=0, %b[88]
movaqp,1 area=19, ind=16, am=0, be=0, %b[75]
movaqp,2 area=19, ind=0, am=1, be=0, %b[90]
movaqp,3 area=19, ind=16, am=0, be=0, %b[89]
}
{
loop_mode
qpfmul_hadds,0 %b[73], %b[28], %r40, %b[28]
qpfmul_hadds,1 %b[74], %b[32], %r40, %b[32]
qpfmul_hadds,2 %b[82], %b[48], %r40, %b[48]
qpfmul_hadds,3 %b[84], %b[56], %r40, %b[56]
qpfmul_hadds,4 %b[83], %b[52], %r40, %b[52]
qpfmul_hadds,5 %b[76], %b[40], %r40, %b[40]
}
{
loop_mode
qpfmul_hsubs,0 %b[21], %g20, %r40, %b[21]
qpfmul_hsubs,1 %b[22], %g16, %r40, %b[73]
qpfmul_hadds,2 %b[81], %b[44], %r40, %b[44]
qpfmul_hsubs,3 %b[46], %g24, %r40, %b[74]
qpfmul_hsubs,4 %b[45], %g28, %r40, %b[76]
qpfmul_hsubs,5 %b[25], %b[23], %r40, %b[81]
}
{
loop_mode
qpfmul_hsubs,0 %b[50], %g18, %r40, %b[82]
qpfmul_hsubs,1 %b[49], %g22, %r40, %b[83]
qpfmul_hsubs,2 %b[26], %b[19], %r40, %b[59]
qpshufb,4 %b[22], %b[22], %r23, %b[22]
qpfmul_hadds,5 %b[59], %g20, %r40, %g20
}
{
loop_mode
qpshufb,0 %b[46], %b[46], %r23, %b[46]
qpshufb,1 %b[45], %b[45], %r23, %b[45]
qpshufb,3 %b[26], %b[26], %r23, %b[26]
qpshufb,4 %b[25], %b[25], %r23, %b[25]
qpfmul_hadds,5 %b[22], %g16, %r40, %g16
}
{
loop_mode
qpshufb,0 %b[50], %b[50], %r23, %b[22]
qpshufb,1 %b[49], %b[49], %r23, %b[49]
qpfmul_hadds,2 %b[45], %g28, %r40, %g28
qpfmul_hadds,3 %b[25], %b[23], %r40, %b[23]
qppermb,4 %b[39], %b[37], %r24, %b[25]
qpfmul_hadds,5 %b[26], %b[19], %r40, %b[19]
}
{
loop_mode
nop 2
qpfmul_hadds,0 %b[22], %g18, %r40, %g18
qpfmul_hadds,1 %b[49], %g22, %r40, %g22
qpfmul_hadds,2 %b[46], %g24, %r40, %g24
}
{
loop_mode
qppermb,3 %b[27], %g27, %r24, %g27
qppermb,4 %b[31], %g26, %r24, %g26
}
{
loop_mode
qppermb,0 %b[35], %b[38], %r24, %b[22]
qppermb,1 %b[51], %b[42], %r24, %b[26]
qppermb,3 %b[47], %g30, %r24, %g30
qppermb,4 %b[55], %b[41], %r24, %b[27]
}
{
loop_mode
qppermb,0 %b[32], %b[29], %r24, %b[29]
qppermb,1 %b[43], %g31, %r24, %g31
qppermb,4 %b[28], %b[30], %r24, %b[28]
}
{
loop_mode
qppermb,3 %b[40], %b[53], %r24, %b[30]
qppermb,4 %b[48], %b[33], %r24, %b[31]
qpfsubs,5 %g27, %b[28], %b[32]
}
{
loop_mode
qppermb,0 %b[56], %b[57], %r24, %b[33]
qppermb,1 %b[36], %b[54], %r24, %b[35]
qpfsubs,2 %g26, %b[29], %b[37]
qppermb,3 %b[44], %b[34], %r24, %b[34]
qppermb,4 %b[52], %b[58], %r24, %b[36]
qpfsubs,5 %b[25], %b[30], %b[38]
}
{
loop_mode
qpfsubs,0 %b[27], %b[33], %b[42]
qppermb,1 %g16, %b[73], %r24, %g16
qpfsubs,2 %b[22], %b[35], %b[39]
qpfsubs,3 %b[26], %b[36], %b[41]
qppermb,4 %g20, %b[21], %r24, %g20
qpfsubs,5 %g30, %b[31], %b[40]
}
{
loop_mode
qppermb,0 %g28, %b[76], %r24, %g28
qppermb,1 %g24, %b[74], %r24, %g24
qpfadds,2 %g27, %b[28], %g27
qppermb,3 %b[23], %b[81], %r24, %b[23]
qppermb,4 %b[19], %b[59], %r24, %b[19]
qpfsubs,5 %g31, %b[34], %b[21]
}
{
loop_mode
qpfadds,0 %g26, %b[29], %g26
qppermb,1 %g22, %b[83], %r24, %g22
qpfadds,2 %b[25], %b[30], %b[25]
qpfadds,3 %g30, %b[31], %g30
qppermb,4 %g18, %b[82], %r24, %g18
qpfsubs,5 %g21, %g20, %b[28]
}
{
loop_mode
qpfadds,0 %b[27], %b[33], %b[27]
qpfadds,1 %g17, %g16, %b[29]
qpfsubs,2 %g17, %g16, %g16
qpfadds,3 %b[22], %b[35], %g17
qpfadds,4 %g31, %b[34], %g31
qpfadds,5 %b[26], %b[36], %b[22]
}
{
loop_mode
qpfadds,0 %g21, %g20, %g20
qpfadds,1 %g29, %g28, %g21
qpfsubs,2 %g29, %g28, %g28
qpfadds,3 %b[24], %b[23], %g29
qpfsubs,4 %b[24], %b[23], %b[23]
qpfadds,5 %b[20], %b[19], %b[24]
}
{
loop_mode
qpfadds,0 %g25, %g24, %b[26]
qpfsubs,1 %g25, %g24, %g24
qpfsubs,2 %b[20], %b[19], %g25
qpfsubs,3 %g19, %g18, %g18
qpfadds,5 %g19, %g18, %b[19]
}
{
loop_mode
qpfsubs,2 %g23, %g22, %g19
qpfadds,5 %g23, %g22, %g22
}
{
loop_mode
qpfadds,0 %b[29], %g27, %g27
qpfsubs,2 %b[29], %g27, %b[20]
qpshufb,4 %b[32], %b[32], %r23, %g23
}
{
loop_mode
qpshufb,0 %b[37], %b[37], %r23, %b[29]
qpshufb,1 %b[38], %b[38], %r23, %b[30]
qpfsubs,2 %g20, %g26, %b[33]
qpshufb,3 %b[40], %b[40], %r23, %b[31]
qpshufb,4 %b[42], %b[42], %r23, %b[32]
qpfadds,5 %g29, %g30, %b[34]
}
{
loop_mode
qpfadds,0 %g20, %g26, %g20
qpshufb,1 %b[39], %b[39], %r23, %b[35]
qpfadds,2 %b[26], %g17, %g26
qpshufb,3 %b[21], %b[21], %r23, %b[21]
qpshufb,4 %b[41], %b[41], %r23, %b[36]
qpfsubs,5 %g29, %g30, %g29
}
{
loop_mode
qpxor,0 %b[29], %r22, %g30
qpxor,1 %b[30], %r22, %b[29]
qpfadds,2 %g21, %b[25], %b[31]
qpxor,3 %g23, %r22, %g23
qpxor,4 %b[31], %r22, %b[30]
qpfsubs,5 %g21, %b[25], %g21
}
{
loop_mode
qpfsubs,0 %b[26], %g17, %g17
qpxor,1 %b[35], %r22, %b[32]
qpfsubs,2 %b[24], %g31, %b[26]
qpxor,3 %b[32], %r22, %b[25]
qpxor,4 %b[21], %r22, %b[21]
qpfadds,5 %b[24], %g31, %g31
}
{
loop_mode
qpfsubs,0 %g22, %b[27], %b[35]
qpfadds,1 %b[19], %b[22], %b[36]
qpfadds,2 %g22, %b[27], %g22
qpxor,3 %b[36], %r22, %b[24]
qpfsubs,4 %b[19], %b[22], %b[19]
qpfsubs,5 %g16, %g23, %b[22]
}
{
loop_mode
qpfsubs,0 %b[28], %g30, %b[27]
qpfadds,1 %b[28], %g30, %g30
qpfadds,2 %g28, %b[29], %g23
qpfadds,3 %g16, %g23, %g16
qpfsubs,4 %b[23], %b[30], %b[28]
qpfadds,5 %b[23], %b[30], %b[23]
}
{
loop_mode
qpfsubs,0 %g28, %b[29], %g28
qpfsubs,1 %g24, %b[32], %b[25]
qpfadds,2 %g24, %b[32], %g24
qpfadds,3 %g19, %b[25], %b[30]
qpfsubs,4 %g19, %b[25], %g19
qpfsubs,5 %g25, %b[21], %b[29]
}
{
loop_mode
nop 1
qpfadds,2 %g25, %b[21], %g25
qpfsubs,3 %g18, %b[24], %g18
qpfadds,5 %g18, %b[24], %b[21]
}
{
loop_mode
qpshufb,0 %b[68], %b[68], %r23, %b[24]
qpshufb,1 %b[67], %b[67], %r23, %b[32]
qpshufb,4 %b[62], %b[62], %r23, %b[37]
}
{
loop_mode
qpshufb,0 %b[65], %b[65], %r23, %b[38]
qpshufb,1 %b[61], %b[61], %r23, %b[39]
qpshufb,3 %b[71], %b[71], %r23, %b[40]
qpshufb,4 %b[72], %b[72], %r23, %b[41]
}
{
loop_mode
qpshufb,0 %b[77], %b[77], %r23, %b[42]
qpshufb,1 %b[66], %b[66], %r23, %b[43]
qpshufb,3 %b[78], %b[78], %r23, %b[44]
qpshufb,4 %b[85], %b[85], %r23, %b[45]
}
{
loop_mode
qpshufb,0 %b[86], %b[86], %r23, %b[46]
qpshufb,1 %b[63], %b[63], %r23, %b[47]
qpshufb,3 %b[87], %b[87], %r23, %b[48]
qpshufb,4 %b[90], %b[90], %r23, %b[49]
}
{
loop_mode
qpshufb,0 %b[89], %b[89], %r23, %b[50]
qpshufb,1 %g21, %b[33], %r7, %b[51]
qpshufb,3 %g26, %g27, %r7, %b[52]
qpshufb,4 %b[31], %g20, %r7, %b[53]
}
{
loop_mode
qpshufb,0 %g17, %b[20], %r7, %b[54]
qpshufb,1 %b[35], %g29, %r7, %b[55]
qpfmul_hsubs,2 %b[85], %b[51], %r40, %b[58]
qpshufb,3 %b[36], %g31, %r7, %b[56]
qpshufb,4 %g22, %b[34], %r7, %b[57]
qpfmul_hadds,5 %b[37], %b[53], %r40, %b[37]
}
{
loop_mode
qpshufb,0 %b[19], %b[26], %r7, %b[59]
qpshufb,1 %g23, %g30, %r7, %b[73]
qpfmul_hadds,2 %b[45], %b[51], %r40, %b[45]
qpshufb,3 %g28, %b[27], %r7, %b[74]
qpshufb,4 %g24, %g16, %r7, %b[76]
qpfmul_hsubs,5 %b[62], %b[53], %r40, %b[51]
}
{
loop_mode
qpshufb,0 %g19, %b[28], %r7, %b[53]
qpshufb,1 %b[30], %b[23], %r7, %b[62]
qpfmul_hadds,2 %b[44], %b[54], %r40, %b[44]
qpshufb,3 %b[25], %b[22], %r7, %b[81]
qpshufb,4 %g18, %b[29], %r7, %b[82]
qpfmul_hadds,5 %b[24], %b[52], %r40, %b[24]
}
{
loop_mode
qpshufb,0 %b[21], %g25, %r7, %b[83]
qpfmul_hsubs,1 %b[68], %b[52], %r40, %b[52]
qpfmul_hsubs,2 %b[63], %b[55], %r40, %b[63]
qpfmul_hsubs,3 %b[67], %b[56], %r40, %b[67]
qpfmul_hadds,4 %b[32], %b[56], %r40, %b[32]
qpfmul_hadds,5 %b[39], %b[57], %r40, %b[39]
}
{
loop_mode
qpfmul_hsubs,0 %b[78], %b[54], %r40, %b[54]
qpfmul_hadds,1 %b[47], %b[55], %r40, %b[47]
qpfmul_hadds,2 %b[49], %b[73], %r40, %b[49]
qpfmul_hsubs,3 %b[61], %b[57], %r40, %b[55]
qpfmul_hsubs,4 %b[87], %b[76], %r40, %b[56]
qpfmul_hsubs,5 %b[72], %b[74], %r40, %b[57]
}
{
loop_mode
qpfmul_hsubs,0 %b[77], %b[59], %r40, %b[61]
qpfmul_hadds,1 %b[42], %b[59], %r40, %b[42]
qpfmul_hsubs,2 %b[90], %b[73], %r40, %b[59]
qpfmul_hadds,3 %b[48], %b[76], %r40, %b[48]
qpfmul_hadds,4 %b[41], %b[74], %r40, %b[41]
qpfmul_hsubs,5 %b[71], %b[81], %r40, %b[68]
}
{
loop_mode
qpfmul_hsubs,0 %b[66], %b[53], %r40, %b[66]
qpfmul_hadds,1 %b[43], %b[53], %r40, %b[43]
qpfmul_hsubs,2 %b[89], %b[62], %r40, %b[53]
qpfmul_hadds,3 %b[50], %b[62], %r40, %b[50]
qpfmul_hadds,4 %b[40], %b[81], %r40, %b[40]
qpfmul_hsubs,5 %b[65], %b[82], %r40, %b[62]
}
{
loop_mode
qpfmul_hadds,0 %b[46], %b[83], %r40, %b[46]
qpshufb,1 %b[70], %b[70], %r23, %b[71]
qpfmul_hsubs,2 %b[86], %b[83], %r40, %b[65]
qpshufb,3 %b[64], %b[64], %r23, %b[72]
qpshufb,4 %b[69], %b[69], %r23, %b[73]
qpfmul_hadds,5 %b[38], %b[82], %r40, %b[38]
}
{
loop_mode
qpshufb,0 %b[60], %b[60], %r23, %b[74]
qpshufb,1 %b[80], %b[80], %r23, %b[76]
qpshufb,3 %b[79], %b[79], %r23, %b[77]
qpshufb,4 %b[75], %b[75], %r23, %b[78]
}
{
loop_mode
nop 3
qpshufb,0 %b[88], %b[88], %r23, %b[81]
}
{
loop_mode
qpshufb,1 %b[31], %g20, %r6, %g20
qpshufb,3 %g21, %b[33], %r6, %g21
qpshufb,4 %b[35], %g29, %r6, %g29
}
{
loop_mode
qpshufb,0 %g22, %b[34], %r6, %g22
qpshufb,1 %g28, %b[27], %r6, %g28
qpfmul_hadds,2 %b[71], %g20, %r40, %b[23]
qpshufb,3 %g23, %g30, %r6, %g23
qpshufb,4 %b[30], %b[23], %r6, %g30
qpfmul_hadds,5 %b[76], %g21, %r40, %b[27]
}
{
loop_mode
qpshufb,0 %g19, %b[28], %r6, %g19
qpfmul_hsubs,1 %b[70], %g20, %r40, %g20
qpfmul_hadds,2 %b[73], %g22, %r40, %b[30]
qpfmul_hsubs,3 %b[80], %g21, %r40, %g21
qpfmul_hsubs,4 %b[79], %g29, %r40, %b[28]
qpfmul_hadds,5 %b[77], %g29, %r40, %g29
}
{
loop_mode
qpfmul_hsubs,0 %b[69], %g22, %r40, %g22
qpfmul_hadds,1 %b[72], %g28, %r40, %b[33]
qpfmul_hsubs,2 %b[64], %g28, %r40, %g28
qpfmul_hsubs,3 %b[88], %g23, %r40, %b[31]
qpfmul_hadds,4 %b[81], %g23, %r40, %g23
qpfmul_hsubs,5 %b[75], %g30, %r40, %b[34]
}
{
loop_mode
qpfmul_hadds,0 %b[74], %g19, %r40, %g19
qppermb,1 %b[45], %b[58], %r24, %b[45]
qpfmul_hsubs,2 %b[60], %g19, %r40, %b[35]
qppermb,3 %b[24], %b[52], %r24, %b[24]
qppermb,4 %b[44], %b[54], %r24, %b[44]
qpfmul_hadds,5 %b[78], %g30, %r40, %g30
}
{
loop_mode
qppermb,0 %b[37], %b[51], %r24, %b[37]
qppermb,1 %b[32], %b[67], %r24, %b[32]
qppermb,3 %b[42], %b[61], %r24, %b[42]
qppermb,4 %b[39], %b[55], %r24, %b[39]
}
{
loop_mode
qppermb,0 %b[47], %b[63], %r24, %b[47]
qppermb,1 %b[49], %b[59], %r24, %b[49]
qppermb,3 %b[40], %b[68], %r24, %b[40]
qppermb,4 %b[48], %b[56], %r24, %b[48]
}
{
loop_mode
qppermb,0 %b[43], %b[66], %r24, %b[43]
qppermb,1 %b[50], %b[53], %r24, %b[50]
qppermb,3 %b[38], %b[62], %r24, %b[38]
qppermb,4 %b[41], %b[57], %r24, %b[41]
}
{
loop_mode
qppermb,0 %b[46], %b[65], %r24, %b[46]
qpfsubs,1 %b[24], %b[37], %b[51]
qpfsubs,3 %b[44], %b[45], %b[52]
qpfsubs,4 %b[40], %b[41], %b[53]
}
{
loop_mode
qpfsubs,0 %b[42], %b[47], %b[55]
qpfsubs,1 %b[46], %b[50], %b[56]
qpfsubs,2 %b[32], %b[39], %b[54]
qpfadds,3 %b[24], %b[37], %b[24]
qpfadds,4 %b[44], %b[45], %b[37]
qpfadds,5 %b[32], %b[39], %b[32]
}
{
loop_mode
qpfadds,0 %b[48], %b[49], %b[44]
qpfadds,1 %b[42], %b[47], %b[42]
qpfsubs,2 %b[48], %b[49], %b[39]
qpfadds,3 %b[40], %b[41], %b[40]
}
{
loop_mode
qpfadds,0 %b[38], %b[43], %b[38]
qpfadds,1 %b[46], %b[50], %b[43]
qpfsubs,2 %b[38], %b[43], %b[41]
}
{
loop_mode
qpshufb,4 %g26, %g27, %r6, %g26
}
{
loop_mode
qpshufb,3 %g17, %b[20], %r6, %g17
qpshufb,4 %b[36], %g31, %r6, %g27
}
{
loop_mode
qpshufb,0 %b[19], %b[26], %r6, %g31
qpshufb,1 %g24, %g16, %r6, %g16
qpshufb,3 %b[25], %b[22], %r6, %g24
qpshufb,4 %g18, %b[29], %r6, %g18
}
{
loop_mode
qpshufb,0 %b[21], %g25, %r6, %g25
qppermb,1 %b[23], %g20, %r24, %g20
qppermb,3 %b[27], %g21, %r24, %g21
qppermb,4 %g29, %b[28], %r24, %g29
}
{
loop_mode
qppermb,0 %b[30], %g22, %r24, %g22
qppermb,1 %b[33], %g28, %r24, %g28
qpfsubs,2 %g26, %g20, %b[19]
qppermb,3 %g23, %b[31], %r24, %g23
qppermb,4 %g30, %b[34], %r24, %g30
qpfsubs,5 %g17, %g21, %b[20]
}
{
loop_mode
qppermb,0 %g19, %b[35], %r24, %g19
qpfsubs,1 %g27, %g22, %g21
qpfadds,2 %g26, %g20, %g20
qpshufb,3 %b[51], %b[51], %r23, %g26
qpshufb,4 %b[52], %b[52], %r23, %b[21]
qpfadds,5 %g17, %g21, %g17
}
{
loop_mode
qpfadds,0 %g27, %g22, %g22
qpfsubs,1 %g24, %g28, %g31
qpfadds,2 %g24, %g28, %g24
qpfadds,3 %g31, %g29, %b[22]
qpfsubs,4 %g31, %g29, %g29
qpfadds,5 %g16, %g23, %g27
}
{
loop_mode
nop 1
qpfsubs,0 %g18, %g19, %g18
qpfadds,2 %g18, %g19, %g28
qpfadds,3 %g25, %g30, %g23
qpfsubs,4 %g25, %g30, %g25
qpfsubs,5 %g16, %g23, %g16
}
{
loop_mode
qpfadds,0 %g20, %b[24], %g30
qpshufb,1 %b[54], %b[54], %r23, %g19
qpfsubs,2 %g20, %b[24], %g20
qpfadds,3 %g17, %b[37], %b[23]
qpfsubs,4 %g17, %b[37], %g17
}
{
loop_mode
qpshufb,0 %b[39], %b[39], %r23, %b[24]
qpshufb,1 %b[41], %b[41], %r23, %b[25]
qpfsubs,2 %g22, %b[32], %b[28]
qpshufb,3 %b[55], %b[55], %r23, %b[26]
qpshufb,4 %b[53], %b[53], %r23, %b[27]
qpfsubs,5 %b[22], %b[42], %b[29]
}
{
loop_mode
qpfadds,0 %g22, %b[32], %g22
qpshufb,1 %b[56], %b[56], %r23, %b[30]
qpfsubs,2 %g24, %b[40], %b[33]
qpfadds,3 %b[22], %b[42], %b[22]
qpfsubs,4 %g27, %b[44], %b[31]
qpfadds,5 %g23, %b[43], %b[32]
}
{
loop_mode
qpxor,0 %g26, %r22, %g26
qpxor,1 %g19, %r22, %g19
qpfadds,2 %g27, %b[44], %g27
qpxor,3 %b[21], %r22, %b[21]
qpxor,4 %b[26], %r22, %b[26]
qpfsubs,5 %g23, %b[43], %g23
}
{
loop_mode
qpfadds,0 %g24, %b[40], %g24
qpxor,1 %b[25], %r22, %b[25]
qpfsubs,2 %b[19], %g26, %b[34]
qpfadds,3 %g28, %b[38], %b[35]
qpfsubs,4 %g28, %b[38], %g28
qpfsubs,5 %b[20], %b[21], %b[36]
}
{
loop_mode
qpxor,0 %b[24], %r22, %b[24]
qpxor,1 %b[27], %r22, %b[27]
qpfadds,2 %b[19], %g26, %g26
qpfadds,3 %b[20], %b[21], %b[19]
qpfadds,4 %g29, %b[26], %b[20]
qpfsubs,5 %g29, %b[26], %g29
}
{
loop_mode
qpfsubs,0 %g21, %g19, %b[26]
qpxor,1 %b[30], %r22, %b[21]
qpfadds,2 %g21, %g19, %g19
stqp,5 %r16, %r0, %g17
}
{
loop_mode
qpfadds,0 %g16, %b[24], %g17
qpfsubs,1 %g18, %b[25], %g21
qpfadds,2 %g18, %b[25], %g18
stqp,5 %r20, %r0, %b[23]
}
{
loop_mode
qpfsubs,0 %g16, %b[24], %g16
qpfsubs,1 %g31, %b[27], %b[23]
qpfadds,2 %g31, %b[27], %g31
stqp,5 %r18, %r0, %g20
}
{
loop_mode
qpfadds,0 %g25, %b[21], %g20
qpfsubs,1 %g25, %b[21], %g25
stqp,2 %r2, %r0, %g30
stqp,5 %r36, %r0, %b[29]
}
{
loop_mode
stqp,2 %r25, %r0, %b[22]
stqp,5 %r32, %r0, %b[28]
}
{
loop_mode
stqp,2 %r27, %r0, %g22
stqp,5 %r3, %r0, %b[31]
}
{
loop_mode
stqp,2 %r9, %r0, %b[32]
stqp,5 %r35, %r0, %g23
}
{
loop_mode
stqp,2 %r17, %r0, %g27
stqp,5 %r13, %r0, %b[33]
}
{
loop_mode
stqp,2 %r21, %r0, %g24
stqp,5 %r31, %r0, %g28
}
{
loop_mode
stqp,2 %r15, %r0, %g26
stqp,5 %r26, %r0, %b[35]
}
{
loop_mode
stqp,2 %r19, %r0, %b[34]
stqp,5 %r34, %r0, %g19
}
{
loop_mode
stqp,2 %r30, %r0, %b[26]
stqp,5 %r39, %r0, %g18
}
{
loop_mode
stqp,2 %r5, %r0, %b[19]
stqp,5 %r29, %r0, %g21
}
{
loop_mode
stqp,2 %r11, %r0, %b[36]
stqp,5 %r4, %r0, %g17
}
{
loop_mode
stqp,2 %r38, %r0, %b[20]
stqp,5 %r14, %r0, %g16
}
{
loop_mode
stqp,2 %r28, %r0, %g29
stqp,5 %r1, %r0, %g31
}
{
loop_mode
stqp,2 %r12, %r0, %b[23]
stqp,5 %r37, %r0, %g20
}
{
loop_mode
ct %ctpr1 ? %NOT_LOOP_END
alc alcf=1, alct=1
stqp,2 %r33, %r0, %g25
addd,3,sm %r0, _f16s,_lts0lo 0x20, %r0
}
Теоретическая скорость: 64 комплексных числа за 106 тактов (64/106) = 4.83 Байт/такт
Четверная теоретическая скорость: 19.32 Байт/такт
Замеры скорости

7. stage_radix4_2x_simd128_noConj_unroll3
Здесь происходит раскрутка цикла в 3 раза с помощью опции unroll.
Код на Си
void stage_radix4_2x_simd128_noConj_unroll3(int data_count, myComplex *data_in, myComplex *data_out, myComplex *coefC_a, myComplex *coefD_a, myComplex *coefE_a, myComplex *coefC_b, myComplex *coefD_b, myComplex *coefE_b)
{
__v2di *xy0_in = (__v2di*)&data_in[ 0];
__v2di *zw0_in = (__v2di*)&data_in[ 2];
__v2di *xy1_in = (__v2di*)&data_in[ 4];
__v2di *zw1_in = (__v2di*)&data_in[ 6];
__v2di *xy2_in = (__v2di*)&data_in[ 8];
__v2di *zw2_in = (__v2di*)&data_in[10];
__v2di *xy3_in = (__v2di*)&data_in[12];
__v2di *zw3_in = (__v2di*)&data_in[14];
__v2di *xy4_in = (__v2di*)&data_in[16];
__v2di *zw4_in = (__v2di*)&data_in[18];
__v2di *xy5_in = (__v2di*)&data_in[20];
__v2di *zw5_in = (__v2di*)&data_in[22];
__v2di *xy6_in = (__v2di*)&data_in[24];
__v2di *zw6_in = (__v2di*)&data_in[26];
__v2di *xy7_in = (__v2di*)&data_in[28];
__v2di *zw7_in = (__v2di*)&data_in[30];
__v2di *c0a_in = (__v2di*)&coefC_a[0];
__v2di *c1a_in = (__v2di*)&coefC_a[2];
__v2di *c2a_in = (__v2di*)&coefC_a[4];
__v2di *c3a_in = (__v2di*)&coefC_a[6];
__v2di *d0a_in = (__v2di*)&coefD_a[0];
__v2di *d1a_in = (__v2di*)&coefD_a[2];
__v2di *d2a_in = (__v2di*)&coefD_a[4];
__v2di *d3a_in = (__v2di*)&coefD_a[6];
__v2di *e0a_in = (__v2di*)&coefE_a[0];
__v2di *e1a_in = (__v2di*)&coefE_a[2];
__v2di *e2a_in = (__v2di*)&coefE_a[4];
__v2di *e3a_in = (__v2di*)&coefE_a[6];
__v2di *c0b_in = (__v2di*)&coefC_b[0*data_count/16];
__v2di *c1b_in = (__v2di*)&coefC_b[1*data_count/16];
__v2di *c2b_in = (__v2di*)&coefC_b[2*data_count/16];
__v2di *c3b_in = (__v2di*)&coefC_b[3*data_count/16];
__v2di *d0b_in = (__v2di*)&coefD_b[0*data_count/16];
__v2di *d1b_in = (__v2di*)&coefD_b[1*data_count/16];
__v2di *d2b_in = (__v2di*)&coefD_b[2*data_count/16];
__v2di *d3b_in = (__v2di*)&coefD_b[3*data_count/16];
__v2di *e0b_in = (__v2di*)&coefE_b[0*data_count/16];
__v2di *e1b_in = (__v2di*)&coefE_b[1*data_count/16];
__v2di *e2b_in = (__v2di*)&coefE_b[2*data_count/16];
__v2di *e3b_in = (__v2di*)&coefE_b[3*data_count/16];
__v2di *out_0 = (__v2di*)&data_out[ 0*data_count/16];
__v2di *out_1 = (__v2di*)&data_out[ 1*data_count/16];
__v2di *out_2 = (__v2di*)&data_out[ 2*data_count/16];
__v2di *out_3 = (__v2di*)&data_out[ 3*data_count/16];
__v2di *out_4 = (__v2di*)&data_out[ 4*data_count/16];
__v2di *out_5 = (__v2di*)&data_out[ 5*data_count/16];
__v2di *out_6 = (__v2di*)&data_out[ 6*data_count/16];
__v2di *out_7 = (__v2di*)&data_out[ 7*data_count/16];
__v2di *out_8 = (__v2di*)&data_out[ 8*data_count/16];
__v2di *out_9 = (__v2di*)&data_out[ 9*data_count/16];
__v2di *out_10 = (__v2di*)&data_out[10*data_count/16];
__v2di *out_11 = (__v2di*)&data_out[11*data_count/16];
__v2di *out_12 = (__v2di*)&data_out[12*data_count/16];
__v2di *out_13 = (__v2di*)&data_out[13*data_count/16];
__v2di *out_14 = (__v2di*)&data_out[14*data_count/16];
__v2di *out_15 = (__v2di*)&data_out[15*data_count/16];
#pragma ivdep
#pragma unroll(3)
#pragma prefetch
for(int64_t i = 0; i < data_count/32; ++i)
{
__v2di xy0 = xy0_in[16*i];
__v2di zw0 = zw0_in[16*i];
__v2di xy1 = xy1_in[16*i];
__v2di zw1 = zw1_in[16*i];
__v2di c0 = c0a_in[4*i];
__v2di d0 = d0a_in[4*i];
__v2di e0 = e0a_in[4*i];
__v2di xy2 = xy2_in[16*i];
__v2di zw2 = zw2_in[16*i];
__v2di xy3 = xy3_in[16*i];
__v2di zw3 = zw3_in[16*i];
__v2di c1 = c1a_in[4*i];
__v2di d1 = d1a_in[4*i];
__v2di e1 = e1a_in[4*i];
__v2di xy4 = xy4_in[16*i];
__v2di zw4 = zw4_in[16*i];
__v2di xy5 = xy5_in[16*i];
__v2di zw5 = zw5_in[16*i];
__v2di c2 = c2a_in[4*i];
__v2di d2 = d2a_in[4*i];
__v2di e2 = e2a_in[4*i];
__v2di xy6 = xy6_in[16*i];
__v2di zw6 = zw6_in[16*i];
__v2di xy7 = xy7_in[16*i];
__v2di zw7 = zw7_in[16*i];
__v2di c3 = c3a_in[4*i];
__v2di d3 = d3a_in[4*i];
__v2di e3 = e3a_in[4*i];
__v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di cy0_real = __builtin_e2k_qpfmuls( c0, y0);
__v2di cy1_real = __builtin_e2k_qpfmuls( c1, y1);
__v2di cy2_real = __builtin_e2k_qpfmuls( c2, y2);
__v2di cy3_real = __builtin_e2k_qpfmuls( c3, y3);
__v2di dz0_real = __builtin_e2k_qpfmuls( d0, z0);
__v2di dz1_real = __builtin_e2k_qpfmuls( d1, z1);
__v2di dz2_real = __builtin_e2k_qpfmuls( d2, z2);
__v2di dz3_real = __builtin_e2k_qpfmuls( d3, z3);
__v2di ew0_real = __builtin_e2k_qpfmuls( e0, w0);
__v2di ew1_real = __builtin_e2k_qpfmuls( e1, w1);
__v2di ew2_real = __builtin_e2k_qpfmuls( e2, w2);
__v2di ew3_real = __builtin_e2k_qpfmuls( e3, w3);
__v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
__v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
__v2di cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2);
__v2di cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3);
__v2di dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0);
__v2di dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1);
__v2di dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2);
__v2di dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3);
__v2di ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0);
__v2di ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1);
__v2di ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2);
__v2di ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3);
__v2di cy0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy0_real);
__v2di cy1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy1_real);
__v2di cy2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy2_real);
__v2di cy3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy3_real);
__v2di dz0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz0_real);
__v2di dz1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz1_real);
__v2di dz2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz2_real);
__v2di dz3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz3_real);
__v2di ew0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew0_real);
__v2di ew1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew1_real);
__v2di ew2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew2_real);
__v2di ew3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew3_real);
__v2di cy0_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy0_imag);
__v2di cy1_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy1_imag);
__v2di cy2_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy2_imag);
__v2di cy3_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy3_imag);
__v2di dz0_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz0_imag);
__v2di dz1_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz1_imag);
__v2di dz2_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz2_imag);
__v2di dz3_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz3_imag);
__v2di ew0_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew0_imag);
__v2di ew1_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew1_imag);
__v2di ew2_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew2_imag);
__v2di ew3_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew3_imag);
__v2di cy0 = __builtin_e2k_qppermb(cy0_ii, cy0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di cy1 = __builtin_e2k_qppermb(cy1_ii, cy1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di cy2 = __builtin_e2k_qppermb(cy2_ii, cy2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di cy3 = __builtin_e2k_qppermb(cy3_ii, cy3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di dz0 = __builtin_e2k_qppermb(dz0_ii, dz0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di dz1 = __builtin_e2k_qppermb(dz1_ii, dz1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di dz2 = __builtin_e2k_qppermb(dz2_ii, dz2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di dz3 = __builtin_e2k_qppermb(dz3_ii, dz3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di ew0 = __builtin_e2k_qppermb(ew0_ii, ew0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di ew1 = __builtin_e2k_qppermb(ew1_ii, ew1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di ew2 = __builtin_e2k_qppermb(ew2_ii, ew2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di ew3 = __builtin_e2k_qppermb(ew3_ii, ew3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
__v2di add02_0 = __builtin_e2k_qpfadds( x0, dz0);
__v2di add02_1 = __builtin_e2k_qpfadds( x1, dz1);
__v2di add02_2 = __builtin_e2k_qpfadds( x2, dz2);
__v2di add02_3 = __builtin_e2k_qpfadds( x3, dz3);
__v2di sub02_0 = __builtin_e2k_qpfsubs( x0, dz0);
__v2di sub02_1 = __builtin_e2k_qpfsubs( x1, dz1);
__v2di sub02_2 = __builtin_e2k_qpfsubs( x2, dz2);
__v2di sub02_3 = __builtin_e2k_qpfsubs( x3, dz3);
__v2di add13_0 = __builtin_e2k_qpfadds(cy0, ew0);
__v2di add13_1 = __builtin_e2k_qpfadds(cy1, ew1);
__v2di add13_2 = __builtin_e2k_qpfadds(cy2, ew2);
__v2di add13_3 = __builtin_e2k_qpfadds(cy3, ew3);
__v2di sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0);
__v2di sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1);
__v2di sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2);
__v2di sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3);
__v2di swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31});
__v2di sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31});
__v2di sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31});
__v2di sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31});
__v2di out0 = __builtin_e2k_qpfadds(add02_0, add13_0);
__v2di out1 = __builtin_e2k_qpfadds(add02_1, add13_1);
__v2di out2 = __builtin_e2k_qpfadds(add02_2, add13_2);
__v2di out3 = __builtin_e2k_qpfadds(add02_3, add13_3);
__v2di out4 = __builtin_e2k_qpfsubs(sub02_0, sub13i_0);
__v2di out5 = __builtin_e2k_qpfsubs(sub02_1, sub13i_1);
__v2di out6 = __builtin_e2k_qpfsubs(sub02_2, sub13i_2);
__v2di out7 = __builtin_e2k_qpfsubs(sub02_3, sub13i_3);
__v2di out8 = __builtin_e2k_qpfsubs(add02_0, add13_0);
__v2di out9 = __builtin_e2k_qpfsubs(add02_1, add13_1);
__v2di out10 = __builtin_e2k_qpfsubs(add02_2, add13_2);
__v2di out11 = __builtin_e2k_qpfsubs(add02_3, add13_3);
__v2di out12 = __builtin_e2k_qpfadds(sub02_0, sub13i_0);
__v2di out13 = __builtin_e2k_qpfadds(sub02_1, sub13i_1);
__v2di out14 = __builtin_e2k_qpfadds(sub02_2, sub13i_2);
__v2di out15 = __builtin_e2k_qpfadds(sub02_3, sub13i_3);
xy0 = out0;
zw0 = out1;
xy1 = out2;
zw1 = out3;
c0 = c0b_in[i];
d0 = d0b_in[i];
e0 = e0b_in[i];
xy2 = out4;
zw2 = out5;
xy3 = out6;
zw3 = out7;
c1 = c1b_in[i];
d1 = d1b_in[i];
e1 = e1b_in[i];
xy4 = out8;
zw4 = out9;
xy5 = out10;
zw5 = out11;
c2 = c2b_in[i];
d2 = d2b_in[i];
e2 = e2b_in[i];
xy6 = out12;
zw6 = out13;
xy7 = out14;
zw7 = out15;
c3 = c3b_in[i];
d3 = d3b_in[i];
e3 = e3b_in[i];
x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100});
w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100});
y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100});
w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100});
y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100});
w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
swap_c0 = __builtin_e2k_qpshufb(c0, c0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_c1 = __builtin_e2k_qpshufb(c1, c1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_c2 = __builtin_e2k_qpshufb(c2, c2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_c3 = __builtin_e2k_qpshufb(c3, c3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_d0 = __builtin_e2k_qpshufb(d0, d0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_d1 = __builtin_e2k_qpshufb(d1, d1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_d2 = __builtin_e2k_qpshufb(d2, d2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_d3 = __builtin_e2k_qpshufb(d3, d3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_e0 = __builtin_e2k_qpshufb(e0, e0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_e1 = __builtin_e2k_qpshufb(e1, e1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_e2 = __builtin_e2k_qpshufb(e2, e2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_e3 = __builtin_e2k_qpshufb(e3, e3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
cy0_real = __builtin_e2k_qpfmuls( c0, y0);
cy1_real = __builtin_e2k_qpfmuls( c1, y1);
cy2_real = __builtin_e2k_qpfmuls( c2, y2);
cy3_real = __builtin_e2k_qpfmuls( c3, y3);
dz0_real = __builtin_e2k_qpfmuls( d0, z0);
dz1_real = __builtin_e2k_qpfmuls( d1, z1);
dz2_real = __builtin_e2k_qpfmuls( d2, z2);
dz3_real = __builtin_e2k_qpfmuls( d3, z3);
ew0_real = __builtin_e2k_qpfmuls( e0, w0);
ew1_real = __builtin_e2k_qpfmuls( e1, w1);
ew2_real = __builtin_e2k_qpfmuls( e2, w2);
ew3_real = __builtin_e2k_qpfmuls( e3, w3);
cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2);
cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3);
dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0);
dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1);
dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2);
dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3);
ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0);
ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1);
ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2);
ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3);
cy0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy0_real);
cy1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy1_real);
cy2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy2_real);
cy3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, cy3_real);
dz0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz0_real);
dz1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz1_real);
dz2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz2_real);
dz3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, dz3_real);
ew0_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew0_real);
ew1_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew1_real);
ew2_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew2_real);
ew3_rr = __builtin_e2k_qpfhsubs((__v2di){0}, ew3_real);
cy0_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy0_imag);
cy1_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy1_imag);
cy2_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy2_imag);
cy3_ii = __builtin_e2k_qpfhadds((__v2di){0}, cy3_imag);
dz0_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz0_imag);
dz1_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz1_imag);
dz2_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz2_imag);
dz3_ii = __builtin_e2k_qpfhadds((__v2di){0}, dz3_imag);
ew0_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew0_imag);
ew1_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew1_imag);
ew2_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew2_imag);
ew3_ii = __builtin_e2k_qpfhadds((__v2di){0}, ew3_imag);
cy0 = __builtin_e2k_qppermb(cy0_ii, cy0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
cy1 = __builtin_e2k_qppermb(cy1_ii, cy1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
cy2 = __builtin_e2k_qppermb(cy2_ii, cy2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
cy3 = __builtin_e2k_qppermb(cy3_ii, cy3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
dz0 = __builtin_e2k_qppermb(dz0_ii, dz0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
dz1 = __builtin_e2k_qppermb(dz1_ii, dz1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
dz2 = __builtin_e2k_qppermb(dz2_ii, dz2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
dz3 = __builtin_e2k_qppermb(dz3_ii, dz3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
ew0 = __builtin_e2k_qppermb(ew0_ii, ew0_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
ew1 = __builtin_e2k_qppermb(ew1_ii, ew1_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
ew2 = __builtin_e2k_qppermb(ew2_ii, ew2_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
ew3 = __builtin_e2k_qppermb(ew3_ii, ew3_rr, (__v2di){0x1B1A19180B0A0908, 0x1F1E1D1C0F0E0D0C});
add02_0 = __builtin_e2k_qpfadds( x0, dz0);
add02_1 = __builtin_e2k_qpfadds( x1, dz1);
add02_2 = __builtin_e2k_qpfadds( x2, dz2);
add02_3 = __builtin_e2k_qpfadds( x3, dz3);
sub02_0 = __builtin_e2k_qpfsubs( x0, dz0);
sub02_1 = __builtin_e2k_qpfsubs( x1, dz1);
sub02_2 = __builtin_e2k_qpfsubs( x2, dz2);
sub02_3 = __builtin_e2k_qpfsubs( x3, dz3);
add13_0 = __builtin_e2k_qpfadds(cy0, ew0);
add13_1 = __builtin_e2k_qpfadds(cy1, ew1);
add13_2 = __builtin_e2k_qpfadds(cy2, ew2);
add13_3 = __builtin_e2k_qpfadds(cy3, ew3);
sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0);
sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1);
sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2);
sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3);
swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31});
sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31});
sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31});
sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31});
out_0[i] = __builtin_e2k_qpfadds(add02_0, add13_0);
out_1[i] = __builtin_e2k_qpfadds(add02_1, add13_1);
out_2[i] = __builtin_e2k_qpfadds(add02_2, add13_2);
out_3[i] = __builtin_e2k_qpfadds(add02_3, add13_3);
out_4[i] = __builtin_e2k_qpfsubs(sub02_0, sub13i_0);
out_5[i] = __builtin_e2k_qpfsubs(sub02_1, sub13i_1);
out_6[i] = __builtin_e2k_qpfsubs(sub02_2, sub13i_2);
out_7[i] = __builtin_e2k_qpfsubs(sub02_3, sub13i_3);
out_8[i] = __builtin_e2k_qpfsubs(add02_0, add13_0);
out_9[i] = __builtin_e2k_qpfsubs(add02_1, add13_1);
out_10[i] = __builtin_e2k_qpfsubs(add02_2, add13_2);
out_11[i] = __builtin_e2k_qpfsubs(add02_3, add13_3);
out_12[i] = __builtin_e2k_qpfadds(sub02_0, sub13i_0);
out_13[i] = __builtin_e2k_qpfadds(sub02_1, sub13i_1);
out_14[i] = __builtin_e2k_qpfadds(sub02_2, sub13i_2);
out_15[i] = __builtin_e2k_qpfadds(sub02_3, sub13i_3);
}
}
Основной цикл на ассемблере
.L24812:
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=0, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=1, disp=64
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=1, disp=96
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=2, disp=128
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=2, disp=160
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=3, disp=192
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=3, disp=224
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=4, disp=256
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=4, disp=288
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=5, disp=320
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=5, disp=352
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=6, disp=384
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=6, disp=416
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=7, disp=448
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=7, disp=480
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=8, disp=512
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=8, disp=544
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=9, disp=576
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=9, disp=608
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=10, disp=640
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=10, disp=672
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=11, disp=704
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=0, abs=11, disp=736
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=0, abs=12, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=0, abs=12, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=0, abs=13, disp=64
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=0, abs=13, disp=96
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=0, abs=14, disp=128
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=4, asz=0, abs=14, disp=160
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=0, abs=15, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=0, abs=15, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=0, abs=16, disp=64
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=0, abs=16, disp=96
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=0, abs=17, disp=128
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=3, asz=0, abs=17, disp=160
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=0, abs=18, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=0, abs=18, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=0, abs=19, disp=64
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=0, abs=19, disp=96
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=0, abs=20, disp=128
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=3, ind=2, asz=0, abs=20, disp=160
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=1, incr=2, ind=0, asz=0, abs=21, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=15, asz=0, abs=21, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=14, asz=0, abs=22, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=13, asz=0, abs=22, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=12, asz=0, abs=23, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=11, asz=0, abs=23, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=10, asz=0, abs=24, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=9, asz=0, abs=24, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=8, asz=0, abs=25, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=7, asz=0, abs=25, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=6, asz=0, abs=26, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=5, asz=0, abs=26, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=1, incr=2, ind=0, asz=0, abs=27, disp=32
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=15, asz=0, abs=27, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=14, asz=0, abs=28, disp=32
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=13, asz=0, abs=28, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=12, asz=0, abs=29, disp=32
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=11, asz=0, abs=29, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=10, asz=0, abs=30, disp=32
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=9, asz=0, abs=30, disp=32
}
{
fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=8, asz=0, abs=31, disp=32
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=2, ind=7, asz=0, abs=31, disp=32
}
.L19801:
{
loop_mode
disp %ctpr1, .L19801
ldqp,0 %r58, %r0, %g20
addd,1,sm %r0, _f16s,_lts0lo 0x30, %g22
ldqp,2 %r55, %r0, %g21
movaqp,0 area=0, ind=0, am=1, be=0, %g17
movaqp,1 area=0, ind=16, am=0, be=0, %g16
movaqp,2 area=0, ind=0, am=1, be=0, %g19
movaqp,3 area=0, ind=16, am=0, be=0, %g18
}
{
loop_mode
ldb,0,sm %r58, %g22, %empty, mas=0x20
ldb,2,sm %r55, %g22, %empty, mas=0x20
movaqp,0 area=1, ind=0, am=1, be=0, %g24
movaqp,1 area=1, ind=16, am=0, be=0, %g23
movaqp,2 area=1, ind=0, am=1, be=0, %g26
movaqp,3 area=1, ind=16, am=0, be=0, %g25
}
{
loop_mode
movaqp,0 area=2, ind=0, am=1, be=0, %g27
movaqp,1 area=2, ind=16, am=0, be=0, %g22
movaqp,2 area=2, ind=0, am=1, be=0, %g29
movaqp,3 area=2, ind=16, am=0, be=0, %g28
}
{
loop_mode
movaqp,0 area=3, ind=0, am=1, be=0, %g31
movaqp,1 area=3, ind=16, am=0, be=0, %g30
movaqp,2 area=3, ind=0, am=1, be=0, %b[38]
movaqp,3 area=3, ind=16, am=0, be=0, %b[37]
}
{
loop_mode
movaqp,0 area=4, ind=0, am=1, be=0, %b[40]
movaqp,1 area=4, ind=16, am=0, be=0, %b[39]
movaqp,2 area=4, ind=0, am=1, be=0, %b[42]
movaqp,3 area=4, ind=16, am=0, be=0, %b[41]
}
{
loop_mode
qpshufb,0 %g19, %g17, %r6, %b[47]
qpshufb,1 %g18, %g16, %r6, %b[48]
qpshufb,3 %g18, %g16, %r23, %g16
qpshufb,4 %g19, %g17, %r23, %g17
movaqp,0 area=5, ind=0, am=1, be=0, %b[44]
movaqp,1 area=5, ind=16, am=0, be=0, %b[43]
movaqp,2 area=5, ind=0, am=1, be=0, %b[46]
movaqp,3 area=5, ind=16, am=0, be=0, %b[45]
}
{
loop_mode
qpshufb,0 %g26, %g24, %r6, %b[51]
qpshufb,1 %g25, %g23, %r6, %b[52]
qpshufb,3 %g25, %g23, %r23, %g23
qpshufb,4 %g26, %g24, %r23, %g24
movaqp,0 area=6, ind=0, am=1, be=0, %g19
movaqp,1 area=6, ind=16, am=0, be=0, %g18
movaqp,2 area=6, ind=0, am=1, be=0, %b[50]
movaqp,3 area=6, ind=16, am=0, be=0, %b[49]
}
{
loop_mode
qpshufb,0 %g29, %g27, %r6, %b[55]
qpshufb,1 %g28, %g22, %r6, %b[56]
qpshufb,3 %g28, %g22, %r23, %g22
qpshufb,4 %g29, %g27, %r23, %g27
movaqp,0 area=7, ind=0, am=1, be=0, %g26
movaqp,1 area=7, ind=16, am=0, be=0, %g25
movaqp,2 area=7, ind=0, am=1, be=0, %b[54]
movaqp,3 area=7, ind=16, am=0, be=0, %b[53]
}
{
loop_mode
qpshufb,0 %b[38], %g31, %r6, %b[59]
qpshufb,1 %b[37], %g30, %r6, %b[60]
qpshufb,3 %b[37], %g30, %r23, %g30
qpshufb,4 %b[38], %g31, %r23, %g31
movaqp,0 area=8, ind=0, am=1, be=0, %g29
movaqp,1 area=8, ind=16, am=0, be=0, %g28
movaqp,2 area=8, ind=0, am=1, be=0, %b[58]
movaqp,3 area=8, ind=16, am=0, be=0, %b[57]
}
{
loop_mode
qpshufb,0 %b[42], %b[40], %r6, %b[63]
qpshufb,1 %b[41], %b[39], %r6, %b[64]
qpshufb,3 %b[41], %b[39], %r23, %b[39]
qpshufb,4 %b[42], %b[40], %r23, %b[40]
movaqp,0 area=9, ind=0, am=1, be=0, %b[38]
movaqp,1 area=9, ind=16, am=0, be=0, %b[37]
movaqp,2 area=9, ind=0, am=1, be=0, %b[62]
movaqp,3 area=9, ind=16, am=0, be=0, %b[61]
}
{
loop_mode
qpshufb,0 %b[46], %b[44], %r6, %b[67]
qpshufb,1 %b[45], %b[43], %r6, %b[68]
qpshufb,3 %b[45], %b[43], %r23, %b[43]
qpshufb,4 %b[46], %b[44], %r23, %b[44]
movaqp,0 area=10, ind=0, am=1, be=0, %b[42]
movaqp,1 area=10, ind=16, am=0, be=0, %b[41]
movaqp,2 area=10, ind=0, am=1, be=0, %b[66]
movaqp,3 area=10, ind=16, am=0, be=0, %b[65]
}
{
loop_mode
qpshufb,0 %b[50], %g19, %r6, %b[71]
qpshufb,1 %b[49], %g18, %r6, %b[72]
qpshufb,3 %b[49], %g18, %r23, %g18
qpshufb,4 %b[50], %g19, %r23, %g19
movaqp,0 area=11, ind=0, am=1, be=0, %b[46]
movaqp,1 area=11, ind=16, am=0, be=0, %b[45]
movaqp,2 area=11, ind=0, am=1, be=0, %b[70]
movaqp,3 area=11, ind=16, am=0, be=0, %b[69]
}
{
loop_mode
qpshufb,0 %b[54], %g26, %r6, %b[75]
qpshufb,1 %b[53], %g25, %r6, %b[76]
qpshufb,3 %b[53], %g25, %r23, %g25
qpshufb,4 %b[54], %g26, %r23, %g26
movaqp,0 area=12, ind=0, am=1, be=0, %b[50]
movaqp,1 area=12, ind=16, am=0, be=0, %b[49]
movaqp,2 area=12, ind=0, am=1, be=0, %b[74]
movaqp,3 area=12, ind=16, am=0, be=0, %b[73]
}
{
loop_mode
qpshufb,0 %b[58], %g29, %r6, %b[79]
qpshufb,1 %b[57], %g28, %r6, %b[80]
qpshufb,3 %b[57], %g28, %r23, %g28
qpshufb,4 %b[58], %g29, %r23, %g29
movaqp,0 area=13, ind=0, am=1, be=0, %b[54]
movaqp,1 area=13, ind=16, am=0, be=0, %b[53]
movaqp,2 area=13, ind=0, am=1, be=0, %b[78]
movaqp,3 area=13, ind=16, am=0, be=0, %b[77]
}
{
loop_mode
qpshufb,0 %b[62], %b[38], %r6, %b[83]
qpshufb,1 %b[61], %b[37], %r6, %b[84]
qpshufb,3 %b[61], %b[37], %r23, %b[37]
qpshufb,4 %b[62], %b[38], %r23, %b[38]
movaqp,0 area=14, ind=0, am=1, be=0, %b[58]
movaqp,1 area=14, ind=16, am=0, be=0, %b[57]
movaqp,2 area=14, ind=0, am=1, be=0, %b[82]
movaqp,3 area=14, ind=16, am=0, be=0, %b[81]
}
{
loop_mode
qpshufb,0 %b[66], %b[42], %r6, %b[87]
qpshufb,1 %b[65], %b[41], %r6, %b[88]
qpshufb,3 %b[65], %b[41], %r23, %b[41]
qpshufb,4 %b[66], %b[42], %r23, %b[42]
movaqp,0 area=15, ind=0, am=1, be=0, %b[62]
movaqp,1 area=15, ind=16, am=0, be=0, %b[61]
movaqp,2 area=15, ind=0, am=1, be=0, %b[86]
movaqp,3 area=15, ind=16, am=0, be=0, %b[85]
}
{
loop_mode
qpshufb,0 %b[70], %b[46], %r6, %b[91]
qpshufb,1 %b[69], %b[45], %r6, %b[92]
qpshufb,3 %b[69], %b[45], %r23, %b[45]
qpshufb,4 %b[70], %b[46], %r23, %b[46]
movaqp,0 area=16, ind=0, am=1, be=0, %b[66]
movaqp,1 area=16, ind=16, am=0, be=0, %b[65]
movaqp,2 area=16, ind=0, am=1, be=0, %b[90]
movaqp,3 area=16, ind=16, am=0, be=0, %b[89]
}
{
loop_mode
qpshufb,0 %b[50], %b[50], %r24, %b[95]
qpshufb,1 %b[49], %b[49], %r24, %b[96]
qpfmul_hsubs,2 %b[50], %b[47], %r25, %b[50]
qpshufb,3 %b[74], %b[74], %r24, %b[97]
qpshufb,4 %b[73], %b[73], %r24, %b[98]
qpfmul_hsubs,5 %b[49], %b[51], %r25, %b[49]
movaqp,0 area=17, ind=0, am=1, be=0, %b[70]
movaqp,1 area=17, ind=16, am=0, be=0, %b[69]
movaqp,2 area=17, ind=0, am=1, be=0, %b[94]
movaqp,3 area=17, ind=16, am=0, be=0, %b[93]
}
{
loop_mode
qpshufb,0 %b[54], %b[54], %r24, %b[103]
qpshufb,1 %b[53], %b[53], %r24, %b[104]
qpfmul_hsubs,2 %b[74], %b[55], %r25, %b[74]
qpshufb,3 %b[78], %b[78], %r24, %b[105]
qpshufb,4 %b[77], %b[77], %r24, %b[106]
qpfmul_hsubs,5 %b[73], %b[59], %r25, %b[73]
movaqp,0 area=18, ind=0, am=1, be=0, %b[100]
movaqp,1 area=18, ind=16, am=0, be=0, %b[99]
movaqp,2 area=18, ind=0, am=1, be=0, %b[102]
movaqp,3 area=18, ind=16, am=0, be=0, %b[101]
}
{
loop_mode
qpshufb,0 %b[58], %b[58], %r24, %b[111]
qpshufb,1 %b[57], %b[57], %r24, %b[112]
qpfmul_hsubs,2 %b[54], %b[63], %r25, %b[54]
qpshufb,3 %b[82], %b[82], %r24, %b[113]
qpshufb,4 %b[81], %b[81], %r24, %b[114]
qpfmul_hsubs,5 %b[53], %b[67], %r25, %b[53]
movaqp,0 area=19, ind=0, am=1, be=0, %b[108]
movaqp,1 area=19, ind=16, am=0, be=0, %b[107]
movaqp,2 area=19, ind=0, am=1, be=0, %b[110]
movaqp,3 area=19, ind=16, am=0, be=0, %b[109]
}
{
loop_mode
qpfmul_hadds,0 %b[96], %b[51], %r25, %b[51]
qpfmul_hsubs,1 %b[78], %b[71], %r25, %b[78]
qpfmul_hsubs,2 %b[77], %b[75], %r25, %b[77]
qpfmul_hadds,3 %b[97], %b[55], %r25, %b[55]
qpfmul_hadds,4 %b[98], %b[59], %r25, %b[59]
qpfmul_hsubs,5 %b[58], %b[79], %r25, %b[58]
movaqp,0 area=20, ind=0, am=1, be=0, %b[116]
movaqp,1 area=20, ind=16, am=0, be=0, %b[115]
movaqp,2 area=20, ind=0, am=1, be=0, %b[118]
movaqp,3 area=20, ind=16, am=0, be=0, %b[117]
}
{
loop_mode
qpfmul_hadds,0 %b[95], %b[47], %r25, %b[47]
qpfmul_hsubs,1 %b[57], %b[83], %r25, %b[57]
qpfmul_hsubs,2 %b[82], %b[87], %r25, %b[82]
qpfmul_hadds,3 %b[105], %b[71], %r25, %b[71]
qpfmul_hadds,4 %b[106], %b[75], %r25, %b[75]
qpfmul_hsubs,5 %b[81], %b[91], %r25, %b[81]
movaqp,0 area=21, ind=0, am=1, be=0, %b[96]
movaqp,1 area=21, ind=16, am=0, be=0, %b[95]
movaqp,2 area=21, ind=0, am=1, be=0, %b[98]
movaqp,3 area=21, ind=16, am=0, be=0, %b[97]
}
{
loop_mode
qpfmul_hadds,0 %b[103], %b[63], %r25, %b[63]
qpfmul_hadds,1 %b[112], %b[83], %r25, %b[83]
qpfmul_hadds,2 %b[104], %b[67], %r25, %b[67]
qpfmul_hadds,3 %b[114], %b[91], %r25, %b[91]
qpshufb,4 %b[62], %b[62], %r24, %b[103]
qpfmul_hadds,5 %b[113], %b[87], %r25, %b[87]
movaqp,0 area=22, ind=0, am=1, be=0, %b[105]
movaqp,1 area=22, ind=16, am=0, be=0, %b[104]
movaqp,2 area=22, ind=0, am=1, be=0, %b[112]
movaqp,3 area=22, ind=16, am=0, be=0, %b[106]
}
{
loop_mode
qpshufb,0 %b[100], %b[100], %r24, %b[113]
qpshufb,1 %b[99], %b[99], %r24, %b[114]
qpfmul_hsubs,2 %b[99], %b[52], %r25, %b[99]
qpshufb,3 %b[102], %b[102], %r24, %b[119]
qpshufb,4 %b[101], %b[101], %r24, %b[120]
qpfmul_hadds,5 %b[111], %b[79], %r25, %b[79]
movaqp,0 area=23, ind=0, am=1, be=0, %b[121]
movaqp,1 area=23, ind=16, am=0, be=0, %b[111]
movaqp,2 area=23, ind=0, am=1, be=0, %b[123]
movaqp,3 area=23, ind=16, am=0, be=0, %b[122]
}
{
loop_mode
qpshufb,0 %b[108], %b[108], %r24, %b[124]
qpshufb,1 %b[107], %b[107], %r24, %b[125]
qpfmul_hsubs,2 %b[100], %b[48], %r25, %b[100]
qpshufb,3 %b[110], %b[110], %r24, %b[126]
qpshufb,4 %b[109], %b[109], %r24, %b[127]
qpfmul_hsubs,5 %b[101], %b[60], %r25, %b[101]
movaqp,0 area=24, ind=0, am=1, be=0, %b[35]
movaqp,1 area=24, ind=16, am=0, be=0, %b[36]
movaqp,2 area=24, ind=0, am=1, be=0, %b[33]
movaqp,3 area=24, ind=16, am=0, be=0, %b[34]
}
{
loop_mode
qpshufb,0 %b[116], %b[116], %r24, %b[14]
qpshufb,1 %b[115], %b[115], %r24, %b[13]
qpfmul_hsubs,2 %b[102], %b[56], %r25, %b[102]
qpshufb,3 %b[118], %b[118], %r24, %b[12]
qpshufb,4 %b[117], %b[117], %r24, %b[11]
qpfmul_hsubs,5 %b[107], %b[68], %r25, %b[107]
movaqp,0 area=25, ind=0, am=1, be=0, %b[31]
movaqp,1 area=25, ind=16, am=0, be=0, %b[32]
movaqp,2 area=25, ind=0, am=1, be=0, %b[29]
movaqp,3 area=25, ind=16, am=0, be=0, %b[30]
}
{
loop_mode
qpfmul_hsubs,0 %b[108], %b[64], %r25, %b[108]
qpfmul_hsubs,1 %b[110], %b[72], %r25, %b[110]
qpfmul_hsubs,2 %b[109], %b[76], %r25, %b[109]
qpfmul_hadds,3 %b[119], %b[56], %r25, %b[56]
qpfmul_hadds,4 %b[120], %b[60], %r25, %b[60]
qpfmul_hsubs,5 %b[115], %b[84], %r25, %b[115]
movaqp,0 area=26, ind=0, am=1, be=0, %b[27]
movaqp,1 area=26, ind=16, am=0, be=0, %b[28]
movaqp,2 area=26, ind=0, am=1, be=0, %b[25]
movaqp,3 area=26, ind=16, am=0, be=0, %b[26]
}
{
loop_mode
qpfmul_hadds,0 %b[113], %b[48], %r25, %b[48]
qpfmul_hadds,1 %b[114], %b[52], %r25, %b[52]
qpfmul_hsubs,2 %b[116], %b[80], %r25, %b[113]
qpfmul_hadds,3 %b[127], %b[76], %r25, %b[76]
qpfmul_hsubs,4 %b[118], %b[88], %r25, %b[114]
qpfmul_hsubs,5 %b[117], %b[92], %r25, %b[116]
movaqp,0 area=28, ind=0, am=1, be=0, %b[22]
movaqp,1 area=27, ind=0, am=1, be=0, %b[24]
movaqp,2 area=28, ind=0, am=1, be=0, %b[21]
movaqp,3 area=27, ind=0, am=1, be=0, %b[23]
}
{
loop_mode
qpfmul_hadds,0 %b[124], %b[64], %r25, %b[64]
qpfmul_hadds,1 %b[125], %b[68], %r25, %b[68]
qpfmul_hadds,2 %b[126], %b[72], %r25, %b[72]
qpfmul_hadds,3 %b[11], %b[92], %r25, %b[11]
qpshufb,4 %b[61], %b[61], %r24, %b[88]
qpfmul_hadds,5 %b[12], %b[88], %r25, %b[12]
movaqp,0 area=30, ind=0, am=1, be=0, %b[18]
movaqp,1 area=29, ind=0, am=1, be=0, %b[20]
movaqp,2 area=30, ind=0, am=1, be=0, %b[17]
movaqp,3 area=29, ind=0, am=1, be=0, %b[19]
}
{
loop_mode
qpfmul_hadds,0 %b[13], %b[84], %r25, %b[13]
qpshufb,1 %b[86], %b[86], %r24, %b[80]
qpfmul_hadds,2 %b[14], %b[80], %r25, %b[14]
qpshufb,3 %b[85], %b[85], %r24, %b[84]
qpshufb,4 %b[66], %b[66], %r24, %b[92]
qpfmul_hsubs,5 %b[61], %g23, %r25, %b[61]
movaqp,1 area=31, ind=0, am=1, be=0, %b[16]
movaqp,3 area=31, ind=0, am=1, be=0, %b[15]
}
{
loop_mode
qpshufb,0 %b[65], %b[65], %r24, %b[117]
qpshufb,1 %b[90], %b[90], %r24, %b[118]
qpfmul_hsubs,2 %b[62], %g16, %r25, %b[62]
qpshufb,3 %b[89], %b[89], %r24, %b[119]
qpshufb,4 %b[70], %b[70], %r24, %b[120]
qpfmul_hsubs,5 %b[86], %g22, %r25, %b[86]
}
{
loop_mode
qpshufb,0 %b[69], %b[69], %r24, %b[124]
qpshufb,1 %b[94], %b[94], %r24, %b[125]
qpfmul_hsubs,2 %b[85], %g30, %r25, %b[85]
qpshufb,3 %b[93], %b[93], %r24, %b[126]
qpfmul_hsubs,4 %b[66], %b[39], %r25, %b[66]
qpfmul_hsubs,5 %b[65], %b[43], %r25, %b[65]
}
{
loop_mode
qpfmul_hsubs,0 %b[90], %g18, %r25, %b[90]
qpfmul_hsubs,1 %b[89], %g25, %r25, %b[89]
qpfmul_hadds,2 %b[103], %g16, %r25, %g16
qpfmul_hadds,3 %b[88], %g23, %r25, %g23
qpfmul_hsubs,4 %b[70], %g28, %r25, %b[70]
qpfmul_hsubs,5 %b[69], %b[37], %r25, %b[69]
}
{
loop_mode
qpfmul_hsubs,0 %b[94], %b[41], %r25, %b[88]
qpfmul_hsubs,1 %b[93], %b[45], %r25, %b[93]
qpfmul_hadds,2 %b[80], %g22, %r25, %g22
qpfmul_hadds,3 %b[84], %g30, %r25, %g30
qpfmul_hadds,4 %b[92], %b[39], %r25, %b[39]
qpfmul_hadds,5 %b[119], %g25, %r25, %g25
}
{
loop_mode
qpfmul_hadds,0 %b[117], %b[43], %r25, %b[43]
qpfmul_hadds,1 %b[118], %g18, %r25, %g18
qpfmul_hadds,2 %b[124], %b[37], %r25, %b[37]
qpfmul_hadds,3 %b[126], %b[45], %r25, %b[45]
qppermb,4 %b[51], %b[49], %r7, %b[49]
qpfmul_hadds,5 %b[120], %g28, %r25, %g28
}
{
loop_mode
qppermb,1 %b[59], %b[73], %r7, %b[51]
qpfmul_hadds,2 %b[125], %b[41], %r25, %b[41]
qppermb,3 %b[47], %b[50], %r7, %b[47]
qppermb,4 %b[55], %b[74], %r7, %b[50]
}
{
loop_mode
qppermb,0 %b[71], %b[78], %r7, %b[55]
qppermb,1 %b[75], %b[77], %r7, %b[59]
qppermb,3 %b[63], %b[54], %r7, %b[54]
qppermb,4 %b[67], %b[53], %r7, %b[53]
}
{
loop_mode
nop 2
qppermb,0 %b[79], %b[58], %r7, %b[58]
qppermb,1 %b[83], %b[57], %r7, %b[57]
qppermb,3 %b[87], %b[82], %r7, %b[63]
}
{
loop_mode
qppermb,4 %b[91], %b[81], %r7, %b[67]
}
{
loop_mode
qppermb,0 %b[56], %b[102], %r7, %b[56]
qppermb,1 %b[60], %b[101], %r7, %b[60]
qppermb,3 %b[48], %b[100], %r7, %b[48]
qppermb,4 %b[52], %b[99], %r7, %b[52]
}
{
loop_mode
qppermb,0 %b[68], %b[107], %r7, %b[68]
qppermb,1 %b[76], %b[109], %r7, %b[71]
qpfsubs,2 %b[51], %b[60], %b[73]
qppermb,3 %b[64], %b[108], %r7, %b[64]
qppermb,4 %b[72], %b[110], %r7, %b[72]
qpfsubs,5 %b[49], %b[52], %b[74]
}
{
loop_mode
qppermb,0 %b[12], %b[114], %r7, %b[12]
qppermb,1 %b[11], %b[116], %r7, %b[11]
qpfsubs,2 %b[50], %b[56], %b[75]
qppermb,3 %b[14], %b[113], %r7, %b[14]
qppermb,4 %b[13], %b[115], %r7, %b[13]
qpfsubs,5 %b[47], %b[48], %b[76]
}
{
loop_mode
qpfsubs,0 %b[59], %b[71], %b[77]
qpfsubs,1 %b[53], %b[68], %b[78]
qpfsubs,2 %b[63], %b[12], %b[79]
qpfsubs,3 %b[55], %b[72], %b[81]
qpfsubs,4 %b[58], %b[14], %b[82]
qpfsubs,5 %b[54], %b[64], %b[80]
}
{
loop_mode
qppermb,0 %g16, %b[62], %r7, %g16
qppermb,1 %g23, %b[61], %r7, %g23
qpfsubs,2 %b[67], %b[11], %b[83]
qppermb,3 %g22, %b[86], %r7, %g22
qppermb,4 %g30, %b[85], %r7, %g30
qpfsubs,5 %b[57], %b[13], %b[84]
}
{
loop_mode
qpfadds,0 %b[51], %b[60], %b[51]
qpfadds,1 %b[47], %b[48], %b[47]
qpfadds,2 %b[49], %b[52], %b[48]
qpfadds,3 %b[50], %b[56], %b[49]
qpfadds,4 %b[59], %b[71], %b[50]
qpfadds,5 %b[54], %b[64], %b[52]
}
{
loop_mode
qppermb,0 %b[39], %b[66], %r7, %b[39]
qppermb,1 %b[43], %b[65], %r7, %b[43]
qpfadds,2 %b[53], %b[68], %b[53]
qppermb,3 %g18, %b[90], %r7, %g18
qppermb,4 %g25, %b[89], %r7, %g25
qpfadds,5 %b[55], %b[72], %b[54]
}
{
loop_mode
qpfadds,0 %b[67], %b[11], %b[11]
qpfadds,1 %g24, %g23, %b[55]
qpfsubs,2 %g17, %g16, %b[56]
qpfadds,3 %b[58], %b[14], %b[14]
qpfadds,4 %b[57], %b[13], %b[13]
qpfadds,5 %b[63], %b[12], %b[12]
}
{
loop_mode
qppermb,0 %b[37], %b[69], %r7, %b[37]
qppermb,1 %g28, %b[70], %r7, %g28
qpfsubs,2 %g24, %g23, %g23
qppermb,3 %b[45], %b[93], %r7, %b[45]
qppermb,4 %b[41], %b[88], %r7, %b[41]
qpfadds,5 %g17, %g16, %g16
}
{
loop_mode
qpfsubs,0 %g31, %g30, %g17
qpfadds,1 %g27, %g22, %g24
qpfadds,2 %g31, %g30, %g30
qpfsubs,3 %g27, %g22, %g22
qpfsubs,4 %g26, %g25, %g27
qpfadds,5 %g26, %g25, %g25
}
{
loop_mode
qpfsubs,0 %b[44], %b[43], %g26
qpfadds,1 %b[40], %b[39], %g31
qpfadds,2 %b[44], %b[43], %b[43]
qpfsubs,3 %b[40], %b[39], %b[39]
qpfsubs,4 %g19, %g18, %b[40]
qpfadds,5 %g19, %g18, %g18
}
{
loop_mode
qpfadds,0 %b[38], %b[37], %g19
qpfsubs,1 %g29, %g28, %b[44]
qpfsubs,2 %b[38], %b[37], %b[37]
qpfsubs,3 %b[46], %b[45], %b[45]
qpfadds,4 %b[42], %b[41], %b[46]
qpfadds,5 %b[46], %b[45], %b[38]
}
{
loop_mode
qpfadds,0 %b[55], %b[48], %b[41]
qpfsubs,1 %b[55], %b[48], %b[42]
qpfadds,2 %g29, %g28, %g28
qpfadds,3 %g16, %b[47], %b[48]
qpfsubs,4 %g16, %b[47], %g16
qpfsubs,5 %b[42], %b[41], %g29
}
{
loop_mode
qpfadds,0 %g30, %b[51], %b[47]
qpfadds,1 %g24, %b[49], %b[51]
qpfsubs,2 %g30, %b[51], %g30
qpfsubs,3 %g25, %b[50], %g25
qpfadds,5 %g25, %b[50], %b[55]
}
{
loop_mode
qpfsubs,0 %g24, %b[49], %g24
qpfadds,1 %b[43], %b[53], %b[49]
qpfsubs,2 %g31, %b[52], %b[50]
qpfsubs,3 %g18, %b[54], %g18
qpfadds,5 %g18, %b[54], %b[57]
}
{
loop_mode
qpfsubs,0 %b[43], %b[53], %b[43]
qpfadds,1 %g31, %b[52], %g31
qpfadds,2 %g19, %b[13], %b[52]
qpfsubs,3 %b[38], %b[11], %b[53]
qpshufb,4 %b[75], %b[75], %r24, %b[54]
qpfadds,5 %b[46], %b[12], %b[58]
}
{
loop_mode
qpfadds,0 %g28, %b[14], %b[13]
qpfsubs,1 %g28, %b[14], %g28
qpfsubs,2 %g19, %b[13], %g19
qpfsubs,3 %b[46], %b[12], %b[12]
qpshufb,4 %b[73], %b[73], %r24, %b[59]
qpfadds,5 %b[38], %b[11], %b[11]
}
{
loop_mode
qpshufb,4 %b[74], %b[74], %r24, %b[14]
}
{
loop_mode
qpshufb,4 %b[78], %b[78], %r24, %b[38]
}
{
loop_mode
qpshufb,0 %b[77], %b[77], %r24, %b[46]
qpshufb,1 %b[83], %b[83], %r24, %b[60]
qpshufb,3 %b[76], %b[76], %r24, %b[61]
qpshufb,4 %b[80], %b[80], %r24, %b[62]
}
{
loop_mode
qpshufb,0 %b[81], %b[81], %r24, %b[63]
qpshufb,1 %b[79], %b[79], %r24, %b[64]
qpshufb,3 %b[82], %b[82], %r24, %b[65]
qpshufb,4 %b[84], %b[84], %r24, %b[66]
}
{
loop_mode
qpxor,0 %b[54], %r22, %b[54]
qpxor,1 %b[59], %r22, %b[59]
qpxor,3 %b[14], %r22, %b[14]
qpxor,4 %b[38], %r22, %b[38]
}
{
loop_mode
qpxor,0 %b[46], %r22, %b[46]
qpxor,1 %b[60], %r22, %b[60]
qpfadds,2 %g17, %b[59], %b[67]
qpxor,3 %b[61], %r22, %b[61]
qpxor,4 %b[62], %r22, %b[62]
qpfadds,5 %g26, %b[38], %b[68]
}
{
loop_mode
qpxor,0 %b[63], %r22, %b[63]
qpxor,1 %b[64], %r22, %b[64]
qpfsubs,2 %g17, %b[59], %g17
qpxor,3 %b[65], %r22, %b[65]
qpxor,4 %b[66], %r22, %b[66]
qpfsubs,5 %g23, %b[14], %b[59]
}
{
loop_mode
qpfadds,0 %g27, %b[46], %b[69]
qpfsubs,1 %g22, %b[54], %b[70]
qpfadds,2 %g22, %b[54], %g22
qpfadds,3 %g23, %b[14], %g23
qpfsubs,4 %g26, %b[38], %g26
qpfsubs,5 %b[56], %b[61], %b[14]
}
{
loop_mode
qpfsubs,0 %g27, %b[46], %g27
qpfsubs,1 %b[45], %b[60], %b[38]
qpfadds,2 %g29, %b[64], %b[46]
qpfadds,3 %b[56], %b[61], %b[54]
qpfsubs,4 %b[39], %b[62], %b[56]
qpfadds,5 %b[39], %b[62], %b[39]
}
{
loop_mode
qpfadds,0 %b[45], %b[60], %b[45]
qpfadds,1 %b[40], %b[63], %b[60]
qpfsubs,2 %g29, %b[64], %g29
qpfadds,3 %b[37], %b[66], %b[37]
qpfsubs,4 %b[44], %b[65], %b[62]
qpfsubs,5 %b[37], %b[66], %b[61]
}
{
loop_mode
nop 1
qpfsubs,2 %b[40], %b[63], %b[40]
qpfadds,5 %b[44], %b[65], %b[44]
}
{
loop_mode
qpshufb,0 %b[95], %b[95], %r24, %b[63]
qpshufb,1 %g20, %g20, %r24, %b[64]
qpshufb,3 %b[96], %b[96], %r24, %b[65]
qpshufb,4 %b[104], %b[104], %r24, %b[66]
}
{
loop_mode
qpshufb,0 %b[112], %b[112], %r24, %b[71]
qpshufb,1 %b[105], %b[105], %r24, %b[72]
qpshufb,3 %b[106], %b[106], %r24, %b[73]
qpshufb,4 %b[122], %b[122], %r24, %b[74]
}
{
loop_mode
qpshufb,0 %b[35], %b[35], %r24, %b[75]
qpshufb,1 %b[123], %b[123], %r24, %b[76]
qpshufb,3 %b[36], %b[36], %r24, %b[77]
qpshufb,4 %b[32], %b[32], %r24, %b[78]
}
{
loop_mode
qpshufb,0 %b[29], %b[29], %r24, %b[79]
qpshufb,1 %b[31], %b[31], %r24, %b[80]
qpshufb,3 %b[30], %b[30], %r24, %b[81]
qpshufb,4 %b[26], %b[26], %r24, %b[82]
}
{
loop_mode
qpshufb,0 %b[25], %b[25], %r24, %b[83]
qpshufb,1 %b[24], %b[24], %r24, %b[84]
qpshufb,3 %b[21], %b[21], %r24, %b[85]
qpshufb,4 %b[22], %b[22], %r24, %b[86]
}
{
loop_mode
qpshufb,0 %b[18], %b[18], %r24, %b[87]
qpshufb,1 %b[15], %b[15], %r24, %b[88]
qpshufb,3 %b[19], %b[19], %r24, %b[89]
qpshufb,4 %b[16], %b[16], %r24, %b[90]
}
{
loop_mode
qpshufb,0 %b[51], %b[48], %r6, %b[91]
qpshufb,1 %b[47], %b[41], %r6, %b[92]
qpshufb,3 %g24, %g16, %r6, %b[93]
qpshufb,4 %g30, %b[42], %r6, %b[94]
}
{
loop_mode
qpshufb,0 %b[57], %g31, %r6, %b[99]
qpshufb,1 %b[55], %b[49], %r6, %b[100]
qpfmul_hadds,2 %b[65], %b[91], %r25, %b[65]
qpshufb,3 %g18, %b[50], %r6, %b[101]
qpshufb,4 %g25, %b[43], %r6, %b[102]
qpfmul_hadds,5 %b[75], %b[93], %r25, %b[75]
}
{
loop_mode
qpshufb,0 %b[58], %b[13], %r6, %b[103]
qpshufb,1 %b[11], %b[52], %r6, %b[107]
qpfmul_hadds,2 %b[72], %b[92], %r25, %b[72]
qpshufb,3 %b[12], %g28, %r6, %b[108]
qpshufb,4 %b[53], %g19, %r6, %b[109]
qpfmul_hadds,5 %b[80], %b[94], %r25, %b[80]
}
{
loop_mode
qpshufb,0 %g17, %b[59], %r6, %b[110]
qpshufb,1 %b[67], %g23, %r6, %b[113]
qpfmul_hsubs,2 %b[96], %b[91], %r25, %b[91]
qpshufb,3 %g27, %g26, %r6, %b[114]
qpshufb,4 %b[69], %b[68], %r6, %b[115]
qpfmul_hsubs,5 %b[35], %b[93], %r25, %b[35]
}
{
loop_mode
qpshufb,0 %b[70], %b[14], %r6, %b[93]
qpshufb,1 %g22, %b[54], %r6, %b[96]
qpfmul_hsubs,2 %b[31], %b[94], %r25, %b[31]
qpshufb,3 %b[60], %b[39], %r6, %b[116]
qpshufb,4 %b[45], %b[37], %r6, %b[117]
qpfmul_hsubs,5 %b[105], %b[92], %r25, %b[92]
}
{
loop_mode
qpshufb,0 %b[40], %b[56], %r6, %b[94]
qpshufb,1 %g29, %b[62], %r6, %b[105]
qpfmul_hsubs,2 %b[32], %b[102], %r25, %b[32]
qpshufb,3 %b[38], %b[61], %r6, %b[118]
qpshufb,4 %b[46], %b[44], %r6, %b[119]
qpfmul_hadds,5 %b[78], %b[102], %r25, %b[78]
}
{
loop_mode
qpfmul_hsubs,0 %b[95], %b[99], %r25, %b[95]
qpfmul_hsubs,1 %b[36], %b[101], %r25, %b[36]
qpfmul_hsubs,2 %b[104], %b[100], %r25, %b[102]
qpfmul_hadds,3 %b[63], %b[99], %r25, %b[63]
qpfmul_hadds,4 %b[77], %b[101], %r25, %b[77]
qpfmul_hadds,5 %b[66], %b[100], %r25, %b[66]
}
{
loop_mode
qpfmul_hsubs,0 %b[18], %b[108], %r25, %b[18]
qpfmul_hsubs,1 %b[22], %b[107], %r25, %b[22]
qpfmul_hsubs,2 %b[16], %b[109], %r25, %b[16]
qpfmul_hadds,3 %b[87], %b[108], %r25, %b[87]
qpfmul_hadds,4 %b[86], %b[107], %r25, %b[86]
qpfmul_hadds,5 %b[90], %b[109], %r25, %b[90]
}
{
loop_mode
qpfmul_hsubs,0 %b[24], %b[103], %r25, %b[24]
qpfmul_hadds,1 %b[84], %b[103], %r25, %b[84]
qpfmul_hsubs,2 %b[25], %b[113], %r25, %b[25]
qpfmul_hadds,3 %b[83], %b[113], %r25, %b[83]
qpfmul_hsubs,4 %b[26], %b[115], %r25, %b[26]
qpfmul_hadds,5 %b[82], %b[115], %r25, %b[82]
}
{
loop_mode
qpfmul_hsubs,0 %b[29], %b[96], %r25, %b[29]
qpfmul_hsubs,1 %b[123], %b[110], %r25, %b[99]
qpfmul_hadds,2 %b[79], %b[96], %r25, %b[79]
qpfmul_hadds,3 %b[76], %b[110], %r25, %b[76]
qpfmul_hsubs,4 %b[122], %b[114], %r25, %b[96]
qpfmul_hadds,5 %b[74], %b[114], %r25, %b[74]
}
{
loop_mode
qpfmul_hsubs,0 %b[112], %b[93], %r25, %b[100]
qpfmul_hadds,1 %b[71], %b[93], %r25, %b[71]
qpfmul_hsubs,2 %b[30], %b[116], %r25, %b[30]
qpfmul_hadds,3 %b[81], %b[116], %r25, %b[81]
qpfmul_hsubs,4 %g20, %b[117], %r25, %g20
qpfmul_hadds,5 %b[64], %b[117], %r25, %b[64]
}
{
loop_mode
qpfmul_hsubs,0 %b[106], %b[94], %r25, %b[93]
qpfmul_hadds,1 %b[73], %b[94], %r25, %b[73]
qpfmul_hsubs,2 %b[15], %b[119], %r25, %b[15]
qpfmul_hsubs,3 %b[19], %b[118], %r25, %b[19]
qpfmul_hadds,4 %b[88], %b[119], %r25, %b[88]
qpfmul_hadds,5 %b[89], %b[118], %r25, %b[89]
}
{
loop_mode
nop 5
qpfmul_hadds,0 %b[85], %b[105], %r25, %b[85]
qpfmul_hsubs,2 %b[21], %b[105], %r25, %b[21]
}
{
loop_mode
qpshufb,1 %b[97], %b[97], %r24, %b[94]
qpshufb,3 %g21, %g21, %r24, %b[101]
qpshufb,4 %b[98], %b[98], %r24, %b[103]
}
{
loop_mode
qpshufb,0 %b[121], %b[121], %r24, %b[104]
qpshufb,1 %b[111], %b[111], %r24, %b[105]
qpshufb,3 %b[34], %b[34], %r24, %b[106]
qpshufb,4 %b[33], %b[33], %r24, %b[107]
}
{
loop_mode
qpshufb,0 %b[27], %b[27], %r24, %b[108]
qpshufb,1 %b[28], %b[28], %r24, %b[109]
qpshufb,3 %b[23], %b[23], %r24, %b[110]
qpshufb,4 %b[20], %b[20], %r24, %b[112]
}
{
loop_mode
qpshufb,0 %b[17], %b[17], %r24, %b[113]
qpshufb,1 %b[47], %b[41], %r23, %b[41]
qpshufb,3 %g30, %b[42], %r23, %g30
qpshufb,4 %b[55], %b[49], %r23, %b[42]
}
{
loop_mode
qpshufb,0 %g25, %b[43], %r23, %g25
qpshufb,1 %b[11], %b[52], %r23, %b[11]
qpfmul_hsubs,2 %b[98], %b[41], %r25, %b[43]
qpshufb,3 %b[53], %g19, %r23, %g19
qpshufb,4 %g17, %b[59], %r23, %g17
qpfmul_hsubs,5 %b[33], %g30, %r25, %b[33]
}
{
loop_mode
qpshufb,0 %b[67], %g23, %r23, %g23
qpshufb,1 %g27, %g26, %r23, %g26
qpfmul_hadds,2 %b[106], %g25, %r25, %b[47]
qpshufb,3 %b[69], %b[68], %r23, %g27
qpshufb,4 %b[38], %b[61], %r23, %b[38]
qpfmul_hsubs,5 %b[17], %g19, %r25, %b[17]
}
{
loop_mode
qpshufb,0 %b[45], %b[37], %r23, %b[37]
qpfmul_hadds,1 %b[103], %b[41], %r25, %b[41]
qpfmul_hsubs,2 %b[34], %g25, %r25, %g25
qpfmul_hadds,3 %b[107], %g30, %r25, %g30
qpfmul_hsubs,4 %b[97], %b[42], %r25, %b[34]
qpfmul_hadds,5 %b[94], %b[42], %r25, %b[42]
}
{
loop_mode
qpfmul_hadds,0 %b[110], %b[11], %r25, %b[45]
qpfmul_hsubs,1 %b[23], %b[11], %r25, %b[11]
qpfmul_hadds,2 %b[108], %g23, %r25, %b[49]
qpfmul_hsubs,3 %b[121], %g17, %r25, %b[23]
qpfmul_hadds,4 %b[104], %g17, %r25, %g17
qpfmul_hadds,5 %b[113], %g19, %r25, %g19
}
{
loop_mode
qpfmul_hsubs,0 %b[111], %g26, %r25, %b[52]
qpfmul_hadds,1 %b[105], %g26, %r25, %g26
qpfmul_hsubs,2 %b[27], %g23, %r25, %g23
qpfmul_hsubs,3 %b[28], %g27, %r25, %b[28]
qpfmul_hadds,4 %b[109], %g27, %r25, %g27
qpfmul_hsubs,5 %b[20], %b[38], %r25, %b[20]
}
{
loop_mode
qpfmul_hadds,0 %b[101], %b[37], %r25, %b[27]
qppermb,1 %b[75], %b[35], %r7, %b[35]
qpfmul_hsubs,2 %g21, %b[37], %r25, %g21
qppermb,3 %b[80], %b[31], %r7, %b[31]
qppermb,4 %b[65], %b[91], %r7, %b[38]
qpfmul_hadds,5 %b[112], %b[38], %r25, %b[37]
}
{
loop_mode
qppermb,0 %b[72], %b[92], %r7, %b[53]
qppermb,1 %b[63], %b[95], %r7, %b[55]
qppermb,3 %b[66], %b[102], %r7, %b[59]
qppermb,4 %b[78], %b[32], %r7, %b[32]
}
{
loop_mode
qppermb,0 %b[90], %b[16], %r7, %b[16]
qppermb,1 %b[77], %b[36], %r7, %b[36]
qppermb,3 %b[87], %b[18], %r7, %b[18]
qppermb,4 %b[84], %b[24], %r7, %b[24]
}
{
loop_mode
qppermb,0 %b[86], %b[22], %r7, %b[22]
}
{
loop_mode
qpfsubs,1 %b[38], %b[53], %b[61]
qpfsubs,3 %b[35], %b[31], %b[63]
qpfadds,4 %b[35], %b[31], %b[31]
}
{
loop_mode
qpfsubs,0 %b[36], %b[32], %b[65]
qpfadds,1 %b[38], %b[53], %b[38]
qpfsubs,2 %b[55], %b[59], %b[35]
qpfadds,3 %b[55], %b[59], %b[53]
}
{
loop_mode
qpfsubs,0 %b[24], %b[22], %b[59]
qpfadds,1 %b[36], %b[32], %b[32]
qpfsubs,2 %b[18], %b[16], %b[55]
qpfadds,3 %b[18], %b[16], %b[16]
}
{
loop_mode
qpfadds,0 %b[24], %b[22], %b[22]
qppermb,4 %b[83], %b[25], %r7, %b[18]
}
{
loop_mode
qppermb,4 %b[79], %b[29], %r7, %b[24]
}
{
loop_mode
qppermb,1 %b[76], %b[99], %r7, %b[25]
qppermb,3 %b[71], %b[100], %r7, %b[29]
qppermb,4 %b[74], %b[96], %r7, %b[36]
qpfsubs,5 %b[24], %b[18], %b[66]
}
{
loop_mode
qppermb,0 %b[82], %b[26], %r7, %b[26]
qppermb,1 %b[64], %g20, %r7, %g20
qppermb,3 %b[81], %b[30], %r7, %b[30]
qppermb,4 %b[85], %b[21], %r7, %b[21]
qpfadds,5 %b[24], %b[18], %b[18]
}
{
loop_mode
qppermb,0 %b[88], %b[15], %r7, %b[15]
qppermb,1 %b[89], %b[19], %r7, %b[19]
qppermb,3 %b[73], %b[93], %r7, %b[24]
qpshufb,4 %b[51], %b[48], %r23, %b[48]
}
{
loop_mode
qpshufb,0 %g24, %g16, %r23, %g16
qpshufb,1 %g18, %b[50], %r23, %g18
qpfsubs,2 %b[15], %g20, %b[64]
qpshufb,3 %b[57], %g31, %r23, %g24
qpshufb,4 %b[58], %b[13], %r23, %g31
qpfsubs,5 %b[24], %b[36], %b[51]
}
{
loop_mode
qpshufb,0 %b[12], %g28, %r23, %g28
qpshufb,1 %g22, %b[54], %r23, %g22
qpfsubs,2 %b[29], %b[25], %b[13]
qpshufb,3 %b[70], %b[14], %r23, %b[12]
qpshufb,4 %b[40], %b[56], %r23, %b[14]
qpfadds,5 %b[29], %b[25], %b[25]
}
{
loop_mode
qpfsubs,0 %b[21], %b[19], %b[40]
qpshufb,1 %b[60], %b[39], %r23, %b[39]
qpfsubs,2 %b[30], %b[26], %b[29]
qpshufb,3 %b[46], %b[44], %r23, %b[44]
qppermb,4 %b[41], %b[43], %r7, %b[41]
qpfadds,5 %b[30], %b[26], %b[26]
}
{
loop_mode
qppermb,0 %b[47], %g25, %r7, %g25
qpshufb,1 %g29, %b[62], %r23, %g29
qpfadds,2 %b[15], %g20, %g20
qppermb,3 %g30, %b[33], %r7, %g30
qppermb,4 %b[42], %b[34], %r7, %b[30]
qpfadds,5 %b[24], %b[36], %b[15]
}
{
loop_mode
qpfadds,0 %b[21], %b[19], %b[17]
qppermb,1 %b[45], %b[11], %r7, %b[11]
qpfadds,2 %g18, %g25, %b[21]
qppermb,3 %g17, %b[23], %r7, %g17
qppermb,4 %g19, %b[17], %r7, %g19
qpfsubs,5 %b[48], %b[41], %b[19]
}
{
loop_mode
qppermb,0 %b[49], %g23, %r7, %g23
qppermb,1 %g26, %b[52], %r7, %g26
qpfsubs,2 %g18, %g25, %g18
qppermb,3 %g27, %b[28], %r7, %g27
qppermb,4 %b[37], %b[20], %r7, %b[20]
qpfadds,5 %b[48], %b[41], %b[23]
}
{
loop_mode
qpfadds,0 %g31, %b[11], %b[27]
qppermb,1 %b[27], %g21, %r7, %g21
qpfsubs,2 %g31, %b[11], %g31
qpfsubs,3 %g24, %b[30], %g25
qpfsubs,4 %g16, %g30, %b[24]
qpfadds,5 %g24, %b[30], %g24
}
{
loop_mode
qpfadds,0 %g16, %g30, %g16
qpfsubs,1 %g22, %g23, %b[12]
qpfadds,2 %b[14], %g26, %b[28]
qpfadds,3 %b[12], %g17, %g30
qpfsubs,4 %b[12], %g17, %g17
qpfsubs,5 %b[39], %g27, %b[11]
}
{
loop_mode
qpfadds,0 %g28, %g19, %b[30]
qpfsubs,1 %g28, %g19, %g19
qpfsubs,2 %b[14], %g26, %g26
qpfadds,3 %b[39], %g27, %g27
qpfsubs,4 %g29, %b[20], %g29
qpfadds,5 %g29, %b[20], %g28
}
{
loop_mode
qpfadds,0 %b[44], %g21, %g23
qpfsubs,1 %b[44], %g21, %g21
qpfadds,2 %g22, %g23, %g22
qpfadds,3 %b[23], %b[38], %b[14]
qpfsubs,4 %b[23], %b[38], %b[20]
}
{
loop_mode
qpfadds,0 %b[21], %b[32], %b[23]
qpfsubs,1 %b[21], %b[32], %b[21]
qpfadds,2 %b[27], %b[22], %b[32]
qpfsubs,3 %g24, %b[53], %b[33]
qpfadds,4 %g24, %b[53], %g24
}
{
loop_mode
qpfadds,0 %g16, %b[31], %b[34]
qpfsubs,1 %g16, %b[31], %g16
qpfsubs,2 %b[27], %b[22], %b[22]
qpfsubs,3 %g30, %b[25], %b[27]
qpfadds,4 %g30, %b[25], %g30
}
{
loop_mode
qpfadds,0 %b[30], %b[16], %b[25]
qpfsubs,1 %b[30], %b[16], %b[16]
qpfsubs,2 %b[28], %b[15], %b[31]
qpfadds,3 %g27, %b[26], %b[30]
qpfsubs,4 %g27, %b[26], %g27
qpfsubs,5 %g28, %b[17], %b[26]
}
{
loop_mode
qpfadds,0 %g22, %b[18], %b[36]
qpfsubs,1 %g22, %b[18], %g22
qpfadds,2 %g23, %g20, %b[18]
qpfadds,3 %b[28], %b[15], %b[15]
qpfadds,4 %g28, %b[17], %g28
stqp,5 %r18, %r0, %b[20]
}
{
loop_mode
qpfsubs,0 %g23, %g20, %g20
stqp,2 %r36, %r0, %b[21]
stqp,5 %r2, %r0, %b[14]
}
{
loop_mode
stqp,2 %r16, %r0, %g16
stqp,5 %r30, %r0, %b[33]
}
{
loop_mode
qpshufb,1 %b[61], %b[61], %r24, %g16
stqp,2 %r27, %r0, %b[23]
qpshufb,3 %b[63], %b[63], %r24, %g23
qpshufb,4 %b[35], %b[35], %r24, %b[14]
stqp,5 %r29, %r0, %g24
}
{
loop_mode
qpshufb,0 %b[55], %b[55], %r24, %g24
qpshufb,1 %b[65], %b[65], %r24, %b[17]
stqp,2 %r51, %r0, %b[22]
qpshufb,3 %b[59], %b[59], %r24, %b[20]
qpshufb,4 %b[66], %b[66], %r24, %b[21]
stqp,5 %r20, %r0, %b[34]
}
{
loop_mode
qpshufb,0 %b[13], %b[13], %r24, %b[13]
qpshufb,1 %b[51], %b[51], %r24, %b[22]
stqp,2 %r38, %r0, %b[32]
qpshufb,3 %b[29], %b[29], %r24, %b[23]
qpshufb,4 %b[64], %b[64], %r24, %b[28]
stqp,5 %r21, %r0, %g30
}
{
loop_mode
qpshufb,0 %b[40], %b[40], %r24, %g30
qpxor,1 %g16, %r22, %g16
stqp,2 %r13, %r0, %b[27]
qpxor,3 %g23, %r22, %g23
qpxor,4 %b[14], %r22, %b[14]
stqp,5 %r35, %r0, %g27
}
{
loop_mode
qpxor,0 %b[17], %r22, %g27
qpxor,1 %g24, %r22, %g24
qpfadds,2 %b[19], %g16, %b[21]
qpxor,3 %b[20], %r22, %b[17]
qpxor,4 %b[21], %r22, %b[20]
qpfadds,5 %b[24], %g23, %b[27]
}
{
loop_mode
qpxor,0 %b[13], %r22, %b[13]
qpxor,1 %b[22], %r22, %b[22]
qpfsubs,2 %b[19], %g16, %g16
qpxor,3 %b[23], %r22, %b[23]
qpxor,4 %b[28], %r22, %b[28]
qpfsubs,5 %b[24], %g23, %g23
}
{
loop_mode
qpxor,0 %g30, %r22, %g30
qpfsubs,1 %g19, %g24, %b[14]
qpfadds,2 %g19, %g24, %g19
qpfsubs,3 %g25, %b[14], %b[19]
qpfadds,4 %g25, %b[14], %g25
qpfadds,5 %g31, %b[17], %b[24]
}
{
loop_mode
qpfsubs,0 %g18, %g27, %g24
qpfadds,1 %g18, %g27, %g18
qpfadds,2 %g17, %b[13], %b[17]
qpfsubs,3 %g31, %b[17], %g27
qpfsubs,4 %b[12], %b[20], %g31
qpfadds,5 %b[12], %b[20], %b[12]
}
{
loop_mode
qpfsubs,0 %g17, %b[13], %g17
qpfadds,1 %g26, %b[22], %b[20]
qpfsubs,2 %g26, %b[22], %g26
qpfsubs,3 %b[11], %b[23], %b[13]
qpfadds,4 %b[11], %b[23], %b[11]
qpfsubs,5 %g21, %b[28], %b[23]
}
{
loop_mode
qpfadds,0 %g21, %b[28], %g21
qpfsubs,1 %g29, %g30, %b[22]
qpfadds,2 %g29, %g30, %g29
stqp,5 %r26, %r0, %b[30]
}
{
loop_mode
stqp,2 %r37, %r0, %b[31]
stqp,5 %r44, %r0, %b[25]
}
{
loop_mode
stqp,2 %r49, %r0, %b[16]
stqp,5 %r3, %r0, %g22
}
{
loop_mode
stqp,2 %r28, %r0, %b[15]
stqp,5 %r17, %r0, %b[36]
}
{
loop_mode
stqp,2 %r54, %r0, %g20
stqp,5 %r43, %r0, %b[18]
}
{
loop_mode
stqp,2 %r50, %r0, %b[26]
stqp,5 %r5, %r0, %b[27]
}
{
loop_mode
stqp,2 %r45, %r0, %g28
stqp,5 %r11, %r0, %g23
}
{
loop_mode
stqp,2 %r15, %r0, %b[21]
stqp,5 %r19, %r0, %g16
}
{
loop_mode
stqp,2 %r34, %r0, %g25
stqp,5 %r9, %r0, %b[19]
}
{
loop_mode
stqp,2 %r57, %r0, %g19
stqp,5 %r47, %r0, %b[14]
}
{
loop_mode
stqp,2 %r53, %r0, %b[24]
stqp,5 %r32, %r0, %g24
}
{
loop_mode
stqp,2 %r40, %r0, %g18
stqp,5 %r42, %r0, %g27
}
{
loop_mode
stqp,2 %r4, %r0, %b[12]
stqp,5 %r14, %r0, %g31
}
{
loop_mode
stqp,2 %r1, %r0, %b[17]
stqp,5 %r12, %r0, %g17
}
{
loop_mode
stqp,2 %r56, %r0, %g21
stqp,5 %r39, %r0, %b[11]
}
{
loop_mode
stqp,2 %r46, %r0, %b[23]
stqp,5 %r31, %r0, %b[13]
}
{
loop_mode
stqp,2 %r41, %r0, %b[20]
stqp,5 %r33, %r0, %g26
}
{
loop_mode
addd,0,sm %r0, _f16s,_lts0lo 0x30, %r0
stqp,2 %r52, %r0, %g29
stqp,5 %r48, %r0, %b[22]
}
{
loop_mode
ct %ctpr1 ? %NOT_LOOP_END
alc alcf=1, alct=1
}
{
loop_mode
ldd,0 %r8, _f16s,_lts0lo 0xff30, %b[11], mas=0x4
ldd,2 %r8, _f16s,_lts0hi 0xff38, %b[12], mas=0x4
ldd,3 %r8, _f16s,_lts1lo 0xff40, %b[13], mas=0x4
ldd,5 %r8, _f16s,_lts1hi 0xff48, %b[14], mas=0x4
}
{
loop_mode
ldd,0 %r8, _f16s,_lts0lo 0xff50, %b[15], mas=0x4
ldd,2 %r8, _f16s,_lts0hi 0xff58, %b[16], mas=0x4
ldd,3 %r8, _f16s,_lts1lo 0xff60, %b[17], mas=0x4
ldd,5 %r8, _f16s,_lts1hi 0xff68, %b[18], mas=0x4
}
{
loop_mode
ldd,0 %r8, _f16s,_lts0lo 0xff70, %b[19], mas=0x4
ldd,2 %r8, _f16s,_lts0hi 0xff78, %b[20], mas=0x4
ldd,3 %r8, _f16s,_lts1lo 0xff80, %b[21], mas=0x4
ldd,5 %r8, _f16s,_lts1hi 0xff88, %b[22], mas=0x4
}
{
loop_mode
ldd,0 %r8, _f16s,_lts0lo 0xff90, %b[23], mas=0x4
ldd,2 %r8, _f16s,_lts0hi 0xff98, %b[24], mas=0x4
ldd,3 %r8, _f16s,_lts1lo 0xffa0, %b[25], mas=0x4
ldd,5 %r8, _f16s,_lts1hi 0xffa8, %b[26], mas=0x4
}
{
loop_mode
ldd,0 %r8, _f16s,_lts0lo 0xffb0, %b[27], mas=0x4
ldd,2 %r8, _f16s,_lts0hi 0xffb8, %b[28], mas=0x4
ldd,3 %r8, _f16s,_lts1lo 0xffc0, %b[29], mas=0x4
ldd,5 %r8, _f16s,_lts1hi 0xffc8, %b[30], mas=0x4
}
{
loop_mode
ldd,0 %r8, _f16s,_lts0lo 0xffd0, %b[31], mas=0x4
ldd,2 %r8, _f16s,_lts0hi 0xffd8, %b[32], mas=0x4
ldd,3 %r8, _f16s,_lts1lo 0xffe0, %b[33], mas=0x4
ldd,5 %r8, _f16s,_lts1hi 0xffe8, %b[34], mas=0x4
}
{
loop_mode
ldd,0 %r8, _f16s,_lts0lo 0xfff0, %b[35], mas=0x4
ldd,2 %r8, _f16s,_lts0hi 0xfff8, %b[36], mas=0x4
}
Теоретическая скорость: 96 комплексных чисел за 158 тактов (96/158) = 4.86 Байт/такт
Четверная теоретическая скорость: 19.44 Байт/такт
Замеры скорости

Итоги по stage_radix4_2x


Скорости упали по сравнению с исходными версиями stage_radix4.
График FFT находится здесь.
stage_radix4_readConjSwap
Один проход по stage_radix4_readConjSwap совершает ту же работу, что 2 прохода по stage_radix2. Поэтому скорость stage_radix4_readConjSwap будем умножать на 2 для удобства сравнения с stage_radix2 (этот факт подписан на оси графика и в выводе консоли).
1. stage_radix4_readConjSwap_simd64
Вычисления делаем аналогично stage_radix2_readConjSwap_simd64.
Код на Си
void stage_radix4_readConjSwap_simd64(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coefC, myComplex *conj_coefD, myComplex *conj_coefE, myComplex *swap_coefC, myComplex *swap_coefD, myComplex *swap_coefE)
{
uint64_t *x_in = (uint64_t*)&data_in[0];
uint64_t *y_in = (uint64_t*)&data_in[1];
uint64_t *z_in = (uint64_t*)&data_in[2];
uint64_t *w_in = (uint64_t*)&data_in[3];
uint64_t *conj_c_in = (uint64_t*)conj_coefC;
uint64_t *conj_d_in = (uint64_t*)conj_coefD;
uint64_t *conj_e_in = (uint64_t*)conj_coefE;
uint64_t *swap_c_in = (uint64_t*)swap_coefC;
uint64_t *swap_d_in = (uint64_t*)swap_coefD;
uint64_t *swap_e_in = (uint64_t*)swap_coefE;
uint64_t *out_0 = (uint64_t*)&data_out[0*data_count/4];
uint64_t *out_1 = (uint64_t*)&data_out[1*data_count/4];
uint64_t *out_2 = (uint64_t*)&data_out[2*data_count/4];
uint64_t *out_3 = (uint64_t*)&data_out[3*data_count/4];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < data_count/4; ++i)
{
uint64_t x = x_in[4*i];
uint64_t y = y_in[4*i];
uint64_t z = z_in[4*i];
uint64_t w = w_in[4*i];
uint64_t conj_c = conj_c_in[i];
uint64_t conj_d = conj_d_in[i];
uint64_t conj_e = conj_e_in[i];
uint64_t swap_c = swap_c_in[i];
uint64_t swap_d = swap_d_in[i];
uint64_t swap_e = swap_e_in[i];
uint64_t cy_real = __builtin_e2k_pfmuls(conj_c, y);
uint64_t dz_real = __builtin_e2k_pfmuls(conj_d, z);
uint64_t ew_real = __builtin_e2k_pfmuls(conj_e, w);
uint64_t cy_imag = __builtin_e2k_pfmuls(swap_c, y);
uint64_t dz_imag = __builtin_e2k_pfmuls(swap_d, z);
uint64_t ew_imag = __builtin_e2k_pfmuls(swap_e, w);
uint64_t cy = __builtin_e2k_pfhadds(cy_real, cy_imag);
uint64_t dz = __builtin_e2k_pfhadds(dz_real, dz_imag);
uint64_t ew = __builtin_e2k_pfhadds(ew_real, ew_imag);
uint64_t add02 = __builtin_e2k_pfadds( x, dz);
uint64_t sub02 = __builtin_e2k_pfsubs( x, dz);
uint64_t add13 = __builtin_e2k_pfadds(cy, ew);
uint64_t sub13 = __builtin_e2k_pfsubs(cy, ew);
//uint64_t conj_sub13 = __builtin_e2k_pxord(sub13, 1LL<<63);
//uint64_t sub13i = __builtin_e2k_pshufb(0, conj_sub13, 0x0302010007060504);
uint64_t swap_sub13 = __builtin_e2k_pshufb(0, sub13, 0x0302010007060504);
uint64_t sub13i = __builtin_e2k_pxord(swap_sub13, 1LL<<31);
out_0[i] = __builtin_e2k_pfadds(add02, add13);
out_1[i] = __builtin_e2k_pfsubs(sub02, sub13i);
out_2[i] = __builtin_e2k_pfsubs(add02, add13);
out_3[i] = __builtin_e2k_pfadds(sub02, sub13i);
}
}
Основной цикл на ассемблере
.L640:
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=7, asz=3, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=5, asz=3, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=6, asz=3, abs=8, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=3, asz=3, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=4, asz=4, abs=16, disp=0
}
{
fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=2, asz=3, abs=24, disp=0
}
.L275:
{
loop_mode
pfmul_hadds,0,sm %b[43], %b[59], %b[80], %b[11]
pfadd_adds,1,sm %b[34], %b[17], %b[87], %b[67]
pfadd_rsubs,2,sm %b[34], %b[17], %b[87], %b[54]
pfmuls,3,sm %b[12], %b[55], %b[76]
pfsubs,4,sm %b[42], %b[25], %b[88]
pfadds,5,sm %b[44], %b[27], %b[83]
movad,1 area=0, ind=0, am=1, be=0, %b[70]
movad,2 area=2, ind=0, am=1, be=0, %b[1]
movad,3 area=1, ind=0, am=1, be=0, %b[0]
}
{
loop_mode
pfsub_rsubs,0,sm %b[34], %b[17], %b[99], %b[80]
pfsub_adds,1,sm %b[34], %b[17], %b[99], %b[59]
staad,2 %b[73], %aad4[ %aasti11 ]
incr,2 %aaincr0
pfmuls,3,sm %b[100], %b[64], %b[87]
pshufb,4,sm 0x0, %b[90], %r21, %b[93]
staad,5 %b[60], %aad2[ %aasti9 ]
incr,5 %aaincr0
movad,0 area=3, ind=0, am=1, be=0, %b[44]
movad,1 area=2, ind=0, am=1, be=0, %b[27]
movad,2 area=0, ind=0, am=0, be=0, %b[12]
movad,3 area=0, ind=16, am=0, be=0, %b[43]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
pfmul_hadds,0,sm %b[52], %b[68], %b[91], %b[17]
pfmul_hadds,1,sm %b[9], %b[81], %b[94], %b[34]
staad,2 %b[86], %aad3[ %aasti10 ]
incr,2 %aaincr0
pfmuls,3,sm %b[74], %b[77], %b[90]
xord,4,sm %b[95], %r9, %b[97]
staad,5 %b[65], %aad1[ %aasti8 ]
incr,5 %aaincr0
movad,1 area=1, ind=0, am=1, be=0, %b[96]
movad,2 area=0, ind=8, am=1, be=0, %b[73]
movad,3 area=0, ind=24, am=0, be=0, %b[60]
}
Теоретическая скорость: 4 комплексных числа за 3 такта (4/3) = 10.67 Байт/такт
Двойная теоретическая скорость: 21.33 Байт/такт
Замеры скорости

2. stage_radix4_readConjSwap_simd128
Вычисления делаем аналогично stage_radix2_readConjSwap_simd128.
Код на Си
void stage_radix4_readConjSwap_simd128(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coefC, myComplex *conj_coefD, myComplex *conj_coefE, myComplex *swap_coefC, myComplex *swap_coefD, myComplex *swap_coefE)
{
__v2di *xy0_in = (__v2di*)&data_in[0];
__v2di *zw0_in = (__v2di*)&data_in[2];
__v2di *xy1_in = (__v2di*)&data_in[4];
__v2di *zw1_in = (__v2di*)&data_in[6];
__v2di *conj_c_in = (__v2di*)conj_coefC;
__v2di *conj_d_in = (__v2di*)conj_coefD;
__v2di *conj_e_in = (__v2di*)conj_coefE;
__v2di *swap_c_in = (__v2di*)swap_coefC;
__v2di *swap_d_in = (__v2di*)swap_coefD;
__v2di *swap_e_in = (__v2di*)swap_coefE;
__v2di *out_0 = (__v2di*)&data_out[0*data_count/4];
__v2di *out_1 = (__v2di*)&data_out[1*data_count/4];
__v2di *out_2 = (__v2di*)&data_out[2*data_count/4];
__v2di *out_3 = (__v2di*)&data_out[3*data_count/4];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < data_count/8; ++i)
{
__v2di xy0 = xy0_in[4*i];
__v2di zw0 = zw0_in[4*i];
__v2di xy1 = xy1_in[4*i];
__v2di zw1 = zw1_in[4*i];
__v2di conj_c = conj_c_in[i];
__v2di conj_d = conj_d_in[i];
__v2di conj_e = conj_e_in[i];
__v2di swap_c = swap_c_in[i];
__v2di swap_d = swap_d_in[i];
__v2di swap_e = swap_e_in[i];
__v2di x = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di cy_real = __builtin_e2k_qpfmuls(conj_c, y);
__v2di dz_real = __builtin_e2k_qpfmuls(conj_d, z);
__v2di ew_real = __builtin_e2k_qpfmuls(conj_e, w);
__v2di cy_imag = __builtin_e2k_qpfmuls(swap_c, y);
__v2di dz_imag = __builtin_e2k_qpfmuls(swap_d, z);
__v2di ew_imag = __builtin_e2k_qpfmuls(swap_e, w);
__v2di cy_rrii = __builtin_e2k_qpfhadds(cy_real, cy_imag);
__v2di dz_rrii = __builtin_e2k_qpfhadds(dz_real, dz_imag);
__v2di ew_rrii = __builtin_e2k_qpfhadds(ew_real, ew_imag);
__v2di dz = __builtin_e2k_qpshufb(dz_rrii, dz_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di add02 = __builtin_e2k_qpfadds( x, dz);
__v2di sub02 = __builtin_e2k_qpfsubs( x, dz);
__v2di add13_rrii = __builtin_e2k_qpfadds(cy_rrii, ew_rrii);
__v2di sub13_rrii = __builtin_e2k_qpfsubs(cy_rrii, ew_rrii);
__v2di add13 = __builtin_e2k_qpshufb(add13_rrii, add13_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
//__v2di conj_sub13 = __builtin_e2k_qpxor(sub13_rrii, (__v2di){(1LL<<63) + (1LL<<31), 0});
//__v2di sub13i = __builtin_e2k_qpshufb(conj_sub13, conj_sub13, (__v2di){0x030201000B0A0908, 0x070605040F0E0D0C});
__v2di swap_sub13 = __builtin_e2k_qpshufb(sub13_rrii, sub13_rrii, (__v2di){0x030201000B0A0908, 0x070605040F0E0D0C});
__v2di sub13i = __builtin_e2k_qpxor(swap_sub13, (__v2di){1LL<<31, 1LL<<31});
out_0[i] = __builtin_e2k_qpfadds(add02, add13);
out_1[i] = __builtin_e2k_qpfsubs(sub02, sub13i);
out_2[i] = __builtin_e2k_qpfsubs(add02, add13);
out_3[i] = __builtin_e2k_qpfadds(sub02, sub13i);
}
}
Основной цикл на ассемблере
.L1243:
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=1, asz=3, abs=0, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=7, asz=3, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=6, asz=3, abs=8, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=5, asz=3, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=4, asz=3, abs=16, disp=0
}
{
fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=3, asz=3, abs=24, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=2, asz=3, abs=24, disp=0
}
.L699:
{
loop_mode
qpfmuls,0,sm %b[59], %b[33], %b[48]
qpfmul_hadds,1,sm %b[32], %b[41], %b[80], %b[6]
qpfadd_adds,2,sm %b[24], %b[77], %b[62], %b[47]
qpshufb,3,sm %b[54], %b[54], %g17, %b[60]
qpshufb,4,sm %b[42], %b[42], %g16, %b[69]
qpfadd_rsubs,5,sm %b[24], %b[77], %b[62], %b[7]
movaqp,1 area=0, ind=0, am=0, be=0, %b[1]
movaqp,3 area=0, ind=0, am=0, be=0, %b[0]
}
{
loop_mode
qpfmul_hadds,0,sm %b[19], %b[38], %b[73], %b[72]
qpfmul_hadds,1,sm %b[29], %b[35], %b[50], %b[41]
qpfsub_rsubs,2,sm %b[24], %b[77], %b[81], %b[54]
qpshufb,3,sm %b[2], %b[3], %r22, %b[32]
qpxor,4,sm %b[69], %g18, %b[79]
qpfsub_adds,5,sm %b[24], %b[77], %b[81], %b[42]
movaqp,1 area=0, ind=16, am=1, be=0, %b[62]
movaqp,3 area=0, ind=16, am=1, be=0, %b[59]
}
{
loop_mode
qpfadds,0,sm %b[76], %b[10], %b[50]
qpfsubs,1,sm %b[76], %b[10], %b[38]
staaqp,2 %b[51], %aad4[ %aasti11 ]
incr,2 %aaincr0
qpshufb,3,sm %b[61], %b[64], %r22, %b[35]
qpshufb,4,sm %b[63], %b[66], %g19, %b[29]
staaqp,5 %b[11], %aad2[ %aasti9 ]
incr,5 %aaincr0
movaqp,1 area=3, ind=0, am=1, be=0, %b[19]
movaqp,3 area=3, ind=0, am=1, be=0, %b[24]
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
qpfmuls,0,sm %b[70], %b[34], %b[69]
qpfmuls,1,sm %b[67], %b[37], %b[76]
staaqp,2 %b[58], %aad3[ %aasti10 ]
incr,2 %aaincr0
qpshufb,3,sm %b[4], %b[5], %g19, %b[10]
qpshufb,4,sm %b[45], %b[45], %g17, %b[73]
staaqp,5 %b[46], %aad1[ %aasti8 ]
incr,5 %aaincr0
movaqp,0 area=2, ind=0, am=1, be=0, %b[63]
movaqp,1 area=1, ind=0, am=1, be=0, %b[66]
movaqp,2 area=2, ind=0, am=1, be=0, %b[11]
movaqp,3 area=1, ind=0, am=1, be=0, %b[51]
}
Теоретическая скорость: 8 комплексных чисел за 4 такта (8/4) = 16 Байт/такт
Двойная теоретическая скорость: 32 Байт/такт
Замеры скорости

Итоги по stage_radix4_readConjSwap


График FFT находится здесь.
stage_radix4_readConjSwap_2x
Один проход по stage_radix4_readConjSwap_2x совершает ту же работу, что 2 прохода по stage_radix4_readConjSwap. А один проход по stage_radix4_readConjSwap совершает ту же работу, что 2 прохода по stage_radix2. Поэтому скорость stage_radix4_readConjSwap_2x будем умножать на 4 для удобства сравнения с stage_radix2 (этот факт подписан на оси графика и в выводе консоли).
1. stage_radix4_readConjSwap_2x_simd64
Здесь происходит ручная раскрутка алгоритма stage_radix4_readConjSwap_simd64 в 2 раза.
Код на Си
void stage_radix4_readConjSwap_2x_simd64(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coefC_a, myComplex *conj_coefD_a, myComplex *conj_coefE_a, myComplex *conj_coefC_b, myComplex *conj_coefD_b, myComplex *conj_coefE_b, myComplex *swap_coefC_a, myComplex *swap_coefD_a, myComplex *swap_coefE_a, myComplex *swap_coefC_b, myComplex *swap_coefD_b, myComplex *swap_coefE_b)
{
uint64_t *x0_in = (uint64_t*)&data_in[ 0];
uint64_t *y0_in = (uint64_t*)&data_in[ 1];
uint64_t *z0_in = (uint64_t*)&data_in[ 2];
uint64_t *w0_in = (uint64_t*)&data_in[ 3];
uint64_t *x1_in = (uint64_t*)&data_in[ 4];
uint64_t *y1_in = (uint64_t*)&data_in[ 5];
uint64_t *z1_in = (uint64_t*)&data_in[ 6];
uint64_t *w1_in = (uint64_t*)&data_in[ 7];
uint64_t *x2_in = (uint64_t*)&data_in[ 8];
uint64_t *y2_in = (uint64_t*)&data_in[ 9];
uint64_t *z2_in = (uint64_t*)&data_in[10];
uint64_t *w2_in = (uint64_t*)&data_in[11];
uint64_t *x3_in = (uint64_t*)&data_in[12];
uint64_t *y3_in = (uint64_t*)&data_in[13];
uint64_t *z3_in = (uint64_t*)&data_in[14];
uint64_t *w3_in = (uint64_t*)&data_in[15];
uint64_t *conj_c0a_in = (uint64_t*)&conj_coefC_a[0];
uint64_t *conj_c1a_in = (uint64_t*)&conj_coefC_a[1];
uint64_t *conj_c2a_in = (uint64_t*)&conj_coefC_a[2];
uint64_t *conj_c3a_in = (uint64_t*)&conj_coefC_a[3];
uint64_t *conj_d0a_in = (uint64_t*)&conj_coefD_a[0];
uint64_t *conj_d1a_in = (uint64_t*)&conj_coefD_a[1];
uint64_t *conj_d2a_in = (uint64_t*)&conj_coefD_a[2];
uint64_t *conj_d3a_in = (uint64_t*)&conj_coefD_a[3];
uint64_t *conj_e0a_in = (uint64_t*)&conj_coefE_a[0];
uint64_t *conj_e1a_in = (uint64_t*)&conj_coefE_a[1];
uint64_t *conj_e2a_in = (uint64_t*)&conj_coefE_a[2];
uint64_t *conj_e3a_in = (uint64_t*)&conj_coefE_a[3];
uint64_t *conj_c0b_in = (uint64_t*)&conj_coefC_b[0*data_count/16];
uint64_t *conj_c1b_in = (uint64_t*)&conj_coefC_b[1*data_count/16];
uint64_t *conj_c2b_in = (uint64_t*)&conj_coefC_b[2*data_count/16];
uint64_t *conj_c3b_in = (uint64_t*)&conj_coefC_b[3*data_count/16];
uint64_t *conj_d0b_in = (uint64_t*)&conj_coefD_b[0*data_count/16];
uint64_t *conj_d1b_in = (uint64_t*)&conj_coefD_b[1*data_count/16];
uint64_t *conj_d2b_in = (uint64_t*)&conj_coefD_b[2*data_count/16];
uint64_t *conj_d3b_in = (uint64_t*)&conj_coefD_b[3*data_count/16];
uint64_t *conj_e0b_in = (uint64_t*)&conj_coefE_b[0*data_count/16];
uint64_t *conj_e1b_in = (uint64_t*)&conj_coefE_b[1*data_count/16];
uint64_t *conj_e2b_in = (uint64_t*)&conj_coefE_b[2*data_count/16];
uint64_t *conj_e3b_in = (uint64_t*)&conj_coefE_b[3*data_count/16];
uint64_t *swap_c0a_in = (uint64_t*)&swap_coefC_a[0];
uint64_t *swap_c1a_in = (uint64_t*)&swap_coefC_a[1];
uint64_t *swap_c2a_in = (uint64_t*)&swap_coefC_a[2];
uint64_t *swap_c3a_in = (uint64_t*)&swap_coefC_a[3];
uint64_t *swap_d0a_in = (uint64_t*)&swap_coefD_a[0];
uint64_t *swap_d1a_in = (uint64_t*)&swap_coefD_a[1];
uint64_t *swap_d2a_in = (uint64_t*)&swap_coefD_a[2];
uint64_t *swap_d3a_in = (uint64_t*)&swap_coefD_a[3];
uint64_t *swap_e0a_in = (uint64_t*)&swap_coefE_a[0];
uint64_t *swap_e1a_in = (uint64_t*)&swap_coefE_a[1];
uint64_t *swap_e2a_in = (uint64_t*)&swap_coefE_a[2];
uint64_t *swap_e3a_in = (uint64_t*)&swap_coefE_a[3];
uint64_t *swap_c0b_in = (uint64_t*)&swap_coefC_b[0*data_count/16];
uint64_t *swap_c1b_in = (uint64_t*)&swap_coefC_b[1*data_count/16];
uint64_t *swap_c2b_in = (uint64_t*)&swap_coefC_b[2*data_count/16];
uint64_t *swap_c3b_in = (uint64_t*)&swap_coefC_b[3*data_count/16];
uint64_t *swap_d0b_in = (uint64_t*)&swap_coefD_b[0*data_count/16];
uint64_t *swap_d1b_in = (uint64_t*)&swap_coefD_b[1*data_count/16];
uint64_t *swap_d2b_in = (uint64_t*)&swap_coefD_b[2*data_count/16];
uint64_t *swap_d3b_in = (uint64_t*)&swap_coefD_b[3*data_count/16];
uint64_t *swap_e0b_in = (uint64_t*)&swap_coefE_b[0*data_count/16];
uint64_t *swap_e1b_in = (uint64_t*)&swap_coefE_b[1*data_count/16];
uint64_t *swap_e2b_in = (uint64_t*)&swap_coefE_b[2*data_count/16];
uint64_t *swap_e3b_in = (uint64_t*)&swap_coefE_b[3*data_count/16];
uint64_t *out_0 = (uint64_t*)&data_out[ 0*data_count/16];
uint64_t *out_1 = (uint64_t*)&data_out[ 1*data_count/16];
uint64_t *out_2 = (uint64_t*)&data_out[ 2*data_count/16];
uint64_t *out_3 = (uint64_t*)&data_out[ 3*data_count/16];
uint64_t *out_4 = (uint64_t*)&data_out[ 4*data_count/16];
uint64_t *out_5 = (uint64_t*)&data_out[ 5*data_count/16];
uint64_t *out_6 = (uint64_t*)&data_out[ 6*data_count/16];
uint64_t *out_7 = (uint64_t*)&data_out[ 7*data_count/16];
uint64_t *out_8 = (uint64_t*)&data_out[ 8*data_count/16];
uint64_t *out_9 = (uint64_t*)&data_out[ 9*data_count/16];
uint64_t *out_10 = (uint64_t*)&data_out[10*data_count/16];
uint64_t *out_11 = (uint64_t*)&data_out[11*data_count/16];
uint64_t *out_12 = (uint64_t*)&data_out[12*data_count/16];
uint64_t *out_13 = (uint64_t*)&data_out[13*data_count/16];
uint64_t *out_14 = (uint64_t*)&data_out[14*data_count/16];
uint64_t *out_15 = (uint64_t*)&data_out[15*data_count/16];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < data_count/16; ++i)
{
uint64_t x0 = x0_in[16*i];
uint64_t y0 = y0_in[16*i];
uint64_t z0 = z0_in[16*i];
uint64_t w0 = w0_in[16*i];
uint64_t conj_c0 = conj_c0a_in[4*i];
uint64_t conj_d0 = conj_d0a_in[4*i];
uint64_t conj_e0 = conj_e0a_in[4*i];
uint64_t swap_c0 = swap_c0a_in[4*i];
uint64_t swap_d0 = swap_d0a_in[4*i];
uint64_t swap_e0 = swap_e0a_in[4*i];
uint64_t x1 = x1_in[16*i];
uint64_t y1 = y1_in[16*i];
uint64_t z1 = z1_in[16*i];
uint64_t w1 = w1_in[16*i];
uint64_t conj_c1 = conj_c1a_in[4*i];
uint64_t conj_d1 = conj_d1a_in[4*i];
uint64_t conj_e1 = conj_e1a_in[4*i];
uint64_t swap_c1 = swap_c1a_in[4*i];
uint64_t swap_d1 = swap_d1a_in[4*i];
uint64_t swap_e1 = swap_e1a_in[4*i];
uint64_t x2 = x2_in[16*i];
uint64_t y2 = y2_in[16*i];
uint64_t z2 = z2_in[16*i];
uint64_t w2 = w2_in[16*i];
uint64_t conj_c2 = conj_c2a_in[4*i];
uint64_t conj_d2 = conj_d2a_in[4*i];
uint64_t conj_e2 = conj_e2a_in[4*i];
uint64_t swap_c2 = swap_c2a_in[4*i];
uint64_t swap_d2 = swap_d2a_in[4*i];
uint64_t swap_e2 = swap_e2a_in[4*i];
uint64_t x3 = x3_in[16*i];
uint64_t y3 = y3_in[16*i];
uint64_t z3 = z3_in[16*i];
uint64_t w3 = w3_in[16*i];
uint64_t conj_c3 = conj_c3a_in[4*i];
uint64_t conj_d3 = conj_d3a_in[4*i];
uint64_t conj_e3 = conj_e3a_in[4*i];
uint64_t swap_c3 = swap_c3a_in[4*i];
uint64_t swap_d3 = swap_d3a_in[4*i];
uint64_t swap_e3 = swap_e3a_in[4*i];
uint64_t cy0_real = __builtin_e2k_pfmuls(conj_c0, y0);
uint64_t cy1_real = __builtin_e2k_pfmuls(conj_c1, y1);
uint64_t cy2_real = __builtin_e2k_pfmuls(conj_c2, y2);
uint64_t cy3_real = __builtin_e2k_pfmuls(conj_c3, y3);
uint64_t dz0_real = __builtin_e2k_pfmuls(conj_d0, z0);
uint64_t dz1_real = __builtin_e2k_pfmuls(conj_d1, z1);
uint64_t dz2_real = __builtin_e2k_pfmuls(conj_d2, z2);
uint64_t dz3_real = __builtin_e2k_pfmuls(conj_d3, z3);
uint64_t ew0_real = __builtin_e2k_pfmuls(conj_e0, w0);
uint64_t ew1_real = __builtin_e2k_pfmuls(conj_e1, w1);
uint64_t ew2_real = __builtin_e2k_pfmuls(conj_e2, w2);
uint64_t ew3_real = __builtin_e2k_pfmuls(conj_e3, w3);
uint64_t cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0);
uint64_t cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1);
uint64_t cy2_imag = __builtin_e2k_pfmuls(swap_c2, y2);
uint64_t cy3_imag = __builtin_e2k_pfmuls(swap_c3, y3);
uint64_t dz0_imag = __builtin_e2k_pfmuls(swap_d0, z0);
uint64_t dz1_imag = __builtin_e2k_pfmuls(swap_d1, z1);
uint64_t dz2_imag = __builtin_e2k_pfmuls(swap_d2, z2);
uint64_t dz3_imag = __builtin_e2k_pfmuls(swap_d3, z3);
uint64_t ew0_imag = __builtin_e2k_pfmuls(swap_e0, w0);
uint64_t ew1_imag = __builtin_e2k_pfmuls(swap_e1, w1);
uint64_t ew2_imag = __builtin_e2k_pfmuls(swap_e2, w2);
uint64_t ew3_imag = __builtin_e2k_pfmuls(swap_e3, w3);
uint64_t cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag);
uint64_t cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag);
uint64_t cy2 = __builtin_e2k_pfhadds(cy2_real, cy2_imag);
uint64_t cy3 = __builtin_e2k_pfhadds(cy3_real, cy3_imag);
uint64_t dz0 = __builtin_e2k_pfhadds(dz0_real, dz0_imag);
uint64_t dz1 = __builtin_e2k_pfhadds(dz1_real, dz1_imag);
uint64_t dz2 = __builtin_e2k_pfhadds(dz2_real, dz2_imag);
uint64_t dz3 = __builtin_e2k_pfhadds(dz3_real, dz3_imag);
uint64_t ew0 = __builtin_e2k_pfhadds(ew0_real, ew0_imag);
uint64_t ew1 = __builtin_e2k_pfhadds(ew1_real, ew1_imag);
uint64_t ew2 = __builtin_e2k_pfhadds(ew2_real, ew2_imag);
uint64_t ew3 = __builtin_e2k_pfhadds(ew3_real, ew3_imag);
uint64_t add02_0 = __builtin_e2k_pfadds( x0, dz0);
uint64_t add02_1 = __builtin_e2k_pfadds( x1, dz1);
uint64_t add02_2 = __builtin_e2k_pfadds( x2, dz2);
uint64_t add02_3 = __builtin_e2k_pfadds( x3, dz3);
uint64_t sub02_0 = __builtin_e2k_pfsubs( x0, dz0);
uint64_t sub02_1 = __builtin_e2k_pfsubs( x1, dz1);
uint64_t sub02_2 = __builtin_e2k_pfsubs( x2, dz2);
uint64_t sub02_3 = __builtin_e2k_pfsubs( x3, dz3);
uint64_t add13_0 = __builtin_e2k_pfadds(cy0, ew0);
uint64_t add13_1 = __builtin_e2k_pfadds(cy1, ew1);
uint64_t add13_2 = __builtin_e2k_pfadds(cy2, ew2);
uint64_t add13_3 = __builtin_e2k_pfadds(cy3, ew3);
uint64_t sub13_0 = __builtin_e2k_pfsubs(cy0, ew0);
uint64_t sub13_1 = __builtin_e2k_pfsubs(cy1, ew1);
uint64_t sub13_2 = __builtin_e2k_pfsubs(cy2, ew2);
uint64_t sub13_3 = __builtin_e2k_pfsubs(cy3, ew3);
//uint64_t conj_sub13_0 = __builtin_e2k_pxord(sub13_0, 1LL<<63);
//uint64_t conj_sub13_1 = __builtin_e2k_pxord(sub13_1, 1LL<<63);
//uint64_t conj_sub13_2 = __builtin_e2k_pxord(sub13_2, 1LL<<63);
//uint64_t conj_sub13_3 = __builtin_e2k_pxord(sub13_3, 1LL<<63);
//uint64_t sub13i_0 = __builtin_e2k_pshufb(0, conj_sub13_0, 0x0302010007060504);
//uint64_t sub13i_1 = __builtin_e2k_pshufb(0, conj_sub13_1, 0x0302010007060504);
//uint64_t sub13i_2 = __builtin_e2k_pshufb(0, conj_sub13_2, 0x0302010007060504);
//uint64_t sub13i_3 = __builtin_e2k_pshufb(0, conj_sub13_3, 0x0302010007060504);
uint64_t swap_sub13_0 = __builtin_e2k_pshufb(0, sub13_0, 0x0302010007060504);
uint64_t swap_sub13_1 = __builtin_e2k_pshufb(0, sub13_1, 0x0302010007060504);
uint64_t swap_sub13_2 = __builtin_e2k_pshufb(0, sub13_2, 0x0302010007060504);
uint64_t swap_sub13_3 = __builtin_e2k_pshufb(0, sub13_3, 0x0302010007060504);
uint64_t sub13i_0 = __builtin_e2k_pxord(swap_sub13_0, 1LL<<31);
uint64_t sub13i_1 = __builtin_e2k_pxord(swap_sub13_1, 1LL<<31);
uint64_t sub13i_2 = __builtin_e2k_pxord(swap_sub13_2, 1LL<<31);
uint64_t sub13i_3 = __builtin_e2k_pxord(swap_sub13_3, 1LL<<31);
uint64_t out0 = __builtin_e2k_pfadds(add02_0, add13_0);
uint64_t out1 = __builtin_e2k_pfadds(add02_1, add13_1);
uint64_t out2 = __builtin_e2k_pfadds(add02_2, add13_2);
uint64_t out3 = __builtin_e2k_pfadds(add02_3, add13_3);
uint64_t out4 = __builtin_e2k_pfsubs(sub02_0, sub13i_0);
uint64_t out5 = __builtin_e2k_pfsubs(sub02_1, sub13i_1);
uint64_t out6 = __builtin_e2k_pfsubs(sub02_2, sub13i_2);
uint64_t out7 = __builtin_e2k_pfsubs(sub02_3, sub13i_3);
uint64_t out8 = __builtin_e2k_pfsubs(add02_0, add13_0);
uint64_t out9 = __builtin_e2k_pfsubs(add02_1, add13_1);
uint64_t out10 = __builtin_e2k_pfsubs(add02_2, add13_2);
uint64_t out11 = __builtin_e2k_pfsubs(add02_3, add13_3);
uint64_t out12 = __builtin_e2k_pfadds(sub02_0, sub13i_0);
uint64_t out13 = __builtin_e2k_pfadds(sub02_1, sub13i_1);
uint64_t out14 = __builtin_e2k_pfadds(sub02_2, sub13i_2);
uint64_t out15 = __builtin_e2k_pfadds(sub02_3, sub13i_3);
x0 = out0;
y0 = out1;
z0 = out2;
w0 = out3;
conj_c0 = conj_c0b_in[i];
conj_d0 = conj_d0b_in[i];
conj_e0 = conj_e0b_in[i];
swap_c0 = swap_c0b_in[i];
swap_d0 = swap_d0b_in[i];
swap_e0 = swap_e0b_in[i];
x1 = out4;
y1 = out5;
z1 = out6;
w1 = out7;
conj_c1 = conj_c1b_in[i];
conj_d1 = conj_d1b_in[i];
conj_e1 = conj_e1b_in[i];
swap_c1 = swap_c1b_in[i];
swap_d1 = swap_d1b_in[i];
swap_e1 = swap_e1b_in[i];
x2 = out8;
y2 = out9;
z2 = out10;
w2 = out11;
conj_c2 = conj_c2b_in[i];
conj_d2 = conj_d2b_in[i];
conj_e2 = conj_e2b_in[i];
swap_c2 = swap_c2b_in[i];
swap_d2 = swap_d2b_in[i];
swap_e2 = swap_e2b_in[i];
x3 = out12;
y3 = out13;
z3 = out14;
w3 = out15;
conj_c3 = conj_c3b_in[i];
conj_d3 = conj_d3b_in[i];
conj_e3 = conj_e3b_in[i];
swap_c3 = swap_c3b_in[i];
swap_d3 = swap_d3b_in[i];
swap_e3 = swap_e3b_in[i];
cy0_real = __builtin_e2k_pfmuls(conj_c0, y0);
cy1_real = __builtin_e2k_pfmuls(conj_c1, y1);
cy2_real = __builtin_e2k_pfmuls(conj_c2, y2);
cy3_real = __builtin_e2k_pfmuls(conj_c3, y3);
dz0_real = __builtin_e2k_pfmuls(conj_d0, z0);
dz1_real = __builtin_e2k_pfmuls(conj_d1, z1);
dz2_real = __builtin_e2k_pfmuls(conj_d2, z2);
dz3_real = __builtin_e2k_pfmuls(conj_d3, z3);
ew0_real = __builtin_e2k_pfmuls(conj_e0, w0);
ew1_real = __builtin_e2k_pfmuls(conj_e1, w1);
ew2_real = __builtin_e2k_pfmuls(conj_e2, w2);
ew3_real = __builtin_e2k_pfmuls(conj_e3, w3);
cy0_imag = __builtin_e2k_pfmuls(swap_c0, y0);
cy1_imag = __builtin_e2k_pfmuls(swap_c1, y1);
cy2_imag = __builtin_e2k_pfmuls(swap_c2, y2);
cy3_imag = __builtin_e2k_pfmuls(swap_c3, y3);
dz0_imag = __builtin_e2k_pfmuls(swap_d0, z0);
dz1_imag = __builtin_e2k_pfmuls(swap_d1, z1);
dz2_imag = __builtin_e2k_pfmuls(swap_d2, z2);
dz3_imag = __builtin_e2k_pfmuls(swap_d3, z3);
ew0_imag = __builtin_e2k_pfmuls(swap_e0, w0);
ew1_imag = __builtin_e2k_pfmuls(swap_e1, w1);
ew2_imag = __builtin_e2k_pfmuls(swap_e2, w2);
ew3_imag = __builtin_e2k_pfmuls(swap_e3, w3);
cy0 = __builtin_e2k_pfhadds(cy0_real, cy0_imag);
cy1 = __builtin_e2k_pfhadds(cy1_real, cy1_imag);
cy2 = __builtin_e2k_pfhadds(cy2_real, cy2_imag);
cy3 = __builtin_e2k_pfhadds(cy3_real, cy3_imag);
dz0 = __builtin_e2k_pfhadds(dz0_real, dz0_imag);
dz1 = __builtin_e2k_pfhadds(dz1_real, dz1_imag);
dz2 = __builtin_e2k_pfhadds(dz2_real, dz2_imag);
dz3 = __builtin_e2k_pfhadds(dz3_real, dz3_imag);
ew0 = __builtin_e2k_pfhadds(ew0_real, ew0_imag);
ew1 = __builtin_e2k_pfhadds(ew1_real, ew1_imag);
ew2 = __builtin_e2k_pfhadds(ew2_real, ew2_imag);
ew3 = __builtin_e2k_pfhadds(ew3_real, ew3_imag);
add02_0 = __builtin_e2k_pfadds( x0, dz0);
add02_1 = __builtin_e2k_pfadds( x1, dz1);
add02_2 = __builtin_e2k_pfadds( x2, dz2);
add02_3 = __builtin_e2k_pfadds( x3, dz3);
sub02_0 = __builtin_e2k_pfsubs( x0, dz0);
sub02_1 = __builtin_e2k_pfsubs( x1, dz1);
sub02_2 = __builtin_e2k_pfsubs( x2, dz2);
sub02_3 = __builtin_e2k_pfsubs( x3, dz3);
add13_0 = __builtin_e2k_pfadds(cy0, ew0);
add13_1 = __builtin_e2k_pfadds(cy1, ew1);
add13_2 = __builtin_e2k_pfadds(cy2, ew2);
add13_3 = __builtin_e2k_pfadds(cy3, ew3);
sub13_0 = __builtin_e2k_pfsubs(cy0, ew0);
sub13_1 = __builtin_e2k_pfsubs(cy1, ew1);
sub13_2 = __builtin_e2k_pfsubs(cy2, ew2);
sub13_3 = __builtin_e2k_pfsubs(cy3, ew3);
//conj_sub13_0 = __builtin_e2k_pxord(sub13_0, 1LL<<63);
//conj_sub13_1 = __builtin_e2k_pxord(sub13_1, 1LL<<63);
//conj_sub13_2 = __builtin_e2k_pxord(sub13_2, 1LL<<63);
//conj_sub13_3 = __builtin_e2k_pxord(sub13_3, 1LL<<63);
//sub13i_0 = __builtin_e2k_pshufb(0, conj_sub13_0, 0x0302010007060504);
//sub13i_1 = __builtin_e2k_pshufb(0, conj_sub13_1, 0x0302010007060504);
//sub13i_2 = __builtin_e2k_pshufb(0, conj_sub13_2, 0x0302010007060504);
//sub13i_3 = __builtin_e2k_pshufb(0, conj_sub13_3, 0x0302010007060504);
swap_sub13_0 = __builtin_e2k_pshufb(0, sub13_0, 0x0302010007060504);
swap_sub13_1 = __builtin_e2k_pshufb(0, sub13_1, 0x0302010007060504);
swap_sub13_2 = __builtin_e2k_pshufb(0, sub13_2, 0x0302010007060504);
swap_sub13_3 = __builtin_e2k_pshufb(0, sub13_3, 0x0302010007060504);
sub13i_0 = __builtin_e2k_pxord(swap_sub13_0, 1LL<<31);
sub13i_1 = __builtin_e2k_pxord(swap_sub13_1, 1LL<<31);
sub13i_2 = __builtin_e2k_pxord(swap_sub13_2, 1LL<<31);
sub13i_3 = __builtin_e2k_pxord(swap_sub13_3, 1LL<<31);
out_0[i] = __builtin_e2k_pfadds(add02_0, add13_0);
out_1[i] = __builtin_e2k_pfadds(add02_1, add13_1);
out_2[i] = __builtin_e2k_pfadds(add02_2, add13_2);
out_3[i] = __builtin_e2k_pfadds(add02_3, add13_3);
out_4[i] = __builtin_e2k_pfsubs(sub02_0, sub13i_0);
out_5[i] = __builtin_e2k_pfsubs(sub02_1, sub13i_1);
out_6[i] = __builtin_e2k_pfsubs(sub02_2, sub13i_2);
out_7[i] = __builtin_e2k_pfsubs(sub02_3, sub13i_3);
out_8[i] = __builtin_e2k_pfsubs(add02_0, add13_0);
out_9[i] = __builtin_e2k_pfsubs(add02_1, add13_1);
out_10[i] = __builtin_e2k_pfsubs(add02_2, add13_2);
out_11[i] = __builtin_e2k_pfsubs(add02_3, add13_3);
out_12[i] = __builtin_e2k_pfadds(sub02_0, sub13i_0);
out_13[i] = __builtin_e2k_pfadds(sub02_1, sub13i_1);
out_14[i] = __builtin_e2k_pfadds(sub02_2, sub13i_2);
out_15[i] = __builtin_e2k_pfadds(sub02_3, sub13i_3);
}
}
Основной цикл на ассемблере
.L2581:
{
fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=1, asz=0, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=1, asz=0, abs=0, disp=32
}
{
fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=1, asz=0, abs=1, disp=64
fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=2, ind=1, asz=0, abs=1, disp=96
}
{
fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=7, asz=1, abs=2, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=6, asz=1, abs=2, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=5, asz=1, abs=4, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=4, asz=1, abs=4, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=3, asz=1, abs=6, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=0, d=0, incr=1, ind=2, asz=1, abs=6, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=16, incr=0, ind=0, asz=1, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=15, incr=0, ind=0, asz=1, abs=8, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=14, incr=0, ind=0, asz=1, abs=10, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=13, incr=0, ind=0, asz=1, abs=10, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=12, incr=0, ind=0, asz=1, abs=12, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=11, incr=0, ind=0, asz=1, abs=12, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=10, incr=0, ind=0, asz=1, abs=14, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=9, incr=0, ind=0, asz=1, abs=14, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=8, incr=0, ind=0, asz=1, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=7, incr=0, ind=0, asz=1, abs=16, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=6, incr=0, ind=0, asz=1, abs=18, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=5, incr=0, ind=0, asz=1, abs=18, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=4, incr=0, ind=0, asz=1, abs=20, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=3, incr=0, ind=0, asz=1, abs=20, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=2, incr=0, ind=0, asz=1, abs=22, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=1, incr=0, ind=0, asz=1, abs=22, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=15, asz=1, abs=24, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=14, asz=1, abs=24, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=13, asz=1, abs=26, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=12, asz=1, abs=26, disp=0
}
{
fapb ct=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=11, asz=1, abs=28, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=10, asz=1, abs=28, disp=0
}
{
fapb ct=1, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=9, asz=1, abs=30, disp=0
fapb dpl=0, dcd=0, fmt=4, mrng=8, d=0, incr=0, ind=8, asz=1, abs=30, disp=0
}
.L1747:
{
loop_mode
pfadd_rsubs,1,sm %b[63], %b[62], %b[77], %b[86]
pfmul_hadds,2,sm %b[9], %b[117], %b[86], %b[68]
pfmul_hadds,3,sm %b[33], %b[68], %b[88], %b[88]
}
{
loop_mode
pfmuls,1,sm %b[113], %b[103], %b[73]
xord,2,sm %b[107], %r3, %b[90]
pfmul_hadds,3,sm %b[71], %b[73], %b[65], %b[71]
pfmul_hadds,4,sm %b[40], %b[106], %b[101], %b[65]
movad,0 area=16, ind=0, am=1, be=0, %b[1]
movad,1 area=15, ind=0, am=1, be=0, %b[12]
movad,2 area=16, ind=0, am=1, be=0, %b[8]
movad,3 area=15, ind=0, am=1, be=0, %b[9]
}
{
loop_mode
pfsub_adds,2,sm %b[63], %b[62], %b[90], %b[92]
pfadd_adds,3,sm %b[63], %b[62], %b[77], %b[95]
pfmul_hadds,4,sm %b[5], %b[97], %b[105], %b[77]
movad,0 area=14, ind=0, am=1, be=0, %b[16]
movad,1 area=13, ind=0, am=1, be=0, %b[5]
movad,2 area=14, ind=0, am=1, be=0, %b[13]
movad,3 area=13, ind=0, am=1, be=0, %b[17]
}
{
loop_mode
pfsubs,1,sm %b[93], %b[82], %b[85]
pfmul_hadds,2,sm %b[32], %b[64], %b[85], %b[64]
movad,0 area=12, ind=0, am=1, be=0, %b[25]
movad,1 area=11, ind=0, am=1, be=0, %b[24]
movad,2 area=12, ind=0, am=1, be=0, %b[20]
movad,3 area=11, ind=0, am=1, be=0, %b[21]
}
{
loop_mode
pfsubs,0,sm %b[91], %b[89], %b[96]
pfadds,4,sm %b[74], %b[87], %b[93]
pfadds,5,sm %b[93], %b[82], %b[82]
movad,0 area=10, ind=0, am=1, be=0, %b[28]
movad,1 area=9, ind=0, am=1, be=0, %b[33]
movad,2 area=10, ind=0, am=1, be=0, %b[32]
movad,3 area=9, ind=0, am=1, be=0, %b[29]
}
{
loop_mode
pfmul_hadds,0,sm %b[80], %b[103], %b[73], %b[62]
pfadds,1,sm %b[91], %b[89], %b[72]
pfmuls,2,sm %b[72], %b[108], %b[73]
pfsub_rsubs,3,sm %b[63], %b[62], %b[90], %b[63]
pfsubs,5,sm %b[74], %b[87], %b[74]
movad,0 area=8, ind=0, am=1, be=0, %b[41]
movad,1 area=7, ind=0, am=1, be=0, %b[36]
movad,2 area=8, ind=0, am=1, be=0, %b[37]
movad,3 area=7, ind=0, am=1, be=0, %b[40]
}
{
loop_mode
pfsubs,0,sm %b[75], %b[70], %b[80]
pfsubs,5,sm %b[81], %b[78], %b[87]
movad,0 area=6, ind=0, am=1, be=0, %b[49]
movad,1 area=5, ind=0, am=1, be=0, %b[48]
movad,2 area=6, ind=0, am=1, be=0, %b[44]
movad,3 area=5, ind=0, am=1, be=0, %b[45]
}
{
loop_mode
pshufb,0,sm 0x0, %b[85], %r25, %b[84]
pfmuls,1,sm %b[84], %b[102], %b[85]
pfadds,4,sm %b[81], %b[78], %b[90]
pfsubs,5,sm %b[83], %b[69], %b[89]
movad,0 area=4, ind=0, am=0, be=0, %b[52]
movad,1 area=4, ind=16, am=0, be=0, %b[78]
movad,2 area=4, ind=0, am=0, be=0, %b[53]
movad,3 area=4, ind=16, am=0, be=0, %b[81]
}
{
loop_mode
pshufb,1,sm 0x0, %b[96], %r25, %b[98]
pfadds,3,sm %b[83], %b[69], %b[99]
pfsubs,5,sm %b[88], %b[76], %b[97]
movad,0 area=4, ind=24, am=0, be=0, %b[69]
movad,1 area=4, ind=8, am=1, be=0, %b[83]
movad,2 area=4, ind=24, am=0, be=0, %b[91]
movad,3 area=4, ind=8, am=1, be=0, %b[96]
}
{
loop_mode
pfmul_hadds,0,sm %b[55], %b[108], %b[73], %b[75]
pfadds,1,sm %b[75], %b[70], %b[100]
pfadd_adds,2,sm %b[79], %b[66], %b[72], %b[74]
pshufb,3,sm 0x0, %b[74], %r25, %b[101]
pfadd_rsubs,4,sm %b[94], %b[71], %b[82], %b[76]
pfadds,5,sm %b[88], %b[76], %b[104]
movad,0 area=3, ind=0, am=0, be=0, %b[70]
movad,1 area=3, ind=16, am=0, be=0, %b[88]
movad,2 area=3, ind=0, am=0, be=0, %b[56]
movad,3 area=3, ind=16, am=0, be=0, %b[73]
}
{
loop_mode
pfadd_rsubs,0,sm %b[79], %b[66], %b[72], %b[72]
pshufb,1,sm 0x0, %b[80], %r25, %b[107]
pfadd_adds,2,sm %b[86], %b[68], %b[93], %b[103]
pshufb,4,sm 0x0, %b[87], %r25, %b[109]
pfadd_rsubs,5,sm %b[86], %b[68], %b[93], %b[93]
movad,0 area=3, ind=24, am=0, be=0, %b[105]
movad,1 area=3, ind=8, am=1, be=0, %b[106]
movad,2 area=3, ind=24, am=0, be=0, %b[80]
movad,3 area=3, ind=8, am=1, be=0, %b[87]
}
{
loop_mode
pfmul_hadds,0,sm %b[58], %b[102], %b[85], %b[84]
pfadd_adds,1,sm %b[94], %b[71], %b[82], %b[85]
xord,2,sm %b[84], %r3, %b[112]
pfadd_adds,3,sm %b[95], %b[65], %b[90], %b[102]
pshufb,4,sm 0x0, %b[89], %r25, %b[111]
pfadd_rsubs,5,sm %b[95], %b[65], %b[90], %b[90]
movad,0 area=2, ind=0, am=0, be=0, %b[82]
movad,1 area=2, ind=16, am=0, be=0, %b[108]
movad,2 area=2, ind=24, am=0, be=0, %b[110]
movad,3 area=0, ind=24, am=0, be=0, %b[89]
}
{
loop_mode
xord,0,sm %b[98], %r3, %b[116]
pfsub_adds,1,sm %b[94], %b[71], %b[112], %b[94]
pfsub_rsubs,2,sm %b[94], %b[71], %b[112], %b[97]
pfadd_rsubs,3,sm %b[92], %b[77], %b[99], %b[98]
pshufb,4,sm 0x0, %b[97], %r25, %b[115]
pfadd_adds,5,sm %b[92], %b[77], %b[99], %b[99]
movad,0 area=2, ind=24, am=0, be=0, %b[114]
movad,1 area=2, ind=8, am=1, be=0, %b[113]
movad,2 area=1, ind=16, am=0, be=0, %b[71]
movad,3 area=0, ind=8, am=0, be=0, %b[112]
}
{
loop_mode
pfadd_adds,1,sm %b[57], %b[62], %b[100], %b[104]
pfsub_adds,2,sm %b[79], %b[66], %b[116], %b[118]
xord,3,sm %b[101], %r3, %g16
pfadd_rsubs,4,sm %b[63], %b[64], %b[104], %b[117]
pfadd_adds,5,sm %b[63], %b[64], %b[104], %b[119]
movad,0 area=1, ind=0, am=0, be=0, %b[55]
movad,1 area=1, ind=16, am=0, be=0, %b[101]
movad,2 area=1, ind=24, am=0, be=0, %g18
movad,3 area=1, ind=8, am=0, be=0, %g17
}
{
loop_mode
xord,0,sm %b[107], %r3, %b[116]
pfmuls,1,sm %b[67], %b[60], %g19
pfsub_rsubs,2,sm %b[79], %b[66], %b[116], %b[66]
pfsub_rsubs,3,sm %b[86], %b[68], %g16, %b[107]
xord,4,sm %b[109], %r3, %g16
pfsub_adds,5,sm %b[86], %b[68], %g16, %b[86]
movad,0 area=1, ind=24, am=0, be=0, %b[68]
movad,1 area=1, ind=8, am=1, be=0, %b[79]
movad,2 area=2, ind=8, am=0, be=0, %b[109]
movad,3 area=0, ind=16, am=0, be=0, %b[67]
}
{
loop_mode
pfsub_adds,0,sm %b[57], %b[62], %b[116], %b[95]
pfsub_rsubs,3,sm %b[95], %b[65], %g16, %g16
pfsub_adds,4,sm %b[95], %b[65], %g16, %g21
xord,5,sm %b[111], %r3, %g20
movad,0 area=0, ind=0, am=0, be=0, %b[59]
movad,1 area=0, ind=16, am=0, be=0, %b[58]
movad,2 area=2, ind=0, am=1, be=0, %b[65]
movad,3 area=2, ind=16, am=0, be=0, %b[111]
}
{
loop_mode
pfmuls,0,sm %b[106], %b[89], %g24
pfadd_rsubs,2,sm %b[57], %b[62], %b[100], %b[115]
xord,3,sm %b[115], %r3, %g23
pfsub_rsubs,4,sm %b[92], %b[77], %g20, %g20
pfsub_adds,5,sm %b[92], %b[77], %g20, %g22
movad,0 area=0, ind=8, am=1, be=0, %b[100]
movad,1 area=0, ind=24, am=0, be=0, %b[106]
movad,2 area=1, ind=0, am=1, be=0, %b[92]
movad,3 area=0, ind=0, am=1, be=0, %b[77]
}
{
loop_mode
pfmuls,0,sm %b[113], %b[112], %b[113]
pfmuls,3,sm %b[110], %b[71], %b[63]
pfsub_rsubs,4,sm %b[63], %b[64], %g23, %b[110]
pfsub_adds,5,sm %b[63], %b[64], %g23, %b[64]
}
{
loop_mode
pfmuls,0,sm %b[105], %g18, %b[114]
pfmuls,1,sm %b[114], %g17, %g19
pfmul_hadds,2,sm %b[54], %b[60], %g19, %b[60]
pfmuls,4,sm %b[27], %b[76], %b[105]
}
{
loop_mode
pfsub_rsubs,0,sm %b[57], %b[62], %b[116], %b[62]
pfsubs,1,sm %b[84], %b[75], %b[116]
pfmuls,2,sm %b[51], %b[85], %g23
pfmuls,5,sm %b[109], %b[67], %b[109]
}
{
loop_mode
pfmuls,0,sm %b[88], %b[68], %b[88]
pfmuls,1,sm %b[108], %b[79], %b[93]
std,2 %r23, %b[6], %b[103]
pfmuls,3,sm %b[26], %b[72], %b[103]
addd,4,sm 0x8, %b[6], %b[4] ? %pcnt0
std,5 %r19, %b[6], %b[93]
}
{
loop_mode
pfmul_hadds,0,sm %b[96], %b[89], %g24, %b[87]
pfmul_hadds,1,sm %b[87], %b[112], %b[113], %b[89]
std,2 %r2, %b[6], %b[102]
pfmuls,3,sm %b[50], %b[74], %b[90]
std,5 %r20, %b[6], %b[90]
}
{
loop_mode
pfmul_hadds,0,sm %b[91], %g18, %b[114], %b[80]
pfmul_hadds,1,sm %b[80], %g17, %g19, %b[91]
std,2 %r11, %b[6], %b[98]
pfmuls,3,sm %b[14], %b[94], %b[96]
pfmuls,4,sm %b[35], %b[97], %b[98]
std,5 %r22, %b[6], %b[99]
}
{
loop_mode
pfmul_hadds,0,sm %b[42], %b[85], %g23, %b[76]
pfmuls,1,sm %b[47], %b[104], %b[99]
std,2 %r16, %b[6], %b[117]
pfmul_hadds,3,sm %b[19], %b[76], %b[105], %b[85]
pfmuls,4,sm %b[18], %b[118], %b[102]
std,5 %r24, %b[6], %b[119]
}
{
loop_mode
pfadds,0,sm %b[84], %b[75], %b[75]
pfmuls,1,sm %b[23], %b[115], %b[84]
std,2 %r14, %b[6], %b[107]
pfmul_hadds,3,sm %b[22], %b[72], %b[103], %b[72]
pfmuls,4,sm %b[43], %b[66], %b[86]
std,5 %r13, %b[6], %b[86]
}
{
loop_mode
std,2 %r21, %b[6], %g16
pshufb,3,sm 0x0, %b[116], %r25, %b[105]
pfmuls,4,sm %b[15], %b[95], %b[103]
std,5 %r18, %b[6], %g21
}
{
loop_mode
pfmul_hadds,0,sm %b[81], %b[68], %b[88], %b[68]
pfmul_hadds,1,sm %b[73], %b[79], %b[93], %b[73]
std,2 %r17, %b[6], %g20
pfmul_hadds,3,sm %b[46], %b[74], %b[90], %b[79]
pfmul_hadds,4,sm %b[34], %b[97], %b[98], %b[74]
std,5 %r12, %b[6], %g22
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
pfmul_hadds,0,sm %b[83], %b[67], %b[109], %b[64]
pfmuls,1,sm %b[39], %b[62], %b[83]
std,2 %r15, %b[6], %b[110]
pfmul_hadds,3,sm %b[10], %b[94], %b[96], %b[67]
pfmul_hadds,4,sm %b[11], %b[118], %b[102], %b[81]
std,5 %r0, %b[6], %b[64]
}
Теоретическая скорость: 16 комплексных чисел за 28 тактов (16/28) = 4.57 Байт/такт
Четверная теоретическая скорость: 18.29 Байт/такт
Замеры скорости

2. stage_radix4_readConjSwap_2x_simd128
Здесь происходит ручная раскрутка алгоритма stage_radix4_readConjSwap_simd128 в 2 раза.
Код на Си
void stage_radix4_readConjSwap_2x_simd128(int data_count, myComplex *data_in, myComplex *data_out, myComplex *conj_coefC_a, myComplex *conj_coefD_a, myComplex *conj_coefE_a, myComplex *conj_coefC_b, myComplex *conj_coefD_b, myComplex *conj_coefE_b, myComplex *swap_coefC_a, myComplex *swap_coefD_a, myComplex *swap_coefE_a, myComplex *swap_coefC_b, myComplex *swap_coefD_b, myComplex *swap_coefE_b)
{
__v2di *xy0_in = (__v2di*)&data_in[ 0];
__v2di *zw0_in = (__v2di*)&data_in[ 2];
__v2di *xy1_in = (__v2di*)&data_in[ 4];
__v2di *zw1_in = (__v2di*)&data_in[ 6];
__v2di *xy2_in = (__v2di*)&data_in[ 8];
__v2di *zw2_in = (__v2di*)&data_in[10];
__v2di *xy3_in = (__v2di*)&data_in[12];
__v2di *zw3_in = (__v2di*)&data_in[14];
__v2di *xy4_in = (__v2di*)&data_in[16];
__v2di *zw4_in = (__v2di*)&data_in[18];
__v2di *xy5_in = (__v2di*)&data_in[20];
__v2di *zw5_in = (__v2di*)&data_in[22];
__v2di *xy6_in = (__v2di*)&data_in[24];
__v2di *zw6_in = (__v2di*)&data_in[26];
__v2di *xy7_in = (__v2di*)&data_in[28];
__v2di *zw7_in = (__v2di*)&data_in[30];
__v2di *conj_c0a_in = (__v2di*)&conj_coefC_a[0];
__v2di *conj_c1a_in = (__v2di*)&conj_coefC_a[2];
__v2di *conj_c2a_in = (__v2di*)&conj_coefC_a[4];
__v2di *conj_c3a_in = (__v2di*)&conj_coefC_a[6];
__v2di *conj_d0a_in = (__v2di*)&conj_coefD_a[0];
__v2di *conj_d1a_in = (__v2di*)&conj_coefD_a[2];
__v2di *conj_d2a_in = (__v2di*)&conj_coefD_a[4];
__v2di *conj_d3a_in = (__v2di*)&conj_coefD_a[6];
__v2di *conj_e0a_in = (__v2di*)&conj_coefE_a[0];
__v2di *conj_e1a_in = (__v2di*)&conj_coefE_a[2];
__v2di *conj_e2a_in = (__v2di*)&conj_coefE_a[4];
__v2di *conj_e3a_in = (__v2di*)&conj_coefE_a[6];
__v2di *conj_c0b_in = (__v2di*)&conj_coefC_b[0*data_count/16];
__v2di *conj_c1b_in = (__v2di*)&conj_coefC_b[1*data_count/16];
__v2di *conj_c2b_in = (__v2di*)&conj_coefC_b[2*data_count/16];
__v2di *conj_c3b_in = (__v2di*)&conj_coefC_b[3*data_count/16];
__v2di *conj_d0b_in = (__v2di*)&conj_coefD_b[0*data_count/16];
__v2di *conj_d1b_in = (__v2di*)&conj_coefD_b[1*data_count/16];
__v2di *conj_d2b_in = (__v2di*)&conj_coefD_b[2*data_count/16];
__v2di *conj_d3b_in = (__v2di*)&conj_coefD_b[3*data_count/16];
__v2di *conj_e0b_in = (__v2di*)&conj_coefE_b[0*data_count/16];
__v2di *conj_e1b_in = (__v2di*)&conj_coefE_b[1*data_count/16];
__v2di *conj_e2b_in = (__v2di*)&conj_coefE_b[2*data_count/16];
__v2di *conj_e3b_in = (__v2di*)&conj_coefE_b[3*data_count/16];
__v2di *swap_c0a_in = (__v2di*)&swap_coefC_a[0];
__v2di *swap_c1a_in = (__v2di*)&swap_coefC_a[2];
__v2di *swap_c2a_in = (__v2di*)&swap_coefC_a[4];
__v2di *swap_c3a_in = (__v2di*)&swap_coefC_a[6];
__v2di *swap_d0a_in = (__v2di*)&swap_coefD_a[0];
__v2di *swap_d1a_in = (__v2di*)&swap_coefD_a[2];
__v2di *swap_d2a_in = (__v2di*)&swap_coefD_a[4];
__v2di *swap_d3a_in = (__v2di*)&swap_coefD_a[6];
__v2di *swap_e0a_in = (__v2di*)&swap_coefE_a[0];
__v2di *swap_e1a_in = (__v2di*)&swap_coefE_a[2];
__v2di *swap_e2a_in = (__v2di*)&swap_coefE_a[4];
__v2di *swap_e3a_in = (__v2di*)&swap_coefE_a[6];
__v2di *swap_c0b_in = (__v2di*)&swap_coefC_b[0*data_count/16];
__v2di *swap_c1b_in = (__v2di*)&swap_coefC_b[1*data_count/16];
__v2di *swap_c2b_in = (__v2di*)&swap_coefC_b[2*data_count/16];
__v2di *swap_c3b_in = (__v2di*)&swap_coefC_b[3*data_count/16];
__v2di *swap_d0b_in = (__v2di*)&swap_coefD_b[0*data_count/16];
__v2di *swap_d1b_in = (__v2di*)&swap_coefD_b[1*data_count/16];
__v2di *swap_d2b_in = (__v2di*)&swap_coefD_b[2*data_count/16];
__v2di *swap_d3b_in = (__v2di*)&swap_coefD_b[3*data_count/16];
__v2di *swap_e0b_in = (__v2di*)&swap_coefE_b[0*data_count/16];
__v2di *swap_e1b_in = (__v2di*)&swap_coefE_b[1*data_count/16];
__v2di *swap_e2b_in = (__v2di*)&swap_coefE_b[2*data_count/16];
__v2di *swap_e3b_in = (__v2di*)&swap_coefE_b[3*data_count/16];
__v2di *out_0 = (__v2di*)&data_out[ 0*data_count/16];
__v2di *out_1 = (__v2di*)&data_out[ 1*data_count/16];
__v2di *out_2 = (__v2di*)&data_out[ 2*data_count/16];
__v2di *out_3 = (__v2di*)&data_out[ 3*data_count/16];
__v2di *out_4 = (__v2di*)&data_out[ 4*data_count/16];
__v2di *out_5 = (__v2di*)&data_out[ 5*data_count/16];
__v2di *out_6 = (__v2di*)&data_out[ 6*data_count/16];
__v2di *out_7 = (__v2di*)&data_out[ 7*data_count/16];
__v2di *out_8 = (__v2di*)&data_out[ 8*data_count/16];
__v2di *out_9 = (__v2di*)&data_out[ 9*data_count/16];
__v2di *out_10 = (__v2di*)&data_out[10*data_count/16];
__v2di *out_11 = (__v2di*)&data_out[11*data_count/16];
__v2di *out_12 = (__v2di*)&data_out[12*data_count/16];
__v2di *out_13 = (__v2di*)&data_out[13*data_count/16];
__v2di *out_14 = (__v2di*)&data_out[14*data_count/16];
__v2di *out_15 = (__v2di*)&data_out[15*data_count/16];
#pragma ivdep
#pragma unroll(1)
#pragma prefetch
for(int64_t i = 0; i < data_count/32; ++i)
{
__v2di xy0 = xy0_in[16*i];
__v2di zw0 = zw0_in[16*i];
__v2di xy1 = xy1_in[16*i];
__v2di zw1 = zw1_in[16*i];
__v2di conj_c0 = conj_c0a_in[4*i];
__v2di conj_d0 = conj_d0a_in[4*i];
__v2di conj_e0 = conj_e0a_in[4*i];
__v2di swap_c0 = swap_c0a_in[4*i];
__v2di swap_d0 = swap_d0a_in[4*i];
__v2di swap_e0 = swap_e0a_in[4*i];
__v2di xy2 = xy2_in[16*i];
__v2di zw2 = zw2_in[16*i];
__v2di xy3 = xy3_in[16*i];
__v2di zw3 = zw3_in[16*i];
__v2di conj_c1 = conj_c1a_in[4*i];
__v2di conj_d1 = conj_d1a_in[4*i];
__v2di conj_e1 = conj_e1a_in[4*i];
__v2di swap_c1 = swap_c1a_in[4*i];
__v2di swap_d1 = swap_d1a_in[4*i];
__v2di swap_e1 = swap_e1a_in[4*i];
__v2di xy4 = xy4_in[16*i];
__v2di zw4 = zw4_in[16*i];
__v2di xy5 = xy5_in[16*i];
__v2di zw5 = zw5_in[16*i];
__v2di conj_c2 = conj_c2a_in[4*i];
__v2di conj_d2 = conj_d2a_in[4*i];
__v2di conj_e2 = conj_e2a_in[4*i];
__v2di swap_c2 = swap_c2a_in[4*i];
__v2di swap_d2 = swap_d2a_in[4*i];
__v2di swap_e2 = swap_e2a_in[4*i];
__v2di xy6 = xy6_in[16*i];
__v2di zw6 = zw6_in[16*i];
__v2di xy7 = xy7_in[16*i];
__v2di zw7 = zw7_in[16*i];
__v2di conj_c3 = conj_c3a_in[4*i];
__v2di conj_d3 = conj_d3a_in[4*i];
__v2di conj_e3 = conj_e3a_in[4*i];
__v2di swap_c3 = swap_c3a_in[4*i];
__v2di swap_d3 = swap_d3a_in[4*i];
__v2di swap_e3 = swap_e3a_in[4*i];
__v2di x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100});
__v2di w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
__v2di cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0);
__v2di cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1);
__v2di cy2_real = __builtin_e2k_qpfmuls(conj_c2, y2);
__v2di cy3_real = __builtin_e2k_qpfmuls(conj_c3, y3);
__v2di dz0_real = __builtin_e2k_qpfmuls(conj_d0, z0);
__v2di dz1_real = __builtin_e2k_qpfmuls(conj_d1, z1);
__v2di dz2_real = __builtin_e2k_qpfmuls(conj_d2, z2);
__v2di dz3_real = __builtin_e2k_qpfmuls(conj_d3, z3);
__v2di ew0_real = __builtin_e2k_qpfmuls(conj_e0, w0);
__v2di ew1_real = __builtin_e2k_qpfmuls(conj_e1, w1);
__v2di ew2_real = __builtin_e2k_qpfmuls(conj_e2, w2);
__v2di ew3_real = __builtin_e2k_qpfmuls(conj_e3, w3);
__v2di cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
__v2di cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
__v2di cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2);
__v2di cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3);
__v2di dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0);
__v2di dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1);
__v2di dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2);
__v2di dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3);
__v2di ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0);
__v2di ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1);
__v2di ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2);
__v2di ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3);
__v2di cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag);
__v2di cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag);
__v2di cy2_rrii = __builtin_e2k_qpfhadds(cy2_real, cy2_imag);
__v2di cy3_rrii = __builtin_e2k_qpfhadds(cy3_real, cy3_imag);
__v2di dz0_rrii = __builtin_e2k_qpfhadds(dz0_real, dz0_imag);
__v2di dz1_rrii = __builtin_e2k_qpfhadds(dz1_real, dz1_imag);
__v2di dz2_rrii = __builtin_e2k_qpfhadds(dz2_real, dz2_imag);
__v2di dz3_rrii = __builtin_e2k_qpfhadds(dz3_real, dz3_imag);
__v2di ew0_rrii = __builtin_e2k_qpfhadds(ew0_real, ew0_imag);
__v2di ew1_rrii = __builtin_e2k_qpfhadds(ew1_real, ew1_imag);
__v2di ew2_rrii = __builtin_e2k_qpfhadds(ew2_real, ew2_imag);
__v2di ew3_rrii = __builtin_e2k_qpfhadds(ew3_real, ew3_imag);
__v2di cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di cy2 = __builtin_e2k_qpshufb(cy2_rrii, cy2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di cy3 = __builtin_e2k_qpshufb(cy3_rrii, cy3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di dz0 = __builtin_e2k_qpshufb(dz0_rrii, dz0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di dz1 = __builtin_e2k_qpshufb(dz1_rrii, dz1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di dz2 = __builtin_e2k_qpshufb(dz2_rrii, dz2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di dz3 = __builtin_e2k_qpshufb(dz3_rrii, dz3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di ew0 = __builtin_e2k_qpshufb(ew0_rrii, ew0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di ew1 = __builtin_e2k_qpshufb(ew1_rrii, ew1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di ew2 = __builtin_e2k_qpshufb(ew2_rrii, ew2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di ew3 = __builtin_e2k_qpshufb(ew3_rrii, ew3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
__v2di add02_0 = __builtin_e2k_qpfadds( x0, dz0);
__v2di add02_1 = __builtin_e2k_qpfadds( x1, dz1);
__v2di add02_2 = __builtin_e2k_qpfadds( x2, dz2);
__v2di add02_3 = __builtin_e2k_qpfadds( x3, dz3);
__v2di sub02_0 = __builtin_e2k_qpfsubs( x0, dz0);
__v2di sub02_1 = __builtin_e2k_qpfsubs( x1, dz1);
__v2di sub02_2 = __builtin_e2k_qpfsubs( x2, dz2);
__v2di sub02_3 = __builtin_e2k_qpfsubs( x3, dz3);
__v2di add13_0 = __builtin_e2k_qpfadds(cy0, ew0);
__v2di add13_1 = __builtin_e2k_qpfadds(cy1, ew1);
__v2di add13_2 = __builtin_e2k_qpfadds(cy2, ew2);
__v2di add13_3 = __builtin_e2k_qpfadds(cy3, ew3);
__v2di sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0);
__v2di sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1);
__v2di sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2);
__v2di sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3);
__v2di swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
__v2di sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31});
__v2di sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31});
__v2di sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31});
__v2di sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31});
__v2di out0 = __builtin_e2k_qpfadds(add02_0, add13_0);
__v2di out1 = __builtin_e2k_qpfadds(add02_1, add13_1);
__v2di out2 = __builtin_e2k_qpfadds(add02_2, add13_2);
__v2di out3 = __builtin_e2k_qpfadds(add02_3, add13_3);
__v2di out4 = __builtin_e2k_qpfsubs(sub02_0, sub13i_0);
__v2di out5 = __builtin_e2k_qpfsubs(sub02_1, sub13i_1);
__v2di out6 = __builtin_e2k_qpfsubs(sub02_2, sub13i_2);
__v2di out7 = __builtin_e2k_qpfsubs(sub02_3, sub13i_3);
__v2di out8 = __builtin_e2k_qpfsubs(add02_0, add13_0);
__v2di out9 = __builtin_e2k_qpfsubs(add02_1, add13_1);
__v2di out10 = __builtin_e2k_qpfsubs(add02_2, add13_2);
__v2di out11 = __builtin_e2k_qpfsubs(add02_3, add13_3);
__v2di out12 = __builtin_e2k_qpfadds(sub02_0, sub13i_0);
__v2di out13 = __builtin_e2k_qpfadds(sub02_1, sub13i_1);
__v2di out14 = __builtin_e2k_qpfadds(sub02_2, sub13i_2);
__v2di out15 = __builtin_e2k_qpfadds(sub02_3, sub13i_3);
xy0 = out0;
zw0 = out1;
xy1 = out2;
zw1 = out3;
conj_c0 = conj_c0b_in[i];
conj_d0 = conj_d0b_in[i];
conj_e0 = conj_e0b_in[i];
swap_c0 = swap_c0b_in[i];
swap_d0 = swap_d0b_in[i];
swap_e0 = swap_e0b_in[i];
xy2 = out4;
zw2 = out5;
xy3 = out6;
zw3 = out7;
conj_c1 = conj_c1b_in[i];
conj_d1 = conj_d1b_in[i];
conj_e1 = conj_e1b_in[i];
swap_c1 = swap_c1b_in[i];
swap_d1 = swap_d1b_in[i];
swap_e1 = swap_e1b_in[i];
xy4 = out8;
zw4 = out9;
xy5 = out10;
zw5 = out11;
conj_c2 = conj_c2b_in[i];
conj_d2 = conj_d2b_in[i];
conj_e2 = conj_e2b_in[i];
swap_c2 = swap_c2b_in[i];
swap_d2 = swap_d2b_in[i];
swap_e2 = swap_e2b_in[i];
xy6 = out12;
zw6 = out13;
xy7 = out14;
zw7 = out15;
conj_c3 = conj_c3b_in[i];
conj_d3 = conj_d3b_in[i];
conj_e3 = conj_e3b_in[i];
swap_c3 = swap_c3b_in[i];
swap_d3 = swap_d3b_in[i];
swap_e3 = swap_e3b_in[i];
x0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0706050403020100, 0x0706050403020100});
y0 = __builtin_e2k_qpshufb(xy1, xy0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
z0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0706050403020100, 0x0706050403020100});
w0 = __builtin_e2k_qpshufb(zw1, zw0, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
x1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0706050403020100, 0x0706050403020100});
y1 = __builtin_e2k_qpshufb(xy3, xy2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
z1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0706050403020100, 0x0706050403020100});
w1 = __builtin_e2k_qpshufb(zw3, zw2, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
x2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0706050403020100, 0x0706050403020100});
y2 = __builtin_e2k_qpshufb(xy5, xy4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
z2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0706050403020100, 0x0706050403020100});
w2 = __builtin_e2k_qpshufb(zw5, zw4, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
x3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0706050403020100, 0x0706050403020100});
y3 = __builtin_e2k_qpshufb(xy7, xy6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
z3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0706050403020100, 0x0706050403020100});
w3 = __builtin_e2k_qpshufb(zw7, zw6, (__v2di){0x0F0E0D0C0B0A0908, 0x0F0E0D0C0B0A0908});
cy0_real = __builtin_e2k_qpfmuls(conj_c0, y0);
cy1_real = __builtin_e2k_qpfmuls(conj_c1, y1);
cy2_real = __builtin_e2k_qpfmuls(conj_c2, y2);
cy3_real = __builtin_e2k_qpfmuls(conj_c3, y3);
dz0_real = __builtin_e2k_qpfmuls(conj_d0, z0);
dz1_real = __builtin_e2k_qpfmuls(conj_d1, z1);
dz2_real = __builtin_e2k_qpfmuls(conj_d2, z2);
dz3_real = __builtin_e2k_qpfmuls(conj_d3, z3);
ew0_real = __builtin_e2k_qpfmuls(conj_e0, w0);
ew1_real = __builtin_e2k_qpfmuls(conj_e1, w1);
ew2_real = __builtin_e2k_qpfmuls(conj_e2, w2);
ew3_real = __builtin_e2k_qpfmuls(conj_e3, w3);
cy0_imag = __builtin_e2k_qpfmuls(swap_c0, y0);
cy1_imag = __builtin_e2k_qpfmuls(swap_c1, y1);
cy2_imag = __builtin_e2k_qpfmuls(swap_c2, y2);
cy3_imag = __builtin_e2k_qpfmuls(swap_c3, y3);
dz0_imag = __builtin_e2k_qpfmuls(swap_d0, z0);
dz1_imag = __builtin_e2k_qpfmuls(swap_d1, z1);
dz2_imag = __builtin_e2k_qpfmuls(swap_d2, z2);
dz3_imag = __builtin_e2k_qpfmuls(swap_d3, z3);
ew0_imag = __builtin_e2k_qpfmuls(swap_e0, w0);
ew1_imag = __builtin_e2k_qpfmuls(swap_e1, w1);
ew2_imag = __builtin_e2k_qpfmuls(swap_e2, w2);
ew3_imag = __builtin_e2k_qpfmuls(swap_e3, w3);
cy0_rrii = __builtin_e2k_qpfhadds(cy0_real, cy0_imag);
cy1_rrii = __builtin_e2k_qpfhadds(cy1_real, cy1_imag);
cy2_rrii = __builtin_e2k_qpfhadds(cy2_real, cy2_imag);
cy3_rrii = __builtin_e2k_qpfhadds(cy3_real, cy3_imag);
dz0_rrii = __builtin_e2k_qpfhadds(dz0_real, dz0_imag);
dz1_rrii = __builtin_e2k_qpfhadds(dz1_real, dz1_imag);
dz2_rrii = __builtin_e2k_qpfhadds(dz2_real, dz2_imag);
dz3_rrii = __builtin_e2k_qpfhadds(dz3_real, dz3_imag);
ew0_rrii = __builtin_e2k_qpfhadds(ew0_real, ew0_imag);
ew1_rrii = __builtin_e2k_qpfhadds(ew1_real, ew1_imag);
ew2_rrii = __builtin_e2k_qpfhadds(ew2_real, ew2_imag);
ew3_rrii = __builtin_e2k_qpfhadds(ew3_real, ew3_imag);
cy0 = __builtin_e2k_qpshufb(cy0_rrii, cy0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
cy1 = __builtin_e2k_qpshufb(cy1_rrii, cy1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
cy2 = __builtin_e2k_qpshufb(cy2_rrii, cy2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
cy3 = __builtin_e2k_qpshufb(cy3_rrii, cy3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
dz0 = __builtin_e2k_qpshufb(dz0_rrii, dz0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
dz1 = __builtin_e2k_qpshufb(dz1_rrii, dz1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
dz2 = __builtin_e2k_qpshufb(dz2_rrii, dz2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
dz3 = __builtin_e2k_qpshufb(dz3_rrii, dz3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
ew0 = __builtin_e2k_qpshufb(ew0_rrii, ew0_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
ew1 = __builtin_e2k_qpshufb(ew1_rrii, ew1_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
ew2 = __builtin_e2k_qpshufb(ew2_rrii, ew2_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
ew3 = __builtin_e2k_qpshufb(ew3_rrii, ew3_rrii, (__v2di){0x0B0A090803020100, 0x0F0E0D0C07060504});
add02_0 = __builtin_e2k_qpfadds( x0, dz0);
add02_1 = __builtin_e2k_qpfadds( x1, dz1);
add02_2 = __builtin_e2k_qpfadds( x2, dz2);
add02_3 = __builtin_e2k_qpfadds( x3, dz3);
sub02_0 = __builtin_e2k_qpfsubs( x0, dz0);
sub02_1 = __builtin_e2k_qpfsubs( x1, dz1);
sub02_2 = __builtin_e2k_qpfsubs( x2, dz2);
sub02_3 = __builtin_e2k_qpfsubs( x3, dz3);
add13_0 = __builtin_e2k_qpfadds(cy0, ew0);
add13_1 = __builtin_e2k_qpfadds(cy1, ew1);
add13_2 = __builtin_e2k_qpfadds(cy2, ew2);
add13_3 = __builtin_e2k_qpfadds(cy3, ew3);
sub13_0 = __builtin_e2k_qpfsubs(cy0, ew0);
sub13_1 = __builtin_e2k_qpfsubs(cy1, ew1);
sub13_2 = __builtin_e2k_qpfsubs(cy2, ew2);
sub13_3 = __builtin_e2k_qpfsubs(cy3, ew3);
swap_sub13_0 = __builtin_e2k_qpshufb(sub13_0, sub13_0, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_sub13_1 = __builtin_e2k_qpshufb(sub13_1, sub13_1, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_sub13_2 = __builtin_e2k_qpshufb(sub13_2, sub13_2, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
swap_sub13_3 = __builtin_e2k_qpshufb(sub13_3, sub13_3, (__v2di){0x0302010007060504, 0x0B0A09080F0E0D0C});
sub13i_0 = __builtin_e2k_qpxor(swap_sub13_0, (__v2di){1LL<<31, 1LL<<31});
sub13i_1 = __builtin_e2k_qpxor(swap_sub13_1, (__v2di){1LL<<31, 1LL<<31});
sub13i_2 = __builtin_e2k_qpxor(swap_sub13_2, (__v2di){1LL<<31, 1LL<<31});
sub13i_3 = __builtin_e2k_qpxor(swap_sub13_3, (__v2di){1LL<<31, 1LL<<31});
out_0[i] = __builtin_e2k_qpfadds(add02_0, add13_0);
out_1[i] = __builtin_e2k_qpfadds(add02_1, add13_1);
out_2[i] = __builtin_e2k_qpfadds(add02_2, add13_2);
out_3[i] = __builtin_e2k_qpfadds(add02_3, add13_3);
out_4[i] = __builtin_e2k_qpfsubs(sub02_0, sub13i_0);
out_5[i] = __builtin_e2k_qpfsubs(sub02_1, sub13i_1);
out_6[i] = __builtin_e2k_qpfsubs(sub02_2, sub13i_2);
out_7[i] = __builtin_e2k_qpfsubs(sub02_3, sub13i_3);
out_8[i] = __builtin_e2k_qpfsubs(add02_0, add13_0);
out_9[i] = __builtin_e2k_qpfsubs(add02_1, add13_1);
out_10[i] = __builtin_e2k_qpfsubs(add02_2, add13_2);
out_11[i] = __builtin_e2k_qpfsubs(add02_3, add13_3);
out_12[i] = __builtin_e2k_qpfadds(sub02_0, sub13i_0);
out_13[i] = __builtin_e2k_qpfadds(sub02_1, sub13i_1);
out_14[i] = __builtin_e2k_qpfadds(sub02_2, sub13i_2);
out_15[i] = __builtin_e2k_qpfadds(sub02_3, sub13i_3);
}
}
Основной цикл на ассемблере
.L6737:
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=0, abs=0, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=0, abs=0, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=0, abs=1, disp=64
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=0, abs=1, disp=96
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=0, abs=2, disp=128
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=0, abs=2, disp=160
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=0, abs=3, disp=192
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=2, ind=1, asz=0, abs=3, disp=224
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=7, asz=0, abs=4, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=7, asz=0, abs=4, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=6, asz=0, abs=5, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=6, asz=0, abs=5, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=5, asz=0, abs=6, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=5, asz=0, abs=6, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=4, asz=0, abs=7, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=4, asz=0, abs=7, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=3, asz=0, abs=8, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=3, asz=0, abs=8, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=2, asz=0, abs=9, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=0, d=0, incr=1, ind=2, asz=0, abs=9, disp=32
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=16, incr=0, ind=0, asz=0, abs=10, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=15, incr=0, ind=0, asz=0, abs=10, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=14, incr=0, ind=0, asz=0, abs=11, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=13, incr=0, ind=0, asz=0, abs=11, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=12, incr=0, ind=0, asz=1, abs=12, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=11, incr=0, ind=0, asz=1, abs=12, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=10, incr=0, ind=0, asz=1, abs=14, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=9, incr=0, ind=0, asz=1, abs=14, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=8, incr=0, ind=0, asz=1, abs=16, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=7, incr=0, ind=0, asz=1, abs=16, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=6, incr=0, ind=0, asz=1, abs=18, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=5, incr=0, ind=0, asz=1, abs=18, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=4, incr=0, ind=0, asz=1, abs=20, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=3, incr=0, ind=0, asz=1, abs=20, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=2, incr=0, ind=0, asz=1, abs=22, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=1, incr=0, ind=0, asz=1, abs=22, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=15, asz=1, abs=24, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=14, asz=1, abs=24, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=13, asz=1, abs=26, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=12, asz=1, abs=26, disp=0
}
{
fapb ct=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=11, asz=1, abs=28, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=10, asz=1, abs=28, disp=0
}
{
fapb ct=1, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=9, asz=1, abs=30, disp=0
fapb dpl=0, dcd=0, fmt=5, mrng=16, d=0, incr=0, ind=8, asz=1, abs=30, disp=0
}
.L2988:
{
loop_mode
qpfmul_hadds,0,sm %b[92], %b[83], %b[98], %b[83]
qpshufb,1,sm %b[69], %b[69], %r4, %b[69]
qpshufb,3,sm %b[63], %b[66], %r0, %b[92]
qpfmuls,4,sm %b[48], %b[90], %b[98]
qpfmuls,5,sm %b[21], %b[86], %b[99]
}
{
loop_mode
qpshufb,1,sm %b[62], %b[62], %r4, %b[74]
qpshufb,3,sm %b[72], %b[74], %r3, %b[62]
qpfmuls,5,sm %b[29], %b[92], %b[72]
}
{
loop_mode
qpshufb,1,sm %b[97], %b[97], %r4, %b[97]
qpfmul_hadds,2,sm %b[32], %b[87], %b[102], %b[87]
qpshufb,3,sm %b[115], %b[118], %r3, %b[100]
qpfmuls,5,sm %b[45], %b[62], %b[103]
}
{
loop_mode
qpfmul_hadds,2,sm %b[44], %b[101], %b[93], %b[71]
qpshufb,3,sm %b[71], %b[76], %r3, %b[76]
qpfsubs,4,sm %b[79], %b[77], %b[93]
qpfmuls,5,sm %b[28], %b[100], %b[101]
}
{
loop_mode
qpfmul_hadds,0,sm %b[49], %b[95], %b[84], %b[84]
qpshufb,1,sm %b[96], %b[96], %r4, %b[73]
qpfmul_hadds,2,sm %b[25], %b[91], %b[94], %b[89]
qpshufb,3,sm %b[73], %b[89], %r3, %b[91]
qpfsubs,4,sm %b[69], %b[70], %b[94]
qpfmuls,5,sm %b[13], %b[76], %b[95]
}
{
loop_mode
qpshufb,1,sm %b[108], %b[109], %r3, %b[80]
qpfmul_hadds,2,sm %b[41], %b[85], %b[88], %b[85]
qpshufb,3,sm %b[80], %b[80], %r4, %b[102]
qpfsubs,4,sm %b[74], %b[81], %b[96]
qpfmuls,5,sm %b[24], %b[91], %b[88]
}
{
loop_mode
qpfmul_hadds,0,sm %b[36], %b[90], %b[98], %b[81]
qpshufb,1,sm %b[82], %b[82], %r4, %b[64]
qpfmul_hadds,2,sm %b[17], %b[86], %b[99], %b[82]
qpshufb,3,sm %b[64], %b[67], %r3, %b[67]
qpfadds,4,sm %b[74], %b[81], %b[74]
qpfsubs,5,sm %b[102], %b[97], %b[86]
movaqp,0 area=21, ind=0, am=1, be=0, %b[1]
movaqp,1 area=20, ind=0, am=1, be=0, %b[17]
movaqp,2 area=21, ind=0, am=1, be=0, %b[13]
movaqp,3 area=20, ind=0, am=1, be=0, %b[8]
}
{
loop_mode
qpshufb,1,sm %b[75], %b[75], %r4, %b[72]
qpfmul_hadds,2,sm %b[12], %b[92], %b[72], %b[77]
qpshufb,3,sm %b[78], %b[105], %r3, %b[75]
qpfadds,5,sm %b[79], %b[77], %b[78]
movaqp,0 area=19, ind=0, am=1, be=0, %b[25]
movaqp,1 area=18, ind=0, am=1, be=0, %b[12]
movaqp,2 area=19, ind=0, am=1, be=0, %b[24]
movaqp,3 area=18, ind=0, am=1, be=0, %b[21]
}
{
loop_mode
qpshufb,1,sm %b[83], %b[83], %r4, %b[62]
qpfmul_hadds,2,sm %b[9], %b[62], %b[103], %b[69]
qpshufb,3,sm %b[65], %b[68], %r3, %b[65]
qpfadds,5,sm %b[69], %b[70], %b[68]
movaqp,0 area=17, ind=0, am=1, be=0, %b[29]
movaqp,1 area=16, ind=0, am=1, be=0, %b[33]
movaqp,2 area=17, ind=0, am=1, be=0, %b[28]
movaqp,3 area=16, ind=0, am=1, be=0, %b[9]
}
{
loop_mode
qpshufb,1,sm %b[93], %b[93], %r5, %b[83]
qpfmul_hadds,2,sm %b[5], %b[100], %b[101], %b[70]
qpshufb,3,sm %b[96], %b[96], %r5, %b[90]
qpfadds,5,sm %b[102], %b[97], %b[79]
movaqp,0 area=15, ind=0, am=1, be=0, %b[5]
movaqp,1 area=14, ind=0, am=1, be=0, %b[36]
movaqp,2 area=15, ind=0, am=1, be=0, %b[37]
movaqp,3 area=14, ind=0, am=1, be=0, %b[32]
}
{
loop_mode
qpshufb,1,sm %b[94], %b[94], %r5, %b[92]
qpfmul_hadds,2,sm %b[16], %b[76], %b[95], %b[76]
qpshufb,3,sm %b[86], %b[86], %r5, %b[86]
movaqp,0 area=13, ind=0, am=1, be=0, %b[44]
movaqp,1 area=12, ind=0, am=1, be=0, %b[16]
movaqp,2 area=13, ind=0, am=1, be=0, %b[41]
movaqp,3 area=12, ind=0, am=1, be=0, %b[40]
}
{
loop_mode
qpxor,1,sm %b[92], %r1, %b[88]
qpfmul_hadds,2,sm %b[20], %b[91], %b[88], %b[91]
qpxor,3,sm %b[86], %r1, %b[86]
movaqp,0 area=11, ind=0, am=1, be=0, %b[48]
movaqp,1 area=10, ind=0, am=1, be=0, %b[49]
movaqp,2 area=11, ind=0, am=1, be=0, %b[45]
movaqp,3 area=10, ind=0, am=1, be=0, %b[20]
}
{
loop_mode
qpfadd_adds,0,sm %b[65], %b[62], %b[74], %b[87]
qpshufb,1,sm %b[89], %b[89], %r4, %b[95]
qpshufb,3,sm %b[87], %b[87], %r4, %b[96]
movaqp,0 area=9, ind=0, am=0, be=0, %b[93]
movaqp,1 area=9, ind=16, am=1, be=0, %b[94]
movaqp,2 area=9, ind=0, am=0, be=0, %b[89]
movaqp,3 area=9, ind=16, am=1, be=0, %b[92]
}
{
loop_mode
qpfadd_adds,0,sm %b[75], %b[72], %b[78], %b[71]
qpxor,1,sm %b[90], %r1, %b[100]
qpshufb,3,sm %b[71], %b[71], %r4, %b[101]
movaqp,0 area=8, ind=16, am=1, be=0, %b[90]
movaqp,1 area=8, ind=0, am=0, be=0, %b[98]
movaqp,2 area=8, ind=16, am=1, be=0, %b[97]
movaqp,3 area=8, ind=0, am=0, be=0, %b[99]
}
{
loop_mode
qpfadd_rsubs,0,sm %b[67], %b[64], %b[68], %b[52]
qpshufb,1,sm %b[82], %b[82], %r4, %b[103]
qpshufb,3,sm %b[84], %b[84], %r4, %b[106]
movaqp,0 area=7, ind=16, am=1, be=0, %b[102]
movaqp,1 area=7, ind=0, am=0, be=0, %b[104]
movaqp,2 area=7, ind=0, am=0, be=0, %b[82]
movaqp,3 area=7, ind=16, am=1, be=0, %b[84]
}
{
loop_mode
qpfadd_rsubs,0,sm %b[80], %b[73], %b[79], %b[53]
qpxor,1,sm %b[83], %r1, %b[108]
qpshufb,3,sm %b[85], %b[85], %r4, %b[110]
qpfadds,4,sm %b[106], %b[101], %b[106]
qpfsubs,5,sm %b[106], %b[101], %b[107]
movaqp,0 area=6, ind=0, am=0, be=0, %b[101]
movaqp,1 area=6, ind=16, am=1, be=0, %b[105]
movaqp,2 area=6, ind=0, am=0, be=0, %b[83]
movaqp,3 area=6, ind=16, am=1, be=0, %b[85]
}
{
loop_mode
qpfadd_rsubs,0,sm %b[65], %b[62], %b[74], %b[74]
qpshufb,1,sm %b[69], %b[69], %r4, %b[109]
qpfadd_rsubs,2,sm %b[75], %b[72], %b[78], %b[69]
qpshufb,3,sm %b[81], %b[81], %r4, %b[113]
qpfadds,4,sm %b[96], %b[95], %b[111]
qpfsubs,5,sm %b[96], %b[95], %b[112]
movaqp,0 area=5, ind=16, am=0, be=0, %b[78]
movaqp,1 area=1, ind=16, am=0, be=0, %b[95]
movaqp,2 area=5, ind=16, am=0, be=0, %b[96]
movaqp,3 area=1, ind=16, am=0, be=0, %b[81]
}
{
loop_mode
qpfadd_adds,0,sm %b[67], %b[64], %b[68], %b[57]
qpshufb,1,sm %b[60], %b[61], %r3, %b[79]
qpfadd_adds,2,sm %b[80], %b[73], %b[79], %b[56]
qpshufb,3,sm %b[77], %b[77], %r4, %b[113]
qpfadds,4,sm %b[113], %b[110], %b[110]
qpfsubs,5,sm %b[113], %b[110], %b[114]
movaqp,0 area=5, ind=0, am=1, be=0, %b[60]
movaqp,1 area=0, ind=16, am=0, be=0, %b[68]
movaqp,2 area=5, ind=0, am=1, be=0, %b[77]
movaqp,3 area=0, ind=16, am=0, be=0, %b[61]
}
{
loop_mode
qpfsub_adds,0,sm %b[65], %b[62], %b[100], %b[116]
qpshufb,1,sm %b[76], %b[76], %r4, %b[119]
qpfsub_adds,2,sm %b[75], %b[72], %b[108], %b[113]
qpshufb,3,sm %b[55], %b[54], %r3, %b[118]
qpfadds,4,sm %b[113], %b[103], %g17
qpfsubs,5,sm %b[113], %b[103], %g16
movaqp,0 area=3, ind=16, am=1, be=0, %b[117]
movaqp,1 area=3, ind=0, am=0, be=0, %b[103]
movaqp,2 area=3, ind=16, am=1, be=0, %b[115]
movaqp,3 area=3, ind=0, am=0, be=0, %b[76]
}
{
loop_mode
qpfsub_rsubs,0,sm %b[65], %b[62], %b[100], %b[72]
qpshufb,1,sm %b[91], %b[91], %r4, %b[108]
qpfsub_rsubs,2,sm %b[75], %b[72], %b[108], %b[70]
qpshufb,3,sm %b[70], %b[70], %r4, %b[75]
movaqp,0 area=4, ind=16, am=0, be=0, %b[91]
movaqp,1 area=0, ind=0, am=1, be=0, %b[65]
movaqp,2 area=4, ind=16, am=0, be=0, %b[100]
movaqp,3 area=0, ind=0, am=1, be=0, %b[62]
}
{
loop_mode
qpfsub_rsubs,0,sm %b[67], %b[64], %b[88], %b[59]
qpshufb,1,sm %b[58], %b[59], %r3, %g18
qpfsub_rsubs,2,sm %b[80], %b[73], %b[86], %b[58]
qpshufb,3,sm %b[63], %b[66], %r3, %g19
movaqp,0 area=4, ind=0, am=1, be=0, %g20
movaqp,1 area=1, ind=0, am=1, be=0, %b[66]
movaqp,2 area=4, ind=0, am=1, be=0, %g21
movaqp,3 area=1, ind=0, am=1, be=0, %b[63]
}
{
loop_mode
qpfadd_rsubs,0,sm %g18, %b[108], %b[106], %b[112]
qpshufb,1,sm %b[107], %b[107], %r5, %g22
qpfadd_adds,2,sm %g18, %b[108], %b[106], %g24
qpshufb,3,sm %b[112], %b[112], %r5, %g23
qpshufb,4,sm %b[81], %b[95], %r0, %g25
movaqp,0 area=2, ind=0, am=0, be=0, %b[107]
movaqp,1 area=2, ind=16, am=1, be=0, %g26
movaqp,2 area=2, ind=0, am=0, be=0, %b[106]
movaqp,3 area=2, ind=16, am=1, be=0, %g27
}
{
loop_mode
qpfadd_rsubs,0,sm %b[118], %b[119], %b[111], %b[105]
qpxor,1,sm %g22, %r1, %g22
qpfadd_adds,2,sm %b[118], %b[119], %b[111], %b[111]
qpxor,3,sm %g23, %r1, %g23
qpshufb,4,sm %b[61], %b[68], %r0, %g29
qpfmuls,5,sm %b[105], %g25, %g28
}
{
loop_mode
qpfsub_rsubs,0,sm %g18, %b[108], %g22, %b[108]
qpshufb,1,sm %b[114], %b[114], %r5, %g30
qpfsub_adds,2,sm %g18, %b[108], %g22, %b[101]
qpshufb,3,sm %b[115], %b[117], %r0, %b[114]
qpshufb,4,sm %b[76], %b[103], %r0, %g22
qpfmuls,5,sm %b[101], %g29, %g18
}
{
loop_mode
qpfadd_adds,0,sm %b[79], %b[109], %b[110], %b[100]
qpshufb,1,sm %g16, %g16, %r5, %g16
qpfadd_adds,2,sm %g19, %b[75], %g17, %b[85]
qpshufb,3,sm %b[62], %b[65], %r0, %g31
qpfmuls,4,sm %b[85], %b[114], %r7
qpfmuls,5,sm %b[100], %g22, %r6
}
{
loop_mode
qpfadd_rsubs,0,sm %b[79], %b[109], %b[110], %g17
qpxor,1,sm %g16, %r1, %g16
qpfadd_rsubs,2,sm %g19, %b[75], %g17, %b[110]
qpshufb,3,sm %b[63], %b[66], %r0, %r9
qpfmuls,5,sm %g20, %g31, %g20
}
{
loop_mode
qpfsub_rsubs,0,sm %b[118], %b[119], %g23, %b[118]
qpxor,1,sm %g30, %r1, %g30
qpfsub_adds,2,sm %b[118], %b[119], %g23, %b[91]
qpshufb,3,sm %g27, %g26, %r0, %r26
qpfmuls,5,sm %b[91], %r9, %g23
}
{
loop_mode
qpfsub_adds,0,sm %g19, %b[75], %g16, %b[83]
qpfsub_rsubs,2,sm %b[79], %b[109], %g30, %r27
qpshufb,3,sm %b[106], %b[107], %r0, %r29
qpshufb,4,sm %g27, %g26, %r3, %g26
qpfmuls,5,sm %b[83], %r26, %r28
}
{
loop_mode
qpfsub_rsubs,0,sm %g19, %b[75], %g16, %b[77]
qpfmul_hadds,1,sm %b[94], %g25, %g28, %b[79]
qpfsub_adds,2,sm %b[79], %b[109], %g30, %b[75]
qpfmuls,4,sm %b[77], %g26, %b[94]
qpfmuls,5,sm %g21, %r29, %b[109]
}
{
loop_mode
qpfsub_adds,0,sm %b[67], %b[64], %b[88], %b[64]
qpfmul_hadds,1,sm %b[93], %g29, %g18, %b[68]
qpfsub_adds,2,sm %b[80], %b[73], %b[86], %b[61]
qpshufb,3,sm %b[61], %b[68], %r3, %b[80]
qpshufb,4,sm %b[115], %b[117], %r3, %b[73]
qpfmul_hadds,5,sm %b[104], %g31, %g20, %b[67]
}
{
loop_mode
qpfmul_hadds,0,sm %b[84], %g22, %r6, %b[86]
qpfmul_hadds,2,sm %b[92], %b[114], %r7, %b[84]
qpfmuls,3,sm %b[96], %b[73], %b[88]
qpfmuls,4,sm %b[60], %b[80], %b[92]
qpfmul_hadds,5,sm %b[102], %r9, %g23, %b[60]
}
{
loop_mode
addd,1,sm 0x10, %b[6], %b[4] ? %pcnt0
stqp,2 %r21, %b[6], %b[112]
stqp,5 %r2, %b[6], %g24
}
{
loop_mode
qpshufb,1,sm %b[81], %b[95], %r3, %b[81]
stqp,2 %r20, %b[6], %b[105]
stqp,5 %r24, %b[6], %b[111]
}
{
loop_mode
qpfmul_hadds,0,sm %b[89], %r26, %r28, %b[95]
qpshufb,1,sm %b[69], %b[74], %r0, %b[89]
stqp,2 %r22, %b[6], %b[108]
qpshufb,3,sm %b[56], %b[57], %r0, %b[93]
stqp,5 %r19, %b[6], %b[101]
}
{
loop_mode
qpfmul_hadds,0,sm %b[82], %r29, %b[109], %b[78]
qpfmuls,1,sm %b[78], %b[81], %b[96]
stqp,2 %r25, %b[6], %b[100]
qpshufb,3,sm %b[53], %b[52], %r0, %b[85]
qpfmuls,4,sm %b[51], %b[93], %b[82]
stqp,5 %r23, %b[6], %b[85]
}
{
loop_mode
qpfmul_hadds,0,sm %b[99], %g26, %b[94], %b[94]
stqp,2 %r17, %b[6], %g17
qpshufb,3,sm %b[71], %b[87], %r0, %b[99]
qpfmuls,4,sm %b[35], %b[85], %b[100]
stqp,5 %r12, %b[6], %b[110]
}
{
loop_mode
qpfmul_hadds,0,sm %b[98], %b[80], %b[92], %b[80]
qpfmul_hadds,1,sm %b[97], %b[73], %b[88], %b[73]
stqp,2 %r15, %b[6], %b[118]
qpshufb,3,sm %b[58], %b[59], %r0, %b[88]
qpfmuls,4,sm %b[50], %b[99], %b[91]
stqp,5 %r14, %b[6], %b[91]
}
{
loop_mode
qpshufb,0,sm %b[79], %b[79], %r4, %b[79]
qpshufb,1,sm %b[68], %b[68], %r4, %b[68]
stqp,2 %r13, %b[6], %b[83]
qpshufb,3,sm %b[70], %b[72], %r0, %b[83]
qpfmuls,4,sm %b[31], %b[89], %b[92]
stqp,5 %r16, %b[6], %r27
}
{
loop_mode
alc alcf=1, alct=1
abn abnf=1, abnt=1
ct %ctpr1 ? %NOT_LOOP_END
qpshufb,0,sm %b[84], %b[84], %r4, %b[75]
qpshufb,1,sm %b[86], %b[86], %r4, %b[77]
stqp,2 %r18, %b[6], %b[77]
qpshufb,3,sm %b[113], %b[116], %r0, %b[84]
qpfmuls,4,sm %b[38], %b[83], %b[86]
stqp,5 %r11, %b[6], %b[75]
}
Теоретическая скорость: 32 комплексных числа за 39 тактов (32/39) = 6.56 Байт/такт
Четверная теоретическая скорость: 26.26 Байт/такт
Замеры скорости

Итоги по stage_radix4_readConjSwap_2x


Скорости упали по сравнению с исходными версиями stage_radix4_readConjSwap.
График FFT находится здесь.
Собираем FFT
fft_radix2
Собираем fft_radix2 из reverse_radix2_x32 и всех вариантов stage_radix2.


fft_radix2_2x
Собираем fft_radix2_2x из reverse_radix2_x32 и всех вариантов stage_radix2_2x.


fft_radix2_readConjSwap
Собираем fft_radix2_readConjSwap из reverse_radix2_x32 и всех вариантов stage_radix2_readConjSwap.


fft_radix2_readConjSwap_2x
Собираем fft_radix2_readConjSwap_2x из reverse_radix2_x32 и всех вариантов stage_radix2_readConjSwap_2x.


fft_radix4
Собираем fft_radix4 из reverse_radix4_x16 и всех вариантов stage_radix4.


fft_radix4_2x
Собираем fft_radix4_2x из reverse_radix4_x16 и всех вариантов stage_radix4_2x.


fft_radix4_readConjSwap
Собираем fft_radix4_readConjSwap из reverse_radix4_x16 и всех вариантов stage_radix4_readConjSwap.


fft_radix4_readConjSwap_2x
Собираем fft_radix4_readConjSwap_2x из reverse_radix4_x16 и всех вариантов stage_radix4_readConjSwap_2x.


Дальнейшие оптимизации имеющегося кода
Сейчас коэффициенты размещаются в нескольких массивах (conj/swap, coefC/coefD/coefE).
Можно разместить коэффициенты в одном массиве, перемежив их между собой.
Это позволит эффективнее производить чтение из памяти.
Хуже от этого стать не должно.
Дальнейшие направления создания кода
Что делать дальше:
-
Вариант
3xвradix2.
Сейчас2xвradix2ускоряет вычисления, поэтому можно сделать3x(и далее).
Сейчас2xвradix4замедляет вычисления, поэтому бесполезно делать3x.
Вряд ли это даст что‑то интересное, но для полноты картины можно сделать. -
Варианты Stage, в которых коэффициенты вычисляются на ходу, а не читаются из памяти.
Это должно ускорить вычисления в случаях, когда данные не помещаются в кэш.
Вradix4можно рассмотреть ещё такой случай: читать из памяти только coefC и вычислять coefD/coefE из coefC. -
Варианты Stage, на вход которым подаётся один постоянный коэффициент (или набор коэффициентов c/d/e для
radix4).
Можно заметить, что первый Stage использует один и тот же коэффициент для обработки всего входного массива. Второй Stage использует 2 разных коэффициента: один — для обработки первой половины массива, второй — для обработки второй половины. Третий Stage использует 4 разных коэффициента и так далее
Можно разбить обработку входного массива на части, обрабатываемые одинаковым коэффициентом. Получили коэффициент (из памяти или вычислением на ходу) — обработали часть массива, получили следующий коэффициент — обработали следующую часть массива.
Этот способ будет работать для начальных Stage, у которых такие части достаточно длинные. Для более поздних Stage надо использовать обычные методы получения коэффициентов (на каждом шаге читать из памяти или вычислять на ходу).
Поиск границы между начальными и поздними Stage придётся делать экспериментально.
Это отдельная задача. -
Объединить Reverse и первый Stage.
Сейчас есть объединение нескольких Stage в одном цикле (2x). Можно аналогично объединить Reverse и первый Stage (в общем случае — Reverse и несколько первых Stage). -
Входные данные типа
double complex.
Сейчас сделанfloat complex.
Надо сделать версию дляdouble complex.
Заключение
На примере stage_radix4_2x можно понять, что компилятор делает много работы по эффективной упаковке инструкций в такты. Вручную такая задача заняла бы много времени.
Я пишу код на Си так, чтобы его было удобно читать, а компилятор тасует операции так, чтобы, они эффективнее выполнялись. На ассемблере мне бы пришлось выбирать между понятным кодом и быстрым кодом.
На ассемблере приятно решать небольшие задачи, полностью осознавая, что где происходит. Однако, стоит немного увеличить сложность, и начинают резко уставать.
Не уверен, смог бы я решить эту задачу на чистом ассемблере.
Автор: LeonidLagunov
