added cmake options: USE_FFTW and USE_SIMD

USE_FFTW: option to use fftw3 in benchmark, default: OFF
USE_SIMD: use SIMD CPU features in library, default: ON
  controls definition of PFFFT_SIMD_DISABLE

Signed-off-by: hayati ayguen <[email protected]>
1 file changed
tree: c0a48591e6a31471be41e26bc30141d013bca709
  1. .gitignore
  2. CMakeLists.txt
  3. fftpack.c
  4. fftpack.h
  5. pffft.c
  6. pffft.h
  7. README.md
  8. test_pffft.c
README.md

PFFFT: a pretty fast FFT.

TL;DR

PFFFT does 1D Fast Fourier Transforms, of single precision real and complex vectors. It tries do it fast, it tries to be correct, and it tries to be small. Computations do take advantage of SSE1 instructions on x86 cpus, Altivec on powerpc cpus, and NEON on ARM cpus. The license is BSD-like.

Why does it exist:

I was in search of a good performing FFT library , preferably very small and with a very liberal license.

When one says “fft library”, FFTW (“Fastest Fourier Transform in the West”) is probably the first name that comes to mind -- I guess that 99% of open-source projects that need a FFT do use FFTW, and are happy with it. However, it is quite a large library , which does everything fft related (2d transforms, 3d transforms, other transformations such as discrete cosine , or fast hartley). And it is licensed under the GNU GPL , which means that it cannot be used in non open-source products.

An alternative to FFTW that is really small, is the venerable FFTPACK v4, which is available on NETLIB. A more recent version (v5) exists, but it is larger as it deals with multi-dimensional transforms. This is a library that is written in FORTRAN 77, a language that is now considered as a bit antiquated by many. FFTPACKv4 was written in 1985, by Dr Paul Swarztrauber of NCAR, more than 25 years ago ! And despite its age, benchmarks show it that it still a very good performing FFT library, see for example the 1d single precision benchmarks here. It is however not competitive with the fastest ones, such as FFTW, Intel MKL, AMD ACML, Apple vDSP. The reason for that is that those libraries do take advantage of the SSE SIMD instructions available on Intel CPUs, available since the days of the Pentium III. These instructions deal with small vectors of 4 floats at a time, instead of a single float for a traditionnal FPU, so when using these instructions one may expect a 4-fold performance improvement.

The idea was to take this fortran fftpack v4 code, translate to C, modify it to deal with those SSE instructions, and check that the final performance is not completely ridiculous when compared to other SIMD FFT libraries. Translation to C was performed with f2c. The resulting file was a bit edited in order to remove the thousands of gotos that were introduced by f2c. You will find the fftpack.h and fftpack.c sources in the repository, this a complete translation of fftpack, with the discrete cosine transform and the test program. There is no license information in the netlib repository, but it was confirmed to me by the fftpack v5 curators that the [same terms do apply to fftpack v4] (http://www.cisl.ucar.edu/css/software/fftpack5/ftpk.html). This is a “BSD-like” license, it is compatible with proprietary projects.

Adapting fftpack to deal with the SIMD 4-element vectors instead of scalar single precision numbers was more complex than I originally thought, especially with the real transforms, and I ended up writing more code than I planned..

The code:

Only two files, in good old C, pffft.c and pffft.h. The API is very very simple, just make sure that you read the comments in pffft.h.

Comparison with other FFTs:

The idea was not to break speed records, but to get a decently fast fft that is at least 50% as fast as the fastest FFT -- especially on slowest computers . I'm more focused on getting the best performance on slow cpus (Atom, Intel Core 1, old Athlons, ARM Cortex-A9...), than on getting top performance on today fastest cpus.

It can be used in a real-time context as the fft functions do not perform any memory allocation -- that is why they accept a ‘work’ array in their arguments.

It is also a bit focused on performing 1D convolutions, that is why it provides “unordered” FFTs , and a fourier domain convolution operation.

Benchmark results (cpu tested: core i7 2600, core 2 quad, core 1 duo, atom N270, cortex-A9)

The benchmark shows the performance of various fft implementations measured in MFlops, with the number of floating point operations being defined as 5Nlog2(N) for a length N complex fft, and 2.5*Nlog2(N) for a real fft. See here for an explanation of these formulas.

MacOS Lion, gcc 4.2, 64-bit, fftw 3.3 on a 3.4 GHz core i7 2600

Built with:

gcc-4.2 -o test_pffft -arch x86_64 -O3 -Wall -W pffft.c test_pffft.c fftpack.c -L/usr/local/lib -I/usr/local/include/ -DHAVE_VECLIB -framework veclib -DHAVE_FFTW -lfftw3f
input lenreal FFTPackreal vDSPreal FFTWreal PFFFTcplx FFTPackcplx vDSPcplx FFTWcplx PFFFT
6428168596732981872887148981466811108
963298n/a837877273953n/a1568010878
1283507115759266101084233175981642712000
1603391n/a9838107114220n/a1665311187
1923919n/a9868109564297n/a1577012540
25642831317910694131284545195501635013822
3843136n/a10810120613600n/a1610313240
4803477n/a10632120743536n/a1163012522
51237831514111267138383649200021656013580
6403639n/a11164139463695n/a1541613890
7683800n/a11245134953590n/a1580214552
8003440n/a10499133013659n/a1205613268
102439241560511450153393769209631394115467
204845181619511551155324258204131372315042
24004294n/a10685130784093n/a1277713119
409647501659611672158174157196621431614336
819238201622711084125553691181321210213813
92163864n/a10254128703586n/a1211913994
1638438221512310454128223613168741237013881
3276841751451210662110953881147021161911524
26214433171142962699517281011729775710179
1048576291310551473058672661788135205350

Debian 6, gcc 4.4.5, 64-bit, fftw 3.3.1 on a 3.4 GHz core i7 2600

Built with:

gcc -o test_pffft -DHAVE_FFTW -msse2 -O3 -Wall -W pffft.c test_pffft.c fftpack.c -L$HOME/local/lib -I$HOME/local/include/ -lfftw3f -lm
N (input length)real FFTPackreal FFTWreal PFFFTcplx FFTPackcplx FFTWcplx PFFFT
6438407680877743892048011171
9642149633842948162247711238
1283584102401024051202389311947
1924854110951294548542219114121
2564096117031638451202340613653
3844395146511255848841953514651
5125760131661536046082304015360
7684907140201635744611962814020
10245120146291462951202048015754
20485632140801877346931251616091
4096512013653175544726768014456
8192416073961331244371479113312
921642106124134734491728214970
163843976110101431342101145013631
32768426010224109544260681611797
262144373668969961235989659437
1048576279645346453186430785592

MacOS Snow Leopard, gcc 4.0, 32-bit, fftw 3.3 on a 1.83 GHz core 1 duo

Built with:

gcc -o test_pffft -DHAVE_FFTW -DHAVE_VECLIB -O3 -Wall -W pffft.c test_pffft.c fftpack.c -L/usr/local/lib -I/usr/local/include/ -lfftw3f -framework veclib
input lenreal FFTPackreal vDSPreal FFTWreal PFFFTcplx FFTPackcplx vDSPcplx FFTWcplx PFFFT
64745214517062028961335633132300
96877n/a197619781059n/a33332233
1289512808221322791202380337392494
1921002n/a245624291186n/a37012508
25610653205264127931302401339122663
384845n/a27592499948n/a37292504
512900347629562759974405739542645
768910n/a29122737975n/a38372614
10249363583310730091006412438212697
204810573585309128371089388937012513
409610833524309227331039361734622364
8192874325229672363911310627892302
9216898n/a24202290865n/a26762204
16384903289225062421899302627972289
32768965283725502358920292227632240
262144738242215891708610203814361091
1048576528120784588060610206691036

Ubuntu 11.04, gcc 4.5, 32-bit, fftw 3.2 on a 2.66 core 2 quad

Built with:

gcc -o test_pffft -DHAVE_FFTW -msse -mfpmath=sse -O3 -Wall -W pffft.c test_pffft.c fftpack.c -L/usr/local/lib -I/usr/local/include/ -lfftw3f -lm
input lenreal FFTPackreal FFTWreal PFFFTcplx FFTPackcplx FFTWcplx PFFFT
64192036145120219476806467
96187335495187210784295863
128224037735514256079646827
192176545697767228491377061
256204854617447273196387802
384199858616762231392537644
5122095614476802194102407089
7682230577375492045103317010
10242133640085332133107797877
20482011704086651942102407768
4096219468278777175594526827
8192184966566656175278316827
9216187158586416164369096266
16384188362236506166473406982
32768182663906667163174816971
262144154640755977129934153551
1048576110420711730110411491834

Ubuntu 11.04, gcc 4.5, 32-bit, fftw 3.3 on a 1.6 GHz Atom N270

Built with:

gcc -o test_pffft -DHAVE_FFTW -msse -mfpmath=sse -O3 -Wall -W pffft.c test_pffft.c fftpack.c -L/usr/local/lib -I/usr/local/include/ -lfftw3f -lm
N (input length)real FFTPackreal FFTWreal PFFFTcplx FFTPackcplx FFTWcplx PFFFT
644521041133654923181781
964441297129750324081686
1285271525170754326551886
1924981653184953926781942
2565851862215659427772244
3844991870199851125861890
5125622095219454229732194
7685452045213354523652133
10245952133243856926952179
20485872125234752122301707
40964951890183449218761672
81924691548172943817401664
92164681663166344615851531
163844531608176739814761664
327684561420150338713881345
262144309385726262415840
1048576280351739261313797

Windows 7, visual c++ 2010 on a 1.6 GHz Atom N270

Built with:

cl /Ox -D_USE_MATH_DEFINES /arch:SSE test_pffft.c pffft.c fftpack.c

(visual c++ is definitively not very good with SSE intrinsics...)

N (input length)real FFTPackreal PFFFTcplx FFTPackcplx PFFFT
6417310091741159
9616910291881201
12819512421911275
19217813121841276
25619615911861281
38417214091811281
51218716401811313
76817116141761258
102418618121781223
204819017071861099
40961821446177975
819217513451691034
921616512711681023
163841661396165949
327681721311161881
262144136632134629
1048576134698127623

Ubuntu 12.04, gcc-4.7.3, 32-bit, with fftw 3.3.3 (built with --enable-neon), on a 1.2GHz ARM Cortex A9 (Tegra 3)

Built with:

gcc-4.7 -O3 -DHAVE_FFTW -march=armv7-a -mtune=cortex-a9 -mfloat-abi=hard -mfpu=neon -ffast-math test_pffft.c pffft.c -o test_pffft_arm fftpack.c -lm -I/usr/local/include/ -L/usr/local/lib/ -lfftw3f
input lenreal FFTPackreal FFTWreal PFFFTcplx FFTPackcplx FFTWcplx PFFFT
64549452731512602640
96421272702496571602
128498512815597618652
160521536815586669625
192539571883485597626
256640539975569611671
384499610879499602637
480518507877496661616
5125245911002549678668
640542612955568663645
768557613981491663598
800514353882514360574
10246406401067492683602
2048587640908486640552
2400479368777422376518
4096511614853426640534
8192415584708386622516
9216419571687364586506
16384426577716398606530
32768417572673399572468
262144219380293255431343
1048576202274237265282355

Same platform as above, but this time pffft and fftpack are built with clang 3.2:

clang -O3 -DHAVE_FFTW -march=armv7-a -mtune=cortex-a9 -mfloat-abi=hard -mfpu=neon -ffast-math test_pffft.c pffft.c -o test_pffft_arm fftpack.c -lm -I/usr/local/include/ -L/usr/local/lib/ -lfftw3f
input lenreal FFTPackreal FFTWreal PFFFTcplx FFTPackcplx FFTWcplx PFFFT
644274528534276021024
96351276843337571963
1283735129963906181054
160426536987375669914
19240457110793885881079
25646553912054456021170
38436661010993435941099
4803565071140335651931
51241159112133846491124
6403986121193373654901
76840961312273836631044
8004113481073353358809
102442764012804136921004
20484146261126371640853
2400399373898319368653
40964046021059357633778
8192332584792308616716
9216322561783299586687
16384344568778314617745
32768342564737314552629
262144201383313227435413
1048576187262251228281409

So it looks like, on ARM, gcc 4.7 is the best at scalar floating point (the fftpack performance numbers are better with gcc), while clang is the best with neon intrinsics (see how pffft perf has improved with clang 3.2).

NVIDIA Jetson TK1 board, gcc-4.8.2. The cpu is a 2.3GHz cortex A15 (Tegra K1).

Built with:

gcc -O3 -march=armv7-a -mtune=native -mfloat-abi=hard -mfpu=neon -ffast-math test_pffft.c pffft.c -o test_pffft_arm fftpack.c -lm
input lenreal FFTPackreal PFFFTcplx FFTPackcplx PFFFT
641735330819943744
961596344819873572
1281807407622553960
1601769408320713845
1921990423320173939
2562191488222544346
3841878449220734012
4801748439819233951
5122030506422674195
6401918475620944184
7682099490720484297
8001822455518804063
10242232535521874420
20482176498320273602
24001741425617103344
40961816391418513349
81921716348117003255
92161735358916533094
163841567348316373244
327681624324016553156
262144101218989831503
104857687611548681341

The performance on the tegra K1 is pretty impressive. I'm not including the FFTW numbers as they as slightly below the scalar fftpack numbers, so something must be wrong (however it seems to be correctly configured and is using neon simd instructions).

When using clang 3.4 the pffft version is even a bit faster, reaching 5.7 GFlops for real ffts of size 1024.