|
View:
New views
9 Messages
—
Rating Filter:
Alert me
|
|
|
Fast factor 2 resampling Hi!
I have worked quite a bit now on writing code for fast factor two resampling based on FIR filters. So the task is similar to the last posting I made. The differences are: - I tried to really optimize for speed; I used gcc's SIMD primitives to design a version that runs really fast on my machine: model name : AMD Athlon(tm) 64 Processor 3400+ stepping : 10 cpu MHz : 2202.916 running in 64bit mode. - I put some more effort into designing the coefficients for the filter; I used octave to do it; the specifications I tried to meet are listed in the coeffs.h file. The resamplers are designed for streaming use; they do smart history keeping. Thus a possible use case I designed them for would be to upsample BEAST at the input devices and downsample BEAST at the output devices. The benefits of using the code for this tasks are: - the filters are linear-phase - the implementation should be fast enough (at least on my machine) - the implementation should be precise enough (near -96 dB error == 16 bit precision) The downside may be the delay of the filters. I put some effort into making this code easy to test, with four kinds of tests: (p) Performance tests measure how fast the code runs I tried on my machine with both: gcc-3.4 and gcc-4.0; you'll see the results below. The speedup gain achieved using SIMD instructions (SSE3 or whatever AMD64 uses) is gcc-4.0 gcc-3.4 -------------+--------------------- upsampling | 2.82 2.85 downsampling | 2.54 2.50 oversampling | 2.70 2.64 where oversampling is first performing upsampling and then performing downsampling. Note that there is a bug in gcc-3.3 which will not allow combining C++ code with SIMD instructions. The other output should be self-explaining (if not, feel free to ask). (a) Accuracy tests, which compare what should be the result with what is the result; you'll see that using SIMD instructions means a small loss of precision, but it should be acceptable. It occurs because the code doesn't use doubles to store the accumulated sum, but floats, to enable SIMD speedup. (g) Gnuplot; much the same like accuracy, but it writes out data which can be plottet by gnuplot. So it is possible to "see" the interpolation error, rather than just get it as output. (i) Impulse response; this one can be used for debugging - it will give the impulse response for the (for sub- and oversampling combined) system, so you can for instance see the delay in the time domain or plot the filter response in the frequency domain. So I am attaching all code and scripts that I produced so far. For compiling, I use g++ -O3 -funroll-loops as options; however, I suppose on x86 machines you need to tell the compiler to generate SSE code. I just tried it briefly on my laptop, and the SSE version there is much slower than the non-SSE version. Currently I can't say why this is. I know that AMD64 has extra registers compared to standard SSE. However, I designed the inner loop (fir_process_4samples_sse) with having in mind not to use more than 8 registers: out0..out3, input, taps, intermediate sum/product whatever. These are 7 registers. Well, maybe AMD64 isn't faster because of more registers, but better adressing mode, or whatever. Maybe its just my laptop being slow, and other non-AMD64-systems will perform better. Maybe we need to write three versions of the inner loop. One for AMD64, one for x86 with SSE and one for FPU. In any case, I invite you to try it out, play with the code, and give feedback about it. Cu... Stefan -- Stefan Westerfeld, Hamburg/Germany, http://space.twc.de/~stefan /* SSE optimized FIR Resampling code * Copyright (C) 2006 Stefan Westerfeld * * This library is free software; you can redistribute it and/or * modify it under the terms of the GNU Lesser General Public * License as published by the Free Software Foundation; either * version 2 of the License, or (at your option) any later version. * * This library is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * Lesser General Public License for more details. * * You should have received a copy of the GNU Lesser General * Public License along with this library; if not, write to the * Free Software Foundation, Inc., 59 Temple Place, Suite 330, * Boston, MA 02111-1307, USA. */ /* * halfband FIR filter for factor 2 resampling, created with octave * * design method: windowed sinc, using ultraspherical window * * coefficients = 32 * x0 = 1.01065 * alpha = 0.75 * * design criteria (44100 Hz => 88200 Hz): * * passband = [ 0, 18000 ] 1 - 2^-16 <= H(z) <= 1+2^-16 * transition = [ 18000, 26100 ] * stopband = [ 26100, 44100 ] | H(z) | <= -96 dB * * and for 48 kHz => 96 kHz: * * passband = [ 0, 19589 ] 1 - 2^-16 <= H(z) <= 1+2^-16 * transition = [ 19588, 29386 ] * stopband = [ 29386, 48000 ] | H(z) | <= -96 dB * * in order to keep the coefficient number down to 32, the filter * does only "almost" fulfill the spec, but its really really close * (no stopband ripple > -95 dB) */ const double halfband_fir_upsample2_96db_coeffs[32] = { -3.48616530828033e-05, 0.000112877490936198, -0.000278961878372482, 0.000590495306376081, -0.00112566995029848, 0.00198635062559427, -0.00330178798332932, 0.00523534239035401, -0.00799905465189065, 0.0118867161189188, -0.0173508611368417, 0.0251928452706978, -0.0370909694665106, 0.057408291607388, -0.102239638342325, 0.317002929635456, /* here, a 0.5 coefficient will be used */ 0.317002929635456, -0.102239638342325, 0.0574082916073878, -0.0370909694665105, 0.0251928452706976, -0.0173508611368415, 0.0118867161189186, -0.00799905465189052, 0.0052353423903539, -0.00330178798332923, 0.00198635062559419, -0.00112566995029842, 0.000590495306376034, -0.00027896187837245, 0.000112877490936177, -3.48616530827983e-05 }; /* SSE optimized FIR Resampling code * Copyright (C) 2006 Stefan Westerfeld * * This library is free software; you can redistribute it and/or * modify it under the terms of the GNU Lesser General Public * License as published by the Free Software Foundation; either * version 2 of the License, or (at your option) any later version. * * This library is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU * Lesser General Public License for more details. * * You should have received a copy of the GNU Lesser General * Public License along with this library; if not, write to the * Free Software Foundation, Inc., 59 Temple Place, Suite 330, * Boston, MA 02111-1307, USA. */ #include <malloc.h> #include <stdio.h> #include <math.h> #include <assert.h> #include <vector> #include <sys/time.h> #include <time.h> using std::vector; using std::min; using std::max; using std::copy; /* see: http://ds9a.nl/gcc-simd/ */ typedef float v4sf __attribute__ ((vector_size (16))); //((mode(V4SF))); // vector of four single floats union f4vector { v4sf v; float f[4]; }; template<int N> inline float fir_process_one_sample (const float *input, const float *taps) { double out = 0; for (int i = 0; i < N; i++) out += input[i] * taps[i]; return out; } template<int N> inline void fir_process_4samples_sse (const float *input, const float *taps, float& out0, float& out1, float& out2, float& out3) { assert (N >= 4 && (N % 4) == 0); /* input and taps must be 16-byte aligned */ const f4vector *input_v = reinterpret_cast<const f4vector *> (input); const f4vector *taps_v = reinterpret_cast<const f4vector *> (taps); f4vector out0_v, out1_v, out2_v, out3_v; out0_v.v = input_v[0].v * taps_v[0].v; out1_v.v = input_v[0].v * taps_v[1].v; out2_v.v = input_v[0].v * taps_v[2].v; out3_v.v = input_v[0].v * taps_v[3].v; for (int i = 1; i < (N+4)/4; i++) /* FIXME: upper loop boundary may be wrong; only tested with (N % 4) == 0 */ { out0_v.v += input_v[i].v * taps_v[i*4+0].v; out1_v.v += input_v[i].v * taps_v[i*4+1].v; out2_v.v += input_v[i].v * taps_v[i*4+2].v; out3_v.v += input_v[i].v * taps_v[i*4+3].v; } out0 = out0_v.f[0] + out0_v.f[1] + out0_v.f[2] + out0_v.f[3]; out1 = out1_v.f[0] + out1_v.f[1] + out1_v.f[2] + out1_v.f[3]; out2 = out2_v.f[0] + out2_v.f[1] + out2_v.f[2] + out2_v.f[3]; out3 = out3_v.f[0] + out3_v.f[1] + out3_v.f[2] + out3_v.f[3]; } /* * we require a special ordering of the FIR taps, to get maximum benefit of the SSE SIMD operations * * example: suppose the FIR taps are [ x1 x2 x3 x4 x5 x6 x7 x8 x9 ], then the SSE taps become * * [ x1 x2 x3 x4 0 x1 x2 x3 0 0 x1 x2 0 0 0 x1 <- for input[0] * x5 x6 x7 x8 x4 x5 x6 x7 x3 x4 x5 x6 x2 x3 x4 x5 <- for input[1] * x9 0 0 0 x8 x9 0 0 x7 x8 x9 0 x6 x7 x8 x9 ] <- for input[2] * \------------/\-----------/\-----------/\-----------/ * for out0 for out1 for out2 for out3 * * so that we can compute out0, out1, out2 and out3 simultaneously * from input[0]..input[2] */ vector<float> fir_compute_taps_sse (const vector<float>& no_sse_taps) { const int T = no_sse_taps.size(); vector<float> sse_taps ((T + 4) * 4); /* FIXME: may be wrong; code only tested with (T % 4) == 0 */ for (int j = 0; j < 4; j++) for (int i = 0; i < T; i++) { int k = i + j; sse_taps[(k/4)*16+(k%4)+j*4] = no_sse_taps[i]; } return sse_taps; } template<class T, int ALIGN> class AlignedMem { unsigned char *unaligned_mem; T *data; unsigned int n_elements; void allocate_aligned_data() { assert ((ALIGN % sizeof(T)) == 0); unaligned_mem = static_cast <unsigned char *> (malloc (n_elements * sizeof(T) + (ALIGN-1))); unsigned char *aligned_mem = unaligned_mem; if (ptrdiff_t (unaligned_mem) % ALIGN) aligned_mem += ALIGN - ptrdiff_t (unaligned_mem) % ALIGN; data = reinterpret_cast<T *> (aligned_mem); } /* * no copy constructor */ AlignedMem (const AlignedMem&); /* * no assignment operator */ AlignedMem& operator= (const AlignedMem&); public: AlignedMem (const vector<T>& elements) : n_elements (elements.size()) { allocate_aligned_data(); for (unsigned int i = 0; i < n_elements; i++) new (data + i) T(elements[i]); } AlignedMem (unsigned int n_elements) : n_elements (n_elements) { allocate_aligned_data(); for (unsigned int i = 0; i < n_elements; i++) new (data + i) T(); } ~AlignedMem() { /* * C++ destruction order: last allocated element is deleted first */ while (n_elements) data[--n_elements].~T(); free (unaligned_mem); } T& operator[] (unsigned int pos) { return data[pos]; } const T& operator[] (unsigned int pos) const { return data[pos]; } unsigned int size() { return n_elements; } }; template<unsigned int T, bool USE_SSE> class Upsampler2 { vector<float> no_sse_taps; AlignedMem<float,16> history; AlignedMem<float,16> sse_taps; public: Upsampler2 (float *init_taps) : no_sse_taps (init_taps, init_taps + T), history (2 * T), sse_taps (fir_compute_taps_sse (no_sse_taps)) { assert ((T & 1) == 0); /* even order filter */ } /* * fast SSE optimized convolution */ void process_4samples_aligned (const float *input /* aligned */, float *output) { const int H = (T / 2) - 1; /* half the filter length */ output[0] = input[H]; output[2] = input[H+1]; output[4] = input[H+2]; output[6] = input[H+3]; fir_process_4samples_sse<T> (&input[0], &sse_taps[0], output[1], output[3], output[5], output[7]); } /* * slow non-SSE convolution */ void process_sample_unaligned (const float *input, float *output) { const int H = (T / 2) - 1; /* half the filter length */ output[0] = input[H]; output[1] = fir_process_one_sample<T> (&input[0], &no_sse_taps[0]); } void process_block_aligned (const float *input, unsigned n_input_samples, float *output) { unsigned int i = 0; if (USE_SSE) { while (i + 3 < n_input_samples) { process_4samples_aligned (&input[i], &output[i*2]); i += 4; } } while (i < n_input_samples) { process_sample_unaligned (&input[i], &output[2*i]); i++; } } void process_block_unaligned (const float *input, unsigned n_input_samples, float *output) { const int H = (T / 2) - 1; /* half the filter length */ unsigned int i = 0; if (USE_SSE) { while ((reinterpret_cast<ptrdiff_t> (&input[i]) & 15) && i < n_input_samples) { process_sample_unaligned (&input[i], &output[2*i]); i++; } } process_block_aligned (&input[i], n_input_samples - i, &output[2*i]); } void process_block (const float *input, unsigned int n_input_samples, float *output) { unsigned int history_todo = min (n_input_samples, T); copy (input, input + history_todo, &history[T]); process_block_aligned (&history[0], history_todo, output); if (n_input_samples >= T) { process_block_unaligned (input, n_input_samples - history_todo, &output [2*history_todo]); // build new history from new input copy (input + n_input_samples - T, input + n_input_samples, &history[0]); } else { // build new history from end of old history // (very expensive if n_input_samples tends to be a lot smaller than T often) memmove (&history[0], &history[n_input_samples], sizeof (history[0]) * T); } } }; template<unsigned int T, bool USE_SSE> class Downsampler2 { vector<float> no_sse_taps; AlignedMem<float,16> history_even; AlignedMem<float,16> history_odd; AlignedMem<float,16> sse_taps; public: Downsampler2 (float *init_taps) : no_sse_taps (init_taps, init_taps + T), history_even (2 * T), history_odd (2 * T), sse_taps (fir_compute_taps_sse (no_sse_taps)) { assert ((T & 1) == 0); /* even order filter */ } /* * fast SSE optimized convolution */ template<int ODD_STEPPING> void process_4samples_aligned (const float *input_even /* aligned */, const float *input_odd, float *output) { const int H = (T / 2) - 1; /* half the filter length */ fir_process_4samples_sse<T> (&input_even[0], &sse_taps[0], output[0], output[1], output[2], output[3]); output[0] += 0.5 * input_odd[H*ODD_STEPPING]; output[1] += 0.5 * input_odd[(H+1)*ODD_STEPPING]; output[2] += 0.5 * input_odd[(H+2)*ODD_STEPPING]; output[3] += 0.5 * input_odd[(H+3)*ODD_STEPPING]; } /* * slow non-SSE convolution */ template<int ODD_STEPPING> float process_sample_unaligned (const float *input_even, const float *input_odd) { const int H = (T / 2) - 1; /* half the filter length */ return fir_process_one_sample<T> (&input_even[0], &no_sse_taps[0]) + 0.5 * input_odd[H * ODD_STEPPING]; } template<int ODD_STEPPING> void process_block_aligned (const float *input_even, const float *input_odd, float *output, unsigned int n_output_samples) { unsigned int i = 0; if (USE_SSE) { while (i + 3 < n_output_samples) { process_4samples_aligned<ODD_STEPPING> (&input_even[i], &input_odd[i * ODD_STEPPING], &output[i]); i += 4; } } while (i < n_output_samples) { output[i] = process_sample_unaligned<ODD_STEPPING> (&input_even[i], &input_odd[i * ODD_STEPPING]); i++; } } template<int ODD_STEPPING> void process_block_unaligned (const float *input_even, const float *input_odd, float *output, unsigned int n_output_samples) { const int H = (T / 2) - 1; /* half the filter length */ unsigned int i = 0; if (USE_SSE) { while ((reinterpret_cast<ptrdiff_t> (&input_even[i]) & 15) && i < n_output_samples) { output[i] = process_sample_unaligned<ODD_STEPPING> (&input_even[i], &input_odd[i * ODD_STEPPING]); i++; } } process_block_aligned<ODD_STEPPING> (&input_even[i], &input_odd[i * ODD_STEPPING], &output[i], n_output_samples); } void deinterleave2 (const float *data, unsigned int n_data_values, float *output) { for (unsigned int i = 0; i < n_data_values; i += 2) output[i / 2] = data[i]; } void process_block (const float *input, unsigned int n_input_samples, float *output) { assert ((n_input_samples & 1) == 0); const unsigned int BLOCKSIZE = 1024; f4vector block[BLOCKSIZE / 4]; /* using f4vector ensures 16-byte alignment */ float *input_even = &block[0].f[0]; while (n_input_samples) { unsigned int n_input_todo = min (n_input_samples, BLOCKSIZE * 2); /* * since the halfband filter contains zeros every other sample * and since we're using SIMD instructions, which expect the * data to be consecutively represented in memory, we prepare * a block of samples containing only even-indexed samples * * we keep the deinterleaved data on the stack (instead of per-class * allocated memory), to ensure that even running a lot of these * downsampler streams will not result in cache trashing */ /* * FIXME: this implementation is suboptimal for non-SSE, because it * performs an extra deinterleaving step in any case, but deinterleaving * is only required for SSE instructions */ deinterleave2 (input, n_input_todo, input_even); const float *input_odd = input + 1; /* we process this one with a stepping of 2 */ const unsigned int n_output_todo = n_input_todo / 2; const unsigned int history_todo = min (n_output_todo, T); copy (input_even, input_even + history_todo, &history_even[T]); deinterleave2 (input_odd, history_todo * 2, &history_odd[T]); process_block_aligned <1> (&history_even[0], &history_odd[0], output, history_todo); if (n_output_todo >= T) { process_block_unaligned<2> (input_even, input_odd, &output[history_todo], n_output_todo - history_todo); // build new history from new input copy (input_even + n_output_todo - T, input_even + n_output_todo, &history_even[0]); deinterleave2 (input_odd + n_input_todo - T * 2, T * 2, &history_odd[0]); /* FIXME: can be optimized */ } else { // build new history from end of old history // (very expensive if n_output_todo tends to be a lot smaller than T often) memmove (&history_even[0], &history_even[n_output_todo], sizeof (history_even[0]) * T); memmove (&history_odd[0], &history_odd[n_output_todo], sizeof (history_odd[0]) * T); } n_input_samples -= n_input_todo; input += n_input_todo; output += n_output_todo; } } }; #include "coeffs.h" void exit_usage (int status) { printf ("usage: ssefir <options> [ <block_size> ] [ <test_frequency> ]\n"); printf ("\n"); printf (" p - performance\n"); printf (" a - accuracy\n"); printf (" g - generate output for gnuplot\n"); printf (" i - impulse response\n"); printf ("\n"); printf (" d - downsample\n"); printf (" u - upsample\n"); printf (" s - subsample (= downsample and upsample after that)\n"); printf (" o - oversample (= upsample and downsample after that)\n"); printf ("\n"); printf (" f - fast implementation (sse)\n"); printf ("\n"); printf (" block_size defaults to 128 values\n"); printf (" test_frequency defaults to 440 Hz\n"); printf ("\n"); printf ("examples:\n"); printf (" ssefir puf 256 # check performance of fast upsampling with 256 value blocks\n"); printf (" ssefir au # check accuracy of standard upsampling\n"); printf (" ssefir au 128 500 # check accuracy of standard upsampling using a 500 Hz frequency\n"); exit (status); } enum TestType { TEST_PERFORMANCE, TEST_ACCURACY, TEST_GNUPLOT, TEST_IMPULSE, TEST_NONE } test_type = TEST_NONE; enum ResampleType { RES_DOWNSAMPLE, RES_UPSAMPLE, RES_SUBSAMPLE, RES_OVERSAMPLE, RES_NONE } resample_type = RES_NONE; enum ImplType { IMPL_SSE, IMPL_NORMAL } impl_type = IMPL_NORMAL; unsigned int block_size = 128; double test_frequency = 440.0; double gettime () { timeval tv; gettimeofday (&tv, 0); return double(tv.tv_sec) + double(tv.tv_usec) * (1.0 / 1000000.0); } template <int TEST, int RESAMPLE, int IMPL> int perform_test() { if (TEST == TEST_IMPULSE) /* need to have space for impulse response for all 4 tests */ block_size = 150; /* initialize up- and downsampler */ const int T = 32; const bool USE_SSE = (IMPL == IMPL_SSE); float utaps[T], dtaps[T]; for (int i = 0; i < T; i++) { utaps[i] = halfband_fir_upsample2_96db_coeffs[i] * 2; dtaps[i] = halfband_fir_upsample2_96db_coeffs[i]; } Upsampler2<T,USE_SSE> ups (utaps); Downsampler2<T,USE_SSE> downs (dtaps); f4vector in_v[block_size / 2 + 1], out_v[block_size / 2 + 1], out2_v[block_size / 2 + 1]; float *input = &in_v[0].f[0], *output = &out_v[0].f[0], *output2 = &out2_v[0].f[0]; /* ensure aligned data */ if (TEST == TEST_PERFORMANCE) { for (int i = 0; i < block_size; i++) input[i] = sin (i * test_frequency/44100.0 * 2 * M_PI); double start_time = gettime(); long long k = 0; for (int i = 0; i < 500000; i++) { if (RESAMPLE == RES_DOWNSAMPLE || RESAMPLE == RES_SUBSAMPLE) { downs.process_block (input, block_size, output); if (RESAMPLE == RES_SUBSAMPLE) ups.process_block (output, block_size / 2, output2); } if (RESAMPLE == RES_UPSAMPLE || RESAMPLE == RES_OVERSAMPLE) { ups.process_block (input, block_size, output); if (RESAMPLE == RES_OVERSAMPLE) downs.process_block (output, block_size * 2, output2); } k += block_size; } double end_time = gettime(); if (RESAMPLE == RES_DOWNSAMPLE) { printf (" (performance will be normalized to downsampler output samples)\n"); k /= 2; } else if (RESAMPLE == RES_UPSAMPLE) { printf (" (performance will be normalized to upsampler input samples)\n"); } printf (" total samples processed = %lld\n", k); printf (" processing_time = %f\n", end_time - start_time); printf (" samples / second = %f\n", k / (end_time - start_time)); printf (" which means the resampler can process %.2f 44100 Hz streams simultaneusly\n", k / (end_time - start_time) / 44100.0); printf (" or one 44100 Hz stream takes %f %% CPU usage\n", 100.0 / (k / (end_time - start_time) / 44100.0)); } else if (TEST == TEST_ACCURACY || TEST == TEST_GNUPLOT) { long long k = 0; double phase = 0; double max_diff = 0; for (int b = 0; b < 1000; b++) { int misalign = rand() % 4; int bs = rand() % (block_size - misalign); if (RESAMPLE == RES_DOWNSAMPLE || RESAMPLE == RES_SUBSAMPLE) bs -= bs & 1; for (int i = 0; i < bs; i++) { input[i+misalign] = sin (phase); phase += test_frequency/44100.0 * 2 * M_PI; } if (RESAMPLE == RES_DOWNSAMPLE || RESAMPLE == RES_SUBSAMPLE) { downs.process_block (input + misalign, bs, output); if (RESAMPLE == RES_SUBSAMPLE) ups.process_block (output, bs / 2, output2); } if (RESAMPLE == RES_UPSAMPLE || RESAMPLE == RES_OVERSAMPLE) { ups.process_block (input + misalign, bs, output); if (RESAMPLE == RES_OVERSAMPLE) downs.process_block (output, bs * 2, output2); } /* validate output */ double sin_shift; double freq_factor; unsigned int out_bs; float *check = output; if (RESAMPLE == RES_UPSAMPLE) { sin_shift = 34; freq_factor = 0.5; out_bs = bs * 2; } else if (RESAMPLE == RES_DOWNSAMPLE) { sin_shift = 16.5; freq_factor = 2; out_bs = bs / 2; } else if (RESAMPLE == RES_OVERSAMPLE) { sin_shift = 33.5; freq_factor = 1; check = output2; out_bs = bs; } else if (RESAMPLE == RES_SUBSAMPLE) { sin_shift = 67; freq_factor = 1; check = output2; out_bs = bs; } for (unsigned int i = 0; i < out_bs; i++, k++) if (k > 100) { double correct_output = sin ((k - sin_shift) * 2 * freq_factor * test_frequency/44100.0 * M_PI); if (TEST == TEST_GNUPLOT) printf ("%lld %.17f %.17f\n", k, check[i], correct_output); else max_diff = max (max_diff, check[i] - correct_output); } } double max_diff_db = 20 * log (max_diff) / log (10); if (TEST == TEST_ACCURACY) { printf (" input frequency used to perform test = %.2f Hz (SR = 44100.0 Hz)\n", test_frequency); printf (" max difference between correct and computed output: %f = %f dB\n", max_diff, max_diff_db); } } else if (TEST == TEST_IMPULSE) { input[0] = 1; for (int i = 1; i < block_size; i++) { input[i] = 0; output[i] = 0; output2[i] = 0; } if (RESAMPLE == RES_DOWNSAMPLE || RESAMPLE == RES_SUBSAMPLE) { downs.process_block (input, block_size, output); if (RESAMPLE == RES_SUBSAMPLE) ups.process_block (output, block_size / 2, output2); } if (RESAMPLE == RES_UPSAMPLE || RESAMPLE == RES_OVERSAMPLE) { ups.process_block (input, block_size, output); if (RESAMPLE == RES_OVERSAMPLE) downs.process_block (output, block_size * 2, output2); } float *check = output; if (RESAMPLE == RES_OVERSAMPLE || RESAMPLE == RES_SUBSAMPLE) check = output2; for (int i = 0; i < block_size; i++) printf ("%.17f\n", check[i]); } } template <int TEST, int RESAMPLE> int perform_test() { switch (impl_type) { case IMPL_SSE: printf ("using SSE instructions\n"); return perform_test<TEST, RESAMPLE, IMPL_SSE> (); case IMPL_NORMAL: printf ("using FPU instructions\n"); return perform_test<TEST, RESAMPLE, IMPL_NORMAL> (); } } template <int TEST> int perform_test () { switch (resample_type) { case RES_DOWNSAMPLE: printf ("for factor 2 downsampling "); return perform_test<TEST, RES_DOWNSAMPLE> (); case RES_UPSAMPLE: printf ("for factor 2 upsampling "); return perform_test<TEST, RES_UPSAMPLE> (); case RES_SUBSAMPLE: printf ("for factor 2 subsampling "); return perform_test<TEST, RES_SUBSAMPLE> (); case RES_OVERSAMPLE: printf ("for factor 2 oversampling "); return perform_test<TEST, RES_OVERSAMPLE> (); } } int perform_test() { switch (test_type) { case TEST_PERFORMANCE: printf ("performance test "); return perform_test<TEST_PERFORMANCE> (); case TEST_ACCURACY: printf ("accuracy test "); return perform_test<TEST_ACCURACY> (); case TEST_GNUPLOT: printf ("# gnuplot test "); return perform_test<TEST_GNUPLOT> (); case TEST_IMPULSE: printf ("# impulse response test "); return perform_test<TEST_IMPULSE> (); } } int main (int argc, char **argv) { if (argc >= 2) { for (int i = 0; i < strlen (argv[1]); i++) { switch (argv[1][i]) { case 'p': test_type = TEST_PERFORMANCE; break; case 'a': test_type = TEST_ACCURACY; break; case 'g': test_type = TEST_GNUPLOT; break; case 'i': test_type = TEST_IMPULSE; break; case 'd': resample_type = RES_DOWNSAMPLE; break; case 'u': resample_type = RES_UPSAMPLE; break; case 's': resample_type = RES_SUBSAMPLE; break; case 'o': resample_type = RES_OVERSAMPLE; break; case 'f': impl_type = IMPL_SSE; break; } } } if (argc >= 3) block_size = atoi (argv[2]); if (argc >= 4) test_frequency = atoi (argv[3]); if ((block_size & 1) == 1) block_size++; if (block_size < 2) block_size = 2; if (test_type == TEST_NONE || resample_type == RES_NONE) exit_usage (1); return perform_test(); } # g++-4.0 -funroll-loops -O3 -c -o ssefir.o ssefir.cc # g++-4.0 -funroll-loops -O3 -o ssefir ssefir.o performance test for factor 2 upsampling using FPU instructions (performance will be normalized to upsampler input samples) total samples processed = 64000000 processing_time = 3.582959 samples / second = 17862330.233790 which means the resampler can process 405.04 44100 Hz streams simultaneusly or one 44100 Hz stream takes 0.246888 % CPU usage performance test for factor 2 upsampling using SSE instructions (performance will be normalized to upsampler input samples) total samples processed = 64000000 processing_time = 1.268400 samples / second = 50457270.836486 which means the resampler can process 1144.16 44100 Hz streams simultaneusly or one 44100 Hz stream takes 0.087401 % CPU usage performance test for factor 2 downsampling using FPU instructions (performance will be normalized to downsampler output samples) total samples processed = 32000000 processing_time = 2.099055 samples / second = 15244955.091818 which means the resampler can process 345.69 44100 Hz streams simultaneusly or one 44100 Hz stream takes 0.289276 % CPU usage performance test for factor 2 downsampling using SSE instructions (performance will be normalized to downsampler output samples) total samples processed = 32000000 processing_time = 0.826667 samples / second = 38709658.514582 which means the resampler can process 877.77 44100 Hz streams simultaneusly or one 44100 Hz stream takes 0.113925 % CPU usage performance test for factor 2 oversampling using FPU instructions total samples processed = 64000000 processing_time = 7.717842 samples / second = 8292473.356379 which means the resampler can process 188.04 44100 Hz streams simultaneusly or one 44100 Hz stream takes 0.531808 % CPU usage performance test for factor 2 oversampling using SSE instructions total samples processed = 64000000 processing_time = 2.863337 samples / second = 22351542.660578 which means the resampler can process 506.84 44100 Hz streams simultaneusly or one 44100 Hz stream takes 0.197302 % CPU usage accuracy test for factor 2 upsampling using FPU instructions input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz) max difference between correct and computed output: 0.000012 = -98.194477 dB accuracy test for factor 2 upsampling using SSE instructions input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz) max difference between correct and computed output: 0.000012 = -98.080477 dB accuracy test for factor 2 downsampling using FPU instructions input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz) max difference between correct and computed output: 0.000006 = -104.811480 dB accuracy test for factor 2 downsampling using SSE instructions input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz) max difference between correct and computed output: 0.000006 = -104.761055 dB accuracy test for factor 2 oversampling using FPU instructions input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz) max difference between correct and computed output: 0.000012 = -98.194477 dB accuracy test for factor 2 oversampling using SSE instructions input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz) max difference between correct and computed output: 0.000012 = -98.080477 dB # g++-3.4 -funroll-loops -O3 -c -o ssefir.o ssefir.cc # g++-3.4 -funroll-loops -O3 -o ssefir ssefir.o performance test for factor 2 upsampling using FPU instructions (performance will be normalized to upsampler input samples) total samples processed = 64000000 processing_time = 3.797012 samples / second = 16855359.553811 which means the resampler can process 382.21 44100 Hz streams simultaneusly or one 44100 Hz stream takes 0.261638 % CPU usage performance test for factor 2 upsampling using SSE instructions (performance will be normalized to upsampler input samples) total samples processed = 64000000 processing_time = 1.332213 samples / second = 48040368.623546 which means the resampler can process 1089.35 44100 Hz streams simultaneusly or one 44100 Hz stream takes 0.091798 % CPU usage performance test for factor 2 downsampling using FPU instructions (performance will be normalized to downsampler output samples) total samples processed = 32000000 processing_time = 2.284448 samples / second = 14007760.860869 which means the resampler can process 317.64 44100 Hz streams simultaneusly or one 44100 Hz stream takes 0.314825 % CPU usage performance test for factor 2 downsampling using SSE instructions (performance will be normalized to downsampler output samples) total samples processed = 32000000 processing_time = 0.913186 samples / second = 35042155.471063 which means the resampler can process 794.61 44100 Hz streams simultaneusly or one 44100 Hz stream takes 0.125848 % CPU usage performance test for factor 2 oversampling using FPU instructions total samples processed = 64000000 processing_time = 8.265052 samples / second = 7743448.102385 which means the resampler can process 175.59 44100 Hz streams simultaneusly or one 44100 Hz stream takes 0.569514 % CPU usage performance test for factor 2 oversampling using SSE instructions total samples processed = 64000000 processing_time = 3.125921 samples / second = 20473965.840909 which means the resampler can process 464.26 44100 Hz streams simultaneusly or one 44100 Hz stream takes 0.215395 % CPU usage accuracy test for factor 2 upsampling using FPU instructions input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz) max difference between correct and computed output: 0.000012 = -98.194477 dB accuracy test for factor 2 upsampling using SSE instructions input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz) max difference between correct and computed output: 0.000012 = -98.080477 dB accuracy test for factor 2 downsampling using FPU instructions input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz) max difference between correct and computed output: 0.000006 = -104.811480 dB accuracy test for factor 2 downsampling using SSE instructions input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz) max difference between correct and computed output: 0.000006 = -104.761055 dB accuracy test for factor 2 oversampling using FPU instructions input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz) max difference between correct and computed output: 0.000012 = -98.194477 dB accuracy test for factor 2 oversampling using SSE instructions input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz) max difference between correct and computed output: 0.000012 = -98.080477 dB _______________________________________________ beast mailing list beast@... http://mail.gnome.org/mailman/listinfo/beast |
|
|
Re: Fast factor 2 resamplingOn Mon, 6 Mar 2006, Stefan Westerfeld wrote:
> Hi! > > I have worked quite a bit now on writing code for fast factor two > resampling based on FIR filters. So the task is similar to the last > posting I made. The differences are: > > - I tried to really optimize for speed; I used gcc's SIMD primitives to > design a version that runs really fast on my machine: > > model name : AMD Athlon(tm) 64 Processor 3400+ > stepping : 10 > cpu MHz : 2202.916 > > running in 64bit mode. > > - I put some more effort into designing the coefficients for the filter; > I used octave to do it; the specifications I tried to meet are listed > in the coeffs.h file. hm, can you put up a description about how to derive the coefficients with octave or with some other tool then. so they can be reproduced by someone else? > The resamplers are designed for streaming use; they do smart history > keeping. Thus a possible use case I designed them for would be to > upsample BEAST at the input devices and downsample BEAST at the output > devices. > > The benefits of using the code for this tasks are: > > - the filters are linear-phase *why* exactly is this a benefit? > - the implementation should be fast enough (at least on my machine) > - the implementation should be precise enough (near -96 dB error == 16 > bit precision) what is required to beef this up to -120dB, or provide an alternative implementation. i'm asking because float or 24bit datahandles are not at all unlikely for the future. likewise, a 12bit variant may make sense as well for some handles (maybe even an 8bit variant in case that's still significantly faster than the 12bit version). > The downside may be the delay of the filters. > > I put some effort into making this code easy to test, with four kinds of > tests: > > (p) Performance tests measure how fast the code runs > > I tried on my machine with both: gcc-3.4 and gcc-4.0; you'll see the > results below. The speedup gain achieved using SIMD instructions > (SSE3 or whatever AMD64 uses) is > > gcc-4.0 gcc-3.4 > -------------+--------------------- > upsampling | 2.82 2.85 > downsampling | 2.54 2.50 > oversampling | 2.70 2.64 > > where oversampling is first performing upsampling and then > performing downsampling. Note that there is a bug in gcc-3.3 which > will not allow combining C++ code with SIMD instructions. > > The other output should be self-explaining (if not, feel free to > ask). hm, these figures are pretty much meaningless without knowing: - what exactly was performed that took 2.82 or 2.85 - what is the unit of those figures? milli seconds? hours? dollars? > (a) Accuracy tests, which compare what should be the result with what is > the result; you'll see that using SIMD instructions means a small > loss of precision, but it should be acceptable. It occurs because > the code doesn't use doubles to store the accumulated sum, but > floats, to enable SIMD speedup. what's the cost of using doubles for intermediate values anyway (is that possible at all?) and what does the precision loss mean in dB? > (g) Gnuplot; much the same like accuracy, but it writes out data which > can be plottet by gnuplot. So it is possible to "see" the > interpolation error, rather than just get it as output. > > (i) Impulse response; this one can be used for debugging - it will give > the impulse response for the (for sub- and oversampling combined) > system, so you can for instance see the delay in the time domain or > plot the filter response in the frequency domain. > > So I am attaching all code and scripts that I produced so far. For > compiling, I use g++ -O3 -funroll-loops as options; however, I suppose > on x86 machines you need to tell the compiler to generate SSE code. > > I just tried it briefly on my laptop, and the SSE version there is much > slower than the non-SSE version. Currently I can't say why this is. I > know that AMD64 has extra registers compared to standard SSE. However, > I designed the inner loop (fir_process_4samples_sse) with having in mind > not to use more than 8 registers: out0..out3, input, taps, intermediate > sum/product whatever. These are 7 registers. Well, maybe AMD64 isn't > faster because of more registers, but better adressing mode, or > whatever. > > Maybe its just my laptop being slow, and other non-AMD64-systems will > perform better. > > Maybe we need to write three versions of the inner loop. One for AMD64, > one for x86 with SSE and one for FPU. > > In any case, I invite you to try it out, play with the code, and give > feedback about it. first, i'd like to thank you for working on this. but then, what you're sending here is still pretty rough and looks cumbersome to deal with. can you please provide more details on the exact API you intend to add (best is to have this in bugzilla), and give precise build instructions (best is usually down to the level of shell commands, so the reader just needs to paste those). also, more details of what exctaly your performance tests do and how to use them would be apprechiated. > Cu... Stefan --- ciaoTJ _______________________________________________ beast mailing list beast@... http://mail.gnome.org/mailman/listinfo/beast |
|
|
Re: Fast factor 2 resampling Hi!
On Tue, Mar 28, 2006 at 03:27:14PM +0200, Tim Janik wrote: > On Mon, 6 Mar 2006, Stefan Westerfeld wrote: > >I have worked quite a bit now on writing code for fast factor two > >resampling based on FIR filters. So the task is similar to the last > >posting I made. The differences are: > > > >- I tried to really optimize for speed; I used gcc's SIMD primitives to > > design a version that runs really fast on my machine: > > > > model name : AMD Athlon(tm) 64 Processor 3400+ > > stepping : 10 > > cpu MHz : 2202.916 > > > > running in 64bit mode. > > > >- I put some more effort into designing the coefficients for the filter; > > I used octave to do it; the specifications I tried to meet are listed > > in the coeffs.h file. > > hm, can you put up a description about how to derive the coefficients with > octave or with some other tool then. so they can be reproduced by someone > else? As I have done it, it requires extra octave code (a bunch of .m files implementing the ultraspherical window). I've copypasted the code from a paper, and hacked around until it worked (more or less) in octave. But if we want to include it as octave code in the BEAST distribution, it might be worth investing a little more work into this window so that we can provide a matlab/octave implementation we really understand and then can provide a C implementation as well, so it can be used from BEAST directly. > >The resamplers are designed for streaming use; they do smart history > >keeping. Thus a possible use case I designed them for would be to > >upsample BEAST at the input devices and downsample BEAST at the output > >devices. > > > >The benefits of using the code for this tasks are: > > > >- the filters are linear-phase > > *why* exactly is this a benefit? Linear phase filtering means three things: * we do "real interpolation", in the sense that for factor 2 upsampling, every other sample is exactly kept as it is; this means that we don't have to compute it * we keep the shape of the signal intact, thus operations that modify the shape of the signal (non-linear operations, such as saturation) will sound the same when oversampling them * we have the same delay for all frequencies - not having the same delay for all frequencies may result in audible differences between the original and up/downsampled signal http://en.wikipedia.org/wiki/Group_delay gives a table, which however seems to indicate that "not being quite" linear phase wouldn't lead to audible problems > >- the implementation should be fast enough (at least on my machine) > >- the implementation should be precise enough (near -96 dB error == 16 > > bit precision) > > what is required to beef this up to -120dB, or provide an alternative > implementation. i'm asking because float or 24bit datahandles are not at > all unlikely for the future. Why -120dB? 6 * 24 = 144...? The first factor that influences the precision is of course the filter (and the resampling code doesn't hardcode the filter coefficients). The filter can be tweaked to offer a -144dB (or -120dB) frequency response by redesigning the coefficients (with the octave method I used), it will be longer then (more delay, slower computation). The second factor is the SSE code itself, because SSE limits us to float float precision. My implementation also uses a computation order that is quite fast - but not too good for precision. Usually, for FIR filters its good to compute first the influence of small coefficients and then the influence of larger ones. However I compute the influence of the coefficients in the order they occur in the impulse response. As conclusion: it might be that SSE code - at least as implemented - cannot attain the precision we desire for 24bit audio. How good it gets probably can't be determined without trying it. > likewise, a 12bit variant may make sense as well for some handles (maybe > even an 8bit variant in case that's still significantly faster than the > 12bit version). That should be no problems, simply by designing new coefficients. > >The downside may be the delay of the filters. > > > >I put some effort into making this code easy to test, with four kinds of > >tests: > > > >(p) Performance tests measure how fast the code runs > > > > I tried on my machine with both: gcc-3.4 and gcc-4.0; you'll see the > > results below. The speedup gain achieved using SIMD instructions > > (SSE3 or whatever AMD64 uses) is > > > > gcc-4.0 gcc-3.4 > > -------------+--------------------- > > upsampling | 2.82 2.85 > > downsampling | 2.54 2.50 > > oversampling | 2.70 2.64 > > > > where oversampling is first performing upsampling and then > > performing downsampling. Note that there is a bug in gcc-3.3 which > > will not allow combining C++ code with SIMD instructions. > > > > The other output should be self-explaining (if not, feel free to > > ask). > > hm, these figures are pretty much meaningless without knowing: > - what exactly was performed that took 2.82 or 2.85 > - what is the unit of those figures? milli seconds? hours? dollars? These are speedup gains. A speedup gain is a factor between the "normal" implementation and the SSE implementation. speedup_gain = time_normal / time_sse It has no unit, because the "seconds" unit both times have will disappear when dividing them. If you want to know the times, and the number of samples processed in that time, you should read the RESULTS file. It is much more detailed than the table I gave above. > >(a) Accuracy tests, which compare what should be the result with what is > > the result; you'll see that using SIMD instructions means a small > > loss of precision, but it should be acceptable. It occurs because > > the code doesn't use doubles to store the accumulated sum, but > > floats, to enable SIMD speedup. > > what's the cost of using doubles for intermediate values anyway (is that > possible at all?) > and what does the precision loss mean in dB? The non-SSE implementation does use doubles for intermediate values. The SSE implementation could only use doubles if we rely on some higher version of SSE (I think SSE2 or SSE3). However, the price of doing it would be that the vectorized operations don't do four operations at once, but two. That means it would become a lot slower to use SSE at all. As outlined above, the "real" performance loss is hard to predict. However, I can give you one sample here for the -96dB filter: $ ssefir au accuracy test for factor 2 upsampling using FPU instructions input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz) max difference between correct and computed output: 0.000012 = -98.194477 dB $ ssefir auf accuracy test for factor 2 upsampling using SSE instructions input frequency used to perform test = 440.00 Hz (SR = 44100.0 Hz) max difference between correct and computed output: 0.000012 = -98.080477 dB As you see, the variant which uses doubles for intermediate values is not much better than the SSE variant, and both fulfill the spec without problems. However, as dB is a logarithmic measure, care has to be taken when extrapolating what it would mean for a -144dB (or -120dB) filter. And the other aspects that affect precision I mentioned above will also affect the result. > but then, what you're sending here is still pretty rough and > looks cumbersome to deal with. > can you please provide more details on the exact API you intend to add > (best is to have this in bugzilla), and give precise build instructions > (best is usually down to the level of shell commands, so the reader just > needs to paste those). I've uploaded a more recent version of the sources to bugzilla: #336366. It also contains build instructions for the standalong thingy. For ssefir.h, I added documentation comments /** *... */ for those functions/classes that may be interesting for others. I also marked a few more functions protected, so that only the interesting part of the main classes, Upsampler2 and Downsampler2, remains public. > also, more details of what exctaly your performance tests do and how > to use them would be apprechiated. Basically, they do the resampling processing for the same block of data 500000 times. You can modify the block size used. By timing this operation, a throughput can be computed which then can be given as samples per second, or for instance CPU usage for resampling a single 44100 Hz stream. If you run the shell script (or read the test file RESULTS I had attached to the initial mail), you may understand it a bit more, because the output is somewhat verbose. On my system: $ ssefir pu performance test for factor 2 upsampling using FPU instructions (performance will be normalized to upsampler input samples) total samples processed = 64000000 processing_time = 3.667876 samples / second = 17448790.501572 which means the resampler can process 395.66 44100 Hz streams simultaneusly or one 44100 Hz stream takes 0.252740 % CPU usage $ ssefir puf performance test for factor 2 upsampling using SSE instructions (performance will be normalized to upsampler input samples) total samples processed = 64000000 processing_time = 1.346511 samples / second = 47530250.673020 which means the resampler can process 1077.78 44100 Hz streams simultaneusly or one 44100 Hz stream takes 0.092783 % CPU usage The arguments here are: p = performance u = upsampling f = fast -> SSE implementation run ssefir without args for help. Cu... Stefan -- Stefan Westerfeld, Hamburg/Germany, http://space.twc.de/~stefan _______________________________________________ beast mailing list beast@... http://mail.gnome.org/mailman/listinfo/beast |
|
|
Re: Fast factor 2 resamplingOn Tue, 28 Mar 2006, Stefan Westerfeld wrote:
> Hi! > > On Tue, Mar 28, 2006 at 03:27:14PM +0200, Tim Janik wrote: >> On Mon, 6 Mar 2006, Stefan Westerfeld wrote: >>> I have worked quite a bit now on writing code for fast factor two >>> resampling based on FIR filters. So the task is similar to the last >>> posting I made. The differences are: >>> >>> - I tried to really optimize for speed; I used gcc's SIMD primitives to >>> design a version that runs really fast on my machine: >>> >>> model name : AMD Athlon(tm) 64 Processor 3400+ >>> stepping : 10 >>> cpu MHz : 2202.916 >>> >>> running in 64bit mode. >>> >>> - I put some more effort into designing the coefficients for the filter; >>> I used octave to do it; the specifications I tried to meet are listed >>> in the coeffs.h file. >> >> hm, can you put up a description about how to derive the coefficients with >> octave or with some other tool then. so they can be reproduced by someone >> else? > > As I have done it, it requires extra octave code (a bunch of .m files > implementing the ultraspherical window). I've copypasted the code from a > paper, and hacked around until it worked (more or less) in octave. > > But if we want to include it as octave code in the BEAST distribution, > it might be worth investing a little more work into this window so that > we can provide a matlab/octave implementation we really understand and > then can provide a C implementation as well, so it can be used from > BEAST directly. ok, ok, first things first ;) as far as i see, we only have a couple use cases at hand, supposedly comprehensive filter setups are: - 8bit: 48dB - 12bit: 72dB - 16bit: 96dB - 20bit: 120dB - 24bit: 144dB if we have those 5 cases covered by coefficient sets, that'd be good enough to check the stuff in to CVS and have production ready up/down sampling. then, if the octave files and the paper you pasted from permit, it'd be good to put the relevant octave/matlab files into CVS under LGPL, so the coefficient creation process can be reconstructed later on (and by other contributors). last, once all of the above has been settled and there is a valid use case for creating these filters from C, the octave/matlab code can be translated to be available during runtime. do you actually see a use case for this? >>> The resamplers are designed for streaming use; they do smart history >>> keeping. Thus a possible use case I designed them for would be to >>> upsample BEAST at the input devices and downsample BEAST at the output >>> devices. >>> >>> The benefits of using the code for this tasks are: >>> >>> - the filters are linear-phase >> >> *why* exactly is this a benefit? > > Linear phase filtering means three things: > > * we do "real interpolation", in the sense that for factor 2 upsampling, > every other sample is exactly kept as it is; this means that we don't > have to compute it > > * we keep the shape of the signal intact, thus operations that modify > the shape of the signal (non-linear operations, such as saturation) > will sound the same when oversampling them > > * we have the same delay for all frequencies - not having the same > delay for all frequencies may result in audible differences between > the original and up/downsampled signal > > http://en.wikipedia.org/wiki/Group_delay > > gives a table, which however seems to indicate that "not being quite" > linear phase wouldn't lead to audible problems ok, thanks for explaining this. we should have this and similar things available in our docuemntation actually. either on a wiki page on synthesis, or even a real documentation chapter about synthesis. thoughts? >>> - the implementation should be fast enough (at least on my machine) >>> - the implementation should be precise enough (near -96 dB error == 16 >>> bit precision) >> >> what is required to beef this up to -120dB, or provide an alternative >> implementation. i'm asking because float or 24bit datahandles are not at >> all unlikely for the future. > > Why -120dB? 6 * 24 = 144...? yeah, thanks for pointing this out. both are valid use cases, 20bit samples and 24bit samples. > The first factor that influences the precision is of course the filter > (and the resampling code doesn't hardcode the filter coefficients). The > filter can be tweaked to offer a -144dB (or -120dB) frequency response > by redesigning the coefficients (with the octave method I used), it will > be longer then (more delay, slower computation). > > The second factor is the SSE code itself, because SSE limits us to float > float precision. My implementation also uses a computation order that is > quite fast - but not too good for precision. Usually, for FIR filters its > good to compute first the influence of small coefficients and then the > influence of larger ones. However I compute the influence of the > coefficients in the order they occur in the impulse response. > > As conclusion: it might be that SSE code - at least as implemented - > cannot attain the precision we desire for 24bit audio. How good it gets > probably can't be determined without trying it. yeah, right. float might fall short on 20bit or 24bit (definitely the latter, since 32bit floats have only 23bit of mantissa). but as you say, we'll see once we have the other coefficient sets, and even if at 144dB only the slow FPU variant can keep precision, the SSE code will still speed up the most common use case which is 16bit. what worries me a bit though is that you mentioned one of your machines runs the SSE variant slower than the FPU varient. did you investigate more here? >> likewise, a 12bit variant may make sense as well for some handles (maybe >> even an 8bit variant in case that's still significantly faster than the >> 12bit version). > > That should be no problems, simply by designing new coefficients. ok, as i understand, you're going to do this by manually, using octave or mathlab as a tool now, right? >>> gcc-4.0 gcc-3.4 >>> -------------+--------------------- >>> upsampling | 2.82 2.85 >>> downsampling | 2.54 2.50 >>> oversampling | 2.70 2.64 >>> >>> where oversampling is first performing upsampling and then >>> performing downsampling. Note that there is a bug in gcc-3.3 which >>> will not allow combining C++ code with SIMD instructions. >>> >>> The other output should be self-explaining (if not, feel free to >>> ask). >> >> hm, these figures are pretty much meaningless without knowing: >> - what exactly was performed that took 2.82 or 2.85 >> - what is the unit of those figures? milli seconds? hours? dollars? > > These are speedup gains. A speedup gain is a factor between the "normal" > implementation and the SSE implementation. ah, good you point this out. > If you want to know the times, and the number of samples processed in > that time, you should read the RESULTS file. It is much more detailed > than the table I gave above. well, i did read through it now. first, what's oversampling? how's that different from upsampling? and second, reading through the output doesn't lend itself for proper comparisions of the figures in a good way. compiling them into a table would be better for that. and third, since this is the output of a single run, i have no idea how much those figures are affected by processor/sytem/etc. jitter, so even if all the performance figures where next to each other, i'd still have no idea within what ranges they are actually comparable. i.e. a bit of posprocessing on your side would have helped to make the information acquired be properly digestible ;) i agree that the output itself is nicely verbose though. >>> (a) Accuracy tests, which compare what should be the result with what is >>> the result; you'll see that using SIMD instructions means a small >>> loss of precision, but it should be acceptable. It occurs because >>> the code doesn't use doubles to store the accumulated sum, but >>> floats, to enable SIMD speedup. >> >> what's the cost of using doubles for intermediate values anyway (is that >> possible at all?) >> and what does the precision loss mean in dB? > > The non-SSE implementation does use doubles for intermediate values. The > SSE implementation could only use doubles if we rely on some higher > version of SSE (I think SSE2 or SSE3). However, the price of doing it > would be that the vectorized operations don't do four operations at > once, but two. That means it would become a lot slower to use SSE at > all. depending on sse2 also limits portability, e.g. out of 2 laptops here, only 1 has sse2 (both have sse), and out of 2 athlons here only one has sse (and none sse2). the story is different with mmx of course, which is supported by all 4 processors... > As outlined above, the "real" performance loss is hard to predict. > However, I can give you one sample here for the -96dB filter: > max difference between correct and computed output: 0.000012 = -98.194477 dB > accuracy test for factor 2 upsampling using SSE instructions > max difference between correct and computed output: 0.000012 = -98.080477 dB thanks. now lets see those figures for 120dB and 144dB ;) > As you see, the variant which uses doubles for intermediate values is > not much better than the SSE variant, and both fulfill the spec without > problems. have you by any chance benched the FPU variant with doubles against the FPU variant with floats btw? > However, as dB is a logarithmic measure, care has to be taken when > extrapolating what it would mean for a -144dB (or -120dB) filter. > And the other aspects that affect precision I mentioned above will also > affect the result. yeah, agree. >> but then, what you're sending here is still pretty rough and >> looks cumbersome to deal with. >> can you please provide more details on the exact API you intend to add >> (best is to have this in bugzilla), and give precise build instructions >> (best is usually down to the level of shell commands, so the reader just >> needs to paste those). > > I've uploaded a more recent version of the sources to bugzilla: #336366. > It also contains build instructions for the standalong thingy. For > ssefir.h, I added documentation comments > > /** > *... > */ > > for those functions/classes that may be interesting for others. I also > marked a few more functions protected, so that only the interesting part > of the main classes, Upsampler2 and Downsampler2, remains public. thanks for the good work, will have a look at it later. > Cu... Stefan --- ciaoTJ _______________________________________________ beast mailing list beast@... http://mail.gnome.org/mailman/listinfo/beast |
|
|
Re: Fast factor 2 resampling Hi!
On Tue, Mar 28, 2006 at 08:06:07PM +0200, Tim Janik wrote: > On Tue, 28 Mar 2006, Stefan Westerfeld wrote: > >>>- I put some more effort into designing the coefficients for the filter; > >>>I used octave to do it; the specifications I tried to meet are listed > >>>in the coeffs.h file. > >> > >>hm, can you put up a description about how to derive the coefficients with > >>octave or with some other tool then. so they can be reproduced by someone > >>else? > > > >As I have done it, it requires extra octave code (a bunch of .m files > >implementing the ultraspherical window). I've copypasted the code from a > >paper, and hacked around until it worked (more or less) in octave. > > > >But if we want to include it as octave code in the BEAST distribution, > >it might be worth investing a little more work into this window so that > >we can provide a matlab/octave implementation we really understand and > >then can provide a C implementation as well, so it can be used from > >BEAST directly. > > ok, ok, first things first ;) > > as far as i see, we only have a couple use cases at hand, supposedly > comprehensive filter setups are: > - 8bit: 48dB > - 12bit: 72dB > - 16bit: 96dB > - 20bit: 120dB > - 24bit: 144dB > > if we have those 5 cases covered by coefficient sets, that'd be good enough > to check the stuff in to CVS and have production ready up/down sampling. Yes, these sound reasonable. Although picking which filter setup to use may not be as easy as looking at the precision of the input data. For example ogg input data could be resampled with 96dB coefficients for performance reasons, or 8bit input data could be resampled with a higher order filter to get better transition steepness. Anyway, I'll design coefficients for these 5 cases, and if we want to have more settings later on, we still can design new coefficients. > then, if the octave files and the paper you pasted from permit, it'd be good > to put the relevant octave/matlab files into CVS under LGPL, so the > coefficient > creation process can be reconstructed later on (and by other contributors). I've asked the author of the paper, and he said we can put his code in our LGPL project. I still need to put some polishing into the octave code, because I somewhat broke it when porting it from matlab to octave. The original version has somewhat more readable/understandable filter design parameters than my version. I hope I get the octave code right. > last, once all of the above has been settled and there is a valid use case > for creating these filters from C, the octave/matlab code can be translated > to be available during runtime. do you actually see a use case for this? One case I can think of is if we've got a FIR module with a GUI that allows designing custom filters. But even then, ultraspherical coefficient tweaking may be too much work for everyday use. Probably the standard user will rather have "some" window which is somewhat optimal, without a lot of tweaking, rather than an "almost optimal" window - such as ultraspherical, with a lot of manual tweaking. One could also try to automate the tweaking of the two window parameters by using some kind of search algorithm. Then, it would be as easy to use as a normal window. > >Linear phase filtering means three things: > > > >* we do "real interpolation", in the sense that for factor 2 upsampling, > > every other sample is exactly kept as it is; this means that we don't > > have to compute it > > > >* we keep the shape of the signal intact, thus operations that modify > > the shape of the signal (non-linear operations, such as saturation) > > will sound the same when oversampling them > > > >* we have the same delay for all frequencies - not having the same > > delay for all frequencies may result in audible differences between > > the original and up/downsampled signal > > > > http://en.wikipedia.org/wiki/Group_delay > > > > gives a table, which however seems to indicate that "not being quite" > > linear phase wouldn't lead to audible problems > > ok, thanks for explaining this. we should have this and similar things > available in our docuemntation actually. either on a wiki page on synthesis, > or even a real documentation chapter about synthesis. thoughts? Maybe a new doxi file on synthesis details? I could write a few paragraphs on the resampler. > >Why -120dB? 6 * 24 = 144...? > > yeah, thanks for pointing this out. both are valid use cases, 20bit > samples and 24bit samples. Although by the way -120dB should be ok for almost any practical use case, because the human ear probably won't able to hear the difference. Since these are relative values (unlike when talking about integer precisions for samples), even signals which are not very loud will get really good resampling. Thus you have error scenarios like this: a signal with a loud desired signal (sine wave with 0 dB) and a small error signal (sine wave with -120dB). I doubt that the human ear can pick up the error signal. I even doubt it for the -96 dB case. But well, we could perform listening tests to try it out. > yeah, right. float might fall short on 20bit or 24bit (definitely the > latter, > since 32bit floats have only 23bit of mantissa). > but as you say, we'll see once we have the other coefficient sets, and even > if at 144dB only the slow FPU variant can keep precision, the SSE code will > still speed up the most common use case which is 16bit. Yes. We need to try it once I have the coefficient sets. As I argued above, the errors may be well below what the human ear can percieve. > what worries me a bit though is that you mentioned one of your machines > runs the SSE variant slower than the FPU varient. did you investigate > more here? Not yet. > >>likewise, a 12bit variant may make sense as well for some handles (maybe > >>even an 8bit variant in case that's still significantly faster than the > >>12bit version). > > > >That should be no problems, simply by designing new coefficients. > > ok, as i understand, you're going to do this by manually, using octave or > mathlab as a tool now, right? Yes. > >>> gcc-4.0 gcc-3.4 > >>> -------------+--------------------- > >>> upsampling | 2.82 2.85 > >>> downsampling | 2.54 2.50 > >>> oversampling | 2.70 2.64 > >>> > >>> where oversampling is first performing upsampling and then > >>> performing downsampling. Note that there is a bug in gcc-3.3 which > >>> will not allow combining C++ code with SIMD instructions. > >>> > >>> The other output should be self-explaining (if not, feel free to > >>> ask). > >> > >>hm, these figures are pretty much meaningless without knowing: > >>- what exactly was performed that took 2.82 or 2.85 > >>- what is the unit of those figures? milli seconds? hours? dollars? > > > >These are speedup gains. A speedup gain is a factor between the "normal" > >implementation and the SSE implementation. > > ah, good you point this out. > > >If you want to know the times, and the number of samples processed in > >that time, you should read the RESULTS file. It is much more detailed > >than the table I gave above. > > well, i did read through it now. first, what's oversampling? how's that > different from upsampling? Oversampling is first upsampling a 44100 Hz signal to 88200 Hz, and then downsampling it again to 44100 Hz. Its what I first designed the filters for: for oversampling the engine. Thus I benchmarked it as seperate case. > and second, reading through the output doesn't lend itself for proper > comparisions of the figures in a good way. compiling them into a table > would be better for that. Right. > and third, since this is the output of a single run, i have no idea how > much those figures are affected by processor/sytem/etc. jitter, so even > if all the performance figures where next to each other, i'd still have > no idea within what ranges they are actually comparable. Well, the of putting the code in bugs.gnome.org was that you could see yourself how much jitter/... it produces. > >>>(a) Accuracy tests, which compare what should be the result with what is > >>> the result; you'll see that using SIMD instructions means a small > >>> loss of precision, but it should be acceptable. It occurs because > >>> the code doesn't use doubles to store the accumulated sum, but > >>> floats, to enable SIMD speedup. > >> > >>what's the cost of using doubles for intermediate values anyway (is that > >>possible at all?) > >>and what does the precision loss mean in dB? > > > >The non-SSE implementation does use doubles for intermediate values. The > >SSE implementation could only use doubles if we rely on some higher > >version of SSE (I think SSE2 or SSE3). However, the price of doing it > >would be that the vectorized operations don't do four operations at > >once, but two. That means it would become a lot slower to use SSE at > >all. > > depending on sse2 also limits portability, e.g. out of 2 laptops here, only > 1 has sse2 (both have sse), and out of 2 athlons here only one has sse (and > none sse2). the story is different with mmx of course, which is supported by > all 4 processors... But MMX only accelerates integer operations, which doesn't help much for our floating point based data handles. > >As outlined above, the "real" performance loss is hard to predict. > >However, I can give you one sample here for the -96dB filter: > > >max difference between correct and computed output: 0.000012 = -98.194477 > >dB > > >accuracy test for factor 2 upsampling using SSE instructions > > >max difference between correct and computed output: 0.000012 = -98.080477 > >dB > > thanks. now lets see those figures for 120dB and 144dB ;) Later. I want to cleanup the octave files first (see above), before doing the final design of the coefficients. > >As you see, the variant which uses doubles for intermediate values is > >not much better than the SSE variant, and both fulfill the spec without > >problems. > > have you by any chance benched the FPU variant with doubles against the > FPU variant with floats btw? Well, I tried it now: the FPU variant without doubles is quite a bit (15%) faster than the variant which uses doubles as intermediate values. If you want *really cool* speedups, you can use gcc-4.1 with float temporaries -ftree-vectorize and -ffast-math. That auto vectorization thing really works, and replaces the FPU instructions with SSE instructions automagically. Its not much slower than my hand crafted version. But then again, we wanted a FPU variant to have a FPU variant, right? I haven't benchmarked a version which uses double filter coefficients, because it would have required a lot of rewriting to get it work with my current source tree. But here are the tests I described above: ### FPU CODE with temporary DOUBLES. $ g++-4.1 -o ssefir ssefir.cc -O3 -funroll-loops $ ./ssefir pu performance test for factor 2 upsampling using FPU instructions (performance will be normalized to upsampler input samples) total samples processed = 64000000 processing_time = 3.804806 samples / second = 16820831.364426 which means the resampler can process 381.42 44100 Hz streams simultaneusly or one 44100 Hz stream takes 0.262175 % CPU usage ### FPU CODE with temporary FLOATS. $ g++-4.1 -o ssefir ssefir.cc -O3 -funroll-loops $ ./ssefir pu performance test for factor 2 upsampling using FPU instructions (performance will be normalized to upsampler input samples) total samples processed = 64000000 processing_time = 3.317954 samples / second = 19288996.585134 which means the resampler can process 437.39 44100 Hz streams simultaneusly or one 44100 Hz stream takes 0.228628 % CPU usage ### SSE CODE with temporary FLOATS generated by AUTOVECTORIZER $ g++-4.1 -o ssefir ssefir.cc -O3 -funroll-loops -ffast-math -ftree-vectorize $ ./ssefir pu performance test for factor 2 upsampling using FPU instructions (performance will be normalized to upsampler input samples) total samples processed = 64000000 processing_time = 1.482929 samples / second = 43157831.814407 which means the resampler can process 978.64 44100 Hz streams simultaneusly or one 44100 Hz stream takes 0.102183 % CPU usage ### SSE CODE with temporary FLOATS generated by HAND $ g++-4.1 -o ssefir ssefir.cc -O3 -funroll-loops $ ./ssefir puf performance test for factor 2 upsampling using SSE instructions (performance will be normalized to upsampler input samples) total samples processed = 64000000 processing_time = 1.323285 samples / second = 48364483.105296 which means the resampler can process 1096.70 44100 Hz streams simultaneusly or one 44100 Hz stream takes 0.091183 % CPU usage I must admit, I am really impressed with the quality of auto vectorization. Its the first time I can try it out (installed gcc-4.1 today). Hand written code is a somewhat faster, but not much. Another advantage of the hand written code is that it will work with any compiler down to gcc 3.4. > >I've uploaded a more recent version of the sources to bugzilla: #336366. > >[...] > > thanks for the good work, will have a look at it later. I uploaded a new version with my current sources. Cu... Stefan -- Stefan Westerfeld, Hamburg/Germany, http://space.twc.de/~stefan _______________________________________________ beast mailing list beast@... http://mail.gnome.org/mailman/listinfo/beast |
|
|
Re: Fast factor 2 resamplingOn Tue, 4 Apr 2006, Stefan Westerfeld wrote:
> Hi! > > On Tue, Mar 28, 2006 at 08:06:07PM +0200, Tim Janik wrote: >> On Tue, 28 Mar 2006, Stefan Westerfeld wrote: >>>>> - I put some more effort into designing the coefficients for the filter; >>>>> I used octave to do it; the specifications I tried to meet are listed >>>>> in the coeffs.h file. >>>> >>>> hm, can you put up a description about how to derive the coefficients with >>>> octave or with some other tool then. so they can be reproduced by someone >>>> else? >>> >>> As I have done it, it requires extra octave code (a bunch of .m files >>> implementing the ultraspherical window). I've copypasted the code from a >>> paper, and hacked around until it worked (more or less) in octave. >>> >>> But if we want to include it as octave code in the BEAST distribution, >>> it might be worth investing a little more work into this window so that >>> we can provide a matlab/octave implementation we really understand and >>> then can provide a C implementation as well, so it can be used from >>> BEAST directly. >> >> ok, ok, first things first ;) >> >> as far as i see, we only have a couple use cases at hand, supposedly >> comprehensive filter setups are: >> - 8bit: 48dB >> - 12bit: 72dB >> - 16bit: 96dB >> - 20bit: 120dB >> - 24bit: 144dB >> >> if we have those 5 cases covered by coefficient sets, that'd be good enough >> to check the stuff in to CVS and have production ready up/down sampling. > > Yes, these sound reasonable. Although picking which filter setup to use > may not be as easy as looking at the precision of the input data. > > For example ogg input data could be resampled with 96dB coefficients for > performance reasons, or 8bit input data could be resampled with a higher > order filter to get better transition steepness. but that'd just be another choice out of those 5, other than the obvious one. or am i misunderstanding you and you want to point out a missing setup? > > Anyway, I'll design coefficients for these 5 cases, and if we want to > have more settings later on, we still can design new coefficients. yeah. > >> then, if the octave files and the paper you pasted from permit, it'd be good >> to put the relevant octave/matlab files into CVS under LGPL, so the >> coefficient >> creation process can be reconstructed later on (and by other contributors). > > I've asked the author of the paper, and he said we can put his code in > our LGPL project. I still need to put some polishing into the octave > code, because I somewhat broke it when porting it from matlab to octave. ugh. i think putting just the link to the paper into our docs or the code comments would be enough, give it is publically available. but we can mirror it on site if we got permission for redistribution and there is reason to believe the original location may be non-permanent in any way. >>> Linear phase filtering means three things: >>> >>> * we do "real interpolation", in the sense that for factor 2 upsampling, >>> every other sample is exactly kept as it is; this means that we don't >>> have to compute it >>> >>> * we keep the shape of the signal intact, thus operations that modify >>> the shape of the signal (non-linear operations, such as saturation) >>> will sound the same when oversampling them >>> >>> * we have the same delay for all frequencies - not having the same >>> delay for all frequencies may result in audible differences between >>> the original and up/downsampled signal >>> >>> http://en.wikipedia.org/wiki/Group_delay >>> >>> gives a table, which however seems to indicate that "not being quite" >>> linear phase wouldn't lead to audible problems >> >> ok, thanks for explaining this. we should have this and similar things >> available in our docuemntation actually. either on a wiki page on synthesis, >> or even a real documentation chapter about synthesis. thoughts? > > Maybe a new doxi file on synthesis details? I could write a few > paragraphs on the resampler. that'd be good. will you check that in to docs/ then? one file that may be remotely suitable is: http://beast.gtk.org/architecture.html but it's probably much better to just start synthesis-details.doxi. > >>> Why -120dB? 6 * 24 = 144...? >> >> yeah, thanks for pointing this out. both are valid use cases, 20bit >> samples and 24bit samples. > > Although by the way -120dB should be ok for almost any practical use > case, because the human ear probably won't able to hear the difference. > > Since these are relative values (unlike when talking about integer > precisions for samples), even signals which are not very loud will get > really good resampling. > > Thus you have error scenarios like this: a signal with a loud desired > signal (sine wave with 0 dB) and a small error signal (sine wave with > -120dB). I doubt that the human ear can pick up the error signal. I > even doubt it for the -96 dB case. But well, we could perform listening > tests to try it out. well, since we're writing a modular synthesis application here, keep in mind that examining just one signal in isolation isn't good enough for all cases. that'd be ok for a media player with one pluggable filter in it's output chain, but our signals are used for various purposes and *may* be strongly amplified. i'm not saying 144dB will be the common use case, but i think it's reasonably within the range of filters we migth want to offer synthesis users. > >> yeah, right. float might fall short on 20bit or 24bit (definitely the >> latter, >> since 32bit floats have only 23bit of mantissa). >> but as you say, we'll see once we have the other coefficient sets, and even >> if at 144dB only the slow FPU variant can keep precision, the SSE code will >> still speed up the most common use case which is 16bit. > > Yes. We need to try it once I have the coefficient sets. As I argued > above, the errors may be well below what the human ear can percieve. i'll keep an eye on it. with our FFT scope. which allows allmost arbitrary signal boosts ;) > >> what worries me a bit though is that you mentioned one of your machines >> runs the SSE variant slower than the FPU varient. did you investigate >> more here? > > Not yet. ok, please keep posting once you've done that then ;) >> well, i did read through it now. first, what's oversampling? how's that >> different from upsampling? > > Oversampling is first upsampling a 44100 Hz signal to 88200 Hz, and then > downsampling it again to 44100 Hz. Its what I first designed the filters > for: for oversampling the engine. Thus I benchmarked it as seperate > case. hm, i still don't have a good idea if we won't need n-times oversampling for the whole engine. basically, because i have a rough idea on what usual input output rates are or could be (44.1k, 48k, 88.2k, 96k), but not what good rates are to run the synthesis engine at (48K, 56K, 66.15K, 64K, 72K)... >>> The non-SSE implementation does use doubles for intermediate values. The >>> SSE implementation could only use doubles if we rely on some higher >>> version of SSE (I think SSE2 or SSE3). However, the price of doing it >>> would be that the vectorized operations don't do four operations at >>> once, but two. That means it would become a lot slower to use SSE at >>> all. >> >> depending on sse2 also limits portability, e.g. out of 2 laptops here, only >> 1 has sse2 (both have sse), and out of 2 athlons here only one has sse (and >> none sse2). the story is different with mmx of course, which is supported by >> all 4 processors... > > But MMX only accelerates integer operations, which doesn't help much for > our floating point based data handles. sure, i'm just pointing out the availability of different technologies here. i.e. mmx vs. sse vs. sse2. and the athlons of course also have 3dnow. to sum it up, SSE seems feasible at the moment, SSE2 not so, out of the available instruction sets. >>> As you see, the variant which uses doubles for intermediate values is >>> not much better than the SSE variant, and both fulfill the spec without >>> problems. >> >> have you by any chance benched the FPU variant with doubles against the >> FPU variant with floats btw? > > Well, I tried it now: the FPU variant without doubles is quite a bit (15%) > faster than the variant which uses doubles as intermediate values. > > If you want *really cool* speedups, you can use gcc-4.1 with float > temporaries -ftree-vectorize and -ffast-math. That auto vectorization > thing really works, and replaces the FPU instructions with SSE > instructions automagically. Its not much slower than my hand crafted > version. But then again, we wanted a FPU variant to have a FPU variant, > right? erm, i can't believe your gcc did that without also specifiying a processor type... and when we get processor specific, we have to provide alternative compilation objects and need a mechanism to clearly identify and select the required instruction sets during runtime. >>> I've uploaded a more recent version of the sources to bugzilla: #336366. >>> [...] >> >> thanks for the good work, will have a look at it later. > > I uploaded a new version with my current sources. rock. > Cu... Stefan --- ciaoTJ _______________________________________________ beast mailing list beast@... http://mail.gnome.org/mailman/listinfo/beast |
|
|
Re: Fast factor 2 resampling Hi!
On Wed, Apr 05, 2006 at 02:05:37AM +0200, Tim Janik wrote: > On Tue, 4 Apr 2006, Stefan Westerfeld wrote: > >>ok, ok, first things first ;) > >> > >>as far as i see, we only have a couple use cases at hand, supposedly > >>comprehensive filter setups are: > >>- 8bit: 48dB > >>- 12bit: 72dB > >>- 16bit: 96dB > >>- 20bit: 120dB > >>- 24bit: 144dB > >> > >>if we have those 5 cases covered by coefficient sets, that'd be good > >>enough > >>to check the stuff in to CVS and have production ready up/down sampling. > > > >Yes, these sound reasonable. Although picking which filter setup to use > >may not be as easy as looking at the precision of the input data. > > > >For example ogg input data could be resampled with 96dB coefficients for > >performance reasons, or 8bit input data could be resampled with a higher > >order filter to get better transition steepness. > > but that'd just be another choice out of those 5, other than the obvious > one. > or am i misunderstanding you and you want to point out a missing setup? No, I was just pointing out to the fact that choosing from these 5 should not (always) be automated for datahandles. In the plain C API, this means that we now have /* --- resampling datahandles with the factor 2 --- */ GslDataHandle* bse_data_handle_new_upsample2 (GslDataHandle *src_handle, int precision_bits); instead of the old API GslDataHandle* bse_data_handle_new_upsample2 (GslDataHandle *src_handle); But actually there is a case that is not covered very well (and can not be covered with the code designed as it is right now), and thats resampling files with small sample rate. When it comes to that, the aliasing area that I've designed into the inaudible area (22050-26100 Hz if we're resampling 44100 Hz recordings) moves down into the audible area (for instance 11025-13050 Hz upsampling a 22050Hz recording with factor 2). If that is a problem we need a non-halfband implementation as well (see below for more cases where we need that), which is not too hard to write but may be significantly slower (factor 2 or so). > >>then, if the octave files and the paper you pasted from permit, it'd be > >>good > >>to put the relevant octave/matlab files into CVS under LGPL, so the > >>coefficient > >>creation process can be reconstructed later on (and by other > >>contributors). > > > >I've asked the author of the paper, and he said we can put his code in > >our LGPL project. I still need to put some polishing into the octave > >code, because I somewhat broke it when porting it from matlab to octave. > > ugh. i think putting just the link to the paper into our docs or the code > comments would be enough, give it is publically available. but we can mirror > it on site if we got permission for redistribution and there is reason to > believe the original location may be non-permanent in any way. I was't speaking about the paper, but only about the octave/matlab source code given in the paper (for ultraspherical windows). He allowed us to redistribute this, and thats what I want to do, so that everybody can reproduce how we designed the filter coefficients. > > >>>Linear phase filtering means three things: > >>> > >>>* we do "real interpolation", in the sense that for factor 2 upsampling, > >>>every other sample is exactly kept as it is; this means that we don't > >>>have to compute it > >>> > >>>* we keep the shape of the signal intact, thus operations that modify > >>>the shape of the signal (non-linear operations, such as saturation) > >>>will sound the same when oversampling them > >>> > >>>* we have the same delay for all frequencies - not having the same > >>>delay for all frequencies may result in audible differences between > >>>the original and up/downsampled signal > >>> > >>> http://en.wikipedia.org/wiki/Group_delay > >>> > >>>gives a table, which however seems to indicate that "not being quite" > >>>linear phase wouldn't lead to audible problems > >> > >>ok, thanks for explaining this. we should have this and similar things > >>available in our docuemntation actually. either on a wiki page on > >>synthesis, > >>or even a real documentation chapter about synthesis. thoughts? > > > >Maybe a new doxi file on synthesis details? I could write a few > >paragraphs on the resampler. > > that'd be good. will you check that in to docs/ then? > one file that may be remotely suitable is: > http://beast.gtk.org/architecture.html > but it's probably much better to just start synthesis-details.doxi. Will do. > >Oversampling is first upsampling a 44100 Hz signal to 88200 Hz, and then > >downsampling it again to 44100 Hz. Its what I first designed the filters > >for: for oversampling the engine. Thus I benchmarked it as seperate > >case. > > hm, i still don't have a good idea if we won't need n-times oversampling > for the whole engine. basically, because i have a rough idea on what usual > input output rates are or could be (44.1k, 48k, 88.2k, 96k), but not what > good rates are to run the synthesis engine at (48K, 56K, 66.15K, 64K, > 72K)... Well, I've at least already thought about accelerated FIR based implementations: you can accelerate every rate change with a raional number (P/Q), i.e. 3/2 upsampling or 5/4 upsampling (because you can build complete coefficient tables, then). However, the code will be quite a bit slower than factor 2 upsampling due to two reasons: * you can not copy samples from the original signal as often as you can do it for factor 2 upsampling * for rational rates, you can not use half band filters, thus you don't have filters with every other coefficient zero So in the worst case, we have a performance loss of factor 2 for the first point and factor 2 for the second point. But of course these are estimates and can't replace real benchmarking of an implementation once it exists. > >>>As you see, the variant which uses doubles for intermediate values is > >>>not much better than the SSE variant, and both fulfill the spec without > >>>problems. > >> > >>have you by any chance benched the FPU variant with doubles against the > >>FPU variant with floats btw? > > > >Well, I tried it now: the FPU variant without doubles is quite a bit (15%) > >faster than the variant which uses doubles as intermediate values. > > > >If you want *really cool* speedups, you can use gcc-4.1 with float > >temporaries -ftree-vectorize and -ffast-math. That auto vectorization > >thing really works, and replaces the FPU instructions with SSE > >instructions automagically. Its not much slower than my hand crafted > >version. But then again, we wanted a FPU variant to have a FPU variant, > >right? > > erm, i can't believe your gcc did that without also specifiying a > processor type... Well, I never have to specify a processor type, because my gcc only supports one: native AMD64 code. But you are of course right in the sense that my gcc always produces code which is perfectly optimized to the processor it will run on, and my gcc always knows about all instructions my processor supports and so on. > and when we get processor specific, we have to provide alternative > compilation objects and need a mechanism to clearly identify and > select the required instruction sets during runtime. I understand why you don't want to support _many_ object files per algorithm. However, it may be reasonable to support _two_ object files per algorithm, one compiled with -msse -ftree-vectorize and one without. This also needs to be done for Bse::Resampler. It may be reasonable to do it for common algorithms at least, like scaling a float block with a float value or adding two float blocks together. And we need a runtime check for checking whether SSE is available. But thats not a problem, because there is one in arts/flow/cpuinfo.* we can simply copypaste. Cu... Stefan -- Stefan Westerfeld, Hamburg/Germany, http://space.twc.de/~stefan _______________________________________________ beast mailing list beast@... http://mail.gnome.org/mailman/listinfo/beast |
|
|
Re: Fast factor 2 resamplingOn Mon, 10 Apr 2006, Stefan Westerfeld wrote:
> Hi! > > On Wed, Apr 05, 2006 at 02:05:37AM +0200, Tim Janik wrote: >> On Tue, 4 Apr 2006, Stefan Westerfeld wrote: >>>> ok, ok, first things first ;) >>>> >>>> as far as i see, we only have a couple use cases at hand, supposedly >>>> comprehensive filter setups are: >>>> - 8bit: 48dB >>>> - 12bit: 72dB >>>> - 16bit: 96dB >>>> - 20bit: 120dB >>>> - 24bit: 144dB >>>> >>>> if we have those 5 cases covered by coefficient sets, that'd be good >>>> enough >>>> to check the stuff in to CVS and have production ready up/down sampling. >>> >>> Yes, these sound reasonable. Although picking which filter setup to use >>> may not be as easy as looking at the precision of the input data. >>> >>> For example ogg input data could be resampled with 96dB coefficients for >>> performance reasons, or 8bit input data could be resampled with a higher >>> order filter to get better transition steepness. >> >> but that'd just be another choice out of those 5, other than the obvious >> one. >> or am i misunderstanding you and you want to point out a missing setup? > > No, I was just pointing out to the fact that choosing from these 5 > should not (always) be automated for datahandles. In the plain C API, > this means that we now have > > /* --- resampling datahandles with the factor 2 --- */ > GslDataHandle* bse_data_handle_new_upsample2 (GslDataHandle *src_handle, int precision_bits); > > instead of the old API > > GslDataHandle* bse_data_handle_new_upsample2 (GslDataHandle *src_handle); > > > But actually there is a case that is not covered very well (and can not > be covered with the code designed as it is right now), and thats > resampling files with small sample rate. When it comes to that, the > aliasing area that I've designed into the inaudible area (22050-26100 Hz > if we're resampling 44100 Hz recordings) moves down into the audible > area (for instance 11025-13050 Hz upsampling a 22050Hz recording with > factor 2). sorry, i don't understand. can you elaborate o nwhat you mean with "aliasing area" here? >> and when we get processor specific, we have to provide alternative >> compilation objects and need a mechanism to clearly identify and >> select the required instruction sets during runtime. > > I understand why you don't want to support _many_ object files per > algorithm. However, it may be reasonable to support _two_ object files > per algorithm, one compiled with -msse -ftree-vectorize and one without. > This also needs to be done for Bse::Resampler. It may be reasonable to > do it for common algorithms at least, like scaling a float block with a > float value or adding two float blocks together. > > And we need a runtime check for checking whether SSE is available. But > thats not a problem, because there is one in arts/flow/cpuinfo.* we can > simply copypaste. the runtime CPU check and build system changes to build plugins with and without SSE support are now in CVS HEAD. > Cu... Stefan --- ciaoTJ _______________________________________________ beast mailing list beast@... http://mail.gnome.org/mailman/listinfo/beast |
|
|
Re: Fast factor 2 resampling Hi!
On Tue, Apr 18, 2006 at 05:39:30PM +0200, Tim Janik wrote: > >/* --- resampling datahandles with the factor 2 --- */ > >GslDataHandle* bse_data_handle_new_upsample2 (GslDataHandle *src_handle, > >int precision_bits); > > > >instead of the old API > > > >GslDataHandle* bse_data_handle_new_upsample2 (GslDataHandle *src_handle); > > > > > >But actually there is a case that is not covered very well (and can not > >be covered with the code designed as it is right now), and thats > >resampling files with small sample rate. When it comes to that, the > >aliasing area that I've designed into the inaudible area (22050-26100 Hz > >if we're resampling 44100 Hz recordings) moves down into the audible > >area (for instance 11025-13050 Hz upsampling a 22050Hz recording with > >factor 2). > > sorry, i don't understand. can you elaborate o nwhat you mean with > "aliasing area" here? Suppose this is your original spectrum: # # ### #### # ###### ###### --------------------> frequency | | 0 nyquist Zero padding introduces a spectrum copy # # # # ### ### #### ## #### ############ ############ --------------------> frequency | | | old nyquist 0 | new nyquist We now need to filter this, and using a half band filter for the task means that we'll have an attenuation of -6dB (0.5 on a linear scale) at the old nyquist frequency. Thus, there is an area which starts slightly below the old nyquist frequency and ends slightly above the old nyquist frequency which is not the same as the original signal. Filter #### ##### ###### ####### ######## --------------------> frequency | \---/ | aliasing area 0 | new nyquist Filtered signal: # # ### #### ###### ####### --------------------> frequency | \---/ | aliasing area 0 | new nyquist As you can see, some frequencies have been introduced which were not present in the original signal, and some were reduced which were present in the original signal. As long as the original sample rate is high, this is not a problem, because you can not hear them. But if the original sample rate is low, this is a problem. Cu... Stefan -- Stefan Westerfeld, Hamburg/Germany, http://space.twc.de/~stefan _______________________________________________ beast mailing list beast@... http://mail.gnome.org/mailman/listinfo/beast |
| Free embeddable forum powered by Nabble | Forum Help |