In a never ending quest to minimize small artifacts in the sound processing, I did something drastic a few months ago. I went from using 100 quarter-Bark bands to a mere 3 wide bands covering the audible range. And to my surprise it sounded pretty good. Not as detailed as the many-bands approach, but it also presented fewer of those troublesome sonic artifacts.
So now, it seems that narrow bands produce more artifacts because band crossings will be more frequent, and drastic gain changes between adjacent bands will be more apparent. So that poses the question as to how many bands are optimal?
I ran a series of experiments, using progressively more bands, up from those mere 3 bands. And as I increased the number of bands, as you might expect, the sound processing became more delicate, offering more subtle details in the music playback. But also, those sonic artifacts gradually increased.
There doesn’t seem to be a crisp answer to the question. It becomes a question of competing effects. More detail and subtlety, as you increase the number of processing bands, while also more artifacts. And so a tradeoff is necessary.
After an extended session last night, varying the number of processing bands between tests, recording the sine-sweep results, and listening to music with each selection, I am coming to the conclusion that bands spaced about 2 Bark apart are right in the sweet spot. Plenty of detail, and not too many sonic artifacts. But that sweet spot is very broadly defined. There isn’t a huge difference between 1.5 Bark and 3 Bark bandwidths.
That’s kind of interesting, because when you take an audiology test, those test frequencies at half-octave intervals, just happen to also be about 2 Bark apart. Maybe the old timers were on to something when they decided on those testing frequencies.
Recall that the Bark Bands are not something hardwired into our hearing. Rather, when a loud tone presents itself, at whatever frequency, our hearing develops self-organizing loudness masking zones that correspond to critical bands, each about 1 Bark wide.
So the question about how many bands to utilize, also leads into the question about how band filters are designed and used. The original Crescendo used rectangular bands with relatively steep transitions between adjacent bands. And that kind of design seems to exacerbate the sonic artifacts that occur as a swept sine traverses the bands. Each band will be prepped for different gains, with those gains steadily increasing toward higher frequencies. And so as the sine sweeps into the highest frequencies the sonic artifacts grow more strong.
You can’t actually discern these artifacts directly during normal music playback. They are buried down in the background while the music captures your attention. But the artifacts can lead to an increased sense of noise in that background behind the music.
The sonic artifacts show up most strongly in a completely quiet setting with only a pure sine wave moving in a logarithmic sweep. And in that situation, the worst of them, at the very highest frequencies appear to be about 50-60 dBc below the carrier level.
An artifact presents a whole forest of sidebands, spaced about 300 Hz apart, all up and down the audible spectrum. That 300 Hz corresponds to the processing repetition rate as the incoming sound is chopped into 3 ms segments for analysis and for updating the corrections.
And the intensity of these artifacts also depends strongly on how much vTuning you need for your hearing. Mine, at vTuning 60 dB, is strong enough to make the artifacts clearly present in the sine sweep tests.
So, a little aside on what causes the artifacts… Crescendo dynamics are designed for ear safety. When a loud signal arrives in a band, the gain is immediately reduced to safe levels. The faster that happens, and the louder the signal, the stronger the artifacts.
To avoid grotesque generation of artifacts, we use a 2ms lookahead interval and begin reducing the gain prior to the arrival of the loud signal. But our fastest reduction is only 3 ms, due to the processing interval, and against that 2ms lookahead, there will still be a bit of glitch.
When a band is sitting idle with a quiet input, it is prepped for maximum gain so that you can hear the faintest details. But with no signal present, allowing raw amplification merely boosts the dust in the noise floor. And that isn’t pleasant. So under those circumstances, while the band is prepped for maximum gain if and when a signal arrives, it is throttled to provide no processing at all.
The moment some faint signal finally arrives, the band comes to life at its maximum gain setting. But if the arriving signal is really loud, that gain must be quickly reduced to protect our hearing. And the transition to sharply lower gain produces a broadband click-like sonic artifact. Turning a sound quickly off produces the same kind of artifact as turning a sound quickly on. It is the magnitude of the gain change, not its direction, and the speed at which it happens, that determines how loud the transient artifact becomes.
So with rectangular processing bands, that transition from off to on can become pretty fast. A better approach is to utilize triangular bandpass shapes that overlap from one band to the next. That way, when a sound is moving toward higher frequencies, the next processing band begins to get a hint of it coming before it actually arrives fully in-band. That allows the gain transition some time to more gracefully adapt to the loud signal, and those transitions now perform the same degree of gain reduction, but spread over a longer period of time. Hence the sonic artifacts drop considerably in their amplitude.
But we use FFT processing to perform all the band splitting. FFT’s present a linear frequency space, where each FFT cell occupies exactly the same number of Hz across the entire spectrum. That couldn’t be more misaligned to our hearing. We need something like Bark bands, whose bandwidth grows in size as the frequency increases.
[An FFT bin bandwidth is also ludicrously narrow at the high frequencies. Too little sound power in a single FFT bin, leading to wildly fluctuating gain estimates. Rather, it turns out we need to know the effective power per Bark band unit, regardless of the measurement bandwidth. And so we need to sum together many FFT bins at the highest frequencies where the sound is at its faintest levels, to get a decent power estimate per unit of Bark frequency.]
And so, when I design a Bark filter, that triangular bandpass shape is symmetric only in Bark frequency space. But translated back to linear frequency space for the FFT, they become highly asymmetric. But that’s okay, except that you must be very careful about how you interpret the results coming out of each band filter.
With rectangular bands, there is little question about which band owns the signal. It is either in a band and in no others, or it isn’t. When summing FFT cell power for measuring loudness summation, signals that are near a band edge still get fully represented by the owning cell.
But when you use overlapping triangular bands, that signal is no longer solely owned by one channel. It contributes power to two adjacent channels in varying degree, depending on how far the signal is from channel center frequency.
Signal Reconstruction vs Power Measurement
These filters are really used for two completely separate purposes. On the one hand, they are used to split a musical signal into channels, for later reassembly back into the whole music sound once we know how much gain to apply to each channel. The other purpose of the filters is to measure how much power is present at each frequency. These are not the same kind of channel processing.
With a hardware filter, we have no choice but to do the wrong thing. That is, a hardware filter takes a sound and spits out that portion of the sound that is present in the filter bandwidth around its filtering frequency. But the filter shapes the output. This isn’t a transparent presentation. The filter profile multiplies the incoming sound spectrum.
And so while you can take a bank of parallel abutting filters and add them all together to reconstitute the original sound, you cannot reliably measure total sound power this way.
Imagine a completely quiet environment, and we present a swept sine wave signal. As that signal traverses a band, the sound intensity at the output of a single filter will rise from nothing, to maximum, and then back down to nothing. It is a smooth, rounded, transition as the sine sweeps through the filter bandpass. Hence its power might be seen as non-constant, even though we haven’t changed the amplitude of the input signal at all.
But the magic of digital signal processing allows us to manage this situation better. As the sound traverses our passband, it is also starting to traverse the next higher passband, after it crosses our center frequency. Directly summing two channels produces the correct output for all sounds. No spectral shape across the central portion of the summed filters. And we want to see that the power of the signal also does not change.
So that means that for measuring sound power, we actually want to use the filter shaping against the PSD, not the other way around. If the sound is located exactly halfway between two adjacent filters, its amplitude would be half its peak amplitude in each of the filters. When we reconstruct the sound, we add the two filtered sub-band signals to get back our original sound at the original amplitude.
But if we did that filtering first, then computed the power in the signal, it wouldn’t show as half the power in one band and half in the next. It would show 1/4 in each band, summing to 1/2 overall. In other words, as the signal at constant amplitude moves toward higher frequencies, its power would appear to fluctuate from peak to half and back again, every time it crosses between filter bands. That isn’t correct.
The solution is to first form the PSD in the Fourier domain, and then apply the filtering to it, to determine how much power in each band. This corresponds to coherent signal addition, as opposed to incoherent signal summing. It is the same underlying signal, and so its contributions to adjacent channels really should be added coherently. But you can’t do that with hardware filters, only DSP filters where we can do all kinds of magical things.
Apodization is the term used for the process of “cutting off the feet” of a filter’s stop band response. This is important when making power measurements, because the stop band ripple falsely allows out of band power to contribute to the band’s measured power. We want those stop band ripples to be small.
So we don’t simply take the Bark triangle passbands, jumble them together with our estimated gain settings, and apply it to the incoming signal. That would falsely diminish sound levels in the treble region due to a strong bass component. Removing stop band contribution to each filter helps keep the measured power and consequent gain setting for each passband more isolated to the band’s frequency neighborhood.
The way to apodize a filter response is to transform over to its time domain representation, apply a window to it, then transform back to frequency space. After that, we can apply the filtering to the signal via spectral multiplication.
The image just above this paragraph shows the temporal response of one of our Bark bandpass filters, prior to apodization. Notice that the response at each endpoint has an abrupt discontinuity. That produces a ringing response in the spectral domain. All other frequencies, apart from those of interest in the band pass, will contribute to power estimates and cause a falsely higher power measurement in that band.
By applying a Hann window to the raw filter response, we end up with the next graph. See how smoothly the response decays to zero at each end of the time window?
The net effect of windowing, or apodizing, is to slightly broaden the spectral passband, but it also drastically lowers the stopband response. It also smooths out abrupt gain transitions between adjacent bands, acting as a kind of smoothing interpolator.
Had we not performed filter apodization, the spectral artifacts would have been much stronger, even possibly quite objectionable.
PS: A 1/4 Bark bandwidth at the higher frequencies corresponds to about 1/12 octave, or a semitone. Our loudness response versus frequency is not as precise as our pitch perception, and so such narrow loudness analysis is probably unwarranted.