Latest Research Crescendo

For the past several years I have been pondering what a minimum Crescendo would look like. I began many years ago thinking that, in the interest of highest accuracy, I would use 100 quarter-Bark channels of correction. That may well be accurate, but can a human discern the difference on real music versus a more sawn off implementation?

I did find that using as few as 3-5 bands, in emergencies, provides at least some benefits. But there is no comparison between that emergency grade Crescendo and a much more complex implementation.

So how few bands can we get away with? I found along the way that our hearing shares something in common with our sense of vision. It has been known for some time that our ability to discern spatial variations of color is much less acute than our ability to sense intensity variations. That’s why the Bayer matrix, found in CCD cameras everywhere, works so well for us. The Bayer matrix trades off color spatial resolution against intensity resolution, offering only half the color resolution as the sensor provides.

Our hearing is like that too. We can sense pitch variations of only a few cents (1/100 semitone). But our sense of loudness variation with frequency is much more coarse.

My current implementation uses 11 bands of 2.5 Bark wide channels, spanning the audible range from 20 Hz to about 16 kHz. Not much to be found in music above 15 kHz, and most of us wouldn’t be able to hear it anyway. That’s why perceptual coding of audio cuts off at 15 kHz. If you have sensioneural loss then, for sure, you won’t be able to hear it.

I find that I can barely discern the difference between this 11 band Crescendo and the 100 band version. Music sounds very very good in the little version.

Crescendo works by taking in the digital audio stream and analyzing the power in each analysis channel, independently for left and right audio channels. Then knowing how much power is in each band it can determine the instantaneous nonlinear compression gain to apply to the audio within that Bark channel. It does this about 300 times per second on overlapping chunks of the audio stream.

To estimate the power in each channel we need to use shaped bandpass filters that have some resemblance to what our chochlea offers. The filters I’m now using for power estimation look like this:

Individual filters are shown in green, while their power-sum is shown in red. Notice how they broaden toward higher frequencies. And notice the asymmetry in their response, being steeper on the low frequency skirt, compared to the high frequency skirt.

The skew in these filters occurs because we are taking constant width, skewed, triangular filters from Bark frequency space and mapping them to our conventional Hz space. In Bark frequency space, the filters actually have a steeper roll-off (-20 dB/Bark) on the high frequency side than on the low side (-10 dB/Bark), to match how our cochlear filters behave. That is to say, in Bark space, the filters are actually skewed in the opposite direction to how they appear in linear Hz space. (This point needs reexamination! See below…)

The next graph shows these same filters, now plotted in Log Hz space. That is closer to, but not the same as, Bark frequency. But here you can see that these filters are closer to constant-Q filters, where the frequency-relative shape is nearly constant as frequency rises.

This is one of the major failings of current day hearing aids, which use WOLA filtering – akin to FFT processing. Each channel on a WOLA filtered hearing aid has constant bandwidth, with shape symmetry in Hz space, not Bark space. And so if you have a 16 channel hearing aid, it likely only covers (poorly) from about 500 Hz to 5-6 kHz in symmetric, equal width, Hz channels.

Adding insult to this injury, a musical sound spectrum rolls off toward higher frequencies, which leads to power starvation in narrow constant-width analysis channels. That makes the signal to noise ratio decline, and the highest frequency corrections become increasingly worthless.

Our hearing does not use equal width channels – at least not in our conventional Hz frequency space. It works approximately in Bark-scaled frequency space. And in that space, the channels are approximately equal width. But they are also not in fixed locations in frequency, but rather are self-organizing channels around the loudest spectral peaks in the sound.

And the broadening of the channels toward higher frequencies means that despite the rolloff in the audible spectrum, we are taking in more and more power summing at the highest frequencies. That approximately balances out the decline in spectrum power, keeping the signal to noise ratio approximately constant.

So, with fixed channel assignments, even our Crescendo remains a crude approximation of what would really be needed. But even so, it is a major improvement over anything else, especially of nothing.

  • DM

[Point to examine: Should we be making our analysis filterbank resemble what the cochlear filter bank is doing?

We aren’t trying to simulate a chochlea. Our chochleae will impose their own filtering on top of what we do in Crescendo. So our job should be to just try to estimate power in each Bark channel. There ought not to be asymmetric rolloff on the filter skirts in Bark space.

If anything, it would seem that we are giving a double shot at upward masking by having our filters shaped like the cochlear filters. The cochlea will do its own upward masking. Or will it? Maybe we need the extra masking help here, to overcome the damaged sensors in our cochlea?

Despite this uncertainty, it sounds very very good. So here again is an indication that our loudness sense is not very acute, compared with our sense of tuning.]

Author: dbmcclain

Astrophysicist, spook, musician, Lisp aficionado, deaf guy