Head Related Transfer Function (HRTF)
What are the Pinnae?
Take notice of your ears -- examine your outer ear, that flap of cartilage covered with skin.
Perhaps it is pierced. These flaps are called your pinnae. The
pinnae are a rather odd shape -- longer than wide,
open at the front, curled over at the top and some of the rear.
What do the Pinnae do?
Sound that reaches inside your ear canal has to move around and
along your pinnae to get there. Depending on the direction from which
the sound comes, the sound will move across a different part of
the pinnae. If the sound comes directly in, it may arrive largely
unchanged. If it comes from behind you, the sound will move
along your hair on the back of your head, wrap around the back
edge of the pinnae, then
enter the ear canal, bouncing off that little tab in the front.
If coming from above, the sound passes over the hair on top of
your head, wraps around the curved edge on the top, bounces off
the bottom, and into the ear canal. While doing this, the sound
is changed -- certain frequencies are dampened, others are
reinforced. Delays in the paths taken by different frequencies
cause minute phase changes. For a given direction of approach,
the sound will be colored in a specific way for a specific person.
This type of sound
coloring, once thoroughly analyzed, can be simulated with digital
filters. A filter which produces a specific frequency and
phase response can be completely described with a mathematical formula
called a transfer function. Since the set of filters used
to indicate direction of sound are affected by the shape and size
of a person's head and ears, these filter functions are called
head related transfer functions (HRTF). By selecting the
appropriate pair (one for each ear) of transfer functions for a
sound coming from a given direction and applying it to the sound
being directed at each ear, we may be able to alter the direction
the sound is perceived as coming from -- making it come from
above, below, or behind -- even when the speakers are directly in
front of the listener.
For a fun experiment to hear how this works, play a song on
your stereo, preferably with some high frequencies like a snare
drum. Sit with your back to the speakers, a few feet away. Hold
up your hands, palms cupped and facing backwards towards the
speakers. Place the edges of your hands which are opposite your thumbs
on the side of your head, just in front of your ear. Your fingers are now
cupping over the top of your pinnae, palms open and facing the
speakers. Fiddle with the angle and shape of your hands. You
should be able to find an arrangement in which the music appears to be
coming from your side instead of from behind. Your hands are acting
like artificial ears. For a more extreme effect, shave your
head, duct tape your ears to your head, glue on fake ears in the
opposite direction, and put on a wig facing backwards. Now, sounds in
front of you will sound like they are coming from the back
and vice versa. The only problem is that the interaural time
difference (ITD) cues contradict the HRTF cues,
which may make you dizzy and fall down! Or maybe it's the wig...
Another experiment - scratching top & bottom; note about the
resonant effects of membranes of different sizes and tensions.
Pitfalls -- pinnae uniqueness
Pinnae are like fingerprints -- no two are the same. They uniquely identify you.
Because of this, each person filters sounds differently! Each person's brain is
tuned to recognize and process their own pinnae's filtering. Perhaps now you can see why HRTF
design is so difficult -- no two people have the same HRTF!! Therefore it is actually
impossible to make a single HRTF that will work well for everyone. There are three major
approaches to selecting a HRTF:
- Create a custom HRTF for each listener. This requires that the person's pinnae
be measured and a suitably capable expert designs a HRTF to match -- an expensive process.
- Select a HRTF representing a typical ear that works acceptably well for as many people
as possible.
- Select a HRTF based on what the available hardware can support.
Even under ideal circumstances, there will always be some people
who can't hear the effect at all. There are several explanations for this:
- Their ears may be very different from the HRTF you are using.
- They may have damaged their high-frequency hearing by listening
to loud music previously in life. The most powerful spatial imaging cues
are in the high-frequency range.
- Some people are very sensitive to incorrect lighting and can spot
a fake photo instantly, and can tell in a movie when a real model is
filmed, and when a computer generated model is used. Likewise, some
people are very perceptually sensitive to spatial sound direction. As
their heads make minute motions, affecting the ITD, and their ears twitch,
affecting the HRTF, their brains spot a forgery! They can then filter
out the false cues and detect the real source of the sound: speakers.
Does this mean we should just give up? No! An old adage says:
"Something is better than nothing." Likewise, sounds can be livened up,
moved about, and intriguingly enhanced through the use of HRTFs. For many
people, it will sound positively amazing, for a few, they won't hear
much difference. We're doing this for the first group.
Measurement of a person's own HRTF
methods of HRTF capture -- canal sampling (incl. ripping from studies)
& model-casting/ray-tracing algorithmic (MUST model porosity
of cartilage & dampening effects of ear hair to be an acceptably accurate
approximation)
Topography of Surfaces
Different types of objects require different numbers of
dimensions to describe them. A flat tabletop and a book cover
each have two dimensions. A box, a kitten, or a bowl filled with
soup have three dimensions. All curved planar surfaces can be
considered to have between two and three dimensions. Some
examples of curved planar surfaces are a mountain, a range of
rolling hills, a crumpled sheet, and the surface of a sea at a
given moment. The inner surface of the pinnae that is critical to
determining the pinnae's transfer functions is also a curved
planar surfaces. In mathematics, objects like the pinnae
with "in-between" dimensions are called fractals
(fractional dimensions). Surfaces with higher dimensionality
are described as more topographically complex.
complexity & freq. range of
top/down vs front/back. More people can hear top/down because 8k.
Filter Design
First, let's analyze what range of frequencies and precision we will
need to have. Once this is accomplished, we will try to find the simplest
filter topography (interconnection arrangement) that will yield
acceptable results. Since the topographies may be complex filters
which are unknown, we will not have easy access to formulas that help us select
coefficients to tune the filter. Thus for each candidate topography,
we will need to find a way to select coefficients. Filters are
a set of interconnected nodes that multiply a number of inputs by
coefficients, yielding an output. This describes a neural network. Programming a
neural networks involves using back-propagation to converge upon a
set of coefficients that will yield the response we desire. We
can use the same method to program our filters, though be warned --
it is more difficult than it sounds.
What range of frequencies are most important in implementing a HRTF?
A very rough first approximation can be obtained by considering
the size of the pinnae and the speed of sound:
From the top to the bottom of a typical pinnae is 2.5 inches.
This means that the lowest frequency significantly affected by the pinnae in that direction
is (1130 ft/sec) / (2.5 inches / (12 inches/ft)) = 5424 Hz. Most
of the up/down filtering will happen at higher frequencies.
From the front to the back of a typical pinnae is 1.5 inches.
This means that the lowest frequency significantly affected by the pinnae in that direction
is (1130 ft/sec) / (1.5 inches / (12 inches/ft)) = 9040 Hz. Most
of the front/back filtering will happen at higher frequencies.
The first thing to notice here is that people with poor hearing often have lost
their ability to hear sounds in this range. Such people are likely to have
poor localization of sound to begin with and probably can not hear anything
special from HRTFs.
Secondly, let's consider what kind of filter system we would need to do
the needed filtering accurately and in real-time. In a system with a 50 kHz sampling rate, a
10 sample long finite impulse response (FIR) filter can implement a transfer function focusing
on frequencies above 5000 Hz. For good quality, especially in the higher frequencies,
oversampling is needed to improve the precision of the frequency response. 8x oversampling
with interpolation gives a 400kHz sampling rate and requires a 40
sample long FIR filter. Two such filters are needed in a stereo system,
yielding a system that must be able to perform about 2 * 400,000 * 40 = 32 million
24-bit multiplys per second per 3D stream. To handle 10 streams in a
quadraphonic system, you need 10 * 4 * 400,000 * 40 = 640 million 24-bit multiplys
per second. If your system is going to do anything else beyond calculate HRTFs,
you'll need even more computational power.
This can be an intense and expensive computational load obviously. It is worthwhile
to explore other filter topographies. If we can find self-similarity in the
filter response, we can structure a hierarchical FIR filter rather than flat, and reduce
our computational complexity by a log function. Another method is to explore
infinite impulse response (IIR) filters. Such filters have feedback
and are thus less stable but can yield considerable improvements in processing speed.
They can also be quite difficult to map to an arbitrary filter response, which
is what we'll need to do.
Multiple Filter Candidate Topography from Surface Topography Using Fractal Auto-correlation
Methods of filter topography extraction - using fractals to extract
topographical similarities, suggesting # of units to use
Mapping surfaces into higher dimensions, yielding lower complexity.
A FIR is basically one dimensional (between 1 & 2). Each fold can
increase dimensionality, but reduces size of calculations.
n-dimensional bifolding & p-folding
Coefficient Extraction using Back-Propagation
Methods of coefficient matching - feed to neural networks
& train using modified back-propagation to a wide range of digital filter
topographies. When close match occurs, do quick test of each on subjects.
Further develop promising ones to reduce to essentials, test against
subjects. Notes: This takes a lot of time & back propagation and imagination
can produce an infinite variety of solutions. The more time spent (months with
experts, possible years to get good ones), the greater likelihood of finding
a kick-ass topography that can be efficiently digitally modeled with wickedly
stark imaging from compelling coefficients.
Filters must be pretty fancy/surprisingly clever to do precise phase work
at close-to R/2 frequencies.
Pitfall -- Engineering Time vs. Computational Complexity
As we have seen, the pinnae are extremely complex
mathematical objects and there is absolutely no way to make this work
by fiddling with knobs or guessing.
By far the most difficult physical cue to get right. Can take 100 times longer to make
it work than getting all the rest working in real-time.
Can get good results by ripping curves from
journals and using FIR filters, but at a high computational cost, usually
involving fully customized ASIC chips. This is how Diamond implemented Aureal's (now Creative's)
algorithms -- with 15 stage FIR filters, allowing 4 streams per chip.
For both, best to use own ear initially.
Implementation: algorithmic vs modeled vs sampled (also simulated, but don't mention).
If using speakers, overcompensate