/ holofonika / interactive 3d positional audio algorithms Contact Us |


3D Positional Audio Algorithms

In my ears, movement in space is more important and more exciting than melody or harmony. We are not used to hearing patterns in movement though -- so most people are unaware of this.

When there is motion, listening involves the listener even more -- it draws him in, seduces him and allows him to see things he never saw before. Music can be hallucinogenic, can bring you on an experience more vivid, more real, more meaningful than any drug (even caffeine).

  1. What is positional audio?
  2. What's it good for?
  3. How does positional audio work?
    1. Perceptual Sound Characteristics
    2. Physical Cues
      1. Interaural Intensity Difference (IID)
      2. Interaural Time Difference
      3. Early Reflections
      4. Reverb Wash
      5. Head Related Transfer Functions (HRTF)
  4. 3D Positional Audio Algorithms by X. J. Scott
    1. XJS Type 1 3D Audio
    2. XJS Type 2 3D Audio
    3. XJS Type 3 3D Audio
    4. XJS Type 4 3D Audio

What is positional audio?

Audio is sound and position is location. Positional audio is a technology for placing sounds in space.

Sounds you hear in the world have characteristics associated with them. Your brain uses physical cues to figure out sound characteristics. These physical cues then enable you to recognize what a sound is and where it is coming from. Positional audio technology deals with where. It tries to emulate physical cues so that your brain thinks sound is coming from a certain direction, is a certain distance away, is in a certain sized room, and so forth -- when it is really coming from speakers or headphones.

What's it good for?

I have always been interested in the movement of sounds and melodies through space. In my early compositional work, I would create motives of motions -- a given melody would move through space in a certain way. I discovered that adding spatial motion to a composition not only makes it more alive, but enables me to write much more dense and complex music, and yet still have it be listenable without turning into a 'wall of sound.'

How does positional audio work?

There are many different techniques for creating positional audio. A simple one is called stereo panning. It is often used in systems which feature stereophonic sound(two speakers).You have two speakers, and you change the loudness of a sound in the right speaker relative to the left one which causes the sound to appear to be closer to one speaker than the other. Moving sounds through space by changing the volume at which the sound is sent to each speaker is called 'panning'. Another, similar, technique is called quadraphonic panning. It is the same idea, but is used in systems that feature quadraphonic sound (four speakers instead of two).

But the way your brain hears sound is a much more complex. Your brain uses many different cues to determine where in space a sound is coming from - whether to the left, the right, above, below, in front, or behind. Among these cues are interaural intensity difference, interaural time difference, early reflections, reverb wash, and head related transfer functions (HRTF).

Let's now look at the characteristics we hear in sounds; afterwards we'll examine the physical cues that our brain uses to figure out those characteristics.

Sound Characteristics

Both when your hear sounds and when you listen to music, your brain figures out all kinds of things about the characteristics of the sounds you are hearing and their environment. For example: How does your brain perceive these characteristics? Your brain uses certain cues to figure out what it can about each of those characteristics. The following table shows which cues help your brain figure out each of several useful characteristics.

Figure 1 - Psychoacoustic effects of various spatialization cues 
  c h a r a c t e r i s t i c
Left/Right Front/Back Up/Down Distance Room Size Location of nearby objects
intensity x (quad)      
ITD (delay)  x        
early reflections         x x
reverb wash       x x  
HRTF  x x x      

Making 3D audio work involves picking which of these cues you need to simulate and how well you need to simulate them in order to fool your brain. Next we will examine more detailed information about each of the cues.

Spatialization Cues

  1. Interaural Intensity Difference (IID)

    Interaural Intensity Difference
    IID Model
    Click on Illustration to visit the pop-up
    Interactive 3-D Audio Lab (requires Java)

    The volume or intensity of a sound when it reaches each ear is taken into consideration by your brain -- that's one cue. In technospeak, that cue is called interaural intensity difference (IID). It can be easily modeled by calculating how far a sound is from each ear. In the real (physical) world, the intensity of the sound decreases with distance according to the formula amplitude = k/(distance2). If you model the change in amplitude over distance according to this formula, then you are doing physical modeling of sound intensity.

  1. Interaural Time Difference (ITD)

    Interaural Time Difference
    ITD Model
    Click on Illustration to visit the pop-up
    Interactive 3-D Audio Lab (requires Java)

    Your brain also uses the time difference between when a sound arrives at your right ear and your left ear when it tries to figure out where a sound is coming from. If it arrives at your left ear first, the sound must have come from your left. If it arrives at your right ear, it must have come from your right. This cue is called interaural time difference. Interaural means between the ears.

    When a sound moves, the time difference changes too. Your brain processes that change along with other cues to tell that an object is moving and track its location.

    Motion of the sound itself is not the only thing that causes a change in the ITD. As you move your head -- even ever so slightly -- the distance from your ears to the sound source moves, causing the ITD to change too. Fortunately, your brain uses the balance center in your inner ear like a gyroscope to determine your orientation and compensate for this movement, altering the sound you hear. To test this out, turn your head while listening to music coming from a boom box. Does it sound like the boombox is moving? Probably not. Now have someone carry the boombox across the room. Does it sound like the boombox is moving? Yes, of course. Each should sound different, although the effect upon the interaural time difference is the same. The difference is that your balance center tells your brain when it is you that is moving.

  1. Early Reflections

    Early reflections are the first echoes you hear from a sound as it bounces off nearby walls and other things that are around the sound. Have you ever noticed how your blind friend can walk around objects she can not see, and stop before hitting a wall in your house even though she is not using a cane? Her brain analyzes the initial echoes heard to tell her where things are -- very similar to the echo-location used in radar.

    Early reflections help you hear how big the room is and where things are around you, especially when the sound is coming from you. Imagine shouting in a canyon. The first reflections you hear of your voice are so far apart that you have a special name for them -- echoes. Because they are so far apart, you know you are in a large space -- and also that there is some sort of large canyon wall for the sound to reflect off of. This type of analysis of the sound happens even when you are in smaller spaces and the echoes are so close together you don't even consciously notice them.

    The time between first hearing the direct sound and hearing the first reflection tells you how far away the nearest wall is -- the longer the delay to the first reflection, the farther you are from a wall and the bigger is the space you are in.

    Only the first round of echoes are useful after a sound begins; after that the echoes become a blur quickly, turning into a reverb wash.

  1. Reverb Wash

    The reverb wash is made up of the reflections of sound as they continue to bounce around. The combined first bounces off each object are called early reflections -- all the bounces heard after that make up the reverb wash.

    The reverb wash tells you a lot about how far a sound source is from you. This is done through two mechanisms. First, by its loudness relative to the sound and secondly, by its time delay relative to the sound. If the source is nearby, you will first hear the sound, then hear the reflections and wash afterwards. The direct sound will be much louder than the wash. But if a sound is coming from far away in the distance, you will hear the direct sound and the wash at about the same time. Also, the wash and the direct sound will each have about the same volume. In this case, the early reflections will not even be noticable because they will be mixed up in the wash.

    The length of the reverb wash in time tells you something about the initial loudness of the sound and of the size of the area you are in. If the sound is very loud to begin with, you will be able to hear the reverb longer. If you are in a large space that is well-enclosed, you will hear more sound bouncing around from wall to wall, and so the reverb wash will be longer because of the opportunity to bounce and also because of the long time between bounces, stretching out the wash. Hence, caves and cathedrals have reverb washs that last a long time, but bathrooms and nightclubs have a short reverb.

    Porous walls and soft furniture and thick rugs and crowds wearing fur coats will all absorb sound, especially the high frequencies. This will dampen and dull the reverb wash. On the other hand, hard, smooth, straight walls in an empty room will reflect all frequencies well, giving a brighter, longer reverb wash.

  1. Head Related Transfer Function (HRTF)

    What are the Pinnae?
    Take notice of your ears -- examine your outer ear, that flap of cartilage covered with skin. Perhaps it is pierced. These flaps are called your pinnae. The pinnae are a rather odd shape -- longer than wide, open at the front, curled over at the top and some of the rear.

    What do the Pinnae do?
    Sound that reaches inside your ear canal has to move around and along your pinnae to get there. Depending on the direction from which the sound comes, the sound will move across a different part of the pinnae. If the sound comes directly in, it may arrive largely unchanged. If it comes from behind you, the sound will move along your hair on the back of your head, wrap around the back edge of the pinnae, then enter the ear canal, bouncing off that little tab in the front. If coming from above, the sound passes over the hair on top of your head, wraps around the curved edge on the top, bounces off the bottom, and into the ear canal. While doing this, the sound is changed -- certain frequencies are dampened, others are reinforced. Delays in the paths taken by different frequencies cause minute phase changes. For a given direction of approach, the sound will be colored in a specific way for a specific person. This type of sound coloring, once thoroughly analyzed, can be simulated with digital filters. A filter which produces a specific frequency and phase response can be completely described with a mathematical formula called a transfer function. Since the set of filters used to indicate direction of sound are affected by the shape and size of a person's head and ears, these filter functions are called head related transfer functions (HRTF). By selecting the appropriate pair (one for each ear) of transfer functions for a sound coming from a given direction and applying it to the sound being directed at each ear, we may be able to alter the direction the sound is perceived as coming from -- making it come from above, below, or behind -- even when the speakers are directly in front of the listener.

    For a fun experiment to hear how this works, play a song on your stereo, preferably with some high frequencies like a snare drum. Sit with your back to the speakers, a few feet away. Hold up your hands, palms cupped and facing backwards towards the speakers. Place the edges of your hands which are opposite your thumbs on the side of your head, just in front of your ear. Your fingers are now cupping over the top of your pinnae, palms open and facing the speakers. Fiddle with the angle and shape of your hands. You should be able to find an arrangement in which the music appears to be coming from your side instead of from behind. Your hands are acting like artificial ears. For a more extreme effect, shave your head, duct tape your ears to your head, glue on fake ears in the opposite direction, and put on a wig facing backwards. Now, sounds in front of you will sound like they are coming from the back and vice versa. The only problem is that the interaural time difference (ITD) cues contradict the HRTF cues, which may make you dizzy and fall down! Or maybe it's the wig...

    Another experiment - scratching top & bottom; note about the resonant effects of membranes of different sizes and tensions.

    Pitfalls -- pinnae uniqueness
    Pinnae are like fingerprints -- no two are the same. They uniquely identify you. Because of this, each person filters sounds differently! Each person's brain is tuned to recognize and process their own pinnae's filtering. Perhaps now you can see why HRTF design is so difficult -- no two people have the same HRTF!! Therefore it is actually impossible to make a single HRTF that will work well for everyone. There are three major approaches to selecting a HRTF:

    1. Create a custom HRTF for each listener. This requires that the person's pinnae be measured and a suitably capable expert designs a HRTF to match -- an expensive process.
    2. Select a HRTF representing a typical ear that works acceptably well for as many people as possible.
    3. Select a HRTF based on what the available hardware can support.
    Even under ideal circumstances, there will always be some people who can't hear the effect at all. There are several explanations for this:
    1. Their ears may be very different from the HRTF you are using.
    2. They may have damaged their high-frequency hearing by listening to loud music previously in life. The most powerful spatial imaging cues are in the high-frequency range.
    3. Some people are very sensitive to incorrect lighting and can spot a fake photo instantly, and can tell in a movie when a real model is filmed, and when a computer generated model is used. Likewise, some people are very perceptually sensitive to spatial sound direction. As their heads make minute motions, affecting the ITD, and their ears twitch, affecting the HRTF, their brains spot a forgery! They can then filter out the false cues and detect the real source of the sound: speakers.
    Does this mean we should just give up? No! An old adage says: "Something is better than nothing." Likewise, sounds can be livened up, moved about, and intriguingly enhanced through the use of HRTFs. For many people, it will sound positively amazing, for a few, they won't hear much difference. We're doing this for the first group.

    Measurement of a person's own HRTF
    methods of HRTF capture -- canal sampling (incl. ripping from studies) & model-casting/ray-tracing algorithmic (MUST model porosity of cartilage & dampening effects of ear hair to be an acceptably accurate approximation)

    Topography of Surfaces
    Different types of objects require different numbers of dimensions to describe them. A flat tabletop and a book cover each have two dimensions. A box, a kitten, or a bowl filled with soup have three dimensions. All curved planar surfaces can be considered to have between two and three dimensions. Some examples of curved planar surfaces are a mountain, a range of rolling hills, a crumpled sheet, and the surface of a sea at a given moment. The inner surface of the pinnae that is critical to determining the pinnae's transfer functions is also a curved planar surfaces. In mathematics, objects like the pinnae with "in-between" dimensions are called fractals (fractional dimensions). Surfaces with higher dimensionality are described as more topographically complex.

    complexity & freq. range of top/down vs front/back. More people can hear top/down because 8k.

    Filter Design

    First, let's analyze what range of frequencies and precision we will need to have. Once this is accomplished, we will try to find the simplest filter topography (interconnection arrangement) that will yield acceptable results. Since the topographies may be complex filters which are unknown, we will not have easy access to formulas that help us select coefficients to tune the filter. Thus for each candidate topography, we will need to find a way to select coefficients. Filters are a set of interconnected nodes that multiply a number of inputs by coefficients, yielding an output. This describes a neural network. Programming a neural networks involves using back-propagation to converge upon a set of coefficients that will yield the response we desire. We can use the same method to program our filters, though be warned -- it is more difficult than it sounds.

    What range of frequencies are most important in implementing a HRTF? A very rough first approximation can be obtained by considering the size of the pinnae and the speed of sound:

    From the top to the bottom of a typical pinnae is 2.5 inches. This means that the lowest frequency significantly affected by the pinnae in that direction is (1130 ft/sec) / (2.5 inches / (12 inches/ft)) = 5424 Hz. Most of the up/down filtering will happen at higher frequencies.

    From the front to the back of a typical pinnae is 1.5 inches. This means that the lowest frequency significantly affected by the pinnae in that direction is (1130 ft/sec) / (1.5 inches / (12 inches/ft)) = 9040 Hz. Most of the front/back filtering will happen at higher frequencies.

    The first thing to notice here is that people with poor hearing often have lost their ability to hear sounds in this range. Such people are likely to have poor localization of sound to begin with and probably can not hear anything special from HRTFs.

    Secondly, let's consider what kind of filter system we would need to do the needed filtering accurately and in real-time. In a system with a 50 kHz sampling rate, a 10 sample long finite impulse response (FIR) filter can implement a transfer function focusing on frequencies above 5000 Hz. For good quality, especially in the higher frequencies, oversampling is needed to improve the precision of the frequency response. 8x oversampling with interpolation gives a 400kHz sampling rate and requires a 40 sample long FIR filter. Two such filters are needed in a stereo system, yielding a system that must be able to perform about 2 * 400,000 * 40 = 32 million 24-bit multiplys per second per 3D stream. To handle 10 streams in a quadraphonic system, you need 10 * 4 * 400,000 * 40 = 640 million 24-bit multiplys per second. If your system is going to do anything else beyond calculate HRTFs, you'll need even more computational power.

    This can be an intense and expensive computational load obviously. It is worthwhile to explore other filter topographies. If we can find self-similarity in the filter response, we can structure a hierarchical FIR filter rather than flat, and reduce our computational complexity by a log function. Another method is to explore infinite impulse response (IIR) filters. Such filters have feedback and are thus less stable but can yield considerable improvements in processing speed. They can also be quite difficult to map to an arbitrary filter response, which is what we'll need to do.

    Multiple Filter Candidate Topography from Surface Topography Using Fractal Auto-correlation
    Methods of filter topography extraction - using fractals to extract topographical similarities, suggesting # of units to use Mapping surfaces into higher dimensions, yielding lower complexity. A FIR is basically one dimensional (between 1 & 2). Each fold can increase dimensionality, but reduces size of calculations. n-dimensional bifolding & p-folding

    Coefficient Extraction using Back-Propagation
    Methods of coefficient matching - feed to neural networks & train using modified back-propagation to a wide range of digital filter topographies. When close match occurs, do quick test of each on subjects. Further develop promising ones to reduce to essentials, test against subjects. Notes: This takes a lot of time & back propagation and imagination can produce an infinite variety of solutions. The more time spent (months with experts, possible years to get good ones), the greater likelihood of finding a kick-ass topography that can be efficiently digitally modeled with wickedly stark imaging from compelling coefficients. Filters must be pretty fancy/surprisingly clever to do precise phase work at close-to R/2 frequencies.

    Pitfall -- Engineering Time vs. Computational Complexity
    As we have seen, the pinnae are extremely complex mathematical objects and there is absolutely no way to make this work by fiddling with knobs or guessing. By far the most difficult physical cue to get right. Can take 100 times longer to make it work than getting all the rest working in real-time. Can get good results by ripping curves from journals and using FIR filters, but at a high computational cost, usually involving fully customized ASIC chips. This is how Diamond implemented Aureal's (now Creative's) algorithms -- with 15 stage FIR filters, allowing 4 streams per chip. For both, best to use own ear initially. Implementation: algorithmic vs modeled vs sampled (also simulated, but don't mention).

    If using speakers, overcompensate


1. real time/offline

2. interactive - usually requires real-time, though path generation can be real-time interactive with offline DSP.
Do you need special equipment to play it back?

What can positional audio be used for?

(stuff above should go here so as not to lose interest)

3D Positional Audio Algorithms by X. J. Scott

Figure 2 - Summary of Spatialization features in XJS Algorithms 
ITD (delay)
early reflections
reverb wash
Doppler shift

XJS Type 1 3D Audio -- Pitch Bend to simulate doppler shift; panning and reverb levels to simulate distance and direction. Used in Jacob's Ladder, Fall 1991.

XJS Type 2 3D Audio -- Movement modeled by sampling stereo impulse response of actual physical spaces at a number of positions and cross-convolving among them to simulate motion. Without doppler shift, this effect is unconvincing. Used to place a roller coaster inside a stairwell in March, 1993.

XJS Type 3 3D Audio -- Physical modeling of an acoustic delay line; filters roughly simulating the response of a typical ear to distinguish front/back; algorithmic (non-modeled) early reflection control to give an impression of the size of the nearby sound space; and a reverb tail/wash to give an impression of distance and of the overall environment. Scales to accept any number of speakers. Adapts to any placement of speakers. Modeled and first demonstrated in csound in January 1995. Ported to run 10 interactive streams in real time on a client's ASIC in November-December 1996.

XJS Type 4 3D Audio -- New method of generating filter transfer functions and other parameters to simulate direction and distance including very accurate reverb and early reflection data. Only a mono delay is used to generate Doppler, rather than the interaural delays of types 2 and 3. Left/Right direction is simply indicated by modulating amplitude. First used in This is Not for Realsies, June 1997.

(c) 1991-2001 X. J. Scott, All Rights Reserved.