Game audio has always had the need to compress audio files. The fact remains that there is still not enough disk space, or memory, to have all of the grandiose audio environment we dream of as raw uncompressed audio samples. This guide aims to help you select the best audio encoders for your game.
Every codec has strengths and weaknesses. Knowing which one to use in which situation is not always obvious. The first thing to understand is that there is no single correct choice for all platforms. There isn’t even a single correct choice for all types of sounds on a single platform. Therefore, it is not a good idea to lock yourself in the mindset of “one size fits all”.
If you don’t need all the nitty-gritty details of Wwise codecs support and only want to know which one to pick, skip to the last section for a quick selection guide.
Compression and performance
Wwise supports 4 software codecs: PCM, ADPCM, Vorbis and Opus. There’s one main advantage for using these codecs, which is consistency across platforms. You will get exactly the same quality, relative performance and behavior on all platforms. They all support Wwise's full feature set for audio sources, which is not the case for hardware codecs.
Additionally, each game console that was created also came with a hardware audio processor. Each has different capabilities, which are not equivalent between platforms (read further for the limitations). Wwise supports hardware decoding of compressed sources, and sometimes more hardware processing, when available on the platform.
A note on the term “quality”
First, there is an unfortunate fact about audio quality: it is subjective. There is not a single formula or number that can tell us what is good or bad audio. But we have some imperfect proxies for that. When setting the codec’s parameters, the most important one is the Quality/Bitrate. When selecting that setting, you also need to keep in mind what exactly this setting means.
The Quality setting is used in variable bitrate codecs and is usually unit-less. It is only a stand-in for numerous internal rules and parameters that will, in the end, allocate more or less bits (like resolution for an image) for different sections of the frequency spectrum. Given a sound will have a changing frequency spectrum over time, this means that some sections of the sound will compress very well, while others won’t. You can easily imagine that a second of near-silence can be compressed a lot, while a second of heavy metal music might not.
The Bitrate is used in constant bitrate codecs and it is very literal: it is the amount of bits we need to read per second to reconstruct said second of audio. If we have more information to reconstruct the sound, the sound will be closer to the original. One might ask, how can a sound with a variable amount of audio information over time be compressed with the same amount of bits every second? Simple! It changes the resolution of the spectrum over time so it always takes the same amount of bits. Therefore, the codec will choose to encode the second of near-silence with more bits than necessary, because it has extra for that section. Conversely, the second of heavy metal music will have a lower resolution, because there is a lot of data needed to reconstruct, but still a finite amount of bits to represent it. So, bitrate-limited codecs can actually be wasteful in regards to file size. There are other techniques employed, which go far beyond the scope of this article, but this is just a simplified example.
Thus, we can also see the “variable rate/constant rate” property as “constant quality/variable quality”. In general, modern codecs have a very advanced psycho-acoustic model that will prevent jarring artifacts even at low quality settings, for many types of sounds. By “jarring” I mean artifacts that are obvious and annoying to hear, that even non-audiophiles would identify as “low quality audio” (to be polite). Obviously, we all want to avoid those artifacts at all costs.
There are also a wide range of compression artifacts that fall into the “loss of reproduction fidelity” category. In other words, the compressed sound is slightly different from the original, and the difference could be audible, but you can only tell by doing A/B comparisons. Any unaware listener might not be able to tell which one is the original. Such compression artifacts are much less important from an overall audio experience point of view, as they will not pull the player out of the game.
Fidelity to the original is more important for some sounds. Music comes to mind, as any filtering or frequency distortion will be obvious, especially to musically-trained listeners. However, on the other end of the spectrum, nobody would notice a low-fidelity reproduction of an explosion or a footstep, as long as no jarring effects appear. Additionally, we must remember that most of the time, sounds are not played in isolation nor at full volume. The end result is that lower fidelity can easily be masked in a heavy mix, in case you need extra space or CPU. Selecting a quality/bitrate setting per category of sounds can be an easy way to start, then fine-tuned where needed. If you are using a constant bitrate codec, you should make sure to give it some headroom (higher bitrate) to cover for the heavier parts of your sounds.
Compression ratio
The compression ratio is simply the ratio of the final file size to the original file size. For audio codecs, we will only include the audio samples part of the file in the calculation, meaning without the WEM container or markers. It also excludes any prior downsampling. The table below compares the compression efficiency of each codec. For variable quality codecs, the audio content and settings undoubtedly play a huge role in the results.
Codec |
Compression ratio |
VBR/CBR? |
PCM |
1:1 |
Constant |
ADPCM |
4:1 |
Constant |
Vorbis |
2:1 - ~40:1* |
Variable or Constant |
Opus |
2:1 - ~60:1** |
Constant |
XMA |
6:1 - 15:1 |
Constant |
ATRAC9 |
8:1 - 13:1 |
Constant |
(*) Depends on selected quality and audio content, upper limits are hardly achievable.
(**) Same comment as Vorbis, but much better ratio achieved with voice content (narrowband) and some harmonious content (music)
CPU performance
The next table shows the performance of the software codecs, in terms of throughput. In other words, it is the maximum number of streams that can be decoded in real-time, on one core. This number includes only the decoder itself, without resampling, voice management overhead, interrupts, memory bus bottlenecks, etc. In short, the decoder is isolated in the best possible conditions. The intent is to give an idea of the maximum performance, but the upper limits are unlikely to be achievable in a real game environment. The throughput can be compared to the maximum supported streams for the hardware codecs (below). For Vorbis and Opus, the 3 numbers are for the lowest, default and highest quality settings, giving the whole range of performance.
Maximum stream count
Platform |
ADPCM |
Vorbis (low-med-high Q) |
WEM Opus (low-med-high Q) |
Mac1 |
10700 |
7500 - 5900 - 3100 |
1600 - 1200 - 700 |
Mobile2 |
8600 |
5200 - 4300 - 2100 |
1100 - 800 - 500 |
PC3 |
9500 |
5700 - 4600 - 2300 |
1300 - 1000 - 500 |
PS4 |
3200 |
1200 - 1000 - 500 |
300 - 200 - 100 |
PS5 |
7600 |
5900 - 4700 - 2500 (SW) 80 (HW) |
80 (HW) |
Switch |
2000 |
600 - 500 - 300 |
200 - 150 - 100 (SW) 20 (HW)4 |
XboxOne |
10000 |
1000 - 3400 - 1200 |
300 - 200 - 100 |
XboxSeriesX |
9200 |
6900 - 5400 - 2800 |
300 (HW) |
1. Mac M1
2. ARM64 Cortex A9
3. Core i7 (circa 2018)
4. Using software Opus (WEM). The hardware decoder uses a slightly different codec, OpusNX. See next section.
Hardware limitations
There are a few specific Wwise features that are tied to the hardware implementation of codecs. If it isn’t supported by the hardware, Wwise will try to emulate the same thing in software, but that comes with a cost, or sometimes it is not possible at all and an error will be reported. Here is a list of features that might be supported or not depending on the platform:
- Sample-accurate looping: the ability to start and stop a loop at any sample of the same stream, regardless of the duration of the native codec’s blocks. In Wwise, this corresponds to the Loop checkbox.
- Sample-accurate transition: the ability to stop a sound at any sample and jump to another stream, at any position, regardless of the duration of the native codec’s blocks. In Wwise, this is useful for Random/Sequence containers, in the « Sample accurate transition » mode.
- Variable resampling: the ability to change the sampling rate, or pitch, dynamically during the whole duration of the sound.
The hardware feature support is summarized in this table. When not supported, it will be achieved through software, at extra cost.
Codec |
PF |
HW? |
SA Loop |
SA Trans |
Var. pitch |
Max Count |
PCM |
All |
No |
Yes |
Yes |
Yes |
|
ADPCM |
All |
No |
Yes |
Yes |
Yes |
|
Vorbis |
All1 |
No |
Yes |
Yes |
Yes |
|
|
PS5 |
Yes |
Yes |
No |
Yes |
80 |
Opus |
All1 |
No |
Yes |
Yes |
Yes |
|
|
PS5 |
Yes |
Yes |
No |
Yes |
80 |
|
XBX |
Yes |
Yes |
Yes |
Yes |
300 |
|
Switch |
Yes |
Yes |
No |
No |
12 - 243 |
XMA |
XB1 |
Yes |
Yes2 |
No |
No |
128 |
|
XBX |
Yes |
Yes2 |
No |
No |
128 |
ATRAC |
PS4 |
Yes |
Yes |
No |
No |
60 - 5003 |
|
PS5 |
Yes |
Yes |
No |
No |
120 - 10003 |
1. All, except specified platforms below
2. XMA loops are possible only on 128 samples boundaries
3. Maximum depends on channel count, bit rate and granularity of all active sounds. Reasonable maximum is mid-range.
Hardware latency
Hardware acceleration is done usually through a DSP co-processor. As such, it has to process the data like any other CPU, and it is not instantaneous. The real advantage it offers is parallelism with the main CPU. Unfortunately, the DSP itself is much slower than the top-of-the-line CPU that comes with any console.
There are two ways to use that parallelism: same-frame processing or delayed processing. This is controlled by the bLowLatencyHwCodec flag. The first mode (when set to true) is self explanatory: there will be no latency between the hardware sources and the software sources. The compressed data will be sent to the decoder, Wwise will process other software sources, or yield the CPU if there is none, until the hardware signals the results are ready. Then these will be processed and mixed in the same audio frame, in sync with the other software sources. Hence, no latency added. This means the reported processing time reported by the Wwise Profiler will be higher, because it will include the yielded time (in which the CPU can jump to other tasks). But everything will be synchronized.
The second mode, when bLowLatencyHwCodec is false, sends the data to the decompressor on one frame but will fetch the results on the next frame. This has the major advantage of hiding the DSP processing time entirely. However, all sounds are delayed by one frame, which may cause issues if there is a mix of software and hardware sounds that must be synchronized, or if the game has very tight latency requirements (e.g. rhythm games). This is the preferred mode for overall system performance.
Bottom line, hardware decoding is a good way of freeing the CPU to do other things, but in absolute terms it doesn’t mean it can process more raw audio than the CPU. But you can mix both hardware and software sources to maximize parallelism.
Is hardware decoding free?
Simply put, no. Not even on the latest generation of consoles. The cost of hardware decoding comes in a different form than raw CPU cycles. The hardware decoder’s physical implementation differs greatly from one platform to another, as do the driver implementation, the OS integration and the available API. All this makes comparing the cost of these codecs difficult. But there are a few common costs.
In a lot of cases, the API to command the decoder will cross the kernel boundary, often incurring a cost. On some platforms, commands are posted to a queue that is protected by some thread synchronization and can therefore stall for a bit. In some cases, the memory containing the compressed input stream needs to be transferred to memory visible by the hardware decoder, which may be separate from the main memory. The same may need to be done for the audio output. It is also common that the format of the output doesn’t match the one of the rest of the sound engine, requiring a conversion step. Some hardware works in batches and the whole batch must be done for the results to be available, while others will return results as they are ready.
In general, because of all these costs, hardware codecs are not a good fit for very small sounds (less than 100ms, as an easy rule of thumb). Therefore, depending on the frequency of use of short sounds, their duration and the platform used, it might be costlier than pure software decoding. Examples of sounds to be careful with are: very dry (short) footsteps, impacts or repetitive firearm sounds made through the Trigger Rate mechanism. Also any granular synthesis made with Random or Sequence Containers through sample-accurate transitions is risky with hardware codecs.
Additional notes on each codec
PCM
PCM is simply the uncompressed media, therefore, its only advantage is speed. Main disadvantage is disk and memory space, given it is uncompressed. The only recommended usage is for highly used sounds on less capable platforms. Nowadays, there isn’t such a platform where this would be preferable over the other options.
ADPCM
ADPCM is well known in the gaming industry as one of the first affordable codec (CPU-wise) in the old consoles. It has a fixed compression ratio of 4:1 and is very fast to decode. Its main downside is its uncertain quality. Sounds with obvious transients or very high frequencies will exhibit audible artifacts. Nonetheless, these cases are not the norm and ADPCM is still often used.
Vorbis
This is a general purpose psychoacoustic codec, meaning it uses the specificities of the human ear and brain to help compress audio while minimizing artifacts. This codec has made its way into gaming initially because of its good audio quality along with good file compression. The compression ratio is very variable and depends on the audio signal. It can go from 2:1 (highest audio quality, on a wide band signal) to 40:1 (lowest quality, on a narrow band signal). The codec implementation that Wwise uses has been optimized over and over, and its average CPU cost is now about 1.5x to 3x that of the ADPCM codec.
Vorbis is a variable bit rate codec, which means the compression ratio depends on the content of the file. In general, the noisier the sound, less compression can be achieved. Also, the CPU cost is directly tied to the Quality factor used: more quality means more CPU used to decode.
This format requires a seek table to be able to do seek operations, support the “From elapsed time” Virtual Voice option and Interactive Music feature set. This table is optional, you can avoid its cost if you are not using this functionality, but in general it should be enabled. Note that you can use a large granularity in most cases, to cut down on the cost of this table.
Also, there is a small overhead of metadata in this format. Therefore, very small files will be impacted by this. As a rule of thumb, sounds smaller than 50ms may be larger than necessary.
Opus
Opus is the top of the line when it comes to quality vs file size. It is the successor to Vorbis and benefits from significant improvements on the compression ratio. The perceived quality is slightly higher than Vorbis, according to subjective listener tests (see https://opus-codec.org/comparison/). Therefore, it can achieve better ratios than Vorbis while retaining quality. However, it comes at a high CPU cost, typically 5x to 10x slower than ADPCM.
Opus is not the best format for tiny files. The algorithm needs a few milliseconds of data to set up properly, hence about 80ms of audio is added to the file. Therefore it is wasteful to use Opus for tiny grains of audio of about that length. As a rule of thumb, if your sound is smaller than 200ms, just use Vorbis or ADPCM.
XMA
This codec is available only on Xbox platforms. It has been the workhorse of games since the Xbox360. Its main advantage is being decoded by hardware. However, the hardware isn’t particularly fast, so it has a limit of 128 streams (channels). Also, some audio content doesn’t compress very well through this algorithm, and sometimes artifacts can be heard. Fortunately, this is far from the norm! Finally, the format itself limits loop points to multiple of 128 samples, which preclude sample accurate stitching of loops without going back to the CPU.
ATRAC9
This codec is available only on the PlayStation platforms. It is always decoded in hardware. Some sounds could generate some audible artifacts under certain encoding settings, but changing said settings could fix the issues. Fortunately, artifacts are not common either.
The guide to the right codec
Finding the right codec for the task is not a simple question. It depends what you want to focus on: Quality? Speed? File Size? Sync with other sounds? It also depends on the platform(s) you are working on, obviously, as it dictates the available choices.
Best for Quality vs File Size (smallest file size for equal quality): Opus , the second would be Vorbis
Best for Quality vs Speed: Except PCM (obviously), Vorbis is the fastest for the best quality in software decoding. If Opus is available in hardware, then this is best.
Best for raw Speed: PCM, then ADPCM. Obviously all hardware codecs win, when supported and used in high latency mode.
My recommendation for a default codec would be Vorbis, or Opus when available in hardware, at mid-quality levels.
It is recommended that you run a sample of your sounds through your selected codecs and settings and carefully listen for artifacts before committing an entire section of the sound design to a specific codec setup.
As a rule of thumb, mixing hardware and software codecs is a good thing to achieve the best throughput i.e. maximize the number of samples processed in the shortest time. While the CPU decodes the software files, the hardware can do its work in parallel. If you do not have any software sources, you are limited by the hardware decoder speed, which is not always fast.
Voice Overs
Modern games usually have tons of VO files. Therefore, minimizing the size of those is the goal. There is no better codec than Opus for this, even if it is decoded in software. Opus has a sub-codec entirely dedicated to human voice compression. Usually, there is only one or two VO active at a time, so the decoding cost remains limited. Obviously, this isn’t even an issue on the 3 platforms that support Opus with hardware.
Gunshots, impacts and other short granular sounds
That type of sound is usually played very often in games, often stitched up to do variations with a subset of files. Given the files are short, Vorbis would be the best choice here because it compresses well, and is fast to decode on all platforms. Opus could be considered too on the PS5 and XboxSeriesX, but if you are optimizing for file size, this would be wasteful. On low-end mobile, ADPCM might be something to consider if you are in a bind and need extra CPU cycles.
General SFX
For any longer SFX, you should prefer the hardware codecs, when available. If not supported on your platform, then Vorbis is the better option, followed by ADPCM if the CPU is too taxed.
Ambience
Background ambience sounds are usually made of loops and added detailing sounds of various lengths. Depending on the design, there can be a good number of those playing. For all platforms that support hardware codecs, use them. Otherwise, Vorbis will fit the bill quite well.
Music
Opus is an excellent choice here as it is a codec that was designed with music in mind. When decoded in software, if your music is very complex, if it has many layers playing at the same time, then Opus might be costly on the CPU. Hardware codecs can fit the bill properly, if all the tracks are hardware decoded. Mixing software and hardware decoded music tracks can be tricky for synchronization (see the Hardware Latency section above). Vorbis would be the second best choice.
Last words of advice
The simplest advice I could give about codecs is: don’t fret too much about it! At least in the beginning of your project. You can definitely pick a generic codec for all your sources to start, using a mid-range quality. Or define three or four Conversion Sharesets for wide categories of sounds. Then, when the project is well under its way and you can start having an idea of the performance of your design, start profiling and change the codecs where required. Given that Wwise has a way to override the Conversion Settings at any level of the hierarchy, it is easy to use a specialized codec for a particular section of your sound design.
As long as it sounds good and you’re within budget, don’t worry too much about codecs. If not, remember there are plenty of options to tweak to get you going!
Commentaires
Chuck Carr
October 12, 2023 at 10:28 am
Thank you for this, Mathieu. Very helpful information.
éliane blaise
November 07, 2023 at 09:18 am
Mathieu, Thank you very much for your clearest article and this written reminder! By the way, you confort me in my choices:)