In this article, we’ll be introducing and detailing one of the new systems in Wwise 2024.1, a new default memory allocator for Wwise called the AkMemoryArena. Compared to previous versions of Wwise, this new memory allocator can deliver significant improvements to Wwise’s utilization of memory resources, but proper study and configuration of the memory system can help achieve even further optimizations for many projects.
A Review of Memory Reservation and Fragmentation
Wwise 2019.2 included support for a new default memory allocator. Highlights of this allocator included:
- Block-based allocations at all size-classes, to mitigate internal memory fragmentation
- Thread-caching behaviour, allowing for individual threads to have little-to-no contention when allocation and freeing memory
- Growing memory resources on demand as needed, removing the need to have pre-sized memory pools
This new memory allocator provided excellent performance, and the ability to map memory on demand greatly improved the flexibility of Wwise.
However, while the improvements were noted, we found that as more developers used it, there were some deficiencies that came to light regarding overall memory utilization. There were three main issues noted:
- The block-based allocation scheme meant that each allocation tended to be larger than required, and, on average, the total “Memory Used” used by many titles increased by approximately 10-15%.
- The thread-caching behavior would result in increases in “Memory Reserved” beyond what was expected, simply due to memory allocations happening across more threads. In one particularly severe case, we saw dozens of threads managed by the application which instantiated unique heaps, often to barely ever be used and never be released.
- The allocator would request new pages of memory in large spans at a time, but the allocator also expected that subsets of those spans could be unmapped over time – leaving those portions unused – before being released altogether.
While some of these issues are shared with other thread-caching memory allocators, some of these issues arose specifically due to some of the operating conditions for Wwise: that is, the fact that the Wwise sound engine is a middleware library. Being a middleware means that Wwise is in the middle of the software stack, after all, and therefore often cannot interact with the lowest level aspects of the platform. This includes things such as direct control of every thread being used for execution, or direct interaction with virtual memory systems for mapping and unmapping of memory. For example, we found that most game engine integrations with Wwise could not support the concept of partial unmapping of memory, which often resulted in large amounts of memory being left reserved but unused.
Over the years, various measures were introduced to keep the amount of memory reserved under control, but these tended to increase the CPU cost of using the allocator, mitigating much of the benefits that were initially recognized. Even despite that effort, there were some core design decisions with the memory allocator we could not reasonably work around in order to achieve the best possible use of resources for the Wwise sound engine.
Before continuing, it is worth noting that there are definitely many use cases where thread-caching allocators can deliver excellent results, such as in desktop or server applications where large amounts of memory are available. For example, the LLVM project recently moved their default memory allocator for their toolchain on Windows to rpmalloc and recognized substantial performance improvements in the process. It is simply the case that we concluded that trying to apply this strategy was not the best fit for Wwise, due to many unique requirements of our software, which are not universally shared by other applications.
That all being said, let’s now discuss the AkMemoryArena, our new system for allocating memory, which we feel does a better job of satisfying Wwise’s requirements.
Overview of AkMemoryArena
In 2024.1, we are introducing AkMemoryArena as a new memory allocator for Wwise. Unlike previous memory allocators offered by Wwise over the years, which were all off-the-shelf solutions adapted for our use, this is the first time that we have built a full memory allocator entirely from the ground-up. This was so that we could target all of Wwise’s requirements from the start.
The AkMemoryArena includes the following features:
Ability to Map and Unmap Spans of Memory Dynamically
We still retain the ability to map new spans of memory when required, as well as unmapping those spans when all constituent allocations have been freed. This means that we do not strictly require memory pools to be pre-sized with a hard limit. In our experience, not having a predetermined limit for memory provides much greater flexibility and stability for most developers.
Good CPU and Memory Fragmentation Performance
The AkMemoryArena uses a mixture of different allocation algorithms which seek to deliver a balance between overall CPU performance and memory utilization across a variety of scenarios:
- A block-based allocator is used for small allocations which are less than 256 bytes in size
- i.e. the “Small-Block Allocator” (or SBA) section of the arena
- A free-list-based allocator, which uses a “Good-fit” allocation policy, for medium-sized allocations
- i.e. the “Two-level segregated fit” (or TLSF) section of the arena
- Standalone spans for huge-sized allocations
This should allow developers the ability to control overall memory fragmentation over time. Also, the allocation algorithms all have constant-time performance characteristics, which should help keep CPU overhead low in most scenarios.
The AkMemoryArena also does not have any thread-local caching, and instead relies on lightweight locking mechanisms to handle synchronization across threads. By sharing the state of the memory allocator across all available threads, this should keep memory reservation predictable, even in multi-threaded scenarios.
The lack of thread-local caching does have a drawback in that frequent memory allocations across multiple threads can hinder performance. To mitigate the frequency of memory allocations, especially during audio rendering across multiple cores, Wwise now also uses some specialized memory allocators for handling short-lived memory allocations, such as TempAlloc and BookmarkAlloc, which utilize their own thread-local state to nearly eliminate the need for any cross-thread synchronization.
Outside of the CPU overhead of memory allocations themselves, we have also found that it is more viable to utilize Huge Pages about 2MiB in size for mapping memory. By utilizing Huge Pages, it is possible to reduce the frequency of translation lookaside buffer misses on the CPU. This typically improves overall CPU performance by about 10%, compared to when using smaller pages that are 4KiB or 16KiB in size.
Simple Game Engine Integration
In order to simplify game engine integrations, and try to remove this as a point of error or uncertainty when allocating memory, we made sure that the callbacks used by the AkMemoryArena to acquire and release spans of memory were as simple as possible.
Each AkMemoryArena requires a pair of user-provided callbacks for managing memory. One to allocate spans of memory, and one to free spans of memory.
The implementation of these callbacks have very few requirements:
- There are no special requirements for how memory is mapped and unmapped
- There are no requirements on memory alignment
- The only data provided when freeing spans is exactly the same data that is returned from previous calls to allocate spans – both the user-provided address to the allocation, as well as a pointer to an arbitrary userData.
As an example of how simple the callbacks can be, a valid implementation of these functions can simply forward some of the parameters of the function to std::malloc() and std::free() without any extra logic. In fact, on Windows and POSIX-based platforms, this is exactly what the default versions of these hooks in the Wwise sound engine are!
Highly Configurable for Different Project Needs
Many projects and teams have varying philosophies on how to approach memory utilization, or teams may find that certain decisions for memory utilization are required due to the nature of the content being developed for their game. For example,
- One team may prefer that all memory across the application is allocated and fully budgeted across every system at startup, to make sure that memory use is extremely predictable
- Another team may instead prefer to have memory mapped and unmapped at a very granular level, so that memory can be used by other systems only as needed
Where appropriate, we tried to make sure that we had an abundance of configuration options to satisfy a variety of different needs and use cases.
For example, while the AkMemoryArenas do not require all memory to be pre-reserved at initialization time, it is possible to configure them so that they effectively behave as such, by setting the initial span to a very large size. This initial span of memory could even be imposed as a soft limit in your callbacks for allocating memory, by issuing a warning every time they are invoked after the first call.
Profiler for Monitoring Memory Fragmentation
As a part of building the AkMemoryArenas into Wwise, we wanted to take advantage of the opportunity to set up systems to offer detailed monitoring and profiling of the arenas over time. We expect that this will be valuable to help guide configuration of the AkMemoryArenas for your titles, as well as allow us to provide more detailed support and assistance for identifying issues or optimization opportunities when necessary.
Reduced Memory Usage
In addition to these features, we have found that the total memory reserved, and even total memory used, by the AkMemoryArena tends to be much lower than what was achievable before.
Profiling AkMemoryArena
The new Profiler for the AkMemoryArena provides the following statistics and data for each span in each Arena:
- The Address and UserData returned from each call to the allocation callback, fnMemAllocSpan
- Note that every row in the main view represents one call to fnMemAllocSpan
- The size of the span
- How much of the span has been allocated, and how much is free
- A fragmentation map, illustrating which regions of the span are allocated
Some statistics are provided for each AkMemoryArena in aggregate, as well. In addition to simple accumulations such as the total memory used and reserved, the Profiler also lists what the “Largest Free” available space in the arena is. That is, a statistic showing what the largest allocation that the arena can support before it has to request a new span. When this value is compared to the total free memory, this can be used as a general indicator of how fragmented the Arena is.
All of this is accomplished with a very minor runtime cost, and this data does not require the full application history in order to be calculated. This means that even if a game has been running for multiple hours, the Authoring application will be able to connect to the game and evaluate the state of the memory layout with very little effort.
It should be noted that the trade-off in reducing the CPU overhead of tracking this data is that it provides only a coarse level-of-detail of the state of the AkMemoryArenas. While it may not provide a per-allocation level of granularity – as many specialized tools for analyzing memory do – we still feel that this should be enough data to allow a user to easily evaluate if there is a problem with the memory utilization, use it to direct further inspection if a problem is found, and then also guide decisions for further configuration of the arenas and memory strategies.
Integration into Game Engine
For users who maintain a custom game engine, as opposed to using our pre-made integrations of Wwise into Unity and Unreal, it would be worth reconsidering some aspects of how the memory system can be integrated and configured.
To start, it should be noted that all the existing callbacks for handling individual memory allocations are still available, and the behaviour of these have not been modified: If you prefer using your own memory allocator for every allocation from Wwise, that option is available.
However, using these callbacks will preclude any usage of the AkMemoryArenas as well as the new Profiler described above. Even if you feel the need to make sure that every memory allocation is accounted for, it may be worth considering simply using the AkMemoryArena instead, due to the tooling available. Leaning on AkMemoryArenas to handle most memory allocations may also relieve the pressure on other global memory systems you may have or help simplify the development of other tools for tracking memory usage in Wwise.
It is also worth noting that even if the AkMemoryArena is in use, it is still possible to record some metadata about each individual memory allocation. The “Debug” memory hooks in the AkMemSettings are available for this purpose:
These memory hooks are active regardless of whether Wwise is using its built-in memory allocator systems, or the individual memory allocation callbacks are used instead. These callbacks may be helpful to use if an in-depth diagnosis of memory fragmentation needs to be performed.
As discussed previously, we did try to keep the initial setup and integration for the AkMemoryArenas as simple as possible. The following demonstrates an example implementation of the callbacks for allocating and freeing memory.
Because the AkMemoryArenas are configured separately, it is even possible to use different callbacks for each AkMemoryArena. This may be applicable if you want to use different low-level memory allocators for the Primary or Media arenas, due to the relative differences in lifetimes and sizes of the allocations used by those systems.
Alternatively, if you want to disable the use of certain memory arenas, you can just set the callbacks to nullptr:
This is used to disable the Profiler arena (AkMemoryMgrArena_Profiler) in Release configurations by default, or to disable the Device memory arena (AkMemoryMgrArena_Device) on platforms that do not use device-specific memory for audio processing.
Similarly, depending on how the rest of the integration of Wwise into your game engine is set up, it may be that Wwise never allocates memory to the Media arena at all in normal gameplay. This may be the case if your project exclusively uses APIs such as AK::SoundEngine::SetMedia and AK::SoundEngine::LoadBankMemoryView, instead of AK::SoundEngine::LoadBank or AK::SoundEngine::LoadBankMemoryCopy. Note that some Media allocations are still performed when the Authoring tool transfers new media to the sound engine during profiling, so this option should only be considered if you are targeting the Release configuration of Wwise.
Depending on your game engine’s integration, it may also be worth considering using the Memory Arenas outside of the sound engine as well. For example, it is possible to create your own memory allocation using AK::MemoryMgr::Malloc, load your SoundBank data into that allocation, and then provide that memory to AK::SoundEngine::LoadBankMemoryView. This would mean that the memory would still be managed by the AkMemoryArena, allowing you to gain the benefits of the Profiler and other systems, but still allow for the memory allocation to be owned by your code, and with a lifetime that you control.
Further Configuration and Tuning
The memoryArenaSettings array in AkMemSettings is used to configure other parameters in the Memory Arena. While this array is configured to reasonable defaults in the sound engine, in order to make sure systems behave well under the most common circumstances, it is worth remembering that because each game has different content and memory requirements, appropriate optimization for your game’s content can help realize some significant improvements to memory utilization. For example, in our testing against simulated game content, we have found that we could reduce overall Memory Reserved by 5-10% by doing some simple tweaks to the AkMemoryArena settings for each test.
The following are some simple suggestions we recommend considering:
- Set AkMemoryArenaSettings::uTlsfInitSize to match typical memory usage, or target memory budgets. Typically, we have found that a larger “Base”, or initial span, tends to offer the best memory fragmentation performance instead of having many individual spans of memory. A larger initial size can also help clarify total system-wide memory reservation at startup, and help with budgeting of memory in other domains.
- Set AkMemoryArenaSettings::uSbaInitSize to a watermark identified in the Profiler. The SBA has its own “Base” span which has an extra benefit of having a reduced memory footprint for each constituent allocation: approximately 16 bytes per allocation. Setting this to a higher value so that more ‘small’ allocations go into the SBA Base Span, can have a direct impact on lowering not just Memory Reserved, but also lowering Memory Used.
- Set AkMemoryArenaSettings::uAllocSizeHuge to a lower value to reduce fragmentation in the TLSF spans. A lower value here will ensure that more allocations become standalone “Huge” spans, instead of being in the TLSF spans. Note that this is highly dependent on how the integration of the Memory Arena hooks for fnMemAllocSpan are set up, as this will also result in more calls to fnMemAllocSpan and needs to assume that external fragmentation from the spans is not an issue.
Other suggestions like these are available in the Wwise SDK Documentation in Configuration and Tuning of AkMemoryArenas.
Comments