LWMalloc is an ultra-lightweight dynamic memory allocator designed for embedded systems that is said to outperform ptmalloc used in Glibc, achieving up to 53% faster execution time and 23% lower memory usage.
Malloc can cause memory fragmentation on embedded systems, potentially leading to crashes after the firmware runs long enough. Garbage collection is one technique for lowering fragmentation, but it’s not always practical on resource-constrained devices, and some simply avoid using malloc in their firmware, preferring static memory allocation or memory pools to improve reliability. Custom dynamic memory allocators are another option, and that’s what LWMalloc provides.
LWMalloc is described in a paper entitled “LWMalloc: A Lightweight Dynamic Memory Allocator for Resource-Constrained Environments” as follows:
LWMalloc incorporates a lightweight data structure, a deferred coalescing (DC) policy, and dedicated small chunk pools to optimize memory allocation. The lightweight data structure minimizes metadata overhead, ensuring a compact and efficient implementation. The DC policy reduces execution overhead by postponing redundant operations until allocation, maintaining both efficiency and low-response times. Dedicated small chunk pools enable O(1) allocation for small memory requests, which are common in dynamic allocation patterns, by segregating them into fixed-size pools.
The researchers at Seoul National University of Science and Technology (SEOULTECH) who developed LWMalloc dynamic memory allocator tested it again ptmalloc and reported the following results and highlights:
- 53% faster execution time
- 23% lower memory usage
- LWMalloc is comprised of 530 lines of code and a 20KB footprint, which compares to the 4838 lines of code and 116 KB footprint for ptmalloc
The associated press release also mentions jemalloc, tcmalloc, and mimalloc as alternatives to ptmalloc for improved memory management, but they “suffer from heavy memory consumption, vast library sizes, complexity, and eventual performance degradation” according to the Korean researchers. Access to the full research paper requires a paid IEEE subscription, but I noticed the C code and test program can also be found on GitHub. It’s using standard malloc, calloc, realloc, and free calls, so if you were to integrate it into your project, there would be no need to change the application code, and the library can be injected at runtime to replace malloc/calloc/realloc/free via LD_PRELOAD.
While LWMalloc can benefit any embedded or IoT system with strict memory and performance constraints, SEOULTECH highlights consumer electronics (smart TVs, set-top boxes, home appliances), mobile and wearable devices, automotive systems with real-time constraints, and edge AI computing applications.
Thanks to TLS for the tip.

Jean-Luc started CNX Software in 2010 as a part-time endeavor, before quitting his job as a software engineering manager, and starting to write daily news, and reviews full time later in 2011.
Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress. We also use affiliate links in articles to earn commissions if you make a purchase after clicking on those links.





There’s no support for concurrency in the code, so you’ll need to wrap it with a mutex if you’re using this in code that’s multi-threaded, might be called from an interrupt handler, etc.
(Much of the complexity of fancier allocators comes from optimizing for multi-threaded access. Simply using a mutex is not good for performance.)
It’s always interesting to see new allocators, but each time they claim X% faster and Y% lower memory usage, until they face reality and have to implement thread support where suddenly the relative performance compared to others is simply divided by the number of threads. The only way to to correct this is then to implement thread-local pools and/or larger/more complex structures and the lower memory usage is gone as well. And finally users ask for memalign() since it’s quite common on embedded systems, and that’s more code complexity.
Another point is that it only relies on sbrk(), which means that it cannot release memory. This can be a problem for long-lived processes that are gracefully replaced because all the unused RAM of the old process cannot be released to be usable by the new one, which results in a much higher memory usage.
Musl’s allocator on the other hand is well balanced (threads, mmap) and not significantly larger (maybe around 20% at first glance).
May I suggest that super small allocators should be tiny at the expense of performance and sometimes even memory usage (most of us don’t care about wasting 8 more bytes when calling strdup() on a file name), and that those focusing on performance must absolutely support threads and be measured in that situation otherwise performance is totally meaningless nowadays.
Anyway, let’s see if this one lives, for how long and how it will evolve. It might be usable for an ESP32 possibly, though maybe there are simpler and/or more efficient ones for such targets, I don’t know.