| This document summarizes the common approaches for performance fine tuning with |
| jemalloc (as of 5.1.0). The default configuration of jemalloc tends to work |
| reasonably well in practice, and most applications should not have to tune any |
| options. However, in order to cover a wide range of applications and avoid |
| pathological cases, the default setting is sometimes kept conservative and |
| suboptimal, even for many common workloads. When jemalloc is properly tuned for |
| a specific application / workload, it is common to improve system level metrics |
| by a few percent, or make favorable trade-offs. |
| |
| |
| ## Notable runtime options for performance tuning |
| |
| Runtime options can be set via |
| [malloc_conf](http://jemalloc.net/jemalloc.3.html#tuning). |
| |
| * [background_thread](http://jemalloc.net/jemalloc.3.html#background_thread) |
| |
| Enabling jemalloc background threads generally improves the tail latency for |
| application threads, since unused memory purging is shifted to the dedicated |
| background threads. In addition, unintended purging delay caused by |
| application inactivity is avoided with background threads. |
| |
| Suggested: `background_thread:true` when jemalloc managed threads can be |
| allowed. |
| |
| * [metadata_thp](http://jemalloc.net/jemalloc.3.html#opt.metadata_thp) |
| |
| Allowing jemalloc to utilize transparent huge pages for its internal |
| metadata usually reduces TLB misses significantly, especially for programs |
| with large memory footprint and frequent allocation / deallocation |
| activities. Metadata memory usage may increase due to the use of huge |
| pages. |
| |
| Suggested for allocation intensive programs: `metadata_thp:auto` or |
| `metadata_thp:always`, which is expected to improve CPU utilization at a |
| small memory cost. |
| |
| * [dirty_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms) and |
| [muzzy_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.muzzy_decay_ms) |
| |
| Decay time determines how fast jemalloc returns unused pages back to the |
| operating system, and therefore provides a fairly straightforward trade-off |
| between CPU and memory usage. Shorter decay time purges unused pages faster |
| to reduces memory usage (usually at the cost of more CPU cycles spent on |
| purging), and vice versa. |
| |
| Suggested: tune the values based on the desired trade-offs. |
| |
| * [narenas](http://jemalloc.net/jemalloc.3.html#opt.narenas) |
| |
| By default jemalloc uses multiple arenas to reduce internal lock contention. |
| However high arena count may also increase overall memory fragmentation, |
| since arenas manage memory independently. When high degree of parallelism |
| is not expected at the allocator level, lower number of arenas often |
| improves memory usage. |
| |
| Suggested: if low parallelism is expected, try lower arena count while |
| monitoring CPU and memory usage. |
| |
| * [percpu_arena](http://jemalloc.net/jemalloc.3.html#opt.percpu_arena) |
| |
| Enable dynamic thread to arena association based on running CPU. This has |
| the potential to improve locality, e.g. when thread to CPU affinity is |
| present. |
| |
| Suggested: try `percpu_arena:percpu` or `percpu_arena:phycpu` if |
| thread migration between processors is expected to be infrequent. |
| |
| Examples: |
| |
| * High resource consumption application, prioritizing CPU utilization: |
| |
| `background_thread:true,metadata_thp:auto` combined with relaxed decay time |
| (increased `dirty_decay_ms` and / or `muzzy_decay_ms`, |
| e.g. `dirty_decay_ms:30000,muzzy_decay_ms:30000`). |
| |
| * High resource consumption application, prioritizing memory usage: |
| |
| `background_thread:true` combined with shorter decay time (decreased |
| `dirty_decay_ms` and / or `muzzy_decay_ms`, |
| e.g. `dirty_decay_ms:5000,muzzy_decay_ms:5000`), and lower arena count |
| (e.g. number of CPUs). |
| |
| * Low resource consumption application: |
| |
| `narenas:1,lg_tcache_max:13` combined with shorter decay time (decreased |
| `dirty_decay_ms` and / or `muzzy_decay_ms`,e.g. |
| `dirty_decay_ms:1000,muzzy_decay_ms:0`). |
| |
| * Extremely conservative -- minimize memory usage at all costs, only suitable when |
| allocation activity is very rare: |
| |
| `narenas:1,tcache:false,dirty_decay_ms:0,muzzy_decay_ms:0` |
| |
| Note that it is recommended to combine the options with `abort_conf:true` which |
| aborts immediately on illegal options. |
| |
| ## Beyond runtime options |
| |
| In addition to the runtime options, there are a number of programmatic ways to |
| improve application performance with jemalloc. |
| |
| * [Explicit arenas](http://jemalloc.net/jemalloc.3.html#arenas.create) |
| |
| Manually created arenas can help performance in various ways, e.g. by |
| managing locality and contention for specific usages. For example, |
| applications can explicitly allocate frequently accessed objects from a |
| dedicated arena with |
| [mallocx()](http://jemalloc.net/jemalloc.3.html#MALLOCX_ARENA) to improve |
| locality. In addition, explicit arenas often benefit from individually |
| tuned options, e.g. relaxed [decay |
| time](http://jemalloc.net/jemalloc.3.html#arena.i.dirty_decay_ms) if |
| frequent reuse is expected. |
| |
| * [Extent hooks](http://jemalloc.net/jemalloc.3.html#arena.i.extent_hooks) |
| |
| Extent hooks allow customization for managing underlying memory. One use |
| case for performance purpose is to utilize huge pages -- for example, |
| [HHVM](https://github.com/facebook/hhvm/blob/master/hphp/util/alloc.cpp) |
| uses explicit arenas with customized extent hooks to manage 1GB huge pages |
| for frequently accessed data, which reduces TLB misses significantly. |
| |
| * [Explicit thread-to-arena |
| binding](http://jemalloc.net/jemalloc.3.html#thread.arena) |
| |
| It is common for some threads in an application to have different memory |
| access / allocation patterns. Threads with heavy workloads often benefit |
| from explicit binding, e.g. binding very active threads to dedicated arenas |
| may reduce contention at the allocator level. |