Skip to content

Instantly share code, notes, and snippets.

@jwbee
Last active March 25, 2025 12:56
Make Ubuntu packages 90% faster by rebuilding them

Make Ubuntu packages 90% faster by rebuilding them

TL;DR

You can take the same source code package that Ubuntu uses to build jq, compile it again, and realize 90% better performance.

Setting

I use jq for processing GeoJSON files and other open data offered in JSON format. Today I am working with a 500MB GeoJSON file that contains the Alameda County Assessor's parcel map. I want to run a query that prints the city for every parcel worth more than a threshold amount. The program is

.features[] | select(.properties.TotalNetValue < 193000) | .properties.SitusCity

This takes about 5 seconds with the file cached, on a Ryzen 9 9950X system. That seems a bit shabby and I am sure we can do better.

Step 1: Just rebuild the package

What happens if you grab the jq source code from Launchpad, then configure and rebuild it with no flags at all? Even that is about 2-4% faster than the Ubuntu binary package.

We are using hyperfine to get repeatable results. The jq program is being constrained on logical CPU 2, to keep it away from system interrupts that run on CPU 0 and to ensure no CPU migrations.

% hyperfine --warmup 1 --runs 3 -L binary ~/jq-jq-1.7.1/jq,/usr/bin/jq "taskset -c 2 {binary} -rf /tmp/select.jq /tmp/parcels.geojson"
Benchmark 1: taskset -c 2 /home/jwb/jq-jq-1.7.1/jq -rf /tmp/select.jq /tmp/parcels.geojson
  Time (mean ± σ):      4.517 s ±  0.017 s    [User: 3.907 s, System: 0.610 s]
  Range (min … max):    4.497 s …  4.531 s    3 runs

Benchmark 2: taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson
  Time (mean ± σ):      4.641 s ±  0.038 s    [User: 4.013 s, System: 0.628 s]
  Range (min … max):    4.601 s …  4.675 s    3 runs

Summary
  taskset -c 2 /home/jwb/jq-jq-1.7.1/jq  -rf /tmp/select.jq /tmp/parcels.geojson ran
    1.03 ± 0.01 times faster than taskset -c 2 /usr/bin/jq  -rf /tmp/select.jq /tmp/parcels.geojson

Step 2: Rebuild with clang and better flags

Next, let's rebuild the program with my favorite compiler, a higher optimization level, LTO, and some flags that I typically want to help with debugging and profiling. Some of them are irrelevant to this case, but I use the same flags for most builds. The flags that seem to make a performance difference are:

  • -O3 vs -O2
  • -flto
  • -DNDEBUG

The last of those saves a lot of cost in assertions that showed up strongly in the profiles.

% CC=clang-18 LDFLAGS="-flto -g -Wl,--emit-relocs -Wl,-z,now -Wl,--gc-sections -fuse-ld=lld" CFLAGS="-flto -DNDEBUG -fno-omit-frame-pointer -gmlt -march=native -O3 -mno-omit-leaf-frame-pointer -ffunction-sections -fdata-sections" ./configure

Benchmark 1: taskset -c 2 /home/jwb/jq-jq-1.7.1/jq -rf /tmp/select.jq /tmp/parcels.geojson
  Time (mean ± σ):      3.853 s ±  0.033 s    [User: 3.245 s, System: 0.608 s]
  Range (min … max):    3.822 s …  3.887 s    3 runs

Benchmark 2: taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson
  Time (mean ± σ):      4.631 s ±  0.047 s    [User: 4.012 s, System: 0.619 s]
  Range (min … max):    4.602 s …  4.686 s    3 runs

Summary
  taskset -c 2 /home/jwb/jq-jq-1.7.1/jq -rf /tmp/select.jq /tmp/parcels.geojson ran
    1.20 ± 0.02 times faster than taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson

Now we are 20% faster than upstream with almost no effort.

Step 3: Add TCMalloc

Jq is a complex C program, and C programs of any complexity tend to rely on malloc and free, because the language offers no other cognizable way to deal with memory. Allocation is the top line in the profile by far. What if we use a better allocator, instead of the one that comes in GNU libc? Ubuntu offers a package of TCMalloc, which is actually rather obsolete and not the current TCMalloc effort, but it's an allocator package in their repo, so let's give it a whirl.

Having added -L/usr/lib/x86_64-linux-gnu -ltcmalloc_minimal to the LDFLAGS and rebuilt ...

Benchmark 1: taskset -c 2 /home/jwb/jq-jq-1.7.1/jq -rf /tmp/select.jq /tmp/parcels.geojson
  Time (mean ± σ):      3.253 s ±  0.009 s    [User: 2.625 s, System: 0.628 s]
  Range (min … max):    3.245 s …  3.262 s    3 runs

Benchmark 2: taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson
  Time (mean ± σ):      4.611 s ±  0.026 s    [User: 4.015 s, System: 0.596 s]
  Range (min … max):    4.591 s …  4.640 s    3 runs

Summary
  taskset -c 2 /home/jwb/jq-jq-1.7.1/jq -rf /tmp/select.jq /tmp/parcels.geojson ran
    1.42 ± 0.01 times faster than taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson

This is not bad. We are now > 40% faster than the package upstream tried to foist on us.

Step 4: What about just preloading TCMalloc dynamically?

If the allocator is the issue, it stands to reason that we can get some of that benefit just by hiding the libc allocator using a dynamic preload with the stock Ubuntu binary.

Benchmark 1: LD_PRELOAD= taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson
  Time (mean ± σ):      4.601 s ±  0.027 s    [User: 3.966 s, System: 0.634 s]
  Range (min … max):    4.577 s …  4.630 s    3 runs

Benchmark 2: LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson
  Time (mean ± σ):      4.082 s ±  0.010 s    [User: 3.476 s, System: 0.606 s]
  Range (min … max):    4.071 s …  4.091 s    3 runs

Summary
  LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so taskset -c 2 /usr/bin/jq  -rf /tmp/select.jq /tmp/parcels.geojson ran
    1.13 ± 0.01 times faster than LD_PRELOAD= taskset -c 2 /usr/bin/jq  -rf /tmp/select.jq /tmp/parcels.geojson

This by itself is good for 13%. Not bad.

Step 5: Dynamically loading other allocators

Ubuntu also ships packages of jemalloc and mimalloc. We can try them all. It turns out that mimalloc beats all others. Note: results obtained after setting MIMALLOC_LARGE_OS_PAGES=1, MALLOC_CONF="thp:always,metadata_thp:always", and GLIBC_TUNABLES=glibc.malloc.hugetlb=1 in the environment.

Benchmark 1: LD_PRELOAD= taskset -c 2 /usr/bin/jq  -rf /tmp/select.jq /tmp/parcels.geojson
  Time (mean ± σ):      4.123 s ±  0.040 s    [User: 3.862 s, System: 0.261 s]
  Range (min … max):    4.084 s …  4.165 s    3 runs

Benchmark 2: LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so taskset -c 2 /usr/bin/jq  -rf /tmp/select.jq /tmp/parcels.geojson
  Time (mean ± σ):      4.130 s ±  0.017 s    [User: 3.505 s, System: 0.624 s]
  Range (min … max):    4.118 s …  4.149 s    3 runs

Benchmark 3: LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so taskset -c 2 /usr/bin/jq  -rf /tmp/select.jq /tmp/parcels.geojson
  Time (mean ± σ):      3.510 s ±  0.079 s    [User: 3.223 s, System: 0.286 s]
  Range (min … max):    3.452 s …  3.599 s    3 runs

Benchmark 4: LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so taskset -c 2 /usr/bin/jq  -rf /tmp/select.jq /tmp/parcels.geojson
  Time (mean ± σ):      3.154 s ±  0.010 s    [User: 2.889 s, System: 0.265 s]
  Range (min … max):    3.145 s …  3.164 s    3 runs

Summary
  LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so taskset -c 2 /usr/bin/jq  -rf /tmp/select.jq /tmp/parcels.geojson ran
    1.11 ± 0.03 times faster than LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so taskset -c 2 /usr/bin/jq  -rf /tmp/select.jq /tmp/parcels.geojson
    1.31 ± 0.01 times faster than LD_PRELOAD= taskset -c 2 /usr/bin/jq  -rf /tmp/select.jq /tmp/parcels.geojson
    1.31 ± 0.01 times faster than LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so taskset -c 2 /usr/bin/jq  -rf /tmp/select.jq /tmp/parcels.geojson

Enabling THP benefits the glibc allocator, jemalloc, and mimalloc. The speedup of THP+mimalloc is 31% over THP+glibc and 48% over glibc defaults.

Step 6: Rebuild with mimalloc

Its cool that mimalloc is fast in this case, but dynamic preloads aren't amazing for performance. Let's rebuild the program with mimalloc.

Benchmark 1: taskset -c 2 /home/jwb/jq-jq-1.7.1/jq -rf /tmp/select.jq /tmp/parcels.geojson
  Time (mean ± σ):      2.428 s ±  0.019 s    [User: 2.161 s, System: 0.267 s]
  Range (min … max):    2.404 s …  2.464 s    10 runs

Benchmark 2: taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson
  Time (mean ± σ):      4.606 s ±  0.039 s    [User: 3.979 s, System: 0.627 s]
  Range (min … max):    4.522 s …  4.640 s    10 runs

Summary
  taskset -c 2 /home/jwb/jq-jq-1.7.1/jq -rf /tmp/select.jq /tmp/parcels.geojson ran
    1.90 ± 0.02 times faster than taskset -c 2 /usr/bin/jq -rf /tmp/select.jq /tmp/parcels.geojson

Jq rebuilt from source with a a better allocator is 1.9x, nearly twice as fast as the Ubuntu binary package for this workload. In another application, processing 2.2GB of JSON in 13000 files (using rush to parallelize) this build of jq does the job in 0.755s vs 1.424s for the Ubuntu package. That is a speedup of nearly 2x again. These are very satisfactory results.

@loganwoolf
Copy link

Sick. I just learned more about compiling in the last 4 minutes than my entire life before that.

@brucehoult
Copy link

Since you haven't made the test data available, would you be so kind as to build Boehm GC as a transparent malloc() replacement (build with -DREDIRECT_MALLOC=GC_malloc -DIGNORE_FREE) and let us know the results? Also try with "export GC_free_space_divisor=2" as well as the default 3.

https://github.com/ivmai/bdwgc

@karstenba
Copy link

@loganwoolf would you mind scooching over, looks like we're in the same boat 😂

@bchewy
Copy link

bchewy commented Mar 19, 2025

poggers

@sysek
Copy link

sysek commented Mar 19, 2025

just use Gentoo

@poige
Copy link

poige commented Mar 19, 2025

It turns out that mimalloc beats all others.
Note: mimalloc result obtained after setting MIMALLOC_LARGE_OS_PAGES=1 in the environment.

So you've turned on huge pages for one of them and then "turns out" it beats the others. C'mon, there're lies, damned lies, and statistics "performance tests" like that one. Also, you're blaming Ubuntu's default builds but still use all these libraries… built by Ubuntu?

@jwbee
Copy link
Author

jwbee commented Mar 19, 2025

It turns out that mimalloc beats all others.
Note: mimalloc result obtained after setting MIMALLOC_LARGE_OS_PAGES=1 in the environment.

So you've turned on huge pages for one of them and then "turns out" it beats the others. C'mon, there're lies, damned lies, and statistics "performance tests" like that one. Also, you're blaming Ubuntu's default builds but still use all these libraries… built by Ubuntu?

The THP tunable for glibc has no observable effect, and this version of tcmalloc is not THP-aware. I thought that jemalloc would automatically use THP on this system, but I re-ran it with a different MALLOC_CONF for your enjoyment.

@michaelmior
Copy link

michaelmior commented Mar 19, 2025

@poige I didn't see any blame here. Assuming all the results reported are accurate, which I don't see any reason to doubt, it's true that the final build was was 90% faster than the default Ubuntu build. I suppose there could have been some discussion on why the Ubuntu builds are likely the way that they are, but there wasn't any blame here.

EDIT I missed the "package upstream tried to foist on us" part. That's not great.

Also mimalloc requires enabling support for huge pages. In tcmalloc and jemalloc, huge pages are always used. That said, there certainly other configuration options that could be explored to get a further performance boost.

@jwbee
Copy link
Author

jwbee commented Mar 19, 2025

It turns out that mimalloc beats all others.
Note: mimalloc result obtained after setting MIMALLOC_LARGE_OS_PAGES=1 in the environment.

So you've turned on huge pages for one of them and then "turns out" it beats the others. C'mon, there're lies, damned lies, and statistics "performance tests" like that one. Also, you're blaming Ubuntu's default builds but still use all these libraries… built by Ubuntu?

The THP tunable for glibc has no observable effect, and this version of tcmalloc is not THP-aware. I thought that jemalloc would automatically use THP on this system, but I re-ran it with a different MALLOC_CONF for your enjoyment.

Actually, the glibc tunable set to 1, not 2, has some effect. Updated again.

@poige
Copy link

poige commented Mar 19, 2025

It turns out that mimalloc beats all others.
Note: mimalloc result obtained after setting MIMALLOC_LARGE_OS_PAGES=1 in the environment.

So you've turned on huge pages for one of them and then "turns out" it beats the others. C'mon, there're lies, damned lies, and statistics "performance tests" like that one. Also, you're blaming Ubuntu's default builds but still use all these libraries… built by Ubuntu?

The THP tunable for glibc has no observable effect, and this version of tcmalloc is not THP.

You mean this setting has no effect — LIBC_TUNABLES=glibc.malloc.hugetlb=2? If so (2), then it's not THP, but straight huge-pages, they gotta be pre-allocated first. Operation can be time consuming, so testing methodology should be adjusted accordingly . OTOH, to have THP properly working, one might also need tweaking done correctly under /sys/kernel/mm/… There're already 6 revisions of the gist with that setting flipping between 1 and 2, so yeah, hard to tell what's going on.

… for your enjoyment.

Thanks, but all above just prove what I've already said: "there're lies, damned lies, and statistics "performance tests" like that one".

And I admit that in these very circumstances there can be a clear winning combination, but people tend to generalize everything not realizing there're edge cases and nuances, and thus might end up being misled by the results.

@poige
Copy link

poige commented Mar 19, 2025

@michaelmior 's

In tcmalloc and jemalloc, huge pages are always used

aha-aha. See, jwbee (the op) has found MALLOC_CONF, so you might have done so too.

@CmdQ
Copy link

CmdQ commented Mar 19, 2025

Cool stuff!

Would be interesting to also try the MESH allocator. Heard of it?

@michaelopdenacker
Copy link

Very interesting article, thanks a lot! I'm very excited about hyperfine too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment