"Make one Ubuntu package 90% faster by rebuilding it and switching the memory allocator"
i wish i could slap people in the face over standard tcp/ip for clickbait. it was ONE package and some gains were not realized by recompilation.
i have to give it to him, i have preloaded jemalloc to one program to swap malloc implementation and results have been very pleasant. not in terms of performance (did not measure) but in stabilizing said application's memory usage. it actually fixed a problem that appeared to be a memory leak, but probably wasn't fault of the app itself (likely memory fragmentation with standard malloc)
I did research into the glibc memory allocator. Turns out this is not memory fragmentation, but per-thread caches that are never freed back to the kernel! A free() call does not actually free the memory externally unless in exceptional circumstances. The more threads and CPU cores you have, the worse this problem becomes.
One easy solution is setting the "magic" environment variable MALLOC_ARENA_MAX=2, which limits the number of caches.
Another solution is having the application call malloc_trim() regularly, which purges the caches. But this requires application source changes.
FWIW i had it with icinga2. so now they actually preload jemalloc in the service file to mitigate the issue, this may very well be what you're talking about
True, I also believed it for a second. But it's also easy to blame Ubuntu for errors. IMHO they are doing a quite decent job with assembling their packages. In fact they are also compiled with Stack fortifications. On the other hand I'm glad they are not compiled with the possibly buggy -O3. It can be nice for something performance critical but I definitely don't want a whole system compiled with -O3.
To me it's obviously a scam because there's no way such an improvement can be achieved globally with a single post explanation. 90% faster is a micro-benchmark number.
This is neither a micro-benchmark nor a scam, but it is click-bait by not mentioning jq specifically.
Micro-benchmarks would be testing e.g. a single library function or syscall rather than the whole application. This is the whole application, just not one you might care that much for the performance of.
Other applications will of course see different results, but stuff like enabling LTO, tuning THP and picking a suitable allocator are good, universal recommendations.
True that, I mean it is still interesting, that if you have a narrow task, you might achieve some significant speed up from rebuilding them. But this is a very niche application.
true, i saw a thread recently on reddit where guy hand-tuned compilation flags and did pgo profiling for a video encoder app that he uses on video encode farm.
In his case, even a gain of ~20% was significant. It calculated into extra bandwidth to encode a few thousand more video files per year.
I wonder how many prepackaged binary distributions are built with the safest options for the os/hardware and don't achieve the best possible performance.
I bet most of them, tbh.
Many years ago I started building Mozilla and my own linux kernels to my preferences, usually realizing modest performance gains.
The entire purpose of the Gentoo Linux distribution, e.g., is performance gains possible by optimized compilation of everything from source.
the title is clickbait, but it's good to encourage app developers to rebuild. esp when you are cpu bound on a few common utitilities e.g. jq, grep, ffmpeg, ocrmypdf -- common unix utils built build targets for general use rather than a specific application
Or, if I understand TFA correctly, don't release debug builds in your release packages.
Reminds me of back in the day, when I was messing around with blender's cmake config files quite a bit, I noticed the fedora package was using the wrong flag -- some sort of debug only flag intended for developers instead of whatever they thought is was. I mentioned this to the package maintainer, it was confirmed by package sub-maintainer (or whomever) and the maintainer absolutely refused to change it because the spelling of the two flags was close enough they could just say "go away, contributing blender dev, you have no idea what you're talking about." Wouldn't doubt the fedora package still has the same mistaken flag to this day and all this occurred something like 15 years ago.
So, yeah, don't release debug builds if you're a distro package maintainer.
Vector operations like AVX512 will not magically make common software faster. The number of applications that deal with regular operations on large blocks of data is pretty much limited to graphical applications, neural networks and bulk cryptographic operations. Even audio processing doesn't benefit that much from vector operations because a codec's variable-size packets do not allow for efficient vectorization (the main exception being multi-channel effects processing as used in DAW).
Thanks for the correction. I hadn't considered bulk memory operations to be part of SIMD operation but it makes sense -- they operate on a larger grain than word-size so they can do the same operation with less micro-ops overhead.
Engineering is a compromise. The article shows most gains come from specialising the memory allocater. The thing to remember is that some projects are multithreaded, and allocate in one thread, use data in another and maybe deallocate in a 3rd. The allocator needs to handle this. So a speedup for one project may be a crash in another.
Also, what about reallocation strategy? Some programs preallocate and never touch malloc again, others constantly release and acquire. How well do they handle fragmentation? What is the uptime (10 seconds or 10 years)? Sometimes the choice of allocators is the difference between long term stability vs short term speed.
I experimented with different allocators devoloping a video editor testing 4K videos that caches frames. 32Mb per frame, at 60fps, thats almost 2Gb per second per track. You quickly hit allocator limitations, and realise that at least vanilla glibc allocator offers the best long term stability. But for short running benchmarks its the slowest.
As already pointed out, engineering is a compromise.
Mimalloc is a general purpose allocator like JEMalloc / TCMalloc. Glibc is known to have a pretty bad allocator that modern allocators like MIMalloc & the latest TCMalloc (not the one available by default in Ubuntu) run laps around. While of course speedup may be variable, generally the benchmarks show an across the board speedup (whether that matters for any given application is something entirely different). As for crashes, these are all general purpose multi-thread allocators and behave no differently from glibc (modulo bugs that can exist equally in glibc).
Agree for most short running apps. I updated my comment to reflect issues with apps that are constantly reallocating, and running for longer that 60 seconds. But you are absolutely correct for most short running apps, 99% recommended to replace glibc. However, there is an app or two where glibc stability doesnt trigger a pathological use cases, and you have no choice. Hence why its the default, since there are less crashes in pathaligical cases. And the devs are exhausted dealing with crashing bugs which can be eliminated by using the slower allocator.
Looked at your updated post and it looks like you’re operating under wildly incorrect assumptions.
1. Fragmentation: MIMalloc and the newest TCMalloc definitely handle this better than glibc. This is well established in many many many benchmarks.
2. In terms of process lifetime, MIMalloc (Microsoft Cloud) and TCMalloc (Google Cloud) are designed to be run for massive long-lived services that continually allocate/deallocate over long periods of time. Indeed, they have much better system behavior in that allocating a bunch of objects & then freeing them actually ends up eventually releasing the memory back to the OS (something glibc does not do).
> However, there is an app or two where glibc stability doesnt trigger a pathological use cases, and you have no choice.
I’m going to challenge you to please produce an example with MIMalloc or the latest TCMalloc (or heck - even any real data point from some other popular allocators vs vague anectodes). This just simply is not something these allocators suffer from and would be major bugs the projects would solve.
I tried to use mimalloc and snmalloc and in both cases got crashes I don't get with glibc when interoperating with other libraries (libusb, jack, one that I suspect to be in the Nvidia driver) :(
If you are not properly overriding the allocator consistently for everything within an executable, that’s entirely possible (eg linking against 1 allocator and then linking with a dynamic library that’s using a different one). Without a specific repro it’s hard to distinguish PEBCAK from legit bug. Also it certainly can’t be the Nvidia driver since that’s not running anything in your process.
The explanation here is usually simpler. Any major change like this can lead to crashes.
If I have three libraries, biz, bar, and bif, and each has bugs. You've used biz. When the system was unstable, you'd debug until it works (either by finding the real bug, or by an innocuous change).
If you switch libraries, the bugs which go away are the ones which aren't affecting you, since your software works. On the other hand, you have a whole new set of bugs.
This comes up when upgrading libraries too. More bugs are usually fixed than introduced, but there's often a debug cycle.
> Also it certainly can’t be the Nvidia driver since that’s not running anything in your process.
A huge chunk of a modern GPU driver is part of the calling process, loaded like a regular library. Just spot checking Chrome's GPU thread, there's dozens of threads created by a single 80+mb nvidia DLL. And this isn't unusual, every GPU driver has massive libraries loaded into the app using the GPU - often including entire copies of LLVM for things like shader compilers.
Challenge yourself to produce some numbers first. If there are many many many benchmarks it shouldn't be too difficult to link one. Just saying something is "well established" doesn't really help without some other context.
but we all can agree that glibc is trash right? :)
I’d be careful extrapolating from just one benchmark but generally if I had to choose I’d pick the new tcmalloc if I could. It seems to be a higher quality codebase.
The problem with the current tcmalloc, which I agree is very good, is the difficulty of integrating it with random builds. It works well for large systems and especially those that are already bazelized, but I couldn't whip up a quick way to link it to `jq` in this gist.
Thanks for that. I have an innate skepticism of benchmarks from a company who wants to show their solution is best, and it seems their latest results are from 2021, but I couldn't find a better comparison myself either.
I do note that rpmalloc (old), Hermes (not public? doesn't compare with mimalloc) and also snmalloc (also Microsoft) have benchmarks of their own showing themselves to be best in some circumstances.
More than that, based on the basic architecture of JEmalloc, mimalloc and tcmalloc, you should always expect their fragmentation behavior to be better for long-running software than the glibc malloc. (At the expense of consuming substantially more memory for very small programs with only a few allocations of a given size). The glibc malloc has nearly pessimal fragmentation behavior, you are very confused here.
> This just simply is not something these allocators suffer from and would be major bugs the projects would solve.
Not OP, but the following logic shows why this claim is bogus. In short: If two non-garbage-collecting memory allocators do anything differently -- other than behave as perfect "mirror images" of each other, so that whenever one allocates byte i, the other allocates byte totalMem-i -- then there exists a program that crashes on one but not the other, and vice versa.
In detail:
If 2 allocators do not allocate exactly the same blocks of memory to the same underlying sequence of malloc() or free() calls, then there exists a program which, if built twice, once using each allocator, and then each executable is run with the same input, will after some time produce different patterns of memory fragmentation.
The first time this difference appears -- let's say, after the first n calls to either malloc() or free() -- the two executables will have the same total number of bytes allocated, but the specific ranges of allocated bytes will be different. The nth such call must be a malloc() call (since if it were a free() call, and allocated ranges were identical after the first n-1 such calls, they would still be identical after the first n, contradicting our assumption that they are different). Then for each executable, this nth malloc() call either allocates a block at or some distance past the end, or it subdivides some existing free block. We can remove the latter possibility (and simplify the proof) by assuming that there is no more memory available past the end of the highest byte thus far allocated (this is allowed, since a computer with that amount of memory could exist).
Now have both programs call free() on every allocated block except the one allocated in operation n. Let the resulting free range at the start of memory (before the sole remaining allocated block) have total length s1 in executable 1 and s2 in executable 2, and let the resulting free range at the end of memory (after that sole remaining allocated block) have length e1 in executable 1 and e2 in executable 2. By assumption, s1≠s2 and e1≠e2. Now have both executables call malloc() twice, namely, on s1 and e1 in descending order. Then, unless s1=e2, executable 1 can satisfy both malloc()s, but executable 2 can satisfy only the first. Similarly, calling malloc() on s2 and e2 in decreasing order will succeed in executable 2 but not executable 1, again unless s1=e2 holds.
What if s1=e2 does hold, though? This occurs when, say, one executable allocates the block 100 bytes from the start of memory, while the other allocates it 100 bytes from the end. In this case, all we need is to keep some second, symmetry-breaking block around at the end in addition to the block allocated by operation n -- that is, a block for which it does not hold that one allocator allocates the mirror-image memory range of the other. (If no such block exists, then the two allocators are perfect mirror images of each other.)
I really have no idea what you’re getting at here. This isn’t embedded where there’s a fixed pool to allocate out of. If there’s insufficient space it’ll get more virtual memory from the OS.
Also, nothing you’ve said actually says that the other allocator will be the worse one. Indeed, glibc is known to hold onto memory longer and have more fragmentation than allocators like mimalloc and tcmalloc so I’m still at a loss to understand how even if what you wrote is correct (which I don’t believe it is) that it follows that glibc is the one that won’t crash. If you’re confident in your proof by construction, please post a repro that we can all take a look at.
> If there’s insufficient space it’ll get more virtual memory from the OS.
Swap space is finite too.
> Also, nothing you’ve said actually says that the other allocator will be the worse one.
I'm not claiming that either is worse. I'm showing mathematically that for any two allocators that behave differently at all (with the one tiny exception of a pair of allocators that are perfect mirror images of each other), it's possible to craft a program that succeeds on one but fails on the other.
I didn't say so explicitly as I thought it was obvious, but the upshot is: It's never completely safe to just change the allocator. Even if 99% if the time, one works better than the other, there's provably a corner case where it will fail but the other does not.
I should have been more explicit about the assumptions I make about an allocator:
1. If malloc() is called when there exists a contiguous block of free memory with size >= the argument to malloc(), the call will succeed. I think you'll agree that this is reasonable.
2. Bookkeeping (needed at least for tracking the free list, plus any indexes on top) uses the same amount of malloc()able memory in each allocator. I.e., if malloc(x) at some point in time reduces the number of bytes that are available to future malloc() calls by y >= x bytes under allocator 1, it must reduce the number of bytes available to future malloc() calls by y under allocator 2 if called at the same point in time as well. This may not hold exactly in practice, but it's a very good approximation -- it's possible to store the free list "for free" by using the first few bytes as next and prev pointers in a doubly linked list.
To head one another possible objection off at the pass: If the OS allocates lazily (i.e., it doesn't commit backing store at malloc() time, instead waiting till there is an actual access to the page, like Linux does), this doesn't change anything: Address space (even 64-bit address space) is still finite, and that is still being allocated eagerly. In practice, you could craft the differentially crashing program to crash much faster if you call memset() immediately on every freshly malloc()ed block to render this lazy commit ineffective -- then you would only need to exhaust the physical RAM + swap, rather than the complete 64-bit virtual address space.
Swapping out allocators won't cause some programs to crash and others to not crash unless your program is already using up all available RAM or your program has a bug and the allocation pattern happens to be more likely to trigger invalid memory access in a way that crashes.
This allocation pattern idea is unlikely to show up in any real application except at the absolute limit where your exhausting RAM and the OOM killer gets involved. Even then I think you're going to not see the allocator be much of a differentiating factor.
Funny, cause the situations where I've had to replace glibc is always that it is a long running server that allocates often. Glibc: Ballooning memory, eventually crash. jemalloc: Stable as a rock.
> I experimented with different allocators devoloping a video editor testing 4K videos that caches frames. 32Mb per frame, at 60fps, thats almost 2Gb per second per track. You quickly hit allocator limitations, and realise that at least vanilla glibc allocator offers the best long term stability. But for short running benchmarks its the slowest.
I also work with large (8K) video frames [1]. If you're talking about the frames themselves, 60 allocations per second is nothing. In the case of glibc, it's slow for just one reason: each allocation exceeds DEFAULT_MMAP_THRESHOLD_MAX (= 32 MiB on 64-bit platforms), so (as documented in the mallopt manpage), you can not convince glibc to cache it. It directly requests the memory from the kernel with mmap and returns it with munmap each time. Those system calls are a little slow, and faulting in each page of memory on first touch is in my case slow enough that it's impossible to meet my performance goals.
The solution is really simple: use your own freelist (on top of the general-purpose allocator or mmap, whatever) for just the video frames. It's a really steady number of allocations that are exactly the same size, so this works fine.
[1] in UYVY format, this is slightly under 64 MiB; in I420 format, this is slightly under 48 MiB.
There's one other element I didn't mention in my previous comment, which is a thread handoff. It may be significant because it trashes any thread-specific arena and/or because it introduces a little bit of variability over a single malloc at a time.
For whatever reason the absolute rate on my test machine is much higher than in my actual program (my actual program does other things with a more complex threading setup, has multiple video streams, etc.) but you can see the same effect of hitting the mmap, munmap, and page fault paths that really need not ever be exercised after program start.
In my actual (Rust-based) program, adding like 20 lines of code for the pooling was a totally satisfactory solution and took me less time than switching general-purpose allocator, so I didn't try others. (Also, my program supports aarch64 and iirc the vendored jemalloc in the tikv-jemallocator crate doesn't compile cleanly there.)
Sorry, I'm struggling to make sense of this comment. I don't know C or C compilers very well at all, but I read the full gist and felt I learned a bunch of stuff and got a lot of value from it.
But then when I read this top comment, it makes me concerned I've completely misunderstood the article. From the tone of this comment, I assume that I shouldn't ever do what's talked about in this gist and it's a terrible suggestion that overlooks all these complexities that you understand and have referenced with rhetorical-looking questions.
Any chance you could help me understand if the original gist is good, makes any legitimate points, or has any value at all? Because I thought it did until I saw this was the top comment, and it made me realise I'm not smart enough to be able to tell. You sound like you're smart enough to tell, and you're telling me only bad things.
I'll have a go at explaining: The process described in the article isn't a simple recipe that you can apply to any program to achieve similar results.
`jq` is a command-line program that fires up to do one job, and then dies. For such a program, the only property we really want to optimise is execution speed. We don't care about memory leaks, or how much memory the process uses (within reason). `jq` could probably avoid freeing memory completely, and that would be fine. So using a super-stupid allocator is a big win for `jq`. You could probably write your own and make it run even faster.
But for a program with different runtime characteristics, the results might be radically different. A long-lived server program might need to avoid memory bloat more than it needs to run fast. Or it might need to ensure stability more than speed or size. Or maybe speed does matter, but it's throughput speed rather than latency. Each of those cases need to be measured differently, and may respond better to different optimisation strategies.
The comment that confused you is just trying to speak a word of caution about applying the article's recipe in a simplistic way. In the real world, optimisation can be quite an involved job.
> The comment that confused you is just trying to speak a word of caution about applying the article's recipe in a simplistic way. In the real world, optimisation can be quite an involved job.
I think that's what confused and irritated me. There's a lot of value and learning in the gist - I've used JQ in my previous jobs regularly, this is the real world, and valuable to many. But the top comment (at the time I responded) is largely rhetorically trashing the submission based on purely the title.
I get that the gist won't make _everything_ faster: but I struggle to believe that any HN reader would genuinely believe that's either true, or a point that the author is trying to make. The literal first sentence of the submission clarifies the discussion is purely about JQ.
Anyone can read a submission, ignore any legitimate value it in, pick some cases the submission wasn't trying to address, and then use those cases to rhetorically talk it down. I'm struggling to understand why/how that's bubbling to the top in a place of intellectual curiosity like HN.
Edit: I should practice what I preach. Conversation and feedback which is purely cautionary or negative isn't a world that anyone really wants! Thanks for the response, I really appreciated it:) It was helpful in confirming my understanding that this submission does genuinely improve JQ on Ubuntu. Cautionary is often beneficial and necessary, and I think the original comment I responded to could make a better world with a single sentence confirming that this gist is actually valuable in the context it defines.
That’s why it’s a bad idea to use one allocator for everything in existence . It’s terrible that everyone pays the cost of thread safety even for single threaded applications - or even multithreaded applications with disciplined resource management.
While I do agree with the general sentiment, I think the default should be to use the safer ones that can handle multi threaded usage. If someone wants to use a bad allocator like glibc that doesn't handle concurrency well, then they should certainly be free to switch.
So now you’re using std vector and it’s taking a lock for no reason, even though push_back on the same vector across threads isn’t thread safe to begin with.
Haphazard multithreading is not a sane default.
I understand a million decisions have been made so that we can’t go flip that switch back off, but we’ve got to learn these lessons for the future.
Whatever. Allocation is slow enough that I don't care about a noncontended lock. Make things work by default and if you want to gain performance by not allowing multithreading then that should be possible and easy. But safety first, as in most cases it really don't matter. When it comes to mainstream general purpose allocators it isn't really a tradeoff anyhow, as all of them are nominally threadsafe.
Even glibc claims to be multithreading safe even if it tends to not return or reuse all freed memory.
Write in a language that makes sense for the project. Then people tell you that you should have used this other language, for reasons.
Use a compression algo that makes sense for your data. Then people will tell you why you are stupid and should have used this other algo.
My most recent memory of this was needing to compress specific long json strings to fit in Dynamo. I exhaustively tested every popular algo, and Brotli came out far ahead. But that didn't stop every passerby from telling me that zlib is better.
> Gentoo linux is essentially made specifically for people like this, to be able to optimize one’s own linux rig for one’s specific usecase.
That's true but worth noting that "optimize" here doesn't necessarily refer to performance.
I've been using Gentoo for 20 years and performance was never the reason. Gentoo is great if you know how you want things to work. Gentoo helps you get there.
Reducing the dependency tree gets a bit more complicated once you consider that now you have to satisfy not only runtime dependencies for all packages but also build-time dependencies. There may be ways of cleaning that up after a build, but next time you want to emerge a new package you'll just end up having to re-build the build-time dependencies, so in practice you'll just end up leaving them there. There is an ability to emerge packages to a separate part of the filesystem tree (ROOT="/my/chroot" emerge bla), so that you have one build-time system act as a kind of incubator for a runtime system that gets to be minimal. But you'll end up encountering problems that most other Gentoo users wouldn't encounter, having to do with the separation between build-time dependencies and runtime dependencies not being correctly made in the recipes. Personally, I had been relying on this feature for roughly the last 10 years, but there has been steady deterioration there over the years and I eventually gave up late last year.
This is a good point. I've been using Gentoo since early 2004 (the dreaded Pentium IV era, Lol). Lately, I run into this with dev-lang/tcl only being need to build dev-db/sqlite. I actually think it's pretty weird that software intended to be as widely used as sqlite with as much of a free base of supporting devs doesn't just do the extra effort to use a Makefile.
Long time ago when I was using it I preferred Gentoo because of ergonomics and better exposition to supply chain.
Slackware was very manual and some bits were drowned in its low level and long command chains. Gentoo felt easy but highlighted dependencies with a hard cost associated with compilation times.
Being a newb back then I enjoyed user friendliness with access to the machinery beneath.
Satisfaction of a 1s boot time speedu, a result of 48h+ compilation, was unparalleled, too ;)
Never seen the HN version of the 'install gentoo' meme before, more sophisticated definitely.
> The goal of Gentoo is to have an operating system that builds all programs from source, instead of having pre-built binary packages. While this does allow for advanced speed and customizability, it means that even the most basic components such as the kernel must be compiled from source. It is known through out the Linux community as being a very complex operating system because of its daunting install process. The default Gentoo install boots straight to a command prompt, from which the user must manually partition the disk, download a package known as a "Stage 3 tarball", extract it, and build the system up by manually installing packages. New or inexperienced users will often not know what to do when they boot in to the installer to find there is no graphical display. Members of /g/ will often exaggerate the values of Gentoo, trying to trick new users in to attempting to install it.
Where does that blurb come from, chatgpt? I don't think it's true anymore, last time I checked I think Gentoo had a "normal" liveCD installation for the base system, which you could then recompile on your own if wanted.
GRP (Gentoo packages) existed at least 20 years ago, from my memory, as that's the last time I really used it in anger. I remeber packages being available and not having to rice everything, for sure.
I had had Gentoo continuously in use since 2003, and only very recently moved off of it (late 2024) when I tried Void Linux. On Void, buildability from source by end users is not a declared goal nor architectural feature, but you have a pretty decent chance of being able to make it work. You can expect one or two hiccups, but if you have decent all-round Linux experience, chances are you'll be able to jump into the build recipes, fix them, make everything work for what you need it to do, and contribute the fixes back upstream. This is what you get from a relentless focus on minimalism and avoiding overengineering of any kind. It's what I had been missing in Gentoo all those years. With Gentoo, I always ended up having to fiddle with use flags and package masks in ways that wouldn't be useful to other users. The build system is so complex that it had been just too difficult for me, over all these years, to properly learn it and learn to fix problems at the root cause level and contribute them upstream. Void should also be an ideal basis for when you don't want to build the entire system from source, but you just want to mix & match distro-provided binaries with packages you've built from source (possibly on the basis of a modified build recipe to better match your needs or your hardware).
I used Gentoo for a while, but the temptation to endlessly fiddle with everything always let me to eventually break the system. (It's not Gentoo's fault, it's mine.)
Afterwards I moved to ArchLinux, and that has been mostly fine for me.
If you are using a fairly standard processor, then Gentoo shouldn't give you that much of an advantage?
Gentoo lets you do all of the tweaks mentioned here within the system package manager, so you still get security updates for your tweaked build. You can also install Gentoo on top of another system via Gentoo Prefix for use as a userland packages manager:
These are the Arch packages built for x86-64-v2, x86-64-v3 and x86-64-v4, which are basically names for different sets of x86-64 extensions. Selecting the highest level supported by your processor should get you most of the way to -march=native, without the hassle of compiling it yourself.
I've broken Arch but never broken Gentoo. I think this more due to the fact I ran Arch first and you then Gentoo first, rather than any real difference between them.
Gentoo is more stable than Arch by default, though. It's not actually a bleeding edge distro, but you can choose to run it that way if you wish. Gentoo is about choice.
> I think this more due to the fact I ran Arch first and you then Gentoo first, rather than any real difference between them.
I can believe that.
> Gentoo is more stable than Arch by default, though. It's not actually a bleeding edge distro, but you can choose to run it that way if you wish. Gentoo is about choice.
I actually had way more trouble with stuff breaking with Ubuntu. That's because every six months, when I did the distro upgrade, lots of stuff broke at once and it was hard to do root cause analysis.
With a rolling distribution, it's usually only one thing breaking at a time.
The Zircon kernel does not support signals, so basic C is not going to work well.
"It is heavily inspired by Unix kernels, but differs greatly. For example, it does not support Unix-like signals, but incorporates event-driven programming and the observer pattern."
As long as there's Bash and Python support, Gentoo Prefix could as well. That's kind of the only things that are hard needed for Gentoo as Portage is written in Python and Ebuilds are Bash scripts. Bigger issue would be that the Fuchsia kernel likely isn't POSIX complete.
Kidding... honestly that was a pretty fun distribution to play around with ~20 years ago. The documentation was really good and it was a great way to learn how a lot of the pieces of a Linux distribution fit together.
I was never convinced that the performance difference was really noticeable, though.
Gentoo was the primary source of heating for my living quarters back in the early 2000s. My tower was highly constrained on memory, and I was on a relentless quest to pare out any modules or dependencies I wasn't actually using. Performance gains were primarily from being able to stay out of slow HDD swap space memory. I doubt there were any gains once amortizing the compilation times, but I ran my compile batches at night, and they kept me nice and warm.
Note that if you do this then you will opt out of any security updates not just for jq but also for its regular expression parsing dependency onigurama. For example, there was a security update for onigurama previously; if this sort of thing happens again, you'd be vulnerable, and jq is often used to parse untrusted JSON.
Indeed, there are many methods to have a custom build and still get security updates, including at least one method that is native to Ubuntu and doesn’t need any external tooling. However my warning refers to the method presented in the article, where this isn’t the case.
But isn't there still the kernel of an idea here for a package management system that intelligently decides to build based on platform? Seems like a lot of performance to leave on the table.
Rebuilding from scratch also takes longer than installing a prebuilt package. So while it might be worth it for a heavily used application, in general I doubt it.
Also I think in earlier days the argument to build was so you can optimize the application for the specific capabilities of your system like the supported SIMD instruction set or similar. I think nowadays that is much less of a factor. Instead it would probably be better to do things like that on a package or distribution level (i.e. have one binary distribution package prebuilt by the distribution for different CPU capabilities).
This is generally true but specifically false. The builds described in the gist are still linking onigurama dynamically. It is in another package, libonig5, that would be updated normally.
The gist uses the orig tarball only, so skips the distro patch that selects the distro packaged libonig in favor of the "vendored" one. At least that's how it appears to me. I only skimmed the situation.
Or do you see something deeper that ensures that the distro libonig is actually the one that gets used?
I'm curious how applicable these are, in general? Feels like pointing out that using interior doors in your house misses out on the security afforded from a vault door. Not wrong, but there is also a reason every door in a bank is not a vault door.
That is, I don't want to devalue the CVE system; but it is also undeniable that there are major differences in impact between findings?
Sure, but jq is very much a "front door" in your analogy. You'd have to look at each individual CVE to assess the risk for your specific case, but for jq, claimed security vulnerabilities are worth paying attention to.
This is certainly true. Also, by replacing the allocator and changing compiler flags, you're possibly immunizing yourself from attacks that rely on some specific memory layout.
By hardwiring the allocator you may end up with binaries that load two different allocators. It is too fun to debug a program that is using jemalloc free to release memory allocated by glibc. Unless you know what you are doing, it is better to leave it as is.
UBSAN is usually a debug build only thing. You can run it in production for some added safety, but it comes at a performance cost and theoretically, if you test all execution paths on a debug build and fix all complaints, there should be no benefit to running it in production.
I think it's time for the C/C++ communities to consider a mindset shift and pivot to having almost all protectors, canaries, sanitizers, assertions (e.g. via _GLIBCXX_ASSERTIONS) on by default and recommended for use in release builds in production. The opposite (i.e, the current state of affairs) should be discouraged and begrudginly accepted in select few cases.
https://www.youtube.com/watch?v=gG4BJ23BFBE is a presentation that best represents my view on the kind of mindset that's long overdue to become the new norm in our industry.
I do not think things like the time command need to be compiled with such things. It is pointless, but your suggestion here is to do it anyway. Why bother?
Assertions in release builds are a bad idea since they can be fairly expensive. It is a better idea to have a different variety of assertion like the verify statements that OpenZFS uses, which are assertions that run even in release builds. They are used in situations where it is extremely important for an assertion to be done at runtime, without the performance overhead of the less important assertions that are in performance critical paths.
Why would I want potentially undefined behaviour in 'time'? I expect it to crash anytime it's about to enter UB. Sure, you may want to minimize such statements between the start/stop of the timer, but I expect any processing of stdout/stderr of the child process to be UB-proofed as much as possible.
I think it's a philosophical difference of opinions and it's one of the things that drive Rust, Go, C# etc. ahead - not merely language ergonomics (I hope Zig ends up as the language that replaces C). The society at large is mostly happy to take a 1-3% perf hit to get rid of buffer overflows and other UB-inducing errors.
But I agree with you on not having "expensive" asserts in releases.
ASLR is not security through obscurity though. It forces attacker to get a pointer leak before doing almost anything (even arbitrary read and arbitrary write primitives are useless without a leak with ASLR). As someone with a bit of experience in exploit dev, it makes a world of a difference and is one of the most influential hardenings, next to maybe stack cookies and W^X.
I'm genuinely curious what was so undesirable about this sibling comment that it was removed:
"ASLR obscures the memory layout. That is security by obscurity by definition. People thought this was okay if the entropy was high enough, but then the ASLR⊕Cache attack was published and now its usefulness is questionable."
Usually when a comment is removed, it's pretty obvious why, but in this case I'm really not seeing it at all. I read up (briefly) on the mentioned attack and can confirm that the claims made in the above comment are at the very least plausible sounding. I checked other comments from that user and don't see any other recent ones that were removed, so it doesn't seem to be a user-specific thing.
I realize this is completely off-topic, but I'd really like to understand why it was removed. Perhaps it was removed by mistake?
Some people use the "flag" button as a "disagree" button or even a "fuck this guy" button. Unfortunately, constructive but unpopular comments get flagged to death on HN all the time.
I had thought that flagging was basically a request for a mod to have a look at something. But based on this case I now suspect that it's possible for a comment to be removed without a mod ever looking at it if enough people flag it.
My point was more that, at least in this case, it looks like a post was hidden without any moderator intervention.
If this is indeed what happened, it seems like a bad thing that it's even possible. Since many, perhaps most people probably don't have showdead enabled, it means that the 'flag' option is effectively a mega-downvote.
I believe most people see security through obscurity has an attempt to hide an insecurity.
ASLR/KASLR intends to make attackers lives harder by having non consistent offsets of known data structures. Its not obscuring a security flaw, instead its raises an attacks 'single run' effectivness.
The ASLR attack that i believe is being referenced is specific to abuse within the browser, and running with a single process. This single attack vector does not mean that KASLR globally is not effective.
Your quote has some choice words, but its contextually poor.
That attack does not require a web browser. The web browser being able to do it showed it was higher severity than you would think than if the proof of concept had been in C, since web browsers run untrusted code all of the time.
The 'attack' there does require you to be able to run code and test within a single process with a single randomized address space, which is the exact vector that the web browser provides.
Most times in C, each fork() (rather than thread) has a differential address space, so it's actually less severe than you think.
The kernel address space is the same regardless of how many fork() calls have been done. I would assume the exploitation path for a worst case scenario would be involve chaining exploits to do: AnC on userspace, JavaScript engine injection to native code, sandbox escape, AnC on kernel space, kernel native code injection. That would give complete control over a user’s machine just by having the user visit a web page.
I am not sure why anyone would attempt what you described, for the exact reason you stated. It certainly is not what I had in mind.
Its been a few days and a thousand kilometers since I've read the paper, I thought it referenced userspace. How is it able to infer kernel addresses that are not mapped in that process ?
I assume people downvoted it because “ASLR obscures the memory layout. That is security by obscurity by definition” is just wrong (correct description here: https://news.ycombinator.com/item?id=43408039). It does say [flagged] too, though, so maybe that’s not the whole story…?
No, that other definition is the incorrect one. Security by obscurity does not require that the attacker is ignorant of the fact you're using it. Say I have an IPv6 network with no firewall, simply relying on the difficulty of scanning the address space. I think that people would agree that I'm using security by obscurity, even if the attacker somehow found out I was doing this. The correct definition is simply "using obscurity as a security defense mechanism", nothing more.
No, I would not agree that you would be using security by obscurity in that example. Not all security that happens to be weak or fragile and involves secret information somewhere is security by obscurity – it’s specifically the security measure that has to be secret. Of course, there’s not a hard line dividing secret information between categories like “key material” and “security measure”, but I would consider ASLR closer to the former side than the latter and it’s certainly not security by obscurity “by definition” (aside: the rampant misuse of that phrase is my pet peeve).
> The correct definition is simply "using obscurity as a security defense mechanism", nothing more.
This is just restating the term in more words without defining the core concept in context (“obscurity”).
I'm inclined to agree and would like to point out that if you take a hardline stance that any reliance on the attacker not knowing something makes it security by obscurity then things like keys become security by obscurity. That's obviously not a useful end result so that can't be the correct definition.
It's useful to ask what the point being conveyed by the phrase is. Typically (at least as I've encountered it) it's that you are relying on secrecy of your internal processes. The implication is usually that your processes are not actually secure - that as soon as an attacker learns how you do things the house of cards will immediately collapse.
What is missing from these two representations is the ability for something to become trivially bypassable once you know the trick to it. AnC is roughly that for ASLR.
I'd argue that AnC is a side channel attack. If I can obtain key material via a side channel that doesn't (at least in the general case) suddenly change the category of the corresponding algorithm.
Also IIUC to perform AnC you need to already have arbitrary code execution. That's a pretty big caveat for an attacker.
You are not wrong, but how big of a caveat it is varies. On a client system, it is an incredibly low bar given client side scripting in web browsers (and end users’ tendency to execute random binaries they find on the internet). On a server system, it is incredibly unlikely.
I think the middle ground is to call the effectiveness of ASLR questionable. It is no longer the gold standard of mitigations that it was 10 years ago.
ASLR is not purely security through obscurity because it is based on a solid security principle: increasing the difficulty of an attack by introducing randomness. It doesn't solely rely on the secrecy of the implementation but rather the unpredictability of memory addresses.
Think of it this way - if I guess the ASLR address once, a restart of the process renders that knowledge irrelevant implicitly. If I get your IPv6 address once, you’re going to have to redo your network topology to rotate your secret IP. That’s the distinction from ASLR.
I don't like that example because the damaged cause by and the difficulty of recovering from a secret leaking is not what determines the classification. There exist keys that if leaked would be very time consuming to recover from. That doesn't make them security by obscurity.
I think the key feature of the IPv6 address example is that you need to expose the address in order to communicate. The entire security model relies on the attacker not having observed legitimate communications. As soon as an attacker witnesses your system operating as intended the entire thing falls apart.
Another way to phrase it is that the security depends on the secrecy of the implementation, as opposed to the secrecy of one or more inputs.
You don’t necessarily need to expose the IPv6 address to untrusted parties though in which case it is indeed quite similar to ASLR in that data leakage of some kind is necessary. I think the main distinguishing factor is that ASLR by design treats the base address as a secret and guards it as such whereas that’s not a mode the IPv6 address can have because by its nature it’s assumed to be something public.
Huh. The IPv6 example is much more confusing that I initially thought. At this point I am entirely unclear as to whether it is actually an example of security through obscurity, regardless of whatever else it might be (a very bad idea to rely on it for one). Rather ironic given that the poster whose claims I was disputing provided it as an example of something that would be universally recognized as such.
I think it’s security through obscurity because in ASLR the randomized base address is a protected secret key material whereas in the ipv6 case it’s unprotected key material (eg every hop between two communicating parties sees the secret). It’s close though which is why IPv6 mapping efforts are much more heuristics based than ipv4 which you can just brute force (along with port #) quickly these days.
I'm finding this semantic rabbit hole surprisingly amusing.
The problem with that line of reasoning is that it implies that data handling practices can determine whether or not a given scheme is security through obscurity. But that doesn't fit the prototypical example where someone uses a super secret and utterly broken home rolled "encryption" algorithm. Nor does it fit the example of someone being careless with the key material for a well established algorithm.
The key defining characteristic of that example is that the security hinges on the secrecy of the blueprints themselves.
I think a case can also be made for a slightly more literal interpretation of the term where security depends on part of the design being different from the mainstream. For example running a niche OS making your systems less statistically likely to be targeted in the first place. In that case the secrecy of the blueprints no longer matters - it's the societal scale analogue of the former example.
I think the IPv6 example hinges on the semantic question of whether a network address is considered part of the blueprint or part of the input. In the ASLR analogue, the corresponding question is whether a function pointer is part of the blueprint or part of the input.
> The problem with that line of reasoning is that it implies that data handling practices can determine whether or not a given scheme is security through obscurity
Necessary but not sufficient condition. For example, if I’m transmitting secrets across the wire in plain text that’s clearly security through obscurity even if you’re relying on an otherwise secure algorithm. Security is a holistic practice and you can’t ignore secrets management separate from the algorithm blueprint (which itself is also a necessary but not sufficient condition).
Consider that in the ASLR analogy dealing in function pointers is dealing in plaintext.
I think the semantics are being confused due to an issue of recursively larger boundaries.
Consider the system as designed versus the full system as used in a particular instance, including all participants. The latter can also be "the system as designed" if you zoom out by a level and examine the usage of the original system somewhere in the wild.
In the latter case, poor secrets management being codified in the design could in some cases be security through obscurity. For example, transmitting in plaintext somewhere the attacker can observe. At that point it's part of the blueprint and the definition I referred to holds. But that blueprint is for the larger system, not the smaller one, and has its own threat model. In the example, it's important that the attacker is expected to be capable of observing the transmission channel.
In the former case, secrets management (ie managing user input) is beyond the scope of the system design.
If you're building the small system and you intend to keep the encryption algorithm secret, we can safely say that in all possible cases you will be engaging in security through obscurity. The threat model is that the attacker has gained access to the ciphertext; obscuring the algorithm only inflicts additional cost on them the first time they attack a message secured by this particular system.
It's not obvious to me that the same can be said of the IPv6 address example. Flippantly, we can say that the physical security of the network is beyond the scope of our address randomization scheme. Less flippantly, we can observe that there are many realistic threat models where the attacker is not expected to be able to snoop any of the network hops. Then as long as addresses aren't permanent it's not a one time up front cost to learn a fixed procedure.
Function pointer addresses are not meant to be shared - they hold 0 semantic meaning or utility outside a process boundary (modulo kernel). IPv6 addresses are meant to be shared and have semantic meaning and utility at a very porous layer. Pretending like there’s no distinction between those two cases is why it seems like ASLR is security through obscurity when in fact it isn’t. Of course, if your program is trivially leaking addresses outside your program boundary, then ASLR degrades to a form of security through obscurity.
I'm not pretending that there's no distinction. I'm explicitly questioning the extent to which it exists as well as the relevance of drawing such a distinction in the stated context.
> Function pointer addresses are not meant to be shared
Actually I'm pretty sure that's their entire purpose.
> they hold 0 semantic meaning or utility outside a process boundary (modulo kernel).
Sure, but ASLR is meant to defend against an attacker acting within the process boundary so I don't see the relevance.
How the system built by the programmer functions in the face of an adversary is what's relevant (at least it seems to me). Why should the intent of the manufacturer necessarily have a bearing on how I use the tool? I cannot accept that as a determining factor of whether something qualifies as security by obscurity.
If the expectation is that an attacker is unable to snoop any of the relevant network hops then why does it matter that the address is embedded in plaintext in the packets? I don't think it's enough to say "it was meant to be public". The traffic on (for example) my wired LAN is certainly not public. If I'm not designing a system to defend against adversaries on my LAN then why should plaintext on my LAN be relevant to the analysis of the thing I produced?
Conversely, if I'm designing a system to defend against an adversary that has physical access to the memory bus on my motherboard then it matters not at all whether the manufacturer of the board intended for someone to attach probes to the traces.
I think that's why the threat model matters. I consider my SSH keys secure as long as they don't leave the local machine in plaintext form. However if the scenario changes to become "the adversary has arbitrary read access to your RAM" then that's obviously not going to work anymore.
> The correct definition is simply "using obscurity as a security defense mechanism", nothing more.
Also stated as "security happens in layers", and often obscurity is a very good layer for keeping most of the script kiddies away and keeping the logs clean.
My personal favorite example is using a non-default SSH port. Even if you keep it under 1024, so it's still on a root-controlled port, you'll cut down the attacks by an order of magnitude or two. It's not going to keep the NSA or MSS out, but it's still effective in pushing away the common script kiddies. You could even get creative and play with port knocking - that keeps under-1024 ports logs clean.
Except I do know what security by obscurity is and you are out of date on the subject. When you have attacks that make ASLR useless, then it is security by obscurity. Your thinking would have been correct 10 years ago. It is no longer correct today. The middle ground is to say that the benefits of ASLR are questionable, like I said in the comment you downvoted.
ASLR obscures the memory layout. That is security by obscurity by definition. People thought this was okay if the entropy was high enough, but then the ASLR⊕Cache attack was published and now its usefulness is questionable.
ASLR is by definition security through obscurity. That doesn't make it useless, as there's nothing wrong with using obscurity as one layer of defenses. But that doesn't change what it fundamentally is: obscuring information so that an attacker has to work harder.
Is having a secret password security by obscurity? What about a private key?
Security by obscurity is about the bad practice of thinking that obscuring your mechanisms and implementations of security increases your security. It's about people that think that by using their nephew's own super secret unpublished encryption they will be more secure than by using hardened standard encryption libraries.
Security through obscurity is when you run your sshd server on port 1337 instead of 22 without actually securing the server settings down, because you don’t think the hackers know how to portscan that high. Everyone runs on 22, but you obscurely run it elsewhere. “Nobody will think to look.”
ASLR is nothing like that. It’s not that nobody thinks to look, it’s that they have no stable gadgets to jump to. The only way to get around that is to leak the mapping or work with the handful of gadgets that are stable. It’s analogous to shuffling a deck of cards before and after every hand to protect against card counters. Entire cities in barren deserts have been built on the real mathematical win that comes from that. It’s real.
With attacks such as AnC, your logic fails. They can figure out the locations and get plenty of stable gadgets.
Any shuffling of a deck of cards by Alice is pointless if Bob can inspect the deck after she shuffles them. It makes ASLR not very different from changing your sshd port. In both cases, this describes the security:
okay, sure, ASLR can be defeated by hardware leaks. The first rowhammer papers were over ten years ago, it's very old news. It's totally irrelevant to this thread. The fact that there exist designs that have hardware flaws which make them incapable of hosting a secure PRNG does not have any relevance to a discussion about the merits or lack thereof of a PRNG-based security measures. The systems you're referring to don't have secure PRNGs.
Words have meaning, god damn it! ASLR is not security through obscurity.
Edit: I was operating under the assumption that “AnC” was some new hotness, but no, this is the same stuff that’s always been around, timing attacks on the caches. And there’s still the same solution as there was back then: you wipe the caches out so your adversaries have no opportunity to measure the latencies. It’s what they always should have done on consumer devices running untrusted code.
ASLR is technically a form of security by obscurity. The obscurity here being the memory layout. The reason nobody treated it that way was the high entropy that ASLR had on 64-bit, but the ASLR⊕Cache attack has undermined that significantly. You really do not want ASLR to be what determines whether an attacker takes control of your machine if you care about having a secure system.
The defining characteristic of security through obscurity is that the effectiveness of the security measure depends on the attacker not knowing about the measure at all. That description doesn’t apply to ASLR.
It produces a randomization either at compile time or run time, and the randomization is the security measure, which is obscured based on the idea that nobody can figure it out with ease. It is a poor security measure given the AnC attack that I mentioned. ASLR randomization is effectively this when such attacks are applicable:
You are confusing randomization, a legitimate security mechanism, with security by obscurity. ASLR is not security by obscurity.
Please spend the time on understanding the terminology rather than regurgitating buzz words.
I understand the terminology. I even took a graduate course on the subject. I stand by what I wrote. Better yet, this describes ASLR when the AnC attack applies:
The normal way is to use dpkg to rebuild and patch, and use dch to increase the patch version with a .1 or something similar, so that the OS version always takes precedence, and then rebuild.
It's a while since I had to deal with this kind of thing, but my memory was that as soon as you go beyond the flags that the upstream developers use (just to be clear, I mean the upstream developers, not the distro packagers) you're buying yourself weird bugs and a whole lot of indifference if they occur.
I haven't used a non-libc malloc before but I suspect the same applies.
Two opposing things are both true at the same time.
If you as an individual avoid being at all different, then you are in the most company and will likely have the most success in the short term.
But it's also true that if we all do that then that leads to monoculture and monoculture is fragile and bad.
It's only because of people building code in different contexts (different platforms, compilers, options, libraries, etc...) that code ever becomes at all robust.
A bug that you mostly don't trigger because your platform or build flags just happens to walk just a hair left of the hole in the ground, was still a bug and the code is still better for discovering and fixing it.
We as individuals all benefit from code being generally robust instead of generally fragile.
I've been building my own emacs for a long time, and have yet to hit any weird bugs. I thought that as long as you avoid any unsafe optimizations, you should be fine? Granted, I also thought that -march=native was the main boost that I was seeing. This post indicates that is not necessarily the case.
I also suspect that any application using floats is more likely to have rough edges?
Complex software usually has some undefined behavior lurking that at higher or even just different optimization levels can trigger the compiler to do unexpected things to the code. It happens all the time in my line of work. If there's an extensive test suite you can run to verify that it still works mostly as expected then it's easier.
This is one where I suspect we don't disagree. But "all the time" can have a very different feel between people.
It also used to happen that just changing processors was likely to find some problems in the code. I have no doubt that still happens, but I'd also expect it has reduced.
Some of this has to be a convergence on far fewer compilers than we used to encounter. I know there are still many c compilers. Seems there are only two common ones, though. Embedded, of course, is a whole other bag of worms.
I thought touching the math optimizations directly was in the "unsafe" bucket. Really the only optimization I was aiming for was -march=native. That and the features like native compilation that have made it to the release.
I do think I saw improvements. But I never got numbers, so I'm assuming most of my feel was wishful thinking. Reality is a modern computer is hella fast for something like emacs.
I did see compilation mode improve when I trimmed down the regexes it watches for to only the ones I knew were relevant for me. That said, I think I've stopped doing that; so maybe that is a lot better?
I've turned on fastmath in python numba compiler while thinking "of course i want faster math, duh". Took me a while to find out it was a cause of many "fun" subtle bugs. Never touching that stuff again.
On the other hand, if your optimization helps consistently across platforms, you could convince upstream developers to implement it directly. (Not necessarily across all platforms – a sizable performance gain on just a single arch might still be enough to tweak configuration for that particular build).
It's been awhile since I looked into this, but it's not necessarily an easy change. glibc malloc has debugging APIs; a distro can't easily replace it without either emulating the API or patching programs that use it.
No need to even patch it out. It's relatively easy to change the default, rebuild the world (distros have a flow for this), and restore glibc for the tiny, tiny handful of individual programs that actually rely on glibc debugging APIs.
From a end developer perspective: I have no particular familiarity with mimalloc, but I know jemalloc has pretty extensive debugging functionality (not API compatible with glibc malloc of course).
I'd be curious how the performance compares to this Rust jq clone:
cargo install --locked jaq
(you might also be able to add RUSTFLAGS="-C target-cpu=native" to enable optimizations for your specific CPU family)
"cargo install" is an underrated feature of Rust for exactly the kind of use case described in the article. Because it builds the tools from source, you can opt into platform-specific features/instructions that often aren't included in binaries built for compatibility with older CPUs. And no need to clone the repo or figure out how to build it; you get that for free.
jaq[1] and yq[2] are my go-to options anytime I'm using jq and need a quick and easy performance boost.
As a bonus that people might not be aware of, in the cases where you do want to use the repo directly (either because there isn't a published package or maybe you want the latest commit that hasn't been released), `cargo install` also has a `--git` flag that lets you specify a URL to a repo. I've used this a number of times in the past, especially as an easy way for me to quickly install personal stuff that I throw together and push to a repo without needing to put together any sort of release process or manually copy around binaries to personal machines and keep track of the exact commits I've used to build them.
Misleading title, it's 90% of the faster time. It's about 45% faster.
It's actually a little bit interesting, if you are interested in how we use language. You could argue that now you now get 90% more work done in the same amount of time, and that would align with other 'speed' units that we commonly use (miles per hour, words per minute, bits per second). However, the convention in computer performance is to measure time for a fixed amount of work. I would guess that this is because generally we have a fixed amount of work and what might vary is how long we wait for it (and that is absolutely true in the case of this blog post) so we put time in the numerator.
It's a very interesting post and very well done, but it's not 90% faster.
I feel like using units of rate (90% faster) and not units of time makes more sense here.
Plus, if you were using units of time, you wouldn’t use the word “faster.” “Takes 45% less time” and “45% faster” are very different assertions, but they both have meaning, both in programming and outside it.
It comes down to convention. When talking about proportional differences, we can fix the unitary example to be the smaller, larger, earlier, or later, object or subject.
I think, generally, we fix on the earlier when talking about the change over time of a characteristic. "This stock went up 100%, this stock went down 50%". In both cases it's the earlier measurement that is taken as the unit. That makes this a 45% reduction in time to do the work, and that's actually what they measured.
When talking about comparisons between two things that aren't time dependent it depends on if we talk in multiples or percents I think. A twin bed is half as big as a king bed. A king bed is twice as big as a twin bed. Both are idiomatic. A king bed is 100% bigger than a twin bed. Yes, you could talk like this. A twin bed is 100% smaller than a king bed. Right away you say wait, a twin bed isn't 0 size! Because we don't talk in terms of the smaller thing when talking about decreasing percents, only increasing. A twin bed is 50% smaller than a king bed (iffy). A twin bed is 50% as big as a king bed. There, that's idiomatic again.
Great read. I hadn't even considered that people might interpret "90% faster" as "10 times as fast", i.e., it will take 100-90=10% of the original time. It seems like a completely incorrect interpretation to me, but obviously there are people who read it with this understanding. Huh.
It is unfortunate (because how do you interpret 100% faster) but a common interpretation of X% faster implies it takes x% less time than before. One easy way is to have chatgpt give a numeric example for a statement like this, and in all cases I tried it gives an example of sorts if the job took 100 seconds before it takes 10 seconds now (a 10x increase), and I'm assuming it is a representation of common usage based on how much data is used to train it.
After thinking about it more, I think I know why it seems misleading. They are talking about the change in the larger value as a % of the smaller value. That's what is misleading. They are saying it is reduced by 90% (of some other thing). When you say "We reduced THING by N%" it's just assumed that the N% is N% of THING, not OTHER THING.
I think you're right. I think you could say the package (as in the code) is 45% faster or that the package increases parsing rate by 90%. But mixing the two is confusing.
I think it's an interpretation issue. When you say it's 45% faster I interpret that as "the new package handles work at a rate of 145% when compared to the original, that is, 1.45 times as fast".
I would rephrase your comment as "the package takes 45% less time to process a given data set".
Thanks for this, as an average HN user I didn't click the link and just skimmed the comments thinking how is it possible that they reduced the runtime to 10% of the original. This post clarifies that for me (now on to actually read the blog post).
Reading that such a simple change can get such a big speedup, my first thought is to let the authors of jq know. Maybe there's a caveat to be aware of, maybe they'll test it and end up making it faster for everyone. Useful to drop a quick note pretty much no matter the result, I think?
The article doesn't seem to even consider that option and I don't see any comment here mentioning this either. Am I missing something?
The benchmarking tool being used in this post accounts for that using multiple runs for each invocation, together with a warmup run that is not included in the result metric.
Yeah after this post I sought a couple youtube videos explaining it. It's starting to make a bit more sense now. But the lightbulb hasn't gone off just yet. Appreciate the link.
I wish I had seen this earlier. My mental model was close but not quite there to the point where I needed to think too hard about how to solve problems.
I think jq's syntax is pretty sweet, once you get used to it (or are already familiar with point-free style, as it's in Haskell). The man page is subpar, however.
AFAIK ptmalloc (on which glibc is based) was created decades ago and both multi-threaded application and multi CPU systems where rare back then (at least in the Linux world) so multi-threaded performance didn't matter. Some improvement in glibc were made since then but I don't think it's possible to significantly improve glibc malloc without rewriting it more or less fully. At that point it would make more sense to import some existing malloc implementation.
And we are speaking about trafeoff the default number of arenas in glibc malloc is 8 times CPU which is a terrible tradeoff - on many workloads it cause heap fragmentation and memory usage (RSS) many times higher than allocated memory size, that's why it is common to find advice to set MALLOC_ARENA_MAX to 1 or 2. But probably such high number of areans allows glib to look less bad on synthetic benchmakrs.
Jemalloc, tcmalloc, mimalloc all were created with focus on multi-threaded applications from the beginning and while they don't work better than glibc malloc for single threaded application they don't work worse for this use case either. Probably the main disadvantage of using je/tc/mi mallocs for a single threaded app is large code size.
Apologies, I did my first post on my phone, so was rather quick. I had found the repository and it is a good read. What I'd love to see is a comprehensive dive on the different options and a good general discussion on when each is superior.
I realize, of course, that this could just be overall optimal. Feels more likely that the allocation patterns of the application using it will impact which is the better allocator?
I wonder why my brain calculates `90% faster' as 10% of the time (i.e. t1=0.1*t0) it used to be working. However, this makes sense only if we calculate the recompiled application running time as t1=t0/1.90
Using preload on the stock Ubuntu binary, to give it mimalloc, they got a 44% speedup.
By rebuilding the binary with different compiler options, but not changing malloc, they got an 20% speedup.
If we naively multiply these speedups, we get 1.78: 78% faster.
How it goes to 1.9 is that when you speed up only the program, but leave malloc the same, malloc matters to its performance a lot more.
When the faster malloc is applied to a program that is compiled better, it will make a better contribution than the 44% seen when the allocator was preloaded into the slower program.
To do the math right, we would have to look into how much time was saved with just the one change, and how much time was saved with the other. If we add those times, and subtract them from the original, slow time, we should get a time that is close to the 1.9 speedup.
Ubuntu is still building for CPUs with the x86-64-v1 feature set. On experimental images built for x86-64-v3 (roughly the feature set of 10 year old CPUs, a bit newer on low-power CPUs) they have observed 60% improvement on some benchmarks (and much less in some others). Some distros have switched to -v3, Ubuntu is still holding out to support older hardware. The author is compiling to the actual feature set of their specific CPU, which is nearly guaranteed to be even more than -v3.
Another thing the article adds is LTO, which in my experience also makes a huge difference: it makes your software a lot faster in certain cases, but also makes build times a lot worse. Spending a bit more time in the build process should be an easy call, but might be harder at the scale of a distro.
I was under the, evidently naive, understanding that there were some shenanigans done in the libc to pick optimal code paths based on the architecture. This would obviously have a startup cost on the first run, but presumably the dynamic linking would work to this advantage by having it in a fast state for most calls?
Not sure where I got that idea, though. :(. Will have to look into that later.
Even if libc does it, other packages wouldn’t. The compiler is what generates code and they generally don’t try to automatically generate for different instruction sets because they don’t know what code is what hotpath and worth optimizing for. You could probably try to build for multiple different CPU features at the top level and do a universal executable like Apple did for their CPU transitions but that’s a lot of expense to pay both in terms of compilation time AND in terms of size.
Debug symbols do not need to be paged in so shouldn’t make much/any difference. Compiling with -O3 can increase code size a lot due to inlining, which can be a little bad.
Right, my humor there was that I'm assuming people specifically did non-debug flags for the build as an optimization. Only for them to still fall short.
And understood a little on -O3 possibly increasing code size. I had thought that was more of a concern for tight environments than for most systems? Of course, I'd have assumed that -march=native would be more impactful, but the post indicates otherwise.
I said in a top level, but it seems the allocator makes the biggest impact for this application? Would be interesting to see which applications should use different allocators. Would be amazing to see a system where the more likely optimal allocator was default for different applications, based on their typical allocation patterns.
> Right, my humor there was that I'm assuming people specifically did non-debug flags for the build as an optimization.
It used to be the case that the presence of debug symbols would affect GCC code generation. Nowadays that should be fixed. I think it still affects the speed of compilation so if you're building the whole system from source you might want to avoid it.
Mario 64 gets flak for some files not being compiled with optimizations on. There is a YouTuber, whose name I forget at the moment, who has been doing optimization and improvements on it and constantly talks about how this is for the better due to the low instruction cache size and slow speed on the console. Compiling with optimizations explodes code size (unrolling links, inlining, etc) to the point it’s a net loss after factoring in the increased load on the IC.
The IC is the instruction cache. Bloated code needs more slow loads from the IC, and even if the performance is faster per cache line, the 50x overhead of added cache loads diminishes all -O3 advantages.
Right, I meant the humor to be that folks to not use debug flags, with my naive presumption that they did this as an optimization, only for that not to really help.
It's not a 90% speedup, it's ~50% (still quite impressive).
The author seems to be confused, because the original jq is 1.9x slower than the optimized one.
That depends on how you're representing the speedup.
To travel 10 miles, at 60 MPH, takes 10 minutes. Make it 100% faster, at 120 MPH, and that time becomes 5 minutes. Travel just as far in 50% of the time. Or travel just as far 100% faster. The 90% speedup matches the reduction of the time it takes to nearly half (a 90% (projected) speedup, or about a 45% time reduction, as mathed out by kazinator `Projected speedup from both: 4.631/2.431 = 1.905`). Your claim that its closer to 50% is correct from a total time taken perspective, just coming at it from the other direction.
For kicks, I downloaded the file and compared duckdb with the spatial extension against jq. jq was about 2x as fast for interactive use, but if you had multiple queries to run, paying a small cost to create a duckdb database and then querying it would be vastly faster
I also was nerdsniped into trying this and found that after extracting the features array into a newline delimited json file, DuckDB finishes the example query in 500 ms (M1 Mac), querying the 1.3 GB json file directly with read_json!
> What happens if you grab the jq source code from Launchpad, then configure and rebuild it with no flags at all? Even that is about 2-4% faster than the Ubuntu binary package.
I tried this with a 1.3GB file [1], and got Gemini to convert the jq query into a go program, that takes 6.6s the second time its run.(to cache the geojson file). My laptop isn't particularly fast (i7-1250p), and go's json handling isn't particularly fast either, so jq's time is not impressing me when run on a ryzen 9 desktop processor on a 500MB file.
It's surprising how quick this kind of processing can be in go.
It is funny, how being a corporate-Rails programmer taught me this early in my career. Because building Ruby from source was the only way you could install latest Ruby versions (using ruby-build) and installing Ruby from source means its dependencies like openssl, zlib had to be installed from source too, to match the required versions.
And for a long time using Jemalloc was the only was to keep the memory usage constant with multi-threaded ruby programs like Puma and Sidekiq. This was either achieved through compiling Ruby with jemalloc or modifying the LD_LIBRARY_PATH.
Some developers also reported 5% or so reduction in response times with jemalloc, iirc.
The problem with this approach is though, when a package has a lot of dependencies like ImageMagick which relies on jpeg, png, ghost and a lot of other libraries, you have to take a trial and error approach until it succeeds. Fortunately fixing the dependency errors are the easiest, sometimes building Python from source would throw errors from headers which are impossible to understand. If you find a Stackoverflow solution then you are good, or you have to go down the rabbit hole coming either successful or empty handed based on your level of expertise.
-march=native + mimalloc (or jemalloc) should be sufficient without causing significant undefined behavior like -O3 or most extra optimization related compiler arguments.
Nope, I'm not sure about it. I remember when I was using Gentoo about 10 years ago, this was the common reason given for using -O2 instead of -O3 in your build flags, and I'm just speaking from that memory.
But in general I have often found great benefit in many cases in obtaining packages from source and compiling them, or downloading official binaries, even if the repository does have the latest version in theory.
(shameless plug incoming): My only obstacle with such "Manually Installed [and/or] Source-Compiled" (MISC) packages was checking for updates. So I made this: https://sr.ht/~tpapastylianou/misc-updater/
Works great, and I've been slowly improving it to add 'upgrader' commands for some of the packages too.
It's not a full replacement for a package manager, nor is it meant to be. But it's made my ability to use MISC packages instead of repository ones a lot more streamlined and easier.
Reading this, I wonder a few things that seem distro quick wins:
1) Why don't distros replace the glibc allocator with one of the better ones?
2) Why don't distros allow for making server-specific builds for only a few packages? You don't have to pay the compilation cost for everything, just a list of 2 or 3 packages.
Is glibc being generalist still true? I might be wrong here, but I had the impression it was open and free but behind most others,and e.g. tcmalloc is better on most criteria. Do we have benchmarks and comparisons for the generalist case?
How would you do the things mentioned in the article using Guix?
You could write your own custom package definitions, extending the default to change up compile flags and allocators, but then you need to do this for every single package (and maintain them all). I'm not sure Guix gives you much here, though maybe that's fine for one or two packages.
The most pain-free option I can think of is the --tune flag (which is similar to applying -march=native), but packages have to be defined as tunable for it to work (and not many are).
If you want it to be permanent, then you can use a guix home profile (that's a declarative configuration of your home directory) with a patch function in the package list there:
You can also write a 10 line guile script to automatically do it for all dependencies (I sometimes do--for example for emacs). That would cause a massive rebuild, though.
>The most pain-free option I can think of is the --tune flag (which is similar to applying -march=native), but
> packages have to be defined as tunable for it to work (and not many are).
We did it that way on purpose--from prior experience, otherwise, you would get a combinatorial explosion of different package combinations.
If it does help for some package X, please email us a 2 line patch adding (tunable? . #t) to that one package.
If you do use --tune, it will tune everything that is tuneable in the dependency graph. But at least all dependents (not dependencies) will be just grafted--not be rebuilt.
In the late 1990s, the most common method for distributing a package was by providing its source code as a tarball. You essentially had to run ./configure and make on every package. I believe ./configure does some of what the OP did in a clever way, checking for locally installed alternatives, etc. It’s still pretty cool, though. I didn’t think a malloc replacement would significantly improve performance.
what makes me jealous,is not the double performance ,but the fact that he can handle 13000 files in 2 seconds. i am at my ${corporate} job now, and have about 40'000 json files stored locally on the windows laptop. when I need to load a them, it takes minutes. i am still not sure if i read files wrong or NTFS is a pile of trash when comes to reading many smallish files
NTFS is slow, especially when you operate on a lot of tiny files (nobody in the Windows world would do that, you'd always put your tiny data blobs into a bigger container file, e.g. asset files in games), but from my corporate experience, it's mostly the _multiple_ "endpoint security" solutions that bog file system performance down.
It's the reason I so far use a Mac at work, which has its own issues, and a lot of them.
NTFS is definitely very slow when it comes to reading many small files. Windows Defender or whatever antivirus you might be using can also further slow this down.
you were right. it didn't occur to me that antivirus would scan everything I open. Moved all files into an excluded directory, reading files is now almost 10 times faster. Thanks kind stranger !
because that d be a several hundreds megabytes json :) Anyway with suggestion in another comment, i moved the root directory into an antivirus-excluded location, which gave me almost a 10 times performance win
BOLT is only good for ~2% in this scenario, which is about what I would have expected from glancing at the profiles.
-Os is much slower, as I would have expected. Nothing in the perf data suggested high L1i miss rate. The .text of the "size optimized" program is, for some reason, 16x larger.
jaq is nice. It just loses, performance-wise, to the final step in this article, and it can't do the second mentioned workload (yet) so I didn't include it.
It bombs out on the jq program I use for the 2nd corpus that I mentioned. On further investigation, the show-stopping filter is strftime. In the jaq readme this is the only not-yet-checked box in the compatibility list, so perhaps some day soon.
If I had the time to fork jq, I would convert the relevant part to C++ so it doesn't spend literally all of its time dynamically allocating strings it is just about to discard.
GeoJSON is an interchange format that a government agency is using to provide this data, and it updates continuously, so at some point I will be interested in the efficiency of the ETL even if I start doing all my queries against an optimized format.
People will run into subtle bugs and weird behavior when starting to compile things themselves and replacing libraries I think. Not worth the effort for me, fun as an exercise, but would only do it if it's really (really) needed for performance.
Why would you give your product a quite specific name that's exactly the same as another product? "Core 2 Duo" immediately brings an Intel CPU to mind and will probably make it harder to search for anything related to the watch.
> Can’t one recompile the same exact Ubuntu packages you already have on your system with optimal flags for your specific hardware?
Well, in principle yes. But you'd also need to figure out what you mean by 'optimal' flags? The flags might differ between different packages, and not all packages will be compatible with all the flags.
Even worse, in the example in the article they got most of their speedup out of moving to a different allocator, and that's not necessarily compatible with all packages (nor an improvement for all packages).
However, if you still want to do all these things, then Gentoo Linux is the distribution for you. It supports this tinkering directly and you'll have a much happier time than trying to bend Ubuntu to your will.
Since the test data is not available for us to experiment ourselves, I've asked the poster to try it with Boehm GC compiled as a transparent malloc() replacement.
O3 is a bad idea if the package is important. While that’s a dated opinion based on older experiences (Gentoo, various server software, etc), I suspect it’s still the easiest way to get exposed to compiler bugs.
The undefined behaviour is already in the C standard.
But you are right that enabling -O3 is generally going to make your compiler more aggressive about exploiting the undefined behaviour in the pursuit of speed.
In general, if you want to avoid being the guy to run into new and exciting bugs, you want to run with options that are as close as possible to what everyone else is using.
If you're going to offer unsupported FUD, I would prefer you direct the FUD at NDEBUG and the fear that it could have disabled a load-bearing assertion.
Just because it's turned off by default doesn't mean it isn't load bearing. If you didn't ever encounter an ASSERT with side effects that were crucial to proper functioning, you just haven't seen enough ASSERT statements.
Rule #1 of programming: If it can go wrong, it will.
That's before we even consider instances where assert was used to check a condition that should always be verified. In my personal projects I define a "require" macro that I use liberally.
If acknowledging my opinion was dated based on older experiences constitutes “unsupported FUD”, well, have fun with the life experiences that attitude produces for you.
I haven't kept up with Gentoo. How long does it take these days to bootstrap a functional desktop from source?
I remember when it was my main driver back in early '00s, running on a horribly underpowered PII at 266MHz. It would often be compiling 24/7 to keep up with the updates.
Isn't it against the rules to post a link like this where the content is walled behind needing an account to access?
I know most of us have accounts here, but when we're seeing something we want to read its not conducive to have to have to deal with roadblocks before we can see what is being talked about. Same goes for paywalls.
Chrome is optimized right up to its eyebrows. If you are using Google distributions of Chrome binaries, I wouldn't expect improvements from build tweaks.
Ah yes, this is nice but it sounds like Gentoo with extra steps. Probably also some extra benefits. But there is a reason we don’t all run Gentoo. Although given a large and seamless enough cache, maybe we would.
Waste minutes of your time just so anonymous people on the internet can read your comment :).
Not everything in life is an optimization problem. Fun little projects like this is what makes us human (and, arguably, are both educational and entertaining).
I see what you're saying but I am also talking about how computers originally saved humans time and that the OP seems to now be slaving away to save the CPU time. Doesn't that seem a little backwards?
If we're talking about micro-seconds of difference, the trade-off doesn't seem worth it. Even on a mass scale where this is somehow adopted, nobody is going to notice the difference. Maybe if this were in something like eCommerce and web browing where the lag translates to profit lost? Or perhaps game engines?
IDK, I just consider humans time more precious than a slow package (that already runs blazingly fast on a CPU that it barely matters.)
Longer waiting times can decrease productivity a lot more than just the runtime increase. If a program takes long enough that the user switches context mentally, or goes to get another coffee or looks at a few posts on social media that now means far more wasted time. Humans are not robots and won’t just keep staring at the screen.
Also a lot of simple optimizations could save seconds or minutes in the lives of millions of people, and across multiple devices/programs that adds up. Microsoft once sped up the booting process of their consoles by 5 seconds, by simply making the boot animation shorter by 5 seconds. Those consoles were sold to millions of people and it took MS 8 years or so to make the fix. That’s many lifetimes wasted once you add it all up.
Today it's 2.2GB of JSON files, tomorrow it's tens of petabytes of archived data that you need to look through for work. A speedup of 2x is nothing to scoff at.
As a broader comment though, this is how we learn and discover things. Just because the specific outcome here is "trivial" doesn't mean the approach or the lessons learnt aren't valuable.
I found it an interesting read despite not have any stake in the outcome.
i wish i could slap people in the face over standard tcp/ip for clickbait. it was ONE package and some gains were not realized by recompilation.
i have to give it to him, i have preloaded jemalloc to one program to swap malloc implementation and results have been very pleasant. not in terms of performance (did not measure) but in stabilizing said application's memory usage. it actually fixed a problem that appeared to be a memory leak, but probably wasn't fault of the app itself (likely memory fragmentation with standard malloc)
reply