Make Ubuntu packages 90% faster by rebuilding them

wengo314 · 2025-03-19T08:52:39 1742374359

"Make one Ubuntu package 90% faster by rebuilding it and switching the memory allocator"

i wish i could slap people in the face over standard tcp/ip for clickbait. it was ONE package and some gains were not realized by recompilation.

i have to give it to him, i have preloaded jemalloc to one program to swap malloc implementation and results have been very pleasant. not in terms of performance (did not measure) but in stabilizing said application's memory usage. it actually fixed a problem that appeared to be a memory leak, but probably wasn't fault of the app itself (likely memory fragmentation with standard malloc)

FooBarWidget · 2025-03-19T11:22:42 1742383362

I did research into the glibc memory allocator. Turns out this is not memory fragmentation, but per-thread caches that are never freed back to the kernel! A free() call does not actually free the memory externally unless in exceptional circumstances. The more threads and CPU cores you have, the worse this problem becomes.

One easy solution is setting the "magic" environment variable MALLOC_ARENA_MAX=2, which limits the number of caches.

Another solution is having the application call malloc_trim() regularly, which purges the caches. But this requires application source changes.

https://www.joyfulbikeshedding.com/blog/2019-03-14-what-caus...

glandium · 2025-03-19T11:42:00 1742384520

The glibc memory allocator DOES have pathological cases that lead to what can look like memory leaks. See https://glandium.org/blog/?p=3698 https://glandium.org/blog/?p=3723 https://sourceware.org/bugzilla/show_bug.cgi?id=23416 (despite being rather old, it's also still a problem)

wengo314 · 2025-03-19T12:32:18 1742387538

FWIW i had it with icinga2. so now they actually preload jemalloc in the service file to mitigate the issue, this may very well be what you're talking about

in case someone is interested: https://github.com/Icinga/icinga2/issues/8737

(basically using jemalloc was the big fix)

https://icinga.com/docs/icinga-2/latest/doc/15-troubleshooti...

blablabla123 · 2025-03-19T12:40:08 1742388008

True, I also believed it for a second. But it's also easy to blame Ubuntu for errors. IMHO they are doing a quite decent job with assembling their packages. In fact they are also compiled with Stack fortifications. On the other hand I'm glad they are not compiled with the possibly buggy -O3. It can be nice for something performance critical but I definitely don't want a whole system compiled with -O3.

kllrnohj · 2025-03-19T13:30:18 1742391018

> with the possibly buggy -O3

-O3 isn't buggy on either GCC or Clang. Are you thinking of -Ofast and/or -ffast-math that disregard standards compliance? Those aren't part of -O3.

BoingBoomTschak · 2025-03-19T13:35:02 1742391302

-O3 itself isn't "buggy", but since it uses more optimizations, it can reveal issues in them. Other Gentoo users know: e.g. https://bugs.gentoo.org/show_bug.cgi?id=941208 https://bugs.gentoo.org/show_bug.cgi?id=940923 (search O3 in the bugzilla).

kllrnohj · 2025-03-19T13:45:02 1742391902

strict-aliasing, which is what caused that bug to manifest, is enabled at O2.

BoingBoomTschak · 2025-03-19T13:51:29 1742392289

Yep, it's what caused the bug to manifest, but who knows if that UB would have caused -O2 optimizations to mangle the result as well.

EDIT: first one is -funswitch-loops, though

kllrnohj · 2025-03-19T13:57:42 1742392662

Probably would have. There's no shortage of UB bugs revealed by O2 after all, including major security issues.

dietr1ch · 2025-03-19T09:19:12 1742375952

To me it's obviously a scam because there's no way such an improvement can be achieved globally with a single post explanation. 90% faster is a micro-benchmark number.

arghwhat · 2025-03-19T18:02:34 1742407354

This is neither a micro-benchmark nor a scam, but it is click-bait by not mentioning jq specifically.

Micro-benchmarks would be testing e.g. a single library function or syscall rather than the whole application. This is the whole application, just not one you might care that much for the performance of.

Other applications will of course see different results, but stuff like enabling LTO, tuning THP and picking a suitable allocator are good, universal recommendations.

looofooo0 · 2025-03-19T11:26:13 1742383573

True that, I mean it is still interesting, that if you have a narrow task, you might achieve some significant speed up from rebuilding them. But this is a very niche application.

wengo314 · 2025-03-19T12:34:51 1742387691

true, i saw a thread recently on reddit where guy hand-tuned compilation flags and did pgo profiling for a video encoder app that he uses on video encode farm.

In his case, even a gain of ~20% was significant. It calculated into extra bandwidth to encode a few thousand more video files per year.

cratermoon · 2025-03-19T15:02:34 1742396554

I wonder how many prepackaged binary distributions are built with the safest options for the os/hardware and don't achieve the best possible performance. I bet most of them, tbh. Many years ago I started building Mozilla and my own linux kernels to my preferences, usually realizing modest performance gains. The entire purpose of the Gentoo Linux distribution, e.g., is performance gains possible by optimized compilation of everything from source.

tonymet · 2025-03-19T16:39:17 1742402357

the title is clickbait, but it's good to encourage app developers to rebuild. esp when you are cpu bound on a few common utitilities e.g. jq, grep, ffmpeg, ocrmypdf -- common unix utils built build targets for general use rather than a specific application

UncleEntity · 2025-03-19T20:19:40 1742415580

Or, if I understand TFA correctly, don't release debug builds in your release packages.

Reminds me of back in the day, when I was messing around with blender's cmake config files quite a bit, I noticed the fedora package was using the wrong flag -- some sort of debug only flag intended for developers instead of whatever they thought is was. I mentioned this to the package maintainer, it was confirmed by package sub-maintainer (or whomever) and the maintainer absolutely refused to change it because the spelling of the two flags was close enough they could just say "go away, contributing blender dev, you have no idea what you're talking about." Wouldn't doubt the fedora package still has the same mistaken flag to this day and all this occurred something like 15 years ago.

So, yeah, don't release debug builds if you're a distro package maintainer.

margana · 2025-03-19T13:33:06 1742391186

I thought it would be something like recompiling to utilize AVX512 capabilities or something.

tremon · 2025-03-19T14:22:09 1742394129

Vector operations like AVX512 will not magically make common software faster. The number of applications that deal with regular operations on large blocks of data is pretty much limited to graphical applications, neural networks and bulk cryptographic operations. Even audio processing doesn't benefit that much from vector operations because a codec's variable-size packets do not allow for efficient vectorization (the main exception being multi-channel effects processing as used in DAW).

isotypic · 2025-03-19T18:04:18 1742407458

Vector operations are widely used in common software. Java uses AVX512 for sorting. glibc uses SIMD instructions for string operations.

tremon · 2025-03-20T23:56:42 1742515002

Thanks for the correction. I hadn't considered bulk memory operations to be part of SIMD operation but it makes sense -- they operate on a larger grain than word-size so they can do the same operation with less micro-ops overhead.

smallstepforman · 2025-03-19T04:45:12 1742359512

Engineering is a compromise. The article shows most gains come from specialising the memory allocater. The thing to remember is that some projects are multithreaded, and allocate in one thread, use data in another and maybe deallocate in a 3rd. The allocator needs to handle this. So a speedup for one project may be a crash in another.

Also, what about reallocation strategy? Some programs preallocate and never touch malloc again, others constantly release and acquire. How well do they handle fragmentation? What is the uptime (10 seconds or 10 years)? Sometimes the choice of allocators is the difference between long term stability vs short term speed.

I experimented with different allocators devoloping a video editor testing 4K videos that caches frames. 32Mb per frame, at 60fps, thats almost 2Gb per second per track. You quickly hit allocator limitations, and realise that at least vanilla glibc allocator offers the best long term stability. But for short running benchmarks its the slowest.

As already pointed out, engineering is a compromise.

vlovich123 · 2025-03-19T04:52:48 1742359968

Mimalloc is a general purpose allocator like JEMalloc / TCMalloc. Glibc is known to have a pretty bad allocator that modern allocators like MIMalloc & the latest TCMalloc (not the one available by default in Ubuntu) run laps around. While of course speedup may be variable, generally the benchmarks show an across the board speedup (whether that matters for any given application is something entirely different). As for crashes, these are all general purpose multi-thread allocators and behave no differently from glibc (modulo bugs that can exist equally in glibc).

smallstepforman · 2025-03-19T05:08:46 1742360926

Agree for most short running apps. I updated my comment to reflect issues with apps that are constantly reallocating, and running for longer that 60 seconds. But you are absolutely correct for most short running apps, 99% recommended to replace glibc. However, there is an app or two where glibc stability doesnt trigger a pathological use cases, and you have no choice. Hence why its the default, since there are less crashes in pathaligical cases. And the devs are exhausted dealing with crashing bugs which can be eliminated by using the slower allocator.

vlovich123 · 2025-03-19T05:15:01 1742361301

Looked at your updated post and it looks like you’re operating under wildly incorrect assumptions.

1. Fragmentation: MIMalloc and the newest TCMalloc definitely handle this better than glibc. This is well established in many many many benchmarks.

2. In terms of process lifetime, MIMalloc (Microsoft Cloud) and TCMalloc (Google Cloud) are designed to be run for massive long-lived services that continually allocate/deallocate over long periods of time. Indeed, they have much better system behavior in that allocating a bunch of objects & then freeing them actually ends up eventually releasing the memory back to the OS (something glibc does not do).

> However, there is an app or two where glibc stability doesnt trigger a pathological use cases, and you have no choice.

I’m going to challenge you to please produce an example with MIMalloc or the latest TCMalloc (or heck - even any real data point from some other popular allocators vs vague anectodes). This just simply is not something these allocators suffer from and would be major bugs the projects would solve.

jcelerier · 2025-03-19T09:06:47 1742375207

I tried to use mimalloc and snmalloc and in both cases got crashes I don't get with glibc when interoperating with other libraries (libusb, jack, one that I suspect to be in the Nvidia driver) :(

vlovich123 · 2025-03-19T09:44:26 1742377466

If you are not properly overriding the allocator consistently for everything within an executable, that’s entirely possible (eg linking against 1 allocator and then linking with a dynamic library that’s using a different one). Without a specific repro it’s hard to distinguish PEBCAK from legit bug. Also it certainly can’t be the Nvidia driver since that’s not running anything in your process.

blagie · 2025-03-19T14:56:47 1742396207

The explanation here is usually simpler. Any major change like this can lead to crashes.

If I have three libraries, biz, bar, and bif, and each has bugs. You've used biz. When the system was unstable, you'd debug until it works (either by finding the real bug, or by an innocuous change).

If you switch libraries, the bugs which go away are the ones which aren't affecting you, since your software works. On the other hand, you have a whole new set of bugs.

This comes up when upgrading libraries too. More bugs are usually fixed than introduced, but there's often a debug cycle.

kllrnohj · 2025-03-19T13:40:05 1742391605

> Also it certainly can’t be the Nvidia driver since that’s not running anything in your process.

A huge chunk of a modern GPU driver is part of the calling process, loaded like a regular library. Just spot checking Chrome's GPU thread, there's dozens of threads created by a single 80+mb nvidia DLL. And this isn't unusual, every GPU driver has massive libraries loaded into the app using the GPU - often including entire copies of LLVM for things like shader compilers.

Y_Y · 2025-03-19T08:10:14 1742371814

Challenge yourself to produce some numbers first. If there are many many many benchmarks it shouldn't be too difficult to link one. Just saying something is "well established" doesn't really help without some other context.

scns · 2025-03-19T08:34:42 1742373282

https://github.com/microsoft/mimalloc?tab=readme-ov-file#per...

jeffbee · 2025-03-19T14:14:56 1742393696

Oddly enough I've written about those results before:

https://www.dropbox.com/scl/fi/evnn6yoornh9p6l7nq1t9/Irrepro...

vlovich123 · 2025-03-19T15:16:51 1742397411

but we all can agree that glibc is trash right? :)

I’d be careful extrapolating from just one benchmark but generally if I had to choose I’d pick the new tcmalloc if I could. It seems to be a higher quality codebase.

jeffbee · 2025-03-19T15:32:09 1742398329

The problem with the current tcmalloc, which I agree is very good, is the difficulty of integrating it with random builds. It works well for large systems and especially those that are already bazelized, but I couldn't whip up a quick way to link it to `jq` in this gist.

Y_Y · 2025-03-19T09:08:48 1742375328

Thanks for that. I have an innate skepticism of benchmarks from a company who wants to show their solution is best, and it seems their latest results are from 2021, but I couldn't find a better comparison myself either.

I do note that rpmalloc (old), Hermes (not public? doesn't compare with mimalloc) and also snmalloc (also Microsoft) have benchmarks of their own showing themselves to be best in some circumstances.

https://github.com/mjansson/rpmalloc-benchmark

https://arxiv.org/abs/2109.02922

https://github.com/SchrodingerZhu/bench_suite/blob/master/ou...

Tuna-Fish · 2025-03-19T09:40:18 1742377218

More than that, based on the basic architecture of JEmalloc, mimalloc and tcmalloc, you should always expect their fragmentation behavior to be better for long-running software than the glibc malloc. (At the expense of consuming substantially more memory for very small programs with only a few allocations of a given size). The glibc malloc has nearly pessimal fragmentation behavior, you are very confused here.

Y_Y · 2025-03-19T10:42:00 1742380920

> you are very confused here

Is this addressed to me? If so would you do me the kindness of attempting to disabuse me of my confusion?

Tuna-Fish · 2025-03-19T11:55:02 1742385302

Wait, no, sorry. That reply was supposed to go to a different post.

akoboldfrying · 2025-03-19T09:44:33 1742377473

> This just simply is not something these allocators suffer from and would be major bugs the projects would solve.

Not OP, but the following logic shows why this claim is bogus. In short: If two non-garbage-collecting memory allocators do anything differently -- other than behave as perfect "mirror images" of each other, so that whenever one allocates byte i, the other allocates byte totalMem-i -- then there exists a program that crashes on one but not the other, and vice versa.

In detail:

If 2 allocators do not allocate exactly the same blocks of memory to the same underlying sequence of malloc() or free() calls, then there exists a program which, if built twice, once using each allocator, and then each executable is run with the same input, will after some time produce different patterns of memory fragmentation.

The first time this difference appears -- let's say, after the first n calls to either malloc() or free() -- the two executables will have the same total number of bytes allocated, but the specific ranges of allocated bytes will be different. The nth such call must be a malloc() call (since if it were a free() call, and allocated ranges were identical after the first n-1 such calls, they would still be identical after the first n, contradicting our assumption that they are different). Then for each executable, this nth malloc() call either allocates a block at or some distance past the end, or it subdivides some existing free block. We can remove the latter possibility (and simplify the proof) by assuming that there is no more memory available past the end of the highest byte thus far allocated (this is allowed, since a computer with that amount of memory could exist).

Now have both programs call free() on every allocated block except the one allocated in operation n. Let the resulting free range at the start of memory (before the sole remaining allocated block) have total length s1 in executable 1 and s2 in executable 2, and let the resulting free range at the end of memory (after that sole remaining allocated block) have length e1 in executable 1 and e2 in executable 2. By assumption, s1≠s2 and e1≠e2. Now have both executables call malloc() twice, namely, on s1 and e1 in descending order. Then, unless s1=e2, executable 1 can satisfy both malloc()s, but executable 2 can satisfy only the first. Similarly, calling malloc() on s2 and e2 in decreasing order will succeed in executable 2 but not executable 1, again unless s1=e2 holds.

What if s1=e2 does hold, though? This occurs when, say, one executable allocates the block 100 bytes from the start of memory, while the other allocates it 100 bytes from the end. In this case, all we need is to keep some second, symmetry-breaking block around at the end in addition to the block allocated by operation n -- that is, a block for which it does not hold that one allocator allocates the mirror-image memory range of the other. (If no such block exists, then the two allocators are perfect mirror images of each other.)

vlovich123 · 2025-03-19T10:25:50 1742379950

I really have no idea what you’re getting at here. This isn’t embedded where there’s a fixed pool to allocate out of. If there’s insufficient space it’ll get more virtual memory from the OS.

Also, nothing you’ve said actually says that the other allocator will be the worse one. Indeed, glibc is known to hold onto memory longer and have more fragmentation than allocators like mimalloc and tcmalloc so I’m still at a loss to understand how even if what you wrote is correct (which I don’t believe it is) that it follows that glibc is the one that won’t crash. If you’re confident in your proof by construction, please post a repro that we can all take a look at.

akoboldfrying · 2025-03-19T11:27:51 1742383671

> If there’s insufficient space it’ll get more virtual memory from the OS.

Swap space is finite too.

> Also, nothing you’ve said actually says that the other allocator will be the worse one.

I'm not claiming that either is worse. I'm showing mathematically that for any two allocators that behave differently at all (with the one tiny exception of a pair of allocators that are perfect mirror images of each other), it's possible to craft a program that succeeds on one but fails on the other.

I didn't say so explicitly as I thought it was obvious, but the upshot is: It's never completely safe to just change the allocator. Even if 99% if the time, one works better than the other, there's provably a corner case where it will fail but the other does not.

akoboldfrying · 2025-03-19T12:03:50 1742385830

I should have been more explicit about the assumptions I make about an allocator:

1. If malloc() is called when there exists a contiguous block of free memory with size >= the argument to malloc(), the call will succeed. I think you'll agree that this is reasonable.

2. Bookkeeping (needed at least for tracking the free list, plus any indexes on top) uses the same amount of malloc()able memory in each allocator. I.e., if malloc(x) at some point in time reduces the number of bytes that are available to future malloc() calls by y >= x bytes under allocator 1, it must reduce the number of bytes available to future malloc() calls by y under allocator 2 if called at the same point in time as well. This may not hold exactly in practice, but it's a very good approximation -- it's possible to store the free list "for free" by using the first few bytes as next and prev pointers in a doubly linked list.

To head one another possible objection off at the pass: If the OS allocates lazily (i.e., it doesn't commit backing store at malloc() time, instead waiting till there is an actual access to the page, like Linux does), this doesn't change anything: Address space (even 64-bit address space) is still finite, and that is still being allocated eagerly. In practice, you could craft the differentially crashing program to crash much faster if you call memset() immediately on every freshly malloc()ed block to render this lazy commit ineffective -- then you would only need to exhaust the physical RAM + swap, rather than the complete 64-bit virtual address space.

vlovich123 · 2025-03-19T15:49:47 1742399387

Swapping out allocators won't cause some programs to crash and others to not crash unless your program is already using up all available RAM or your program has a bug and the allocation pattern happens to be more likely to trigger invalid memory access in a way that crashes.

This allocation pattern idea is unlikely to show up in any real application except at the absolute limit where your exhausting RAM and the OOM killer gets involved. Even then I think you're going to not see the allocator be much of a differentiating factor.

yxhuvud · 2025-03-19T07:08:29 1742368109

Funny, cause the situations where I've had to replace glibc is always that it is a long running server that allocates often. Glibc: Ballooning memory, eventually crash. jemalloc: Stable as a rock.

scottlamb · 2025-03-19T13:46:43 1742392003

> I experimented with different allocators devoloping a video editor testing 4K videos that caches frames. 32Mb per frame, at 60fps, thats almost 2Gb per second per track. You quickly hit allocator limitations, and realise that at least vanilla glibc allocator offers the best long term stability. But for short running benchmarks its the slowest.

I also work with large (8K) video frames [1]. If you're talking about the frames themselves, 60 allocations per second is nothing. In the case of glibc, it's slow for just one reason: each allocation exceeds DEFAULT_MMAP_THRESHOLD_MAX (= 32 MiB on 64-bit platforms), so (as documented in the mallopt manpage), you can not convince glibc to cache it. It directly requests the memory from the kernel with mmap and returns it with munmap each time. Those system calls are a little slow, and faulting in each page of memory on first touch is in my case slow enough that it's impossible to meet my performance goals.

The solution is really simple: use your own freelist (on top of the general-purpose allocator or mmap, whatever) for just the video frames. It's a really steady number of allocations that are exactly the same size, so this works fine.

[1] in UYVY format, this is slightly under 64 MiB; in I420 format, this is slightly under 48 MiB.

jeffbee · 2025-03-19T15:35:37 1742398537

Buffers of that size are also in tcmalloc's "whatever" class, right? It just does a smarter[1] job of not unmapping them.

1: for a certain point of view

hedora · 2025-03-19T15:11:25 1742397085

I’d be shocked if jemalloc or tcmalloc had issues with that workload.

Do you have a minimal reproducing example?

scottlamb · 2025-03-19T16:18:14 1742401094

Sure, in plain C because why not. <https://gist.github.com/scottlamb/459a3ce6230be67bf4ceb1f1a8...>

There's one other element I didn't mention in my previous comment, which is a thread handoff. It may be significant because it trashes any thread-specific arena and/or because it introduces a little bit of variability over a single malloc at a time.

For whatever reason the absolute rate on my test machine is much higher than in my actual program (my actual program does other things with a more complex threading setup, has multiple video streams, etc.) but you can see the same effect of hitting the mmap, munmap, and page fault paths that really need not ever be exercised after program start.

In my actual (Rust-based) program, adding like 20 lines of code for the pooling was a totally satisfactory solution and took me less time than switching general-purpose allocator, so I didn't try others. (Also, my program supports aarch64 and iirc the vendored jemalloc in the tikv-jemallocator crate doesn't compile cleanly there.)

1dom · 2025-03-19T08:37:56 1742373476

Sorry, I'm struggling to make sense of this comment. I don't know C or C compilers very well at all, but I read the full gist and felt I learned a bunch of stuff and got a lot of value from it.

But then when I read this top comment, it makes me concerned I've completely misunderstood the article. From the tone of this comment, I assume that I shouldn't ever do what's talked about in this gist and it's a terrible suggestion that overlooks all these complexities that you understand and have referenced with rhetorical-looking questions.

Any chance you could help me understand if the original gist is good, makes any legitimate points, or has any value at all? Because I thought it did until I saw this was the top comment, and it made me realise I'm not smart enough to be able to tell. You sound like you're smart enough to tell, and you're telling me only bad things.

alextingle · 2025-03-19T09:32:28 1742376748

I'll have a go at explaining: The process described in the article isn't a simple recipe that you can apply to any program to achieve similar results.

`jq` is a command-line program that fires up to do one job, and then dies. For such a program, the only property we really want to optimise is execution speed. We don't care about memory leaks, or how much memory the process uses (within reason). `jq` could probably avoid freeing memory completely, and that would be fine. So using a super-stupid allocator is a big win for `jq`. You could probably write your own and make it run even faster.

But for a program with different runtime characteristics, the results might be radically different. A long-lived server program might need to avoid memory bloat more than it needs to run fast. Or it might need to ensure stability more than speed or size. Or maybe speed does matter, but it's throughput speed rather than latency. Each of those cases need to be measured differently, and may respond better to different optimisation strategies.

The comment that confused you is just trying to speak a word of caution about applying the article's recipe in a simplistic way. In the real world, optimisation can be quite an involved job.

1dom · 2025-03-19T10:59:45 1742381985

> The comment that confused you is just trying to speak a word of caution about applying the article's recipe in a simplistic way. In the real world, optimisation can be quite an involved job.

I think that's what confused and irritated me. There's a lot of value and learning in the gist - I've used JQ in my previous jobs regularly, this is the real world, and valuable to many. But the top comment (at the time I responded) is largely rhetorically trashing the submission based on purely the title.

I get that the gist won't make _everything_ faster: but I struggle to believe that any HN reader would genuinely believe that's either true, or a point that the author is trying to make. The literal first sentence of the submission clarifies the discussion is purely about JQ.

Anyone can read a submission, ignore any legitimate value it in, pick some cases the submission wasn't trying to address, and then use those cases to rhetorically talk it down. I'm struggling to understand why/how that's bubbling to the top in a place of intellectual curiosity like HN.

Edit: I should practice what I preach. Conversation and feedback which is purely cautionary or negative isn't a world that anyone really wants! Thanks for the response, I really appreciated it:) It was helpful in confirming my understanding that this submission does genuinely improve JQ on Ubuntu. Cautionary is often beneficial and necessary, and I think the original comment I responded to could make a better world with a single sentence confirming that this gist is actually valuable in the context it defines.

grandempire · 2025-03-19T05:27:07 1742362027

That’s why it’s a bad idea to use one allocator for everything in existence . It’s terrible that everyone pays the cost of thread safety even for single threaded applications - or even multithreaded applications with disciplined resource management.

yxhuvud · 2025-03-19T07:10:35 1742368235

While I do agree with the general sentiment, I think the default should be to use the safer ones that can handle multi threaded usage. If someone wants to use a bad allocator like glibc that doesn't handle concurrency well, then they should certainly be free to switch.

grandempire · 2025-03-19T14:16:18 1742393778

So now you’re using std vector and it’s taking a lock for no reason, even though push_back on the same vector across threads isn’t thread safe to begin with.

Haphazard multithreading is not a sane default.

I understand a million decisions have been made so that we can’t go flip that switch back off, but we’ve got to learn these lessons for the future.

yxhuvud · 2025-03-19T17:41:48 1742406108

Whatever. Allocation is slow enough that I don't care about a noncontended lock. Make things work by default and if you want to gain performance by not allowing multithreading then that should be possible and easy. But safety first, as in most cases it really don't matter. When it comes to mainstream general purpose allocators it isn't really a tradeoff anyhow, as all of them are nominally threadsafe.

Even glibc claims to be multithreading safe even if it tends to not return or reuse all freed memory.

silisili · 2025-03-19T06:03:26 1742364206

This is a common pain point.

Write in a language that makes sense for the project. Then people tell you that you should have used this other language, for reasons.

Use a compression algo that makes sense for your data. Then people will tell you why you are stupid and should have used this other algo.

My most recent memory of this was needing to compress specific long json strings to fit in Dynamo. I exhaustively tested every popular algo, and Brotli came out far ahead. But that didn't stop every passerby from telling me that zlib is better.

It's rather exhausting at times...

silisili · 2025-03-19T16:53:05 1742403185

note: in the above, I meant 'zstd', not 'zlib'.

teitoklien · 2025-03-19T01:05:33 1742346333

Gentoo linux is essentially made specifically for people like this, to be able to optimize one’s own linux rig for one’s specific usecase.

After initial setup, it’s pretty simple and easy to use, I remember making a ton of friends at matrix’s Gentoo Linux channel, was fun times.

https://www.gentoo.org/

Fun fact, initial ChromeOS was basically just custom Gentoo Linux install, I’m not sure if they still use Gentoo Linux internally.

donio · 2025-03-19T02:31:36 1742351496

> Gentoo linux is essentially made specifically for people like this, to be able to optimize one’s own linux rig for one’s specific usecase.

That's true but worth noting that "optimize" here doesn't necessarily refer to performance.

I've been using Gentoo for 20 years and performance was never the reason. Gentoo is great if you know how you want things to work. Gentoo helps you get there.

kennysoona · 2025-03-19T05:26:44 1742362004

If it wasn't for performance, what was gained in using it over something like Slackware and building only the packages you needed to?

CamouflagedKiwi · 2025-03-19T07:37:20 1742369840

USE flags. You can build packages with specific features enabled or disabled, which can further reduce your dependency tree.

stakhanov · 2025-03-19T08:27:20 1742372840

Reducing the dependency tree gets a bit more complicated once you consider that now you have to satisfy not only runtime dependencies for all packages but also build-time dependencies. There may be ways of cleaning that up after a build, but next time you want to emerge a new package you'll just end up having to re-build the build-time dependencies, so in practice you'll just end up leaving them there. There is an ability to emerge packages to a separate part of the filesystem tree (ROOT="/my/chroot" emerge bla), so that you have one build-time system act as a kind of incubator for a runtime system that gets to be minimal. But you'll end up encountering problems that most other Gentoo users wouldn't encounter, having to do with the separation between build-time dependencies and runtime dependencies not being correctly made in the recipes. Personally, I had been relying on this feature for roughly the last 10 years, but there has been steady deterioration there over the years and I eventually gave up late last year.

cb321 · 2025-03-19T12:09:52 1742386192

This is a good point. I've been using Gentoo since early 2004 (the dreaded Pentium IV era, Lol). Lately, I run into this with dev-lang/tcl only being need to build dev-db/sqlite. I actually think it's pretty weird that software intended to be as widely used as sqlite with as much of a free base of supporting devs doesn't just do the extra effort to use a Makefile.

kennysoona · 2025-03-19T07:54:42 1742370882

For building specific packages with particular flags, wouldn't slackbuilds have been sufficient?

xlii · 2025-03-19T06:00:49 1742364049

Long time ago when I was using it I preferred Gentoo because of ergonomics and better exposition to supply chain.

Slackware was very manual and some bits were drowned in its low level and long command chains. Gentoo felt easy but highlighted dependencies with a hard cost associated with compilation times.

Being a newb back then I enjoyed user friendliness with access to the machinery beneath. Satisfaction of a 1s boot time speedu, a result of 48h+ compilation, was unparalleled, too ;)

0xbadcafebee · 2025-03-19T19:00:05 1742410805

> If it wasn't for performance, what was gained

A new hobby

kennysoona · 2025-03-19T20:36:49 1742416609

That's very fair, I've certainly tinkered with stuff for the fun of it when there were easier or more suitable alternatives.

vasco · 2025-03-19T06:45:09 1742366709

Never seen the HN version of the 'install gentoo' meme before, more sophisticated definitely.

> The goal of Gentoo is to have an operating system that builds all programs from source, instead of having pre-built binary packages. While this does allow for advanced speed and customizability, it means that even the most basic components such as the kernel must be compiled from source. It is known through out the Linux community as being a very complex operating system because of its daunting install process. The default Gentoo install boots straight to a command prompt, from which the user must manually partition the disk, download a package known as a "Stage 3 tarball", extract it, and build the system up by manually installing packages. New or inexperienced users will often not know what to do when they boot in to the installer to find there is no graphical display. Members of /g/ will often exaggerate the values of Gentoo, trying to trick new users in to attempting to install it.

toyg · 2025-03-19T08:00:35 1742371235

Where does that blurb come from, chatgpt? I don't think it's true anymore, last time I checked I think Gentoo had a "normal" liveCD installation for the base system, which you could then recompile on your own if wanted.

_joel · 2025-03-19T09:25:11 1742376311

GRP (Gentoo packages) existed at least 20 years ago, from my memory, as that's the last time I really used it in anger. I remeber packages being available and not having to rice everything, for sure.

Vilkku · 2025-03-19T09:00:49 1742374849

Seems to be from https://knowyourmeme.com/memes/install-gentoo

stakhanov · 2025-03-19T08:05:39 1742371539

I had had Gentoo continuously in use since 2003, and only very recently moved off of it (late 2024) when I tried Void Linux. On Void, buildability from source by end users is not a declared goal nor architectural feature, but you have a pretty decent chance of being able to make it work. You can expect one or two hiccups, but if you have decent all-round Linux experience, chances are you'll be able to jump into the build recipes, fix them, make everything work for what you need it to do, and contribute the fixes back upstream. This is what you get from a relentless focus on minimalism and avoiding overengineering of any kind. It's what I had been missing in Gentoo all those years. With Gentoo, I always ended up having to fiddle with use flags and package masks in ways that wouldn't be useful to other users. The build system is so complex that it had been just too difficult for me, over all these years, to properly learn it and learn to fix problems at the root cause level and contribute them upstream. Void should also be an ideal basis for when you don't want to build the entire system from source, but you just want to mix & match distro-provided binaries with packages you've built from source (possibly on the basis of a modified build recipe to better match your needs or your hardware).

eru · 2025-03-19T01:29:00 1742347740

I used Gentoo for a while, but the temptation to endlessly fiddle with everything always let me to eventually break the system. (It's not Gentoo's fault, it's mine.)

Afterwards I moved to ArchLinux, and that has been mostly fine for me.

If you are using a fairly standard processor, then Gentoo shouldn't give you that much of an advantage?

ryao · 2025-03-19T02:14:21 1742350461

Gentoo lets you do all of the tweaks mentioned here within the system package manager, so you still get security updates for your tweaked build. You can also install Gentoo on top of another system via Gentoo Prefix for use as a userland packages manager:

https://wiki.gentoo.org/wiki/Project:Prefix

eru · 2025-03-19T06:13:59 1742364839

> Gentoo lets you do all of the tweaks mentioned here within the system package manager, so you still get security updates for your tweaked build.

Yes, Gentoo is great. I'm just saying that for me it was too much of a temptation.

serbuvlad · 2025-03-19T04:52:49 1742359969

I'd like to bring attention to the ALHP repos[1].

These are the Arch packages built for x86-64-v2, x86-64-v3 and x86-64-v4, which are basically names for different sets of x86-64 extensions. Selecting the highest level supported by your processor should get you most of the way to -march=native, without the hassle of compiling it yourself.

It also enables -O3 and LTO for all packages.

[1]: https://github.com/an0nfunc/ALHP

eru · 2025-03-19T06:14:32 1742364872

Nice, I'll try them out!

LTO is great, but I have my doubts about -O3 (vs the more conservative -O2).

UPDATE: bah, ALHP repos don't support the nvidia drivers. And I don't want to muck around with setting everything up again.

Another update: I moved to nvidia-open, so now I can try the suggested repos.

alessandroberna · 2025-03-19T07:35:01 1742369701

You should be able to use the regular arch repos as a fallback (and even repos from alhp with a lower feature level https://somegit.dev/ALHP/ALHP.GO/issues/255#issuecomment-333...)

doing this should allow you to use as many optimized packages as possible while still being able to install packages not supported by the alhp

darkwater · 2025-03-19T07:20:41 1742368841

All in under 2 hours. You were totally nerd-sniped.

eru · 2025-03-19T09:29:26 1742376566

Well, I moved to the new repos. But I had to change to the dkms drivers.

Seems to work so far, but nothing 'feels' faster either.

globular-toast · 2025-03-19T07:16:53 1742368613

I've broken Arch but never broken Gentoo. I think this more due to the fact I ran Arch first and you then Gentoo first, rather than any real difference between them.

Gentoo is more stable than Arch by default, though. It's not actually a bleeding edge distro, but you can choose to run it that way if you wish. Gentoo is about choice.

eru · 2025-03-19T09:31:06 1742376666

> I think this more due to the fact I ran Arch first and you then Gentoo first, rather than any real difference between them.

I can believe that.

> Gentoo is more stable than Arch by default, though. It's not actually a bleeding edge distro, but you can choose to run it that way if you wish. Gentoo is about choice.

I actually had way more trouble with stuff breaking with Ubuntu. That's because every six months, when I did the distro upgrade, lots of stuff broke at once and it was hard to do root cause analysis.

With a rolling distribution, it's usually only one thing breaking at a time.

johnisgood · 2025-03-19T11:23:52 1742383432

Ubuntu et al. are notorious for breaking stuff. If you want to remove a package, it sometimes want to delete your Linux kernel. :|

johnisgood · 2025-03-19T11:22:44 1742383364

I have used Gentoo and Arch. I still favor Arch, however. I like pacman (C), I don't like Portage (Python).

I found Arch Linux to be more stable than Gentoo, but that is just my own experience.

6SixTy · 2025-03-19T01:15:53 1742346953

Afaik the Gentoo based ChromeOS is being replaced by Android.

mycall · 2025-03-19T01:35:56 1742348156

It will be interesting if Gentoo supported Fuchsia next.

yjftsjthsd-h · 2025-03-19T05:11:42 1742361102

Gentoo already has prefix support for non-Linux systems and used to have at least some interest in a full Gentoo/kFreeBSD, so it's plausible.

surajrmal · 2025-03-19T03:20:33 1742354433

What would that even mean?

sham1 · 2025-03-19T04:14:15 1742357655

Portage on top of Fuchsia, probably. There's not all that much more to Gentoo all things considered.

nickysielicki · 2025-03-19T07:10:48 1742368248

Zircon is miles apart from linux. It’s like saying you’d run Portage on top of Windows. You and what army?

chasil · 2025-03-19T08:27:39 1742372859

Windows was built with a POSIX subsystem from the very beginning. WSL 1 uses something analogous.

https://en.m.wikipedia.org/wiki/Microsoft_POSIX_subsystem

The Zircon kernel does not support signals, so basic C is not going to work well.

"It is heavily inspired by Unix kernels, but differs greatly. For example, it does not support Unix-like signals, but incorporates event-driven programming and the observer pattern."

https://en.m.wikipedia.org/wiki/Fuchsia_(operating_system)#K...

6SixTy · 2025-03-19T17:26:26 1742405186

As long as there's Bash and Python support, Gentoo Prefix could as well. That's kind of the only things that are hard needed for Gentoo as Portage is written in Python and Ebuilds are Bash scripts. Bigger issue would be that the Fuchsia kernel likely isn't POSIX complete.

ryao · 2025-03-19T18:47:51 1742410071

Gentoo Portage had support for running on top of Interix, so it did run on Windows at some point (over 15 years ago).

jsight · 2025-03-19T15:15:34 1742397334

I tried that once, but it is still compiling. :)

Kidding... honestly that was a pretty fun distribution to play around with ~20 years ago. The documentation was really good and it was a great way to learn how a lot of the pieces of a Linux distribution fit together.

I was never convinced that the performance difference was really noticeable, though.

wantless · 2025-03-19T19:12:47 1742411567

Gentoo was the primary source of heating for my living quarters back in the early 2000s. My tower was highly constrained on memory, and I was on a relentless quest to pare out any modules or dependencies I wasn't actually using. Performance gains were primarily from being able to stay out of slow HDD swap space memory. I doubt there were any gains once amortizing the compilation times, but I ran my compile batches at night, and they kept me nice and warm.

shanemhansen · 2025-03-19T01:09:32 1742346572

So ChromeOS and also the OS for GKE are still basically built this way.

boobsbr · 2025-03-19T15:04:39 1742396679

Obligatory "Gentoo is rice" mention:

https://www.shlomifish.org/humour/by-others/funroll-loops/Ge...

ndesaulniers · 2025-03-19T07:39:15 1742369955

They still use the build system. Woof.

rlpb · 2025-03-19T01:04:22 1742346262

Note that if you do this then you will opt out of any security updates not just for jq but also for its regular expression parsing dependency onigurama. For example, there was a security update for onigurama previously; if this sort of thing happens again, you'd be vulnerable, and jq is often used to parse untrusted JSON.

> * SECURITY UPDATE: Fix multiple invalid pointer dereference, out-of-bounds write memory corruption and stack buffer overflow.

(that one was for CVE-2017-9224, CVE-2017-9226, CVE-2017-9227, CVE-2017-9228 and CVE-2017-9229)

ryao · 2025-03-19T02:10:37 1742350237

A userland package manager like Gentoo Prefix could be used to install a custom build of this and still get security updates.

rlpb · 2025-03-19T02:25:14 1742351114

Indeed, there are many methods to have a custom build and still get security updates, including at least one method that is native to Ubuntu and doesn’t need any external tooling. However my warning refers to the method presented in the article, where this isn’t the case.

alextingle · 2025-03-19T09:45:55 1742377555

> including at least one method that is native to Ubuntu and doesn’t need any external tooling.

Can you explain a little more? Search has failed me on this one.

jmward01 · 2025-03-19T03:31:57 1742355117

But isn't there still the kernel of an idea here for a package management system that intelligently decides to build based on platform? Seems like a lot of performance to leave on the table.

stkdump · 2025-03-19T07:56:48 1742371008

Rebuilding from scratch also takes longer than installing a prebuilt package. So while it might be worth it for a heavily used application, in general I doubt it.

Also I think in earlier days the argument to build was so you can optimize the application for the specific capabilities of your system like the supported SIMD instruction set or similar. I think nowadays that is much less of a factor. Instead it would probably be better to do things like that on a package or distribution level (i.e. have one binary distribution package prebuilt by the distribution for different CPU capabilities).

jeffbee · 2025-03-19T03:19:42 1742354382

This is generally true but specifically false. The builds described in the gist are still linking onigurama dynamically. It is in another package, libonig5, that would be updated normally.

rlpb · 2025-03-19T20:16:50 1742415410

The gist uses the orig tarball only, so skips the distro patch that selects the distro packaged libonig in favor of the "vendored" one. At least that's how it appears to me. I only skimmed the situation.

Or do you see something deeper that ensures that the distro libonig is actually the one that gets used?

taeric · 2025-03-19T04:28:33 1742358513

I'm curious how applicable these are, in general? Feels like pointing out that using interior doors in your house misses out on the security afforded from a vault door. Not wrong, but there is also a reason every door in a bank is not a vault door.

That is, I don't want to devalue the CVE system; but it is also undeniable that there are major differences in impact between findings?

lpapez · 2025-03-19T09:58:18 1742378298

In my experience, most CVEs are reports about ice cream trucks lacking nuclear-proof bank vault doors.

rlpb · 2025-03-19T20:13:13 1742415193

Sure, but jq is very much a "front door" in your analogy. You'd have to look at each individual CVE to assess the risk for your specific case, but for jq, claimed security vulnerabilities are worth paying attention to.

skrebbel · 2025-03-19T07:51:58 1742370718

> I don't want to devalue the CVE system

You could, though. It's 99.9% stuff like this!

topspin · 2025-03-19T01:20:12 1742347212

This is certainly true. Also, by replacing the allocator and changing compiler flags, you're possibly immunizing yourself from attacks that rely on some specific memory layout.

estebarb · 2025-03-19T02:49:33 1742352573

By hardwiring the allocator you may end up with binaries that load two different allocators. It is too fun to debug a program that is using jemalloc free to release memory allocated by glibc. Unless you know what you are doing, it is better to leave it as is.

smarx007 · 2025-03-19T01:29:43 1742347783

There are also many flags that should be enabled by default for non-debug builds like ubsan, stack protection, see https://news.ycombinator.com/item?id=35758898

ryao · 2025-03-19T01:55:37 1742349337

UBSAN is usually a debug build only thing. You can run it in production for some added safety, but it comes at a performance cost and theoretically, if you test all execution paths on a debug build and fix all complaints, there should be no benefit to running it in production.

smarx007 · 2025-03-19T21:51:19 1742421079

I think it's time for the C/C++ communities to consider a mindset shift and pivot to having almost all protectors, canaries, sanitizers, assertions (e.g. via _GLIBCXX_ASSERTIONS) on by default and recommended for use in release builds in production. The opposite (i.e, the current state of affairs) should be discouraged and begrudginly accepted in select few cases.

https://www.youtube.com/watch?v=gG4BJ23BFBE is a presentation that best represents my view on the kind of mindset that's long overdue to become the new norm in our industry.

ryao · 2025-03-20T15:38:55 1742485135

I do not think things like the time command need to be compiled with such things. It is pointless, but your suggestion here is to do it anyway. Why bother?

Assertions in release builds are a bad idea since they can be fairly expensive. It is a better idea to have a different variety of assertion like the verify statements that OpenZFS uses, which are assertions that run even in release builds. They are used in situations where it is extremely important for an assertion to be done at runtime, without the performance overhead of the less important assertions that are in performance critical paths.

smarx007 · 2025-03-20T19:01:38 1742497298

Why would I want potentially undefined behaviour in 'time'? I expect it to crash anytime it's about to enter UB. Sure, you may want to minimize such statements between the start/stop of the timer, but I expect any processing of stdout/stderr of the child process to be UB-proofed as much as possible.

I think it's a philosophical difference of opinions and it's one of the things that drive Rust, Go, C# etc. ahead - not merely language ergonomics (I hope Zig ends up as the language that replaces C). The society at large is mostly happy to take a 1-3% perf hit to get rid of buffer overflows and other UB-inducing errors.

But I agree with you on not having "expensive" asserts in releases.

internetter · 2025-03-19T01:21:50 1742347310

That is, if you are a believer in security via obscurity

frontfor · 2025-03-19T01:23:38 1742347418

Are you arguing that ASLR is “security via obscurity”?

topspin · 2025-03-19T01:29:26 1742347766

I would, and there is no shame in it, as far as I'm concerned.

I don't need to outrun the bear. I just need to outrun you.

msm_ · 2025-03-19T01:53:12 1742349192

ASLR is not security through obscurity though. It forces attacker to get a pointer leak before doing almost anything (even arbitrary read and arbitrary write primitives are useless without a leak with ASLR). As someone with a bit of experience in exploit dev, it makes a world of a difference and is one of the most influential hardenings, next to maybe stack cookies and W^X.

usefulcat · 2025-03-19T03:32:52 1742355172

I'm genuinely curious what was so undesirable about this sibling comment that it was removed:

"ASLR obscures the memory layout. That is security by obscurity by definition. People thought this was okay if the entropy was high enough, but then the ASLR⊕Cache attack was published and now its usefulness is questionable."

Usually when a comment is removed, it's pretty obvious why, but in this case I'm really not seeing it at all. I read up (briefly) on the mentioned attack and can confirm that the claims made in the above comment are at the very least plausible sounding. I checked other comments from that user and don't see any other recent ones that were removed, so it doesn't seem to be a user-specific thing.

I realize this is completely off-topic, but I'd really like to understand why it was removed. Perhaps it was removed by mistake?

bigstrat2003 · 2025-03-19T03:41:18 1742355678

Some people use the "flag" button as a "disagree" button or even a "fuck this guy" button. Unfortunately, constructive but unpopular comments get flagged to death on HN all the time.

usefulcat · 2025-03-19T14:57:27 1742396247

I had thought that flagging was basically a request for a mod to have a look at something. But based on this case I now suspect that it's possible for a comment to be removed without a mod ever looking at it if enough people flag it.

buttercraft · 2025-03-19T15:12:57 1742397177

Not removed but hidden. You can turn on showdead in your profile to see them.

usefulcat · 2025-03-19T21:42:06 1742420526

My point was more that, at least in this case, it looks like a post was hidden without any moderator intervention.

If this is indeed what happened, it seems like a bad thing that it's even possible. Since many, perhaps most people probably don't have showdead enabled, it means that the 'flag' option is effectively a mega-downvote.

worthless-trash · 2025-03-19T05:52:42 1742363562

I believe most people see security through obscurity has an attempt to hide an insecurity.

ASLR/KASLR intends to make attackers lives harder by having non consistent offsets of known data structures. Its not obscuring a security flaw, instead its raises an attacks 'single run' effectivness.

The ASLR attack that i believe is being referenced is specific to abuse within the browser, and running with a single process. This single attack vector does not mean that KASLR globally is not effective.

Your quote has some choice words, but its contextually poor.

ryao · 2025-03-19T18:38:59 1742409539

That attack does not require a web browser. The web browser being able to do it showed it was higher severity than you would think than if the proof of concept had been in C, since web browsers run untrusted code all of the time.

worthless-trash · 2025-03-20T05:03:17 1742446997

The 'attack' there does require you to be able to run code and test within a single process with a single randomized address space, which is the exact vector that the web browser provides.

Most times in C, each fork() (rather than thread) has a differential address space, so it's actually less severe than you think.

ryao · 2025-03-20T08:38:51 1742459931

The kernel address space is the same regardless of how many fork() calls have been done. I would assume the exploitation path for a worst case scenario would be involve chaining exploits to do: AnC on userspace, JavaScript engine injection to native code, sandbox escape, AnC on kernel space, kernel native code injection. That would give complete control over a user’s machine just by having the user visit a web page.

I am not sure why anyone would attempt what you described, for the exact reason you stated. It certainly is not what I had in mind.

worthless-trash · 2025-03-21T14:54:55 1742568895

Its been a few days and a thousand kilometers since I've read the paper, I thought it referenced userspace. How is it able to infer kernel addresses that are not mapped in that process ?

minitech · 2025-03-19T03:38:39 1742355519

I assume people downvoted it because “ASLR obscures the memory layout. That is security by obscurity by definition” is just wrong (correct description here: https://news.ycombinator.com/item?id=43408039). It does say [flagged] too, though, so maybe that’s not the whole story…?

bigstrat2003 · 2025-03-19T03:48:43 1742356123

No, that other definition is the incorrect one. Security by obscurity does not require that the attacker is ignorant of the fact you're using it. Say I have an IPv6 network with no firewall, simply relying on the difficulty of scanning the address space. I think that people would agree that I'm using security by obscurity, even if the attacker somehow found out I was doing this. The correct definition is simply "using obscurity as a security defense mechanism", nothing more.

minitech · 2025-03-19T03:52:29 1742356349

No, I would not agree that you would be using security by obscurity in that example. Not all security that happens to be weak or fragile and involves secret information somewhere is security by obscurity – it’s specifically the security measure that has to be secret. Of course, there’s not a hard line dividing secret information between categories like “key material” and “security measure”, but I would consider ASLR closer to the former side than the latter and it’s certainly not security by obscurity “by definition” (aside: the rampant misuse of that phrase is my pet peeve).

> The correct definition is simply "using obscurity as a security defense mechanism", nothing more.

This is just restating the term in more words without defining the core concept in context (“obscurity”).

fc417fc802 · 2025-03-19T05:09:47 1742360987

I'm inclined to agree and would like to point out that if you take a hardline stance that any reliance on the attacker not knowing something makes it security by obscurity then things like keys become security by obscurity. That's obviously not a useful end result so that can't be the correct definition.

It's useful to ask what the point being conveyed by the phrase is. Typically (at least as I've encountered it) it's that you are relying on secrecy of your internal processes. The implication is usually that your processes are not actually secure - that as soon as an attacker learns how you do things the house of cards will immediately collapse.

ryao · 2025-03-19T18:53:17 1742410397

What is missing from these two representations is the ability for something to become trivially bypassable once you know the trick to it. AnC is roughly that for ASLR.

fc417fc802 · 2025-03-19T23:32:54 1742427174

I'd argue that AnC is a side channel attack. If I can obtain key material via a side channel that doesn't (at least in the general case) suddenly change the category of the corresponding algorithm.

Also IIUC to perform AnC you need to already have arbitrary code execution. That's a pretty big caveat for an attacker.

ryao · 2025-03-20T09:01:33 1742461293

You are not wrong, but how big of a caveat it is varies. On a client system, it is an incredibly low bar given client side scripting in web browsers (and end users’ tendency to execute random binaries they find on the internet). On a server system, it is incredibly unlikely.

I think the middle ground is to call the effectiveness of ASLR questionable. It is no longer the gold standard of mitigations that it was 10 years ago.

vlovich123 · 2025-03-19T05:03:08 1742360588

ASLR is not purely security through obscurity because it is based on a solid security principle: increasing the difficulty of an attack by introducing randomness. It doesn't solely rely on the secrecy of the implementation but rather the unpredictability of memory addresses.

Think of it this way - if I guess the ASLR address once, a restart of the process renders that knowledge irrelevant implicitly. If I get your IPv6 address once, you’re going to have to redo your network topology to rotate your secret IP. That’s the distinction from ASLR.

fc417fc802 · 2025-03-19T05:19:49 1742361589

I don't like that example because the damaged cause by and the difficulty of recovering from a secret leaking is not what determines the classification. There exist keys that if leaked would be very time consuming to recover from. That doesn't make them security by obscurity.

I think the key feature of the IPv6 address example is that you need to expose the address in order to communicate. The entire security model relies on the attacker not having observed legitimate communications. As soon as an attacker witnesses your system operating as intended the entire thing falls apart.

Another way to phrase it is that the security depends on the secrecy of the implementation, as opposed to the secrecy of one or more inputs.

vlovich123 · 2025-03-19T10:01:53 1742378513

You don’t necessarily need to expose the IPv6 address to untrusted parties though in which case it is indeed quite similar to ASLR in that data leakage of some kind is necessary. I think the main distinguishing factor is that ASLR by design treats the base address as a secret and guards it as such whereas that’s not a mode the IPv6 address can have because by its nature it’s assumed to be something public.

fc417fc802 · 2025-03-19T13:41:10 1742391670

Huh. The IPv6 example is much more confusing that I initially thought. At this point I am entirely unclear as to whether it is actually an example of security through obscurity, regardless of whatever else it might be (a very bad idea to rely on it for one). Rather ironic given that the poster whose claims I was disputing provided it as an example of something that would be universally recognized as such.

vlovich123 · 2025-03-19T15:11:23 1742397083

I think it’s security through obscurity because in ASLR the randomized base address is a protected secret key material whereas in the ipv6 case it’s unprotected key material (eg every hop between two communicating parties sees the secret). It’s close though which is why IPv6 mapping efforts are much more heuristics based than ipv4 which you can just brute force (along with port #) quickly these days.

fc417fc802 · 2025-03-19T18:28:49 1742408929

I'm finding this semantic rabbit hole surprisingly amusing.

The problem with that line of reasoning is that it implies that data handling practices can determine whether or not a given scheme is security through obscurity. But that doesn't fit the prototypical example where someone uses a super secret and utterly broken home rolled "encryption" algorithm. Nor does it fit the example of someone being careless with the key material for a well established algorithm.

The key defining characteristic of that example is that the security hinges on the secrecy of the blueprints themselves.

I think a case can also be made for a slightly more literal interpretation of the term where security depends on part of the design being different from the mainstream. For example running a niche OS making your systems less statistically likely to be targeted in the first place. In that case the secrecy of the blueprints no longer matters - it's the societal scale analogue of the former example.

I think the IPv6 example hinges on the semantic question of whether a network address is considered part of the blueprint or part of the input. In the ASLR analogue, the corresponding question is whether a function pointer is part of the blueprint or part of the input.

vlovich123 · 2025-03-19T19:09:07 1742411347

> The problem with that line of reasoning is that it implies that data handling practices can determine whether or not a given scheme is security through obscurity

Necessary but not sufficient condition. For example, if I’m transmitting secrets across the wire in plain text that’s clearly security through obscurity even if you’re relying on an otherwise secure algorithm. Security is a holistic practice and you can’t ignore secrets management separate from the algorithm blueprint (which itself is also a necessary but not sufficient condition).

fc417fc802 · 2025-03-20T00:32:38 1742430758

Consider that in the ASLR analogy dealing in function pointers is dealing in plaintext.

I think the semantics are being confused due to an issue of recursively larger boundaries.

Consider the system as designed versus the full system as used in a particular instance, including all participants. The latter can also be "the system as designed" if you zoom out by a level and examine the usage of the original system somewhere in the wild.

In the latter case, poor secrets management being codified in the design could in some cases be security through obscurity. For example, transmitting in plaintext somewhere the attacker can observe. At that point it's part of the blueprint and the definition I referred to holds. But that blueprint is for the larger system, not the smaller one, and has its own threat model. In the example, it's important that the attacker is expected to be capable of observing the transmission channel.

In the former case, secrets management (ie managing user input) is beyond the scope of the system design.

If you're building the small system and you intend to keep the encryption algorithm secret, we can safely say that in all possible cases you will be engaging in security through obscurity. The threat model is that the attacker has gained access to the ciphertext; obscuring the algorithm only inflicts additional cost on them the first time they attack a message secured by this particular system.

It's not obvious to me that the same can be said of the IPv6 address example. Flippantly, we can say that the physical security of the network is beyond the scope of our address randomization scheme. Less flippantly, we can observe that there are many realistic threat models where the attacker is not expected to be able to snoop any of the network hops. Then as long as addresses aren't permanent it's not a one time up front cost to learn a fixed procedure.

vlovich123 · 2025-03-20T05:44:47 1742449487

Function pointer addresses are not meant to be shared - they hold 0 semantic meaning or utility outside a process boundary (modulo kernel). IPv6 addresses are meant to be shared and have semantic meaning and utility at a very porous layer. Pretending like there’s no distinction between those two cases is why it seems like ASLR is security through obscurity when in fact it isn’t. Of course, if your program is trivially leaking addresses outside your program boundary, then ASLR degrades to a form of security through obscurity.

fc417fc802 · 2025-03-20T07:37:04 1742456224

I'm not pretending that there's no distinction. I'm explicitly questioning the extent to which it exists as well as the relevance of drawing such a distinction in the stated context.

> Function pointer addresses are not meant to be shared

Actually I'm pretty sure that's their entire purpose.

> they hold 0 semantic meaning or utility outside a process boundary (modulo kernel).

Sure, but ASLR is meant to defend against an attacker acting within the process boundary so I don't see the relevance.

How the system built by the programmer functions in the face of an adversary is what's relevant (at least it seems to me). Why should the intent of the manufacturer necessarily have a bearing on how I use the tool? I cannot accept that as a determining factor of whether something qualifies as security by obscurity.

If the expectation is that an attacker is unable to snoop any of the relevant network hops then why does it matter that the address is embedded in plaintext in the packets? I don't think it's enough to say "it was meant to be public". The traffic on (for example) my wired LAN is certainly not public. If I'm not designing a system to defend against adversaries on my LAN then why should plaintext on my LAN be relevant to the analysis of the thing I produced?

Conversely, if I'm designing a system to defend against an adversary that has physical access to the memory bus on my motherboard then it matters not at all whether the manufacturer of the board intended for someone to attach probes to the traces.

ryao · 2025-03-20T09:07:55 1742461675

If you can look up the base address via AnC, is considering it to be a protected key material really correct?

fc417fc802 · 2025-03-21T00:40:52 1742517652

I think that's why the threat model matters. I consider my SSH keys secure as long as they don't leave the local machine in plaintext form. However if the scenario changes to become "the adversary has arbitrary read access to your RAM" then that's obviously not going to work anymore.

ryao · 2025-03-19T18:51:31 1742410291

If someone can guess the randomization within 1 second using the AnC attack, you can restart as much as you want, but it will not improve security.

dotancohen · 2025-03-19T07:20:31 1742368831

  > The correct definition is simply "using obscurity as a security defense mechanism", nothing more.

Also stated as "security happens in layers", and often obscurity is a very good layer for keeping most of the script kiddies away and keeping the logs clean.

My personal favorite example is using a non-default SSH port. Even if you keep it under 1024, so it's still on a root-controlled port, you'll cut down the attacks by an order of magnitude or two. It's not going to keep the NSA or MSS out, but it's still effective in pushing away the common script kiddies. You could even get creative and play with port knocking - that keeps under-1024 ports logs clean.

ryao · 2025-03-19T18:57:29 1742410649

I use non-standard SSH ports too. It does not improve theoretical security, but it does improve quality of life from generating smaller logs.

hgomersall · 2025-03-19T07:19:21 1742368761

In the limit, an encryption key falls to the same logic. You simply rely on the difficulty of scanning all possibly keys.

throw16180339 · 2025-03-19T15:04:10 1742396650

I downvoted because the poster doesn't understand what security by obscurity means.

ryao · 2025-03-20T08:42:27 1742460147

Except I do know what security by obscurity is and you are out of date on the subject. When you have attacks that make ASLR useless, then it is security by obscurity. Your thinking would have been correct 10 years ago. It is no longer correct today. The middle ground is to say that the benefits of ASLR are questionable, like I said in the comment you downvoted.

ryao · 2025-03-19T02:09:29 1742350169

ASLR obscures the memory layout. That is security by obscurity by definition. People thought this was okay if the entropy was high enough, but then the ASLR⊕Cache attack was published and now its usefulness is questionable.

bigstrat2003 · 2025-03-19T03:38:00 1742355480

ASLR is by definition security through obscurity. That doesn't make it useless, as there's nothing wrong with using obscurity as one layer of defenses. But that doesn't change what it fundamentally is: obscuring information so that an attacker has to work harder.

greiskul · 2025-03-19T04:25:07 1742358307

Is having a secret password security by obscurity? What about a private key?

Security by obscurity is about the bad practice of thinking that obscuring your mechanisms and implementations of security increases your security. It's about people that think that by using their nephew's own super secret unpublished encryption they will be more secure than by using hardened standard encryption libraries.

robertlagrant · 2025-03-19T05:26:19 1742361979

Security by obscurity is keeping your security algorithms and design secret, not your data at runtime secret.

nickysielicki · 2025-03-19T07:25:28 1742369128

it’s a total distortion of what the phrase means.

Security through obscurity is when you run your sshd server on port 1337 instead of 22 without actually securing the server settings down, because you don’t think the hackers know how to portscan that high. Everyone runs on 22, but you obscurely run it elsewhere. “Nobody will think to look.”

ASLR is nothing like that. It’s not that nobody thinks to look, it’s that they have no stable gadgets to jump to. The only way to get around that is to leak the mapping or work with the handful of gadgets that are stable. It’s analogous to shuffling a deck of cards before and after every hand to protect against card counters. Entire cities in barren deserts have been built on the real mathematical win that comes from that. It’s real.

ryao · 2025-03-20T08:51:23 1742460683

With attacks such as AnC, your logic fails. They can figure out the locations and get plenty of stable gadgets.

Any shuffling of a deck of cards by Alice is pointless if Bob can inspect the deck after she shuffles them. It makes ASLR not very different from changing your sshd port. In both cases, this describes the security:

https://web.archive.org/web/20240123122515if_/https://www.sy...

nickysielicki · 2025-03-20T21:58:24 1742507904

okay, sure, ASLR can be defeated by hardware leaks. The first rowhammer papers were over ten years ago, it's very old news. It's totally irrelevant to this thread. The fact that there exist designs that have hardware flaws which make them incapable of hosting a secure PRNG does not have any relevance to a discussion about the merits or lack thereof of a PRNG-based security measures. The systems you're referring to don't have secure PRNGs.

Words have meaning, god damn it! ASLR is not security through obscurity.

Edit: I was operating under the assumption that “AnC” was some new hotness, but no, this is the same stuff that’s always been around, timing attacks on the caches. And there’s still the same solution as there was back then: you wipe the caches out so your adversaries have no opportunity to measure the latencies. It’s what they always should have done on consumer devices running untrusted code.

ryao · 2025-03-22T18:45:04 1742669104

> ASLR is not security through obscurity.

I used to think this, but hearing about the AnC attack changed my mind. I have never heard of anyone claiming to mitigate it.

classichasclass · 2025-03-19T02:16:32 1742350592

Except in this case there can be a whole bunch of parallel bears.

BoingBoomTschak · 2025-03-19T13:47:24 1742392044

Scrambling != Obscuring. Obscuring to me means that there's a fixed something to hide that can be discovered and exploited.

ryao · 2025-03-19T02:06:31 1742349991

ASLR is technically a form of security by obscurity. The obscurity here being the memory layout. The reason nobody treated it that way was the high entropy that ASLR had on 64-bit, but the ASLR⊕Cache attack has undermined that significantly. You really do not want ASLR to be what determines whether an attacker takes control of your machine if you care about having a secure system.

pdpi · 2025-03-19T03:30:26 1742355026

The defining characteristic of security through obscurity is that the effectiveness of the security measure depends on the attacker not knowing about the measure at all. That description doesn’t apply to ASLR.

ryao · 2025-03-19T17:46:17 1742406377

It produces a randomization either at compile time or run time, and the randomization is the security measure, which is obscured based on the idea that nobody can figure it out with ease. It is a poor security measure given the AnC attack that I mentioned. ASLR randomization is effectively this when such attacks are applicable:

https://web.archive.org/web/20240123122515if_/https://www.sy...

graymatters · 2025-03-19T05:01:59 1742360519

You are confusing randomization, a legitimate security mechanism, with security by obscurity. ASLR is not security by obscurity. Please spend the time on understanding the terminology rather than regurgitating buzz words.

ryao · 2025-03-19T17:39:54 1742405994

I understand the terminology. I even took a graduate course on the subject. I stand by what I wrote. Better yet, this describes ASLR when the AnC attack applies:

https://web.archive.org/web/20240123122515if_/https://www.sy...

eru · 2025-03-19T01:31:12 1742347872

Why? You could advertise publicly what your flags are.

guappa · 2025-03-19T10:04:08 1742378648

The normal way is to use dpkg to rebuild and patch, and use dch to increase the patch version with a .1 or something similar, so that the OS version always takes precedence, and then rebuild.

Onavo · 2025-03-19T02:48:57 1742352537

What about PGO?

ryao · 2025-03-19T17:55:55 1742406955

https://en.wikipedia.org/wiki/Profile-guided_optimization

rgmerk · 2025-03-19T03:57:32 1742356652

It's a while since I had to deal with this kind of thing, but my memory was that as soon as you go beyond the flags that the upstream developers use (just to be clear, I mean the upstream developers, not the distro packagers) you're buying yourself weird bugs and a whole lot of indifference if they occur.

I haven't used a non-libc malloc before but I suspect the same applies.

Brian_K_White · 2025-03-19T04:44:49 1742359489

Two opposing things are both true at the same time.

If you as an individual avoid being at all different, then you are in the most company and will likely have the most success in the short term.

But it's also true that if we all do that then that leads to monoculture and monoculture is fragile and bad.

It's only because of people building code in different contexts (different platforms, compilers, options, libraries, etc...) that code ever becomes at all robust.

A bug that you mostly don't trigger because your platform or build flags just happens to walk just a hair left of the hole in the ground, was still a bug and the code is still better for discovering and fixing it.

We as individuals all benefit from code being generally robust instead of generally fragile.

taeric · 2025-03-19T04:25:26 1742358326

I've been building my own emacs for a long time, and have yet to hit any weird bugs. I thought that as long as you avoid any unsafe optimizations, you should be fine? Granted, I also thought that -march=native was the main boost that I was seeing. This post indicates that is not necessarily the case.

I also suspect that any application using floats is more likely to have rough edges?

pertymcpert · 2025-03-19T05:07:16 1742360836

Complex software usually has some undefined behavior lurking that at higher or even just different optimization levels can trigger the compiler to do unexpected things to the code. It happens all the time in my line of work. If there's an extensive test suite you can run to verify that it still works mostly as expected then it's easier.

taeric · 2025-03-19T14:36:23 1742394983

This is one where I suspect we don't disagree. But "all the time" can have a very different feel between people.

It also used to happen that just changing processors was likely to find some problems in the code. I have no doubt that still happens, but I'd also expect it has reduced.

Some of this has to be a convergence on far fewer compilers than we used to encounter. I know there are still many c compilers. Seems there are only two common ones, though. Embedded, of course, is a whole other bag of worms.

Y_Y · 2025-03-19T08:19:29 1742372369

Did you try -ffast-math? IIRC that used to break emacs in some subtle way, while providing no extra speed.

taeric · 2025-03-19T14:34:08 1742394848

I thought touching the math optimizations directly was in the "unsafe" bucket. Really the only optimization I was aiming for was -march=native. That and the features like native compilation that have made it to the release.

I do think I saw improvements. But I never got numbers, so I'm assuming most of my feel was wishful thinking. Reality is a modern computer is hella fast for something like emacs.

I did see compilation mode improve when I trimmed down the regexes it watches for to only the ones I knew were relevant for me. That said, I think I've stopped doing that; so maybe that is a lot better?

fireant · 2025-03-20T02:26:42 1742437602

I've turned on fastmath in python numba compiler while thinking "of course i want faster math, duh". Took me a while to find out it was a cause of many "fun" subtle bugs. Never touching that stuff again.

notpushkin · 2025-03-19T04:46:18 1742359578

On the other hand, if your optimization helps consistently across platforms, you could convince upstream developers to implement it directly. (Not necessarily across all platforms – a sizable performance gain on just a single arch might still be enough to tweak configuration for that particular build).

kstrauser · 2025-03-19T00:10:04 1742343004

(Well, rebuilding them with a different allocator that benchmarks well on their specific workflow.)

loeg · 2025-03-19T04:00:51 1742356851

Everything outperforms glibc malloc. It's essentially malpractice that distros continue to use it instead of mimalloc or jemalloc.

throw16180339 · 2025-03-19T15:37:42 1742398662

It's been awhile since I looked into this, but it's not necessarily an easy change. glibc malloc has debugging APIs; a distro can't easily replace it without either emulating the API or patching programs that use it.

loeg · 2025-03-19T17:17:42 1742404662

No need to even patch it out. It's relatively easy to change the default, rebuild the world (distros have a flow for this), and restore glibc for the tiny, tiny handful of individual programs that actually rely on glibc debugging APIs.

From a end developer perspective: I have no particular familiarity with mimalloc, but I know jemalloc has pretty extensive debugging functionality (not API compatible with glibc malloc of course).

dan-robertson · 2025-03-19T00:58:43 1742345923

Is it even known what workloads the glibc malloc is good for?

searealist · 2025-03-19T01:03:33 1742346213

Using all your memory on multi-threaded workflows.

rurban · 2025-03-19T13:18:11 1742390291

It's using the least additional memory, compared to all others. Whilst being the slowest by far of all modern allocators.

electromech · 2025-03-19T02:11:34 1742350294

I'd be curious how the performance compares to this Rust jq clone:

cargo install --locked jaq

(you might also be able to add RUSTFLAGS="-C target-cpu=native" to enable optimizations for your specific CPU family)

"cargo install" is an underrated feature of Rust for exactly the kind of use case described in the article. Because it builds the tools from source, you can opt into platform-specific features/instructions that often aren't included in binaries built for compatibility with older CPUs. And no need to clone the repo or figure out how to build it; you get that for free.

jaq[1] and yq[2] are my go-to options anytime I'm using jq and need a quick and easy performance boost.

[1] https://github.com/01mf02/jaq

[2] https://github.com/mikefarah/yq

oguz-ismail · 2025-03-19T05:08:02 1742360882

> I'd be curious how the performance compares to this Rust jq clone

Every once in a while I test jaq against jq and gojq with my jq solution to AoC 2022 day 13 https://gist.github.com/oguz-ismail/8d0957dfeecc4f816ffee79d...

It's still behind both as of today

saghm · 2025-03-19T03:27:27 1742354847

As a bonus that people might not be aware of, in the cases where you do want to use the repo directly (either because there isn't a published package or maybe you want the latest commit that hasn't been released), `cargo install` also has a `--git` flag that lets you specify a URL to a repo. I've used this a number of times in the past, especially as an easy way for me to quickly install personal stuff that I throw together and push to a repo without needing to put together any sort of release process or manually copy around binaries to personal machines and keep track of the exact commits I've used to build them.

john-tells-all · 2025-03-19T04:13:23 1742357603

If you actually do this, just get Ubuntu to download the exact source package you want. In this case:

    apt-get source jq

Then go into the package and recompile to your heart's content. You can even repackage it for distribution or archiving.

You'll get a result much closer to the upstream Ubuntu, as opposed to getting lots of weird errors and misalignments.

ltbarcly3 · 2025-03-19T00:53:20 1742345600

Misleading title, it's 90% of the faster time. It's about 45% faster.

It's actually a little bit interesting, if you are interested in how we use language. You could argue that now you now get 90% more work done in the same amount of time, and that would align with other 'speed' units that we commonly use (miles per hour, words per minute, bits per second). However, the convention in computer performance is to measure time for a fixed amount of work. I would guess that this is because generally we have a fixed amount of work and what might vary is how long we wait for it (and that is absolutely true in the case of this blog post) so we put time in the numerator.

It's a very interesting post and very well done, but it's not 90% faster.

gblargg · 2025-03-19T04:12:26 1742357546

More misleading is that it implies that all packages can be made 90% faster. This is one particular package.

odo1242 · 2025-03-19T01:08:52 1742346532

I feel like using units of rate (90% faster) and not units of time makes more sense here.

Plus, if you were using units of time, you wouldn’t use the word “faster.” “Takes 45% less time” and “45% faster” are very different assertions, but they both have meaning, both in programming and outside it.

ltbarcly3 · 2025-03-19T01:31:28 1742347888

It comes down to convention. When talking about proportional differences, we can fix the unitary example to be the smaller, larger, earlier, or later, object or subject.

I think, generally, we fix on the earlier when talking about the change over time of a characteristic. "This stock went up 100%, this stock went down 50%". In both cases it's the earlier measurement that is taken as the unit. That makes this a 45% reduction in time to do the work, and that's actually what they measured.

When talking about comparisons between two things that aren't time dependent it depends on if we talk in multiples or percents I think. A twin bed is half as big as a king bed. A king bed is twice as big as a twin bed. Both are idiomatic. A king bed is 100% bigger than a twin bed. Yes, you could talk like this. A twin bed is 100% smaller than a king bed. Right away you say wait, a twin bed isn't 0 size! Because we don't talk in terms of the smaller thing when talking about decreasing percents, only increasing. A twin bed is 50% smaller than a king bed (iffy). A twin bed is 50% as big as a king bed. There, that's idiomatic again.

tom_ · 2025-03-19T02:17:07 1742350627

Related: https://randomascii.wordpress.com/2018/02/04/what-we-talk-ab...