Beating Google's kernelCTF PoW using AVX512

323 points by anematode a day ago

pittma a day ago

Cool stuff! This method is very similar to how AVX-512-optimized RSA implementations work too, as they also have to do Very Large Exponentiations. This paper[1] covers how RSA does its windowing, which includes the formula showing how the window size can be arbitrary. One additional thing AVX-512 RSA implementations do, though, is store the results of the multiplications for the range [0..2^{window-size}) in a table; then, for each window, that result is retrieved from the table[2] and only the shifting/repositioning has to be done.

1. https://dpitt.me/files/sime.pdf (hosted on my domain because it's pulled from a journal)

2. https://github.com/aws/aws-lc/blob/9c8bd6d7b8adccdd8af4242e0...

anematode a day ago

Ooh interesting, I should have looked at this while developing.... Looks like that code could definitely use another version for e.g. Zen 5 where using zmm registers would lead to a 2x multiplication throughput. Also the mask registers are bounced to GPRs for arithmetic but that's suboptimal on Zen 4/5.
Separately, I'm wondering whether the carries really need to be propagated in one step. (At least I think that's what's going on?) The chance that a carry in leads to an additional carry out beyond what's already there in the high 12 bits is very small, so in my code, I assume that carries only happen once and then loop back if necessary. That reduces the latency in the common case. I guess with a branch there could be timing attack issues though
- pittma a day ago
  
  ymms were used here on purpose! With full-width registers, the IFMA insns have a deleterious effect on frequency, at least in the Icelake timeframe.
  - anematode a day ago
    
    Ye, hence a separate version for CPUs which don't have that problem. Although, maintaining so many of these RSA kernels does seem like a pain. Didn't realize u wrote that code; super cool that it's used in practice!
    
    pittma a day ago
    
    I am not the original author—this is adapted from an implementation by Shay Gueron, the author of that paper I linked, but I do agree that it's cool!
  - hnuser123456 a day ago
    
    zen5 can run avx512 at near full boost clocks: https://chipsandcheese.com/p/zen-5s-avx-512-frequency-behavi...
ignoramous 5 hours ago

> dpitt.me/files/sime.pdf (hosted on my domain because it's pulled from a journal
One can also upload to archive.org: https://archive.org/download/sime_20250531/sime.pdf

SunlitCat a day ago

There is something off with

> ...and despite supporting it [AVX512] on consumer CPUs for several generations...

I dunno. Before Rocket Lake (11th gen) AVX512 was only available in those enthusiast cpu, xeon cpu or in some mobile processors (which i wouldn't really call consumer cpu).

With the 12th gen (and that performance/efficiency core concept), they disabled it after a few months in those core and it was never seen again.

I am pretty sure tho, after AMD has some kind of success with AVX512 Intel will reintroduce it again.

btw. I am still rocking an Intel i9-11900 cpu in my setup here. ;)

oparin10 a day ago

That tracks. Intel's updated AVX10 whitepaper[1] from a few months back seems to confirm this. It explicitly states 512-bit AVX will be standard for both P and E cores, moving away from 256-bit only configs. This strongly implies AVX-512 is making a proper comeback, not just on servers but future consumer CPUs with E-cores too. Probably trying to catch up with AMD's wider AVX-512 adoption.
[1] - https://cdrdv2.intel.com/v1/dl/getContent/784343 (PDF)
- SunlitCat a day ago
  
  Nice find! I really hope so, as AVX512 is something interesting to play around with and i am pretty sure, it will play a bigger role in the future, especially with AI and all that stuff around it!
Aurornis a day ago

> With the 12th gen (and that performance/efficiency core concept), they disabled it after a few months in those core and it was never seen again.
The 12th gen CPUs with performance cores didn't even advertise AVX512 support or have it enabled out of the box. They didn't include AVX512 on the efficiency cores for space reasons, so the entire CPU was considered to not have AVX512.
It was only through a quirk of some BIOS options that you could disable the efficiency cores and enable AVX512 on the remaining CPU. You had to give up the E-cores as a tradeoff.
- adgjlsfhk1 a day ago
  
  IMO this is more a sign of complete dysfunction at Intel. They definitely could have made avx512 instructions trigger a switch to p-cores, and honestly probably could have supported them completely (if slightly slowly) by splitting the same way AMD does on Zen4 and Zen5 C cores. The fact that they shipped P cores and E cores that had different assembly sets is what you get when you have separate teams competing with each other rather than working together to make a good product.
  - pitust2 a day ago
    
    > They definitely could have made avx512 instructions trigger a switch to p-cores
    Not really, no. OS-level schedulers are complicated as is with only P vs E cores to worry about, let alone having to dynamically move tasks because they used a CPU feature (and then moving them back after they don't need them anymore).
    > and honestly probably could have supported them completely by splitting the same way AMD does on Zen4 and Zen5 C cores.
    The issue with AVX512 is not (just) that you need a very wide vector unit, but mostly that you need an incredibly large register file: you go up from 16 * 256 bit = 4096 bits (AVX2) to 32 * 512 bit = 16384 bits (AVX512), and on top of that you need to add a whole bunch of extra registers for renaming purposes.
    
    adgjlsfhk1 a day ago
    
    > The issue with AVX512 is not (just) that you need a very wide vector unit, but mostly that you need an incredibly large register file
    Not necessarily, you need to behave as if you had that many registers, but IMO it would be way better if the E cores had supported avx512, but half of the registers actually didn't exist and just were in the L2 cache.
    
    adgjlsfhk1 19 hours ago
    
    Also Zen4C has AVX512 support while being only ~35% bigger than Gracemont (although TSMC node advantage means you should possibly add another 10% or so). This isn't really a fair comparison because Zen4c is a very differently optimized core than Intel's E cores, but I do think it shows that AVX-512 can be implemented with a reasonable footprint.
    Or if Intel really didn't want to do that, they needed to get AVX-10 ready for 2020 rather than going back and forth on it fore ~8 years.
    
    conradev a day ago
    
    They could enable it on P cores with a separate enablement check and then leave it up to the developer to schedule their code on a P core. I imagine Linux has some API to do that scheduling (because macOS does), not sure about Windows.
    
    fc417fc802 a day ago
    
    So introduce performance and efficiency profiles for threads at the OS level. Why should CPUs have to be heterogeneous with regard to the ISA and other details?
    
    charcircuit 15 hours ago
    
    You don't need to switch the entire cores. You could have E cores borrow just the AVX512 circuitry from the p cores.
  - johnklos 9 hours ago
    
    > They definitely could have made avx512 instructions trigger a switch to p-cores,
    That'd be an OS thing.
    This is a problem that has been solved in the mainframe / supercomputing world and which was discussed in the BSD world a quarter of a century ago. It's simple, really.
    Each CPU offers a list of supported features (cpuctl identify), and the scheduler keeps track of whether a program advertises use of certain features. If it does want features that some CPUs don't support, that process can't be scheduled on those CPUs.
    I remember thinking about this way back when dual Nintendo cartridge Pentium motherboards came out. To experiment, I ran a Pentium III and a Celery on an adapter card, which, like talking about self hosting email, infuriated people who told me it can't be done. Different clock speeds, different CPU features, et cetera, worked, and worked well enough to wonder what scheduler changes would make using those different features work properly.
anematode a day ago

fixed :) thx
- SunlitCat a day ago
  
  Awesome! I always love to read about such stuff, especially if it includes playing around with AVX512!

fitzn a day ago

They won in 3.6 secs, but second place took 3.73 (or 3.74 if being consistent with the winning time). So, did second place also optimize the PoW or were they presumably on an FPGA as well?

The previous submission the author describes as being some expensive FPGA one was 4+ seconds. You'd think he'd mention something about how second place on his week was potentially the second fastest submission of all time, no?

underdeserver 19 hours ago

The image says (dupe). I'm guessing it's the OP's team, trying to submit as themselves multiple times in parallel.
- yorwba 12 hours ago
  
  From the article: "Only the first team who is able to connect to and exploit the server, and submit the flag to a Google Form, receives a payout; any subsequent submissions are marked as duplicates."

bluelightning2k a day ago

This is impressive. But it just strikes me as optimizing the wrong thing. A CTF shouldn't be about submission ops. Surely it would be better for everyone if all teams who send the flag within a submission window share the prize

kimixa a day ago

I'd also say that this encourages people to sit on exploits instead of reporting them immediately, so they can try to submit it next time if they didn't get it, even without submission timing shenanigans.
So it may be actively encouraging the "wrong" thing as well.
rfoo a day ago

That would be a different meta-game and while I didn't think deep into it, I believe the likely outcome is - people are just discouraged and won't consider submitting to kernelCTF.

bawolff a day ago

So if i unerstand correctly - its 4 second proof of work, and there is a payout once a month.

Are there really so many exploits that people are racing to get it every month?

Aurornis a day ago

The server was open every 2 weeks. The proof of work was to slow down connections slightly to reduce incentives to spam as many connection requests as possible.
Public CTFs are hard because inevitably some team will try something resembling a DDoS as part racing to the finish.
Google removed the proof of work step after this.
philodeon a day ago

Yes, the myth of Linux kernel security is just that, a myth.
- JackSlateur a day ago
  
  Is it, tho ?
  The ratio bug/feature is what matter (systems with no bugs but no features are nice but we have work to do)
- jeffbee a day ago
  
  How does this myth even exist? There has never been a moment in the history of Linux where it lacked an exploitable flaw. The only uncertainty is that for the last couple of days we aren't sure exactly what the exploit is.
  - thephyber a day ago
    
    The myth has some truth to it.
    The impact of the average Windows exploit was higher than the average Linux exploit because non-NT Windows didn’t use best practices such as multiple accounts + least privilege. And for years there were daemons on Windows that would get exploited with RCE just idling on a network (eg. Conficker).
    It took Microsoft several years of being a laughing stock of security before Bill Gates made the decision to slow new feature development while the company prioritized security. Windows security then increased. I believe that was around the time that Microsoft started reusing the NT kernel in the consumer versions of Windows.
    Also, the myth ignores the fact that cybersecurity risk has components of likelihood (a percentage) and impact (an absolute number, sometimes a dollar value). This conflation of two components invites lots of arguments and confusion, as commonly seen when certain CVEs are assigned non-obvious CVS scores.
    
    bonzini a day ago
    
    > around the time that Microsoft started reusing the NT kernel in the consumer versions of Windows.
    Much later. Windows XP used the NT kernel in 2001, whereas Conficker was in 2008.
    
    aidenn0 20 hours ago
    
    I had a copy of an XP service-pack burned to a CD because an unpatched install in the dorms at college would be infected by a worm (maybe Nimda?) faster than you could download the patches, thus requiring a fully offline install.
  - aidenn0 a day ago
    
    This exists because:
    1. When Linux was "coming of age" (around the turn of the millennium), Windows security was really bad.
    2. At the same time, there were far more Windows machines than Linux
    3. #1 and #2 together made Windows such an attractive target, that exploits for Linux were less of a thing.
    
    rhet0rica a day ago
    
    Don't forget: 4. Poor binary compatibility made it seem like worms were impossible, and twenty years ago, the most dangerous cyberattacks were 'pranks' designed to hose as many PCs as possible
    
    whizzter a day ago
    
    In Linux? No, C/C++-compiled binaries with dynamic linking often had issues due to glibc versions diffing,etc but you could always compiled things statically.
    "not ok" by LGPL licence so an issue for games,etc that wanted closed source so people talked about it, but a malicious worm creator could definitely go there since they probably wouldn't care about (L)GPL compliance (remember Linus has always been very hard on patches not breaking userspace ABI's).
    
    bobmcnamara 20 hours ago
    
    > dynamic linking often had issues due to glibc versions diffing
    This was annoying for exploit authors as well. Like an early form of ASLR.
    
    rhet0rica a day ago
    
    Yeah. Emphasis on the "seem" part. A false sense of security brought on by, "If binary artifacts for package _foo_ can't be ported between distros, what chance do worms have?" plus the even older memory that the Morris worm had been designed around (and limited to) certain target platforms and architectures: https://en.wikipedia.org/wiki/Morris_worm
    
    rfoo a day ago
    
    I thought when people say "Linux security is a myth" they were comparing it to FreeBSD / OpenBSD. So it's revealing that most are comparing Linux to Windows here.
    
    aidenn0 a day ago
    
    Also, Microsoft was pushing hard into the workstation and server market in the late '90s.
    I remember an ad that showed someone who had painted themselves into a corner with paint the shade of purple that Sun used in its branding with the slogan something like "we have the way out."
  - landr0id a day ago
    
    A myth propelled by people who don't understand security continually saying "Anyone can read the code, therefore it's more secure".
    
    mmsc a day ago
    
    b-b-but, my linux distribution is perfect for multi-tenant/multi-user purposes!!
    
    udev4096 a day ago
    
    Most of the serious security researchers, such as Daniel Micay (lead developer of GrapheneOS), have been quite vocal [0] on how insecure linux is.
    [0] - https://old.reddit.com/r/GrapheneOS/comments/bddq5u/os_secur...
    
    spookie a day ago
    
    I would really like for him to go into detail into why Flatpak is flawed from day one, or his MAC criticisms. Fedora has a lot done on the former and no mention of it.
    Also why not recognize the strides Wayland made for security in the desktop? Very handwavey take.
    
    jorvi a day ago
    
    It's pretty ironic that he first laudes macOS so much, disparages Linux for having strong NIH, but then hates on systemd which is heavily heavily inspired on macOS.
    And yeah, I don't understand his hate on Flatpak unless he means the sandbox is too user-hostile. So many things break with Flatpaks, from native messaging hosts (think browser extension talking to desktop application) to theming to local folder access to pasting from the buffer.. it's quite a draconian sandbox.
    
    fucker42069 10 hours ago
    
    [dead]
- api a day ago
  
  But we don’t need safe languages.
dist-epoch a day ago

These are local escalation of privilege exploits (becoming root from a regular user), not remote code execution. Escalation of privilege bugs are a dime a dozen.
- singron a day ago
  
  I think this also requires user (i.e. unprivileged) namespaces since you have to manipulate traffic control queue disciplines (tc qdisc). You normally need to be root to do this, so it's only useful as an escalation if you can do it within a namespace or you can convince some root daemon to do the qdisc manipulation for you (I think unlikely?).
  User namespaces opened up an enormous surface area to local privilege escalation since it made a ton of root-only APIs available to non-root users.
  I don't think user namespaces are available on android, and it's sometimes disabled on linux distributions, although I think more are starting to enable it.
  Relatedly, kernelCTF just announced they will be disabling user namespaces, which is probably a good idea if they are inundated with exploits: https://github.com/google/security-research/commit/7171625f5...
- internetter a day ago
  
  It wouldn't be reasonable to expect it to be a RCE bug. That wouldn't be a kernel bug, it would be in the networking stack or software running.
  - eptcyka a day ago
    
    Where is the networking stack running on a typical Linux deployment? (:
    
    internetter a day ago
    
    Yes, and most software runs on a host operating system. Vulnerabilities in the software of 3rd party developers aren’t in scope, even if it extends the kernel. Some networking tasks are delegated to the kernel, but almost all of the really risky stuff is not in the core kernel.

mmsc a day ago

Amazing stuff, but also reads like a comedy due to the obstacles to win this challenge. Real rube goldberg stuff.

mshockwave 19 hours ago

Nitpick: static link won’t give you inlining but only eliminates the overhead of PLT. LTO will give you more opportunities for inlining

anematode 9 hours ago

Consulted with Will to see exactly what he did and fixed, thx!

dcrazy a day ago

For more on the base-52 representation that this article mentions, see this other post from today’s front page: https://news.ycombinator.com/item?id=44132673

joelthelion 18 hours ago

It's a bit depressing that after all of these years, professionals only need 3 seconds to own a Linux box...

vlovich123 a day ago

I don't understand the purpose of the race. Why not just pay out per unique exploit?

rfoo a day ago

Boss want a strictly fixed budget for running those cool programs. The rationale behind these programs (at least partially) is about measuring exploits and mitigations dynamics, not buying bugs. And, Linux is just too buggy that if you pay for every 0-day it's basically out of control. Google tried to do so (and to drain people's hoarded bugs) once, ran a limited time promotion with no race, every 0 day counts, got flooded.
And at the same time you don't want to piss the community, so here we go.
- 0x0 a day ago
  
  If linux kernel security really is so bad that google had to add a proof-of-work to introduce a 4 second race for 0day submissions, I'm surprised they're ok with still using the Linux kernel as the base for Android.
  - udev4096 a day ago
    
    Android has a vastly improved and better version of linux kernel: https://old.reddit.com/r/GrapheneOS/comments/bddq5u/os_secur...
    
    kimixa a day ago
    
    All of that's talking about the userspace though?
    
    rfoo a day ago
    
    Not all. For example kCFI is kernel space.
    Also, attack surface reduction is a very valid strategy, so it may seem like about the userspace (sandbox for every apps etc) but it could make a big different in how much of the kernel attack surface is exposed.
    
    kimixa a day ago
    
    Yes, but the concept of CFI is only mentioned in passing in that entire thread, and the kCFI implementation used is a vanilla kernel feature and not android specific.
    There's a lot to be said that "Distro kernel config choices may not be as secure as possible", but that's not really an "Android"/"Vanilla Linux Kernel" difference.
    
    rfoo a day ago
    
    Well, I don't know kCFI being enabled on any distro besides Android, cause it requires building the kernel with Clang.
    The previous in-kernel CFI implementation (before the kinda joint effort - kCFI) was upstreamed by Google, too: https://www.phoronix.com/news/Clang-CFI-Linux-Patches and https://www.phoronix.com/news/Linux-Kernel-Clang-LTO-Patches. Pixel devices also had this long before. Given that the entire Linux kernel feature was developed out of Android I find it a little bit unfair to call it "using a vanilla kernel feature".
    
    kimixa a day ago
    
    I'd argue that the entire point of using a shared open source kernel is that other users can benefit from additions.
    Arguing "Who first added a feature" seems to be a losing spiral of needless tribalism. How many features does the Android kernel use that weren't developed by Google? Does that mean they wouldn't have developed those features? Or just that there's no point making a parallel implementation if there's already one existing.
    
    rfoo a day ago
    
    The point here is not who first added the feature to Linux kernel. The point is Android cared about security, built a CFI implementation, started shipping it back in 2018, while Linux had other priorities and didn't have it until 2021. And even then almost nobody adopted it.
  - bonzini 20 hours ago
    
    What's the alternative that is efficient, feature complete(*) and more secure?
    (*) For example, Android uses SELinux to confine apps, virtual machines (pKVM) to run DRM code, and so on. All these increase the overall security of the system and decrease the cost of kernel bugs, so there's a tradeoff that's not easy to evaluate.
  - kevincox a day ago
    
    What is the alternative? I suspect all modern kernels are more or less just as vulnerable? They did start https://fuchsia.dev/ so maybe they are hedging against this problem? But making a fully-featured OS is a huge undertaking, especially if you need compatibility with existing apps and a wide variety of hardware.
  - imtringued 17 hours ago
    
    Google isn't doing the race thing. They just pay out to whoever submits a valid submission. The people doing the racing are the submitters who want to get ahead of their competition and are stockpiling their exploits. If they were altruists, they would just submit their exploits for no renumeration. Hence the race isn't something Google is doing.
    The proof of work isn't there to add a "four second race". It's to prevent ddos like spam.
- pas a day ago
  
  is there some more info somewhere about that flood? how many teams? how many submissions? how many unique? how much money G paid out?
  - rfoo a day ago
    
    So, here is a tweet from Eduardo highlighting that they got a lot of 0day submissions immediately after they launched the promotion: https://x.com/sirdarckcat/status/1681491274489282565
    And the spreadsheet is public [0], I counted 7 unique hoarded bugs (submitted immediately after launch) after deduplicating by patch commit hash. Then in the month following, another 9 unique bugs were submitted.
    As for how much paid, I don't remember. Could be around $200~300k in total.
    [0] https://docs.google.com/spreadsheets/d/e/2PACX-1vS1REdTA29OJ...

supriyo-biswas a day ago

Can anyone tell me how they equated the Python function in the blog to what was shown in the Google's POW implementation? Seems a little difficult to follow.

bonzini a day ago

The "exponent = (p + 1) // 4" in the Google code is 2^1277. To raise a number to such a huge power, you square it 1277 times (and that is what the Python function does). That works because if the initial value is "x", each squaring doubles the number of "x"s, and at the end you have 2^1277 of them.

quasiconnected a day ago

Beautiful use of number theory for the mersenne number modulo!

ramshanker a day ago

Awesome.

Just when I was starting to wonder if we finally had a chance to thwart L7 attacks using these PoW tricks.

initramfs a day ago

neat. what website template is that? I'm looking for a simple blog.

Brybry a day ago

I think it's something like this https://github.com/timlrx/tailwind-nextjs-starter-blog
basically nextjs + tailwind + mdx

davidfiala a day ago

Absolutely excellent!

IshKebab a day ago

Kind of feels like a weird criteria to reduce the number of submissions. In the real world it doesn't matter if an exploit takes 4 seconds or 6. Why not use some other more interesting criteria like how long ago was the bug introduced, or some of the many bonus criteria they have?

Just seems weird for it to be a race.

Havoc a day ago

Some level of submission difficulty makes sense to keep the garbage submissions at bay but that seems a bit over the top?

Can't imagine it makes a big difference to them whether google pays out 50k or 2x 50k for high quality bugs relevant to their company

jameshart a day ago

Right? This gives the impression that there might be people out there who have developed valid attacks but who are holding back from revealing them because they haven’t optimized their strategy to ensure they can win the CTF race.
This seems like a perverse incentive creating suboptimal behavior.
- Retr0id a day ago
  
  That's why they removed it.
  - markasoftware a day ago
    
    no it's not. Google still has a long time between each submission cycle, and people are still holding back their exploits in the hope they'll win the next one. It's just a matter of luck now, rather than a matter of optimizing PoW.
  - jameshart a day ago
    
    Still a race though. So you can’t rig it by optimizing your proof of work implementation, people are still going to look for an edge. These are CTF experts, exploiting systems is what they do.
overfeed a day ago

> Can't imagine it makes a big difference to them whether google pays out 50k or 2x 50k for high quality bugs
You're brushing over important implemention details - it's not Google running this program but a specific team inside the company that has a fixed budget, a limited headcount and doing the best with what they have.
Your argument is similar to "America has trillions in GDP and could afford to provide twice the number of free treatments to kids with cancer" - it sounds reasonable in aggregate, but breaks down in the specifics; the individual hospital systems, teams and care providers are currently working close to their limits.
- Havoc a day ago
  
  I wasn't brushing over anything - just didn't know that. idk how google internal finances work...
bee_rider a day ago

At the beginning of the post, they note a default implementation that uses some Google library. So maybe the assumption (which worked out for a while?) was that nobody would bother doing the “over the top” thing and try to beat Google’s library?
One option could be to just open up, like, 10 slots instead, right? Most folks might use this implementation (since it has been posted anyway). That way coming second place in the race isn’t so bad…
- internetter a day ago
  
  I honestly don't see why there shouldn't be infinite slots. Google has the money. Right now they've created a backlog of an unknowable number of exploits, sequentually being submitted one by one. It is possible the backlog grows faster than it is emptied. If they let them all through at once, there would be an upfront cost but it wouldn't be like that every time, and the pace at which explitable bugs are fixed increases.
  The only restriction should be that the first submission of a given bug is the one that gets the reward.

lawrenceyan a day ago

Math is cool.

avidiax a day ago

> The proof of work (implemented here) is a certain verifiable delay function (VDF) known as "sloth". VDFs are cryptographic primitives which prove that a nontrivial amount of time has passed by requiring a long, serial computation. This computation outputs a proof which can be (relatively) quickly verified. Because the computation is serial, scaling to more computational resources (such as more CPU or GPU cores) does not reduce the runtime.

Maybe this sort of thing is a market opportunity for the CPU makers?

Add two special instructions, one that iterates a sloth given the seed value, and the final instruction that signs the result cryptographically to prove that it was produced by this model of CPU. Perhaps have hardware rate limiting to prevent overclocking.

Another idea would be to monitor CPU uptime and sign the uptime + a challenge. The uptime is cleared after signing. This is more like proof of stake, where we are proving that we dedicated n seconds of CPU ownership, while allowing the CPU to be used productively otherwise.