Hacker Newsnew | past | comments | ask | show | jobs | submit | deadcanard's commentslogin

Take this with a hefty grain of salt. This a one-author paper from a MD that got their license suspended by MA's medical board. The author also pushes some anti vax rhetoric on their website. That does not mean the claims in the paper are incorrect of course. But there are reasons to be skeptical


Have you tried one the link of the article: https://docs.kernel.org/admin-guide/kernel-per-CPU-kthreads....? Also try running "perf stat -d" on your run and see anything pops out


Again, I am biased. But the article explains mem translation in fairly simple terms, hammers the main advantages of HPs (better use of the TLB, simpler and smaller PT). Explains clearly what how much mem the TLB can cover, what a page walk is and how much time it takes (before even loading actual data), the importance of the cache wrt PT. It shows some perf numbers of random vs iterative mem accesses.

I don't think you'll find many articles that detail these points. Now, they might be trivial to you and that's totally fair. But the goal is to address a wide audience. Additionally, the article is not addressing how to use HPs but that's for part 2.

Wrt to other points, I certainly agree they are important topics to explore. I would add using perf is super important to easily access the perf counters


IMO the easiest way (but certainly not the only way) is to allocate a new stack and switch to it with makecontext(). The manpage has a full code example. You just need to change the stack alloc. This approach has a few drawbacks but is hard to beat in terms of simplicity.


I'd argue that understanding what happens on every single memory access qualifies as fundamental.


Wrt code, look at the bench in the article. Even with sequential access, you can get a decent speedup using huge pages. But unless you have a good profile and using PGO, it'll likely not be that sequential for code. Like everything else you'll need to measure it to know exactly what benefit you might get. As a starting point, you can start at looking at the itlb load misses with perf stat -d

Stack access is another story as it's usually local and sequential so it might not be that useful.


URL for your article?


https://www.chaoticafractals.com/manual/getting-started/enab...

Admittedly it's a bit terse, but at least it gives some steps you can use to enable it on Windows. It also benefits other software, such as 7zip. I need to update the page because these days the performance benefits are larger, due to the ever-widening divide between compute and memory speeds, CPUs having bigger large page TLBs, and additional optimisations...


I am biased but I don't think it's fair to say that your article covers as much. There is more content in the article written in a way that's trying to be approachable. I certainly will agree it's wordy but it's hard to make content that's interesting to a wide audience

Your article does cover how to use/enable them while the post does not. But it's meant for part 2


Agreed, "about as much" is a bit unfair.


CAT is indeed a good thing to look at. But there are some important caveats 1) unless you have a very small number of cores, it's not possible to reserve a cache slice for all programs running (some slices are shared for things like DDIO), 2) it's still not possible to lock some specific data in the cache because any collision will replace the data 3) the slices are kinda big, so it's hard to be properly fine grained. Basically, CAT just prevents other processes from stealing all the cache. It does that by reserving ways (as in the cache associativity meaning)


1) fully agreed but most HFT apps with the exception of really simple ones like market data feed handlers which can easily fit their working set into L2 anyway will be the only thing running on a host

2) mutual cache eviction by hash collisions is solvable with a number of tricks (although those methods are not easy and often wasteful). The "DDIO slice" issue used to be a problem back when Intel used ring topology for LLC. These days they are built as a mesh thus minimizing this effect.

3) CAT doesn't recognize threads or processes. COS (class of service) uses CPU cores for way-of-cache assignments

Recent micro-architectures like SKX or CLX have 11 ways of L3 and what often happens is 1-2 ways get assigned to cpu0 for non latency-critical workloads while the rest are assigned to latency-sensitive, isolated cores usually running a single user space thread each.


2) Agreed about the solvability and difficulty of avoiding cache collisions. DDIO must write its data somewhere in the L3 cache. It ends up in the shareable slice. So either you're okay with sharing your cache or cannot use these slices if you want exclusive access for your processes. That was my point.

3) CAT does not recognize processes but resctl does. Feels we're kinda nitpicking here...

Last of your point: Agree, that gives you 9ish usable slices which is not very much depending on the number of cores. That was my point I was trying to make.


3) resctl just uses COS under the hood. The same limitation applies

> Yeah, that gives you 9ish usable slices which is not very much. Again that was my point

This is 9 ways that you can use for your latency-sensitive workloads exclusively. This is MUCH better than letting all that LLC get trashed by non-critical processes/threads. Typically after applying such partitioning we've observed a 15-20% speed up in our apps.

In my area shaving off a few micros that way is a huge deal and definitely worth spending a couple of minutes implementing.


It's because shared_ptr uses an atomic count to synchronise between threads while the Rust version is assuming only one thread.

There is no equivalent in the standard C++ lib. Though it's very easy to write one. If you use gcc's libstdc++, the internal shared_ptr is actually templated on the lock policy so you can do something like: template<typename T> using nt_shared_ptr = std::__shared_ptr<T, __gnu_cxx::_S_single>;

Then if you use nt_shared_ptr instead, the code will be much simpler.

That said, your C++ code is incorrect as well. It gives the ownership to the shared ptr of a stack variable. During the destruction of the shared_ptr, delete will be called on a stack address leading to undefined behavior



You are right. First version of code actually used std::shared_ptr<Data>(new Data) (and produced similar code), but then I thought that it is unfair that we create a struct on a stack in Rust, so I changed the code. I should have used gcc with -Wall that detects the problem (clang doesn't).


Then, instantiate the smart pointer with a custom no-op deleter and if you use the nt typedef I mentioned above, you'll get a similar compiler output to Rust's.

e.g nt_shared_ptr<Data> x(&d, [](Data*) -> void{});

https://godbolt.org/z/d3qv4cE1v


It is a rookie mistake to wrap the result of new into a shared_ptr. It makes two heap allocations. You should use std::make_shared. It combines both and makes only one allocation.


Why call new? I work on a several hundred k code base that has a single call to new.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: