I understand that part of the "magic" behind the M1 is how it has some cores that are "performance" cores and other cores that are highly efficient "low power" cores.
My question is, how much of the sublime performance of M1 Macs comes from MacOS being fine-tuned to take advantage of these two different type of cores?
If you simply get the bare minimum of NetBSD booting on an M1, will it not achieve nearly the same performance unless the OS is fine-tuned to schedule properly across the "performance" cores and the "efficient low power" cores?
I remember reading a recent article [0] about how future Intel chips plan to have similar "perf" and "low power" cores, and part of the presentation included someone from Microsoft saying they spent lots of time on the Windows team making sure Windows could schedule across these properly. So I wonder how much work it really takes.
ARM big.LITTLE[1] SoCs have been a thing for about a decade now, and most operating systems have schedulers that take advantage of each set of cores. macOS isn't doing anything special that Linux et al. aren't doing.
> macOS isn't doing anything special that Linux et al. aren't doing.
MacOS isn't doing anything Linux and others aren't doing, or MacOS isn't doing anything those others can't do?
That is, do we actually know how well tuned MacOS is for these cores and their capabilities, or is that an assumption? I thought I had read there were some specific instructions in the chip that were either new to it or were more aggressively used by MacOS to get additional energy savings or performance gains.
I don't know of anything really magical but for years Apple has been steadily pushing apps towards APIs that give the OS a lot of latitude to manage energy [1]. Grand Central Dispatch, AVFoundation, etc. Then on iOS BackgroundTasks etc (and iPhones have had little cores for quite a while now). I would imagine a lot of that experience transfers to macOS.
The centralized + draconian approach they take has a lot of problems but does help sweeping changes like this.
Care to share what special things macOS is doing? Because according to Apple's documentation, it doesn't seem like they're doing anything special when it comes to heterogeneous multiprocessing and scheduling that Linux hasn't been doing for quite some time.
At high level yes, but on much lower level that is another story.
When you manufacture your own chips and code your own OS, there are no limits on microtuning. You can design them to work together, instead of when you usually need to make compromises.
These kind of tuning might never end up to Linux kernel being too chip specific.
Apple has also moved a lot of driver code to another layer, and you don’t need that on kernel for example.
The core reasoning is in the design, and in power to manufacture and update all parts (device, firmware, drivers, OS). You can design them to work flawlessly together in bigger picture. You can leave out properties from kernel to be done by OS apps. You can make hardware based submodules, such as DCP interface on M1 macs, main target of discussion on Asahi Linux (https://asahilinux.org/2021/08/progress-report-august-2021/). You can add own instruction sets for own purposes. Something which is hard to add for Linux kernel.
In theory, you might be able to the same with Linux kernel, but in practice driver development and other stuff is relying on reverse-engineering, black box testing or written specs without access to source code. How time consuming is that compared that to Apple? Is it more likely that main line code in Kernel is acceptable when it works, not when it is perfectly optimized and works? You can't rely that some OS app handles something, when Apple has full power for that.
Android and iOS are better example for this. I post few links which might give an idea.
> You can leave out properties from kernel to be done by OS apps.
Doable on Linux as well. If this is better for performance this most likely already would have been implemented. Besides this is not concrete. With concrete I mean things that are known to be implemented with M1 that for example Asahi won't be able to replicate.
This is just hardware. Even so this example is a non-starter since DCP will be supported by Linux.
> You can add own instruction sets for own purposes. Something which is hard to add for Linux kernel.
It's actually not hard. It's trivial if you add compiler support(which Apple most likely would for, you guessed it, LLVM). There are actually some custom instructions on M1 afaik, mostly used for being able to run X86 more efficiently.
Where the top 4 points being pure hardware. Point 5 being about specific design decisions done in Android which doesn't mean anything. Point 7 even says that the most likely performance increase would be custom co-processors which again is pure hardware. I'm not sure what this link is supposed to achieve but it arguments are opposed to Apple being better because of Software hardware magic.
This link again mentions the design decisions why Android is less responsive. The main culprit mentioned on why Android uses more RAM is that Android implemented by vendors have a load more bloat. This has nothing to do with magic hardware software combo. This is Android from vendors being trash.
If that is the case though, I wouldn't be surprised if newer Linux and BSD releases gain additional support for per-core-type performance scheduling and optimizations therein.
It's not entirely new - Remember pretty much all ARM processors that aren't MCUs have big.LITTLE, but there is no doubt additional work to be done in the area.
This answer seems optimistic. Unless you have a single execution state cpu bound, which has no parallelism, and no other tasks exist needing runtime, having more cores, even little ones, seems like a win.
Even just pedestrian clock processing for interrupts could exploit the other cores. Or keyboard and mouse processing, whatever. Playing an mp3 while you compile? That other core sure would stop context switching in the compiler...
My Ubuntu VM on my Mac Mini gets outstanding performance which validates the point that macOS isn't essential for the performance. I'm sure however that macOS is very helpful in ensuring the power efficiency on laptops.
Same here. For many years, a MacBook Air 11 was my daily driver. After some time, I wiped up Mac OS X and I ran a minimal Linux configuration: XMonad, Emacs, Firefox and XTerm.
With a few tweaks, mostly those suggested by powertop, my battery range was indistinguishable from Mac OS. Which is impressive, given that Safari is known to be very optimized towards low energy usage. I guess I compensated that with a simpler graphics stack that generated less CPU wakeups.
I’m with you up through XMonad, Emacs, and XTerm but... Firefox? Right now I’m currently struggling with an attempt to use a circa 2015 Dell XPS 13 as a Linux-based daily driver, and Firefox is nigh unusable with even only a few tabs, 4GB of RAM apparently doesn’t go far enough, swap degrades performance even with SSD but turning it off just means stuff dies. I’d love to find out I just set things up wrong but I’m shocked to discover I was getting better performance out of Windows.
I wonder what else you're running on that machine. I have an i5 X201 from 2010 with 2GB of ram (and an SSD), and I regularly push it with 50-ish Firefox tabs.
However, I'm using i3 instead of gnome, and void instead of debian et. al.
It's Lubuntu, so I think the desktop is LXQt; I'd assume it's not that.
About the only unusual thing I can think of is that I'm trying to use dropbox. It dies periodically, so maybe it's hungry, but even without it running, less than a dozen FF tabs can bog down the machine (and I gave up on Chromium entirely).
Would totally welcome any tips from people confident I can do better.
Honestly, I wouldn't discount the desktop or the OS. On a fresh reboot, htop shows my cpu usage across the four cores as 0, 0.7, 0, and 1.3%. That's not a lot of background activity.
I won't claim to have late-model-Ryzen performance; there is an SSD performance hit when the machine uses some of the 16gb of swap I gave it. The website data has to go somewhere. But I haven't found it to become unusable, except when I restore and load all my tabs simultaneously. After it's all downloaded though, pulling web pages out of swap is pretty fast.
Personally, I found the best things for performance were an SSD, i3+void, and a ton of swap space. Pretty much in that order.
Edit: I looked up the processors of the two machines. Ironically, all else being equal, that X201 is a full 20% faster than yours (2.2 vs 2.66).
macOS/iOS have API for marking jobs as background which will run on slow cores. And this API is used, AFAIK. I'm not sure if widely used Windows or Linux software routinely marks its threads for background jobs. I know that I never did that in my software.
On both macOS and Linux, process scheduling goes further than just niceness. On macOS in particular, it has a concept of process priorities[1] and I/O policies, and the OS itself defines special priorities and policies for background processes.
Apple system developers definitely deserve a lot of credit for optimising ios / macOS Big Sur for its ARM hardware platform. If we could run another OS on it, it would be evident that part of the performance boost of Apple's M1 ARM processor is definitely due to the optimised software it runs.
I used an intel macbook pro with an older version of Mac Os that had lots of background processes and features disabled just to get the performance I wanted.
My M1 Air was noticeably faster even with spotlight indexing and a massive build inside a virtual machine out of the box.
The system software (OS) is highly optimised for M1 and thus greatly adds to its performance. Note that Apple has been developing ios / iPad OS on ARM platform for many years now.
My question is, how much of the sublime performance of M1 Macs comes from MacOS being fine-tuned to take advantage of these two different type of cores?
If you simply get the bare minimum of NetBSD booting on an M1, will it not achieve nearly the same performance unless the OS is fine-tuned to schedule properly across the "performance" cores and the "efficient low power" cores?
I remember reading a recent article [0] about how future Intel chips plan to have similar "perf" and "low power" cores, and part of the presentation included someone from Microsoft saying they spent lots of time on the Windows team making sure Windows could schedule across these properly. So I wonder how much work it really takes.
[0] https://www.pcworld.com/article/3629502/intels-alder-lake-wh...