C4 handles a similar subset of C to what selfie does, but is a full order of magnitude smaller : about 600 lines. And it is very readable despite the terseness!
One thing I love is how it reads the whole source file into RAM and operates on it as an array. So many compilers (including my own...) avoid this, which I guess ostensibly lets them handle huge machine generated source files (but that's an extreme outlier, and really I'd be inclined to let people depending on that use something else or work around it). The first compiler I saw that didn't was Wouter van Oortsmerssen's AmigaE compiler [1], which demonstrated that it was a viable strategy even when your compiler is expected to run on a machine with 512KB RAM. It's one example of where we often over-abstract for dubious benefits.
On a machine with virtual memory couldn't it fix the source size issue by mmaping instead of reading it into ram, and leting the OS page it in/out of memory?
Yes, that should work, as long as you make sure to ensure it gets zero terminated so you don't have to do length checks all over the place - that'd lose a lot of the benefit. For files that are not a multiple of the page sizze that's guaranteed. I've never tried to request mapping of more bytes than a file takes, though... I'm going to guess that still does something sane, but I've not tested it.
> I've never tried to request mapping of more bytes than a file takes, though... I'm going to guess that still does something sane, but I've not tested it.
Sadly, the Opengroup and Linux man pages say that
> Memory access within the mapping but beyond the current end of the underlying objects may result in SIGBUS signals being sent to the process.
Joy... Oh, well, I think on modern systems it'd be reasonable to just refuse to deal with source files that are too large to load into RAM, and leave it at that.
Wait, are you saying that one can't mmap a file larger than physical memory. Isn't this exactly what mmap is for. I think what masklinn is saying if you mmap a file, you can't read past the end or over allocate a single mapping. One could have another output mmap region.
It doesn't reliably do something sane. IIRC, LLVM in that case just falls back to reading the file into an array the ordinary way. That only happens one time in 4096, so there's no practical difference to performance.
https://github.com/rswier/c4/blob/master/c4.c
c4 is what you might call a microscopic C compiler; it's surprising how much it does for the code it has.