To add to that: O(1) indexing into strings outside of the ASCII range is completely pointless and makes absolutely no sense. A language should disallow that and not encourage it.
I disagree with it being pointless. Being able to get indices into a string for fast slicing is useful. Consider Regex where references get stored to slices in the array.
What you can drop is the idea of the index being an integer with any particular semantic meaning. Rust uses byte indexes with an assertion that the index lies on a codepoint boundary, for instance.
That isn't an index into a string. It is an opaque indicator of position in a string. That opaque indicator of position happens to be implemented as an index into the underlying array, but it is not itself an index.
I think something like a lexer doesn't need to know about graphemes, it just needs to know which codepoints are XID_start and XID_continue for identifiers and hardcoded codepoints for everything else.
Obviously, it does; instead of allocating a new string for every token, you can just save the start & end index of a token, and actually allocate a new string only in the case where the parser actually requests/uses that token in the AST.
But the type signature of that token should make it clear those are byte offsets and not character offsets. You can use them for byte-wise operations like memcpy, or to explicitly decode them into a string using the original encoding of the string. You cannot use them for string-wise operations like substring without explicitly casting or converting them to a string.
It seems like this whole mess comes from conflating bytes and characters. They are not the same, any more than integers and booleans are the same just because you can jnz in most assembly languages.
> You cannot use them for string-wise operations like substring without explicitly casting or converting them to a string.
Why on earth not? Any slice of a string (with a non-pathological internal encoding) with ends on codepoint boundaries is itself a valid string. Losing that information sounds like a whole lot of pointless hassle and potential for harm. It also forces the internal string encoding to be part of the interface, since taking a slice of a string and getting a byte array requires knowing how the string is encoded to use it, but getting another string means that such details are private.
It's true that they're byte offsets, but that doesn't matter. The parser doesn't care whether it's on the 10543th grapheme or the 10597th. It just wants to know where its stuff is.