To add to that: O(1) indexing into strings outside of the ASCII range is complet...

Veedrac · on Dec 17, 2015

I disagree with it being pointless. Being able to get indices into a string for fast slicing is useful. Consider Regex where references get stored to slices in the array.

What you can drop is the idea of the index being an integer with any particular semantic meaning. Rust uses byte indexes with an assertion that the index lies on a codepoint boundary, for instance.

TheCoelacanth · on Dec 18, 2015

That isn't an index into a string. It is an opaque indicator of position in a string. That opaque indicator of position happens to be implemented as an index into the underlying array, but it is not itself an index.

Veedrac · on Dec 19, 2015

That sounds like an index to me, and the operation surely indexing.

I mean, if one has a hashmap

    hash = {"foo": 1, "bar": 2}

we say hash["foo"] is an indexing operation, and "foo" is an index. Seems the same to me.

the_mitsuhiko · on Dec 17, 2015

> Being able to get indices into a string for fast slicing is useful.

Outside of the ASCII range? I doubt it. That most likely means you are doing something fundamentally wrong.

rbanffy · on Dec 17, 2015

Outside of the anglocentric world we also think in terms of characters and, yet, we consider ASCII as a rather primitive technology.

heinrich5991 · on Dec 17, 2015

I think something like a lexer doesn't need to know about graphemes, it just needs to know which codepoints are XID_start and XID_continue for identifiers and hardcoded codepoints for everything else.

EDIT: Or a JSON parser, etc.

the_mitsuhiko · on Dec 17, 2015

A lexer scans LTR. It never jumps codepoints ahead. Indexing in a lexer never makes sense.

tomp · on Dec 17, 2015

Obviously, it does; instead of allocating a new string for every token, you can just save the start & end index of a token, and actually allocate a new string only in the case where the parser actually requests/uses that token in the AST.

nostrademons · on Dec 18, 2015

But the type signature of that token should make it clear those are byte offsets and not character offsets. You can use them for byte-wise operations like memcpy, or to explicitly decode them into a string using the original encoding of the string. You cannot use them for string-wise operations like substring without explicitly casting or converting them to a string.

It seems like this whole mess comes from conflating bytes and characters. They are not the same, any more than integers and booleans are the same just because you can jnz in most assembly languages.

Veedrac · on Dec 18, 2015

> You cannot use them for string-wise operations like substring without explicitly casting or converting them to a string.

Why on earth not? Any slice of a string (with a non-pathological internal encoding) with ends on codepoint boundaries is itself a valid string. Losing that information sounds like a whole lot of pointless hassle and potential for harm. It also forces the internal string encoding to be part of the interface, since taking a slice of a string and getting a byte array requires knowing how the string is encoded to use it, but getting another string means that such details are private.

It's true that they're byte offsets, but that doesn't matter. The parser doesn't care whether it's on the 10543th grapheme or the 10597th. It just wants to know where its stuff is.