> These files are sometimes NOT using the ASCII encoding but UTF-8, e.g. Java so...

lmm · on Jan 12, 2023

It's a problem when e.g. searching for caf◌́e doesn't find café.

lelanthran · on Jan 12, 2023

> It's a problem when e.g. searching for caf◌́e doesn't find café.

That doesn't even display on my browser[1]; tried it in Goland[2], doesn't display there either, so that's the rare case 0.0001% that I wouldn't really worry about, because if the code has undisplayable unicode sequences, there's bigger problems than searching.

[1] Chrome, on Mac

[2] Also on Mac

rjh29 · on Jan 12, 2023

OP is trying to explain that accents in Unicode can be written in two ways: either "COMBINING ACUTE ACCENT" + "LATIN SMALL LETTER E" (two codepoints) or "LATIN SMALL LETTER E WITH ACUTE" (one codepoint). They both render the same on all browsers. But they don't compare the same unless you use locale-aware code.

To demonstrate this OP explicitly used "DOTTED CIRCLE" (◌) then added the "COMBINING ACUTE ACCENT" to that. Normally there would be no dotted circle.

I wrote an article on this a while ago: https://richardjharris.github.io/unicode-in-five-minutes/

viccuad · on Jan 12, 2023

it displays fine here on Firefox on Linux: https://i.imgur.com/MoRqYL8.png

lmm · on Jan 13, 2023

If you're using a Mac you have an easy way to reproduce the distinction: save a file called café.txt and look at what filename it has in a directory listing. (It will look the same but have a different byte sequence).

d4rti · on Jan 12, 2023

Neither on FF or Safari on Mac.