Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> These files are sometimes NOT using the ASCII encoding but UTF-8, e.g. Java source files.

Why is this a problem? In 99.9999% of Java code, the UTF-8 characters aren't going to trip up an ASCII search.



It's a problem when e.g. searching for caf◌́e doesn't find café.


> It's a problem when e.g. searching for caf◌́e doesn't find café.

That doesn't even display on my browser[1]; tried it in Goland[2], doesn't display there either, so that's the rare case 0.0001% that I wouldn't really worry about, because if the code has undisplayable unicode sequences, there's bigger problems than searching.

[1] Chrome, on Mac

[2] Also on Mac


OP is trying to explain that accents in Unicode can be written in two ways: either "COMBINING ACUTE ACCENT" + "LATIN SMALL LETTER E" (two codepoints) or "LATIN SMALL LETTER E WITH ACUTE" (one codepoint). They both render the same on all browsers. But they don't compare the same unless you use locale-aware code.

To demonstrate this OP explicitly used "DOTTED CIRCLE" (◌) then added the "COMBINING ACUTE ACCENT" to that. Normally there would be no dotted circle.

I wrote an article on this a while ago: https://richardjharris.github.io/unicode-in-five-minutes/


it displays fine here on Firefox on Linux: https://i.imgur.com/MoRqYL8.png


If you're using a Mac you have an easy way to reproduce the distinction: save a file called café.txt and look at what filename it has in a directory listing. (It will look the same but have a different byte sequence).


Neither on FF or Safari on Mac.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: