Python 3 fixes no fundamental issues with python 2 and introduced far more warts...

wbond · on Dec 10, 2016

It most certainly fixes support for Unicode on Windows in terms of filesystems paths, OS function boundaries and the console. Some of these fixes have even taken until 3.6 to get implemented.

As someone who writes cross-platform code, Python 3 was a breath of fresh air after fumbling around in the dark with Python 2.

patrec · on Dec 10, 2016

I can definitely believe that – but windows has basically lost[1] and a much worse text model to boot. Like Java and unlike python 3, they at least have the excellent excuse that this was not obvious at the time. And under unix the impedance mismatch has definitely increased. Not a good trade.

[1] I wouldn't count them out, but they're definitely on the back foot as bash inclusion shows.

Manishearth · on Dec 10, 2016

> I can now access or count code points in O(1) (neither of which is in any way useful)

Oh, yeah, I agree that Python's unicode model isn't great. I like Ruby's, and Swift has it's own cool thing going where it's very explicit about the uselessness of code points.

However, I think that Python 3 having some form of default unicode support is way better than what Python 2 had. It could be improved (backwards-compatibly too!), but it passes my minimum bar for a "modern" language's text story.

patrec · on Dec 10, 2016

In all seriousness, I'd much rather python3 had kept str, phased out the unicode type altogether, got rid of all the harebrained locale crap (sys.{get,set}defaultencoding etc) and just provided tooling (collation, regexp, denormalization etc.) for working with utf-8 encoded byte-'str's. This would probably have been a much smoother transition and ended up with a vastly superior result.

I'm pretty sure the people complaining here about how python2 str only supports ascii and they couldn't paste their smartquotes were bitten either by windows or unnecessarily bad unicode/str interactions due to python not just hardcoding utf-8 auto-conversion. That is the only sane thing to do (Your locale isn't *.UTF-8? Well sucks to be you. By now even the Japanese and Chinese seem to slowly have come around to the utf-8 bandwagon, and they had better reasons then most).

I might be wrong, but I can see basically 3 non-idiotic ways to do text in a programming language:

1. arrays of utf-8 bytes (Rust, Go). Python was close to that already and then messed it up. Indexing indexes into bytes O(1).

Upsides:

- efficient: most text you're going to get is already utf-8 and the rest should be converted on ingress/egress; html/css/most code will be represented fairly efficiently even if the body text is mostly say, Chinese; you can do a lot of text processing by just working on the ascii range (e.g. CSV parsing).

- sane: no BOM, no 32 bit encoding of 21 bit quantities etc; unix-compatible

Downsides: - can't efficiently access individual logical characters or know the fixed-font width of the text

- normalization is kinda nasty (concatenation etc.), in practice people just tend to ignore that

- hard to constrain to only valid utf-8 without significant downsides

- maybe not that beginner friendly

2. use some non-array type that doesn't allow for indexing (e.g. ropes), probably using (mostly) utf-8 for internal encoding.

3. arrays of logical characters. That means you need to make up fake characters to handle graphemes that are not directly representable as a single pre-composed code point in unicode. The upside is that this has beginner friendly semantics in a sense and allows indexing on what's meaningful in the domain (graphemes). The downside is that I can't see how to do this with a lot of complexity and some nasty gotchas. This seems to be what perl6 does https://design.perl6.org/S15.html#NFG