> This should have been the approach to modernising python all along.
The core driver for the Python 3 break was the fix in text model, this is what allowed literally everything else as it completely broke existing code.
And I, for one, think it's one of the most important improvements of Python 3, the text model of Python 2 is a giant mess and makes it very hard to correctly deal with non-ascii text for any non-trivial software, especially in large teams where not everybody will carefully evaluate the text-ness of their code..
> the text model of Python 2 is a giant mess and makes it very hard to correctly deal with non-ascii text for any non-trivial software
There are counter-arguments to this. Armin Ronacher, author of (among other software) the excellent Flask web framework, thinks that Python 2's system of codecs and byte streams is better in practice [1][2]. Reasons include: You can do byte -> byte conversions with codecs that are no longer possible. You can better handle text encodings besides UTF-8 (and here he describes several embarassing failures of Python 3 to handle OS paths correctly). You can write single APIs that handle byte streams like gzip and text encodings like UTF-8.
> Armin Ronacher, author of (among other software) the excellent Flask web framework, thinks that Python 2's system of codecs and byte streams is better in practice.
Armin Ronacher works in a very specific context of having to deal with byte/text interfaces in pretty much all his projects, and while I can see where he comes from I work at a different level and at the level at which I work the P2 model is a giant pain in the ass.
Armin is no foe of Python 3. And as noted in the essaye Python 3 has undergone several improvements or features reintroductions e.g. PEP 461 reintroduced C-style formatting to bytestrings, making generating binary data (especially ascii-based formats) significantly more convenient than it is between 3.0 and 3.4.
Also note that Armin has repeatedly praised Rust's text model, which is much more similar to P3's than P2's (except with static types and no messy legacy).
> and here he describes several embarassing failures of Python 3 to handle OS paths correctly
And (fucking surprise) the issue with that is the text model of FS paths is an embarrassing pile of garbage, Python 2 is convenient because it doesn't try to touch that mess at all and just hands the flaming bag of shit to whoever comes next.
> Also note that Armin has repeatedly praised Rust's text model, which is much more similar to P3's than P2's (except with static types and no messy legacy).
That is incorrect. Rust's text model has (almost) free (and copyless) transmutes from bytes to strings. Python does not. The text model of rust is much closer to Python 2 than 3 in many ways.
> That is incorrect. […] The text model of rust is much closer to Python 2 than 3 in many ways.
Rust's text model strictly separates proper strings and bytestrings, defaults to proper strings and requires that strings be properly formed (so much so that it has additional completely separated platform-dependent types for dealing with OS-originated "stuff").
The one "difference" (which is more in the realm of implementation detail than language text model) is that Rust leverages its ownership system to make UTF8 "encoding" and "decoding" free (literally for the former, essentially for the former). The encoding and decoding are still there and explicit operations though.
> Rust's text model has (almost) free (and copyless) transmutes from bytes to strings.
Only for the specific case of input bytes already in the language's internal encoding (which granted will be common as most inputs would be ascii or utf-8) and with the same ownership constraints as the input, and that's mostly enabled by Rust's ownership model.
> Python does not.
Python doesn't generally do no-alloc/0-copy operations so that's not overly surprising.
> Only for the specific case of input bytes already in the language's internal encoding (which granted will be common as most inputs would be ascii or utf-8) and with the same ownership constraints as the input, and that's mostly enabled by Rust's ownership model.
Except of course on operating systems where text I/O is done entirely in UTF-16. Say, Windows.
Since Python strings have no fixed encoding, but choose "the most efficient one" (heuristically) when decoding, they can cope better than a fixed UTF-8 encoding in these cases.
>> Python does not.
> Python doesn't generally do no-alloc/0-copy operations so that's not overly surprising.
Indeed. Even when the encoding is not changed, the string will be always copied. One could think of an API that does that, though, to optimize all those cases were memory is already owned by a shim in the runtime.
> Since Python strings have no fixed encoding, but choose "the most efficient one" (heuristically) when decoding, they can cope better than a fixed UTF-8 encoding in these cases.
That is wrong. Python can never pick the most efficient encoding unless you decode from latin1.
Rust having strings that are utf-8 is a guarantee and as such allows uou to do very efficient operations on them. Puthon gives you a vagie guarantee that it gives you O(1) access to something like a glyph.
These are very different and incompatible text models.
At no point is Python's text model fast or overly useful.
Armin has backed off of this stance since then. And for good reason.
As someone who works with Python text processing extensively, I can tell you that the Python 2.7 text model is broken and dangerous, due to the silent bytes-unicode coercion and misguided use of ascii instead of UTF-8 as the default text encoding. Many people don't realize this and will argue that it's not broken, because they have never fed non-ascii text through their app to watch it blow up! And once they realize that they have a problem, they then have to deal with a rat's nest of silent bytes-unicode coercions happening implicitly all over their app, sometimes impossible to deal with due to library code outside their control.
There is a good discussion to be had on whether a language should prioritize bytes or unicode strings as the main data type, but there is no excuse for the "ticking timebomb" string data type design that pre-3 Python has with strings and the default encoding.
For this reason alone I'm very happy that 2.7 is starting to lose its grip. Its continued support is a problem, and I have no love for people who are trying to hold on to it.
There are many other features in 3 that I can no longer live without - most of them now available through backports modules - but types and asyncio can't be easily backported either, and people are starting to use them extensively.
Yeah, but if you are dealing only with a subset of the English Language in the U.S., and your API endpoint that you are scraping wants to serve to all peoples in all locales in all situations, you are fucked if you want to use Python3 and its csv module.
You genuinely are better off using Python 2.7.x and its naive approach to text.
I don't understand what you mean by "your API endpoint that you are scraping wants to serve to all peoples in all locales in all situations".
That would mean to me that the API endpoint could be sending me Unicode, in which case Python 3's Unicode-aware CSV is going to work great, and Python 2's csv is fucked. The limitations of Python 2's csv module was one of the key points that moved my company to Python 3.
On Python 3, if you want to be naive about text (not sure why you're celebrating only working in a subset of English, but you have this option), you could open the file as Latin-1 and get the same results as Python 2.
Many CSVs are made with Excel. Excel's only form of Unicode CSV is tab-separated UTF-16. Python 2's csv can't parse those at all, can it?
> Python 2's csv can't parse those at all, can it?
Nope, not without re-encoding to UTF-8 before parsing (learned that out the hard way and found out it's easier to just take excel files as input).
P2's CSV module works byte-based, and basically only handles ASCII-compatible supersets, assuming your special characters (quote chars, field and record separators) are straight ASCII.
I don't think it's honest to post this without context, and without mentioning that in the five years that passed most of these things were remedied, and indeed, some things were already remedied at the time of his writing.
Some points Armin makes are valid and remain valid for Linux-ish systems, but have been shown and refuted countless times for other operating systems; Python is not a Linux-only show. I won't re-iterate all that here.
You should be aware Armin now has a more-or-less followup post telling people not to do what you just did (i.e., reference his 2011 post as an authoritative "Python 3 is bad" explanation, because both Python 3 and his own opinions have evolved since he wrote that post).
This approach could not have worked for modernizing python. The whole point of the Python 3 thing was to be able to remove warts in the language that could not have been fixed without breaking backwards compatibility. One core part of this is unicode support -- Python had a horrible story for international text before this.
The fact that there are some parts of "modern" Python which could have been implemented in Python 2.7 backwards-compatibly is irrelevant.
Python 3 is not the language designers worrying about minor subjective issues in python like the print keyword or the design of iterators and deciding that they want to change it all. It is the language designers worrying about major issues like international text, realizing that they regrettably will have to break backwards compatibility to fix those, and then just taking the opportunity to revamp things like printing and iteration since they're breaking backcompat in some pretty major ways anyway.
Python had a horrible story for international text before this.
No, Python had a horrible story for text, and people who worked in limited/sheltered domains didn't realize it. I personally lost all kinds of valuable hours of my life fighting with Python 2's "pretend everything is ASCII until it isn't, then fall over dead" model, because I -- a US citizen, working at US companies, and for quite a while dealing only with English-language content -- still ran into non-ASCII characters with regularity.
And here I'm being charitable; I simply refuse to believe that the overwhelming majority of people who used Python 2 never once had to deal with someone copy/pasting text out of Word or another program that used "smart quotes".
It's super-sad that "Unicode" is a prominent stated motivation for Python 3.
Unicode in Python 2 was fundamentally broken in that whether it had UTF-16 semantics or UTF-32 semantic depended on how the interpreter was compiled. That's a terrible, terrible idea. However, they could have fixed it by sticking to one option: UTF-16 (which provided compatibility with some interesting things that Python interoperated with like Cocoa and, via Jython, Java).
UTF-16 is a sad legacy mistake, but APIs providing Unicode operations of any kind can be build on top. So UTF-16 is a mistake to begin with, but it's not a blocker for supporting all of Unicode and features targeted at the needs of all writing systems and languages. Java, Windows, the Web Platform (including JS) show that proper i18n can be built on top of the bad but backward-compatible 16-bit code unit foundation.
Now, the _even_ sadder part of Python 3 is that if you decide that UTF-16 is a mistake and want to fix it, UTF-32 is the naive and wrong solution. When a Unicode newbie is told about surrogates, they think that UTF-32 is the answer. But then they waste memory and cache line space (and, if dynamically omitting leading zeros on a per-string basis, the compute and copy cost of promoting to different unit width when adding one emoji). And once the damage is done, someone points out that grapheme clusters are a thing, so they still didn't get O(1) indexing to user-perceived units.
The enlightened thing, of course, is to do what Rust does: use UTF-8 and use iterators on top for accessing pieces larger than a code unit (code point, grapheme cluster). (To my taste, Swift strings are too magic and DWIM-y. At least back when I read the Swift book, it didn't even explain the underlying representation. With Rust, the representation is very explicitly known.)
"UTF-16 sucks" is what Python 3 got right. That UTF-32 (with dynamic leading zero omission on a per-string basis) is the answer is what Python 3 got very, very wrong. The correct answers are either UTF-8 (for a new language like Rust) or holding the nose and making stuff work on top UTF-16 (Java, JavaScript) without breaking old programs.
To be clear, I don't agree with py3s Unicode model. I think it sucks for the same reasons you do.
I also think that default Unicode is a major improvement over py2 and is enough to justify breaking the language because modern languages should at least have that.
Python 3 fixes no fundamental issues with python 2 and introduced far more warts than it removed. GIL is still there, crummy runtime is still there and unicode is now an even greater mess. I really wonder how many people who bang on about unicode actually have a good grasp of unicode and text processing because python3's unicode design is obviously terrible. I can now access or count code points in O(1) (neither of which is in any way useful) at the cost of tremendously increasing space and time overhead for any basic text operation on non ascii text and having some bizarre hacks to deal with the fact that pretending that stdin and stdout and sys.argv are always text.
It most certainly fixes support for Unicode on Windows in terms of filesystems paths, OS function boundaries and the console. Some of these fixes have even taken until 3.6 to get implemented.
As someone who writes cross-platform code, Python 3 was a breath of fresh air after fumbling around in the dark with Python 2.
I can definitely believe that – but windows has basically lost[1] and a much worse text model to boot. Like Java and unlike python 3, they at least have the excellent excuse that this was not obvious at the time. And under unix the impedance mismatch has definitely increased. Not a good trade.
[1] I wouldn't count them out, but they're definitely on the back foot as bash inclusion shows.
> I can now access or count code points in O(1) (neither of which is in any way useful)
Oh, yeah, I agree that Python's unicode model isn't great. I like Ruby's, and Swift has it's own cool thing going where it's very explicit about the uselessness of code points.
However, I think that Python 3 having some form of default unicode support is way better than what Python 2 had. It could be improved (backwards-compatibly too!), but it passes my minimum bar for a "modern" language's text story.
In all seriousness, I'd much rather python3 had kept str, phased out the unicode type altogether, got rid of all the harebrained locale crap (sys.{get,set}defaultencoding etc) and just provided tooling (collation, regexp, denormalization etc.) for working with utf-8 encoded byte-'str's. This would probably have been a much smoother transition and ended up with a vastly superior result.
I'm pretty sure the people complaining here about how python2 str only supports ascii and they couldn't paste their smartquotes were bitten either by windows or unnecessarily bad unicode/str interactions due to python not just hardcoding utf-8 auto-conversion. That is the only sane thing to do (Your locale isn't *.UTF-8? Well sucks to be you. By now even the Japanese and Chinese seem to slowly have come around to the utf-8 bandwagon, and they had better reasons then most).
I might be wrong, but I can see basically 3 non-idiotic ways to do text in a programming language:
1. arrays of utf-8 bytes (Rust, Go). Python was close to that already and then messed it up. Indexing indexes into bytes O(1).
Upsides:
- efficient: most text you're going to get is already utf-8 and the rest should be converted on ingress/egress; html/css/most code will be represented fairly efficiently even if the body text is mostly say, Chinese; you can do a lot of text processing by just working on the ascii range (e.g. CSV parsing).
- sane: no BOM, no 32 bit encoding of 21 bit quantities etc; unix-compatible
Downsides:
- can't efficiently access individual logical characters or know the fixed-font width of the text
- normalization is kinda nasty (concatenation etc.), in
practice people just tend to ignore that
- hard to constrain to only valid utf-8 without significant downsides
- maybe not that beginner friendly
2. use some non-array type that doesn't allow for indexing (e.g. ropes), probably using (mostly) utf-8 for internal encoding.
3. arrays of logical characters. That means you need to make up fake characters to handle graphemes that are not directly representable as a single pre-composed code point in unicode. The upside is that this has beginner friendly semantics in a sense and allows indexing on what's meaningful in the domain (graphemes). The downside is that I can't see how to do this with a lot of complexity and some nasty gotchas. This seems to be what perl6 does https://design.perl6.org/S15.html#NFG
as mentioned in a comment above, the choice to take 2.7 behaviour when 3 behaves differently means this wannabe-python '2.8' is neither backward nor forward compatible