But people that have that csv issue have deeper lurker encoding issues they're p...

jzwinck · on Dec 10, 2016

Oh we are perfectly well aware of our encoding issues. In NumPy, one of the most important libraries in Python 2 or 3, strings always take one byte per character. And it's not going to change. So the built in csv module needs to support a reasonable behavior. Which it does not.

If a company tries Python 3 and discovers basic things like CSV produce utter gibberish, they would do well to opt out. And they do--in droves.

My data is not misencoded, you see. It's just misunderstood.

vurpo · on Dec 10, 2016

>In NumPy, one of the most important libraries in Python 2 or 3, strings always take one byte per character.

So I won't be able to use NumPy with either of my two native languages. That sounds like a bit of a shortcoming for the majority of the world.

aidos · on Dec 10, 2016

Numpy isn't really used like that though. It's for numerical computation. There might be cases for putting text in there but you can always keep it locally and map it to an int that you use in numpy for mapping (I do that in places).

guitarbill · on Dec 10, 2016

> So the built in csv module needs to support a reasonable behavior. Which it does not.

It could be argued that the csv module's behaviour is reasonable, and NumPy's isn't. (I'm not 100% sure about all the details of this issue) Hopefully, NumPy will change it's behaviour to match Python 3, but if not you could still use the NumPy CSV routines like `loadtxt` or `genfromtxt` [0]. So then this becomes a documentation change to add some warnings to both modules.

> they would do well to opt out. And they do--in droves.

This is simply not true. They would do well to handle strings properly and so avoid bugs in future - something Python 3 actively encourages, and Python 2 obscures. And while I can't speak for every company, our metrics show that our Python 3 code has far less customer issues than Python 2, Perl, or Ruby. Now that's business value. (Edit: I mean it's hard to make the comparison - the Perl code is e.g. older - but we're writing code now, and when the interns add new stuff to the Python 3 codebase, it breaks less. All of them are still actively developed, and the Ruby one is about as old as the Python 3 one).

[0] https://docs.scipy.org/doc/numpy/reference/generated/numpy.l...

ianamartin · on Dec 10, 2016

I say this as someone who uses the latest version of Python available in every new project or script.

Text encoding issues are absolute garbage in Python 3.x

I fucking hate the way that csv module works with text encodings.

As soon as I can figure out a reliable way to take latin-1 and save it as UTF-8 without breaking everything, I will try to shoehorn in a PR.

Right now, it's fucking awful. My ETL pipeline hates it, I hate it, my boss hates it, and my internal constituents hate it. Because it sucks.

A file I can read in one encoding and write as another should be readable with the encoding I wrote it in. That is not currently the case with the latest version of Python.

And it makes me hate the world.

ddevault · on Dec 10, 2016

    with open('some latin-1 file', 'rb) as f:
      text = f.read().decode('latin-1')
    with open('some utf8 file', 'wb') as f:
      f.write(text.encode('utf-8'))

Python 3's string encoding support is super good. I've said it before and I'll say it again: if you use bytes as a string you are Doing It Wrong.

If you use bytes as a string you are Doing It Wrong.

ianamartin · on Dec 11, 2016

Allow me to rephrase.

I do that operation on a file I get from an API. I know for a fact that the encoding I'm receiving is latin-1.

I run exactly that operation on the file that you wrote out in code.

When I try to read that file back in as UTF-8, I get encoding errors. That does not make for "super good." That makes me want to scream.

I do not have this problem when I use Python 2.7.x

ddevault · on Dec 11, 2016

You're definitely making a mistake somewhere, because I just tested it for myself and it worked perfectly fine. I made a latin-1 file, applied the above code with it, and got a correct utf-8 file out. Are you reading the final file back as latin-1? You have to read it as utf-8 of course.

ianamartin · on Dec 10, 2016

And what happens when you don't get a choice about what strings you are digesting?

ddevault · on Dec 10, 2016

I don't follow. What do you mean?

To be perfectly clear: bytes (b'') is not a string. Again: bytes is NOT a string. It is an array of octets, aka bytes, aka unsigned 8 bit integers. NOT characters. NOT a string.

If you are dealing with bytes that are encoded representations of a string, then you have to know what encoding they use to decode them and treat them as strings.

minus7 · on Dec 10, 2016

I'm not sure what you mean. If you don't know what the encoding of the input file is you have a problem. As far as I know there are libraries to guess the encoding, but it cannot be determined completely accurate.

minus7 · on Dec 10, 2016

I don't see the problem; reading and writing different encoding works fine. The CSV module makes no problems either:

  in_file = open("in.csv", 'r', encoding='latin-1')
  in_csv = csv.reader(in_file)
  out_file = open("out.csv", 'w', encoding='utf-8')
  out_csv = csv.writer(out_file)
  out_csv.writerows(row for row in in_csv)

ianamartin · on Dec 10, 2016

I have a persistent problem that does almost exactly that.

The resulting file is not readable, and it makes me want to kick puppies and punch kittens.

int_19h · on Dec 10, 2016

> It could be argued that the csv module's behaviour is reasonable

I don't see how silently printing a binary literal, if that is indeed what it does, is reasonable. Simply put, b"foo" is not meaningful CSV.

What it should do is 1) raise an exception by default, informing the user that they need to be supplying strings and not bytes, and 2) provide an explicit switch to treat binary data as pass-thru, which would be useful in scenarios where you're just reading a file and dumping it elsewhere, and don't want to spend time decoding and then encoding everything.

guitarbill · on Dec 11, 2016

The docs say that "[a] row must be an iterable of strings or numbers" [0]. So I guess an exception could be raised. However, the docs do tell a lie; non-strings are accepted and get converted to strings. You can pass any object in which has a string representation - including a byte array. It actually wouldn't be too hard to introduce a check for a bytes field, https://hg.python.org/cpython/file/3.6/Modules/_csv.c#l1227

    + if (PyBytes_Check(field)) {
    +     append_ok = FALSE;
    +     Py_DECREF(field);
    +     PyErr_SetString(PyExc_TypeError, "Field is bytes");
    + }
    else {

This would then raise a TypeError.

I don't think this is the right solution. It seems weird to have a special case because people aren't watching what they're putting in. Garbage in, garbage out, consenting adults and all that.

[0] https://docs.python.org/3/library/csv.html

int_19h · on Dec 11, 2016

Yes, "special cases aren't special enough to break the rules".

But "practicality beats purity".

And "errors should never pass silently"!

It's the same kind of thing that leads to safety warning stickers put on products. You may read it and think that it's something so obvious that consenting adults should know better. But then you look at the statistics about how many people did not, and realize that, yeah, a sticker along the lines of "don't stick your finger into a food processor" is actually a good idea. Especially given how cheap it is, and how expensive reattaching fingers is...

Basically, products should be designed around known human weaknesses, and that includes entrenched modes of thinking by past products. It doesn't mean that new products should accommodate those entrenched modes, especially when they lead to other problems. But they should try to detect them, and issue clear and explicit warnings, to guide the person to the proper way of doing things.

jimnotgym · on Dec 10, 2016

I moved to Python and got back into coding specifically to work with csv files. And from the first time I used code copy and pasted from SO or wherever to import and export CSV's with Python 3 I have not found that I am plagued by unreliable behavior and mischievous encoding. Yes there seems to be a type conversion necessary with NumPy, but it simply is not true that Python 3 has an endemic problem with CSV files!

In fact, the reason I chose Python was because I was able to dive so quickly into real problems like this with no problems whatsoever.

This is a minor issue for someone porting from 2 to 3, it is not a problem with 3