Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Npm module that makes easy to convert JSON to CSV (npmjs.com)
64 points by kauegimenes on June 8, 2017 | hide | past | favorite | 63 comments


Off topic rant I need to get off my chest. Yesterday I troubleshooted a bizarre problem. New versions of Excel don't necessarily understand the simplest CSV files. Reason: They forgot that the C in CSV stands for comma, and instead expect whatever character happens to be set as the 'delimiter' character in the Windows regional preferences. Here in New Zealand, it seems that character is semicolon instead of comma. For some reason. I think Excel was finished circa 2000 and now they focus mainly on making it worse.


There was an interesting article abt CSV parsing a few years ago, here's the HN discussion https://news.ycombinator.com/item?id=7796268

To quote the top most comment (by user, slg): "CSV are a headache. Like the article says, RFC4180 doesn't necessarily represent the real world. However sometimes you just have to reject things that aren't spec.

Not too long ago I was struggling with one of these CSV issues and received some good advice from Hans Passant [1] on a Stack Overflow question pertaining to my problem (emphasis mine):

'It is pretty important that you don't try to fix it. That will make you responsible for bad data for a long time. Reject the file for being improperly formatted. If they hassle you about it then point out that it is not RFC-4180 compatible. There's another programmer somewhere that can easily fix this.'

It makes perfect sense in hindsight. If you accept a malformed CSV file, people will expect you to accept any malformed data that has a CSV extension. You are taking on a lot of extra responsibility to cover for the lack of work by another programmer. Odds are they can make a change to fix the problem that takes a fraction of the time it would take you work around it. You just have to raise the issue.

I realize that rejecting bad files isn't really possible in every circumstance. But I have a feeling it is an option more times than you might initially think."


If you know the file is going to be used in Excel, you can force a delimiter with the `sep` directive. The first characters of the file should be

    sep=,<newline>


Bleh. That's so Microsoft.

It took roughly ten seconds to find huge problems with this approach: https://stackoverflow.com/questions/20395699/sep-statement-b...


CSV is a mess. A while back we switched to exporting XLSX using the xlsx module https://www.npmjs.com/package/xlsx


How can CSV be a mess? The implementation is in the name. Excel is a mess.


The mess is in the "V". RFC4180 leaves value interpretation to the application. It doesn't tell you, for example, when a string like 12345 should be interpreted as a number. So we end up sending numbers and dates and other meaningful values into the ether, hoping that the end application understands what you meant to convey.

XLSX has a cell type for IEEE754 doubles, so there is no such ambiguity. It also has a special date type!


That's still not a failing of CSV. If knowing whether a V is an IEEE754 double is critical to your application you shouldn't be using CSV. You wouldn't say nails are a mess because sometimes you need screws.


Knowing whether a value is a number or text or date is critical to myriads of application, and CSV specs give no guidance on how to interpret the actual values as numbers or text or dates. You don't have control over both endpoints in most cases, especially when you accept files from clients. Furthermore, they mix presentation and values (is `1.23%` a string or a number? What is the number?)

XLSX separates the value from presentation and gives every value a clear type. If a value is a number, there's a concrete numeric value that is stored separately from the number format. That way there is no guesswork involved, you can figure out exactly what the value is with zero magic.


You're missing the point. You're just listing use cases for which CSV does not fit.

CVS's strengths are simplicity and ubiquity. It has existed long before Excel and will probably outlive it. You can't say it's a mess because it doesn't help you parse "1.23%" reliably and constantly -- that's not CSV's job. To try another analogy: you can't say square pegs are poorly designed because you have round holes.


Then what exactly is CSV good for? I don't know any usecase where I need to programmatically exchange values without meaning in an open format.


CSV is good because it's simple (easy to read, easy to implement, hard to break) and it's ubiquitous. If you're just trying to dump data without knowing/caring what the consuming application will be it's a fine choice. By using XLSX you've both lost that ubiquity and introduced a world of headache -- it's a lot easier to break XLSX export than CSV export.

For concrete examples CSV is best when you want to release the data from your system but really have no idea what the client wants to do. Maybe they just want to curl it and display it, maybe they want to process it with R. It's a very easy way to say "here's your data, my job is done".


But you argued before that CSV is not meant for at least one of your examples. If I want to process it in R, I need to know the datatype of the columns. And this is indeed a regular painpoint in this workflow.

For displaying it _might_ work, if you just want to dump an ugly mess. If you don't want to do that, you need to know the type of the values, so you can e.g. right-align numbers in columns.

So CSV clearly fails in your examples. You might argue that we don't have a better format that has support in so many applications. That might be true, but doesn't make CSV good or makes any of these failings not failings.


Protocol buffers don't tell you what the data types are either in the actual message.

It's a trade-off, but I prefer to communicate format out of band, not in every single message.


My take: include a schema when you can and there's a justification and it's practical. IMO even JSON is a (very limited, "name only, no type") type of schema. If it becomes impractical for something you're doing, stop and switch to something less redundant.


You can use jsonexport with rowDelimiter: '\t' for a xls compatible output.


It also behaves differently if you open the csv from inside excel rather than if you doubleclick the file in explorer.

And we haven't even gotten to file encodings yet.


Just for the case, anyone doesn't know. Be carefull with opening CSVs in Excel or just don't do it.

https://www.owasp.org/index.php/CSV_Excel_Macro_Injection



Lets not forget when you have "ID" as the first header (ie. col 1 row 1), it assumes the CSV file is SLYK format.


The secret to using CSV in Excel is to use Data > From Text instead of opening the CSV file directly. This opens the import wizard which allows you to, among other things, change the separator character. It's still not perfect, but it's much better than opening the CSV file directly.


Same in Holland.

The 'fun' thing is that Excel for OS X does not do this, it uses commas.

We used to always just generate CSV files with semicolons since most of our clients were using Dutch Excel on Windows. As some of them moved to OS X, we've mostly been guessing what format to use.


I'm curious why New Zealand would have a semi-colon as the default. Is there a historical or cultural reason?


I checked and New Zealand actually uses a comma as the separator, not a semi-colon. You can verify this by going to Control Panel -> Region -> Change to: English (New Zealand) -> Additional settings... -> List separator

A semi-colon is generally used as the default list separator when the region/locale uses a comma as the decimal separator for numbers. For example Dutch (Netherlands) uses a comma for the decimal separator (ex. 3,14) whereas in English (US) we use a decimal point (ex 3.14). If comma were used as the default list separator in such a region then all floating point numbers would need to be quoted (ex. "3,14") which would make the size of the CSV file larger and also make the file less human-readable


Interesting. Certainly it was only a subset of customers who were complaining, and in the office it was a subset of Excel installations which could reproduce the problem. There's no doubt what the problem was and how to fix it. There is a doubt about exactly which Excel and Windows combinations are problematic, and how prevalent that combination is amongst New Zealand installations.


Nothing comes to mind. I think someone might have had a moment of unnecessary inspired creativity. The main point is, using this setting is a mistake for this context. Let CSV be CSV.


> I think Excel was finished circa 2000 and now they focus mainly on

Not breaking established behaviour?


Do you directly open the file? It used to ask during import.


For a <5Kb library, I recommend removing the underscore dependency. You should make this as lightweight as possible if you want a lot of adoption.


Converting to CSV is mainly a server-side problem, not something you'd want to do in the browser, so this doesn't really feel like an issue to me.


Thanks for the tip, you are welcome to submit a PR if you want to =)


Not OP, but I just submitted a PR removing underscore as a dep. :)



This is now avaliable in v2.0.7 =)

Dependencies

None


Yeah this looks like it's only using helper functions like _.isFunction(), _.isNumber(), and _.extend().


The most popular NPM module for JSON to CSV conversion is json2csv: https://www.npmjs.com/package/json2csv https://github.com/zemirco/json2csv


Here's a graphical comparison of downloads of popular json to csv conversion libraries -

http://npmcharts.com/compare/json2csv,jsonexport,csv,fast-cs...

/shameless plug


json2csv require you to tell every header and don't work well with nested objects and require json2csv-stream for large json files.


False, headers are optional. Nested objects are still a problem though :(


Thanks for clarifying


Cool, I immediately wondered if you handled a lot of the tricky CSV things like quotes and internal commas, it looks like it does a good job.

Comparing against Numbers (probably a bad comparison?) I am seeing one slightly different result:

If I have a field that is "hehe"

It's encoded as ""hehe""

It looks like Numbers adds the enclosing quotes:

"""hehe"""

I only see enclosing quotes if there's a comma:

"hehe,hehe" -> """hehe,hehe"""

Anyhow, I'm not sure what the "correct" thing to do here is, if there is one, just a heads up!


CSV is pretty poorly standardized, but if you're going by RFC-4180, Numbers is definitely right, due to the interaction of two rules:

* double quotes aren't allowed inside a field that isn't double-quoted

* double-quotes that do appear in a field have to be escaped by preceding them with another double quote.

So by RFC-4180, I'm pretty sure ""hehe"" shouldn't be possible, and the way to represent "hehe" with the quotes is """hehe""".


Cool, yep, thanks for pointing out the RFC. I don't envy anyone making CSV parsers and basically assume they won't work when I need to use them :)


Here is all the logic related to this behavior: https://github.com/kauegimenes/jsonexport/blob/master/lib/co...


In the complex object example, what's the usecase for turning that nested object into a multi-column CSV? I guess the overarching question is, why turn a single object at all into a CSV? It's just as machine/human readable as the JSON version of it, and there's no particular benefit in using either structure as it's not data meant to be processed in bulk.

The complex array usecase is where an opinionated-type of conversion tool is particularly needed, but I wonder why it behaves like this:

   name: 'Robert',
   lastname: 'Miller',
   family: null,
   location: [1231,3214,4214]

lastname,name,family.type,family.name,nickname,location Miller,Robert,,,,1231,3214,4214

Why not have `location_1, location_2, location_3`, instead of having a single location column? The latter implementation makes the data difficult to quickly use in a program (like a spreadsheet).


This only happends if its a simple array, but if you have something like

   name: 'Robert',
   lastname: 'Miller',
   family: null,
   location: [{city: 'A'}, {city: 'B'}]
you would get 2 csv lines one with location.city A and other with B


To answer your first question: A whole bunch of non progammer types use csv for everything. SAS and SPSS rule everything around them and they do input from csv. Source: I used to make good money doing the turn x to csv for academics all over.

As to the last, I got nothing. It's how I would have rolled.


Hey, why are you picking on me? :)


It's possible to implement the `location_1, location_2, location_3` as a optional feature.

But i am not sure what name to use for this option.


Maybe some variant on "Extract array elements to separate columns"? or did you mean what to name the command line option? Maybe a "--list-treatment=" option, which can be "concat" or "separate"?


From a Python developer point of view, design is a bit surprising - why isn't it created as two separate libraries:

1) csv writer (and reader?) which takes care of all csv dialects crazyness; 2) a library which "flattens" nested objects/arrays?

(1) is not opinionated (besides a few API choices), it just has to be correct; it doesn't make much sense to re-implement (1) everywhere.

(2) can be more opinionated, it is easy to disagree with design choices, there is more room for personal preferences.

For example, in Python there is CSV stdlib module, and for (2) there are libraries like https://github.com/scrapinghub/flatson. Why put both to the same library? Is it something ideomatic in node.js world, with a deeper reason to design libraries this way (e.g. download size), or is it just an oversight?


It could be. The question is: why should it be?

Everything you said makes sense if those are the goals, but if they aren't... well, why should they be?

The design you suggest would be done by some module authors for sure, but there's no reason it needs to be that design.


Converting JSON to CSV is a usable/good way when analysing NoSQL data. However I'm trying try find since a while a data analysis/query tool for JSON data itself. I saw that mixpanel made a query tool/language called JQL (JavaScript query language) which seems nice. Also RQL (rethink query language) seems nice too. But I can't​ find a simple tool in which I can import my JSON data and query it. It should be similar to SQLite-browser. Simply import your CSV/JSON file, and query it.


(I work at Mixpanel)

Honestly you could probably write a quick pipeline to dump your json data into Mixpanel and then use JQL -- it would be a little hacky but if you have less than a few million rows it shouldn't be too much work (and would still be free at that volume).


What you think about this syntax https://github.com/mmckegg/json-query?


There's a similar command in pandas called json_normalize (http://pandas.pydata.org/pandas-docs/stable/generated/pandas...)


Nice! If you want to do something similar with Excel xlsx this works: https://www.npmjs.com/package/jsonexcel Not sure how it handles huge arrays though...


Thanks for sharing, you can also use jsonexport with rowDelimiter: '\t' for a xls compatible output.


This is a new one. Most people want to go the opposite way.

Either way, neat addition.


> Most people want to go the opposite way.

When you want to easily provide something excel/gsuite/... Will take in without using an heavier excel compatible library, CSV can be quite decent.


Just yesterday I was looking for the other way around, CVS->JSON


- Wrap the entire file with { and }

- Wrap each line with [ and ]

- eval the file

;).


It wasn't neatly quoted, which I assume is a must for JSON, but yeah, I ended up going on a different path.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: