Dataframes – Julia, R, Python

Fede_V · on Dec 24, 2014

There was a very interesting design discussion by JMW on the julia-dev forum about nullable arrays, and column dtypes:

https://groups.google.com/forum/#!topic/julia-dev/hS1DAUciv3...

Even so - right now Pandas is miles ahead of the Julia equivalent. Pandas was the brain child of Wes McKinney - an amazing coder, who really, really cared about speed (who recently also made a lot of money selling his start up to Cloudera - good for him!). The things you can do in Pandas with multi-index selects, joining dataframes on multiple axis, etc, are outright incredible.

eggy · on Dec 25, 2014

I am learning J for about a year now, and I will try out the examples in it to see how it goes. Memory-mapped files, and quick array operations. It is APL based, and J has been around since the 80s, and is open source. Wes McKinney seems to think it is a good way to go:

https://twitter.com/wesmckinn/status/341317411607293953

sebastianavina · on Dec 25, 2014

I've seen a lot about Julia on the last months, it seems like a good language (performance and kind of a nice syntax), For me, what makes R a very good choice is because of RStudio. Being able to play there with your data and save it all for later is one of the biggest reasons to use RLang. The Python equivalent would be emacs org-mode, which is great, but not as graphical as RStudio.

Julia seems like a good language, maybe someday i will jump on it, but for the evil mind out there planning to write another language. Please stop, we already have great languages! I can't keep up with the learning! and is so damn difficult to even start a project with so many choices!

alecdbrooks · on Dec 25, 2014

If you're looking for a nice graphical way to play with data in Python, may I suggest IPython Notebook [0]? It's not always easy to configure, but it's maturing fast and lets you have Python code, Markdown, and graphs in one place, not to mention the 21 other languages available natively or as add-ons[1].

[0]: http://ipython.org/notebook.html.

[1]: https://github.com/ipython/ipython/wiki/IPython%20kernels%20... (I'm counting the two Perl kernels as one language and not counting Calico or the example kernel.)

mrandrewandrade · on Dec 25, 2014

Another great graphical way to play with data in Python is using Spyder. It has an Rstudio/matlab sort of interface

https://code.google.com/p/spyderlib/

http://en.wikipedia.org/wiki/Spyder_(software)

cschmidt · on Dec 25, 2014

There is a version of iPython that works with Julia too

kyllo · on Dec 25, 2014

Yeah you can run IPython Notebook with a Julia kernel.

sebastianavina · on Dec 25, 2014

I just checked ipython notebook. Looks like Matlab Notebook, but in the browser instead of MS Word (I don't know if Matlab Notebook still exists, it's been a long time since the last time I used Windows), anyway, you should try org mode on emacs for python, is way more versatile, compiles to latex, html, and has almost all the features of the ipython notebook.... still Rstudio has a lot of graphical incentives, like data edition, environment variables, easy to follow documentation. Both org-mode and RStudio are very powerful tools, I was mostly ranting about how many options we have for making almost the same thing.

semi-extrinsic · on Dec 25, 2014

RStudio is bringing org-mode to those of us who don't (know|want to learn) emacs. I like it a lot.

ajinkyakale · on Dec 25, 2014

I use ijulia which is based on python notebooks. Try juliabox.org which is a notebook (and more) offering by the julia folks. Then there is juno and julia studio if you are inclined towards an rstudio like interface.

mapcar · on Dec 25, 2014

How about PyCharm or Eclipse+PyDev (I've personally heard more praise for the former)? I use emacs and ess or python-mode so can't comment on the IDEs too much but being able to use the same platform for both has been convenient for me.

peatmoss · on Dec 24, 2014

For dataframe-like operations, I've started wondering why more languages don't take the dplyr approach and simply default to using something like SQLite under the hood. Granted, I are no super data genius, but every time I start cracking a little into the internals of a dataframe implementation, I get the sinking feeling that SQL databases have already done the hard work of indexes and efficient data structures.

hadley · on Dec 25, 2014

I thought that before writing dplyr, but now I see that there a big differences. Relational databases are designed to work with large datasets on disk, and to accept changes very rapidly. The demands for in memory data analytics are quite differnt. Columnar data stores are a better fit, but it's pretty easy to bang out efficient code for in memory data; it's much harder to work with out of memory data.

peatmoss · on Dec 25, 2014

> large datasets on disk

I saw this benchmark a while back comparing Pandas to SQLite in-memory databases. While Pandas did edge out SQLite in several areas, it was by well under an order of magnitude: http://wesmckinney.com/blog/?p=414

Pretty solid performance plus the ability to work with large datasets on disk seemed like a pretty big win to me. I could imagine a set of SQLite extensions (a la spatialite) that could further optimize for various data.frame use cases. As an added bonus, the same libraries would be very portable between different languages--even languages that don't currently have something like dataframes.

EDIT: What I don't know about is memory efficiency. Perhaps SQLite isn't, but I'd not bet against?

IndianAstronaut · on Dec 25, 2014

I personally switched from Pandas to SQL. While Postgres is a heavy duty database for large production operations, it is fully capable of doing day to day analysis of CSV files with nice SQL syntax.

There were two reasons for the switch. SQL syntax is cleaner and more well understood by others. The second is if you get a dataset bigger than memory, you aren't stuck.

hadley · on Dec 25, 2014

That benchmark is only for joins? That's a pretty small part of analytic workflows in my experience.

peatmoss · on Dec 25, 2014

That's fair. Now I'm curious as to how a more complete set of benchmarks would look using in memory sqlite, and what the opportunity for extension would be.

arun_sriniv · on Dec 30, 2014

Unfortunately the datasets in that benchmark less than 3MB each in size - it fits entirely in cache. It doesn't give a good indication of how well the function/implementation scales on bigger data sizes that really matter (in terms of computation time, memory, how cache efficient it is etc..). How much does one really care about 0.018 vs 0.023 seconds?

lpm25 · on Dec 25, 2014

Check out python blaze: Pandas and linq style frontend to lots of different backends.

Soon there will be OOC array ops. http://blaze.pydata.org/docs/dev/index.html http://nbviewer.ipython.org/url/blaze.pydata.org/notebooks/t...

smu3l · on Dec 24, 2014

In Julia, you can create symbols with e.g. :user_id as in Ruby, which looks a lot nicer than symbol("user_id") and doesn't require mapping over an array of strings.

ajinkyakale · on Dec 24, 2014

Ofcourse you can! I dont know why I didnt think of that

jzwinck · on Dec 25, 2014

The example which claims to get "all the rows from 50th row to the 55th row" is broken, since Python is zero-based whereas Julia and R are one-based. The 50th row in Python is at index 49, so the code is not equivalent between the examples.

kldavenport · on Dec 25, 2014

Glad they added .query() to Pandas in .13. In general I find the methods in Pandas/NumPy much more consistent with general programming constructs than most of what I see in R. No doubt Hadley has bolted on a lot of great functionality to R, but the Rcpp dependency/GPL license is a turn off.

hadley · on Dec 25, 2014

What's wrong with Rcpp?

And what's wrong with GPL? Unless you're planning on distributing your code, you shouldn't even need to think about it.

baldfat · on Dec 24, 2014

Good showing of what are three good data languages. Strange I was a Python guy for a long time. Pandas just looks strange to me now since I switched to R two years ago.

Seems like I need to dive into Julia again. Haven't for over a year.

IndianAstronaut · on Dec 24, 2014

Interesting. Although after spwnding a lot of time on data frames, I have grown to like the csv parsers in postgres where I can do a lot of the same things as data frames, but with clean sql instead of the sometimes odd data frame syntax.

R also stands out because it is so easy to run a wide variety of statistical methods easily on a data frame.

mapcar · on Dec 25, 2014

It's amazing to think that R (or S) had data frames since the 70s and only now are other languages implementing them. There are some quirks of course, and pandas introduced some convenient features. But the R community has also provided its own improvements in the way of data.table, and now, dplyr.

ajinkyakale · on Dec 25, 2014

Data.table package by matt dowle definitely deserves a mention! Its fast and I like the indexing functonalities it provides. The benchmark timings are pretty impressive.

arun_sriniv · on Dec 30, 2014

@ajinkyakale, thanks. What'd be also interesting is to benchmark memory usage in addition to runtime.

ajinkyakale · on Dec 31, 2014

I should have mentioned you (arun_sriniv) as the co-developer of data.table! Thanks for all the hard work. And yes, memory usage will be interesting as that is the bottleneck when it comes to large dataset. I am working on something on those lines. Will post something soon :)

arun_sriniv · on Dec 31, 2014

No worries :-). And glad to hear you're working on it! Let me know if I can be of any help.

urlwolf · on Dec 25, 2014

Anyone doing R comparisons should use data.table instead of data.frame. More so for benchmarks. data.table is the best data structure/query language I have found in my career. It's leading the way in The R world, and in my way, in all the data-focused languages.

hadley · on Dec 25, 2014

Data tables are extremely fast but I think their concision makes it harder to learn and code that uses it is harder to read after you've written it. It's very reminiscent of APL.

arun_sriniv · on Dec 30, 2014

data.table's `DT[i, j, by]` is quite consistent actually and is comparable to SQL's - i = where, j = select | update and by = group by.

This form is always intact. For example:

  require(data.table)  
  DT = data.table(x=c(3:7), y=1:5, z=c(1,2,1,1,2))

  DT[x >= 5, mean(y), by=z]        ## calculates mean of y while grouped by z on 
                                   ## rows where x >= 5

  DT[x >= 5, y := cumsum(y), by=z] ## updates y in-place with it's cumulative sum 
                                   ## while grouped by z on rows where x >= 5

"Harder to read after you've written it" and "harder to learn" are all very subjective and pointless. One could make very similar observations about `dplyr`, but I'll refrain from it here.

I implore the readers to take a look at over 100+ reviews on crantastic: http://crantastic.org/packages/data-table from users of the package.

Keeping `i`, `j` and `by` operations together allows optimising for speed and more importantly memory usage (altogether under a consistent syntax) - which are two very important aspects especially working on really huge data sets (10-100GB in RAM or more).

Here's a detailed benchmark (only on grouping so far) on 10 million (in MB) to 2 billion rows (100GB): https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A...

ajinkyakale · on Dec 31, 2014

I agree to what Hadley said in some ways. It takes a bit more time to get used to the [i, j, by] notation and I personally feels its unlike most of the R syntax. But I dont see that stopping me from using something as fast as data.table.

arun_sriniv · on Dec 31, 2014

ajinkyakale, "harder to learn" doesn't expose the fact that data.table provides so many features that, for example, dplyr just doesn't. And in addition, it is fast and memory efficient.

Rolling joins for example are slightly harder to grasp concept because most of us don't know what a "rolling" join is (unless you work regularly with time series).

Aggregating while joining is hard to grasp not because the syntax is hard, but the concept is inherently new.. It allows us to perform operations in a more straightforward manner, which most embrace after investing some time to understand it.

Binary search based subset, e.g., DT[J(4:6)] is again another concept that's new. One could use base R syntax and use vector scans to subset. But when you learn the difference between vector scans and binary search, you obviously don't want to vector scan. Now we can say that learning the difference between "vector scan" and "binary search" is really hard, but that'd be missing the point.

DT[x %in% 4:6] now internally uses binary search by constructing an index automatically! So you can keep using base R syntax.

And dplyr doesn't have any of these features.

In short, a huge part of "bit more time to get used" is due to data.table introducing concepts that aren't available in other tools/packages for faster and more efficient data manipulation. And I say this as a data.table user turned developer.

"harder to read after writing it" is very very subjective. I don't know what to say to that.