Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

data.table's `DT[i, j, by]` is quite consistent actually and is comparable to SQL's - i = where, j = select | update and by = group by.

This form is always intact. For example:

  require(data.table)  
  DT = data.table(x=c(3:7), y=1:5, z=c(1,2,1,1,2))

  DT[x >= 5, mean(y), by=z]        ## calculates mean of y while grouped by z on 
                                   ## rows where x >= 5

  DT[x >= 5, y := cumsum(y), by=z] ## updates y in-place with it's cumulative sum 
                                   ## while grouped by z on rows where x >= 5
"Harder to read after you've written it" and "harder to learn" are all very subjective and pointless. One could make very similar observations about `dplyr`, but I'll refrain from it here.

I implore the readers to take a look at over 100+ reviews on crantastic: http://crantastic.org/packages/data-table from users of the package.

Keeping `i`, `j` and `by` operations together allows optimising for speed and more importantly memory usage (altogether under a consistent syntax) - which are two very important aspects especially working on really huge data sets (10-100GB in RAM or more).

Here's a detailed benchmark (only on grouping so far) on 10 million (in MB) to 2 billion rows (100GB): https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A...



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: