data.table's `DT[i, j, by]` is quite consistent actually and is comparable to SQL's - i = where, j = select | update and by = group by.
This form is always intact. For example:
require(data.table)
DT = data.table(x=c(3:7), y=1:5, z=c(1,2,1,1,2))
DT[x >= 5, mean(y), by=z] ## calculates mean of y while grouped by z on
## rows where x >= 5
DT[x >= 5, y := cumsum(y), by=z] ## updates y in-place with it's cumulative sum
## while grouped by z on rows where x >= 5
"Harder to read after you've written it" and "harder to learn" are all very subjective and pointless. One could make very similar observations about `dplyr`, but I'll refrain from it here.
Keeping `i`, `j` and `by` operations together allows optimising for speed and more importantly memory usage (altogether under a consistent syntax) - which are two very important aspects especially working on really huge data sets (10-100GB in RAM or more).
This form is always intact. For example:
"Harder to read after you've written it" and "harder to learn" are all very subjective and pointless. One could make very similar observations about `dplyr`, but I'll refrain from it here.I implore the readers to take a look at over 100+ reviews on crantastic: http://crantastic.org/packages/data-table from users of the package.
Keeping `i`, `j` and `by` operations together allows optimising for speed and more importantly memory usage (altogether under a consistent syntax) - which are two very important aspects especially working on really huge data sets (10-100GB in RAM or more).
Here's a detailed benchmark (only on grouping so far) on 10 million (in MB) to 2 billion rows (100GB): https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A...