Readings in Database Systems, 5th Edition (2015)

makmanalp · on Oct 9, 2017

While this is a seminal work, I definitely wouldn't approach this like I would approach a textbook: it's definitely not meant to be friendly introductory material. That said, once you have a bit of background, it's a goldmine of a survey. You'll notice that each section is a short, few-page long introduction, but the bulk of the material is in the papers themselves, which can be significantly tougher to read. Though it's great that the summaries are friendly and help you contextualize the papers. My tip is to read papers starting with the introduction, and then the conclusion, and then decide if you want to dive into the rest of the paper to track down the evidence for specific claims.

CalChris · on Oct 10, 2017

It's organized as the most likely papers that you'd read in a graduate level database seminar. Readings in Computer Architecture is the same.

godelmachine · on Oct 10, 2017

Hi Chris, I searched far and wide for a PDF / soft copy of Readings in Computer Architecture. I searched for the same thing 6 months ago, but to no avail. Would you kindly provide a PDF or a soft copy? Sincerely,

Fundlab · on Oct 10, 2017

How about your local library?

godelmachine · on Oct 11, 2017

I live in Pune right now. Can't think of any such library. Libraries here aren't equipped well with such technical subjects.

CalChris · on Oct 10, 2017

Go onto Amazon and get the TOC. Take that to Google Scholar.

muramira · on Oct 9, 2017

Yesterday, https://news.ycombinator.com/item?id=15428526 hit number 1 on HN. Having read the two books, I strongly believe that they not only complement each other, but also must be required reading for any data engineer.

fizixer · on Oct 9, 2017

Does it have something along the lines of building your own RDBMS from scratch? (If not, any recommendations?)

edit: google search has potentially promising results, https://www.google.com/search?q=build+your+own+rdbms

lbruck · on Oct 10, 2017

I like "Architecture of a Database System" by Stonebraker, Hamilton, and Hellerstein (http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf) for an overview and then "Transaction Processing: Concepts and Techniques" by Gray and Reuter (https://www.amazon.com/Transaction-Processing-Concepts-Techn...) for the storage-side of things. Granted these are a little old (especially G&R) so extra thought must be given for modern hardware (memory, CPU performance, processor counts, network, disks, etc) as well as distributed processing, replication, and consensus.

makmanalp · on Oct 9, 2017

This book is more about different types of tradeoffs you can make in terms of your system design. I'd recommend looking at grad databases courses instead, e.g:

- http://db.cs.cmu.edu/courses/ - http://daslab.seas.harvard.edu/classes/cs165/ - http://daslab.seas.harvard.edu/classes/cs265/

fizwhiz · on Oct 9, 2017

I really wish this class (or the Harvard class referenced below) were offered as a MOOC w/ some certification. I rarely find classes around OS/databases offered as MOOCs, which is a pity because those are the things I'd love to spend time on.

makmanalp · on Oct 9, 2017

Part of the problem is that doing a decent MOOC takes a TON of preparation and effort, much more so than a regular class. The professor who runs the class (Stratos Idreos) has a billion things that he's working on, so turning it unto a MOOC would require some outside support, probably. That said, releasing the videos might be a possibility, I'll ask and see. I believe Andy Pavlo's class has videos online already.

The other part is that in the Harvard classes specifically, the class discussion is a huge part of the class.

hackermailman · on Oct 9, 2017

This course does, has lectures on youtube http://15721.courses.cs.cmu.edu/spring2017/

jasonwatkinspdx · on Oct 9, 2017

I have a copy of https://www.amazon.com/gp/product/0130402648 and while I don't think it'll win any "best textbook ever" awards it presents all the basic concepts in a straightforward if somewhat simple way.

mamcx · on Oct 9, 2017

I also looking for info on this.

I will be very happy in how build a sqlite-like DB engine.

All the answer so far is "read the sqlite code". As if everyone know low-level C or DB design!

alexnewman · on Oct 9, 2017

Everyone I have met who have worked a long time in the database industry considers stonebreaker to be

- Overrated - Overly Self Promoting - Mostly not credible

That being said I love his work and his history. The red book is super famous. What gives?

carefulfungi · on Oct 10, 2017

Stonebraker is quite down to earth to work with - he cares about practical ideas and systems that can be plausibly constructed. You can decide if one of the primary collaborators of Ingres, Postgres, Streambase, Vertica, VoltDB, Tamr... is over-rated. Certainly a number of commercial efforts didn't find great commercial success but the technology he participated in pioneering has found its way into almost every modern database system in use.

bambu · on Oct 10, 2017

Not credible? Stonebraker is a Turing Award winning MIT professor who's worked in the database industry for ~50 years.

Ask your friends what gives

frankmcsherry · on Oct 10, 2017

Stonebraker has done a lot of fine work, but also has a lot of false negatives; he got down on recursive queries, on data-parallel compute, and generally most other things that he's not directly working on. It ends up with an in-crowd of folks who think he is great, and an out-crowd of folks frustrated by his constant unfounded dumping.

I've personally found it is hard to take him seriously when he says a thing, and you should first check which company he is currently flogging. Doesn't mean he is wrong, .. but check.

alexnewman · on Oct 9, 2017

I should say I have been writing production databases for almost 10 years. The people I talked to have more than 20 years exp

alexnewman · on Oct 9, 2017

also sorry for the last name pun

stavros · on Oct 10, 2017

What's the pun?

stavros · on Oct 9, 2017

Is there an epub of this?

pfooti · on Oct 10, 2017

This seems like what you are looking for.

https://unglue.it/work/153041/

stavros · on Oct 11, 2017

Thank you very much.

pagnol · on Oct 9, 2017

Yes, there is.

0xFFC · on Oct 9, 2017

Is this new edition? What have been changed?

tjr · on Oct 9, 2017

New in 2015. See also: https://news.ycombinator.com/item?id=10694538

jonsen · on Oct 9, 2017

I think the preface tells.

zzzcpan · on Oct 9, 2017

Redbook is too biased, there is just too much perspective from RDBMS people, which is not relevant in modern distributed environments or even outright incorrect.

Redbook inspired list by Christopher Meiklejohn [1] is a better alternative, or Aphyr's course outline [2].

[1] https://github.com/cmeiklejohn/cmeiklejohn.github.io/blob/ma...

[2] https://github.com/aphyr/distsys-class

elvinyung · on Oct 9, 2017

(Disclaimer: I work at Databricks.)

RDBMS techniques are absolutely relevant in modern distributed environments. It has become increasingly clear that MapReduce is too low-level a programming model for query processing, so modern distributed dataflow systems are increasingly hybridizing with RDBMS-like interfaces and optimizations (e.g. Spark dataframes).

makmanalp · on Oct 9, 2017

For OP's benefit, here are some excerpts from the red book that agree with that premise:

> Google MapReduce set back by a decade the conversation about adaptivity of data in motion, by baking blocking operators into the execution model as a fault-tolerance mechanism. It was nearly impossible to have a reasoned conversation about optimizing dataflow pipelines in the mid-to-late 2000’s because it was inconsistent with the Google/Hadoop fault tolerance model. In the last few years the discussion about execution frameworks for big data has suddenly opened up wide, with a quickly-growing variety of dataflow and query systems being deployed that have more similarities than differences

http://www.redbook.io/ch7-queryoptimization.html

Also see Stonebraker's comment at the bottom here:

http://www.redbook.io/ch5-dataflow.html

edit:

To be more charitable, Mapreduce's main concern was fault tolerance (and recovery) and massive scalability, at the cost of all else. Since it's so simple, you could have subtasks die, disappear, and yet you can just respawn them and keep on chugging through the query. You also don't think too hard about job allocation. It's easy to build and use, easy to reason about. You can throw more computers at it when you have a spike of jobs, and it scales fairly predictably. Not many people were really running infrastructure and jobs at the scale google did, and that's quite different from the traditional "data warehouse" style application, and so it wasn't entirely unjustified. The other benefit, of course, is that you can perform arbitrary computation, which is quite different from most RDBMSes which often don't have great UDF support or are frequently highly restricted and, frankly, horrific to deal with.

Of course, they quickly found that "no query optimization" is sort of an extremist and unproductive position, and that you can have a bit of either or both cakes as needed.

frankmcsherry · on Oct 9, 2017

For further reading about the high quality of database query optimization, and how far back MR et al must have set things, recent SIGMOD work managed to get to within 1000x of a single-threaded implementation (and so, not quite within that of data-parallel systems):

https://github.com/frankmcsherry/blog/blob/master/posts/2017...

I don't use databases because they are really quite bad at computation.

In my opinion, the main recent novelty in query planning has been the work on worst-case optimal joins, stuff like EmptyHeaded[1] and the recent FAQ work[2].

[1]: https://arxiv.org/abs/1503.02368

[2]: https://arxiv.org/abs/1504.04044

makmanalp · on Oct 10, 2017

That's not quite what I was going for. I enjoyed your article because I agree with your overall point that distributing stuff has an overhead, and also it points to a definite problem in how some research work is portrayed - possibly having to do with what the incentives are in reviewing and publishing.

However I think you're taking your argument to the extreme in a way that doesn't really apply here. First, from what I understand, graph databases are still far from well understood and don't really represent the best in query optimization. This is in no way representative of RDBMS query optimizers for usual OLAP/OLTP tasks, and not what we're talking about right now. Something like SAP HANA or Redshift or experimental systems like HyPer and MonetDB or even something like impala would represent that literature better here. Or check out MapD, which uses GPUs for parallelism. Or kdb+, which has existed for forever and is well-known to do a great job parallelizing and offers a rich query syntax.

Indeed, when I look at the emptyheaded paper for example, I see SIMD parallelism, query compilation join optimization, all stuff that was first developed in the context of RDBMSes. Surprise, surprise, when you apply tried and true techniques in the context of a new problem, you see drastic improvements. This is pretty much exactly the point that Stonebraker and the others above are making: MapReduce was kinda like the graph databases you tested: they were hyper-focused on one functionality, and missed the memo on decades of many other basic optimizations. They're certainly not the only one guilty of this.

> I don't use databases because they are really quite bad at computation.

Well, if computation's all you need ... I mean, I hope you're kidding here, but there are reasons other than performance that you'd want to have a parallel system, e.g. your working set doesn't fit in memory, or you need to minimize downtime. Granted, these are not problems that are common. Also there's many reasons you want a database over a hand-rolled solution: you need to concurrently serve a lot of queries, including insertions and updates, you have to do well on many different types of queries rather than just a single one, etc etc etc.

Also, /what/ system? Bad at /what/ computation? There's so many different systems for so many different workloads that I can't believe you can seriously make such a statement. If you're saying RDBMSes are bad at graph computations, then sure. That's unsurprising. But that's not what we were talking about! :-/

frankmcsherry · on Oct 10, 2017

> Indeed, when I look at the emptyheaded paper for example, I see SIMD parallelism, query compilation join optimization, all stuff that was first developed in the context of RDBMSes.

The main contribution of EH is not the use of SIMD, it is the implementation of new WCO join execution strategies that hadn't been developed in the past 40 years of RDBMSes.

If you wanted that behavior, with its orders-of-magnitude performance improvements, you could not get it from an existing optimizing RDBMS---not HyPer, nor MonetDB, nor anything else in your list---but you could get it from a more programmable data-parallel system.

> Well, if computation's all you need ...

It is a thing I need, which is exactly the point. If the RDBMSes don't provide the performance (or anything within 100x) you can get from a more programmable system, you need a different solution.

Stonebraker's claim was that MR was a huge step backwards, which is BS to the extent that RDBMSes weren't solving the problems Google (and others) had. No amount of fantasy query optimization was going to take SQL to the performance of MR or MPI codes (even circa 2009, Vertica still had no support for UDFs).

You are of course welcome to list other things that RDBMSes are good at, and that's great. However, Stonebraker's claim isn't that RDBMSes have some value (which everyone I know agrees with), his claim is that MR was a shit model and everyone should be using RDBMSes instead (preferably his).

> If you're saying RDBMSes are bad at graph computations, then sure. That's unsurprising. But that's not what we were talking about! :-/

Remind me what that was, then? It seemed like we were talking about whether there was a heavy pro-RDBMS bias in the redbook, which I think is (i) fair, and (ii) fine. I also think Stonebraker is wrong in his claim that MR set things backwards because (as I referenced) RDBMSes weren't there to be set back from. If anything, it prompted a great deal of new work that led to improvements in areas he was blind to. A concrete example of this (e.g. iterative computation) seems totally on-topic.

alexnewman · on Oct 10, 2017

Heh didn’t Greenplum solve most of the problems google or yahoo had, just at a huge cost. In retrospect now that it’s open source software... I think one of his points was that having horn data unindexed is a big step backwards. I think he’s crazy. Btw frank I love your work on differential data flow!

makmanalp · on Oct 10, 2017

> MR was a shit model and everyone should be using RDBMSes instead

Ah, yes, sorry, I didn't mean to make it sound like I agree with this.

> If you wanted that behavior, with its orders-of-magnitude performance improvements, you could not get it from an existing optimizing RDBMS---not HyPer, nor MonetDB, nor anything else in your list---but you could get it from a more programmable data-parallel system.

Of course! I now realize that you're using terminology in a way that I'm not familiar with, e.g. "computation" meaning something like "arbitrary computation", which upon rereading, makes me understand and agree with a lot more of what you're staying.

What bothered me about your comment was that it sounded a bit like "wow, I found that RDBMSes suck at this specific type of computation that they're not built to deal with, therefore query optimizers suck in general", which seemed like an over-reaching argument.

When the claim is "there exists computations that RDBMS query optimizers suck at", then absolutely, I agree with you to the ends of the earth. If it's also "there's reasons why you want a more MR-like model", again, I agree completely. The point is that having query optimizers and different computational models are separate decisions that don't affect each other - you can have both.

> Stonebraker's claim was that MR was a huge step backwards, which is BS to the extent that RDBMSes weren't solving the problems Google (and others) had.

I guess what I took away from his claim was that the contribution of MR itself was not the problem, but the fact that while creating that model, they ignored a lot of other learnings: e.g. blocking operators can be detrimental, indexes are handy to have. Plus the fact that everyone /else/ who didn't have Google's reasons to forego all those niceties still dove head-first into "let's use MR for everything".

And that's what Elvin is talking about above - you're now seeing examples of tools where lessons from both camps are being applied: "MR-like but be smart enough to not scan everything" (e.g. spark).

> It seemed like we were talking about whether there was a heavy pro-RDBMS bias in the redbook

Ah, for me the main question and discussion above was "are RDBMS techniques even relevant at this point", to which my answer is yes, absolutely. That doesn't mean you have to take every concept from it wholesale: many techniques that developed in one context are applicable in others, regardless of Stonebraker's opinions.

I think also maybe you see everything as firmly in the RDBMS / SQL camp or firmly in the "not" camp, but I really don't think that's the case. E.g. stuff like flink, where we have a lower layer API for arbitrary computations, and higher level APIs for stuff like SQL which compiled down to the lower level language, and get query-optimized different ways in different layers, for example. Or they do some neat join-optimizations. So even with newer computational models there's stuff to learn from old ways, so it's worth it to read the darned book. That's my point here.

frankmcsherry · on Oct 10, 2017

> What bothered me about your comment was that it sounded a bit like "wow, I found that RDBMSes suck at this specific type of computation that they're not built to deal with, therefore query optimizers suck in general", which seemed like an over-reaching argument.

Gotcha. Yes, it was more "I have some things I need to do, and RDBMSes can't do some of those things, which rules them out as a solution". There is for sure lots of great stuff in query optimization, and it makes some queries lots better.

> I guess what I took away from his claim was that the contribution of MR itself was not the problem, but the fact that while creating that model, they ignored a lot of other learnings: e.g. blocking operators can be detrimental, indexes are handy to have. Plus the fact that everyone /else/ who didn't have Google's reasons to forego all those niceties still dove head-first into "let's use MR for everything".

They did not ignore them, they just weren't building a database. MR was much more a scalable HPC replacement than a data management product.

The main reason that the DB community took a huge step backwards is that they (incl Stonebraker) had doubled-down on mediocre compute abstractions, and found they needed to revisit much of what they'd done, because it just didn't work.

> Ah, for me the main question and discussion above was "are RDBMS techniques even relevant at this point", to which my answer is yes, absolutely. That doesn't mean you have to take every concept from it wholesale: many techniques that developed in one context are applicable in others, regardless of Stonebraker's opinions.

I totally agree with you that they are relevant (and am active in the area). Stonebraker takes a much stronger position, and the appeal to his authority was what triggered me. I didn't mean to point that at you as much as it may have turned out.

cgag · on Oct 10, 2017

That post was both technically interesting and hilarious, thanks

alexnewman · on Oct 9, 2017

This is an example of stonebraker being crazy

zzzcpan · on Oct 9, 2017

I agree, but you have to look at RDBMS from distributed systems perspective.

jchanimal · on Oct 9, 2017

The new consensus algorithm families as exemplified by Google’s Spanner and FaunaDB, my employer, very much make the relational model relevant to distributed systems. The important achievement is support for global ACID transactions with performance acceptable for interactive applications. A comparison of the algorithms can be found here: https://fauna.com/blog/distributed-consistency-at-scale-span...