Blogging from the PASS Summit Keynote : Day 3
This is a rough morning. Got in at 2:00 AM, then had no water in my hotel this morning. I doubt I will ever voluntarily stay at the Seattle Hilton again. Still, I'm very much looking forward to this keynote – the illustrious Dr. David DeWitt is going to make my brain hurt today!
First, though, Rick Heiges introduces Rob Farley and Buck Woody, who walk out playing a pretty funny acoustic song about query tuning. Then there was a touching tribute to Wayne Snyder, who is retiring after this year.
SQL Rally Dallas will be held May 11th-12th. PASS Summit 2012 will be held November 6-9 in Seattle. Check out the early registration prices valid until November 15th.
Dr. David DeWitt takes the stage and starts his keynote, entitled, "Big Data: What's the Big Deal?" He says that Big Data is 10's of petabytes. I take that as, "Stop saying 'huge' when talking about your 40 GB table." A zetabyte (ZB) can be thought of as a quadrillion megabytes, a trillion gigabytes, or a billion terabytes. No matter your interpretation, that's a lot of data 35ZB, the amount of data we should have by 2020, can be represented by a stack of DVDs halfway to Mars. Some breakdowns of properties managing big data:
- eBay: 10 PB, 256 nodes
- Facebook: 20PB, 2,700 nodes
- Bing: 150PB, 40,000 nodes
NoSQL does not mean "SQL should never be used" or that "SQL is dead." What it means is "Not Only SQL." He talks about the benefits of NoSQL and how it trades consistency for availability. Relational databases provide maturity and stability at the cost of flexibility. Look back at the comparison above: eBay, with roughly half the data Facebook has, requires 10% of the computing power. He explains that relational databases are not going away; we will all still have jobs regardless of how popular NoSQL gets. Dr. DeWitt promises this, and I'm going to hold him to it.
Hadoop and MapReduce offer scalability, high degree of fault tolerance, relatively easy programmability, efficient data analysis, lower up-front software / hardware costs (but not necessarily lower TCO).
HDFS = underlying file system for Hadoop, assuming failures are common. Write once, read multiple times. Actually on NTFS – 64MB blocks. A block is replicated twice – first copy on node that creates the file, second is on another node under the same switch (same rack), third on a different rack. This maximizes risk tolerance at lowest performance cost. He gives a great explanation of the Hadoop fault tolerance model, MapReduce, HiveQL, Hive vs. PDW, Sqoop (a bridge between Hadoop and RDBMS). But I did not even think about trying to reproduce that here. If you didn't see the keynote (live or streaming), you should definitely consider the DVDs.
For next year's keynote, I voted for "Main Memory Database Systems." You can tell him what you want to hear about at [email protected].