SQL Zone is brought to you in partnership with:

Rob Williams is a probabilistic Lean coder of Java and Objective-C. Rob is a DZone MVB and is not an employee of DZone and has posted 171 posts at DZone. You can read more from them at their website. View Full User Profile

The Coming SQL Collapse

10.12.2011
| 8392 views |
  • submit to reddit

I looked at neo4j briefly the other day, and quite predictably thought ‘wow, this looks like a serious tinkertoy: it‘s basically a bunch of nodes where you just blob your attributes.‘ Worse than that, to wrap objects around it, you have to have them explicitly incorporate their node class, which is ugly, smelly, violates every law of separation of concerns and logical vs. physical models. On the plus side, as I started to look at it more, I realized that it was the perfect way to implement a backend for a bayesian inference engine (more on that later). Why? Because inference doesn‘t care particularly about all the droll requirements that are settled for you by SQL, and there are no real set operations to speak of. One of the things the status quo defenders of SQL almost always forget is that they have to write themselves a pass to violate the most basic laws of their preferred screed. The NOSQL thing is going to bring that whole reality to the fore and my prediction is that you are going to see a larger exodus away from SQL than we have seen in the past 2 decades.

Here are some of the specific reasons:

  1. Most interesting models in the future will involve discovery, SQL cannot handle discovery – Have you ever worked on a project that was based on some element of discovery? Let‘s take a monitoring app. You don‘t know what type of devices you are going to encounter or what properties they are going to expose. How do you model that with SQL? If you sat 10 programmers down in an interview and gave them this problem, 10 of them would resort to making a model where they stored device names in one table and attributes in another and then they joined them. ‘Gee, how can you account for types that way?‘ Yeah, you can‘t. But that won‘t stop most programmers. They‘ll add a type field and go about their business. Actually, I have come across this in the wild a few times. Believe it or not, the developers will start patting themselves on the back for doing ‘metaprogramming.‘ When in fact, they are doing nothing of the sort. They are taking a language that has a small number of rules and predicates and removing most of them. As Joe Celko is fond of saying, if you can‘t write SQL against your database when you are done, it‘s not really a SQL database now is it? And of course, you can‘t write SQL against a bunch of name/value pairs that are joined in some arbitrary fashion.

  1. Reasoning requires discovery and also heterogeneity, neither of which SQL supports – ok, so now I want to do some reasoning about something. Suppose I want to predict whether or not you will attend an event that is being put on. Give that problem to a bunch of programmers and the vast majority of them will schematize what they know about you. Now let‘s throw a wrench into the works. Suppose I contract to buy some data from outside but only 1/3 of my population is in that database? SQL is going to require me to create a schema that is a superset of anything I could ever know about you. Then I can juggle all the nulls that will occur for the people who didn‘t fill out those surveys. As we all know, null referencing is the largest source of defects in programming. Things are starting to point to a big mess.
  2. Inference requires representations that SQL is useless for – now when I start training, I am going to have to store the features samples into distributions, either graphs, e.g. a Gaussian Kernel Estimator, or discretized representations like Histograms or clustered representations. SQL will offer us almost nothing worthwhile in this regard. But further, having been down this road, invariably, in the SQL solution, you will end up storing the native values then recreating the distribution each time. Once someone comes along and taps on your shoulder and says ‘well actually, you are killing the database pounding it so ruthlessly‘ or ‘if it takes 10s to decide, nevermind,‘ you will invariably be off to make a cache of the distributions. In the graph database, you won‘t really have to do that. Someone logs in, I navigate right to that person‘s node, then I have their distributions right there, ready to be used to compute probabilities.
  3. Schemas in SQL are strong on typing and set operations, weak on relationships – the relationship semantics in SQL are pretty lame. You can basically aggregate things. In discovery environments, and reasoning ones, relationship semantics are much more complex. For instance, I showed some level of interest in event x, a different level in event y. How can I learn from these differences? In a graph or object database, I can easily incorporate that into the model and navigation of the nodes is all I need to do. Can I do that in SQL? Sure, but again, I would have to do it by bolting semantics onto the language that are not there. Remember RUP? The thinness of its relationship semantics caught up to it pretty quickly and after a short while, everything was using archetypes all over the place. Relationships are of greater and greater importance in models, and will continue to become more important.
  4. SQL is about tables mostly, can do a Tree, Graphs are a nuisance – If I were interviewing you, could you implement a tree in SQL? Properly I mean? I have come across almost no programmers who have ever done it. Joe Celko wrote a whole book on trees. Really there are two ways to implement them: Adjacency and Nested Sets. The latter are very useful and powerful. That said, graphs are much more complicated and the additional amount of mess that has to be managed quickly turns to goo in SQL. Clearly, graph databases have no problem with this.
  5. SQL has retained its position of supremacy out of performance anxiety FUD – Larry Ellison is flying around in jets and sailing yachts because he has been able to keep people very scared that if they don‘t use his db, they will suffer a huge performance failure when their big night comes. The problem is that now that the cloud is here, people can scale in other ways, and they don‘t have to spend to prevent something out of fear, because they can test different setups, and use free ones until they fail. This will materially change the whole dynamic of how/why people will be willing to spend so much money on databases.
  6. SQL portrays itself as the only adult performance solution, then relies on caching to deliver the goods – kinda says it all.
  7. Properly implemented object databases outperform SQL at loading complex graphs, and have for 2 decades – which is what your program does a lot of, if you wrote it properly. Furthermore, those graphs are going to get bigger and more complex as more context is called upon to make the abilities of the app greater.

I still think that an object database would be preferable and will be looking at a few at the same time, but NOSQL is going to make serious inroads and those who are calling these guys simpletons and morons should come out and show us how they can handle some of the cases I have presented here. Also note, in my prior posts in the last month or so, the SQL camp has gotten fat and lazy. They have done nothing about ETL, the amount of work required to evaluate the performance of SQL solutions is still a huge drag, and no attempt has been made to support more sophisticate models. Add to that the fact that languages like R are growing rapidly, and you have a situation where SQL is going to, at the very least, wake to a new Copernican reality: they are not the center of the universe. That should have dawned on them when ORMs took hold, but in fact, those have largely preserved the notion among programmers that the database is the center of the universe, unfortunately.

The SQL defense against NoSQL is that it‘s a flash in the pan and real work can‘t be done with it. That‘s silly. The dumbest part of arguments like this is they assume that the people who are making the solutions are too stupid to know whether or when they will be out of gas. It‘s really pretty simple: when you stop getting the performance you need.

Perfect example of the leave my status quo alone dopiness in the SQL defender camp

 

From http://www.jroller.com/robwilliams/entry/the_coming_sql_collapse

Published at DZone with permission of Rob Williams, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Mason Mann replied on Thu, 2011/10/13 - 5:16am

"My prediction is that you are going to see a larger exodus away from SQL than we have seen in the past 2 decades."

Some knucklehead like you says that every year. It's quite amusing. What you don't get is the insane amount of invested code, tooling, education and infrastructure based on SQL databases. For the vast majority of companies, SQL database performance is more than good enough, compliant enough for toolstacks to work and the impedance mismatch is not painful enough. All that while providing a proven technology that keeps your data safe and accessible. Yes, NoSQL is "nice" and will certainly finds its way into lots of places, just like xml (yuk), flat files, serialization and ad-hoc solutions do. So? SQL will continue to grow and branch out, not collapse. Which is a good thing, because it actually works in practice.

Brian May replied on Thu, 2011/10/13 - 5:39am

It sounds like you are just as entrenched in the nosql camp as those you criticize are in the sql camp.

 There are certainly advantages to nosql over sql, but it is not as simple as one or the other. Very large heterogeneous datasets will be best served using nosql while more well defined data sets will continue to be well served using sql.

Most of your examples of the limitations of sql seem very weak, and strike me as your limitations with sql. I find it especially ironic that you think a relational database is "weak on relationships". If you really can't think of a way to solve the example you set forth in sql you are probably right to stick with nosql, but it is unwise to project that as a weakness of sql in general.

 It also occurs to me that you may not be as familiar with Oracle which does provide many useful features to solve these sorts of problems by extending ansi sql considerably.

Adrian Engler replied on Thu, 2011/10/13 - 5:57am

For the specific problem of discovering and inference, there are, of course many different ways to a solution. Some use SQL to some degree. In many cases, this could be seen as belonging to the areas of OLAP and data warehousing, and many databases offer solutions for that (though to use these solutions, you have to go beyond SQL). In some cases, NoSQL solutions are probably better.

 But the whole argument "SQL isn't good at X, so SQL will collapse" is flawed. Even if it was true that SQL has nothing at all to offer for tasks that involve discovering and inference, which is doubtful, this would only mean the collapse of SQL if discovering and inference was the only thing that counts, and this is, of course, absurd. Many enterprise applications deal with clearly structured data that can be modelled with a relational model quite well.

Storing data is a central part of a majority of applications, and relational databases have proved to be useful for many of these applications. SQL is a widespread standard (though it is often used under the hood and rightly so), and it is not going away soon. Yes, there are cases in which relational databases are not the best solutions, but they are still a good and convenient solution for many common cases, and if the SQL standard is going to lose significance, it is very unlikely that this is in the close future.

dieter von holten replied on Thu, 2011/10/13 - 9:33am

The discussions about technology/language/framework/ide X is better than technology/language/framework/ide Y are rather irrelevant when the area of application is not clearly mentioned. what datahandling-application are you talking about? a flight-booking system? the account-part of a bank? the payment-part of an online shop? the logistics part of that online shop? the sparepart catalogue of your car-producer? 

or one of these collect-all-data-and-find-out-what-to-do-with-it-later like f**k or other social networks?

is this a relevant application? i mean, by number of 'installations', not by number of data-records.

IIRC, in the mid-90s there  were a number of vendors of object-databases, real rdbms-killers, believe me. who of them is still around? why not?

for most real-world applications (in the meaning of use-cases and software) the question is not 'sql or nosql '? - the question is 'red, green or blue sql?'

 

Mason Mann replied on Thu, 2011/10/13 - 11:58am

As evidenced by "http://java.dzone.com/users/deconcoder", Rob Williams enjoys writting hollow (easy to read, but no substance) articles with a controversial tone, presumably to attract people to his blog. The Java world has apparently spawned a John C. Dvorak wannabee.

Rob Williams replied on Thu, 2011/10/13 - 12:12pm in response to: Brian May

Dude, you are confusing semantics with mechanics. Of course I could write SQL code to do that. In the same way you could write source code to make up for the lack of types in languages, or parsers, etc. The difference is that then you have a representation that requires another batch of logic for it to make any sense of itself. I see that as a clear design failure. You can only say how things are aggregated in SQL: 1:M, M:M, 1:1. You can't say anything else about those relationships (you can add some cascade logic, but those are all physical concerns). In things like OWL, RDF, etc., relationship semantics become far more important. BTW, all you people who have run out to call me a clown do realize that you are calling Sir Tim, the founder of this here surfboard we're on, one too, right? Cause his whole Web 2.0 is about doing real semantics, rather than just pushing strings around. As to Oracle providing lots of neat ways to make up for the failures of the language, yeah, dude, in 2011, I will not be writing PL-SQL code, thank you very much. If that makes me a Babar in your eyes, praise be to Zeus.

Rob Williams replied on Thu, 2011/10/13 - 12:11pm

As to the rest of the comments here, I don't know how you guys missed my main argument: yes, SQL can deal with modeling simple simulations. The age of the high priced programmer doing the nth version of a payroll program or a part-stocking app is over, AFAIC. Take anything beyond that and you will start to see what I am talking about. You want specifics, I gave a lot of them. Let's suppose, for instance, instead of just modeling the supply chain, we were also performing optimizations up and down it: figuring out where the next part shortage was going to be, or why things have slowed down in parts of a line. 10 years ago guys like you would have argued for a system where the OLTP system poured its data into a db, then a batch process would pick it up and create some OLAP data out of it, etc. I also believe that is untenable, and what you are going to start seeing instead is models that do both operational and analytical work at one time in one place.

Brian May replied on Thu, 2011/10/13 - 1:32pm in response to: Rob Williams

So if you can write sql to do that what is the problem? It is a bold statement to make that sql is going to collapse, and all you can support that statement with is because you find sql cumbersome to write code that does things?

 

Mason Mann replied on Thu, 2011/10/13 - 7:01pm

"As to the rest of the comments here, I don't know how you guys missed my main argument"

Maybe because the arguments were all as lame as your "please come to my blog" title?

Mason Mann replied on Thu, 2011/10/13 - 7:06pm

Readers should be aware that neo4j, claiming to be 1000x faster than any old SQL database, is so fast because it doesn't do transactions properly (you know, boring and collapsing stuff like redo logs.) End result? Corruption reports like http://lists.neo4j.org/pipermail/user/2011-July/009881.html It's a horrible product with a neat web page and an army of clueless fanboys like Rob.

André Pankraz replied on Fri, 2011/10/14 - 7:12am

OK he likes to write catchy blog headlines. Why not, it helps to get people read your stuff ;)

I summarize: SQL products are bad for schema-less key-value-pairs or graph-based data - on top of that they cannot render web-pages or brew coffee (ok...you never know with this Oracle DB). The rest is blog-sales and fanboy talk. Nothing new - we can calm down now.

 

After some experiences with Google App Engine Datastore I cannot understand why all are so eager to ditch SQL products. May be for small super-innovative sides this key-value databases are thrilling and SQL is somewhat 2000th - OK i'm too old for this stuff ;)

Nicolas Bousquet replied on Fri, 2011/10/14 - 10:23am

The whole point of the post is... RDBMS (with SQL connectivity) seems unsuited your needs that are to process graphs and nodes!

Wow !!! Didn't everybody known that there are graph database exactly for that reason?

Now imagine that I want to make an eCommerce web site

For storing the item pictures, I will use some sort of filesystem (maybe distributed). Maybe even a CDN service specialized exactly for that kind of things.

For the search textbox, I will leverage a search engine with its own indexing, its own database, its own algorithms. Another topic again.

But the price of the book, the list of items in your command, I'd want it in a relationnal database with ACID properties.

And if I want to compute the average kart value of my clients per living location, or other statistical things like that, I will make a datawarehouse and apply Business Intelligence technologies on it. Many of the leveraging RDBMS databases.

Many people try to generalize their particular case to everybody. Problem is many times, other have different needs. They might want to do thing you might even don't know or even thinked of.

And so when you tell them that YOUR choice and YOUR solution is the only one they don't always agree. That's to be expected no?

Robert Craft replied on Thu, 2012/01/26 - 6:07am

Your entire article is based on the fact that you're building a Bayesian inference engine. How many enterprise application out there fit that description? Perhaps a fraction of a percent. Perhaps. Perhaps you should read the article you linked to a bit more carefully. NoSQL has it's merits, but there's no way I'll be implementing an application with 500 to 1000 concurrent users with a NoSQL backend. Sometimes you just don't need to scale to that size, and 'just because we can' isn't a valid argument.

Spring Security

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.