SQL Can Be Slow -- Why Do People Doubt This?
Here's a typical problem that results from "SQL Hegemony" -- all data
must be in a database, and all access must be via SQL. This can also be
called the "SQL Fetish" school of programming.
War Story. On a Data Warehousing project, we had to load and process the organizational hierarchy. SQL doesn't do hierarchies well because they can (and should) involve an join of indefinite depth. One of the DBA's wanted to use a "pure SQL" traversal of the hierarchy.
My opinion was that it was a waste of code. We were writing Java programs. We could -- trivially -- fetch the entire tree into Java objects and work with the hierarchy as a hierarchy.
The DBA finally "won" because of the SQL Hegemony argument -- all access must be in SQL, right? I say "won" because we eventually had to throw all the SQL away and use flat files. A "pure SQL" data warehouse is generally unacceptably slow for loading. Data mart subsets can be done in pure SQL, but loads can't.
Recent Events. "a table called [LOTSADATA] and it has 14.7 million rows. One of the columns in [LOTSADATA] table is BRAND" for which they need to do a select distinct. "The disadvantage of [SELECT DISTINCT] is that the Oracle database engine will do a sort which is an insanely expensive operation.
Question: Are there alternative approaches to obtaining the unique brands ina table?"
Response 1. Duh. Of course there are alternatives. What are you, stupid? You have programming languages. Use them.
Response 2. You're kidding, right? Why ask me? Why not just run it? How hard can it be to benchmark this? What are you, stupid? Seriously.
Response 3. Oh. SQL Hegemony. Folks are actually arguing about the cost of a query and -- it appears -- no one can actually write the eight lines of code required to demonstrate that SELECT ALL is faster than SELECT DISTINCT.
[Sorry for calling you stupid. You're paralyzed by fear, not stupidity. What if SQL isn't the perfect end-all, do-all language? If SQL isn't perfect for all data processing, what other lies have we been living? Is this the end of organized data processing? The collapse of western civilization?
Indeed, I'm repeatedly shocked that the question even comes up. And I'm more shocked that the "appeal to authority" argument has to be used. It's trivial to measure. It appears that it's easier to ask me than to gather data.]
Edit. SQL Hegemony? Yes. Rather than run a demonstration program, written in Java or C# or Python, they argued about the SQL. Doing this with minimalist SQL didn't seem to make anyone's radar. Why not? SQL Hegemony. Rather than consider real alternatives, everyone was reduced to looking for sneaky SQL tricks.
Benchmarking. Here is what I did. It's 5 lines of code for each case. [How hard can this be? Apparently, SQL hegemony makes it impossible for some organizations to do even this.]
Notes.
Results.
select_distinct 0.417096select_all 0.162827
For this data, the SQL SELECT DISTINCT took almost 3x as long as the SELECT ALL. It's just that simple.
Want more speed? Use array fetch features to get more rows in bigger buffers.
Consequences.
This is not rocket science. SQL can be Slow. Don't Argue: Benchmark. Your Mileage May Vary.
SQL databases do locking, transaction management, backup and recovery and a bunch of things well. SQL databases are helpful and necessary. However, SQL isn't always fast.
SQL means Slow Query Language. You've been told.
War Story. On a Data Warehousing project, we had to load and process the organizational hierarchy. SQL doesn't do hierarchies well because they can (and should) involve an join of indefinite depth. One of the DBA's wanted to use a "pure SQL" traversal of the hierarchy.
My opinion was that it was a waste of code. We were writing Java programs. We could -- trivially -- fetch the entire tree into Java objects and work with the hierarchy as a hierarchy.
The DBA finally "won" because of the SQL Hegemony argument -- all access must be in SQL, right? I say "won" because we eventually had to throw all the SQL away and use flat files. A "pure SQL" data warehouse is generally unacceptably slow for loading. Data mart subsets can be done in pure SQL, but loads can't.
Recent Events. "a table called [LOTSADATA] and it has 14.7 million rows. One of the columns in [LOTSADATA] table is BRAND" for which they need to do a select distinct. "The disadvantage of [SELECT DISTINCT] is that the Oracle database engine will do a sort which is an insanely expensive operation.
Question: Are there alternative approaches to obtaining the unique brands ina table?"
Response 1. Duh. Of course there are alternatives. What are you, stupid? You have programming languages. Use them.
Response 2. You're kidding, right? Why ask me? Why not just run it? How hard can it be to benchmark this? What are you, stupid? Seriously.
Response 3. Oh. SQL Hegemony. Folks are actually arguing about the cost of a query and -- it appears -- no one can actually write the eight lines of code required to demonstrate that SELECT ALL is faster than SELECT DISTINCT.
[Sorry for calling you stupid. You're paralyzed by fear, not stupidity. What if SQL isn't the perfect end-all, do-all language? If SQL isn't perfect for all data processing, what other lies have we been living? Is this the end of organized data processing? The collapse of western civilization?
Indeed, I'm repeatedly shocked that the question even comes up. And I'm more shocked that the "appeal to authority" argument has to be used. It's trivial to measure. It appears that it's easier to ask me than to gather data.]
Edit. SQL Hegemony? Yes. Rather than run a demonstration program, written in Java or C# or Python, they argued about the SQL. Doing this with minimalist SQL didn't seem to make anyone's radar. Why not? SQL Hegemony. Rather than consider real alternatives, everyone was reduced to looking for sneaky SQL tricks.
Benchmarking. Here is what I did. It's 5 lines of code for each case. [How hard can this be? Apparently, SQL hegemony makes it impossible for some organizations to do even this.]
def select_distinct():
q1= db.cursor()
q1.execute( "SELECT DISTINCT BRAND FROM LOTSADATA" )
print q1.fetchall()
q1.close()
def select_all():
q2= db.cursor()
q2.execute( "SELECT ALL BRAND FROM LOTSADATA" )
print set( q2.fetchall() )
q2.close()Notes.
- I only simulated 100,000 rows. [I don't have the patience to wait for 15 million rows to be created, loaded and queried.]
- The table only had four columns.
- I used SQLite3 -- which is mostly in-memory -- and runs much, much faster than Oracle.
- The select all is not a specious result based on cache being filled; the results are repeatable in any ordering of the queries.
Results.
select_distinct 0.417096select_all 0.162827
For this data, the SQL SELECT DISTINCT took almost 3x as long as the SELECT ALL. It's just that simple.
Want more speed? Use array fetch features to get more rows in bigger buffers.
Consequences.
This is not rocket science. SQL can be Slow. Don't Argue: Benchmark. Your Mileage May Vary.
SQL databases do locking, transaction management, backup and recovery and a bunch of things well. SQL databases are helpful and necessary. However, SQL isn't always fast.
SQL means Slow Query Language. You've been told.
Tags:
Published at DZone with permission of Steven Lott, author and DZone MVB. (source)(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)





Comments
Dragan Sahpaski replied on Fri, 2010/10/01 - 7:46am
Craig Ringer replied on Fri, 2010/10/01 - 8:11am
... which is a remarkably primitive database. Very good for embedded use on smaller data sets, or when you're willing to do more processing in your code and rely on the database only for simpler work. It's a pretty dumb engine, though, so it doesn't do much for your argument when you've been talking about it being faster to do things in $procedural-language rather than Oracle. It also doesn't help that you failed to demonstrate your faster method for picking out distinct results, though there are indeed faster methods than a sort (and Oracle should use them if tuned correctly with sacrificed goats and chicken gut divination; PostgreSQL does).
As it happens, I agree with you that there are some tasks for which SQL databases are far from ideal. Data warehousing can be a good example, though it's possible to produce column-store based databases that still use SQL and these perform extremely well. A row store SQL database is certainly not great for data warehousing and data mining. (That said, you could've used WITH RECURSIVE, or the Oracle-specific START WITH ... CONNECT BY rather than a multi-self-join for your recursive query. I assume it was a long time ago, before these options were available?).
Anyway: there are jobs that SQL is a poor fit for, I just don't think your argument does a very good job of showing this. So I'll show why your example doesn't hold up on a real database with real data.
I don't have the $lots required for Oracle, and prefer to use PostgreSQL in any case. I'd like to try your test on a somewhat smarter database, so I'll use PostgreSQL running on my local win7 box. You didn't specify the column types or the nature of the dummy data used, nor did you provide the code you used to generate it. On the other hand, for the argument it doesn't matter much, as your argument isn't all that specific to the characteristics of the data. I'll use a 220MB, 420000 row table I have on hand on an idle dev machine. Picking an un-indexed column with a reasonable variety of data in it, I'll compare the time for the database to generate a distinct subset vs the time it takes for a simple seqscan, as per your example.
=> \timing on
=> \o outfile
=> select distinct page_number from booking;
Time: 458.107 ms
=> select page_number from booking;
Time: 450.470 ms
Indeed, the DISTINCT is marginally, trivially faster. Of course, I picked the slowest of three times for DISTINCT and the fastest of three for the straight select; in fact, DISTINCT fared much better with an average of 416ms while the straight seqscan got 460ms.
Once you factor in the extra data transfer of the seqscan, and the time required to identify distinct values ... good luck. PostgreSQL is doing a hash aggregate over a sequential scan of the data, not a sort-and-filter, and you are *not* going to beat it in your code when talking to the database over JDBC/ODBC or even a native protocol.
So, yes, a badly tuned SQL database or a really basic one performs badly. A properly set up and configured database, not so much. That's not to say that there aren't some jobs SQL is bad for or slow for - there are - just that your argument goes nowhere toward demonstrating that.
Benjamin Keil replied on Fri, 2010/10/01 - 11:16am
There are several ways to model hierarchies in SQL that don't require "a join of indefinte depth."
First, there's Joe Celko's famous "Nested Set" model. Each node has a left-value and a right-value. A node N contains another node M just in case N.left < M.left and N.right > M.right.
Another way to go is to concatenate a node's identifier with its ancestors. So X has sub-nodes XY and XZ, and XY dominates XYQ.
And then some databases support the "WITH RECURSION" query modifier that takes care of the multiple-join for you.
Endre Varga replied on Sat, 2010/10/02 - 5:01am
in response to: DerHeiligste
Have you ever actually _used_ the models you recommend? I did. Nested Set works fine for some kind of queries but not for others. It is pretty inflexible.
WITH RECURSION does not help much. It may reduce the code you have to write, but it will be just as slow to execute.
Craig Ringer replied on Sun, 2010/10/03 - 8:43pm
in response to: drewhk
Bill Karwin replied on Sun, 2010/10/03 - 11:09pm
in response to: dragan.sahpaskix@gmail.com
Jon Martin Solaas replied on Mon, 2010/10/04 - 5:08am
Jon Martin Solaas replied on Mon, 2010/10/04 - 5:23am
Jean-pol Landrain replied on Mon, 2010/10/04 - 8:12am
Cloves Almeida replied on Mon, 2010/10/04 - 7:02pm
in response to: jonmartin
My thoughts exactly. And if you get to the point you where have so much data that even other DW techniques won't cut it, throwing in more hardware is usually the cheaper then spending programmer time to do what the RDBMS is supposed to.
As an alternative there are column-based RDBMS that are very suited to the massive aggregation problems of BI.
Bikes Direct replied on Thu, 2011/08/11 - 12:51pm
What is missing from the article is the Python code that processes the (Select All) output through a Python (Sets) to product the distinct brands. -BikesDirect
Thomas Kern replied on Sun, 2012/01/22 - 5:41am
select * from distinct_brands; -- will do the trick
+ update, that table from time to time when new data arrives.