SQL Zone is brought to you in partnership with:

As an author, trainer, and coach Markus Winand specializes in helping developers cope with SQL performance issues. He also published the book "SQL Performance Explained" and tweets his best performance tips via @SQLPerfTips. Markus is a DZone MVB and is not an employee of DZone and has posted 6 posts at DZone. You can read more from them at their website. View Full User Profile

Myth: Select * Is Bad

05.13.2014
| 7445 views |
  • submit to reddit

This is one of the most persistent myths I’ve seen in the field. It’s there for decades. If a myth is alive that long there must be some truth behind it. So, what could be bad about select *? Let’s have a closer look.

We all know that selecting “*” is just a short-hand for selecting all columns. Believe it or not, this makes a big difference to many people. So, lets first rephrase the question using this “finding”:

Why is it bad to select all columns?

In fact, there are a few very good reasons it is bad to select all columns if you don’t need them. And they all boil down to performance. What is surprising, however, is that the performance impact can be huge.

Up to 100x slower when preventing an Index-Only Scan

Broadly speaking, the less columns you ask for, the less data must be loaded from disk when processing your query. However, this relationship is non-linear.

Quite often, selecting from a table involves two steps: (1) use an index to find the address where the selected rows are stored; (2) load the selected rows from the table. Now imagine that you are just selecting columns that are present in the index. Why should the database still perform the second step? In fact, most databases don’t. They can process your query just with the information stored in the index—hence index-only scan.

But why should an index-only scan be 100 times faster? Simple: an ideal index stores the selected rows next to each other. It’s not uncommon that each index page holds about 100 rows—a ballpark figure; it depends on the size of the indexed columns. Nonetheless, it means that one IO operation might fetch 100 rows. The table data, on the other hand, is not organized like that (exceptions). Here it is quite common that a page just contains one of the selected rows—along with many other rows that are of no interest for the particular query. So, the reason an Index-Only Scan can be 100 times faster is that an index access can easily deliver 100 rows per IO while the table access typically just fetches a few rows per IO.

If you select a single column that’s not in the index, the database cannot do an index-only scan. If you select all columns, … , well I guess you know the answer.

Further, some databases store large objects in a separate place (e.g., LOBs in Oracle). Accessing those causes an extra IO too.

Up to 5x slower when bloating server memory footprint

Although databases avoid to store the result in the server’s main memory—instead the deliver each row after loading and forget about it again—it is sometimes inevitable. Sorting, for example, needs to keep all rows—and all selected columns—in memory to do the job. Once again, the more columns you select, the more memory the database needs. In the worst case, the database might even need to do an external sort on disk.

However, most database are extremely well tuned for this kind of workload. Although I’ve seen a sorting speed-up of factor two quite often—just by removing a few unused columns—I cannot remember having got more than factor five. However, it’s not just sorting, hash joins are rather sensitive to memory bloat too. Don’t know what that is? Please read this article.

These are just the two top issues from database perspective. Remember that the client needs to process the data too—which might put a considerable load on garbage collection.

Now that we have established a common understanding of why selecting everything is bad for performance, you may ask why it is listed as a myth? It’s because many people think the star is the bad thing. Further they believe they are not committing this crime because their ORM lists all columns by name anyway. In fact, the crime is to select all columns without thinking about it—and most ORMs readily commit this crime on behalf of their users.

The reason select * actually is bad—hence the reason the myth is very resistant—is because the star is just used as an allegory for “selecting everything without thinking about it”. This is the bad thing. But if you need a more catch phrase to remember the truth behind this myth, take this:

It’s not about the star, stupid!

If you like my way to explain things, you’ll love SQL Performance Explained.

Update 2013-11-03 - Is the star itself also bad?

Besides the performance issues mentioned above that are not caused by the star (asterisk) itself, the star itself might still cause other trouble. E.g. with software that expects the columns in a specific order when you add or drop a column. However, from my observation I’d say these issues are rather well understood in the field and usually easily identify (software stops working) fixed.

The focus of the article is on very subtle issues which are hardly understood, hard to find, and often even hard to fix (e.g. when using ORM tools). The main goal of this article is to stop people thinking about the star itself. Once people start to name the wanted columns explicitly to gain the performance benefit explained above, the issues caused by the star itself are also gone. Hence, I’ve felt no reason to add a discussion about these issues here—that’s just a distraction from the arguments that I wanted to explain with the article.

Published at DZone with permission of Markus Winand, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Greg Brown replied on Tue, 2014/05/13 - 7:27am

There are a couple of other reasons to avoid "select *", though one of them is largely historical:

1) Early JDBC drivers required callers to read column values from a result set in the order they were requested in the select statement. Using "select *" left this order up to the database vendor, making the resulting column order unpredictable.

2) Selecting all columns from a large table may not have a noticeable impact on the database itself, but unless your code actually uses all of those column values, you'll end up serializing a lot more data than you need to.

 

Dave Glasser replied on Tue, 2014/05/13 - 4:58pm

 Assuming that the only columns stored in the index are indexed columns (and I don't know if that's always the case or not, I'm just assuming), wouldn't it be true that the only time you would prevent an index-only scan would be when you only needed data from indexed columns?

If that's true, then as a practical matter, I think that savings would be realized in only a very small percentage of cases.

Markus Winand replied on Wed, 2014/05/14 - 12:11am in response to: Dave Glasser

 You are missing one idea: it is pretty common to extend an index (add columns at the end) for the sole purpose to make it useful for an index-only scan. That's more or less ruled out when selecting everything because it is pretty OK to add a few&small columns for that purpose (huge win, small/moderate cost) but doesn't make too much sense if you need to add all column (moderate win, moderate/high cost).

Dave Glasser replied on Wed, 2014/05/14 - 7:50am in response to: Markus Winand

I wasn't aware of that practice or that it's common.

I've been doing SELECT * for many years as part of my ORM framework. The performance is well within the acceptable range, in fact it's lighting-fast if querying only on indexed columns. For my purposes, I don't think the few milliseconds that might be gained here and there, that wouldn't be perceptible to the user, would be worth the extra code that would have to be written and maintained. But that's just me. In my world, that would be what Donald Knuth called premature optimization.

Markus Winand replied on Thu, 2014/05/15 - 3:44am in response to: Dave Glasser

If you are reducing the database to a dumb persistence layer that is just queried for one row at at time (often with the primary key), then you are right (premature optimization). You are dropping performance somewhere else, no reason to look at index-only scans in the first place.

If you are using the database in a relational way—leveraging the full power of SQL—then you'll have many chances to use index only scans to improve performance by an order of magnitude.

I'd also argue that it has many more benefits to use SQL when it makes sense, but I'll leave this over to this quote:

Just because you're using Hibernate, doesn't mean you have to use it for everything. A point I've been making for about ten years now.

— Gavin King, creator of Hibernate on Google+


Dave Glasser replied on Thu, 2014/05/15 - 8:23am in response to: Markus Winand

 For the record, I'm not using Hibernate. Nor am I only querying for one row at a time with the PK.

My point was that even if I am preventing index-only scans by using SELECT *, the actual performance cost of doing so is so small and imperceptible that it's not worth adding additional, less maintainable code to prevent it.

Sergio Samayoa replied on Wed, 2014/05/21 - 9:49am in response to: Dave Glasser

> doing so is so small and imperceptible 

May be you haven't worked with big tables (50M+) in which you have to use database specific artifacts such hints, index tricks and alike. Or you haven't worked with 1000s of concurrent users in which each nano second counts.

So if works for you now, good!

Regards.

Dave Glasser replied on Wed, 2014/05/21 - 10:32am in response to: Sergio Samayoa

So if works for you now, good!

That's pretty much my entire point.

Raging Infernoz replied on Sat, 2014/05/24 - 6:19pm in response to: Greg Brown

1) happened with the shoddy Microsoft JDBC driver V1 for SQL Server 2000, which also cause grief by having too loose matching of the JDBC URL, which Microsoft should be ashamed of; but was fixed with the V3 of the driver.  Please people, replace the V1 driver with at least the V3 one!

Another column sequence issue affecting multiple databases occurs when a new table schema has new columns inserted rather than appended, so that an upgraded of an earlier version would append these new columns, but they would be in a different sequence; this scenario has been a real pain for backup tools like bcp for SQL server, when I needed to transfer data either way between older upgraded databases and newer built database, so had to ask for and hack the format files!

Select * also makes getting column values by index unreliable because you won't know which column values you will get, or know early if columns are missing, so always specify column names unless you can check column metadata.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.