Bad Habits to Kick : Using SELECT * / omitting the column list
In my last post in this series, I talked about inappropriately using SELECT, OUTPUT and RETURN in stored procedures. Today I wanted to talk about using SELECT * or omitting the column list entirely.
This is a typical operation when developing, debugging, or testing, and I have no qualms about its use in those scenarios. But there are several reasons why you should avoid SELECT * in production code:
- You can be returning unnecessary data that will just be ignored, since you don't usually need every single column. This is wasteful in I/O and network traffic, since you will be reading (and transmitting) all of that data instead of just the columns you need. SQL Server may make the wrong choice for a plan strategy, since it thinks you need all of the columns it may opt for a scan instead of a seek or introduce key lookups that ultimately were not required.
- When you use SELECT * in a join, you can introduce complications when multiple tables have columns with the same name (not only on the joined columns, such as OrderID, which are typically the same, but also peripheral columns like CreatedDate or Status). On a straight query this might be okay, but when you try to order by one of these columns, or use the query in a CTE or derived table, you will need to make adjustments.
- While applications should *not* be relying on ordinal position of columns in the resultset, using SELECT * will ensure that, when you add columns or change column order in the table, the shape of the resultset will change. Ideally, this should only happen intentionally and not as a surprise.
A major roadblock to using the explicit column list is
laziness productivity. For those of you using IntelliSense or 3rd party tools like SQL Prompt, this is much less likely to be a good excuse, since the columns can appear for you in a drop down list (and in some cases you can set up keystrokes that will change * to the explicit column list). For those of you that don't use 3rd party tools, or have disabled IntelliSense due to its chattiness, here's another little tip that is seemingly obscure: expand the table or view, click on "Columns," and drag it onto the query window. This works in both Management Studio:
And in Azure Data Studio:
Voila! The column list appears for you… it is probably not in the format you want, since they are just separated by commas and listed out horizontally, and it probably has a lot of square brackets you don't need. But reformatting that list (and removing columns you don't need) can sure beat typing out the columns you do need manually, especially on wider tables. Though being forced to type them out may increase your selectivity and help you decide to leave out the columns you don't need, so maybe the convenience is a toss-up.
SELECT * in a view
Some adopt the misconception that they can avoid maintenance down the road by simply using SELECT * in a view. Then they can change the base table and not worry about updating the view. Let's try this at home, shall we?
USE [tempdb]; GO CREATE TABLE dbo.foo ( a int, b int, c int ); GO INSERT dbo.foo(a, b, c) SELECT 1, 2, 3; GO CREATE VIEW dbo.view_foo AS SELECT * FROM dbo.foo; GO ALTER TABLE dbo.foo ADD d int; GO SELECT d FROM dbo.view_foo; GO -- Error message : Invalid column name 'd'. UPDATE dbo.foo SET d = 4; GO ALTER TABLE dbo.foo DROP COLUMN c; GO SELECT a, b, c FROM dbo.view_foo; GO -- Even though c is no longer in dbo.foo, this works! -- But it returns 4 (the data for d) under column c. EXEC sys.sp_refreshsqlmodule N'dbo.view_foo'; GO SELECT * FROM dbo.view_foo; GO -- Now this returns columns a, b and d as we expect. DROP VIEW dbo.view_foo; DROP TABLE dbo.foo;
This shows it is not a good idea to use SELECT * in a view, because you can't rely on changes to the underlying table(s) to be reflected in the view and its results. The person making the changes to the table has to know and understand that the view must be refreshed in order to ensure correct results (and you can help enforce that by always creating views WITH SCHEMABINDING – now someone can't change the table without being notified about a dependent view). And, in a circular way, WITH SCHEMABINDING prevents you from using SELECT *. Even without SCHEMABINDING, an explicit column list ensures there is no ambiguity caused by metadata changes (but read Itzik's explanation of why it is still a useful option to always use). When you drop a column from the table, for example, the view will break the next time it is queried, instead of potentially returning bogus data in the wrong column.
INSERT … SELECT
A lot of times we build copies of our tables or otherwise populate a table from another table with the same structure. Depending on the columns, it can be tempting to use code like this:
INSERT dbo.foo SELECT a, b, c FROM dbo.bar; -- or even worse: INSERT dbo.foo SELECT * FROM dbo.bar;
If this kind of syntax is used in production code, take care to populate the column lists. This will prevent you from being bitten later when the source table has a column added, or column order is changed in either table. By explicitly defining what you are selecting and where you are putting it, you insulate yourself from these types of metadata changes:
INSERT dbo.foo ( a, b, c ) SELECT a, b, c FROM dbo.bar;
Now, the code will break if you drop or rename a column, but since that is less frequent and because you *should* be notified of that problem immediately, I think that's okay.
Similar to the above case, we often build copies of our tables using syntax like this:
SELECT * INTO dbo.foo FROM dbo.bar;
I'm actually okay with this usage, mostly because it is used in an ad hoc fashion, and because it will take into account metadata changes that have happened in the meantime anyway (so if the source table has changed, the destination table will change in the exact same way). Just keep in mind a few things when you use SELECT INTO:
- you *will* get certain table elements such as IDENTITY property (unless you say something like SELECT IdentityColumn = IdentityColumn + 0 INTO…);
- you will *not* get extensions to the table such as indexes and foreign keys; and,
- you cannot dictate which filegroup or partition scheme the destination table will belong to (well, until SQL Server 2017).
As demonstrated above, it is very easy to avoid SELECT *, and there are plenty of good reasons to do so. All of which trump the "efficiency" factor, IMHO.
I am working on a series of "Bad habits to kick" articles, in an effort to motivate people to drop some of the things that I hate to see when I inherit code. Up next: avoiding the schema prefix.