January 27, 2010 | SQL Server

Sometimes it's the small things : match column names in subqueries

The behavior of column matching in subqueries is a little peculiar, to say the least.  If you've been bitten by this behavior once, you're unlikely to have been bitten a second time, but for some of us it just takes a while to sink in.

This morning I wasted a good five minutes "troubleshooting" a query that wasn't working out the way I thought it should.  I have a core table called Users (simplified for brevity):

    CREATE TABLE dbo.Users
(
  UserID INT PRIMARY KEY,
  -- ... other columns ...
);

Against this table, I was writing a custom report for a specific set of users, so I wrote something like this (again, simplified for brevity):

    DECLARE @u TABLE (id INT PRIMARY KEY);
 
INSERT @u(id) SELECT 1 UNION ALL SELECT 2;
 
SELECT UserID FROM dbo.Users
  WHERE UserID IN (SELECT UserID FROM @u);

I ran the query, and of course got all of the users in the original Users table.  Why?  Because in the table variable definition, I got really lazy about the name of the ID column, and then forgot that I had been lazy when I referenced it later.  In most cases this won't have any effect, but in a subquery, if the column is not found in the objects referenced there, the parser jumps scope and tries to bind to a column in the parent query.  If it finds none, you get an error (invalid column name); if it finds more than one, you get a different error (ambiguous column name).  But when it finds exactly one, then I may as well have written:

    SELECT UserID FROM dbo.Users
  WHERE UserID IN (SELECT UserID FROM dbo.Users);
 
-- which of course becomes:
 
SELECT UserID FROM dbo.Users;

Note that if @u had no rows, then the outer query would have also returned no rows.  But all it needs is one row to satisfy the condition for every row in the outer table… and in some cases, this can be more difficult to troubleshoot than my simple example here.

It is clear I'm not the first person to get tripped up by this behavior, but they aren't going to fix it:

#126785 : Subquery error not causing main query error

#124775 : Column confusion with IN

#499463 : wrong results from Temporary table in a subquery

According to Microsoft, this behavior is "correct."  If it were me, I would "fix" the language rules so that a subquery that is not explicitly correlated to an outer query, should be able to be parsed and executed on its own without error.  If it can't, then it shouldn't parse as part of a larger query, either.

But that's just my opinion.  While I understand that the current behavior aligns with the T-SQL rules, that doesn't mean that the T-SQL rules make a whole lot of sense in this case.

In the meantime, if you have a situation where a subquery with IN() is not acting like the filter you expected, check your column names before you start re-writing your query and pulling your hair out.

7 comments on this post

    • Denis Gobo - January 27, 2010, 8:17 PM

      Another option is to use EXISTS..in that case you will get an error. Yeah IN and NOT IN is especially fun when you have NULLS in the column

    • AaronBertrand - January 27, 2010, 8:25 PM

      Denis, agreed, IN / NOT IN are problematic when NULLs are possible, and EXISTS is a workaround in both situations.  In fact I use EXISTS almost universally, went with IN because I was using it multiple times in a multi-CTE query and didn't feel like correlating every time.What I don't like about EXISTS in a case like this again comes back to laziness.  I would have to correlate with an EXISTS, e.g.:
      SELECT UserID FROM Users AS u
       WHERE EXISTS (SELECT 1 FROM @u WHERE id = u.UserID)
      Which in and of itself (adding a WHERE clause) isn't such a bad thing, but it makes it harder to prototype the query and replace the subquery with a comma-separated list.  The IN() variety allows me to change it to:
      SELECT UserID FROM Users AS u
       WHERE UserID IN (1,2);
      In the long run there isn't a lot of value in that, but during development it can makemuch quicker work of the logical portions of your query… for example if the real list you want to test is 1000 and it adds significant runtime to your query, testing during development slows down quite a bit.

    • Jack Corbett - January 27, 2010, 10:16 PM

      Aaron,
      I find this behavior to be non-intuitive.  I have been burned by it before so I haven't made that mistake recently.

    • cinahcaM madA - January 28, 2010, 3:27 AM

      I think IN often reads better than EXISTS. It's closer to how we think in natural language. For example:
      "Find all of the information about customers who are in the list of those who placed an order in the past 30 days."
      Compared to:
      "Find all of the information about customers for whom there exists an order placed in the past 30 days."

    • Jonathan Kehayias - January 28, 2010, 6:48 AM

      Maybe I am missing the obvious here, but why not JOIN to the table variable?
      SELECT UserID FROM dbo.Users  
      JOIN @u ON UserID = ID
      Its less code and achieves the same result, and if you had used an incorrect name it would have thrown the exception.  IIRC it is also faster against larger datasets, or maybe that was for using NOT IN, I forget which exactly, but I remember testing this based on a comment I received once.

    • Aaron Bertrand - January 28, 2010, 7:03 AM

      Jonathan,
      Part laziness, part habit.  When I was first learning SQL I got into the habit of building uncorrelated subqueries that could be pulled out and run on their own.  Of course I am not advocating that anyone should use IN at all, and I wasn't really asking for help on how to do this better.  I just know that a LOT of people write code this way all the time, not just when quickly creating a one-off, and so hopefully this makes them aware of this far-from-intuitive behavior.

    • AaronL - January 28, 2010, 8:43 PM

      I've definitely seen developers on my team get burned by this more than once, and have caught it over and over again in my code reviews.  
      For me it is best avoided by one practice – always, always, always alias every table in your statements, and use those aliases in every column reference, even when you're only pulling from one table.  It's really not a significant amount extra work, the aliases can be short, and it will save your butt time and again. You never know when you will return to a query and add in a second table reference, or a subquery, etc.

Comments are closed.