Bad Habits to Kick : Creating the uber-view
In my last post in this series, I talked about using ancient copies of Books Online, and why it can be important to keep your local documentation current. This time I wanted to touch on massive and wasteful views that are re-used a little too much.
The one-size-fits-all view: you've seen them, you've had to use them, you may have even created them. Why do we use them? Convenience, laziness, you name it.
Picture a forums system where you have users. On the user page you will see a bunch of data: demographic information, a picture, reputation score, badges, activity, last login time, questions asked, questions answered, etc. This system probably has tables like this (please allow me to grossly simplify things here):
CREATE TABLE dbo.Users ( UserID INT PRIMARY KEY, Username NVARCHAR(255) NOT NULL UNIQUE, FirstName NVARCHAR(32) NOT NULL, LastName NVARCHAR(32) NOT NULL, ProfilePictureURL VARCHAR(1024), CreatedDate SMALLDATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ); CREATE TABLE dbo.Questions ( QuestionID INT PRIMARY KEY, UserID INT NOT NULL REFERENCES dbo.Users(UserID), Question NVARCHAR(MAX), EventDate SMALLDATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ); CREATE TABLE dbo.Answers ( AnswerID INT PRIMARY KEY, QuestionID INT NOT NULL REFERENCES dbo.Questions(QuestionID), UserID INT NOT NULL REFERENCES dbo.Users(UserID), Answer NVARCHAR(MAX) NOT NULL, EventDate SMALLDATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ); CREATE TABLE dbo.ReputationPoints ( UserID INT NOT NULL REFERENCES dbo.Users(UserID), QuestionID INT REFERENCES dbo.Questions(QuestionID), AnswerID INT REFERENCES dbo.Answers(AnswerID), PointValue INT NOT NULL DEFAULT (10), EventDate SMALLDATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ); CREATE TABLE dbo.BadgesEarned ( BadgeID INT NOT NULL REFERENCES dbo.Badges(BadgeID), UserID INT NOT NULL REFERENCES dbo.Users(UserID), EventDate SMALLDATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP );
There are likely a bunch of other columns and indexes geared for the different ways the application will use the data, but this core schema should be a good enough start to get my point across. Let's start from the user page I was talking about, where you see something like this:
Now, if you were tasked to write a query that retrieved all of the data necessary to display this page, it would probably be something like this:
DECLARE @UserID INT; SELECT @UserID = 1; SELECT u.UserID, u.UserName, u.ProfilePictureURL, u.FirstName, u.LastName, Reputation = COALESCE(r.Reputation, 0), MemberSince = u.CreatedDate, LastActivity = q.EventDate, QuestionsAsked = q.QuestionCount, QuestionsAnswered = q.AnswerCount, BadgesEarned = COUNT(b.BadgeID) FROM dbo.Users AS u LEFT OUTER JOIN ( SELECT UserID, Reputation = SUM(PointValue) FROM dbo.ReputationPoints WHERE UserID = @UserID GROUP BY UserID ) AS r ON u.UserID = r.UserID LEFT OUTER JOIN ( SELECT UserID, EventDate = MAX(EventDate), QuestionCount = MAX(QuestionCount), AnswerCount = MAX(AnswerCount), FROM ( SELECT UserID, EventDate = MAX(EventDate), QuestionCount = 0, AnswerCount = COUNT(*) FROM dbo.Answers WHERE UserID = @UserID GROUP BY UserID UNION ALL SELECT UserID, EventDate = MAX(EventDate), QuestionCount = COUNT(*), AnswerCount = 0 FROM dbo.Questions WHERE UserID = @UserID GROUP BY UserID ) AS s ) AS q ON u.UserID = q.UserID LEFT OUTER JOIN dbo.BadgesEarned AS b ON u.UserID = b.UserID WHERE u.UserID = @UserID;
A lot of people will turn this into a view, so that when they need to get this information from a different stored procedure, they don't have to repeat a lot of the same code. Creating the view only requires that you take out the WHERE clauses:
CREATE VIEW dbo.UserData AS SELECT u.UserID, u.UserName, u.ProfilePictureURL, u.FirstName, u.LastName, Reputation = COALESCE(r.Reputation, 0), MemberSince = u.CreatedDate, LastActivity = q.EventDate, QuestionsAsked = q.QuestionCount, QuestionsAnswered = q.AnswerCount, BadgesEarned = COUNT(b.BadgeID) FROM dbo.Users AS u LEFT OUTER JOIN ( SELECT UserID, Reputation = SUM(PointValue) FROM dbo.ReputationPoints GROUP BY UserID ) AS r ON u.UserID = r.UserID LEFT OUTER JOIN ( SELECT UserID, EventDate = MAX(EventDate), QuestionCount = MAX(QuestionCount), AnswerCount = MAX(AnswerCount), FROM ( SELECT UserID, EventDate = MAX(EventDate), QuestionCount = 0, AnswerCount = COUNT(*) FROM dbo.Answers GROUP BY UserID UNION ALL SELECT UserID, EventDate = MAX(EventDate), QuestionCount = COUNT(*), AnswerCount = 0 FROM dbo.Questions GROUP BY UserID ) AS s ) AS q ON u.UserID = q.UserID LEFT OUTER JOIN dbo.BadgesEarned AS b ON u.UserID = b.UserID;
Turning this into a view on its own is not necessarily a bad thing. If it is truly for re-use in other queries where most or all of the columns (and hence derived tables) are required, then I see nothing wrong with encapsulating these joins into a single view, instead of repeating those joins and derived tables several times across slightly different stored procedures. The problem comes in when the view gets used in stored procedures that need a very small subset of the columns. Another problem arises when some new interface needs a couple of those columns and then some other key piece of information… instead of writing a new query, the view above is modified to add another left join, or derived table, or correlated subquery, or UDF to calculate an additional column value. Not only is this wasteful for the new application page (since it doesn't need everything above), it is also wasteful for the existing application pages already using the view, since they will simply ignore this extra column. This is obviously a rabbit-hole you don't really want to find yourself stuck in a year or two down the road.
We have inherited legacy views like this that return all kinds of information about a particular type of entity, and they get called even when all that the application needs are a couple of columns from the core table (e.g. UserID and UserName, for populating a drop-down list). Even worse is when we go after a small subset of columns in the view and only for a single user (e.g. just get reputation count) – the stored procedure uses the view presumably because the user copied part of the query from another stored procedure that users the view. Obviously it is very expensive to materialize that view (especially those computations in the derived tables) for the entire set of users in the database, and this is quite wasteful if we're only after data that is found all by itself in the Users table. In this case, at least in my humble opinion, it would be better to just have a stored procedure called dbo.User_GetList, which just went after the dbo.Users table directly, or to create a second view, called UserList, that looked like this (since this is all that application page requires):
CREATE VIEW dbo.UserList AS SELECT UserID, UserName FROM dbo.Users;
Now you have two views to maintain, but one of them is fairly trivial, and will help prevent your server from doing a lot of extra work every time the list of users is presented to the application. And perhaps at this point you could move the larger view code above back into a stored procedure, where it can remain parameterized so that even when all of the columns are needed, it is a narrow set of rows that are evaluated. Again, I'm not sure that I can tell you when you will need the view for re-use, but in most cases I've seen, the page that has a smorgasbord of information about an entity only appears once in an application, and it is likely the case that smaller sets of columns are re-used more widely.
We are working on cleaning some of these up, and this work was the motivation behind this post. And you may laugh, or wonder, "who would do such a thing?", but I've seen it often.
Rob has pointed out several flaws with my example, including my syntax errors, as well as the fact that the query optimizer is smart enough to not materialize every OUTER JOIN that isn't needed in the result set (and this join elimination can happen with INNER JOINs as well, FYI). I'm not going to go back and correct those, and I'll admit that I rushed through my initial thinking of this article. But let's move on and try a slightly more extreme example – and one which is actually much more indicative of the cases where I've seen this kind of problem. Picture a case where you are logging all changes to a user, have normalized certain attributes (in this case a user category), and encapsulated formatting logic (in this case date/time output). You might have something like this:
-- normalize category descriptions CREATE TABLE dbo.UserCategories ( UserCategoryID TINYINT PRIMARY KEY, [Description] NVARCHAR(32) NOT NULL UNIQUE ); GO CREATE FUNCTION dbo.GetUserCategory ( @UserCategoryID TINYINT ) RETURNS NVARCHAR(32) AS BEGIN RETURN ( SELECT [Description] FROM dbo.UserCategories WHERE UserCategoryID = @UserCategoryID ); END GO -- encapsulate consistent date formatting CREATE FUNCTION dbo.FixDate ( @d SMALLDATETIME ) RETURNS CHAR(16) AS BEGIN RETURN ( SELECT CONVERT(CHAR(10), @d, 120) + ' ' + CONVERT(CHAR(5), @d, 108) ); END GO -- standard users table CREATE TABLE dbo.Users ( UserID INT PRIMARY KEY, UserName NVARCHAR(255) NOT NULL UNIQUE, UserCategoryID TINYINT NOT NULL REFERENCES dbo.UserCategories(UserCategoryID) /* , other columns */ ); GO -- keep a log of who changed what -- (including initial creation, so always >= 1 row per user) CREATE TABLE dbo.UserChangeLog ( UserID INT NOT NULL REFERENCES dbo.Users(UserID), EventDate SMALLDATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP, Details NVARCHAR(255) NOT NULL ); GO -- now, an uber-view intended for use across -- several different queries CREATE VIEW dbo.UserData AS SELECT u.UserID, u.UserName, /* , other u. columns */ Category = dbo.GetUserCategory(u.UserCategoryID), LastUpdate = dbo.FixDate(a.LastModified) FROM dbo.Users AS u INNER JOIN ( SELECT UserID, LastModified = MAX(EventDate) FROM dbo.UserChangeLog GROUP BY UserID ) AS a ON u.UserID = a.UserID; GO
In this case, you'll see that the entire view is materialized no matter how few columns you ask for in the SELECT. Just turn on the execution plan… you'll see that while the graphical plan is not 100% identical, all of the expensive operators are (click on the exec plan to embiggen):
SELECT UserID FROM dbo.UserData; SELECT * FROM dbo.UserData;
So, unlike in the example I presented earlier, in this case (and thanks Rob for the prodding), I think I can demonstrate that there is extra work being done in the case where you are only after a narrow subset of the columns.
Now, it could be argued that there are plenty of "bad habits" in there that are just as important, or even more important, than the one I am trying to highlight in this post. It could also be argued that folks like Rob or myself, as well as a lot of this site's readers, would never lay out a system like this and expect it to perform. But what I try to keep in mind when I write articles like this is that they are intended for people of all levels – those that may have contributed to this design, or those that may have inherited it and don't understand their current performance issues. And I think it is hard to argue that this type of encapsulation and/or normalization is rare out in the wild, so ultimately it could be an issue that is affecting a lot of environments.
Anyway, is the uber-view always bad? Of course not. My intention here is not to tell you, "you should never do x and that is always true." I just want to point out practices and usage patterns that should be thoroughly thought out as opposed to just blindly implemented.