Working around schema drift in SQL Server

At Stack Overflow, our environment has multiple implementations of a largely – but not 100% – identical schema. (By “schema,” I mostly mean the set of tables within a database.) I sometimes need to retrieve or update data across a large number of sites. Let’s say, pulling or removing information from the Users table, and related tables, across all of our databases.

This gets complicated.

Because:

  • Each of our Q & A sites has its own database, and each database has a table called Users (and some metadata is split out into a separate UserMetadata table).
  • A few sites are special, like Area 51, where the Users table looks a little different (it is wider, because the UserMetadata split isn’t present here for historical reasons).
  • A few others look like Q & A databases, like Chat. They also contain a table called Users, but here the table is used differently and the column structure is unique.
  • Stack Overflow for Teams is more complex. A team’s tables are shared with others within a single database, with the boundary being a schema instead of a database. There is no table called dbo.Users, but a database may contain multiple Users tables, such as Team0001.Users, Team0002.Users, and so on. I’ll deal with this special case in a future post.

IntelliSense highlights a few of these differences straight away:

IntelliSense reveals schema drift

Schema has “drifted” in these and other cases, for example when features are only applicable to one site, team, or product. Features like Collectives, Saves, or Staging Ground can lead to new columns or even entire tables that are only present in some databases.

This is okay, and there are many ways to deal with drift – even drift by design. But that’s not the point of this post. Because even if I could convince everyone all of our databases should have identical schema, I can’t snap my fingers and make it happen overnight. In the meantime, I have to deal with these differences.

A fake example.

Imagine I want to collect a list of users with more than one answer and who mention groundhogs in the “About Me” section of their profile. In any Q & A database, I can issue the following query:

That works fine across “normal” site databases.

But there are exceptions.

In the Area 51 database, that query fails. As mentioned above, there is no UserMetadata table, and AnswerCount is not aggregated anywhere. I have to write the query differently:

In the Chat database, this query makes no sense, because there are no answers. So I want an easy way for this database to be skipped entirely.

The problem.

I don’t want to hard-code lists of databases that happen to support one version of the schema or another right now. This reminds me of browser detection techniques that rely on the user agent string the browser presents, rather than testing the required functionality. Picture maintaining a list of all possible user agent strings and keeping it up to date. This is simply unmanageable, and I don’t want to do it for database names, either. I want it to be dynamic, so I don’t have to update some list somewhere every time a new database gets added, existing schema drift gets corrected, or new schema drift appears. And sometimes new schema drift will appear, like with new features, as mentioned before. We know about this drift; it goes through PRs and reviews, and is what I call controlled drift. The technique I use relies on knowing the schemas are well controlled, and being aware of changes coming out. So if someone goes rogue and creates a new table in Area51 called dbo.MoreUserMetadata, and starts stuffing information about groundhogs in a random column there without anyone’s knowledge, that data will go unnoticed. That is one caveat to the solution I’ll propose.

A piece of the puzzle.

I have to repeat this process for a growing list of queries. So, in our central DBA database, I have an Actions table that I use to validate the metadata in each database before attempting to run a given query there.

I put these queries into the table (minus the statement terminator, since these will later be placed inside a common table expression):

As I loop through each database, I build dynamic SQL that checks for the proper metadata before replacing tokens and executing the query. To demonstrate, first, let’s set up a few databases that meet our various criteria:

I’ll use a cursor to step through each action within each database, and check if it has supporting metadata. I keep it flexible so that I can check one or two tables, and zero or one columns in each table. Why zero? For some queries, I know the columns exist in all databases, so I don’t need to check.

Results.

Even though I have very different schema across (many!) different databases, this technique allows me to pull a consolidated result without knowing which ones store the same data in slightly different ways:

Results in spite of schema drift

If you are trying to get a handle on what the code does, the parameter @debug provides some insight. If you execute the script with @debug = 1, you’ll see this output instead of the results. This shows what code is produced and executed by sp_executesql:

I’ll follow up soon with details on some of the more special cases.