Impact of UTF-8 support in SQL Server 2019

By:   |   Comments (4)   |   Related: > SQL Server 2019


Problem

SQL Server has long supported Unicode characters in the form of nchar, nvarchar, and ntext data types, which have been restricted to UTF-16. You could get UTF-8 data into nchar and nvarchar columns, but this was often tedious, even after UTF-8 support through BCP and BULK INSERT was added in SQL Server 2014 SP2. And the end result was to pay for Unicode storage and memory requirements, because you still had to store all of the data as Unicode, even when some or all of it was ASCII.

In SQL Server 2019, there are new UTF-8 collations, that allow you to save storage space, while still enjoying the benefits of compatibility and storing your UTF-8 data natively. Similar (but not identical) to Unicode compression, you only pay for the additional storage space for the characters that actually require that space. But what is the actual storage impact, and how does this affect memory grants and query performance?

Solution

There are numerous implications of various collations, codepages, and UTF formats. I feel like an expert could write a 20-part article series and still not be done. In fact Solomon Rutzky has written about these topics quite a bit– most recently an article about UTF-8 support in SQL Server 2019 – which suggests that you probably should not use this feature (yet), and that your focus for using UTF-8 collation for your columns should be primarily about compatibility, not about storage space or performance.

Because I know people will still use it in spite of Solomon's advice, I want to focus solely on a specific UTF-8 collation, and how the space and memory requirements differ compared to UTF-16 data stored in traditional Unicode columns. I'll compare with and without compression, and with various percentages of a column value (and the percentage of rows in the table) with non-ASCII data.

First, let's take a look at a table that has columns with three different collations, and see what it looks like when we insert data into them. I took a screenshot of this query, because I know that some of these Unicode characters won't translate well by the time they reach your device:

Creating a simple table with columns having different collations, and measuring storage.

There are three columns, the first uses the standard Latin1_General collation, the second has Latin1_General with supplementary characters (SC), and then a third column using a new Latin1_General UTF-8 collation. I inserted a Greek character, an Asian character, and an emoji (the Canadian flag, of course!), both by themselves and then with some additional ASCII characters. Here are the results of LEN() and DATALENGTH() for each of the values:

Results of LEN / DATALENGTH against the same data in different collations.

Clearly, you can see that the lengths are largely the same, with the exception of how the emoji requires four bytes in the first collation (see this post by Greg Low to learn why this is bytes and not necessarily characters). However, the actual storage is almost always the same or lower when using the UTF-8 collation (again, with one exception, this time the Asian character required one extra byte). I'll save you the suspense: with both row and page compression, and a similar #temp table, all of the results were the same.

Also, a comment in the code sample above indicates that you still need an N prefix on string literals, even though the destination type is varchar. The reason is that SQL Server will try to interpret the value of the string first, and if the N is not there, part of the Unicode data gets lost.

Try this:

DECLARE @t TABLE (t varchar(10) COLLATE Latin1_General_100_CI_AI_SC_UTF8);
INSERT @t(t) VALUES('h'),(N'h');
SELECT t FROM @t;

Results:

t
----
?
h

In playing with this I also discovered another phenomenon, probably completely unrelated to collation, but interesting nonetheless. When using varbinary representations of Unicode strings (like the pile of poo emoji, 0x3DD8A9DC), they can be interpreted differently depending on what else is in the statement. In this example, I'm executing three different batches: (1) inserting the varbinary value directly; (2) inserting the value directly and, in a separate statement, inserting the value after converting to nvarchar; and, (3) inserting the value and the converted value in the same statement:

DECLARE @t TABLE (t varchar(10) COLLATE Latin1_General_100_CI_AI_SC_UTF8);
INSERT @t(t) VALUES(0x3DD8A9DC);
SELECT t FROM @t;
GO -- 1
  DECLARE @t TABLE (t varchar(10) COLLATE Latin1_General_100_CI_AI_SC_UTF8); INSERT @t(t) VALUES(0x3DD8A9DC); INSERT @t(t) VALUES(CONVERT(nvarchar(10),0x3DD8A9DC)); SELECT t FROM @t; GO -- 2
  DECLARE @t TABLE (t varchar(10) COLLATE Latin1_General_100_CI_AI_SC_UTF8); INSERT @t(t) VALUES(0x3DD8A9DC),(CONVERT(nvarchar(10),0x3DD8A9DC)); SELECT t FROM @t; GO -- 3

The results had me puzzled:

Puzzling results using VALUES()

In the case where the inserts were performed with different statements, both were interpreted correctly. But when using VALUES() to insert two rows together, both somehow converted to nvarchar. A behavior involving VALUES() and probably nothing to do with collation, but I'll have to look at that in a future tip. In the meantime, watch out for this if you are ever changing scripts from one form to the other.

Back to the original investigation; what if we try this out at a larger scale? I wrote a script that generates CREATE TABLE statements for a bunch of tables with various settings for collation, compression, and how much non-ASCII data would actually be stored. Specifically, this creates 81 tables, with combinations of:

  • Compression (row, page, and none);
  • Collation (Latin1_General_100_CI_AI, Latin1_General_100_CI_AI_SC, and Latin1_General_100_CI_AI_SC_UTF8);
  • Percentage of rows containing UTF-8 data (0%, 50%, 100%); and,
  • Number of characters in each row that is UTF-8 data (0 characters, 25 characters, and 50 characters):
CREATE TABLE #cmp(cmp varchar(4));
INSERT #cmp VALUES('ROW'),('PAGE'),('NONE'); CREATE TABLE #coll(coll varchar(8));
INSERT #coll VALUES(''),('_SC'),('_SC_UTF8'); CREATE TABLE #row(rowconf varchar(9));
INSERT #row VALUES('0  % UTF8'),('50 % UTF8'),('100% UTF8'); CREATE TABLE #char(charconf varchar(7));
INSERT #char VALUES('0 UTF8'),('25 UTF8'),('50 UTF8'); SELECT N'CREATE TABLE dbo.' + QUOTENAME(N'UTF8Test' + coll.coll + N'_'
  + cmp.cmp + N'_' + rowconf + N'_' + charconf) + N'
(
    id int IDENTITY(1,1) NOT NULL,
    the_column ' + CASE coll.coll WHEN '_SC_UTF8' THEN N'' ELSE N'n' END + N'varchar(512)' END
    + N' COLLATE Latin1_General_100_CI_AI' + coll.coll + N',
    CONSTRAINT ' + QUOTENAME(N'pk_UTF8Test_' + coll.coll + N'_' + cmp.cmp
    + N'_' + rowconf + N'_' + charconf) + N' PRIMARY KEY CLUSTERED(id)
    WITH (DATA_COMPRESSION = ' + cmp.cmp + N')
);' FROM #cmp AS cmp, #coll AS coll, #row AS rowconf, #char AS charconf;

This script produces 81 rows of output, with table definitions like the following (they are not pretty scripts, of course):

CREATE TABLE dbo.[UTF8Test_ROW_0  % UTF8_0 UTF8]
(
    id int IDENTITY(1,1) NOT NULL,
    the_column nvarchar(200) COLLATE Latin1_General_100_CI_AI,
    CONSTRAINT [pk_UTF8Test__ROW_0  % UTF8_0 UTF8] PRIMARY KEY CLUSTERED(id)
      WITH (DATA_COMPRESSION = ROW)
); CREATE TABLE dbo.[UTF8Test_SC_ROW_0  % UTF8_0 UTF8]
(
    id int IDENTITY(1,1) NOT NULL,
    the_column nvarchar(200) COLLATE Latin1_General_100_CI_AI_SC,
    CONSTRAINT [pk_UTF8Test__SC_ROW_0  % UTF8_0 UTF8] PRIMARY KEY CLUSTERED(id)
      WITH (DATA_COMPRESSION = ROW)
); CREATE TABLE dbo.[UTF8Test_SC_UTF8_ROW_0  % UTF8_0 UTF8]
(
    id int IDENTITY(1,1) NOT NULL,
    the_column varchar(200) COLLATE Latin1_General_100_CI_AI_SC_UTF8,
    CONSTRAINT [pk_UTF8Test__SC_UTF8_ROW_0  % UTF8_0 UTF8] PRIMARY KEY CLUSTERED(id)
      WITH (DATA_COMPRESSION = ROW)
);

… 78 more tables …

Copy, paste, execute, and now you have 81 tables that you can generate INSERT statements to populate in a similar way. There is more logic involved here, and the script is even uglier as a result – we want to insert 10,000 rows into each table, but those rows are a mix of values partially or wholly populated (or not) with Unicode data. I have a Canada flag in here and added a comment to the location in case it doesn't display properly in your browser:

DECLARE @sql nvarchar(max) = N'SET NOCOUNT ON;';
SELECT @sql += N'
WITH n AS (SELECT n = 1 UNION ALL SELECT n+1 FROM n WHERE n < 10000)
INSERT dbo.' + QUOTENAME(N'UTF8Test' + coll.coll + N'_' + cmp.cmp
  + N'_' + rowconf + N'_' + charconf) + N'(the_column) SELECT b FROM (SELECT
  b = REPLICATE(N''🇨🇦'',' + LEFT(charconf.charconf,2) + N')
  -----------------^ Canada flag is here
  + REPLICATE(N''.'',' + RTRIM(50-LEFT(charconf.charconf,2)) + N')) AS a
  CROSS APPLY (SELECT TOP (' + CONVERT(varchar(11),CONVERT(int,10000
  * LEFT(rowconf.rowconf,3)/100.0)) + N') n FROM n) AS b OPTION (MAXRECURSION 10000); WITH n AS (SELECT n = 1 UNION ALL SELECT n+1 FROM n WHERE n < 10000)
INSERT dbo.' + QUOTENAME(N'UTF8Test' + coll.coll + N'_' + cmp.cmp
  + N'_' + rowconf + N'_' + charconf) + N'(the_column) SELECT b FROM (SELECT
  b = REPLICATE(N''.'',50)) AS a
  CROSS APPLY (SELECT TOP (' + CONVERT(varchar(11),10000-CONVERT(int,10000
  * LEFT(rowconf.rowconf,3)/100.0)) + N') n FROM n) AS b OPTION (MAXRECURSION 10000);'
FROM #cmp AS cmp, #coll AS coll, #row AS rowconf, #char AS charconf; PRINT @sql;
--EXEC sys.sp_executesql @sql;

The print won't show you all of the script (unless you have SSMS 18.2 or use other measures like in this tip), instead it will be pairs of insert statements; the first in each pair representing the rows that contain UTF-8 data, and the second representing the rows that don't:

WITH n AS (SELECT n = 1 UNION ALL SELECT n+1 FROM n WHERE n < 10000)
INSERT dbo.[UTF8Test_ROW_0  % UTF8_0 UTF8](the_column) SELECT b FROM (SELECT
  b = REPLICATE(N'🇨🇦',0 )
  ----------------^ Canada flag is here
  + REPLICATE(N'.',50)) AS a
CROSS APPLY (SELECT TOP (0) n FROM n) AS b OPTION (MAXRECURSION 10000); WITH n AS (SELECT n = 1 UNION ALL SELECT n+1 FROM n WHERE n < 10000)
INSERT dbo.[UTF8Test_ROW_0  % UTF8_0 UTF8](the_column) SELECT b FROM (SELECT
  b = REPLICATE(N'.',50)) AS a
CROSS APPLY (SELECT TOP (10000) n FROM n) AS b OPTION (MAXRECURSION 10000);

In the first example, we want 0% of the rows to contain UTF-8 data, and 0 of the characters inside any row to contain UTF-8 data. This is why we insert no rows containing the Canada flags, and 10,000 rows of 50 periods. (I acknowledge that 50 periods will compress unfairly well, but more representative data is harder to automate, and GUIDs would be too far the other way.)

If we take an arbitrary example from later in the script, we can see how the rows are distributed differently – half the rows contain UTF-8 data, and those that do contain 25 Unicode characters and 25 periods:

WITH n AS (SELECT n = 1 UNION ALL SELECT n+1 FROM n WHERE n < 10000)
INSERT dbo.[UTF8Test_ROW_50 % UTF8_25 UTF8](the_column) SELECT b FROM (SELECT
  b = REPLICATE(N'🇨🇦',25)
  ----------------^ Canada flag is here
  + REPLICATE(N'.',25)) AS a
CROSS APPLY (SELECT TOP (5000) n FROM n) AS b OPTION (MAXRECURSION 10000); WITH n AS (SELECT n = 1 UNION ALL SELECT n+1 FROM n WHERE n < 10000)
INSERT dbo.[UTF8Test_ROW_50 % UTF8_25 UTF8](the_column) SELECT b FROM (SELECT
  b = REPLICATE(N'.',50)) AS a
  CROSS APPLY (SELECT TOP (5000) n FROM n) AS b OPTION (MAXRECURSION 10000);

If you are confident I'm not going to blow up your disk, change this:

PRINT @sql;
--EXEC sys.sp_executesql @sql;

To this:

--PRINT @sql;
EXEC sys.sp_executesql @sql;

Then execute it. On my system this took anywhere from 20 – 40 seconds, and the data and log files were 400 MB and 140 MB respectively (starting from a fairly standard AdventureWorks sample database).

Now, we are ready to spot check and analyze! First, let's make sure all the tables have the right number of rows:

SELECT t.name, p.rows
  FROM sys.tables AS t
  INNER JOIN sys.partitions AS p
  ON t.object_id = p.object_id
  WHERE t.name LIKE N'UTF8%'; -- 81 rows, all with 10,000 rows

Then we can spot check any table where we expect there to be some variance:

SELECT TOP (2) * FROM dbo.[UTF8Test_ROW_50 % UTF8_50 UTF8] ORDER BY id;
SELECT TOP (2) * FROM dbo.[UTF8Test_ROW_50 % UTF8_50 UTF8] ORDER BY id DESC; SELECT TOP (2) * FROM dbo.[UTF8Test_SC_UTF8_ROW_50 % UTF8_25 UTF8] ORDER BY id;
SELECT TOP (2) * FROM dbo.[UTF8Test_SC_UTF8_ROW_50 % UTF8_25 UTF8] ORDER BY id DESC;

Sure enough, we see what we expect to see (and this isn't satisfying anything about collation, it is just proving that my script did what I thought it would do):

50 Canada flags and 50 periods in the first table, followed by 25 flags + 25 periods and 50 periods in the second table

Now, what about storage space? I like to look at the page allocations DMV, sys.dm_db_database_page_allocations, especially for relative comparisons. I pulled this simple query from my templates:

SELECT t.name,PageCount = COUNT(p.allocated_page_page_id) 
FROM sys.tables AS t CROSS APPLY
sys.dm_db_database_page_allocations(DB_ID(), t.object_id, 1, NULL, 'LIMITED') AS p
WHERE t.name LIKE N'UTF8%'
GROUP BY t.name
ORDER BY PageCount DESC;

And here were the results:

Number of pages used by each table. Page compression is a real equalizer.

I moved the output into Excel and separated it into three columns, almost arbitrarily. The column on the left is every table that required more than 100 pages, and on the right is every table that used page compression. The middle column is everything with 81 or 89 pages. Now, I might have stacked the deck in favor of compression, since all the values on any given page are likely to be the same. Which means that the page counts involved with compression are likely much lower than they would be with more real-world data. But what this does show is that, given the same data, page compression is an absolute equalizer. The rest is a mixed bag, with no real observable trends, except for illustrating that when more of the data is Unicode, the page counts are higher, regardless of collation (and, for the most part, whether compression is row or none).

How about performance? The thing I'm typically concerned about in cases like this – in addition to the number of pages that would have to be read in a scan, for example – is the memory grant that would be allocated, particularly for queries with sorts. Duration is always something to be interested in, too, but I always feel like free memory is more scarce than patience. I wrote a script to generate a query to run against each table, 10 times:

DECLARE @sql nvarchar(max) = N'DBCC FREEPROCCACHE;
GO
'; ;WITH x AS (SELECT name FROM sys.tables WHERE name LIKE N'UTF8%')
SELECT @sql += N'
SELECT TOP 1 c FROM (SELECT TOP 9999 the_column FROM dbo.'
  + QUOTENAME(name) + ' ORDER BY 1) x(c);
GO 10'
FROM x; PRINT @sql;

In this case I used the PRINT output (copy and paste into a new window) instead of sys.sp_executesql because the latter can't accept commands like GO 10. After running the queries, I went to sys.dm_exec_query_stats to check on memory grants and query duration. I could have analyzed the 82 queries independently, but I decided to simply group them by collation and compression. The query I ran:

WITH x AS 
(
  SELECT coll = CASE WHEN t.name LIKE '%SC_UTF8%' THEN 'UTF8'
      WHEN t.name LIKE '%_SC%' THEN 'SC' ELSE '' END,
    comp = CASE WHEN t.name LIKE N'%_PAGE_%' THEN 'Page'
      WHEN t.name LIKE N'%_ROW_%' THEN 'Row' ELSE 'None' END,
      max_used_grant_kb,max_ideal_grant_kb,max_elapsed_time
  FROM sys.dm_exec_query_stats AS s
  CROSS APPLY sys.dm_exec_sql_text(s.plan_handle) AS st
  INNER JOIN sys.tables AS t
  ON st.[text] LIKE N'SELECT TOP%' + t.name + N'%'
  WHERE t.name LIKE N'UTF8%'
)
SELECT coll, comp,
  max_used_grant = AVG(max_used_grant_kb*1.0),
  ideal_grant    = AVG(max_ideal_grant_kb*1.0),
  max_time       = AVG(max_elapsed_time*1.0)
FROM x GROUP BY coll,comp
ORDER BY coll, comp;

This resulted in two interesting charts. The first one shows that the memory grants for UTF-8 data were slightly smaller:

Results of memory grants for various queries against UTF-16 and UTF-8 data.

The second chart, unfortunately, shows that the average duration of the UTF-8 queries was 50% higher or more:

Results of duration for various queries against UTF-16 and UTF-8 data.

Summary

The new UTF-8 collations can provide benefits in storage space, but if page compression is used, the benefit is no better than older collations. And while memory grants might be slightly lower, potentially allowing for more concurrency, the runtime of those queries was significantly longer. After this small bit of investigation, I wouldn't say that there is an obvious situation where I would rush to switch to UTF-8 collations.

Now, again, some of this is surely influenced to a degree by my simplistic and well-compressed sample data, so I'm not going to pretend that this has any universal relevance to all data and workloads. It does, however, show some ways you could perform testing with your own data and in your specific environment.

In another experiment, I plan to try out clustered columnstore indexes, to see how traditional nvarchar and UTF-16 there might compare against the new UTF-8 collations. After writing the current article, I suspect that writing the next one will lead to me becoming a bigger fan of columnstore, though not any fonder of UTF-8 collations.

Next Steps

Read on for related tips and other resources:



sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Aaron Bertrand Aaron Bertrand (@AaronBertrand) is a passionate technologist with industry experience dating back to Classic ASP and SQL Server 6.5. He is editor-in-chief of the performance-related blog, SQLPerformance.com, and also blogs at sqlblog.org.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Monday, June 20, 2022 - 11:52:37 AM - Dave Boltman Back To Top (90172)
Super-useful, thanks!

One small nit to pick, is that UTF-8 *is* also Unicode. Although Microsoft originally added to the confusion by having Notepad (IIRC) give the option of saving text files in ANSI or Unicode or UTF-8, presenting the choice as Unicode vs. UTF-8 is misleading. MS seems to have cleaned up their act in Notepad, and a better way of contrasting this choice would be UTF-16 vs. UTF-8 (where UTF is short for *Unicode* Transformation Format in both cases).

I'm in no way minimising how useful your article is. Thank you for taking the time in creating this article!

Tuesday, November 26, 2019 - 5:52:54 PM - Solomon Rutzky Back To Top (83202)

Hello again. Sorry for the delay, but I finally finished the blog post to, once and for all, clear up any confusion surrounding how many bytes can be taken up by characters across the various string datatypes: "How Many Bytes Per Character in SQL Server: a Completely Complete Guide" ( https://sqlquantumleap.com/2019/11/22/how-many-bytes-per-character-in-sql-server-a-completely-complete-guide/ ). I hope that helps.

Take care,
Solomon...


Tuesday, October 15, 2019 - 8:44:56 PM - Solomon Rutzky Back To Top (82788)

Hi Aaron. Thanks for posting these tests. A few points need to be clarified:

  1. Regarding my UTF-8 post, I have been needing to update that for several months now. Unfortunately I have been busy (plus a week in the hospital and then recovery time), so I just haven't had a chance. To summarize, there have been several improvements to the UTF-8 handling so I am much more confident in it now, but the overall advice hasn't changed: using UTF-8 should primarily be for compatibility with existing app code that is already UTF-8 or can be with little modification (i.e. no need to change datatypes or prefix string literals with "N"). And, when using UTF-8, it's best to go all-in and use it as the database's default collation and the collation for all columns. Using UTF-8 in only a few columns can be done successfully, but it requires extensive testing as the opportunities for problems goes way up.
  2. There is no difference, storage-wise, between Latin1_General_100_CI_AI and Latin1_General_100_CI_AI_SC. And there really shouldn't be any performance difference either. The only difference when adding the _SC flag is the ability for the collation to properly handle supplementary characters (i.e. UTF-16) rather than only BMP characters (i.e. the first 65,536 code points / UCS-2). This impacts the ability to create characters using the NCHAR() function with values above 0xFFFF / 65535 (it returns NULL for values above that limit when used in a database with a non-"_SC" default collation), and how built-in functions such as LEN(), CHARINDEX(), SUBSTRING(), etc interpret them (1 character for "_SC" collations, else 2).
  3. Regarding the example with the Canadian flag and the following statement: "you can see that the lengths are largely the same, with the exception of how the emoji requires four bytes in the first collation" -- In column "A", it isn't taking up 4 bytes; it is reporting that it is 4 characters. And that is correct, at least when looking at it from a UCS-2 point of view. Your chart shows it taking up 8 bytes, which is correct for the four 2-byte code units. The reason it shows as 4 characters there yet only 2 in columns B and C is that column A is using a collation that doesn't recognize supplementary characters; it only sees the surrogate pairs used in UTF-16 to create the supplementary characters. But the character displays the same as the other column because the bytes are the same (well, the same as in column B) and it is the UI and font that interpret those bytes. The following example might help visualize the difference as well as show how to include the character in your example scripts so that you don't need to use screen shots, and so that it will be easier for people to copy and paste your example code to try for themselves:

    -- The following work when the current DB has a default collation matching the column title:
    SELECT NCHAR(0x1F1E8) + NCHAR(0x1F1E6) AS ["_SC" and "_140_" Collations],
           NCHAR(0xD83C) + NCHAR(0xDDE8) + NCHAR(0xD83C) + NCHAR(0xDDE6) AS [AllCollations];

    The other issue here with the Canadian flag emoji is that it is not a single Unicode character / code point. In order to create many of the variations of emojis that exist, multiple code points are used in combination. The various country flags are examples of this construct. This emoji is actually 2 supplementary characters, which requires 2 surrogate pairs, hence four 2-byte code units.
  4. Nothing against Greg, but his "Think that varchar(10) means 10 characters?" post is largely incorrect and should not be used as a reference. I am working on a post to explain bytes per character and will let you know when I am done (hopefully today).
  5. Regarding the note that the "N" prefix is still needed for string literals, and "SQL Server will try to interpret the value of the string first, and if the N is not there, part of the Unicode data gets lost." -- this is misleading. SQL Server has no inherent behavior. Interpretation of strings is handled by the current database's default collation. If you had been using a database that had a "_UTF8" collation, then both rows would have correctly inserted the Unicode characters.
  6. Regarding your 3 tests for inserting the "pile of poo" emoji:
    1. The binary literal of "0x3DD8A9DC" is the UTF-16 Little Endian encoding of it. That is neither the code point nor even the UTF-16 surrogate pair. It's actually just a series of bytes. When you convert that to UTF-8, it merely sees the first byte 0x3D which is "=", then it sees the next two bytes of 0xD8A9 which make up the "ة" character, and the final byte of 0xDC which is an incomplete UTF-8 sequence so nothing is returned.
    2. As you suspected, you get different results when using the table value constructor (i.e. test 3) since that is a derived table and so it is returning a column of 2 rows and that column can only have 1 datatype, which is NVARCHAR due to datatype precedence. In this particular case, the 0x3DD8A9DC value is implicitly converted to NVARCHAR such that there can be a consistent datatype for the column, and that happens before the actual INSERT, hence the UTF-8 column sees a Unicode character, not a series of raw bytes.
  7. Regarding the link to the other MSSQLTIPS article, "SQL Server differences of char, nchar, varchar and nvarchar data types": that article is also largely incorrect and should also not be used as a reference.

I hope this helps. Take care, Solomon...


Tuesday, October 15, 2019 - 5:41:58 PM - Brian Back To Top (82785)

Something to note with your "INSERT @t(t) VALUES" example above, if you use the coalation of Latin1_General_100_CI_AI_SC you get the same result as with the UTF8.  This was tested on SQL Server 2016 which does not have that UTF8 coalation.

My thoughts on that are that "VALUES(a),(b)" requires all of the data in a and b to be the same data type.  So casting b as NVARCHAR (in your example) results in a being implicitly cast to NVARCHAR.  Which I believe is the same conclusion you came to.

This is testable with a query like this:
DECLARE @t TABLE (t varchar(10) COLLATE Latin1_General_100_CI_AI_SC);
INSERT @t(t) VALUES('hello'),(CONVERT(INT,'3'));
SELECT t FROM @t;
GO -- 3

Which gives an error telling me it failed to convert 'hello' to data type int even though the table holds VARCHAR(10) values.  















get free sql tips
agree to terms