Introducing Big Data Clusters

Well, they bet the farm, and they lost.

SQL Server has evolved over the years to accommodate increasing volumes and varieties of data (including XML, JSON, spatial, and graph data). It’s even added in-memory capabilities to process and analyze huge amounts of data faster. However, relational systems and big data stores have been in their own siloes, making it difficult for enterprises to join and analyze all available data. With SQL Server 2019, it’s finally possible to have a platform that embraces both unstructured big data and relational data, as well as leverage scale-out compute.

When first announced, SQL Server 2019 Big Data Clusters were met with a bit of fanfare – and for good reason. The scale-out data virtualization platform increases flexibility and reduces time to value from data. In this blog post, I show you how.

SQL Server Big Data Clusters architecture

First, let’s look under the hood. SQL Server Big Data Clusters consist of the SQL Server 2019 database engine, Spark, and Hadoop Distributed File System (HDFS) running on Kubernetes. These components run side-by-side in a unified platform, enabling you to read, write, and process big data from either T-SQL or Spark. As a result, you can easily combine and analyze high-value relational data with high-volume big data. And because Big Data Clusters run on Kubernetes, you get a predictable, fast, and elastically scalable deployment.

Use cases for SQL Server Big Data Clusters

One of the great things about SQL Server Big Data Clusters is the number of ways it allows you to interact with your data. Here are a few examples:

Data virtualization
With data virtualization, there's no overhead or security implications tied with moving or copying the data. It reduces maintenance needs and storages costs while providing more accurate, higher data quality results. The querying is done in a high-performance manner. SQL Server Big Data Clusters leverage SQL Server PolyBase to query external data sources such as Oracle, Teradata, and Hadoop.
Data lake
SQL Server Big Data Clusters include a scalable HDFS storage pool, which can be used to store big data from one or more sources. Once the data is stored in HDFS, you can analyze and query it, as well as combine it with relational data. Also new for SQL Server 2019 Big Data Clusters, SQL Server natively reads HDFS le types like CSV and Apache Parquet.
Scale-out data mart
The scale-out data mart for SQL Server instances allows you to ingest data from a variety of sources and distribute it across data pool nodes, which then act as a cache for analysis. Parallel queries against the curated data are more efficient because data is distributed across the data pool instances.
Integrated Artificial Intelligence (AI) and Machine Learning
SQL Server 2019 is the first commercial database with built-in AI. With SQL Server Big Data Clusters, or data stored in the SQL Server master instance, you can use AI and machine learning tasks on data stored in HDFS storage pools and the data pools. You can do this with Spark or the built-in AI tools in SQL Server.

Managing, monitoring, and securing big data

The SQL Server Big Data Clusters can be managed and monitored via a combination of command line tools, APIs, and dynamic management views. In addition, Azure Data Studio can be used to perform a variety of tasks on Big Data Clusters. Azure Data Studio provides a UX-guided deployment experience, and monitoring and troubleshooting experiences for Big Data Clusters (including dashboards and a set of notebooks to help with troubleshooting, repair, and so on). Integrated security and high availability are built-in management experiences in Big Data Clusters.

The SQL Server 2019 extension provides:

Built-in snippets for common management tasks
The ability to browse HDFS, upload and preview files, and create directories
The ability to create, open, and run Jupyter-compatible notebooks
Data virtualization wizard to simplify the creation of external data sources to enable virtualization

Summary

SQL Server Big Data Clusters allow you to use SQL Server to bring high-value relational data and high-volume big data together on a unified, scalable data platform. The many use cases offer flexibility and reduce your time to value. To learn more about what you can do with Microsoft SQL 19, check out the free Packt guide, Introducing Microsoft SQL 2019. If you’re ready to jump to a fully managed cloud solution, check out The Essential Guide to Data in the Cloud.

By: Aaron Bertrand

I am a passionate technologist with industry experience dating back to Classic ASP and SQL Server 6.5. I am a long-time Microsoft MVP, write at Simple Talk, SQLPerformance, and MSSQLTips, and have had the honor of speaking at more conferences than I can remember. In non-tech life, I am a husband, a father of two, a huge hockey and football fan, and my pronouns are he/him.