Content area
The development of the modern database management systems is leading to providing users and developers of information systems with new and powerful tools for storing and processing large volumes of data. Microsoft SQL Server 2019 provides extremely interesting new features with Big Data Clusters. This paper aims to present the capacity of this tool and to suggest options for using it to store and process large volumes of heterogeneous data.
Abstract. The development of the modern database management systems is leading to providing users and developers of information systems with new and powerful tools for storing and processing large volumes of data. Microsoft SQL Server 2019 provides extremely interesting new features with Big Data Clusters. This paper aims to present the capacity of this tool and to suggest options for using it to store and process large volumes of heterogeneous data.
Keywords. Big Data, Microsoft SQL Server, Big Data Clusters, Machine Learning, Artificial Intelligence.
(ProQuest: ... denotes formulae omitted.)
1. Introduction
Microsoft SQL Server is a contemporary relational database management system. The development of information technology in recent decades has led to a dramatic increase in the amount of stored and processed data, while also increasing the variety of data types. As a typical relational DBMS, Microsoft SQL Server is not designed to store Big Data, nor to store and process unstructured data, such as media files.
Microsoft SQL Server 2019 is expanding its data platform to cover big and unstructured data by integrating Apache Spark and HDFS into the Big Data Cluster [1].
Apache Spark is a platform for large-scale distributed data processing. Spark combines SQL, machine learning, graph computation, and stream processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming [2].
Hadoop Distributed File System (HDFS) is a highly fault-tolerant distributed file system designed to run on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS was originally built as infrastructure for the Apache Nutch web search engine project [3].
Big Data Cluster uses a scalable storage layer that integrates SQL Server and HDFS to scale to petabytes of data storage. Integrated with SQL Server Spark enables the use of open source data processing libraries and large-scale processing and analyze high-volume data in a distributed, in-memory compute layer.
2. Big Data Clusters Architecture
The SQL Server Big Data Cluster is a group of Linux containers organized by Kubernetes. Kubernetes is responsible for the status of the SQL Server Big Data Clusters. Kubernetes builds and configures the cluster nodes, assigns pods to nodes, and monitors the state of the cluster.
The following figure shows the components of a big data cluster for SQL Server.
Controller
The controller is responsible for the management and security of the cluster. It includes the control service, the configuration store, and other services such as Kibana, Elastic Search, and InfluxDB.
Compute pool
The compute pool supplies computational resources to the cluster. The compute pool contains several SQL Server pods (group of one or more containers). They can use data from SQL Server or external sources - Oracle, HDFS, or others. The pods are divided into SQL Compute instances for different processing tasks.
Data pool
The data pool is a group of SQL Server engine pods that is used either to cache data from an external source or to store incoming streaming data. Data can be delivered from SQL queries or Spark jobs. In either case, the data is partitioned and distributed across the SQL Server instances in the pool.
Storage pool
The storage pool is a group of pods comprised of SQL Server engine, HDFS data node, and Spark containers. All the storage nodes in a SQL Server big data cluster are members of an HDFS cluster. This provides the scalable storage tier along with the collocated compute for SQL Server and Spark right next to the data.
3.Scenarios for using Big Data Clusters
SQL Server Big Data Clusters has capabilities to store and process big data. There are many options to use data from external data sources, store big data in HDFS, and query data from multiple different sources. Then data can be used for data analysis, machine learning, and artificial intelligence. The following section provide more information about some of these scenarios.
Querying data from different sources
By using PolyBase, SQL Server Big Data Clusters can query external data sources without moving or copying data. PolyBase was introduced in SQL Server 2016 and it was used to process Transact-SQL queries that read data from external data sources. In SQL Server 2019 Polybase was enhanced to provide different data source connectors including Oracle, SQL Server, MongoDB (CosmosDB), and Teradata. These data sources can be accessed without installing any other software.
We can write T-SQL query to create external data source using CREATE EXTERNAL DATA SOURCE command:
...
There are seven different prefixes and several different ways to specify the path for each supported external data source.
The next step is to create the necessary external file formats, defining external data, by command CREATE EXTERNAL FILE FORMAT.
Then, in the external data source we can create external tables using CREATE EXTERNAL TABLE command:
...
There are the two main methods to ingest data into the data pool of a SQL Server 2019 Big Data Clusters:
* Using Transact-SQL. We can load data with Transact-SQL command INSERT;
* Using Spark jobs. We can create a Spark streaming job that loads data from the storage pool (HDFS) into the external table you created in the data pool.
Finally, we can query data from created external tables into data pool along with data from local tables. This option can be useful when we need to integrate data from different systems, for example to integrate data from heterogeneous systems [5], to combine dynamic data with data from relational database [6], and in many other cases [7], [8].
The big data cluster includes also a scalable HDFS storage pool. This storage pool can be used to store big data, potentially ingested from multiple external sources. When the big data is stored in this pool, there are many options to query or analyze the big data along with relational data from SQL Server database.
Artificial Intelligence and Machine Learning
SQL Server 2019 Big Data Clusters provides a comprehensive machine learning and artificial intelligence platform with all the tools and services required to ingest, store, prepare, and analyze data [9].
Big Data Clusters enable artificial intelligence and machine learning tasks on the data stored in HDFS storage pools and the data pools. We can use Spark as well as built-in artificial intelligence tools in SQL Server, using R, Python, Scala, or Java.
SQL Server 2019 Big Data Clusters provide all the tools to prepare data for analysis and to train the machine learning models. We can train a machine learning model using Spark, SparkML, and SQL Server Machine Learning Services. Then we can expose the model as a machine learning application using SQL Server ML with Transact-SQL or as an application with a REST interface in the application pool [10].
There are the two main group of tools for machine learning in SQL Server 2019 Big Data Clusters:
* SQL Server ML Services;
* Spark and Spark ML.
We can use SQL Server Machine Learning Services to prepare data, do feature engineering, train and deploy machine learning models inside the database. Machine Learning Services is a feature in SQL Server that gives the capability to execute Python and R scripts with data from relational database. We can use the Microsoft Python and R packages for predictive analytics and machine learning. The scripts are executed where the data resides, without moving data outside SQL Server. In addition to the Microsoft packages revoscalepy and microsoftml for Python, and RevoScaleR, MicrosoftML, olapR, and sqlrutils for R, which are included in Machine Learning Services, we can use also the open-source packages and frameworks, such as PyTorch, TensorFlow, and scikit-learn.
MLlib is Spark's machine learning library. At a high level, it provides the following features:
* ML Algorithms - classification, regression, clustering, and collaborative filtering;
* Featurization - feature extraction, transformation, dimensionality reduction, and selection;
* Pipelines - constructing, evaluating, and tuning ML Pipelines;
* Persistence - saving and load algorithms, models, and Pipelines;
* Utilities - linear algebra, statistics, data handling, etc.
Spark ML is a new package introduced in Spark 1.2 that aims to provide a uniform set of high-level APIs that help users create and set up machine learning pipelines.
Data scientists can choose either to use SQL Server Machine Learning Services in the master instance to run R, Python, or Java model training scripts or to use batch scoring over the data in HDFS with Spark.
The important thing is that the entire sequence of actions takes place in the environment of the SQL Server Big Data Cluster. There is no need to transfer the data to an external machine learning server or anywhere outside the big data cluster. The available power of the hardware underlying the big data cluster is accessible for data processing and computing resources can be scaled up and down according to needs.
4.Conclusion
SQL Server 2019 big data clusters offers interesting new possibilities to use SQL Server to bring the highest value enterprise data from a relational database and the high-volume big data together on one common, scalable data platform. Companies can use the power of PolyBase to virtualize their data stores, create data lakes, and create scalable data marts in a secure environment without needing to implement ETL tools. This makes data-driven applications and analysis much more efficient and productive.
References
1.Microsoft Corporation, "Microsoft SQL Server 2019 Big Data Clusters Technical white paper," 9 2018. [Online]. Available: https://info.microsoft.com/ww-landing-SQLDB-Microsoft-SQL-ServerWhitePaper.html.
2. Apache Software Foundation, "Apache Spark 2.4.5," 2018. [Online]. Available: https://spark.apache.org/docs/latest/.
3. Apache Software Foundation, "HDFS Architecture Guide," 22 8 2019. [Online]. Available: https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html.
4. Microsoft Corporation, "What are SQL Server Big Data Clusters?," 2019. [Online]. Available: https://docs.microsoft.com/en-us/sql/big-data-cluster/big-data-cluster-overview?view=sql-serverver15.
5. Tsaneva M., "A Practical Approach For Integrating Heterogeneous Systems," Business management, no. 2, p. 11, 2019.
6. Milev P., "Technological Issues of Storing Dynamic Data in a Relational Database on Research Projects," Trakia Journal of Sciences, vol. 13, pp. 22-25, 2015.
7. Lazarova V., "The Expansion of the E-business Information Systems in the Cloud- Problems and Solutions," Ikonomiceski i Sotsialni Alternativi, no. 4, pp. 34-47, December 2015.
8. Mihova V., "Common Architecture Design of a Business Information System for Performance Management of the Business Applications," in 3rd International conference on application of information and communication technology and statistics in economy and education ICAICTSEE2013, Sofia, Bulgaria, 2013.
9. Blendea M., "Build an intelligent analytics platform with SQL Server 2019 Big Data Clusters," Microsoft, 2019. [Online]. Available: https://cloudblogs.microsoft.com/sqlserver/2019/11/11/buildan-intelligent-analytics-platform-with-sql-server-2019-big-data-clusters/.
10. Ward B., SQL Server 2019 Revealed Including Big Data Clusters and Machine Learning, North Richland Hills, Texas: Apress, 2019.
Copyright International Conference on Application of Information and Communication Technology and Statistics and Economy and Education (ICAICTSEE) 2019