Using Microsoft SQL Server 2019 Big Data Clusters

Full text

Headnote

Abstract. The development of the modern database management systems is leading to providing users and developers of information systems with new and powerful tools for storing and processing large volumes of data. Microsoft SQL Server 2019 provides extremely interesting new features with Big Data Clusters. This paper aims to present the capacity of this tool and to suggest options for using it to store and process large volumes of heterogeneous data.

Keywords. Big Data, Microsoft SQL Server, Big Data Clusters, Machine Learning, Artificial Intelligence.

(ProQuest: ... denotes formulae omitted.)

1. Introduction

Microsoft SQL Server is a contemporary relational database management system. The development of information technology in recent decades has led to a dramatic increase in the amount of stored and processed data, while also increasing the variety of data types. As a typical relational DBMS, Microsoft SQL Server is not designed to store Big Data, nor to store and process unstructured data, such as media files.

Microsoft SQL Server 2019 is expanding its data platform to cover big and unstructured data by integrating Apache Spark and HDFS into the Big Data Cluster [1].

Apache Spark is a platform for large-scale distributed data processing. Spark combines SQL, machine learning, graph computation, and stream processing. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming [2].

Hadoop Distributed File System (HDFS) is a highly fault-tolerant distributed file system designed to run on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS was originally built as infrastructure for the Apache Nutch web search engine project [3].

Big Data Cluster uses a scalable storage layer that integrates SQL Server and HDFS to scale to petabytes of data storage. Integrated with SQL Server Spark enables the use of open source data processing libraries and large-scale processing and analyze high-volume data in a distributed, in-memory compute layer.

2. Big Data Clusters Architecture

The SQL Server Big Data Cluster is a group of Linux containers organized by Kubernetes. Kubernetes is...

Show less

Using Microsoft SQL Server 2019 Big Data Clusters

Content area

Full text

Suggested sources