Read Time:
7 min.
Prepared for:
Font Size:
Font Weight:
Balance Costs and Performance Between MPP Databases and Apache Spark
Different Designs for Different Functions
Apache Spark and massively parallel processing (MPP) analytical databases are designed for different things. The first generation of “big data” architectures relied upon the distributed Hadoop and MapReduce framework for analytical processing. This framework provided a breakthrough in that it increased the amount of data that could be processed, but it operated in batch mode which limited its applicability for interactive analyses. Spark removed the batch processing limitation of MapReduce thus making interactive analyses on big data practical. It also provided capabilities for streaming analyses and machine learning, but it does not include its own persistent storage layer.
Deploying both MPP databases and Spark-based systems gives an organization flexibility to optimize the cost and performance trade-offs of each.
Distributed MPP systems are designed for scalable, high-performance analytical database operations. These database systems spread processing across multiple compute resources to provide scalability and enhance performance while maintaining transactional consistency with support for data updates and deletes. Many applications require transactional consistency or repeatability—for example, customer billing or financial systems—that the relational database technology underlying MPP systems provides. These systems also use a variety of optimization techniques to deliver very high performance when executing a wide variety of analyses, including those involving small numbers of records or very large numbers of records. And while the best implementations of MPP systems are not limited to only SQL processing, the wide availability of SQL skills and tools make it easier to deploy and integrate them into an organization’s information architecture.
Using MPP databases together with Spark-based systems can harness the strengths of both approaches. Exploratory Spark-based analytics on big data can help an organization understand which data and analyses are most important. Using Spark, organizations can extract, transform and load (ETL) subsets of the large data sources for more efficient MPP database operations in production environments. Then where necessary, Spark can combine data from many sources, including MPP databases. Deploying both types of systems also gives an organization flexibility to optimize the cost and performance trade-offs of each.
Spark Accelerates Big Data Processing
Big data systems, often referred to as data lakes, are typically based on industry standard file systems rather than optimized database structures. For instance, many of these systems were initially based on the Hadoop Distributed File System (HDFS), and even with the emergence of several file formats that were optimized for analysis—including Optimized Row Columnar (ORC) and Parquet—analyses were still limited in responsiveness due to the batch processing nature of MapReduce. As cloud storage then became more affordable, object stores became the preferred storage mechanism while at about the same time, the Spark framework became the preferred mechanism for big data processing since it was significantly faster than MapReduce.
There is a large community participating in the development and application of Spark, and this open-source nature is part of its appeal.
There is a large community participating in the development and application of Spark, and this open-source nature is part of its appeal. The big data community tends to adopt open-source technologies, and this tends to make it easier for organizations to find skilled resources and share code for their projects. This in turn helps address the most common challenge that organizations cite in their data lake processes: nearly one-quarter (23%) of participants in our research identified a lack of skilled resources as a problem. Furthermore, there is also a large ecosystem of products and services available for use with Spark. Consequently, some have attempted to apply Spark to almost every data-related operation in their organization.
However, Spark was designed primarily as a big data analytical processing framework, not as an optimized database system. It helped organizations pull data from many sources, combine them and perform sophisticated analyses, including training machine learning models, and it has become one of the most common computing frameworks for transforming the data that organizations collected. And while certain database operations have been added to Spark and some vendors have built data products or services around Spark, these features are not its primary function. Note that common database benchmarks, including those typically associated with analytical workloads (TPC-H and TPC-DS), do not publish results using Spark.
Efficient Analytics Need Optimized Databases
MPP analytical databases employ several techniques designed to enable performant analytics, such as highly compressed columnar storage formats and other disk-based structures that are optimized to accelerate query processing. Most of these systems also include sophisticated caching schemes based on query patterns to minimize the amount of data which must be retrieved from disk. After many years of development, these systems are finely tuned to optimize compute costs and can manage all types of workloads, including traditional business intelligence and advanced geospatial, time series and machine learning use cases. By contrast, Spark cost-based optimization was introduced in Version 2.2 and has only been in production since January 2019.
Sophisticated MPP databases provide standard SQL and other types of processing. This enables improved overall usability, since SQL is widely known and in many cases is the preferred processing language for dealing with data. In many ways, SQL is more “open” than open-source technologies since there is a large community of skilled resources and many vendor tools from which an organization can draw. Relational MPP databases can also support a variety of storage formats optimized for different types of workloads, including semi-structured formats such as JSON. Some distributed databases can also query formats like Parquet that are stored in HDFS or object storage, not in the database itself. And finally, with the rise of interest in AI/ML, MPP databases have added their own machine learning capabilities, and integrated with different data science frameworks like R, Python and others.
Responsive analytical environments also rely on enterprise workload management capabilities, and this can get very complicated. Many big data systems rely on brute force to address this challenge, but this approach consumes many parallel resources to deliver responsiveness. This can be expensive, and it does not scale efficiently, thus causing contention and poor performance when multiple concurrent users access the system. Proper workload management requires years of development to isolate and allocate resources for different types of workloads, but with these capabilities in place, organizations can simultaneously execute a variety of workloads to support the service level agreements (SLAs) required by the organization.
Using Spark and MPP Databases Together
Spark provides an excellent ETL framework to extract and prepare data. This provides value to organizations, because accessing and preparing data for analysis was the most frequently reported time-consuming task in analytical processes, cited by more than two-thirds (69%) of the participants in our research. Our research also shows that nearly three-quarters (72%) of organizations use their data lakes and data warehouses in conjunction with each other, and Spark can help facilitate this relationship by extracting summarized and/or aggregated information from the data lake. At least one-half of the participants in our research report working with unstructured data in their data lakes, and again, Spark can convert unstructured or loosely structured data into a format ready for further analysis. It can also be used to prepare streaming data to feed into the database. Finally, Spark’s parallel data ingestion engine can pull in multiple data sources from outside the data lake and organize the data for persistence either in the data lake or in the database.
Using an MPP database in conjunction with Spark provides performance where performance matters. Interactive analyses such as business intelligence, visualization or ad hoc analyses must be highly responsive to support line-of-business operations. When ETL processes are too slow to meet the business requirements, ELT (extract, load and transform) approaches in which data is transformed using the MPP database can accelerate data preparation tasks. MPP systems can also be used for AI/ML where faster modeling is necessary to explore more alternatives during model development, and MPP-based AI/ML scoring can provide a higher level of responsiveness in production, for instance in fraud detection workloads or ad-targeting where sub-second response times are necessary.
Spark and MPP databases can also be used to manage “hot” and “cold” data. Not all data is of equal value or is accessed frequently, so MPP systems can be used for “hot” data that is needed more often to provide faster access while Spark-based systems can be used for “cold’ data that is accessed less frequently. And an MPP database can be used to analyze combined information from both systems when necessary.
Unlock Benefits by Using Them Together
Using MPP systems to accelerate common workloads provides predictable performance for most analyses.
Using MPP systems to accelerate common workloads provides predictable performance for most analyses. Organizations can typically identify the workloads that line-of-business personnel use repeatedly (certainly, most database systems can track usage and response times), so executing the most common and the most important workloads in MPP databases can help ensure that query performance is predictable and meets the necessary SLAs. And for those operations that require low latency, MPP systems are critical to provide responsiveness.
Spark-based systems provide organizations with excellent data access and preparation capabilities, including getting the data and combining, transferring and transforming it. These capabilities accelerate the most time-consuming part of the analytical process. Spark also provides organizations the capability to develop machine learning models over large amounts of data. This experimentation and exploration on these larger volumes of information can lead to valuable insights that can be operationalized with higher performance in MPP systems.
Finally, as the comedian Steven Wright said, “You can’t have it all. Where would you put it?”
In a perfect world, organizations could store all the data of any type for all of history and access it instantaneously. Unfortunately, the laws of physics, a never-ending thirst for data, and cost constraints make that utopian vision unrealistic. Therefore, organizations can use the combination of Apache Spark and MPP database systems to optimize the trade-offs in cost and performance to retain and analyze as much data as possible while providing near-instantaneous access for the majority of analyses conducted.
Balance Costs and Performance Between MPP Databases and Apache Spark
Different Designs for Different Functions
Apache Spark and massively parallel processing (MPP) analytical databases are designed for different things. The first generation of “big data” architectures relied upon the distributed Hadoop and MapReduce framework for analytical processing. This framework provided a breakthrough in that it increased the amount of data that could be processed, but it operated in batch mode which limited its applicability for interactive analyses. Spark removed the batch processing limitation of MapReduce thus making interactive analyses on big data practical. It also provided capabilities for streaming analyses and machine learning, but it does not include its own persistent storage layer.
Deploying both MPP databases and Spark-based systems gives an organization flexibility to optimize the cost and performance trade-offs of each.
Distributed MPP systems are designed for scalable, high-performance analytical database operations. These database systems spread processing across multiple compute resources to provide scalability and enhance performance while maintaining transactional consistency with support for data updates and deletes. Many applications require transactional consistency or repeatability—for example, customer billing or financial systems—that the relational database technology underlying MPP systems provides. These systems also use a variety of optimization techniques to deliver very high performance when executing a wide variety of analyses, including those involving small numbers of records or very large numbers of records. And while the best implementations of MPP systems are not limited to only SQL processing, the wide availability of SQL skills and tools make it easier to deploy and integrate them into an organization’s information architecture.
Using MPP databases together with Spark-based systems can harness the strengths of both approaches. Exploratory Spark-based analytics on big data can help an organization understand which data and analyses are most important. Using Spark, organizations can extract, transform and load (ETL) subsets of the large data sources for more efficient MPP database operations in production environments. Then where necessary, Spark can combine data from many sources, including MPP databases. Deploying both types of systems also gives an organization flexibility to optimize the cost and performance trade-offs of each.
Spark Accelerates Big Data Processing
Big data systems, often referred to as data lakes, are typically based on industry standard file systems rather than optimized database structures. For instance, many of these systems were initially based on the Hadoop Distributed File System (HDFS), and even with the emergence of several file formats that were optimized for analysis—including Optimized Row Columnar (ORC) and Parquet—analyses were still limited in responsiveness due to the batch processing nature of MapReduce. As cloud storage then became more affordable, object stores became the preferred storage mechanism while at about the same time, the Spark framework became the preferred mechanism for big data processing since it was significantly faster than MapReduce.
There is a large community participating in the development and application of Spark, and this open-source nature is part of its appeal.
There is a large community participating in the development and application of Spark, and this open-source nature is part of its appeal. The big data community tends to adopt open-source technologies, and this tends to make it easier for organizations to find skilled resources and share code for their projects. This in turn helps address the most common challenge that organizations cite in their data lake processes: nearly one-quarter (23%) of participants in our research identified a lack of skilled resources as a problem. Furthermore, there is also a large ecosystem of products and services available for use with Spark. Consequently, some have attempted to apply Spark to almost every data-related operation in their organization.
However, Spark was designed primarily as a big data analytical processing framework, not as an optimized database system. It helped organizations pull data from many sources, combine them and perform sophisticated analyses, including training machine learning models, and it has become one of the most common computing frameworks for transforming the data that organizations collected. And while certain database operations have been added to Spark and some vendors have built data products or services around Spark, these features are not its primary function. Note that common database benchmarks, including those typically associated with analytical workloads (TPC-H and TPC-DS), do not publish results using Spark.
Efficient Analytics Need Optimized Databases
MPP analytical databases employ several techniques designed to enable performant analytics, such as highly compressed columnar storage formats and other disk-based structures that are optimized to accelerate query processing. Most of these systems also include sophisticated caching schemes based on query patterns to minimize the amount of data which must be retrieved from disk. After many years of development, these systems are finely tuned to optimize compute costs and can manage all types of workloads, including traditional business intelligence and advanced geospatial, time series and machine learning use cases. By contrast, Spark cost-based optimization was introduced in Version 2.2 and has only been in production since January 2019.
Sophisticated MPP databases provide standard SQL and other types of processing. This enables improved overall usability, since SQL is widely known and in many cases is the preferred processing language for dealing with data. In many ways, SQL is more “open” than open-source technologies since there is a large community of skilled resources and many vendor tools from which an organization can draw. Relational MPP databases can also support a variety of storage formats optimized for different types of workloads, including semi-structured formats such as JSON. Some distributed databases can also query formats like Parquet that are stored in HDFS or object storage, not in the database itself. And finally, with the rise of interest in AI/ML, MPP databases have added their own machine learning capabilities, and integrated with different data science frameworks like R, Python and others.
Responsive analytical environments also rely on enterprise workload management capabilities, and this can get very complicated. Many big data systems rely on brute force to address this challenge, but this approach consumes many parallel resources to deliver responsiveness. This can be expensive, and it does not scale efficiently, thus causing contention and poor performance when multiple concurrent users access the system. Proper workload management requires years of development to isolate and allocate resources for different types of workloads, but with these capabilities in place, organizations can simultaneously execute a variety of workloads to support the service level agreements (SLAs) required by the organization.
Using Spark and MPP Databases Together
Spark provides an excellent ETL framework to extract and prepare data. This provides value to organizations, because accessing and preparing data for analysis was the most frequently reported time-consuming task in analytical processes, cited by more than two-thirds (69%) of the participants in our research. Our research also shows that nearly three-quarters (72%) of organizations use their data lakes and data warehouses in conjunction with each other, and Spark can help facilitate this relationship by extracting summarized and/or aggregated information from the data lake. At least one-half of the participants in our research report working with unstructured data in their data lakes, and again, Spark can convert unstructured or loosely structured data into a format ready for further analysis. It can also be used to prepare streaming data to feed into the database. Finally, Spark’s parallel data ingestion engine can pull in multiple data sources from outside the data lake and organize the data for persistence either in the data lake or in the database.
Using an MPP database in conjunction with Spark provides performance where performance matters. Interactive analyses such as business intelligence, visualization or ad hoc analyses must be highly responsive to support line-of-business operations. When ETL processes are too slow to meet the business requirements, ELT (extract, load and transform) approaches in which data is transformed using the MPP database can accelerate data preparation tasks. MPP systems can also be used for AI/ML where faster modeling is necessary to explore more alternatives during model development, and MPP-based AI/ML scoring can provide a higher level of responsiveness in production, for instance in fraud detection workloads or ad-targeting where sub-second response times are necessary.
Spark and MPP databases can also be used to manage “hot” and “cold” data. Not all data is of equal value or is accessed frequently, so MPP systems can be used for “hot” data that is needed more often to provide faster access while Spark-based systems can be used for “cold’ data that is accessed less frequently. And an MPP database can be used to analyze combined information from both systems when necessary.
Unlock Benefits by Using Them Together
Using MPP systems to accelerate common workloads provides predictable performance for most analyses.
Using MPP systems to accelerate common workloads provides predictable performance for most analyses. Organizations can typically identify the workloads that line-of-business personnel use repeatedly (certainly, most database systems can track usage and response times), so executing the most common and the most important workloads in MPP databases can help ensure that query performance is predictable and meets the necessary SLAs. And for those operations that require low latency, MPP systems are critical to provide responsiveness.
Spark-based systems provide organizations with excellent data access and preparation capabilities, including getting the data and combining, transferring and transforming it. These capabilities accelerate the most time-consuming part of the analytical process. Spark also provides organizations the capability to develop machine learning models over large amounts of data. This experimentation and exploration on these larger volumes of information can lead to valuable insights that can be operationalized with higher performance in MPP systems.
Finally, as the comedian Steven Wright said, “You can’t have it all. Where would you put it?”
In a perfect world, organizations could store all the data of any type for all of history and access it instantaneously. Unfortunately, the laws of physics, a never-ending thirst for data, and cost constraints make that utopian vision unrealistic. Therefore, organizations can use the combination of Apache Spark and MPP database systems to optimize the trade-offs in cost and performance to retain and analyze as much data as possible while providing near-instantaneous access for the majority of analyses conducted.
Fill out the form to continue reading

ISG Software Research
ISG Software Research is the most authoritative and respected market research and advisory services firm focused on improving business outcomes through optimal use of people, processes, information and technology. Since our beginning, our goal has been to provide insight and expert guidance on mainstream and disruptive technologies. In short, we want to help you become smarter and find the most relevant technology to accelerate your organization's goals.
About ISG Software Research
ISG Software Research provides expert market insights on vertical industries, business, AI and IT through comprehensive consulting, advisory and research services with world-class industry analysts and client experience. Our ISG Buyers Guides offer comprehensive ratings and insights into technology providers and products. Explore our research at www.isg-research.net.
About ISG Research
ISG Research provides subscription research, advisory consulting and executive event services focused on market trends and disruptive technologies driving change in business computing. ISG Research delivers guidance that helps businesses accelerate growth and create more value. For more information about ISG Research subscriptions, please email contact@isg-one.com.
About ISG
ISG (Information Services Group) (Nasdaq: III) is a leading global technology research and advisory firm. A trusted business partner to more than 900 clients, including more than 75 of the world’s top 100 enterprises, ISG is committed to helping corporations, public sector organizations, and service and technology providers achieve operational excellence and faster growth. The firm specializes in digital transformation services, including AI and automation, cloud and data analytics; sourcing advisory; managed governance and risk services; network carrier services; strategy and operations design; change management; market intelligence and technology research and analysis. Founded in 2006 and based in Stamford, Conn., ISG employs 1,600 digital-ready professionals operating in more than 20 countries—a global team known for its innovative thinking, market influence, deep industry and technology expertise, and world-class research and analytical capabilities based on the industry’s most comprehensive marketplace data.
For more information, visit isg-one.com.