Content area
Modern organizations generate and process unprecedented volumes of structured, semi-structured, and unstructured data from diverse sources, creating significant architectural and engineering challenges for traditional data processing systems. Industry analyses consistently report failure rates of 60-85% for Big Data projects, with architectural limitations identified as a primary contributing factor. Current reference architectures suffer from monolithic designs, inadequate cross-cutting concerns (security, privacy, metadata), and limited adaptability to evolving data ecosystems. This paper presents Terramycelium, a novel reference architecture for Big Data systems that addresses these limitations through a domain-driven, event-oriented approach. The architecture integrates principles from complex adaptive systems, domain-driven design, distributed systems, and event-driven architectures to enable autonomous domain-specific data ownership while maintaining system-wide coherence through asynchronous event communication. We developed Terramycelium following empirically grounded reference architecture guidelines and evaluated it through two complementary methods: a case-mechanism experiment and expert opinion assessment. The case-mechanism experiments demonstrated the architecture’s capability to process 1.693GB of data with 50-100 second latency, handle 771,305 streaming messages with 0.0000148 second ingestion latency, and maintain stable performance with 24% CPU utilization under high-volume scenarios. Expert evaluation (n=3, 10-32 years experience) validated the architecture’s innovative integration of domain-driven design with data engineering, while identifying implementation complexity and organizational readiness as adoption challenges. Terramycelium contributes a validated approach for building scalable, maintainable Big Data systems that addresses the limitations of existing monolithic architectures while aligning with modern software engineering practices.
Introduction
With the proliferation of digital devices and the advent of the internet, a significant transformation has occurred in terms of data creation and connectivity, ushering in an epoch marked by the exponential growth of information. This period is characterised by the extensive expansion of data, which presents difficulties for traditional data processing systems and necessitates inventive methods in data architecture [1, 2]. The volume, variety, and velocity of data in the current digital environment necessitate innovative solutions, particularly in the field of Big Data (BD).
Data needs have dramatically evolved, transitioning from basic business intelligence (BI) functions, like generating reports for risk management and compliance, to incorporating machine learning across various organisational facets [3]. These range from product design with automated assistants to personalised customer service and optimised operations. Also, as machine learning gains popularity, application development must evolve from rule-based, deterministic models to more adaptable, probabilistic models. These models should be capable of handling a broader range of outcomes and require continuous improvement through access to the latest data. This evolution underscores the need to reevaluate and simplify our data management strategies to address the growing and diverse expectations placed on data.
Currently, the success rate of BD projects is low. Recent surveys have identified the fact that current approaches to BD do not seem to be effectively addressing these expectations. According to a survey conducted by Databricks [4], only 13% of organisations are highly successful in their data strategy. Additionally, a report by NewVantage Partners reveals that only 24% of organisations have successfully converted to being data-driven, and only 30% have a well-established BD strategy [5]. These observations, additionally corroborated by research conducted by McKinsey & Company [6] and Gartner [7], emphasise the difficulties of successfully using BD in the industry. These difficulties include the lack of a clear understanding of how to extract value from data, the challenge of integrating data from multiple sources, data architecture, and the need for skilled data analysts and scientists.
Without a well-established BD strategy, companies may struggle to navigate these challenges and fully leverage the potential of their data. One effective artefact to overcome some of these challenges is Reference Architectures (RAs) [8]. RAs extract the essence of the practice as a series of patterns and architectural constructs and manifest it through high-level semantics. This allows stakeholders to refrain from reinventing the wheel and instead focus on utilising existing knowledge and best practices to harness the full potential of their data. While there are various BD RAs available to help practitioners design their BD systems, these RAs are overly concentrated, lack attention to cross-cutting concerns such as privacy, security, and metadata, and may not effectively handle the proliferation of data sources and consumers [1, 2].
To this end, this study presents Terramycelium, a RA designed specifically for BD systems based on the principles of complex adaptive systems, Domain-Driven Design (DDD), distributed systems, and event-driven systems. Terramycelium seeks to surpass the constraints of current RAs by utilising domain-driven and distributed approaches derived from contemporary software engineering [9]. This method aims to improve the ability of BD systems to scale, be maintained, and evolve, surpassing the constraints of traditional monolithic data architectures.
This paper begins by laying the conceptual groundwork, covering the background, related work, research methodology, and system requirements (Sections "Background"− "Software and system requirements"). It then introduces the Terramycelium reference architecture, detailing its theoretical foundations (Section "Theory") and design (Section "Artifact"). Subsequently, the architecture is validated through a rigorous two-part evaluation (Sections "Evaluation 1: Case mechanism experiment" and "Evaluation 2: Expert opinion"). We conclude by discussing the findings, their implications, and future directions (Sections "Discussion" and "Conclusion").
Background
This section offers the fundamental definitions necessary for understanding the intricacies of the research. This section seeks to establish the conceptual framework required for comprehending the terminology utilised in the study.
Definition of BD
Various academic definitions have been reviewed to define BD in this research. According to Kaisler et al. [10], BD refers to data that exceeds the capacity of current technologies to store, manage, and process effectively. The author Srivastava et al. [11] define BD as the use of extensive data sets for managing the gathering or reporting of data that aids multiple recipients in decision-making. The phrase BD is defined by Sagiroglu et al. [12] as referring to extensive data sets characterised by their huge size, diverse and intricate structure, and the challenges associated with storing, analysing, and visualising them for subsequent processes or outcomes. In this research, BD is defined as "the management and analysis of large, complex datasets that require specialized tools and techniques for processing, analysis, and storage due to their volume, velocity, and variety".
Significance of BD
The value of BD is well-established, underscored by significant investments from leading corporations and widespread discussion in academic and industry forums [3, 13, 14]. Its practical benefits are evident across various sectors. Netflix’s recommender system, for instance, leveraged diverse user data to quadruple viewership for some series [15]. In public health, the Taiwanese government integrated health and immigration databases to proactively manage the COVID-19 outbreak [16]. In the energy sector, Shell utilizes global drilling data to minimize exploration costs and improve resource discovery [17]. Similarly, Rolls Royce uses sensor data from its aircraft engines for predictive maintenance, enhancing safety and operational efficiency [18].
Reference architectures
RAs are crucial components in modern system development, providing guidance for building, maintaining, and evolving complex systems [8].
They provide a precise representation of the fundamental elements of a system and the interactions required to achieve broad goals. This clarity encourages the development of digestible modules, each focusing on certain parts of complicated issues, and offers a sophisticated platform for stakeholders to participate, contribute, and work together. The importance of RAs in IT is highlighted by the success of widely used technologies such as OAuth [19] and ANSI-SPARC architecture [20], which have their roots in well-organized RAs.
RAs define the characteristics of a system and influence its development. RAs stand apart by emphasising abstract features and higher levels of abstraction, which are present in any system’s architecture. They strive to encapsulate the core of practice and incorporate proven patterns into unified frameworks, covering elements, attributes, and interconnections. RAs play a crucial role in BD by managing communication, controlling complexity, handling knowledge, reducing risks, promoting architectural visions, establishing common ground, improving understanding of BD systems, and enabling additional analysis.
Microservices and decentralised, distributed architectures
Microservices architecture, an evolution of Service-Oriented Architecture (SOA), structures an application as a collection of small, independently deployable, and loosely coupled services [21]. This style is inherently decentralized and distributed, with components spread across multiple nodes that collaborate to perform tasks [9, 22]. By breaking down monolithic applications, this approach significantly improves scalability, fault tolerance, and resilience, as teams can develop, deploy, and scale individual services autonomously [23, 24]. This fosters a more agile development environment, enabling rapid adaptation to changing requirements and representing a key shift towards more flexible and robust system design.
Related work
This section reviews significant works in BD RAs, detailing their emphasis, research methodologies, and inherent limitations, to endorse the distinctive approach of this study’s domain-driven distributed RA, Terramycelium. The Lambda architecture design by Kiran [25] and the Kappa architecture by Kreps [26] are important industrial advancements in real-time analytics for BD, setting foundational principles for data processing. Nevertheless, these architectures have faced criticism for their insufficient data management techniques, particularly concerning data quality, security, and metadata [1].
Along the same lines, there have been numerous efforts on RAs in special domains, like the work of Klein on a security driven BD RA [27] and the work of Quintero et al. on BD RAs in the healthcare domain [28]. These RAs, while concentrating on certain domain requirements, sometimes overlook broader concerns such as privacy and interoperability. Academic research, like Viana [29] and Paakkonen [30], has concentrated on enhancing the conceptual comprehension of BD systems through proposed RAs to provide broader views on data analytics ecosystems.
Yet, these recommendations frequently overlook the dynamic and pervasive features of modern data environments, particularly concerning scalability and adaptability. Ataei et al. [2] highlighted the limitations of current BD RAs, emphasising a common trend: a substantial reliance on monolithic data pipeline architectures.
Data spaces (e.g., Gaia-X) and data mesh architectures (e.g., Snowflake, Google, Amazon) address key challenges such as scalability, privacy, metadata management, and flexibility; similar to our work. Gaia-X emphasizes decentralized, sovereign data sharing [31], while data mesh [32] promotes domain-oriented decentralization, as seen in Snowflake’s secure sharing and Google’s unified metadata.
Current RAs’ dependence on centralized systems is evident in their inability to effectively handle data quality, security, privacy, and metadata. Moreover, their rigid structure frequently leads to scalability and adaptability issues, hindering their ability to keep pace with evolving data and technological environments. Another notable contribution is Cybermycelium by Ataei [33] a domain-driven distributed reference architecture for big data systems. Cybermycelium addresses limitations of current RAs through domain-driven design principles and distributed computing concepts, emphasizing scalability, maintainability, and adaptability [34]. Similar to Terramycelium, it employs event-driven communication with clearly defined domain boundaries, though our approach differs in its implementation of Data Lichen for discovery and sidecar-based governance. Both architectures reflect the growing shift toward domain-driven approaches to overcome challenges in big data implementation, similar to approaches seen in data spaces like Gaia-X [31].
To this end, Terramycelium is presented as a novel artefact to address some of these challenges. Terramycelium addresses the major limitations of current RAs by adhering to Dtackle complexity and breaking down the bigger problem into smaller, well-managed ones. Moreover, Terramycelium advocates for decentralised data stewardship and a modular architecture, as opposed to traditional systems that usually have data management in rigid, monolithic structures. This artefact promotes an event-driven, asynchronous style of communication while empowering domains with simple local rules to embody the idea of complex adaptive systems.
Moreover, Terramycelium enhances scalability and adaptability by addressing cross-cutting concerns such as security, privacy, and data quality. Terramycelium is a departure from traditional BD RAs by focusing on domain-driven distributed processing. This artefact offers a scalable and adaptable framework that aligns well with current data management needs and contemporary software architecture concepts. Terramycelium not only addresses the limitations of present RAs but also paves the way for future research and development in this important field.
Research methodology
Various methods exist for the structured creation of RAs. Cloutier et al. [8] present a methodology for creating RAs by gathering current architectural trends and innovations. Bayer et al. [35] present a method named PuLSE DSSA for generating RAs in product line development. The concept of a pattern-based runtime adaptation for service-based systems is introduced by Stricker et al. [36], who emphasise the usage of patterns as primary entities. The authors Nakagawa et al. [37] present a four-step method for creating and advancing RAs.
Guided by ISO/IEC 26550 [38], Derras et al. [39] provide a four-phase method for creating RAs in the software product line environment. Moreover, Galster et al. [40] suggest a 6-step process based on two main concepts: empirical foundation and empirical validity. Considering all these factors, the 6-step process of empirically grounded RAs presented by Galster et al. is the most suitable approach for this study. This process for RA development is preferred over others because, firstly, it’s got a good pedigree, and secondly, it aligns with the objectives of this study.
Nevertheless, the methodology has limitations. Additional approaches must be incorporated into the methodology to achieve the appropriate level of rigour and relevance. Specific instructions for gathering empirical data in step three of the methodology are absent. We were unsure how to proceed with data collection, synthesis, and modelling. For this purpose, Nakagawa et al. [37] presented research guidelines and introduced the RAModel concept. We absorbed the RAModel concept to increase the systematicity and rigour of this study. In addition, a more systematic and robust evaluation technique was needed, as Galster et al.’s methodology lacks details on how to assess the RA. To solve this issue, a case-mechanism experiment and expert opinion were used to assess the artefact as presented in the works of Wirienga [41].
Together, this study adopts the methodology of empirically grounded RAs presesnted by Galster et al. [40], with the addition of the RAModel concept from Nakagawa et al. [42] and two evaluation methods of case-mechanism experiment and expert opinion as delineated by Wirienga [41].
Step 1: Determination of RA type
The first step in creating the RA was choosing its kind using Angelov et al.’s classification framework [43], which divides RAs into standardisation RAs and facilitation RAs. This decision is fundamental as it directs the following stages of information gathering and risk assessment development:.
The classification framework, which considers context, aims, and design dimensions, was crucial in determining the most suitable RA type for the study’s objectives. The method takes a systematic approach by employing key interrogatives such as ’When’, ’Where’, ’Who’ for context, ’Why’ for goals, and ’How’ and ’What’ for design to efficiently categorise RAs.
The study merged Angelov’s classification with insights from a recent Systematic Literature Review (SLR) on BD RAs presented by Ataei et al. [2]. The goal of Terramycelium is to facilitate BD system development and enhance an efficient, adaptable data architecture. Therefore, this artefact is classified as a standardisation RA intended to be adaptable in many organisational settings.
Step 2: Design strategy selection
The design approach for the RA was influenced by the frameworks proposed by Angelov et al. [44] and Galster et al. [40], which describe two main methodologies: practice-driven (creating RAs from the beginning) and research-driven (building RAs based on pre-existing ones). Practice-driven RAs are uncommon and usually found in emerging areas, while research-driven RAs, which combine current designs, models, and best practices, are more common in established sectors.
This study chooses a research-driven strategy. The RA was created by utilising existing RAs, a contemporary body of knowledge, and documented best practices. This method allows for the development of a detailed design theory that combines and expands on existing knowledge.
Step 3: Empirical data collection
In the development of the artefact, a foundational aspect of our methodology involved harnessing the insights from two comprehensive SLRs specifically focused on BD RAs. These SLRs ([1, 2]) were pivotal not only for their breadth and depth in covering the current state of BD RAs but also for their inclusivity of a wide range of sources, including industrial white papers and grey literature. The deliberate inclusion of such documents in these SLRs provided us with a rich, multifaceted understanding of BD RAs, incorporating both academic research and practical, real-world applications from the industry.
Given the extensive coverage of existing standards, frameworks, and best practices within these SLRs, we deemed it unnecessary to independently revisit or replicate the analysis of such documents. The SLRs had already undertaken a rigorous examination of these areas, ensuring that our RA was built upon a foundation that was both current and aligned with industry standards. This decision was further supported by the fact that these SLRs encompassed recent works published in high-quality journals and conferences, underscoring their relevance and authority in the field.
In addition to the insights gained from the SLRs, our methodology was enriched by the inclusion of a study on the application of microservices patterns to BD systems [9]. This study provided a detailed perspective on how microservices architectures could be leveraged within the context of BD, offering valuable design patterns and architectural insights. The integration of findings from this study with those of the SLRs allowed us to construct an RA that not only reflects the current landscape of BD architectures but also incorporates forward-thinking design principles enabled by microservices.
By drawing on these recent, comprehensive studies, we were able to forego the repetition of gathering and analysing similar sets of data. Instead, our focus was directed towards synthesising these findings to form the basis of our artefact. This approach ensured that our RA was developed with an appreciation for the current state of knowledge and practice within the BD domain, supported by high-quality, authoritative sources. The convergence of insights from the SLRs and the microservices study provided a robust, empirically grounded foundation from which our RA was constructed, positioning it as a relevant and informed contribution to the field of BD systems architecture.
Step4: Construction of the RA
The construction phase of the RA was guided by the findings and elements identified in the previous steps of the research. Based on the ISO/IEC/IEEE 42010 standard [45], the construction of the RA involves selectively choosing and integrating architectural constructs.
The artefact is created using the Architectural Definition Language (ADL) called Archimate, which is part of the ISO/IEC/IEEE 42010 standard. Archimate’s service-oriented strategy efficiently connected the application, business, and infrastructure layers of the RA. The incorporation of themes, theories, and patterns from the empirical data gathered from the previous phase guaranteed that the artefact design addressed the objective of the study.
Using Archimate’s various viewpoints, a comprehensive perspective on the RA is delineated, encompassing technical, business, and consumer context views. This method, in accordance with the ideas presented by Cloutier et al. [8] and Stricker et al. [36], facilitated a thorough comprehension of the RA, guaranteeing its congruence with the study’s goals and context.
Step 5: Activating RA with variability
Integrating variability into RAs is crucial to ensuring its application within organisational-specific legislation and regional policies. Variability management is essential in Business Process Management (BPM) and Software Product Line Engineering (SPLE) to customise business processes and software artefacts according to specific contextual requirements.
Accurate identification and clear communication of variability are essential to encourage stakeholder discussion, maintain decision traceability, and aid the decision-making process [46]. Data collected in previous steps informs variability decision points. Galster et al. [40] outline three approaches to including variability in RAs: 1) annotating the RA, 2) constructing variability views, and 3) forming variability models.
Current literature lacks in-depth information on the criteria for choosing a method to facilitate variability. This study utilises Archimate annotations to incorporate variability into the RA, following the method proposed by Rurua [47]. The technique involves two stages: first, developing a customised layer to encompass fundamental variability principles, and second, annotating the RA. The aim is to highlight the main architectural variabilities associated with the system for architects to consider in order to improve the design and make it easier to implement the RA.
Step 6: Evaluation of the RA
Evaluating the RA is essential to ensuring it achieves its developmental goals, especially in terms of efficacy and usefulness [48]. Assessing a RA presents distinct difficulties because of its elevated level of abstraction, lack of clear stakeholder groups, and emphasis on high-level architectural features rather than concrete context-driven implementations.
Common evaluation techniques for concrete architecture, like SAAM [49], ALMA [50], PASA [51], and ATAM [52], are not suitable for RAs because they heavily depend on stakeholder participation and scenario-based assessment. This requires a customisation of these methods for evaluating RA.
To this end, the evaluation approach of this study includes, firstly, developing a prototype of the RA for a case-mechanism experiment, and secondly, an expert opinion adopted from the principles of design science research as presented by Wieringa [41]. These two evaluation methods are grounded in the rigorous methodologies characteristic of design science research, ensuring that the development and assessment of the RA are both systematic and empirically validated.
Why reference architectures
Viewing the system as a RA aids in comprehending its main elements, behaviour, structure, and development, which subsequently impact quality characteristics like maintainability, scalability, and performance [8]. RAs can serve as a valuable standardisation tool and a means of communication that leads to specific architectures for BD systems.
They also offer stakeholders a common set of elements and symbols to facilitate discussions and advance BD projects. Utilising RAs for system conceptualization and as a standardisation object is a common technique among those dealing with complex systems. Software Product Line (SPL) development uses RAs as generic artefacts that are customised and configured for a specific system domain. IBM and other top IT companies have continually supported the use of RAs as exemplary techniques for solving complex system design difficulties in software engineering [2].
RAs often act as instruments for establishing uniformity in new areas within international standards. The BS ISO/IEC 18384-1 RA [53] for service-oriented architectures showcases the effectiveness of RAs in establishing standardised frameworks in particular domains. In summary, the strategic application of RAs in the development of BD systems provides a structured methodology that enhances system design, fosters effective communication among stakeholders, and supports the standardisation of complex architectures. By adopting RAs, organisations can navigate the challenges of BD system development more effectively, leading to robust, scalable, and high-quality solutions that are capable of extracting valuable insights from large-scale data.
Software and system requirements
According to Wieringa [41], the requirement specification phase is an essential step in developing a new Information System’s (IS) artefact. This phase involves identifying the requirements that the artefact must satisfy to meet the needs of stakeholders and achieve the desired outcomes.
Wieringa’s methodology distinguishes between functional requirements, which describe what the artefact should do, and non-functional requirements, which describe how the artefact should do it. Functional requirements are typically expressed as use cases, which describe the specific interactions between users and the system. Non-functional requirements may include performance requirements, security requirements, and usability requirements.
The requirement specification designed for this study is made up of the following phases:
Determining the type of requirements
Determining the relevant requirements and
Identifying the right approach for categorising the requirements
Identifying the right approach for the presentation of the requirements
Determining the type of the requirements
Defining and classifying software and system requirements is a common subject of debate. Sommervile [54] classifies requirements as three levels of abstraction; user requirements, system requirements, and design specifications. These abstractions are then mapped against user acceptance testing, integration testing, and unit testing. Nevertheless, in this study, a more general framework provided by Laplante [55] is adopted. The adopted approach provides three types of requirements: functional, non-functional, and domain requirements. The objective of this step is to define the high-level requirements of BD systems; therefore, the main focus is on functional and non-functional requirements.
Determining the relevant requirements
In an extensive effort, the NIST BD Public Working Group embarked on a large-scale study to extract requirements from a variety of application domains such as Healthcare, Life Sciences, Commercial, Energy, Government, and Defense [56]. The result of this study is the formation of general requirements into seven categories. In addition, Volk et al. [57] categorise nine use cases of BD projects sourced from published literature using a hierarchical clustering algorithm.
Rad et al. [58] focus on security and privacy requirements for BD systems, Yu et al. [59] present modern components of BD systems, using goal-oriented approaches, Eridaputra et al. [60] created a generic model for BD requirements, and Al-Jaroodi et al. [61] investigate general requirements to support BD software development.
By analysing the results of the first SLRs, the studies discussed above, and by evaluating the design and requirement engineering required for BD RAs, the relevant requirements are deemed to be the high-level requirements based on the characteristics of BD presented in the works of Rad et al. [62]. That is, the types of requirements determined for this study are based on the characteristics of BD. This is explored further in subsequent sections.
Identifying the right approach for categorising the requirements
After defining the types of requirements and identifying relevant requirements, an assessment of current BD RAs and their associated requirements was conducted to enhance our understanding of the existing methods for categorizing BD requirements. This assessment revealed a unifying theme across these studies, suggesting a consensus on a methodological approach to requirement classification. Consequently, this theme informed our categorization strategy, leading us to organise the requirements based on intrinsic BD characteristics. Specifically, the requirements were classified according to critical dimensions of BD, namely velocity, veracity, volume, variety, and value, as well as considerations of security and privacy. This classification approach aligns with the findings from key studies in the field, including works by Ataei et al. [2], Bahrami et al. [63], Rad et al. [18], and Chen et al. [64], which collectively underscore the significance of these characteristics in defining and addressing the requirements of BD systems.
Determining the appropriate strategy for presenting the requirements:
A systematic strategy for presenting software and system requirements that includes quantifiable metrics is recognized due to its established value in both industry and academia [65]. The method for illustrating functional requirements adheres to the principles stated in the ISO/IEC/IEEE standard 29148 [65]. The requirements representation is structured based on system modes, detailing the primary components of the system, followed by the requirements. This technique is based on the requirement specification published for the NASA Wide-field InfraRed Explorer (WIRE) system [66] and the Software Engineering Body of Knowledge (SEBoK) [67].
Empirical research by Angelov et al. [44] demonstrates that reference architectures incorporating quantifiable metrics achieve 3.2 times higher implementation success rates compared to those with qualitative descriptions alone. Their analysis of reference architecture implementations identified measurable quality attributes as the primary differentiator between successful and unsuccessful adoptions.
The metrics established for Terramycelium’s requirements derive from empirical studies and validated industry benchmarks. The ingestion rate specifications align with performance thresholds established by Kreps [26] in his quantitative analysis of streaming data architectures. The latency parameters correspond to Kleppmann’s [68] empirical research on distributed data system performance. Security and privacy requirements conform to established standards, ensuring compliance with current best practices.
The requirements are outlined in Table 1.
Table 1. Terramycelium software and system requirements with measurable criteria
Category | Code | Requirement |
|---|---|---|
Volume | Vol-1 | System shall support asynchronous, streaming, and batch processing with ingestion rates 10,000 events/second for streaming and batch processing of datasets up to 500GB within a 4-hour window. Zaharia et al. [69] established these thresholds as minimum viable performance for enterprise-scale analytics applications. |
Volume | Vol-2 | System shall provide scalable storage for datasets 100TB with read/write latency <50ms for frequently accessed data. Lakshman and Malik [70] determined these performance characteristics as necessary for interactive analytics on large datasets. |
Velocity | Vel-1 | System shall support variable data transmission rates: slow (1–100 events/sec), bursty (0–10,000 events/sec with 50x spikes), and high throughput (>10,000 events/sec sustained). Narkhede et al. [71] identified these patterns as representative of enterprise data flows in production environments. |
Velocity | Vel-2 | System shall stream data to consumers with end-to-end latency 100ms for critical flows and 5 seconds for standard operations, with <1% message loss during peak loads. These metrics align with the performance model validated by Kreps [26] in empirical studies of distributed streaming systems. |
Velocity | Vel-3 | System shall ingest 50 concurrent, continuous, time-varying data streams while preserving temporal sequence integrity. Carbone et al. [72] determined this capacity as necessary for complex event processing in distributed environments. |
Velocity | Vel-4 | System shall support search from streaming and processed data with query response times 500ms for 99th percentile of queries. Baeza-Yates and Ribeiro-Neto [73] established these response time thresholds as requirements for interactive search systems. |
Velocity | Vel-5 | System shall process data in near real-time with maximum processing latency of 2 seconds for 95% of events and 5 seconds for 99.9% of events. Marz and Warren [74] identified these thresholds through quantitative analysis of real-time analytics systems. |
Variety | Var-1 | System shall support multiple data formats (structured, semi-structured, and unstructured) with automatic format detection accuracy 98%. Stonebraker et al. [75] established this capability as critical for handling heterogeneous data environments. |
Variety | Var-2 | System shall support aggregation, standardization, and normalization of disparate data sources with 95% data consistency. This threshold is based on data integration quality metrics validated by Batini et al. [76] across multiple domains. |
Variety | Var-3 | System shall implement schema evolution with zero-downtime updates and automatic propagation within 30 minutes. Sadalage and Fowler [77] identified these capabilities as requirements for evolutionary data architectures. |
Variety | Var-4 | System shall automate new data source integration with <20 minutes configuration time per standard source. This efficiency metric aligns with data integration benchmarks established by Abedjan et al. [78]. |
Value | Val-1 | System shall support analytical processing and machine learning on datasets up to 1TB with resource optimization reducing compute cost by 20%. These parameters derive from resource efficiency benchmarks established by Xin et al. [79]. |
Value | Val-2 | System shall support both batch processing (throughput 500GB/hour) and streaming analytics (latency 2 seconds) while maintaining data consistency. These metrics correspond to performance requirements validated by Kreps [26] for dual-processing architectures. |
Value | Val-3 | System shall support multiple output formats with conversion latency 500ms for standard documents. Kleppmann [68] established these thresholds through empirical measurements of data interchange in distributed systems. |
Value | Val-4 | System shall stream results to consumers with configurable delivery guarantees and verification mechanisms. These requirements implement the reliability patterns validated by Narkhede et al. [71] in production environments. |
Security & Privacy | SaP-1 | System shall protect sensitive data through encryption (AES-256 at rest, TLS 1.3 in transit), with key rotation periods 90 days and PII detection accuracy 99.5%. These specifications conform to established security guidelines for distributed systems [80]. |
Security & Privacy | SaP-2 | System shall implement policy-driven authentication with access decisions computed in 50ms and policy updates propagated within 5 minutes. Hu et al. [81] established these performance thresholds for distributed access control systems. |
Veracity | Ver-1 | System shall support data quality curation achieving 95% accuracy, completeness, and consistency according to domain-specific rules. These thresholds derive from the data quality framework established by Batini et al. [76]. |
Veracity | Ver-2 | System shall support data provenance with complete lineage tracking reconstructing processing history with 100% accuracy. These capabilities implement the provenance models validated by Herschel et al. [82] for regulatory compliance. |
Theory
An inflection point in mathematics is a critical point where the curvature of a curve changes direction, indicating a shift from one behaviour to another [83]. This significant moment is characterised by the breakdown of the previous structure and the emergence of a new one. Currently, there are concrete indicators and factors that suggest the change is imminent. The 2023 New Vantage Partners research indicates that investments in data are increasing and staying robust, despite possible economic challenges [5]. However, the survey indicates that only 23.9% of organisations describe themselves as data-driven.
This section is aimed at delineating theories that underpin the artefact. These theories are referred to as design theories in design science research [84]. These theories form the basis for understanding the underlying principles that govern the design of Terramycelium.
This study utilises the architecture definition defined in ISO/IEC 42010 [45], which defines architecture as the underlying concepts or properties of a system within its context, manifested in its elements, relationships, and design and evolution principles.
The monolith
The techniques and methods of data engineering have experienced rapid growth, similar to the Cambrian explosion, yet the fundamental assumptions guiding data designs have not been significantly questioned [85]. As per Richards [86], architectures can be classified into two main categories: monolithic, deployed as a single entity, and distributed, consisting of individually deployed sub-components. Currently, the majority of data architectures fall under the first type [2].
Beginning with a monolith may be a straightforward and effective method for constructing a data-intensive system, but it becomes inadequate as the system grows. Although this notion is being questioned in the software engineering field with the emergence of microservices patterns [87], data engineering continues to be influenced by monolithic designs. Enabler technologies like data warehouses and data lakes support these concepts. Furthermore, numerous groups and publications embrace the concept of a "single source of truth" [2].
In contrast, modern data architecture paradigms like Data Mesh [85] and Data Fabric [88] and the works of Ataei et al. [3] have gained traction as alternatives to monolithic designs. Data Mesh advocates for a decentralised approach to data management, where data ownership and governance are distributed across different domains or teams within an organisation. This shift towards federated governance addresses some of the challenges associated with monolithic architectures, such as scalability, agility, and data silos [89].
Data Fabric, on the other hand, emphasises the seamless integration of data from various sources and formats, providing a unified view of data across an organisation. By leveraging technologies like distributed computing, data virtualization, and metadata management, Data Fabric enables organisations to access and analyse data in real-time, regardless of its location or format. This approach enhances data agility, accessibility, and reliability [89].
As organisations navigate the complexities of modern data ecosystems, the transition from monolithic designs to more agile and scalable architectures becomes imperative. These approaches offer a more flexible and decentralised model for managing data, aligning with the evolving needs of data engineering in today’s dynamic and data-intensive environments.
Reflecting on this transition, Terramycelium has learned from the shift towards more agile and scalable data architectures. While Data Mesh emphasizes domain-specific data products primarily using synchronous REST-based access patterns, Terramycelium enhances this approach through asynchronous event-driven communication that reduces temporal coupling between domains, improving resilience against failures and enabling higher throughput (>100,000 events/second based on selected technology implementations). Similarly, where Data Fabric architectures focus on data virtualization with centralized metadata management, Terramycelium’s distributed metadata approach allows for 95% faster metadata propagation across domains according to our performance measurements. These architectural decisions specifically address the operational adaptability requirement (Var-3) by enabling zero-downtime schema evolution through event interface contracts rather than fixed schemas.
The data chasm
Analytical data and operational data are two separate categories of data used in corporate operations. Operational data is utilised for daily corporate operations, whereas analytical data is employed for strategic decision-making by recognising patterns and trends in previous data [90].
The problems of current BD systems stem from the basic assumption of segregating operational and analytical data [2]. Operational and analytical data have distinct characteristics and processing methods. Moving operational data away from its source can harm its integrity, lead to organisational silos, and result in data quality problems.
These two distinct data planes are often managed within separate organisational hierarchies. Data scientists, business intelligence analysts, machine learning engineers, data stewards, and data engineers typically work under the guidance of the Chief Data and Analytics Officer (CDAO) to generate business value from data. Software developers, product owners, and quality assurance engineers typically collaborate with the Chief Technology Officer (CTO) [85].
As a consequence, there are now two separate technology systems in place, requiring significant resources to connect them. Two distinct topologies and integration designs have emerged due to this gap, facilitated by ETLs (Fig. 1). Data extraction from operational databases (which process transactional data from day-to-day business operations) is often accomplished through a batch ETL process. These ETL processes transfer data from the transactional/operational plane to the analytical plane, typically lacking a clearly defined agreement with the operational database and solely consuming its data. This emphasises the vulnerability of this system, as modifications made to operational databases can impact analytical applications downstream. As time passes, the complexity of ETL jobs grows, making them harder to maintain and leading to a decline in data quality.
[See PDF for image]
Fig. 1
The great divide of data
Many technologies developed over the years are based on this notion. Although these technologies excel in managing the amount, speed, and diversity of data, current data difficulties revolve around the increase in data sources, data quality, and data architecture.
Information is typically gathered and unified across multiple platforms. Some of this data may exceed the boundaries of the organisation. Therefore, considering the points mentioned earlier and the difficulties outlined, it is suggested that modern data architectures should move away from centralising data in a single large analytical database and instead focus on connecting analytical data from various sources [2, 85].
The artefact created for this study utilises earlier solutions to improve and overcome their limitations. It seeks to move away from highly centralised and rigid data architectures that serve as a bottleneck for coordination. Instead, the artefact aims to connect the point of data generation with its utilisation, streamlining the process. It is designed to enhance agility in response to growth and successfully adapt to organisational changes. To achieve that, the first step is the introduction of data into domains.
Localized autonomy through domain-driven design
Modern enterprises are facing significant complexity. A typical business consists of diverse domains with distinct structures. These domains exhibit varying rates of change and are typically segregated from one another. The business’s overall synergy is determined by the relationships and evolution of various domains.
The core of these synergies is characterised by volatility, frequent market changes, and a growing number of rules [91]. How do modern organisations handle the effects of these developments on their data? Should they consistently update ETL tasks, develop new ETL pipelines, and merge data into operational repositories? How can firms generate reliable and high-quality data efficiently? It comes down to accepting the change in the current data environment.
To address this complexity, one approach is to synchronise technology with business objectives. Businesses divide their challenge into smaller components within each domain, integrating technology and data into these areas. This method is well recognised in microservices architectures (Ataei, 2023).
Central to Terramycelium is the dispersal and decentralisation of services into distinct areas with well-defined boundaries. One of the most difficult aspects of designing a distributed system is determining the architectural quanta on which the system should be divided. This matter has been frequently deliberated among proponents of microservices architecture [92]. Terramycelium, influenced by the works of Eric Evans [93] and Dehghani [85] places data in close proximity to the product domain it pertains to. This suggests that data is within the product domain and is a component of it.
Most businesses nowadays are structured around their products, which is the fundamental driving force behind this approach. These products represent the business’s capacity and are divided into different domains. These domains often establish their defined scope, progress at varying speeds, and are managed by interdisciplinary teams [94]. Integrating data into these specific domains can provide a synergistic effect that enhances the handling of ongoing changes. That is, if data engineering becomes a facet of these products, then there is no need for the sycnrhonization tax that is introduced between operational and analytical systems.
Communication can occur on a micro level between application developers and data engineers over a type definition change in a GraphQL node or on a macro level when application developers consider rebuilding their Protobuff schema in a way that impacts downstream analytical services. Thus, this study integrates the concept of DDD to enhance communication and improve the acceptance, precision, and significance of Terramycelium.
DDD is a software development methodology that emphasises comprehending and representing the problem domain of a software application. The objective of DDD is to develop software that mirrors the language, concepts, and behaviours of the problem domain, prioritising these aspects over technical factors.
DDD can assist Terramycelium by offering a structured method for modelling and handling data that is well-matched with the application’s issue domain. DDD can assist data architects in obtaining a more profound comprehension of the necessary data and its structure by concentrating on the language, concepts, and behaviours of the problem domain [85]. Effective communication is crucial in software development projects, as it facilitates the sharing of vital knowledge [95].
Data engineers and business stakeholders frequently do not connect directly. Domain knowledge is conveyed through intermediaries like business analysts or project managers into a set of activities to be completed [96]. This indicates the need for at least two translations from two distinct ontologies.
Each translation results in the loss of crucial subject expertise, posing a danger to the overall data quality. During a data engineering process, the requirements can become twisted, and the data engineer may lack awareness of the specific business domain or problem being addressed [97].
Data engineering challenges are typically complex and wide-ranging, rather than simple mathematics problems or puzzles. An organisation may choose to enhance workflows and processes by consistently making decisions based on data. However, a rigid and inflexible centralised data architecture can increase the likelihood of project failure. After having the problem broken down to domains, we have to apply some system thinking, that is, should the data centralization break into different data domains, how do we ensure that these domains are healthy and evolving? This is where the concepts of complex adaptive systems are applied.
Complex adaptive systems
Terramycelium architectures have characteristics similar to those of complex adaptive systems (Holland, 1992). The artefact created for this study is inspired by the concept that powerful groups might form based on basic rules that control individual agents. For instance, Reynolds [98] examined a coordinated group of starling birds during the autumn season. The study revealed that each starling bird adheres to three basic rules: alignment, separation, and cohesion. The rules can be mathematically articulated as follows:
Alignment:where is the velocity vector of bird i at time t, k is a normalization factor, and N is the number of neighbouring birds.
Cohesion:where is the center of mass of the neighbouring birds, is the position of bird i at time t, and k is a normalization factor.
Separation:where is the position of bird i at time t, is the position of bird j at time t, and is the distance between birds i and j.
Starling birds do not require a centralised orchestrator to form this intricate adaptive system. Terramycelium aims to encourage a domain-driven allocation of data ownership. This architecture is designed to offer not just operational data but also analytical data through a standard interface. For instance, the practice management software for veterinarians includes operational APIs for updating animal attributes and analytical interfaces for retrieving animal data within a specific time frame. Each domain owns its own data.
The domain can retrieve data from other domains using a discovery mechanism, process the data, and augment its own data. Aggregate domains are created to collect data from different domains and make it available for a certain purpose.
This aims to eliminate vertical dependency and enable teams to have local autonomy while being equipped with the appropriate amount of discovery and APIs. This architecture emphasises the concept of coequal nodes working together to accomplish the system’s overarching objective, rather than relying on a centralised database managed by individuals lacking domain expertise. This idea is influenced by DDD as described by Evans [93], Data Mesh by Dehghani [85], microservices architecture by Newman [23] and Complext Adaptive systems by Lansing [99].
Event driven services
Terramycelium’s decentralised and distributed structure poses difficulties in service communication as the network expands. At first, basic point-to-point communication via REST API calls is adequate, but it becomes inefficient as the system grows. Synchronous interactions can result in a ‘distribution tax’, where one service’s blocked state, generally caused by costly procedures, leads to delays in dependent services [100].
Distributed systems’ extensive network requirements can lead to complications such as tail latency, context switching, and gridlocks. This strong interconnection contradicts the distributed system’s objectives of independence and robustness.
Terramycelium uses asynchronous, event-driven communication to address these concerns. This approach allows services to publish and respond to events, separating their interactions. Within the ‘publish and forget’ paradigm, services broadcast events to designated topics and proceed without waiting for immediate reactions.
Event-driven architectures usually provide eventual consistency, which may not be ideal for some real-time stream processing situations. However, it is generally safe to assume that most data engineering tasks can effectively function within an event-driven framework.
As per Ford et al. [101], event-driven architectures consist of two primary topologies: 1) broker topology and 2) mediator topology. The concept of streaming platforms is further explained in the writings of Stopford [102]. Terramycelium, as a BD architecture designed for processing analytical data, eliminates several issues associated with accomplishing ACID transactions. Terramycelium utilises the architectural principles of distributed, asynchronous, event-driven systems with a mixed topology. Terramycelium is incorporating aspects of broker topology, mediator topology, and several notions from streaming platforms. Terramycelium’s CAP theorem [103] prioritises partition tolerance and availability as key assurances.
Terramycelium’s hybrid topology consists of five main architectural components: the event, the event consumer, the event producer, the event backbone, and the eventing interface. Events are triggered by the event producer and sent to the relevant topic through the eventing interface. The event is processed by the event broker and stored in a queue-like indexed data structure for later retrieval inside the event backbone. Interested event consumers will listen to the topic of interest using the eventing interface. The event backbone is an internally distributed system composed of numerous event brokers.
Event brokers are services created and provided to enable event communication via Terramycelium. The brokers are synchronised with a decentralised service coordinator. Brokers also enable the replication of topics. Additionally, the event backbone includes a specialised event archive to ensure fault tolerance and recoverability. The purpose of this archive is to retain all events processed by the brokers for potential restoration in the event of a failure.
These components collaborate to establish a distributed, fault-tolerant, and scalable data system capable of managing both batch and stream processing, as seen in Fig. 2.
[See PDF for image]
Fig. 2
ATMTerramycelium Event-driven Architecture
Architectural characteristics
After discussing the main design theories that underpin Terramycelium, in this section, the overall architectural characteristics of this artefact are discussed. The Terramycelium architecture is distinguished by its emphasis on maintainability, scalability, fault tolerance, elasticity, and deployability. It is in line with contemporary engineering methodologies like automated deployment and continuous integration [101]. The architecture focuses on utilising microservices, which are autonomous components that can be deployed and maintained separately.
This architecture emphasises improved maintainability in comparison to current data architectures. Event-driven microservices architecture enables modular development and independent scaling, facilitating the maintenance and updating of individual components without impacting the entire system. The design also enables automated deployment processes, which help streamline upgrades and minimise manual interaction.
Scalability and elasticity are significant aspects as well. The architecture allows for horizontal scalability, enabling the addition or removal of services according to demand. This adaptability guarantees that the system can efficiently manage different workloads.
The architecture also demonstrates strong fault tolerance. Interservice communication can affect fault tolerance; however, a redundant and scalable design of Terramycelium, combined with service discovery techniques, helps alleviate this problem. Microservices’ standalone and single-purpose design typically results in a high level of fault tolerance [104].
On the other hand, the architecture might be rated lower in terms of cost and simplicity. The decentralised structure of Terramycelium and the possibility of higher communication overhead can create challenges in cost management and optimisation. Implementing strategies like intelligent data caching and replication can help overcome performance issues related to network calls, but managing costs is still a continuous concern.
The Terramycelium design emphasises the advantages of microservices, focusing on maintainability, scalability, fault tolerance, elasticity, and deployability. Our study recognises the difficulties that come with distributed architectures and provides methods to reduce their impact. Architects need to comprehend the principles of architecture in order to skillfully traverse and capitalise on its advantages. An overview of architectural characteristics is portrayed in Table 2.
The evaluation of architectural characteristics follows a systematic approach grounded in the ISO/IEC 25010 quality model [105] and the architectural evaluation framework proposed by Bass et al. [106]. Our assessment employs a five-level rating system (one to five stars) that quantifies the implementation completeness and effectiveness of each characteristic. The rating schema encompasses: exceptional implementation with comprehensive coverage (five stars), strong implementation with minor limitations (four stars), satisfactory implementation meeting basic requirements (three stars), limited implementation with significant gaps (two stars), and minimal implementation (one star).
The evaluation process integrates both quantitative and qualitative metrics through three complementary dimensions. First, quantitative analysis encompasses performance metrics derived from case-mechanism experiments, resource utilization measurements, and scalability testing results. Second, qualitative assessment incorporates expert reviews (n=3), architecture compliance verification, and code quality analysis. Third, comparative benchmarking evaluates the architecture against baseline monolithic implementations and industry standard metrics [107].
For each architectural characteristic, we applied specific evaluation criteria. Consider maintainability, which received a three-star rating based on: code modularity (0.8/1.0), documentation completeness (0.7/1.0), change impact isolation (0.7/1.0), and testing coverage (0.6/1.0), yielding a composite score of 2.8/4.0. Similarly, scalability earned four stars through assessment of: linear resource scaling (0.9/1.0), load distribution efficiency (0.8/1.0), resource utilization (0.9/1.0), and bottleneck elimination (0.8/1.0), resulting in a score of 3.4/4.0.
Table 2. Terramycelium architecture characteristics
Characteristic | Score |
|---|---|
Maintainability | |
Scalability | |
Fault tolerance | |
Elasticity | |
Deployability | |
Cost | |
Simplicity | |
Performance | |
Support for modern engineering practices |
Positioning terramycelium in the landscape of big data reference architectures
Schema evolution and flexible data source integration represent fundamental challenges in big data architectures. This section positions Terramycelium in relation to existing reference architectures, highlighting its distinctive approach to adaptability.
Comparison with industrial reference architectures
The Lambda architecture [25] and Kappa architecture [26] represent significant industrial contributions to big data processing. However, as identified by Ataei and Litchfield [2], these architectures exhibit substantial limitations:
Data management deficiencies: According to Ataei and Litchfield, these architectures have faced criticism for their "insufficient data management techniques, particularly concerning data quality, security, and metadata" [2]. Terramycelium addresses these deficiencies through domain-specific data ownership with federated governance, enabling consistent quality control and metadata management.
Monolithic processing: The Lambda architecture maintains separate batch and speed layers, requiring duplicate implementation of processing logic. Kappa attempts to unify processing but still relies on centralized data management. As noted by Ataei and Litchfield, "majority of BD RAs that we have analyzed were running underlying some sort of a monolithic data pipeline with a central storage" [2]. Terramycelium’s domain-driven approach distributes processing responsibility while maintaining system-wide consistency through event-driven communication.
Schema evolution: Lambda and Kappa architectures lack explicit mechanisms for schema evolution across distributed components. Terramycelium’s event-driven domain propagation enables schema changes to flow naturally through the system without requiring centralized coordination.
Comparison with domain-specific reference architectures
Ataei and Litchfield [2] identified several domain-specific reference architectures, including Klein’s security-driven BD RA [27] and Quintero’s healthcare-focused approach [28]. While these architectures address specific domain requirements, they present limitations for general adaptability:
Limited scope: These architectures "concentrate on certain domain requirements, sometimes overlook[ing] broader concerns such as privacy and interoperability" [2]. Terramycelium’s federated approach enables domain-specific optimization while maintaining cross-domain standards through its governance fabric.
Integration challenges: Domain-specific architectures often struggle with integrating heterogeneous data sources from beyond their primary domain. Terramycelium’s Data Lichen discovery mechanism and standardized event interfaces facilitate cross-domain data integration without sacrificing domain autonomy.
Schema rigidity: Many domain-specific architectures employ fixed schemas optimized for their particular domain. Terramycelium enables domain-specific schemas to evolve independently while maintaining compatibility through event interface contracts.
Comparison with academic reference architectures
Academic research, represented by approaches like Viana [29] and Pääkkönen [30], has concentrated on enhancing the conceptual understanding of big data systems. However, Ataei and Litchfield [2] note these approaches "frequently overlook the dynamic and pervasive features of modern data environments, particularly concerning scalability and adaptability." Terramycelium addresses these limitations through:
Practical adaptability: While academic architectures often remain theoretical, Terramycelium provides concrete implementation patterns for adaptability through its service mesh architecture and event-driven communication.
Operational focus: Academic RAs typically emphasize conceptual models over operational concerns. Terramycelium balances conceptual clarity with operational practicality through its domain-specific processing nodes and standardized interaction patterns.
Integration with modern practices: Many academic RAs predate contemporary software engineering practices. Terramycelium integrates seamlessly with practices like microservices, continuous delivery, and infrastructure as code.
Comparison with emerging paradigms
Recent paradigms like data spaces (e.g., Gaia-X) and data mesh architectures address some limitations of traditional approaches. Ataei and Litchfield note that these approaches "address key challenges such as scalability, privacy, metadata management, and flexibility" [2]. Terramycelium builds upon these paradigms while offering distinct advantages:
Event-driven communication: Unlike data mesh’s API-centric approach, Terramycelium employs event-driven communication that reduces temporal coupling between domains, enhancing system resilience and scalability.
Complex adaptive systems principles: Terramycelium incorporates principles from complex adaptive systems, enabling emergent behavior through simple local rules rather than global coordination. This approach addresses the challenge identified by Ataei and Litchfield that current RAs’ "dependence on centralized systems is evident in their inability to effectively handle data quality, security, privacy, and metadata" [2].
Unified governance model: While data mesh emphasizes domain-specific governance with federated computational policies, its implementation remains challenging. Terramycelium’s sidecar pattern provides a concrete implementation strategy for balancing domain autonomy with system-wide governance.
A multidimensional definition of adaptability in terramycelium
Adaptability in big data architectures extends beyond schema flexibility. Ataei and Litchfield [2] found that current RAs’ "rigid structure frequently leads to scalability and adaptability issues, hindering their ability to keep pace with evolving data and technological environments." Terramycelium addresses adaptability across four critical dimensions:
Structural adaptability: Unlike traditional RAs that rely on centralized coordination, Terramycelium enables dynamic reconfiguration of processing pipelines without system downtime through domain-driven event propagation. This approach allows services to be added, removed, or modified with minimal impact on system integrity.
Functional adaptability: Traditional RAs often require redefinition of transformation pipelines when processing requirements change. Terramycelium facilitates the incorporation of new algorithms and processing techniques without disrupting existing workflows. Domain-specific processing logic can evolve independently, with cross-domain impacts managed through event interfaces.
Operational adaptability: Ataei and Litchfield found that current RAs struggle with "data source proliferation" and "data consumer proliferation" [2]. Terramycelium’s service mesh architecture supports autonomous scaling based on workload demands, addressing these proliferation challenges through domain-specific scaling policies.
Interoperational adaptability: Traditional RAs struggle with integrating new technologies and data sources. Terramycelium provides seamless integration through standardized event interfaces and the Data Lichen discovery mechanism, reducing integration overhead for new components.
Through its domain-driven, event-oriented approach, Terramycelium transcends the limitations of existing reference architectures, offering a comprehensive solution that addresses the multifaceted adaptability requirements of modern big data systems. Table 3 illustrates this comparison.
Table 3. Comparison of big data reference architectures
Architecture | Data management | Processing model | Schema evolution | Governance model |
|---|---|---|---|---|
Lambda Architecture [25] | Centralized storage with separate systems | Batch and stream (separate layers) | Limited support for schema changes | Centralized governance |
Kappa Architecture [26] | Centralized log-based storage | Unified stream processing | Limited event schema flexibility | Centralized governance |
Domain-Specific RAs [27, 28] | Domain-optimized storage models | Domain-specific processing pipelines | Domain-constrained fixed schemas | Domain-specific governance |
Academic RAs [29, 30] | Conceptual data models with theoretical foundations | Abstract processing frameworks | Theoretical schema models | Conceptual governance frameworks |
Data Mesh [108] | Domain-specific data products with distributed ownership | API-centric data products | Domain autonomy with contract-based interfaces | Federated computational governance |
Data Spaces (Gaia-X) [31] | Sovereign data sharing with distributed storage | Distributed processing with interoperability standards | Semantic interoperability | Rule-based federated governance |
Terramycelium (This paper) | Domain-specific storage with event propagation | Event-driven domain processing | Multi-dimensional adaptability framework | Sidecar-enabled federated governance |
Artifact
After exploring the underpinning design theories, a solid theoretical foundation is established for creating and developing the artefact. Terramycelium is generated using Archimate and mostly showcases the RA within the technological layer. The architect has the responsibility to determine the flow and application that should be present in each node when showcasing these services in the technology layer. To ensure thoroughness, a basic business process is assumed, as every piece of software is created to fulfil a business requirement. Terramycelium should possess the flexibility needed to accommodate alternative business models, notwithstanding potential variations in the business layer across different settings.
This BD RA does not depict the architecture of any particular BD system. It functions as a versatile tool for describing, discussing, and constructing system-specific designs based on standardised principles. Terramycelium enables in-depth and insightful talks about the needs, frameworks, and functions of BD systems. The system stays vendor-neutral, providing flexibility in choosing products or services, and avoids imposing strict solutions that restrict innovation.
To ensure thoroughness, we have assumed a basic BD business process, as all software is created to fulfil a specific business requirement. Terramycelium consists of 15 primary components and 5 variable components, as seen in Fig. 3. The lowercase "a" in the top left corner of the diagram represents the auxiliary view, while the letter "m" represents the master view. When the same entity is utilised in various models, the auxiliary view is employed. This indicates that the entity already exists and is being reused.
[See PDF for image]
Fig. 3
Terramycelium reference architecture
While this business layer could vary in different contexts, Terramycelium should be able to have the elasticity required to account for various business models. To ease understanding of the RA, we sub-diagrammed the product domain in Fig. 4. The sections below provide a detailed discussion of each component.
[See PDF for image]
Fig. 4
Terramycelium service mesh
Ingress gateway
The ingress gateway is pivotal in Terramycelium, acting as the main entry point for external requests. It exposes necessary ports and endpoints, intelligently load-balancing traffic to the appropriate processing controllers, thereby enhancing the system’s scalability and performance.
Security is significantly bolstered by the ingress gateway, which prevents direct service access by centralising entry and enabling the enforcement of security policies such as SSL termination and authentication. This approach secures communication with external entities and streamlines SSL processing, offloading this task from downstream services.
The gateway also allows for incoming data modification or enrichment, offering customisation and system integration capabilities. This architectural component simplifies monitoring and observability, concentrating metrics, logging, and tracing at a singular point for efficient analysis and compliance oversight. Additionally, it supports network segmentation, establishing a clear demarcation between public interfaces and private network resources, thus reinforcing security and control over data flow.
Addresses requirements: Vol-1, Vol-2, Var-1, Var-3, Var-4, Val-1, Val-3, Val-4, SaP-1, and SaP-2.
Batch processing controller
The primary function of the batch processing controller is to manage batch events by sending them to the event backbone. The dedicated service, which may be a tiny service or a Lambda function, processes batch-processing requests and converts them into events for the event broker. Its unique characteristics enable it to handle batch requests in large quantities and asynchronously, distinguishing it from stream processing controllers.
The batch-processing controller offers various architectural advantages. Firstly, it greatly improves monitoring capabilities, offering clearer insight into the status and performance of batch activities. This efficient monitoring simplifies problem-solving and performance evaluation, guaranteeing that batch processing adheres to defined criteria and performance standards.
The batch processing controller allows for customisation and flexibility in generating batch events. It can carry out supplementary functions apart from computationally demanding activities, including data cleansing or incorporating personalised headers. Businesses can customise batch event production to meet their individual needs and expectations.
A specialised controller for batch processing recognises the distinct needs and attributes of batch events, offering tailored features and enhancements. This component fulfils the criteria.
Addresses requirements: Vel-1, Val-1, and Val-2.
Stream processing controller
The stream processing controller is a crucial component in architecture, handling streaming events and dispatching them to the event backbone. It is distinct from batch processing services as it focuses on non-heavy computations optimized for stream processing requirements. It can enable stream provenance, providing insights for data governance and traceability. It can also leverage one-pass algorithms for stream analytics tasks.
A dedicated stream processing controller allows for the association of custom attributes with stream events, enabling contextual enrichment of streaming data and differentiated treatment based on the system’s nature. This enables real-time decision-making or targeted actions based on event content. Key requirements for veracity and data quality include real-time validation, error handling, data provenance tracking, and adaptive processing to maintain accuracy and reliability.
Simplifying monitoring and discovery within the architecture, the controller simplifies tracking and analysing the performance, latency, and throughput of streaming events. It also enables focused monitoring of stream-specific metrics, providing valuable insights into the behaviour and efficiency of the streaming data pipeline.
In summary, the stream processing controller is an essential component in architectures involving streaming data due to its ability to handle high throughput, apply custom attributes, optimise computations, and simplify monitoring and discovery.
Addresses requirements: Vol-1, Vel-1, Vel-2, Vel-4, Vel-5, and Val-2.
Event processing interface
Event brokers are designed to achieve inversion of control. As the company evolves and requirements emerge, the number of nodes or services increases, new regions of operations may be added, and new events might need to be dispatched. As each service has to communicate with the rest through the event backbone, each service will be required to implement its event-handling module.
This can easily turn into a spaghetti of incompatible implementations by various teams, and can even cause bugs and unexpected behaviours. To overcome this challenge, an event broker is introduced to each service of the architecture. Each service connects to its local event broker and publishes and subscribes to events through that broker. One of the key success criteria of the event broker is a unified interface that sits at the right level of abstraction to account for all services in the architecture.
Event brokers, being environmentally agnostic, can be deployed on any on-premise, private, or public infrastructure. This frees up engineers from having to think about the event interface they have to implement and how it should behave. Event brokers can also account for more dynamism by learning which events should be routed to which consumer applications. Moreover, event brokers do also implement circuit breaking, which means if the service they have to break to is not available and does not respond for a certain amount of time, the broker establishes the unavailability of the service to the rest of the services, so no further requests come through. This is essential to preventing a ripple effect over the whole system if one system fails.
Addresses requirements: Val-1, Ver-1.
Event backbone
The event backbone is the central component of the Terramycelium, enabling communication between nodes. Distribution and clustering should be implemented to handle the system’s scalability. Communication is orchestrated by coordinated activities, like a dance ensemble. Every service monitors and responds to the event backbone, performing the required actions.
This guarantees an uninterrupted exchange of data between services, ensuring that all systems are in the appropriate condition. The event backbone can combine many event streams, store events in a cache, archive events, and perform other operations, as long as it does not become overly intelligent or function as an Enterprise Service Bus (ESB) in Service-Oriented Architecture (SOA) systems. An architect should see the event backbone as a collection of interconnected nodes that address different subjects of interest. The event backbone can be monitored over time to analyse access patterns and adjusted to optimise communication efficiency.
Addresses requirements: Vel-1, Vel-2, Vel-3, Vel-4, Vel-5, Val-1, Val-2, Ver-1, Ver-2, and Ver-3.
Egress gateway
The inclusion of an egress gateway in Terramycelium offers numerous advantages, especially for external data consumers. In this architecture, data consumers first interact with the discovery component, known as Data Lichen, which serves as a central hub for accessing available data domains. Once connected to Data Lichen, data consumers can navigate to the relevant data domain to retrieve the desired data.
Furthermore, all external data consumers in the system go through a centralised secret management and centralised authentication and authorization component. This centralised approach brings several benefits to the architecture. Firstly, it ensures a consistent and secure management of secrets, such as API keys, access tokens, or credentials, which are essential for data access and security. This centralised secret management enhances the overall system’s security posture by reducing the chances of vulnerabilities or misconfigurations in secret handling.
Secondly, the centralised authentication and authorization component streamlines the authentication and access control processes for external data consumers. By enforcing a unified authentication mechanism, it ensures that all users are properly authenticated and authorised before accessing the system’s data resources. This centralised approach simplifies the management of user access rights, improves security, and provides granular control over data access permissions.
Thirdly, the centralised components simplify the maintenance and scalability of the system. With a single point for managing secrets, authentication, and authorization, it becomes easier to update, monitor, and audit these components. Additionally, this architectural pattern allows for easier scaling and expansion as new data domains or data sources can be seamlessly integrated into the system with consistent authentication and authorization mechanisms.
Overall, the inclusion of an egress gateway in Terramycelium offers a robust and efficient approach for external data consumers. It ensures standardised data access, enhances security, simplifies maintenance, and enables scalability, making it a highly favourable and beneficial architectural design.
Addresses requirements: Vel-2, Vel-4, Val-3, Val-4, SaP-1, and SaP-2.
Product domain service mesh
Integrating a service mesh as a core element of each product’s domain is an effective strategy in the architectural framework. The service mesh consists of essential components that work together to facilitate effective data processing, storage, governance, and communication within the domain. The components consist of batch and streaming data input, analytical and operational services specific to the domain, API access to distributed storage services, an infrastructure API for platform-as-a-service modules, containers hosting the analytical service, a control tower, and integration with a federated governance service’s API for policy enforcement using sidecars.
The service mesh’s effectiveness is derived from its architectural design and its capability to meet essential needs. By integrating the domain’s functionalities into a service mesh, the interdependence across teams is removed, enabling greater team independence. The components of the service mesh collaborate to provide benefits like data ingestion into the analytical service domain, API connectivity to distributed storage services, container hosting for the analytical service domain, a control tower, and integration with a federated governance service’s API for policy enforcement using sidecars.
The service mesh architecture provides the analytical service with direct access to operational data, connecting data analytics and operational systems. The analytical service can obtain real-time insights and a comprehensive overview of the system’s activities, which minimises the requirement for manual data extraction and transformation procedures. Accessing operational data directly improves the efficiency and precision of analytical services, leading to more precise analysis and well-informed decision-making.
The service mesh architecture enables a cohesive perspective of the system, promoting smooth cooperation between analytics and operations teams. The system supports scalability, adapts to changes in business environments, validates data, maintains quality and integrity, and meets security and privacy standards by enforcing policies, ensuring secure communication, and implementing data governance processes.
The service mesh’s effectiveness lies in its ability to address key architectural concerns. It promotes scalability, allowing the domain to handle large volumes of data and increasing computational resources as needed (Vol-1). It facilitates rapid development and deployment of analytical capabilities (Vel-3, Vel-4, Vel-5). The service mesh architecture accommodates variability in business contexts, supporting the diverse needs and requirements of different product domains (Var-1, Var-2, Var-3). It ensures data validation, quality, and integrity by leveraging advanced analytics and processing techniques (Val-1, Val-2, Val-3, Val-4). Security and privacy requirements are fulfilled through policy enforcement, secure communication, and data governance mechanisms (Sap-1, SaP-2).
Finally, the service mesh architecture allows for the verification of system behaviour, enabling efficient testing, monitoring, and verification of the domain’s analytical outputs (Ver-1, Ver-2, Ver-3).
Addresses requirements: Vol-1, Vel-3, Vel-4, Vel-5, Var-1, Var-2, Var-3, Val-1, Val-2, Val-3, Val-4, Sap-1, SaP-2.
Federated governance service
Terramycelium’s distributed architecture comprises multiple independent services with distinct lifecycles, developed and deployed by autonomous teams. To ensure coherence and prevent interface conflicts without centralising control, global federated governance is essential. This governance model standardises services, enhances their interoperability, and facilitates team collaboration. It also mitigates risks, such as non-compliance with GDPR, by establishing a unified framework of global policies, metadata standards, API standards, and security protocols. This framework not only supports organisational practices but also aligns with external regulations, safeguarding against potential fines and security threats. The federated governance service, comprising components like global policies, metadata standards, and security regulations, plays a pivotal role in maintaining the architecture’s integrity and adaptability, indirectly impacting all requirements. This component can indirectly affect all requirements.
Addresses requirements: All.
Data lichen
With an increasing number of products, more data is accessible for users, leading to enhanced interoperability but also posing greater challenges for maintenance. Without an automated method for different teams to obtain their desired data, a tightly connected and sluggish data culture would develop. To overcome these obstacles and enhance the ability to find, work together, and navigate with guidance, the Data Lichen should be put into practice. Gartner has identified data discovery mechanisms such as Data Lichen as essential tools. These mechanisms facilitate improved communication dynamics, quicker data access through services, and intelligent collaboration among services.
Addresses requirements: Vel-4, Var-1, Var-3, and Var-4.
Telemetry processor
The decentralised architecture of Terramycelium poses challenges for debugging, fault identification, and maintenance as it requires tracking transactions across several services. Consolidating the handling of telemetry data in Terramycelium provides architectural advantages by streamlining data processing, offering a cohesive perspective of logs, metrics, and traces, facilitating thorough analysis and correlation, and guaranteeing optimal resource distribution. Centralising the service guarantees uniform monitoring and governance processes throughout the architecture, encouraging compliance with best practices.
The centralised service can act as a data source for different users, like the Data Lichen, which uses processed telemetry data to create important insights. This service may process data that can be used by customised systems and dashboards, offering adaptability and scalability to fulfil particular business needs. This promotes a data-driven environment, enabling individuals to get valuable information and insights from the analysed telemetry data. The flexibility and ease of access to the telemetry data increase the benefits of the centralised service in the larger BD framework.
Addresses requirements: Vol-1, Vel-1, Val-1, and Ver-1.
Event archive
As the quantity of services expands, the subjects in the event backbone rise, leading to a surge in the number of events. If such circumstances occur, there is a possibility of failure leading to a timeout and subsequent loss of a sequence of events. This can lead to the system being in an incorrect state, causing a harmful chain reaction that affects all services. Terramycelium often addresses these issues by utilising an event archive. The event archive is in charge of recording occurrences for future retrieval in case of failure.
During a blackout in a specific area, if the event backbone is down, it can restore itself and return the system to its correct state by retrieving events from the event archive. The event interface manages circuit breaking to prevent services from sending additional requests to the backbone while it is not functioning. The determination of the duration till expiration and the selection of events to be stored are made according to the specific circumstances in which Terramycelium is utilised.
Addresses requirements: Vol-1, Vel-1, Val-1, and Ver-1.
Distributed storage service
Terramycelium supports decentralised and distributed systems but does not require each product domain to have its own unique data store. To avoid duplication, conflicting data storage methods, reduced interoperability across services, and absence of consistent data storage techniques, The distributed storage service is intended to store vast amounts of data in its original format until it is ready to be retrieved for analysis and other uses.
Data can be stored in the distributed storage service with the appropriate domain ownership before being accessed and used by different services. Data of various structures, such as structured, semi-structured, unstructured, and pseudo-structured, can be stored in a distributed storage service before being processed in batches or streams. However, not all data should be sent straight to this service; the data flow depends on the specific context in which the system operates.
Addresses requirements: Vol-2, Vel-1, Var-1, Var-3, Var-4, Val-3.
Platform as a service
The Platform as a Service (Paas) component acts as a central hub, providing an API to all other system components. This PaaS component is essential for facilitating the independence and expandability of the entire infrastructure.
The primary design value of this PaaS component is its capacity to simplify the underlying infrastructure difficulties. By offering a standardised API, each component can autonomously handle and allocate necessary resources like computing power, storage, and networking without having to deal with the complex specifics of the underlying infrastructure. This abstraction layer encourages a minimal connection between components, making it simpler to create, deploy, and maintain the entire system.
The PaaS component places importance on scalability and elasticity as key design values. It allows for the flexible allocation and deallocation of resources according to the changing requirements of diverse data domains. The PaaS component facilitates effective infrastructure utilisation by providing an API that allows components to request resources as required. The system may adjust resources as needed to maintain peak performance and cost efficiency throughout the whole architecture.
Data domains can utilise the PaaS API to request, set up, and oversee resources, allowing them to function autonomously and effectively. Decentralising infrastructure management improves agility and flexibility in the system.
Addresses requirements: SaP-1, SaP-2, Var-1, Var-3, Var-4, Vel-1, and Vol-2.
Identity and access management
The role of the Identity and Access Management (IAM) component is to guarantee secure and regulated access to the system’s resources and data. The concept includes a range of architectural principles necessary for upholding data integrity, privacy, and regulatory compliance.
The IAM component places significant emphasis on authentication and authorization as core design concepts. The system offers strong methods to verify the identities of users, components, and services in the structure, guaranteeing that only approved entities can get the resources and carry out particular tasks. These measures aid in preventing unauthorised access, reducing security risks, and protecting sensitive data.
Another architectural value of the IAM component is its emphasis on centralised user and access management. It acts as a centralised authority for overseeing user identities, responsibilities, and permissions throughout Terramycelium. Centralization optimises access control management, speeds user onboarding and offboarding, and maintains uniform security policy enforcement system-wide.
The IAM component guarantees meticulous access control and privilege management, enabling the creation of precise access policies. The system enables strong authentication using standard protocols such as OAuth and SAML, simplifying user access through Single Sign-On (SSO). It supports audits by recording access occurrences, strengthening compliance, security oversight, and incident management. A detailed diagram illustrating the IAM component’s internal workflow is available in Appendix Appendix B (see Figure 33).
Addresses requirements: SaP-1, and SaP-2.
Secret management system
The central secret management system is a crucial component for securely storing and maintaining sensitive information, including passwords, API keys, cryptographic keys, and other secrets.
The central secret management system places a strong emphasis on securely storing and encrypting secrets as a fundamental design principle. The system uses strong encryption algorithms and processes to safeguard sensitive data while it is kept, guaranteeing that confidential information is securely protected and cannot be accessed by unauthorised parties. This value aids in preventing unauthorised access to secrets, reducing the risk of data breaches, and guaranteeing confidentiality.
The secret management system also facilitates the secure distribution and retrieval of secrets to authorised components or services. It offers safe APIs or client libraries that allow secrets to be retrieved securely during runtime. This value ensures that confidential information is only available to authorised entities who need it, preventing the unauthorised disclosure of secrets.
The secret management system also encourages integration with other components and services in Terramycelium. The system offers APIs and integration points that enable smooth integration of secrets into different data domains. This integration value enables authorised components to securely and conveniently access secrets, ensuring operational efficiency and minimising obstacles in development and deployment procedures.
Addresses requirements: SaP-1, and SaP-2.
The variable components in Terramycelium can be altered, adapted, or excluded according to the architect’s discretion and the specific characteristics of the situation. This RA aims to enhance data architects’ decision-making process by introducing established patterns and best practices from several perspectives, rather than restricting their innovation. Alternative choices for each variable module are not detailed due to the dynamic nature of the industry, and architects strive to create systems that tackle new issue areas.
Evaluation 1: case mechanism experiment
The validation of treatment is crucial in its research, development, and implementation phases [41]. Validating a treatment involves determining its possible impact on stakeholder objectives when applied to the specific problem at hand.
Clearly outlining and justifying the treatment needs establishes a straightforward avenue for validation. This implies that the treatment may be verified by showing that it fulfils certain predetermined criteria. A key challenge in treatment validation is the absence of real-world implementation to assess the treatment’s impact on stakeholder objectives. This challenge emphasises the need for thorough, diverse, and well-thought-out assessment methods, as discussed more in the next sections.
The initial phase of evaluating Terramycelium is a single-case mechanism experiment based on the parameters outlined by Wirienga [41]. Although Wirienga’s works reference a single experiment, this study considers a few experiments. A multi-case experiment is required because the artefact is complex, which improves the thoroughness and significance of the evaluation process, leading to a more detailed and nuanced comprehension of the artefact’s capabilities and possibilities.
Prototyping an architecture
Prototyping is the creation of a working model that physically represents the structure, demonstrating the design principles and ideas. This is frequently referred to as the concrete architecture. This technique enables researchers to validate and develop the design to assess its practicality, performance, and alignment with research objectives.
Implementing the essential components and functionalities stated in the RA, Terramycelium, is part of prototyping the architecture. The selected technologies, supported by academic justification and in line with the research goals, serve as the basis for developing the prototype. The architecture’s ability to manage large-scale data processing and meet various needs can be evaluated by systematically implementing its architectural components. Figure 5 shows a concrete structure created with Terramycelium.
[See PDF for image]
Fig. 5
ATMTerramycelium Prototype
Experimental environment configuration
The experimental evaluation of Terramycelium was conducted in a controlled environment with specific hardware and software configurations to ensure reproducibility and reliability of results.
Hardware configuration
The primary test environment consisted of:
Compute Node Specifications:
CPU: Apple M2 Pro (10-core CPU)
Memory: 32GB unified memory
Storage: 1TB NVMe SSD
Network: 10 Gigabit Ethernet
Secondary Test Environment:
CPU: Intel Xeon E5-2680 v4 @ 2.40GHz (14 cores)
Memory: 64GB DDR4 ECC
Storage: 2TB NVMe SSD in RAID 1
Network: 25 Gigabit Ethernet
Software stack
The software environment was consistently maintained across all test iterations:
Table 4. Software environment configuration
Component | Version/details |
|---|---|
Operating system | macOS Ventura 13.5/Ubuntu 22.04 LTS |
Container runtime | Docker 24.0.5 |
Kubernetes | v1.27.3 (KIND) |
Kafka | 3.5.1 |
MinIO | RELEASE.2023-09-07T02-05-02Z |
Keycloak | 22.0.1 |
HashiCorp vault | 1.13.3 |
Python | 3.11.5 |
Node.js | 18.17.1 |
Environment controls
To ensure experimental validity and reproducibility, we implemented several environmental controls:
Resource Isolation: Each test iteration was conducted in an isolated Kubernetes namespace with guaranteed resource quotas:
CPU: 8 cores dedicated
Memory: 16GB guaranteed
Storage: 100GB block storage
Network Configuration:
Internal cluster network: 10Gbps
Network latency monitoring
Controlled external access
Monitoring Setup:
Prometheus for metrics collection
Grafana for visualization
Custom telemetry processors
Test execution protocol
All experiments followed a standardized execution protocol:
Environment reset and verification
Warm-up period (10 minutes)
Test execution (variable duration based on scenario)
Cool-down period (5 minutes)
Resource cleanup and state verification
Stage 1: Formation
All components of this prototype are integrated within a single Kubernetes cluster. Therefore, KIND [110] has been selected as the local Kubernetes cluster for this prototype to function. KIND was selected as our local Kubernetes cluster due to its lightweight footprint, support for multi-node topologies, and integration with standard Kubernetes APIs, meeting our requirement for rapid local development while maintaining compatibility with production-grade platforms.
The reproducibility of this technique is crucial. The scripts required to execute, download, and set up artefacts are located in the scripts folder of the Github repository, cited as [111]. These scripts are designed to exclusively operate on Unix-like operating systems.
After establishing the fundamental scripts for initialising the cluster, the procedure commenced with the creation of various services within the Kubernetes cluster. Helm Charts from Artefact Hub [112] are utilised for enhanced maintainability and scalability of Nginx ingress and Kafka. Terraform was subsequently used to implement the charts on the Kubernetes cluster. It is located in the IaaC folder of the repository.
This infrastructure-as-code approach was chosen to ensure reproducibility and consistency across deployments, directly supporting our Var-3 and Var-4 requirements for schema evolution and new data source integration.
We conducted an extensive search for helm charts for all artefact components and selected a mature and well respected chart from Bitnami for Nginx. Using this method is favoured over developing unique directives for Nginx ingress in Kubernetes and linking services to each other through Kubernetes architectural elements such as Services or StatefulSets.
After this, a search has been done for other charts for other components of the concrete architecture on Artefact Hub. After selecting an appropriate chart for Nginx ingress, Keycloak, and Vault, the telemetry processor service is created.
Initially, a FastAPI app is set up. The programme is containerised using Docker, hosted on Github sites, and integrated into Terraform Helm release resources for deployment on the local KIND cluster. The Kubernetes ingress controller is configured as Nginx and deployed to the local cluster using Terraform.
Once the ingress controller is established, the ingress resources are configured to direct traffic to Kafka-rest-proxy. The Kafka-rest proxy is included in the artefact to simplify data access from the cluster. Kafka uses the binary protocol instead of HTTP, whereas most services depend on HTTP for network connections.
Kafka was selected for the event backbone based on quantitative benchmarks showing its ability to handle more than 10,000 events/second with less than 100ms latency, directly addressing our Vel-1 and Vel-2 requirements. Its partitioned log architecture outperformed alternative message brokers like RabbitMQ in our comparative analysis, particularly for high-volume data scenarios.
Upon successful configuration of the Nginx ingress and Kafka REST proxy, Keycloak is installed and seamlessly integrated into various components of the architecture. The Vault is incorporated into the cluster using HashiCorp’s official helm chart [113]. Keycloak and HashiCorp Vault were implemented based on their industry-standard security capabilities, with Keycloak’s OAuth/OIDC support providing authenticated decisions in less than 50ms (satisfying SaP-2), and Vault delivering AES-256 encryption with configurable key rotation policies (addressing SaP-1). These components were selected over alternatives after evaluating their performance against our measurable security requirements.
Next, the major system to be implemented is Data Lichen. Data Lichen required a new frontend to be developed from scratch. A template engine for Node.js called EJS was chosen for this purpose. This template engine was selected for its ability to simplify development and speed up the operations of this trial.
The goal is to create a simple user interface for Data Lichen that successfully showcases the framework’s potential design and capability, rather than building a comprehensive frontend. Various teams may opt to adopt Data Lichen in diverse ways. Data Lichen must be developed from scratch because there is no existing open-source technology that offers the necessary features.
Once Data Lichen has been organised and implemented in the cluster, an Istio service mesh is established. To achieve this goal, various Kubernetes namespaces are established, and various services are categorised within the same namespace for the service mesh to manage and function. Istio’s Kiali dashboard is implemented to enhance the development experience and observability.
Minio’s chart is installed, and the dashboard is active. MinIO and OPA were selected based on quantifiable performance metrics, MinIO for its more than 1GB/s throughput and immutable object versioning (addressing Vol-2 and Ver-2 requirements), and OPA for its policy evaluation time and sidecar integration pattern (satisfying SaP-2), providing superior alternatives to centralized storage and governance approaches found in comparable architectures.
The last component that needed to be deployed was Open Policy Agent (OPA). There are multiple methods to incorporate OPA into an architecture. This can be accomplished via a solo Envoy proxy, a Kubernetes operator, a load balancer with Gloo Edge, or through Istio.
Istio was chosen because it has already been selected and deployed on the cluster. Nevertheless, automatically injecting policies into services inside a particular service mesh is a complex operation. This was done by extending the Envoy sidecar with an external authorizer API to implement inbound and outbound policies.
OPA is incorporated into the service mesh as an independent service known as a sidecar proxy. The OPA functions as an admission controller, evaluating requests sent to the Kubernetes API server.
OPA assesses these requests based on established policy guidelines. Approved requests are executed by the API server, whereas refused requests result in an error message sent to the user. OPA’s policies manage all interactions among services and pods within a namespace. OPA assists in ensuring adherence to predetermined security and operational regulations within a Kubernetes setting. This is depicted in Figure??.
Once OPA was implemented, the analytics service for the domain was created using the Python framework FastAPI. To meet the requirements of Terramycelium, a domain incorporating operational and analytical systems must be established. Since this experiment is not a genuine business, it is logical to establish a simulated operational service that mimics real-world activity, such as generating large amounts of data at a high speed.
Two important data sources studied and selected for creating these domains are the Yelp Academic Dataset [114] and the data from the U.S. Climate Resilience Toolkit’s Climate Explorer [115]. This decision is based on the extensive and detailed data provided by these sources, their accessibility, and their ability to generate valuable analytical insights when used together.
The Yelp Academic Dataset is selected for its comprehensive compilation of restaurant reviews, containing structured and semi-structured data that encompasses customer feedback, business information, and temporal patterns. Another option, the MovieLens dataset [116], has been considered as well, but was found less suitable because it focuses only on movie ratings, limiting the scope of potential analysis.
The Climate Explorer dataset, which provides a detailed record of past and predicted climatic factors in the United States, is selected as an appropriate option for studying possible connections between weather trends and restaurant evaluations. Although options like Visual Crossing Weather Data Services [117] have been examined, Climate Explorer’s extensive global coverage and historical data are preferred.
The datasets qualify as BD because of their amount, variety, and velocity. The Yelp Academic Dataset comprises millions of reviews for over a hundred thousand businesses, making it a data source with a large volume and variety. The Climate Explorer dataset, updated frequently, adds a speed component to the data structure. Collectively, they offer a chance for sophisticated analytical investigations in a data-rich setting.
Stage 2: Amalgamation
The initial phase of developing the prototype involves establishing the foundational services within the cluster. This step focuses on establishing connections and ensuring that the architecture can effectively react to external stimuli. Selected situations are applied to the system to achieve this.
Scenarios are commonly used to evaluate new architectural designs because they offer a robust way to simulate real-world situations, allowing for the assessment of the architecture’s performance in different scenarios.
This experiment is crucial as it simulates various behaviours, system loads, data quantities, and processing kinds to analyse the architecture’s resilience, adaptability, scalability, and interoperability.
Based on the templates in Carroll’s work [118] and the examples in Kazman’s ATAM [52], a set of scenarios is established to evaluate the prototype’s abilities. The scenarios, which include various aspects of system performance and security, are outlined in Tables 5 through 10.
Table 5. Scenario S1: High volume data ingestion scenario
Scenario Description: | The system is subjected to a high volume of data, simulating peak data ingestion rates. The scenario is implemented when Domain A retrieves high volumes of data from Domain B, and processes it. |
Relevant Quality Attributes: | Performance, Scalability. |
Expected Outcome: | The system can ingest, store, and process large quantities of data without significant performance degradation. |
Table 6. Scenario S2: High velocity data ingestion scenario
Scenario Description: | The system is subjected to a high velocity of data, simulating peak data ingestion rates. The scenario is implemented when Domain A streams data into domain B and domain B stream processes the data. |
Relevant Quality Attributes: | Performance, Scalability. |
Expected Outcome: | The system can ingest, store, and process a continuous stream of data without significant performance degradation. |
Table 7. Scenario S3: Data variety scenario
Scenario Description: | The system is exposed to a diverse range of data types and formats. This scenario is implemented when domain B retrieves files in a different format from domain A and processes it. |
Relevant Quality Attributes: | Variety, Interoperability. |
Expected Outcome: | The system can handle and process different data types and formats efficiently. |
Table 8. Scenario S4: Complex query scenario
Scenario Description: | The system processes complex queries that involve multiple large datasets. This scenario happens when an external data scientist queries both domains and tries to understand a relationship between the datasets. |
Relevant Quality Attributes: | Computational Efficiency, Performance. |
Expected Outcome: | The system can efficiently handle and process complex queries. |
Table 9. Scenario S5: Secret management scenario
Scenario Description: | This scenario underscores the system’s ability to manage secrets securely and efficiently, focusing on storage, retrieval, and rotation using Hashicorp Vault in conjunction with OpenID Connect’s standard flow with bearer tokens. |
Relevant Quality Attributes: | Confidentiality, Integrity, Availability. |
Expected Outcome: | The system securely manages secrets, ensuring timely storage, retrieval, and rotation while maintaining confidentiality and integrity. |
Table 10. Scenario S6: Data security scenario
Scenario Description: | The system’s capability is evaluated regarding ensuring data security throughout access, processing, and transmission. OpenID Connect with bearer tokens is leveraged for authentication. |
Relevant Quality Attributes: | Confidentiality, Data Integrity, Authentication. |
Expected Outcome: | The system guarantees data security, with secure data access, processing, and transmission. Unauthorised access attempts are effectively detected and mitigated. |
Stage 3: scenario testing
During this step, after defining scenarios, the system is initialised, and each scenario is executed. Each specific situation is evaluated to determine how the system handles different conditions. This method entails initialising the system and methodically testing each scenario against it.
By doing this, important measurements are recorded, leading to a more thorough comprehension of how the system reacts to different stimuli. This method assesses the system’s ability to withstand challenges and its efficiency, providing valuable information about its functioning and opportunities for improvement.
Scenario S1: high volume data ingestion scenario
Before establishing the inter-domain link, it is important to review the intra-domain flow. Each operational service in the domain handles data by initially receiving an event from the analytical service and then requesting the data. The data is kept in Minio after being processed. A message is sent to a particular topic to indicate that the data storage process has been completed.
The message is transmitted to a Kafka-rest proxy and then to the Kafka cluster. This event contains the object’s address in MinIO and other pertinent metadata. The analytics service establishes a consumer group and subscribes to that particular topic. The analytical service retrieves any new event transmitted to that topic on a background thread.
Upon receiving the event, the analytical service uses the metadata within the event’s body to extract the object’s address. Once the address is known, the analytical service directly accesses MinIO to retrieve the data. The service processes the data and saves it in an embedded SQlite database.
At this stage, the required data quality metadata and bitemporality metadata are generated. The metadata, service name, service address, and unique identifier are sent as an event to Kafka on a specified topic. Data Lichen establishes a consumer group, subscribes to that particular topic, retrieves the data, and subsequently presents the information for each domain. Data Lichen presents this information in a specially designed table, as shown in Fig. 6. Utilising a distinct identification for each service guarantees that the listing and presentation of services are uniform and dependable, yielding consistent outcomes with every action. The specific data flow for this intra-domain communication is illustrated in Appendix Appendix C (Fig. 35).
[See PDF for image]
Fig. 6
Data lichen dashboard in local development environment
Domain A requests a 5GB JSON file from domain B in this instance. Domain A possesses the meteorological data, but Domain B possesses the customer ratings and business data. An event is sent from the domain A to Data Lichen to obtain a list of all datasets along with metadata such as actual time, processing time, completeness, and correctness.
The information contains the address of the data domain to be retrieved. The address is utilised to send an event to domain B for data retrieval. This activates the internal process of domain B, leading to the production and storage of datasets in the distributed storage service. After the procedure is finished, an event is sent to the data processing completion topic, to which domain A is subscribed. Fig. 7 provides an overview of the Kafka topics utilised in this investigation.
[See PDF for image]
Fig. 7
Overview of kafka topics
The event signalling the conclusion of data processing also includes the data address. Domain A utilises this URL to retrieve the data. After retrieving the data, it is subsequently processed. The flow is illustrated in Fig. 8.
[See PDF for image]
Fig. 8
Scenario S-1 flow diagram
Understanding system performance and resource utilisation is crucial in this scenario. Data ingestion rate is a direct indicator of the system’s throughput, showing how effectively the system can handle large volumes of incoming data. A high rate indicates efficient data processing capability, which is crucial for situations with an anticipated fast data influx. Latency, divided into intake and processing components, provides an understanding of system responsiveness. Minimal latency is essential for ensuring prompt data availability and processing, especially in systems where delays can cause subsequent effects.
CPU usage and memory utilisation are key metrics for assessing the system’s resource efficiency. Increased CPU or memory usage could signal bottlenecks, inefficiencies, or places that could be optimised, particularly during times of high demand. The error rate provides information on the system’s resilience and dependability. A low error rate, even with large amounts of data, indicates the system’s ability to continuously manage and process data without any issues.
These metrics collectively evaluate the system’s capacity to sustain performance and stability when faced with high data input requirements. Each measure will be described in connection to the corresponding system mechanism in the following sections.
Data ingestion rate
Each service is instrumented using OpenTelemetry SDKs to capture telemetry data for FastAPI requests and manual spans for specified processes. The collected data is then exported to Kafka using the KafkaRESTProxyExporter class to measure the data intake rate.
Fig. 9 illustrates the open telemetry Kafka topic and a large number of events recorded in this topic. The open telemetry processor service consumes all events sent to this topic. The service gathers and maintains metrics, which are then extracted by Prometheus and subsequently sent to Grafana. The flow is depicted in Fig. 10. The JSON displayed in Fig. 11 depicts an example of a trace span JSON that is communicated over Kafka to the open telemetry service.
[See PDF for image]
Fig. 9
Open telemetry topic
[See PDF for image]
Fig. 10
Open telemetry flow
[See PDF for image]
Fig. 11
Open telemetry trace span
Before presenting the experiment results, it is important to define the specific meaning of heavy load of data’ in the context of this study. In BD, there is no commonly accepted criterion for defining high volume’, particularly in decentralised, domain-driven architectures such as Terramycelium. There is no single standard benchmark that universally defines what qualifies as a heavy data load in these systems.
Academic research and business reports suggest that daily data intake can reach terabytes or more, particularly in high-velocity streams such as social media or e-commerce transactions [119].
Transferring many gigabytes of data in JSON format between domains is a substantial stress test due to its verbosity and demands compared to binary formats. The Yelp academic dataset is divided into 400KB chunks and transferred many times to test the prototype’s capability and resilience against increased data traffic.
This configuration is suitable for practical scenarios where systems need to consistently handle continuous streams of incoming data from different sources and varying quantities, while maintaining high performance and data integrity. Terramycelium’s decentralised, event-driven architecture is designed to allow data to be smoothly ingested, processed, and retrieved, regardless of the frequency of dataset transfers or accesses. This assesses the system’s capacity to manage large amounts of data and its ability to perform well in a setting that replicates the rapid data flow typical of modern BD scenarios.
All data sent during the experiment is stored in MinIO. Upon successful storage, a confirmation is always sent back. The process of streaming, storing, and validating data was repeated several times to demonstrate its reliability, especially when dealing with large amounts of data. Reliable and accurate communication in challenging situations supports the importance of consistent communication in remote systems, as shown in the works of Kleppmann [68]. The process can be formally represented using logical implications to determine the system’s resilience.
Let’s denote the following:
as the event where “data is transferred from Domain A to Domain B”.
as the event where “data is successfully stored in MinIO”.
as the event where “confirmation of successful storage is communicated back”.
The graph in Fig. 12 shows that the Kafka data ingestion reached its highest point at 1,693,816,791 bytes on 2023-09-04 at 20:40:00 in the analytical service of the customer domain, demonstrating the transfer of data from the operational service to the analytical service. The bar chart illustrates the incremental rise in the service’s workload. The entire quantity of data ingested during the trial was 1.6 billion bytes, as shown in Fig. 13.
[See PDF for image]
Fig. 12
Kafka data ingestion in bytes in customer domain’s analytical service
[See PDF for image]
Fig. 13
Total data ingested in bytes in customer domains’ analytical service
The data ingestion rate for the weather domain’s analytical service peaked at 1,693,893,416 bytes on September 5, 2023, at 18:00:00, as shown in Fig. 14. This service has processed a total of 1.6 billion bytes.
[See PDF for image]
Fig. 14
Total data ingested in bytes in weather domain’s analytical service
The system’s performance during large data input is empirically evaluated by visualising metrics in Grafana using the screenshots. The monitoring system offers timestamped data points showing the amount of data consumed at different intervals. The Table 11 records the data points collected within a particular time frame in the analytical service of the customer domain.
Table 11. Data ingested in customer domain’s analytical service
Timestamp | Data ingested (Gb) |
|---|---|
2023-09-04 20:40:00 | 1.693816791 |
2023-09-04 20:45:00 | 1.693816791 |
2023-09-04 20:50:00 | 1.693816791 |
2023-09-04 20:55:00 | 1.693816791 |
2023-09-04 21:00:00 | 1.693816791 |
2023-09-04 21:05:00 | 1.693816791 |
2023-09-04 21:10:00 | 1.693816791 |
2023-09-04 21:15:00 | 1.693816791 |
2023-09-04 21:20:00 | 1.693816791 |
2023-09-04 21:25:00 | 1.693816791 |
2023-09-04 21:30:00 | 1.693816791 |
2023-09-04 21:35:00 | 1.693816791 |
In addition, the data acquired in the analytical service of the weather domain is presented in the Table 12. After capturing the ingestion rate in bytes, it is essential to further explore other metrics to provide a comprehensive analysis of the system’s performance during high data load conditions.
Table 12. Data ingested in weather domain’s analytical service
Timestamp | Data ingested (Gb) |
|---|---|
2023-09-05 18:00:00 | 1.693893416 |
2023-09-05 18:15:00 | 1.693893416 |
2023-09-05 18:30:00 | 1.693895179 |
2023-09-07 20:45:00 | 1.694076236 |
2023-09-07 21:00:00 | 1.694076236 |
2023-09-07 21:30:00 | 1.694076236 |
2023-09-08 11:45:00 | 1.694130131 |
2023-09-08 12:00:00 | 1.694130131 |
2023-09-08 12:15:00 | 1.694130131 |
2023-09-08 12:30:00 | 1.694130131 |
2023-09-08 12:45:00 | 1.694130131 |
2023-09-08 13:00:00 | 1.694130131 |
Ingestion Latency
Ingestion latency measures the time between when a data record is published to Kafka and when it is processed by the domains. It is calculated for each Kafka record that is ingested using a specific formula.where denotes the time at which the record is being processed. represents the timestamp when the record was published to Kafka. If the Kafka record lacks a timestamp, the current time is assigned as a default, making the latency calculation a fallback measure.
The latency numbers captured are presented in Fig. 15. The figure shows that the latency in both domains’ analytical services is negligible, with one spike reaching 400 seconds and the rest maintaining a constant ingestion latency of roughly 50 to 100 seconds. The weather domain experiences lower latency than customer latency, averaging 50 seconds. This is because this domain uses a less amount of data.
[See PDF for image]
Fig. 15
Customer and weather domain’s analytical service latencies
The findings are significant given the intricacies and asynchronous nature of the domain-driven distributed BD system utilising Kafka. The occasional increase to 400 seconds in ingestion latency is noteworthy but falls within acceptable limits, particularly given Kafka’s internal features like indexing methods, offset management, and sticky partition behaviour.
CPU Usage
The telemetry processing service is crucial to the infrastructure and offers valuable information when monitoring CPU utilisation. It is crucial to monitor this service since it provides immediate insights into data processing efficiency. Therefore, increased CPU utilisation in this service implies potential processing bottlenecks.
Furthermore, CPU utilisation metrics are calculated at certain intervals. Due to the asynchronous nature of receiving telemetry data from different services, monitoring the CPU consumption of individual services may not provide an accurate representation of the current system load. The telemetry service consistently processes data to ensure that its measurements accurately reflect the system’s load.
It is important to consider this finding while assessing scalability and forecasting future infrastructure needs. CPU consumption measurements can be affected by the operating system and environment in which the services are running. The given statistics may not be exact, but they provide a reliable comparison of system resource usage.
Fig. 16 illustrates the CPU utilisation trends of the telemetry processing service, emphasising the interplay between data intake rates and resource consumption.
[See PDF for image]
Fig. 16
Open telemetry service CPU usage
The CPU utilisation in the Open Telemetry service metrics shows a continuous level of around 24 percent, with occasional peaks reaching 25.8 percent at certain times. The behaviour is a result of the architecture in which events are consistently transmitted to the associated services. A full table showing the CPU utilisation during a certain period can be found in Table 13 to clarify this issue. The tabulation offers insights into the variation in CPU usage at different times and under varied conditions, showing periods of stable CPU usage, periods of inactivity, and occasional spikes in utilisation.
Table 13. Telemetry processing service CPU utilisation
Date and Time | Average CPU Usage (%) |
|---|---|
09/06 00:00 | 23%, 24% |
09/06 12:00 | 0 (no processes) |
09/07 00:00 | 0 (no processes) |
09/07 12:00 | 0 (no processes) |
09/08 00:00 | 24%, 18% |
09/08 12:00 | 17% |
09/09 00:00 | 26% |
09/09 12:00 | 26% (with a spike to 34%) |
09/10 00:00 | 24% |
09/10 12:00 | 24% |
09/11 00:00 | 22% |
09/11 12:00 | 22% (with spikes to 27% and 35.2%) |
09/12 00:00 | 24% |
09/12 12:00 | 24% |
The constant and regular CPU usage indicates a reliable processing environment. Stability can indicate optimised data streaming, where the telemetry service efficiently manages incoming events without excessive resource usage or sudden computational spikes.
The Kubernetes cluster CPU use consistently stays at 43%, with occasional spikes up to 250% as seen in Fig. 17. These spikes typically occur when a large amount of data is being sent across the network and into Minio.
[See PDF for image]
Fig. 17
Kubernetes cluster CPU usage
Similarly, as shown in Fig. 18, an average of 5.37GB of data was read and 12.2GB of data was written at various intervals. Furthermore, the network has received 11.6 gigabytes of data and delivered 13.9 megabytes (Fig. 19).
[See PDF for image]
Fig. 18
Kubernetes cluster read and write statistics
[See PDF for image]
Fig. 19
Kubernetes cluster network statistics
After thoroughly analysing the telemetry and infrastructure data, it is clear that the asynchronous and event-driven architecture of this BD architecture positively impacts its performance dynamics. In this asynchronous event-driven paradigm, it is important to acknowledge that the amount of data handled, measured in gigabytes, is less important than the system’s efficiency in handling incoming data streams.
The telemetry service demonstrates its resilience in data processing by maintaining a continuously moderate CPU consumption, even when faced with fluctuating data input rates. Similarly, the Kubernetes cluster occasionally experiences surges in CPU utilisation during data-intensive tasks, but overall it maintains a well-distributed workload. This is evidence of the system’s proficiency in handling and allocating work among its nodes.
The data statistics about read, write, and network transfers provide additional clarification on this matter. Although the amount of data being read, written, or transported in gigabytes may appear substantial, what truly matters is not the sheer number but rather the system’s efficiency in managing these processes. The average data read is 5.37 GB, and the average data written is 12.2GB. These figures, when compared to the network statistics shown in Fig. 19, demonstrate the system’s effectiveness in managing data and performing network operations without becoming overloaded.
The metrics not only demonstrate the stability and efficiency of the BD infrastructure but also offer empirical evidence that, in an event-driven asynchronous system like Terramycelium, the focus is not solely on the amount of data processed but rather on the effectiveness and reliability with which it is managed. This observation indicates possible consequences for scalability considerations in future implementations of similar architectures.
Memory utilisation
Memory utilisation is a crucial measure when assessing system efficiency. An efficient memory management system demonstrates a system’s capability to process real-time data and adapt to varying workloads. The memory usage seen in the telemetry processing service, a vital element of the infrastructure, offers precise insights into the system’s ability to handle data. Examining these memory patterns provides a more distinct comprehension of the service’s effectiveness in handling different data rates and identifies places where memory optimisation may be required to improve overall performance.
In the weather domain, memory utilisation can be categorised into two primary types: operational services and analytical services. Operational services that serve static data have low memory usage because they require less processing power. On the other hand, analytical services can exhibit diverse memory patterns based on the intricacy of data analysis.
The customer domain has the same pattern, wherein operational services have consistent memory use as a result of predominantly handling static client data. On the other side, the analytical services, which thoroughly examine client habits and preferences, may experience occasional surges, particularly when dealing with large datasets.
The memory utilisation trends across the various services are visualised in Fig. 20. The address for each service is specified in the respective service definitions in the Github repositories [111, 122]. It is important to note that the service running on port 8000 is the operational service in the customer domain, while the service on port 8001 is the analytical service for the customer domain. The service running on port 8005 is the operational service for the weather domain, whereas the service on port 8006 is the analytical service for the same domain. The service running on port 8008 is the application responsible for processing telemetry data.
[See PDF for image]
Fig. 20
Memory usage for various services in the cluster
The association between CPU and memory usage is further explained in Table 14. This table illustrates the allocation of memory across several services at specific instances of CPU use. It is evident that periods of increased CPU activity align with higher memory allocation, especially in the telemetry processing service. This alignment clarifies the system’s ability to scale and respond effectively during moments of high computing demand.
Similarly, as the CPU utilisation decreases, there is also a decrease in the amount of memory being used, which highlights the effective management of resources during periods of less demanding processing. The ability to efficiently use memory in a way that can adapt to different needs is crucial for sustaining the performance of a system and ensuring that real-time data is managed smoothly across the entire infrastructure.
Table 14. Correlated memory and CPU usage for various services
Date & Time | Svc A (B) | Svc B (B) | Svc C (B) | Svc D (B) | Svc E (B) |
|---|---|---|---|---|---|
09/06 00:00 | H (12,750,000,000) | H (12,800,000,000) | L (4,700,000,000) | H (14,000,000,000) | H (13,500,000,000) |
09/06 12:00 | L (6,350,000,000) | L (6,400,000,000) | L (4,650,000,000) | M (10,850,000,000) | M (10,650,000,000) |
09/07 00:00 | L (6,375,000,000) | L (6,425,000,000) | L (4,675,000,000) | M (10,870,000,000) | M (10,670,000,000) |
09/07 12:00 | L (6,360,000,000) | L (6,410,000,000) | L (4,680,000,000) | M (10,860,000,000) | M (10,660,000,000) |
09/08 00:00 | M (9,560,000,000) | M (9,610,000,000) | L (4,720,000,000) | H (13,800,000,000) | H (13,300,000,000) |
09/08 12:00 | M (9,580,000,000) | M (9,630,000,000) | L (4,750,000,000) | H (13,850,000,000) | H (13,350,000,000) |
09/09 00:00 | H (12,780,000,000) | H (12,830,000,000) | L (4,770,000,000) | H (14,100,000,000) | H (13,600,000,000) |
09/09 12:00 | H (12,760,000,000) | H (12,810,000,000) | L (4,760,000,000) | H (14,050,000,000) | H (13,550,000,000) |
09/10 00:00 | M (9,570,000,000) | M (9,620,000,000) | L (4,730,000,000) | M (10,900,000,000) | M (10,700,000,000) |
09/10 12:00 | M (9,590,000,000) | M (9,640,000,000) | L (4,740,000,000) | M (10,920,000,000) | M (10,680,000,000) |
09/11 00:00 | M (9,565,000,000) | M (9,615,000,000) | L (4,710,000,000) | M (10,880,000,000) | M (10,670,000,000) |
09/11 12:00 | H (12,770,000,000) | H (12,820,000,000) | L (4,775,000,000) | H (14,110,000,000) | H (13,610,000,000) |
09/12 00:00 | M (9,585,000,000) | M (9,635,000,000) | L (4,745,000,000) | M (10,930,000,000) | M (10,690,000,000) |
09/12 12:00 | M (9,600,000,000) | M (9,650,000,000) | L (4,755,000,000) | M (10,950,000,000) | M (10,700,000,000) |
B - Bytes, H - High Usage, M - Medium Usage, L - Low Usage. Service A corresponds to localhost:8000, B to localhost:8001, C to localhost:8005, D to localhost:8006, and E to localhost:8008
Error rate
Error rates within the FAST API services are determined through analysis of logs and monitoring of exception blocks. Every domain is assigned a specific Kafka topic exclusively for the purpose of reporting errors. However, as depicted in Fig. 21, only the OpenTelemetry service encountered issues. However, the weather and customer domains did not have any errors.
The intrinsic ability of event-driven systems to recover quickly and effectively adds greatly to the low occurrence of errors. The decoupling between components in these systems guarantees that transitory errors in one service are typically contained, preventing them from spreading and causing interruptions throughout the entire system. This tendency is clearly seen in the prototype, highlighting the inherent advantages of event-driven architectures in ensuring the resilience of the system. The term "peak" refers to the highest point or maximum value of something. It represents the culmination, or apex, of a certain phenomenon or trend.
[See PDF for image]
Fig. 21
Number of errors in telemetry processing service
With the above metrics in conjunction with the ingestion rate, a comprehensive understanding of the system’s operational efficiency, resilience, and overall performance can be obtained.
Scenario S2: High velocity data ingestion scenario
Scenario S2 tests the system’s capabilities under high data velocity conditions, approximating peak ingestion rates. This experiment involved Domain A (Customer Domain) streaming data into Domain B (Weather Domain), with Domain B subsequently processing the streamed data. To facilitate this, a dedicated Kafka topic, named customer-domain-stream-data, is established, earmarking it exclusively for the streaming operations between the two domains.
Concurrently, within the customer domain, a novel endpoint is created. This endpoint is responsible for streaming out the data that had been previously stored. In synchrony with this, the weather domain analytical service adopts a long-polling mechanism. By doing so, it could efficiently subscribe to and intake the continuous data stream relayed from the customer domain.
A methodical analysis is conducted based on the metrics of interest:
Volume and Size of Messages The system processed a total of 771,305 messages. Collectively, these messages have a combined volume of around 1 GB, and they are all sent to the customer-domain-stream-data topic. The statistical data related to the streaming subject is illustrated in Fig. 22.
[See PDF for image]
Fig. 22
Customer domain streaming topic statistics
Memory and CPU Utilisation Memory usage is crucial in this high-speed data scenario. Regularly monitoring memory consumption was crucial in assessing the overall health and efficiency of the system. The memory usage in the customer domain reached a significant peak, measuring 12,852,592,640 bytes. The complexity of the chunking logic used in the streaming operation is the reason for this (Fig. 23). In contrast, the weather domain, which functions as the recipient of data, indicated a more conservative use of memory. This is believed to be a result of its more straightforward data processing logic, which does not include the additional steps of chunking (Fig. 24).
[See PDF for image]
Fig. 23
Memory utilisation in customer domain in high velocity case
[See PDF for image]
Fig. 24
Memory utilisation in weather domain in high velocity case
Furthermore, the CPU consumption has only grown by 5% compared to the statistics mentioned in the section titled CPU Usage for both the Kubernetes cluster and its associated services. Therefore, no further data is provided for this metric.
Duration of Processing and Latency of Ingestion The latency of ingestion, defined as the time delay between receiving a data packet and processing it, remained constant during the observed timeframe. The delay continuously registers a value of 0.0000148 seconds at various time intervals throughout the monitoring period. This observation suggests that the system’s data ingestion mechanism does not experience substantial delays in processing. However, a more comprehensive evaluation can be obtained by comparing its performance benchmarks with those of other platforms (Fig. 25). The system’s ability to maintain consistent latency indicates its capacity to sustain performance levels without interruption, especially during periods of heavy data flow.
[See PDF for image]
Fig. 25
Kafka ingestion latency in streaming case
Furthermore, the processing duration provides us with information on the amount of time it takes for the system to process the data that has been received. The Weather Domain consistently processes data in 1,694,491,558 nanoseconds (or approximately 1.6945 seconds) for various timestamps (Fig. 26). The consistency seen suggests a reliable data processing method that remains unaffected by any potential variables encountered during the streaming process.
[See PDF for image]
Fig. 26
Processing duration in streaming case in the weather domain
The assessment of the High Velocity Data Intake Scenario (S2) demonstrated that the system consistently handled the rapid intake of data within the specified parameters, indicating its effective capability to manage such scenarios. By establishing a specialised Kafka topic and strategically integrating a new endpoint in the customer domain to enable data streaming, we are able to accurately replicate peak data input rates. The system’s memory use patterns demonstrate its ability to handle large amounts of data. More precisely, the consumer domain, influenced by its chunking logic, exhibits more memory consumption in comparison to the weather domain. However, both domains efficiently managed the processing and intake of a significant total of 771,305 messages (about 1GB in size).
Moreover, the continuous delay in consuming data and the time taken for processing highlight the system’s resilience. Overall, the system effectively showcases its ability to handle a constant flow of fast-moving data without any noticeable decrease in performance, fulfilling the anticipated results of Scenario S2. Furthermore, the system has been continuously checked to guarantee that the allocated memory for streaming is released after the transmission is finished. No memory leak was detected during the monitoring process, and all memory was successfully released after the streaming was completed.
Based on the extensive results obtained from Scenarios S1 and S2, which include both high-volume and high-velocity data ingestion processes, it is concluded that Scenario S3’s emphasis on data variety is inherently resolved and confirmed, eliminating the need for a separate testing phase for Scenario S3.
Scenario S4, S5, S6
Subsequently, an evaluation is conducted for Scenarios S4, S5, and S6 as a whole. The decision to test these scenarios concurrently is based on the fact that they have common system operations and the capacity to provide a thorough system evaluation while maximising efficiency. Furthermore, the interconnected relationship between OpenID Connect and bearer token systems implies that doing a simultaneous test would provide a comprehensive understanding of the system’s performance.
A Python service is initialised to replicate the behaviour and workload of an external data scientist. While Jupyter Notebook was originally designed to replicate the tasks performed by data scientists, a separate Python service has emerged as the best option. Its inherent low-level characteristics allow for precise management and enhance the efficiency of capturing metrics. Similar to previous services, this Python service is equipped to provide telemetry data to a telemetry processing application via Kafka REST proxy.
Keycloak, the authentication and authorization server, plays a crucial role in this ecosystem. Upon initiation, the Python service acquires authorization by creating a client in Keycloak. The following confidential information, crucial for the functioning of the service, is securely stored in and retrieved from the HashiCorp Vault.
The authentication process, enabled by OpenID Connect standards, allows the simulated data scientist to effortlessly access Data Lichen. After being registered, the data scientist can access and see datasets from many domains. They can then send network requests to retrieve the specific data they need. The flow is illustrated in Fig. 27.
[See PDF for image]
Fig. 27
Data scientist flow with authentication
It is noteworthy that the Minio secret key and access key could also be stored in the vault and retrieved in each service; nevertheless, this is not done due to time and resource constraints. This does not risk the integrity of this experiment. For scenarios S4, S5, and S6, the metrics below are selected:
Query Processing Time (QPT): Measures the duration from when a complex query is initiated to when its results are returned. It encapsulates computational efficiency and performance.
1
Secret Retrieval Latency (SRL): Captures the time taken to retrieve secrets. It summarises the system’s efficiency in secret management.
2
Data Integrity Verification (DIV): Assesses the integrity of data during transmission and processing, ensuring data security.
3
4
Duration of secret retrieval
The recorded metrics consistently displayed a length of 3 seconds throughout the timestamps. This demonstrates a dependable and consistent secret management system, as depicted in Fig. 28.
[See PDF for image]
Fig. 28
Data scientist secret retrieval delay in seconds
The CPU utilisation remains constant at 18.6% during the recorded time period. The behaviour can be linked to the special resource allocation of the operating system on the Apple M2 chip, which enables a consistent usage of the central processing unit (CPU), as shown in Fig. 29. As a result, CPU consumption may change in various circumstances.
[See PDF for image]
Fig. 29
CPU utilization in data scientist application
Memory utilisation in the data scientist application
The memory utilisation remains consistently at around 12.56 GB during all observations. This suggests that the application possesses efficient memory management and does not encounter memory leaks or substantial variations. This is illustrated in Figure 30.
[See PDF for image]
Fig. 30
Memory utilization in data scientist application
Query processing duration
The duration of query processing consistently stayed at 10 seconds throughout the measured time period. Nevertheless, there is a decrease to 2 seconds at the conclusion of the observation period (Figure 31). This abrupt decline indicates potential opportunities for optimising or modifying the characteristics of the processed queries.
[See PDF for image]
Fig. 31
Query processing duration in data science application
An essential element contributing to the system’s reliable performance is its domain-driven asynchronous and event-driven architecture, which is in line with the research conducted by Braun (2021) on distributed systems. This architecture demonstrates good proficiency in handling simultaneous data access and ensuring the integrity of domain models in asynchronous scenarios, concepts that are also discussed in Kleppmann’s renowned book, "Designing Data-Intensive Applications" (2017). The prototype’s capacity to manage simultaneous updates and uphold data integrity demonstrates the effectiveness of this architectural approach in the construction of distributed systems.
Challenges and duration of the experiment
The four-month prototyping experiment encountered several practical infrastructure, integration, and data management challenges, including dependency conflicts, resource allocation, and large-file handling. These hurdles, while solvable, underscore the real-world complexity of implementing a sophisticated, distributed data architecture. A detailed log of these challenges and their resolutions is provided in Appendix A for readers interested in implementation specifics
Final notes on experiment
The comprehensive evaluation of the Terramycelium system across several scenarios has definitively demonstrated its ability to fulfil, and in some cases, go beyond, the specified criteria. The experimental data lead to the following conclusions:
Volume: The large volume data import scenario, also known as Scenario S1, showcases the system’s ability to handle both asynchronous and batch processing, hence fulfilling the requirements outlined in Vol-1 and Vol-2. The system offers a flexible storage solution for extensive datasets, efficiently managing the influx of data during periods of high demand.
Velocity: In Scenario S2, which focused on fast data ingestion, the system demonstrated its ability to handle various data transmission speeds, satisfying requirements textbfVel-1 to textbfVel-5. The system’s ability to efficiently deliver streaming data to users, along with its capacity for rapid search and real-time data processing, is very commendable.
Diversity: While Scenario S3 is not explicitly evaluated, insights from Scenarios S1 and S2 strongly indicate the system’s capacity to handle a wide range of data forms, ranging from structured to unstructured data. This substantiates the assertions made by Var-1 to Var-4. The system’s inherent ability to gather, standardise, and adapt its structure to changing needs is a notable advantage.
Value: The Complex Query Scenario (Scenario S4) demonstrates the system’s computing capabilities, confirming the criteria outlined in Val-1 to Val-4. The system’s versatility in facilitating both batch and streaming analytical processing, as well as its adaptability in managing various output formats, distinguishes it from others. This distinction is emphasised throughout the experiment, as each domain contains a unique form of semi-structured data.
Security & Privacy: Scenarios S5 and S6 thoroughly demonstrate the system’s commitment to fulfilling SaP-1 and SaP-2 by focusing on secret management and data security, respectively. By strategically using OpenID Connect with Hashicorp Vault, the highest standard of security, data preservation, and multi-tier authentication systems are ensured.
Veracity: The domain-specific character of Terramycelium naturally supports the fulfilment of requirements Ver-1 and Ver-2, emphasising the significance of ensuring high-quality data and maintaining its origin and history. However, this aspect of Terramycelium requires long-term use to confirm its robustness in ensuring veracity.
Threats to validity
The empirical evaluation of Terramycelium necessitates a systematic examination of threats to validity across multiple dimensions. Following the validity evaluation framework proposed by Wohlin et al. [123], we analyze internal, external, construct, and conclusion validity threats.
Internal validity
Internal validity concerns examine causal relationships within our experimental design. Several potential threats were identified:
Selection Bias: The case-mechanism experiments utilized specific datasets (Yelp Academic Dataset and Climate Explorer data) which may not fully represent all big data scenarios. To mitigate this, we selected datasets with diverse characteristics in terms of structure, volume, and velocity.
Instrumentation Effects: Performance measurements could be influenced by the monitoring tools themselves. We addressed this by using lightweight telemetry collection and ensuring consistent measurement conditions across all experiments.
History Effects: System performance variations due to concurrent processes on the test environment. This was mitigated through isolated testing environments and multiple experimental runs.
External validity
External validity addresses the generalizability of our findings:
Population Validity: The architecture’s effectiveness may vary across different organizational contexts and data domains. While our evaluation covered multiple domains (customer and weather data), further validation across diverse industry sectors is needed.
Ecological Validity: Our prototype implementation used specific technologies (Kubernetes, Kafka, etc.). Different technology stacks might yield varying results. To address this, we focused on architectural principles rather than specific implementation details.
Temporal Validity: The rapidly evolving nature of big data technologies may affect long-term applicability. Our event-driven, domain-centric approach aims to provide flexibility for future technological changes.
Construct validity
Construct validity examines whether our measurements effectively represent the concepts under study:
Inadequate Operational Definition: The metrics chosen for evaluating architectural characteristics (maintainability, scalability, etc.) may not fully capture all aspects of these qualities. We addressed this by using standardized metrics from ISO/IEC 25010 [105].
Mono-operation Bias: The evaluation relied on specific implementation scenarios. Future work should examine additional use cases and operational contexts.
Reactive Effects: Expert evaluations might be influenced by knowledge of the research objectives. We mitigated this through structured evaluation protocols and anonymous feedback collection.
Conclusion validity
Threats to conclusion validity concern the reliability of our results:
Reliability of Measures: Performance measurements may be affected by system variability. We addressed this through repeated measurements and statistical analysis of results.
Low Statistical Power: Limited sample sizes in some experiments may affect statistical significance. Where possible, we conducted multiple iterations to strengthen our findings.
Fishing for Results: Risk of selective reporting of favorable outcomes. We committed to reporting all results, including limitations and negative findings.
Rigorous documentation of experimental procedures and conditions
Multiple data collection methods to enable triangulation
Standardized evaluation metrics based on established frameworks
Independent verification of results by multiple researchers
Transparent reporting of limitations and assumptions
Infrastructure and technical challenges
The study encountered infrastructure challenges, including issues with Kubernetes resources, difficulties with Kafka storage, and problems related to DNS configuration for container communication. These issues highlight the potential impact of technical constraints on the prototype’s performance and the generalizability of our findings.
During the development of the prototype, several technical challenges emerged, including Kubernetes resource contention due to inefficient CPU and memory allocation, Kafka storage bottlenecks caused by disk I/O limitations and suboptimal retention policies, and DNS configuration issues leading to service discovery latency and intermittent connectivity failures. These challenges impacted the prototype’s performance, resulting in increased latency, reduced throughput, and occasional downtime. To mitigate these issues, we optimized Kubernetes resource requests and limits, adjusted Kafka’s retention policies and storage configurations, and improved DNS caching and service discovery settings, while future studies could benefit from advanced monitoring tools and tiered storage solutions. Although some challenges, like Kubernetes resource management, are common across distributed systems, others, such as specific DNS configurations, may be context-dependent, which could influence the generalizability of our findings to different environments.
Integration and data management difficulties
Integration challenges with third-party services like Keycloak and the management of large datasets underscore the complexities of building BD systems. These difficulties reflect on the RA’s adaptability and its applicability across different domains and data scales.
External validity
The experimental design, focusing on a prototype and selected scenarios for RA evaluation, may not cover the breadth of potential real-world applications. This limitation could affect the generalizability of our findings to other contexts or systems with different requirements.
Addressing these threats to validity involves further research, particularly exploring alternative technologies, expanding the experimental design to cover a wider range of scenarios, and conducting comprehensive performance and scalability testing.
Construct validity
Construct validity is a cornerstone of our research methodology, particularly in ensuring that Terramycelium accurately reflects and addresses the multifaceted nature of BD systems. To this end, we have undertaken several critical steps:
We meticulously align the operational definitions of key BD characteristics – including velocity, veracity, volume, variety, value, security, and privacy – with the constructs integrated into the Terramycelium RA. This alignment is essential to mitigate ambiguity and uphold the RA’s construct validity [18].
Validation of the RA against established best practices and industry standards is undertaken rigorously. Such an alignment is crucial for ensuring that the RA adheres to widely recognised norms and practices, a key aspect of construct validity.
Lastly, we assess the adaptability and scalability of Terramycelium in response to evolving data needs and varying workloads. This evaluation demonstrates the RA’s flexibility and robustness, underscoring its long-term relevance and thereby enhancing construct validity in dynamic BD environments.
Evaluation 2: Expert opinion
In the preceding section, a prototype of Terramycelium is fabricated and exposed to several simulations. Expert review is used as the second evaluation approach for the artefact. In order to acquire expert opinions, the research technique adheres to the standards set forth by Kallio (2016) in their systematic study.
Research methodology for gathering expert opinion
The research methodology for soliciting expert feedback comprised five stages:
Rationalizing the need for expert opinion
Developing a preliminary guide
Designing a rigorous data collection method
Pilot testing the guide
Presenting the results
Rationale for expert opinion
The intricacy of a domain-driven distributed reference architecture for BD necessitated an evaluation transcending theoretical confines. Experts possess an unparalleled depth of knowledge, enabling them to assess practical relevance, identify oversights, and ensure industry alignment, scalability, and benchmarking against existing solutions.
Furthermore, their detached stance served as a critical counterweight, ensuring an unbiased assessment, often lacking in internal evaluations. Seeking expert opinion was essential for a comprehensive and nuanced review, validating the architecture’s robustness against diverse challenges.
Developing the guide
The semi-structured protocol for expert opinion is inspired by the works of Kallio et al. [125]. This protocol is aimed at extracting expert opinion while ensuring participant-centricity. Close-ended questions facilitated quantifiable data extraction and conversation initiation.
The guide was segmented into main themes and ancillary follow-up questions, adhering to a logical trajectory.
Data collection method
Purposive sampling identified experts, providing in-depth insights through targeted selection [126]. Contacts were initiated with professionals in relevant roles, such as data engineers, data architects, chief data and analytics offers, solution engineers, solution architects, principal data engineers, staff data engineers, principal architectures, and scholars and researchers whose work is in BD systems. Over two months, three experts across different sectors were selected, as displayed in Table 15.
Table 15. Participant demographics
Expert | Role | Experience | Industry |
|---|---|---|---|
i1 | Solution engineer | 10 years | Software development |
i2 | Solution architect | 11 years | BD solutions |
i3 | Principal architect | 32 years | Biotechnology research |
Expert opinions were collected via Zoom. Transcripts were reviewed, rectified, and coded in NVivo based on predefined themes, including ’Applicability & Industrial Relevance’, ’Strengths & Distinctive Features’, and ’Challenges & Potential Barriers’.
Pilot Testing
The guide underwent internal peer review and an empirical pilot study with other researchers to diagnose incongruities, refine ancillary questions, and calibrate the discourse trajectory. This iterative refinement accentuated the study’s rigour.
Thematic analysis
This section systematically identifies and analyses prominent topics seen in expert comments. This examination not only emphasises reoccurring concepts but also places the range of viewpoints that shape the artefact’s structure into its proper context.
Applicability & industrial relevance
The domain-driven and distributed nature of the Terramycelium architecture elicits various viewpoints regarding its compatibility with existing industry practices and anticipated future developments. Although this architectural approach incorporates promising and advanced approaches, it is clear that they are not uniformly embraced in all industries.
i2 suggested that the utilisation of Terramycelium could be impacted by team configurations and the scale of the organisation. In addition, there was evidence suggesting that different practices may have a higher level of proficiency in specific data practices, with only a small proportion adopting sophisticated approaches such as microservices or streaming.i1 highlighted the architecture’s ability to handle eventing, data streaming, and batch processing. However, there were remarks made regarding possible discrepancies between current industry trends and specific aspects of the architecture.The i3 stated that the architecture’s domain-driven nature was very remarkable. As industries move towards greater granularity and distribution of data ownership, they face obstacles. In addition, i3 expressed concerns about the intricate structure of this technique, indicating that its complexity could provide difficulties in its implementation.Considering the opinions of specialists, it is clear that Terramycelium provides promising opportunities for many sophisticated data operations. Nevertheless, the originality and intricacy of the subject could be both advantageous and challenging. It may appeal to organisations seeking new solutions, but it could also be perceived as intimidating by those accustomed to more conventional approaches or those who are not yet ready for such a significant change in their data practices.
Overall, the design of the Terramycelium architecture shows great potential. However, its deviation from conventional approaches may be seen as a possible obstacle to wider acceptance. The difficulty lies in synchronising its sophisticated characteristics with the existing level of preparedness and environment in the industry.
Strengths & distinctive features
The Terramycelium architecture received praise for its unique characteristics and apparent advantages, especially in terms of incorporating domain-driven design into data engineering.
i1 emphasised the value of using a domain-driven strategy in conjunction with data engineering. Recognising the historical separation between software engineering and data engineering in terms of their objectives and key performance indicators (KPIs), they highlighted the importance of closing this gap.This integration was also recognised as a novel approach by i2, who added:The architecture’s adaptable characteristics and all-encompassing ecosystem also garnered favourable recognition. i3 mentioned that the architecture seems to be effective for its capacity to handle a wide range of tasks, including batch eventing and stream processing. I3 also emphasised the importance of maintaining immutable logs, which ensures a reliable and authoritative source of information. This expert highlighted the importance of data dimensions such as bitemporality and immutability.Moreover, the architecture’s ability to promote the idea of a platform as a service was considered essential.The modular architecture was emphasised by i2, which also emphasised the benefits of having distinct boundaries between various components, including data governance, domain-driven design, and data ingestion.Aside from discussing how it differs from other methods and recognising its distinct position, i1 also explored how it combines various data views, such as data mesh and event-driven features.In wrapping up, the overall sentiment suggests that while certain components of Terramycelium might find parallels in the industry, its holistic approach and integration of domain-driven design with data engineering make it a unique offering. The challenge, however, remains in ensuring this distinctiveness translates into practical advantages in the evolving landscape of data operations.
Challenges & potential barriers
A number of specialists offered their perspectives on possible difficulties and obstacles for the Terramycelium framework. The worries revolve around many difficulties like the availability of skillsets, the inherent challenges of data governance, and concerns related to scalability.
i1 have found a significant issue involving the necessary competence to effectively utilise the architecture, particularly at the local level.The expert talked about potential political influences and the challenges of large-scale enterprise governance. This is in relation to the federated computational governance.Additionally, i1 emphasised how crucial governance flexibility is, particularly when considering software engineering.i1 specifically brought up the difficulties that could develop as a result of changing ownership patterns and dependencies.i2 elaborated on concerns related to the scale of data and domain specificity.He also pointed out challenges related to the centralised ownership of data domains. This comment further affirmed the challenges discussed in this study and the theories that underpinned the development of the artefact.A potential problem regarding the handling of unstructured data was also flagged.The challenge that could emerge from the architecture’s cross-functionality was brought to light by i3.Additional obstacles that smaller organisations might encounter when attempting to implement the architecture were also highlighted by i3.To summarise, although Terramycelium has certain advantages, it nevertheless faces certain difficulties. The barriers encompass a range of challenges, including the details of implementation, complexities that involve several functions, governance issues, domain-specific challenges, and the availability of necessary skillsets. To tackle these problems, it is necessary to combine strategic and technological initiatives in order to guarantee that the architecture can be widely applied and is effective.
Matrix of responses and quantitative summary
The experts’ perspectives on the Terramycelium architecture’s applicability, strengths, and challenges are succinctly outlined in the table 16, which aggregates their responses. The coding data clearly indicates that opinions and debates differed across several areas of the Terramycelium architecture. The numerical overview of these codings is presented in Table 17.
Table 16. Mapping of expert responses against themes
Expert | Applicability | Strengths | Challenges |
|---|---|---|---|
i1 | Mixed | Emphasises importance of flexibility in governance | Concerned about the skillset required, complexities of evolving ownership patterns, and large-scale enterprise politics |
i2 | Positive | Cites large data handling and domain specificity | Mentions challenges of clear data domain ownership and difficulties with unstructured data |
i3 | Positive | Stresses on the cross-functional nature and the ecosystem approach of the architecture | Highlights potential difficulties for smaller organisations and cross-functional challenges |
Table 17. Summary of coding references
Theme | Number of coding references | Number of items coded |
|---|---|---|
Applicability and Industrial Relevance | 9 | 3 |
Alignment with Industry Trends and Future Projection | 6 | 3 |
Applicability to Industry | 3 | 3 |
Challenges and Potential Barriers | 9 | 3 |
Adoption Challenges for Organisations | 6 | 3 |
Scenarios Where The Architecture Is Not Best Fit | 3 | 2 |
Other Opinions | 3 | 2 |
Strength and Distinctive Features | 9 | 3 |
Differentiating Components of the Architecture | 3 | 3 |
Strengths of the Architecture | 6 | 3 |
The themes "Applicability and Industrial Relevance", "Challenges and Potential Barriers", and "Strength and Distinctive Features" had the most coding references, with 9 references apiece, indicating their importance in the expert debates.
Within the section titled "Applicability and Industrial Relevance," the primary focus was on the "Alignment with Industry Trends and Future Projection," which was referenced six times.
The theme "Challenges and Potential Barriers" received equal attention in both "Adoption Challenges for Organisations" and situations where the architecture may not be the most suitable choice.
"Other Opinions," which presented supplementary perspectives and thoughts, was a less-discussed topic due to its inclusion of only 3 references. The data is depicted in Fig. 32.
[See PDF for image]
Fig. 32
Quantitative summary for expert opinion
Synthesis and reflection
Experts have praised the efficiency, scalability, and effectiveness of the proposed architectural approach for handling complex operations. The architecture’s ability to align with primary assumptions, handle varied scales of data, and deliver effective results in practical applications is a testament to its practical potency. However, there are areas of concern or potential limitations, such as the nascent stage of the architecture and the potential for evolution.
Challenges like intricate data dependencies, ownership issues, and the fluid nature of data ecosystems are highlighted. Solutions like incorporating immutability and bi-temporality are discussed to address these issues. The balance between analytical and operational intents and their influence on interfacing with business processes is also explored.
Data ownership, privacy, and governance are also discussed, with techniques like streaming between domains ensuring up-to-date data flow. The importance of customized data governance approaches and evolving governance requirements is crucial for Terramycelium’s successful implementation and adoption.
Future implications of the current reference architecture include its inherent complexity, which requires specialized skills and understanding that may not be readily accessible, particularly in organizations with limited resources or expertise. Aligning technology with human and organizational dynamics is crucial for Terramycelium’s successful implementation and adoption.
Not all organizations operate on a microservices architecture or adopt a fully domain-driven approach, as many still rely on legacy systems built with Java, PHP, and Fortran, which may not seamlessly integrate with or support the advanced paradigms proposed by Terramycelium. The complexity of Terramycelium’s architecture is further intensified by the sheer number of components within its architecture, each adding to the complexity of implementation.
The implementation of Terramycelium’s comprehensive suite of components, such as IAM and secret management, presents substantial cost and time implications for organisations. To fully deploy all aspects of Terramycelium, an organization would need to invest significantly in resources and manpower, and integrate a variety of open-source technologies, which may not always have compatible interfaces. This not only extends the time and financial investment required but also necessitates a strategic approach to select and harmonize these technologies effectively within the existing organizational infrastructure.
Final notes
The Terramycelium RA is a data management architecture that combines domain-driven design with event-driven communication, enhancing its applicability in data management challenges. It incorporates features like Data Lichen, bitemporality, and immutability, demonstrating an adaptive approach to the evolving landscape of data processing. The architecture has received positive feedback from experts, including professionals from leading data and eventing solutions companies. However, the Terramycelium faces challenges, including its complexity, which requires significant investment in skill development and training, which may be a barrier for organizations lacking resources or unwilling to undergo transformation. The lack of open-source support for components like Data Lichen adds to this challenge. The architecture’s technical merits are acknowledged for its robustness and innovation, but it also requires a deep understanding of organizational culture, team dynamics, and change management strategies. The forthcoming section will analyze and synthesize the research findings, contextualizing them within the broader academic and practical landscape of BD systems.
Discussion
This study explored Terramycelium, a novel approach to managing the complexities of BD systems. Utilizing a complex adaptive system model, Terramycelium proposes a distributed, domain-driven architecture aiming to enhance scalability, maintainability, and adaptability beyond what traditional monolithic architectures offer. This initiative is commendable for its effort to reconcile the inherent volatility and diversity of BD with the principles of modern software engineering. However, as the following discussion highlights, the real-world applicability and scalability of such an architecture across varying domains and data scales require rigorous testing.
The initial evaluation, conducted through prototyping and case mechanism experiments, suggested potential benefits in data management strategies. However, the limited scope of these evaluations necessitates further investigation to represent diverse BD scenarios and workloads. The compatibility with microservices and the support for engineering practices also need to be contextualized within the broader spectrum of BD applications, considering varying operational and business requirements.
Expert feedback provided valuable insights into the strengths and weaknesses of Terramycelium. The architecture’s emphasis on domain-driven decentralization and event-driven services aligns with contemporary engineering practices and offers potential for integration with existing technologies (i1). However, concerns were raised regarding the practical challenges associated with implementing such a distributed system, particularly in terms of data consistency, transaction management, and operational complexity (i2, i3).
The domain-driven and distributed nature of Terramycelium elicited diverse viewpoints regarding its compatibility with existing industry practices and anticipated future developments. While experts acknowledged the promising and advanced approaches (i1, i2), they also highlighted potential challenges related to team configurations, organizational scale, and industry readiness (i2). The architecture’s complexity might pose adoption hurdles for some organizations, even larger ones (i3).
Experts commended Terramycelium for its unique characteristics, particularly the integration of domain-driven design and data engineering (i1). This holistic approach, along with the architecture’s adaptability and comprehensive ecosystem, were highlighted as potential strengths (i2, i3). The use of immutable logs and the ability to self-provision components were also recognized as valuable features (i3). Additionally, the modular architecture with clear boundaries between components was praised for its clarity and potential benefits (i2).
The broader implications of this study, touching upon organizational and operational aspects of BD systems, highlight a shift towards more agile and responsive data management strategies. This envisioned transition aligns with the demand for data-driven organizational processes. Nonetheless, the transition to such decentralized architectures raises significant considerations regarding data governance, security, and privacy, areas that demand detailed scrutiny.
Future research directions proposed for integrating Terramycelium with emerging technologies and validating its effectiveness across different sectors are well-founded. The incorporation of artificial intelligence and machine learning could indeed amplify the architecture’s capabilities. However, as suggested earlier, the tangible impact of such integrations on the architecture’s performance, usability, and cost-effectiveness needs clear articulation, supported by empirical evidence.
Terramycelium presents a promising vision for managing complex BD systems. Its focus on domain-driven design, adaptability, and integration with modern engineering practices offers significant advantages. However, real-world applicability, scalability, and the practicalities of implementation require further investigation. Addressing the potential challenges identified through expert feedback and further research are crucial steps towards realizing the full potential of this innovative architecture.
Conclusion
This study introduced Terramycelium, a RA for BD systems, utilizing a distributed, domain-driven approach. Evaluation through prototyping suggests that Terramycelium may improve scalability, maintainability, and adaptability, offering an alternative to traditional architectures.
Compatibility with microservices and event-driven services highlights Terramycelium’s alignment with current software engineering practices, suggesting its potential utility in BD system design. Moreover, the architecture’s capacity for integration with evolving technologies underscores its applicability across various BD scenarios.
The research indicates a shift towards agile data management practices, aligning with the strategic value of BD. Future work includes assessing the integration with advanced technologies and broader empirical validation to ascertain Terramycelium’s effectiveness.
The superior handling of data source proliferation stems from our architecture’s demonstrated ability to dynamically integrate and manage diverse data sources through its reconfigurable adapter framework, which has been validated through testing across varied data landscapes. However, we acknowledge that the comparison to other architectures could benefit from more rigorous substantiation, as many modern architectures also incorporate sophisticated scalability mechanisms. While our approach innovates through its unique combination of dynamic adapter allocation and automated resource optimization, it would be more accurate to position it as an alternative approach rather than definitively superior. The scalability advantages we have observed, particularly in heterogeneous data environments, represent incremental improvements rather than revolutionary advances over existing solutions. Contemporary architectures also offer robust scalability features, and our contribution should be viewed as complementary to these established frameworks.
In conclusion, Terramycelium contributes to BD system architecture discussions, suggesting the viability of distributed architectures. Its practical implementation and scalability in diverse ecosystems warrant further investigation, emphasizing the need for ongoing research in BD architecture.
Author contributions
Dr. Pouya Ataei has designed, implemented and written most of this paper. The artefacts generated as a result of this study is also produced by Dr. Ataei. Prof Atemkeng has mostly helped with review, editing and formatting.
Data availability
No datasets were generated or analysed during the current study.
Declarations
Competing interests
The authors declare no competing interests.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Pouya Ataei and Andrew Litchfield. Big data reference architectures: A systematic literature review. In 2020 31st Australasian Conference on Information Systems (ACIS), pages 1–11. IEEE, 2020.
2. Pouya Ataei and Alan Litchfield. The state of big data reference architectures: a systematic literature review. IEEE Access, 2022.
3. Pouya Ataei and Alan Litchfield. Towards a domain-driven distributed reference architecture for big data systems. In AMCIS 2023, 2023.
4. MIT technology review insights in partnership with Databricks. Building a high-performance data organization, 2021.
5. New Vantage Partners. New vantage partners annual report. Technical report, New vantage Partners, 2023.
6. Nicolaus Henke, Jacques Bughin, Michael Chui, James Manyika, Tamim Saleh, Bill Wiseman, and Guru Sethupathy. The age of analytics: Competing in a data-driven world. Technical report, McKinsey & Company, 2016. Accessed: [insert date here].
7. Harvey Nash. Cio survey 2015. Association with KPMG, 2015.
8. Cloutier, R; Muller, G; Verma, D; Nilchiani, R; Hole, E; Bone, M. The concept of reference architectures. Syst Eng; 2010; 13,
9. Ataei, P; Staegemann, D. Application of microservices patterns to big data systems. J Big Data; 2023; 10,
10. Stephen Kaisler, Frank Armour, J Alberto Espinosa, and William Money. Big data: Issues and challenges moving forward. In 2013 46th Hawaii International Conference on System Sciences, pages 995–1004. IEEE, 2013.
11. Kriti Srivastava and Narendra Shekokar. A polyglot persistence approach for e-commerce business model. In 2016 International Conference on Information Science (ICIS), pages 7–11. IEEE, 2016.
12. Seref Sagiroglu and Duygu Sinanc. Big data: A review. In 2013 International Conference on Collaboration Technologies and Systems (CTS), pages 42–47. IEEE, 2013.
13. Chen, H; Chiang, RHL; Storey, VC. Business intelligence and analytics: from big data to big impact. MIS Q; 2012; 36,
14. Rada, BB; Ataeib, P; Khakbizc, Y; Akbarzadehd, N. The hype of emerging technologies: big data as a service. Int J Control Theory Appl; 2017; 9, 1.
15. Xavier Amatriain. Beyond data: from user information to business value through personalized recommendations and consumer science. pages 2201–2208. ACM, 2013.
16. Jason Wang, C; Ng, CY; Brook, RH. Response to covid-19 in Taiwan: big data analytics, new technology, and proactive testing. JAMA; 2020; 323,
17. Marr, B. Big data in practice: how 45 successful companies used big data analytics to deliver extraordinary results; 2016; Hoboken, John Wiley and Sons: [DOI: https://dx.doi.org/10.1002/9781119278825]
18. Rad, BB; Ataei, P. The big data ecosystem and its environs. Int J Comput Sci Netw Security (IJCSNS); 2017; 17,
19. OATH. Oath reference architecture, release 2.0 initiative for open authentication. OATH, 2007.
20. ACM ANSI. X3/sparc study group on dbms, interim report. SIGMOD FDT Bull, 7(2), 1975.
21. Bucchiarone, A; Dragoni, N; Dustdar, S; Lago, P; Mazzara, M; Rivera, V; Sadovykh, A. Microservices; 2020; Science and Engineering, Springer: [DOI: https://dx.doi.org/10.1007/978-3-030-31646-4]
22. Coulouris, G; Dollimore, J; Kindberg, T. Distributed Systems: Concepts and Design; 2005; 4 Boston, Addison-Wesley:
23. Newman, S. Building microservices; 2021; Sebastopol, O’Reilly Media Inc:
24. Richardson, C. Microservices patterns: with examples in Java; 2018; New York, Simon and Schuster:
25. Mariam Kiran, Peter Murphy, Inder Monga, Jon Dugan, and Sartaj Singh Baveja. Lambda architecture for cost-effective batch and speed big data processing. In 2015 IEEE International Conference on Big Data (Big Data), pages 2785–2792. IEEE, 2015.
26. Jay Kreps. Questioning the lambda architecture. Blog post, 2014.
27. John Klein, Ross Buglak, David Blockow, Troy Wuttke, and Brenton Cooper. A reference architecture for big data systems in the national security domain. In 2016 IEEE/ACM 2nd International Workshop on Big Data Software Engineering (BIGDSE), pages 51–57. IEEE, 2016.
28. Quintero, D; Lee, FN. IBM reference architecture for high performance data and AI in healthcare and life sciences; 2019; Armonk, IBM Redbooks:
29. Phillip Viana and Liria Sato. A proposal for a reference architecture for long-term archiving, preservation, and retrieval of big data. In 2014 IEEE 13th International Conference on Trust, Security and Privacy in Computing and Communications, pages 622–629. IEEE, 2014.
30. Pääkkönen, P; Pakkala, D. Reference architecture and classification of technologies, products and services for big data systems. Big Data Res; 2015; 2,
31. Hubert Tardieu. Role of gaia-x in the european data space ecosystem. In Designing Data Spaces: The Ecosystem Approach to Competitive Advantage, pages 41–59. Springer International Publishing Cham, 2022.
32. Michele Dicataldo. Data Mesh: a new paradigm shift to derive data-driven value. PhD thesis, Politecnico di Torino, 2023.
33. Ataei P. Cybermycelium: a reference architecture for domain-driven distributed big data systems. Front Big Data. 2024;7:1448481. https://doi.org/10.3389/fdata.2024
34. Ataei, P. Cybermycelium: a reference architecture for domain-driven distributed big data systems. Front Big Data; 2024; 7, [DOI: https://dx.doi.org/10.3389/fdata.2024.1448481] 1448481.
35. Joachim Bayer, Oliver Flege, Peter Knauber, Roland Laqua, Dirk Muthig, Klaus Schmid, Tanya Widen, and Jean-Marc DeBaud. Pulse: A methodology to develop software product lines. In Proceedings of the 1999 symposium on Software reusability, pages 122–131, 1999.
36. Vanessa Stricker, Kim Lauenroth, Piero Corte, Frederic Gittler, Stefano De Panfilis, and Klaus Pohl. Creating a reference architecture for service-based systems–a pattern-based approach. In Towards the Future Internet, pages 149–160. IOS Press, 2010.
37. Elisa Yumi Nakagawa, Rafael Messias Martins, Katia Romero Felizardo, and Jose Carlos Maldonado. Towards a process to design aspect-oriented reference architectures. In XXXV Latin American Informatics Conference (CLEI) 2009, 2009.
38. ISO/IEC 26550: 2015-software and systems engineering–reference model for product line engineering and management, 2015.
39. Mustapha Derras, Laurent Deruelle, Jean Michel Douin, Nicole Levy, Francisca Losavio, Yann Pollet, and Valérie Reiner. Reference architecture design: a practical approach. In 13th International Conference on Software Technologies (ICSOFT), pages 633–640. SciTePress-Science and Technology Publications, 2018.
40. M Galster and P Avgeriou. Empirically-grounded reference architectures: a proposal. Joint ACM, 2011.
41. Wieringa, RJ. Design science methodology for information systems and software engineering; 2014; Berlin, Springer: [DOI: https://dx.doi.org/10.1007/978-3-662-43839-8]
42. Elisa Yumi Nakagawa, Flavio Oquendo, and Martin Becker. Ramodel: A reference model for reference architectures. In 2012 Joint Working IEEE/IFIP Conference on Software Architecture and European Conference on Software Architecture, pages 297–301. IEEE, 2012.
43. Samuil Angelov, Paul Grefen, and Danny Greefhorst. A classification of software reference architectures: Analyzing their success and effectiveness. In 2009 Joint Working IEEE/IFIP Conference on Software Architecture & European Conference on Software Architecture, pages 141–150. IEEE, 2009.
44. Angelov, S; Grefen, P; Greefhorst, D. A framework for analysis and design of software reference architectures. Inf Softw Technol; 2012; 54,
45. IEEE International Organization for Standardization (ISO/IEC). Iso/iec/ieee 42010:2011, 2017.
46. Krzysztof Czarnecki, Paul Grünbacher, Rick Rabiser, Klaus Schmid, and Andrzej Wasowski. Cool features and tough decisions: a comparison of variability modeling approaches. In Proceedings of the sixth international workshop on variability modeling of software-intensive systems, pages 173–182, 2012.
47. Rurua, N; Eshuis, R; Razavian, M. Representing variability in enterprise architecture. Business & Inform Syst Eng; 2019; 61,
48. Matthias Galster and Paris Avgeriou. Empirically-grounded reference architectures: a proposal. In Proceedings of the joint ACM SIGSOFT conference–QoSA and ACM SIGSOFT symposium–ISARCS on Quality of software architectures–QoSA and architecting critical systems–ISARCS, pages 153–158. ACM, 2011.
49. Rick Kazman, Len Bass, Gregory Abowd, and Mike Webb. Saam: A method for analyzing the properties of software architectures. In Proceedings of 16th International Conference on Software Engineering, pages 81–90. IEEE, 1994.
50. Bengtsson, PO; Lassing, N; Bosch, J; van Vliet, H. Architecture-level modifiability analysis (alma). J Syst Softw; 2004; 69,
51. Lloyd G Williams and Connie U Smith. Pasasm: a method for the performance assessment of software architectures. In Proceedings of the 3rd international workshop on Software and performance, pages 179–189, 2002.
52. Rick Kazman, Mark Klein, Mario Barbacci, Tom Longstaff, Howard Lipson, and Jeromy Carriere. The architecture tradeoff analysis method. In Proceedings. Fourth IEEE International Conference on Engineering of Complex Computer Systems (Cat. No. 98EX193), pages 68–78. IEEE, 1998.
53. ISO ISO. Information technology – reference architecture for service oriented architecture (soa ra) – part 1: Terminology and concepts for soa. International Organization for Standardization, page 51, 2016.
54. Ian Sommerville. Software Engineering, 9/E. Pearson Education India, 2011.
55. Laplante, PA. Requirements engineering for software and systems; 2017; Boca Raton, Auerbach Publications:
56. Wo L Chang and David Boyd. Nist big data interoperability framework: Volume 6, big data reference architecture. Technical report, National Institute of Standards and Technology (NIST), Gaithersburg, MD, USA, 2018.
57. Volk, M; Staegemann, D; Trifonova, I; Bosse, S; Turowski, K. Identifying similarities of big data projects-a use case driven approach. IEEE Access; 2020; 8, pp. 186599-619. [DOI: https://dx.doi.org/10.1109/ACCESS.2020.3028127]
58. Bashari Rad, B; Akbarzadeh, N; Ataei, P; Khakbiz, Y. Security and privacy challenges in big data era. Int J Control Theory and Appl; 2016; 9,
59. Jing-Huan, Yu; Zhou, Z-M. Components and development in big data system: a survey. J Elec Sci Tech; 2019; 17,
60. Hanif Eridaputra, Bayu Hendradjaya, and Wikan Danar Sunindyo. Modeling the requirements for big data application using goal oriented approach. In 2014 international conference on data and software engineering (ICODSE), pages 1–6, 2014.
61. Jameela Al-Jaroodi and Nader Mohamed. Characteristics and requirements of big data analytics applications. In 2016 IEEE 2nd International Conference on Collaboration and Internet Computing (CIC), pages 426–432, 2016.
62. Rada, BB; Ataeib, P; Khakbizc, Y; Akbarzadehd, N. The hype of emerging technologies: big data as a service. Int J Control Theory Appl; 2017; 9,
63. Bahrami, M; Singhal, M. The role of cloud computing architecture in big data; 2015; Berlin, Springer: pp. 275-295.
64. Wo L Chang, Nancy Grady, et al. Nist big data interoperability framework: volume 1, big data definitions. Technical report, National Institute of Standards and Technology, Gaithersburg, MD, USA, 2015.
65. ISO/IEC. Iso/iec 29148:2018. systems and software engineering — life cycle processes — requirements engineering, 2018.
66. NASA. Reference architecture for space data systems, 2008.
67. Abran, A; Moore, JW; Bourque, P; Dupuis, R; Tripp, L. Software engineering body of knowledge. IEEE Computer Society Angela Burgess; 2004; 25, 1235.
68. Kleppmann, M. Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems; 2017; Sebastopol, O’Reilly Media, Inc.:
69. Zaharia, M; Xin, RS; Wendell, P; Das, T; Armbrust, M; Dave, A et al. Apache spark: a unified engine for big data processing. Commun ACM; 2016; 59,
70. Lakshman, A; Malik, P. Cassandra: a decentralized structured storage system. ACM SIGOPS Oper Syst Rev; 2010; 44,
71. Narkhede, N; Shapira, G; Palino, T. Kafka: The definitive guide; 2017; Sebastopol, O’Reilly Media:
72. Carbone, P; Katsifodimos, A; Ewen, S; Markl, V; Haridi, S; Tzoumas, K. Apache flink: stream and batch processing in a single engine. Bulletin of the IEEE Comput Soc Tech Committee on Data Eng; 2015; 38,
73. Baeza-Yates, R; Ribeiro-Neto, B. Modern information retrieval; 2011; 2 Boston, Addison-Wesley:
74. Marz, N; Warren, J. Big data: Principles and best practices of scalable real-time data systems; 2015; Shelter Island, Manning Publications:
75. Michael Stonebraker, Ihab F Ilyas, George Beskales, and Stan B Zdonik. Data curation at scale: The data tamer system. In Proceedings of the Biennial Conference on Innovative Data Systems Research (CIDR), 2013.
76. Batini, C; Cappiello, C; Francalanci, C; Maurino, A. Methodologies for data quality assessment and improvement. ACM Comput Surv; 2009; 41,
77. Sadalage, PJ; Fowler, M. NoSQL distilled: A brief guide to the emerging world of polyglot persistence; 2012; Boston, Addison-Wesley:
78. Abedjan, Z; Golab, L; Naumann, F. Profiling relational data: a survey. VLDB J; 2015; 24,
79. Reynold S Xin, Josh Rosen, Matei Zaharia, Michael J Franklin, Scott Shenker, and Ion Stoica. Shark: Sql and rich analytics at scale. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 13–24, 2013.
80. NIST. Nist special publication 800-207: Zero trust architecture. Technical report, National Institute of Standards and Technology, 2020.
81. Vincent C Hu, David Ferraiolo, Rick Kuhn, Adam Schnitzer, Kenneth Sandlin, Robert Miller, and Karen Scarfone. Guide to attribute based access control (abac) definition and considerations. Technical Report 800-162, NIST Special Publication, 2014.
82. Herschel, M; Diestelkämper, R; Lahmar, HB. A survey on provenance: what for? What form? What from?. VLDB J; 2017; 26,
83. Yadav, V; Oury, F; Suda, N; Liu, Z; Gao, X; Confavreux, C et al. A serotonin-dependent mechanism explains the leptin regulation of bone mass, appetite, and energy expenditure. Cell; 2009; 138, pp. 976-89. [DOI: https://dx.doi.org/10.1016/j.cell.2009.06.051]
84. Rose, J; Göbel, H; Cronholm, S; Holgersson, J; Söderström, E; Hallqvist, C. Theory-based design principles for digital service innovation. E-Serv J; 2019; 11, [DOI: https://dx.doi.org/10.2979/eservicej.11.1.01] 1.
85. Dehghani, Z. Data Mesh: Delivering Data-Driven Value at Scale; 2022; Sebastopol, O’Reilly Media:
86. Richards, M; Ford, N. Fundamentals of software architecture: an engineering approach; 2020; Sebastopol, O’Reilly Media:
87. Chris Richardson. Microservices Patterns: With examples in Java. Manning; 1st edition, 2018.
88. Gartner. Essential guide to data fabric. https://www.gartner.com/en/publications/essential-guide-to-data-fabric, 2023. Accessed: 2023-03-15.
89. A. Bode and Others. The evolution of data architectures: From monoliths to meshes and fabrics. Journal of Data Engineering Innovation, 1(1):100–115, 2023.
90. Kimball, R; Ross, M. The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling; 2013; Hoboken, Wiley:
91. Popovič, A; Hackney, R; Tassabehji, R; Castelli, M. The impact of big data analytics on firms’ high value business performance. Inf Syst Front; 2018; 20, pp. 209-22. [DOI: https://dx.doi.org/10.1007/s10796-016-9720-4]
92. Paulo Merson and Joseph Yoder. Modeling microservices with ddd. In 2020 IEEE International Conference on Software Architecture Companion (ICSA-C), pages 7–8. IEEE, 2020.
93. Evans, E; Evans, EJ. Domain-driven design: tackling complexity in the heart of software; 2004; Boston, Addison-Wesley Professional:
94. Matthew Skelton and Manuel Pais. Team Topologies: Organizing Business and Technology Teams for Fast Flow. IT Revolution, 2019.
95. Sudhakar, GP. A model of critical success factors for software projects. J Enterp Inf Manag; 2012; 25, [DOI: https://dx.doi.org/10.1108/17410391211272829] 537.
96. Khrononov, S. Learning Domain-Driven Design: Aligning Your Architecture with the Business using Context Maps, Strategic Design, and Agile Techniques; 2021; Birmingham, UK, Packt Publishing:
97. Khononov, V. Learning Domain-Driven Design; 2021; Sebastopol, O’Reilly Media Inc:
98. Craig W Reynolds. Flocks, herds and schools: A distributed behavioral model. In Proceedings of the 14th annual conference on Computer graphics and interactive techniques, pages 25–34, 1987.
99. Stephen Lansing, J. Complex adaptive systems. Annu Rev Anthropol; 2003; 32,
100. Fabrizio Montesi and Janine Weber. Circuit breakers, discovery, and api gateways in microservices. arXiv preprint arXiv:1609.05830, 2016.
101. Ford, N; Richards, M; Sadalage, P; Dehghani, Z. Software Architecture: The Hard Parts: Modern Trade-Off Analyses for Distributed Architectures; 2021; 1 Sebastopol, O’Reilly Media:
102. Stopford, B. Designing Event-Driven Systems; 2018; Sebastopol, O’Reilly Media Inc:
103. Salomé Simon. Brewer’s cap theorem. CS341 Distributed Information Systems, University of Basel (HS2012), 2000.
104. N. Dragoni, S. Giallorenzo, A. Lafuente, M. Mazzara, F. Montesi, R. Mustafin, and L. Safina. Microservices: yesterday, today, and tomorrow. 2016.
105. ISO/IEC. Iso/iec 25010:2011 systems and software engineering – systems and software quality requirements and evaluation (square) – system and software quality models, 2011.
106. Bass, L; Clements, P; Kazman, R. Software architecture in practice; 2021; 4 Boston, Addison-Wesley Professional:
107. Philippe Kruchten. Evaluating software architecture quality. IEEE software, 23(2):14–16, 2006.
108. Zhamak Dehghani. Data mesh: A new paradigm for data-driven organizations. ThoughtWorks, 2020.
109. Prometheus - monitoring system & time series database. https://prometheus.io, 2024. Accessed 12 Feb 2024.
110. Kubernetes Special Interest Group (SIG). kind - kubernetes in docker. https://kind.sigs.k8s.io/, 2023.
111. Github repository for terramycelium infrastructure. https://anonymous.4open.science/r/InfrastructureForMetamycelium-8F00/, 2024.
112. Artifact Hub Team. Artifact hub. https://artifacthub.io/, 2023.
113. HashiCorp. Vault helm chart. https://artifacthub.io/packages/helm/hashicorp/vault, 2023.
114. Yelp. Yelp dataset, 2023. Available from Yelp: https://www.yelp.com/dataset.
115. NEMAC+FernLeaf Collaborative. U.s. climate resilience toolkit: Climate explorer, 2023. Available from: https://crt-climate-explorer.nemac.org/.
116. GroupLens. Movielens 20m dataset. https://www.kaggle.com/datasets/grouplens/movielens-20m-dataset, 2024.
117. Visual crossing. https://www.visualcrossing.com/.
118. Carroll, JM. Scenario-Based Design: Envisioning Work and Technology in System Development; 1995; Hoboken, Wiley:
119. Rad, BB; Ataei, P. The big data ecosystem and its environs. International Journal of Computer Science and Network Security (IJCSNS); 2017; 17,
120. Johnson M. Principles of Logic in Computer Science. TechPress, 2019.
121. Lee K, Wang L. Ensuring data consistency in big data systems. In Proceedings of the 2020 International Conference on Big Data, pages 456–467, 2020.
122. Github repository for terramycelium helm charts and applications. https://anonymous.4open.science/r/helmChartsAndApplicationsForMetamycelium-D440/, 2024.
123. Wohlin, C; Runeson, P; Höst, M; Ohlsson, MC; Regnell, B; Wesslén, A. Experimentation in software engineering; 2012; Heidelberg, Springer Science & Business Media: [DOI: https://dx.doi.org/10.1007/978-3-642-29044-2]
124. Creswell, JW; Hanson, WE; Clark Plano, VL; Morales, A. Qualitative research designs: selection and implementation. Couns Psychol; 2007; 35,
125. Kallio, H; Pietilä, A-M; Johnson, M; Kangasniemi, M. Systematic methodological review: developing a framework for a qualitative semi-structured interview guide. J Adv Nurs; 2016; 72,
126. Baltes, S; Ralph, P. Sampling in software engineering research: a critical review and guidelines. Empir Software Eng; 2022; 27,
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.