Research data management in institutional repositories: an architectural approach using data lakehouses

Abstract

Purpose

This paper aims to address the pressing challenges in research data management within institutional repositories, focusing on the escalating volume, heterogeneity and multi-source nature of research data. The aim is to enhance the data services provided by institutional repositories and modernise their role in the research ecosystem.

Design/methodology/approach

The authors analyse the evolution of data management architectures through literature review, emphasising the advantages of data lakehouses. Using the design science research methodology, the authors develop an end-to-end data lakehouse architecture tailored to the needs of institutional repositories. This design is refined through interviews with data management professionals, institutional repository administrators and researchers.

Findings

The authors present a comprehensive framework for data lakehouse architecture, comprising five fundamental layers: data collection, data storage, data processing, data management and data services. Each layer articulates the implementation steps, delineates the dependencies between them and identifies potential obstacles with corresponding mitigation strategies.

Practical implications

The proposed data lakehouse architecture provides a practical and scalable solution for institutional repositories to manage research data. It offers a range of benefits, including enhanced data management capabilities, expanded data services, improved researcher experience and a modernised institutional repository ecosystem. The paper also identifies and addresses potential implementation obstacles and provides valuable guidance for institutions embarking on the adoption of this architecture. The implementation in a university library showcases how the architecture enhances data sharing among researchers and empowers institutional repository administrators with comprehensive oversight and control of the university’s research data landscape.

Originality/value

This paper enriches the theoretical knowledge and provides a comprehensive research framework and paradigm for scholars in research data management. It details a pioneering application of the data lakehouse architecture in an academic setting, highlighting its practical benefits and adaptability to meet the specific needs of institutional repositories.

Full text

Translate

Turn on search term navigation

Introduction

Research data is generated during and as an end product of research and is usually retained by the scientific community as it is required to validate research findings (Corti et al., 2019). With advances in information and communication technologies and technical infrastructure, the generation of research data has increased significantly in both quantity and variety (Sheikh et al., 2023). Research data can take various forms, from large, well-structured data sets generated by Big Science to fragmented, smaller data sets generated by Small Science (Scaramozzino et al., 2012). Additionally, these data can be generated by different processes for different purposes. Consequently, they can occur as structured data, such as spreadsheets and tables or as semi-structured and unstructured data in formats such as JSON files, XML and HTML files, text sources, images, audio and video recordings, personal notes, emails and more (Khan et al., 2023).

These data sets serve as the basis for scientific discoveries and enable researchers to explore complex phenomena, decipher hidden patterns and drive innovation across disciplines. Against this backdrop, universities and institutions are recognising the value of research data management and institutional repositories are being established to collect, organise, store, preserve and share research data produced in an institution (Asadi et al., 2019; Francke et al., 2017). The FAIR Data Principles (Findable, Accessible, Interoperable and Reusable) provide essential guidelines for ensuring that research data is organised and made available in ways that promote data reuse and interoperability across different platforms and disciplines (Michel et al., 2016). The Digital Curation Centre’s Data Curation Lifecycle provides a structured approach to managing the lifecycle of research data, guiding institutions in curating and preserving data effectively (Higgins, 2008). While institutional repositories have traditionally focused on storing academic outputs such as publications and theses, research data has typically been managed within dedicated data repositories (Springer Nature, 2024). However, advancements in institutional repository technology and the growing demands for integration of data with publications have led an increasing number of academic institutions to plan the provision of research data services through their institutional repositories (Asadi et al., 2019). Institutional repositories comprise various components, such as data, metadata, technologies and systems, stakeholders, ownership and data ethics and administration (Joo et al., 2019). These components form the foundation for a broad range of repository types, from generalist repositories that handle data across various disciplines to disciplinary repositories (Wikipedia, 2024) designed to manage, curate and preserve data sets particular to specialised fields of study. Additionally, the emergence of trusted data repositories (Mehnert et al., 2019), which are accredited through rigorous certification processes to ensure they meet high standards of data security, integrity and accessibility, further enhances the reliability and trustworthiness of data management within academic institutions. Open-source solutions like InvenioRDM (Carson, 2024) provide robust support for transparency, data sharing compliance and metadata management. Additionally, platforms like RSpace (Macdonald and Macneil, 2015) offer electronic research notebook capabilities that enhance collaboration and ensure seamless data management across research teams. Platforms such as the Australian Research Data Commons (Barker et al., 2019) exemplify how national research data infrastructures can facilitate data sharing, collaboration and long-term preservation. Similarly, frameworks like the Django Globus Portal Framework (Saint et al., 2023) enable researchers to rapidly develop customisable and scalable data portals. These platforms integrate cloud services for secure data sharing and management, offering centralised support to institutions. By enhancing interoperability and ensuring best practices, they enable efficient management and dissemination of large data sets across distributed environments, advancing collaborative research. Thompson and Murillo (2024) emphasise the importance of platform usability and functionality in supporting open science, especially for generalist repositories that cater to diverse academic disciplines. Their study highlights how metadata, community features and analytics can enhance the accessibility and sharing of research data. Esser et al. (2024) emphasise the importance of building trust in research data repositories by developing transparent workflows and engaging researchers closely. Their study illustrates how repositories can enhance trust and security, particularly when handling sensitive data. Esteva et al. (2024) further explore the role of standardised metrics, such as data usage and benchmarking, in enhancing transparency within data repositories. Their work illustrates how the implementation of usage metrics can foster better understanding and tracking of data interactions over time, thereby promoting more informed data management practices. The diversity in repository functionality underscores the importance of selecting the appropriate repository type based on specific research needs and data types, a decision that is further informed by Stall et al.’s (2023) comparative analysis of generalist repositories. A significant body of previous research has identified potential challenges facing institutional repositories, including low participation and low motivation for data sharing from researchers (Yang and Li, 2015). Further research has found that user participation is, in turn, significantly impacted by the perceived availability of institutional repositories (Bishop and Collier, 2022; Jeng and He, 2022; Kim, 2022; Yoon and Kim, 2017).

Institutional repositories serve not only as storage, management and preservation spaces for research data but also as platforms for efficient data distribution (Donner, 2023; Joo and Schmidt, 2021; Joo et al., 2019). However, existing institutional repositories often resemble mere warehouses for research data, offering limited services such as uploading, retrieving and labelling. This limitation hampers the collaborative potential between scholars and institutions (Yan et al., 2022). To realise their full potential, institutional repositories need to be developed into more sophisticated platforms that support a wider range of data services (Bashir et al., 2022).

Previous literature has defined the scope of research data services to include data storage and backup, data documentation, data publishing with DOIs and more (Sheikh et al., 2023; Soleimani et al., 2021; Curdt, 2019). Other literature focuses on the role and responsibilities of libraries and librarians in providing data services (Tammaro et al., 2019; Rai and Delhi, 2015), the skills and competencies required (Andrikopoulou et al., 2022), and the training and instruction context (Xu et al., 2022). However, there is limited research on the technical foundations and their implementation of these data services.

In this context, this paper introduces an end-to-end architecture for research data management using data lakehouses based on the DAMA methodology (DAMA International, 2017). To complement the proposed architecture, we analyse the potential implementation obstacles and provide corresponding solutions through interviews with 15 data management professionals, institutional repository administrators and researchers.

Literature review

Data warehouses

The concept of data warehouses was first introduced by IBM researchers BA Devlin and PT Murphy who coined the term business data warehouse. They proposed the idea of a centralised repository that could store all of an organisation’s data in one place to support decision-making (Devlin and Murphy, 1988). According to a formal definition by Inmon, “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process” (W. H. Inmon, 1995). A data warehouse is a sizable database that allows for the structured and organised storage and integration of data from various sources. Its main purpose is to support the decision-making process through effective data analysis. By centralising data in a well-structured manner, a data warehouse enables efficient data retrieval and facilitates accurate and insightful analysis (Chandra and Gupta, 2018). As a result, decision-makers can rapidly and effectively extract valuable insights from large data sets (Reddy et al., 2010). Data warehousing is the process of collecting, organising and storing information from various sources in a data warehouse. It empowers information specialists (such as leaders, administrators or analysts) to make faster and smarter decisions. Therefore, data warehouse systems are an essential business intelligence tool used by organisations in their data management practices (W. H. Inmon, 1996; Theodoratos and Sellis, 1997; Mallach, 2000; Al-Debei, 2011).

There have been extensive studies on the architecture and design of data warehouses since their emergence. Kimball (1996) introduced a bottom–up approach to building data warehouses. Known as the dimensional modelling approach, Kimball’s approach emphasised the importance of modelling data in a way that reflected business processes and user needs. This approach also focused on delivering business value quickly by building small, focused data marts rather than a centralised data warehouse. Golfarelli et al. (1998) proposed a semi-automated method for building data warehouses using existing schemas to describe relational databases and presented a framework for designing data warehouses based on the Dimensional Fact Model. Jarke et al. (2002) conducted a comparative analysis of data warehouses’ best practices and current trends, and addressed the issue of data lineage, which involved tracing data items in a data warehouse back to their sources. Ponniah (2004) explained the various architectural approaches of data warehousing, such as the traditional data warehousing model, the hub-and-spoke model and the federated data warehouse model. He also discussed the various hardware and software options that could be used to implement a data warehouse architecture. Zimanyi (2008) examined advanced data warehouse design from traditional to spatial and temporal applications, covering different stages of the design process such as requirements specification, conceptual, logical and physical design. Yessad and Labiod (2016) evaluated three data warehouse modelling approaches: Inmon, Kimball and Data Vault and provided recommendations for choosing the most suitable approach.

Data warehouse systems are designed for storing and querying structured data. Therefore, they have difficulties with the cross-analysis of unstructured data. As data grows in diversity, data warehouses face challenges in meeting storage and query demands (Terrizzano et al., 2015). To address this issue, data lakes have emerged as a data management technology. Data lakes are capable of storing unstructured data and supporting a variety of application scenarios, including predictive analytics, cross-domain analysis and real-time analysis (Khine and Wang, 2018).

Data lakes

The concept of data lakes was first introduced by James Dixon, the founder of Pentaho, in his blog in 2010 (Dixon, 2010). As a platform for storing and processing large amounts of data based on cloud computing, data lakes serve as centralised repositories that allow users to store raw, unprocessed data in their native format, including structured, semi-structured and unstructured data, all of which can be saved at scale. Organisations use data lakes to make better decisions by using visualisations, machine learning and other technologies for big data analytics (Dhayne et al., 2019; Harby and Zulkernine, 2022). Compared to data warehouses, data lakes provide enhanced flexibility and scalability, making them well-equipped to handle the storage and management demands of vast and diverse volumes of data.

Since their inception, data lakes have attracted considerable attention among researchers. Inmon (2016) showed how to structure data lakes and different data ponds for maximum business value and emphasised four key ingredients for data lake success: metadata, integration mapping, context and metaprocess. M. Farid et al. (2016) proposed constraint-based large-scale automated metadata system, a system for discovering and enforcing expression integrity constraints from large amounts of data with limited schema information. Nargesian et al. (2019) reviewed the latest technologies in data lake data management and the classic issues, such as data extraction, data cleaning, data integration, data versioning and metadata management. Zhang and Ives (2020) further studied the search function of data lakes, so that researchers and analysts could find tables, schemas, workflows and data sets for their tasks. Giebler et al. (2021) introduced a compressive data lake architecture, including the methodology that helped to choose appropriate concepts to instantiate each aspect.

Data lakes are repositories that store large amounts of raw data from various sources. They are designed to enable fast and flexible data analysis and processing. However, the large amount of heterogeneous data stored in data lakes poses significant challenges to data management (Giebler et al., 2019). Additionally, if data usage, definition, security and metadata management are overlooked, data lakes can result in data swamps (Nargesian et al., 2019). The lack of standardisation in data format and structures within data lakes makes it difficult to implement consistent security measures across all data types. This can create security gaps, which can be exploited by attackers or malicious insiders to breach the system and access sensitive information (Ravat and Zhao, 2019).

Data lakehouses

Data lakehouses are innovative data architectures that merge the benefits of both data warehouses and data lakes to provide efficient data storage, querying and analysis (Cherradi, 2024; Hassan, 2024). Data lakehouses overcome the fragmentation between data warehouses and data lakes by combining the enterprise-level data analysis capabilities of data warehouses with the flexibility, data diversity and rich ecosystem of data lakes. Data lakehouses also incorporate machine learning and artificial intelligence algorithms to extract insights from data and improve decision-making (Shiyal, 2021). Data lakehouses assist organisations in establishing data assets and achieving data-driven operations. Tekiner and Pierce (2021) discuss the open data lakehouse architecture on Google Cloud, which enhances scalability and supports real-time analytics, making it an ideal solution for managing diverse data sets while ensuring efficient performance in distributed environments: Levandoski et al. (2024) further elaborate on the evolution of BigLake, which extends BigQuery’s cloud-native architecture to support multi-cloud environments, unifying data lake and enterprise data warehousing workloads. This approach enhances flexibility, security and performance, especially for AI/ML workloads and unstructured data across cloud platforms. This empowers organisations to innovate using data, providing full support for the future large-scale deployment of business intelligence (Armbrust et al., 2021). Table 1 summarises the major differences between data warehouses, data lakes and data lakehouses:

Diversity of data storage: data lakehouses are capable of supporting different types of data, including structured, semi-structured and unstructured data from various sources and formats without requiring a predefined schema or a rigid structure. Through unified storage of raw, processed, cleaned and modelled data, they can deliver high-timeliness, precise and high-performance query services for both historical and real-time data (Jain et al., 2023; Shiyal, 2021).
Data analytics: the unified storage of data lakehouses supports various analytical businesses such as reporting, batch processing and data mining. Data lakehouses also facilitate massive data storage and processing, and enable efficient distributed computing, allowing organisations to execute diverse analytical tasks concurrently, thus enhancing analytical velocity and precision. Furthermore, by applying techniques such as natural language processing, computer vision and deep learning, they can extract valuable insights from unstructured data (Begoli et al., 2021; L’Esteve, 2022; Nambiar and Mundra, 2022).
Data quality: one of the fundamental challenges with data lakes is that data are deposited into them without effective tracking and monitoring, resulting in data swamps, where the data becomes outdated, inconsistent and unreliable. Data lakehouses are capable of overseeing vast quantities of data, by offering more efficient data management, cleaning and enrichment capabilities. With a more streamlined data ingestion process, the data lakehouse supports the establishment of proper data governance policies, including the implementation of essential metadata and curation practices (Harby and Zulkernine, 2022; Nambiar and Mundra, 2022; Nargesian et al., 2019).
Data security: owing to the diverse nature of data stored within data lakes, it is difficult to unify the management of access control, encryption and other security mechanisms for different data sources, thus making them vulnerable to data leaks and security threats. Data lakehouses implement unified security measures across all data types to safeguard data and prevent tampering or leakage during transmission and storage. Currently, the security of data lakehouse architectures is continually being refined and improved (Khine and Wang, 2018; Marty, 2015; Nambiar and Mundra, 2022; Oreščanin and Hlupić, 2021).
Data storage costs: data warehouses save costs by minimising redundancy and integrating diverse data sources. In contrast, data lakes use big data file systems and Spark to store and process data on cost-effective hardware. Data lakehouses merge the benefits of both, achieving data storage structures and management functions akin to those in data warehouses while using the low-cost storage of data lakes (L’Esteve, 2022).

The advantages of data lakehouses make them a promising technology for institutional repositories seeking to expand their data services. However, research on their practical application and usage in institutional repositories is currently limited due to their novelty. Begoli E. et al. (2021) notably advanced the field by introducing a lakehouse architecture adept at managing sensitive biomedical data, with their implementation validated through a rigorous case study at Oak Ridge National Laboratory. This work not only showcased the feasibility of lakehouses in handling complex, multimodal data sets but also demonstrated their capability in meeting stringent compliance and security requirements. However, their research primarily focuses on biomedical data, which highlights a gap in the lakehouse model’s applicability to the broader academic and research data environments that encompass a diverse array of data types. Our study seeks to bridge this gap by adapting the lakehouse architecture to meet the varied needs of institutional repositories in academic settings. By doing so, we aim to develop a comprehensive framework that enhances data accessibility, supports researcher collaboration and expands data services within these environments.

Research methodology

Following the guidance of the design science research methodology (Peffers et al., 2007), we designed the high-level architecture of research data management with the adaptation and extension of Data Management Association (DAMA) methodology (DAMA International, 2017). We extract the most relevant components and techniques from DAMA that align with the operational needs of data lakehouses in the context of institutional repositories. Furthermore, we have incorporated components derived from practical experiences in data management, such as data catalogue. Following this, we delineate five layers: the data collection layer, data storage layer, data processing layer, data management layer and data services layer. Subsequently, we analyse the implementation sequence of each layer within the data lifecycle, as well as the internal logical order within each layer. In this way, we ensure the proposed architecture not only supports but optimises the management of diverse data types – ranging from raw data collections to processed analytical insights. Then, we conducted several rounds of interviews with data management professionals, institutional repository administrators and researchers to get their feedback on the architecture and identify the potential obstacles during implementation. The process of this paper is shown in Figure 1.

We conducted semi-structured interviews with 15 data management professionals, institutional repository administrators and researchers (Table 2) in China from May to July 2023. We divided the 15 interviewees into three groups based on their years of work experience: junior (less than five years, seven scholars), intermediate (6–10 years, five scholars) and senior (more than 10 years, three scholars). The weighting factors assigned to these groups were obtained through expert judgement, with values of 0.2, 0.35 and 0.45, respectively. The interviews were conducted in person and via video conference. The first author, a native Chinese speaker, conducted all interviews. The details of the interview questions can be seen in the Appendix 1.

The interviews were transcribed verbatim and then translated into English by a professional translator. For the analysis of these transcripts, we utilised NVivo 15 (Dhakal, 2022), a qualitative data analysis software, which supported our application of the Thematic Content Analysis method (Vaismoradi et al., 2016). In using NVivo, our initial step was to code the interview content into nodes that represent different themes, thereby organising the data into categorisable elements. This step was crucial for establishing a clear thematic structure, which was continuously refined through an iterative review and merging of these nodes. Throughout this process, we assessed the relative importance of each theme by considering their potential impact, severity and the difficulty of addressing the issues they presented, concepts that were later quantitatively evaluated using a specific weighting system in our analysis. NVivo’s advanced query and visualisation tools, such as the “Word Frequency” query and the “Coding Comparison” feature, were instrumental in this phase. These tools not only helped in identifying dominant themes but also ensured inter-coder reliability, critical for the validity of our thematic analysis. This approach facilitated a deeper interpretation of the data, setting the stage for the subsequent quantitative assessment.

After analysing interview transcripts, each obstacle was rated for impact scope, severity and difficulty of solving using the Fibonacci sequence: 13 for critical, eight for high, five for medium and three for low levels. Each score was then aggregated using weights of 0.3 for impact scope, 0.4 for severity and 0.3 for difficulty, determined through expert judgement. These scores were subsequently used in fuzzy comprehensive evaluation (FCE) to evaluate each obstacle comprehensively (Yager, 1988). The FCE scores for each obstacle were then averaged across the three scholar groups, using their respective weights, to provide a holistic assessment of research data management obstacles, reflecting perspectives across different career stages. Table 3 presents the obstacles in research data management extracted from the interview and their weighted scores after FCE. The details of the calculation can be seen in the Appendix 2.

The interviewees also provided feedback for the proposed architecture and technical suggestions for the identified obstacles, which were taken into consideration in the findings and implementation sections of this paper. The details of the feedback can be seen in the Appendix 3.

Findings: an end-to-end data lakehouse architecture

To facilitate the implementation of data lakehouses for institutional repositories, we propose an end-to-end data lakehouse architecture with five layers: data collection layer, data storage layer, data processing layer, data management layer and data services layer (Figure 2). The conceptual research data management model of the institutional repositories based on this architecture is shown in Figure 3.

Table 4 below maps the proposed architecture components to identify research data management obstacles, clarifying how each addresses specific challenges. Additional insights into this correlation will be provided in the description of each architecture component.

Data collection layer

Researchers use a Web platform to upload their data sets, which are initially stored in a file server before being transferred to the data lakehouse. The data collection layer is responsible for collecting data from the file server. Given the heterogeneous nature of research data, they should be collected according to their types (Armbrust et al., 2021). Structured and semi-structured data can be collected using methods such as application programming interfaces (APIs), extract, transform and load (ETL) processes, and file imports. Unstructured data, on the other hand, can be collected through techniques such as batch writing, message queues and extraction. The data collection layer lays the groundwork for the following layers while ensuring consistency throughout the entire data flow. To meet challenges such as inconsistency in data formats, duplication, delay and loss, corresponding strategies need to be considered.

Data format inconsistency

Research data often originates from multiple sources, each with their specific format, structure and encoding. This heterogeneity presents difficulties in integrating and harmonising the data (Harby and Zulkernine, 2022). To address this challenge (indicated as O1 in Table 4), a target data format for structured data, as well as file format and encoding standards for semi-structured and unstructured data should be established. In addition, it is essential to communicate the preferred data formats and structures to researchers as guidelines for data uploading to ensure synchronisation. During the data collection process, ETL tools can be used to convert different data formats into the predefined standard format (L’Esteve, 2022).

Data duplication

Researchers may unintentionally contribute to data duplication by uploading updated versions of previously deposited data sets without proper version control in place. Also, manual data entry errors can lead to the same data set being uploaded multiple times or duplicate records being created (indicated as O2 in Table 4). Therefore, it is recommended to perform file-level data deduplication at the collection layer. For structured data, deduplication methods such as comparison and matching algorithms and data aggregation algorithms should be used (Kaur et al., 2018). Semi-structured data need to be converted into structured data for deduplication. For unstructured data, deduplication methods based on content similarity should be used, such as hash functions and similarity algorithms to compare and match data content, identifying analogous data for deduplication (H. Farid, 2021; Sadowski and Levin, 2007).

Data delay and data loss

Large or complex data sets may require additional time for collection and transfer, resulting in potential delays or loss. The data consistency validation and deduplication process can introduce additional delays (indicated as O3 in Table 4). To mitigate these issues, administrators should prioritise data sets based on their size and complexity and collect them accordingly. A collection plan should be established that considers the researchers’ data upload schedule and frequency, avoiding collection during peak periods. Moreover, incorporating checks for timestamps, integrity and completeness of the collected data into the collection pipeline can help monitor data delays and loss. For structured data, this involves ensuring that each data batch adheres to database schemas and integrity constraints, while for semi-structured and unstructured data, checking that files are complete and uncorrupted becomes more relevant. These checks help monitor data delays and loss, thus maintaining the overall data quality and reliability within the data lakehouse.

Data storage layer

The data lakehouse data storage system is a data foundation that integrates data stored in the data warehouses and data lakes, enabling data integration and interoperability between them through ETL tools. To optimise storage costs, historical research data initially stored in data warehouses can be migrated to data lakes. Simultaneously, unstructured and semi-structured research data, such as videos and images uploaded by researchers in fields like biomedical sciences, can be extracted, transformed and cleansed before being moved into data warehouses for statistical analysis, reporting and other data-driven services.

As research data continues to grow in volume, institutional repositories often face limitations in allocating adequate storage resources (indicated as O4 in Table 4). Proactive measures can be taken to address this challenge by implementing data compression methodologies (Ait Errami et al., 2023). Additionally, by categorising data according to their attributes such as usage patterns and access frequency, institutional repositories can optimise storage allocation (Ansari et al., 2019). When faced with insufficient storage capacity, there are two viable solutions: one is to increase the storage capacity of individual nodes through vertical scaling; and another is to increase the number of nodes through horizontal scaling (Mazumdar et al., 2019).

Data processing layer

Data processing serves as the bedrock of data services, underpinning the reliability and effectiveness of the entire data lakehouse architecture. In this section, we distil the most essential five steps from the DAMA methodology (DAMA International, 2017) and propose a sequence for their implementation, as illustrated in Figure 2 and described in detail below.

Metadata management

Metadata management plays a crucial role in research data management by providing valuable information about the data, enabling effective understanding and proper utilisation of the data assets. Metadata provides descriptive information about the research data, including their origin, content, structure and context (DAMA International, 2017). Metadata can also capture information about the data measurement methodologies and units, data quality and data transformations applied during the research process. This helps researchers understand the data accurately, facilitating reliable analysis. Managing metadata for structured data involves ensuring that schema definitions are maintained and accessible to support SQL-based querying and transaction consistency. This includes managing schema evolution to handle changes in data structure without disrupting existing processes (Amorim et al., 2017). For semi-structured and unstructured data, metadata management involves tagging data with relevant contextual information, such as file types, content descriptions and usage rights, which are critical for content discovery and retrieval in a schema-less environment (Sawadogo and Darmont, 2021). By standardising and harmonising metadata across different data sets, metadata management enables the integration of data sets from diverse sources within the data lakehouse, enabling researchers to gain holistic insights, retrieve potentially overlooked data and conduct comprehensive analyses, thereby addressing O6 as indicated in Table 4.

To incorporate metadata management into an institutional repository, key metadata elements from different data types should be extracted accordingly. This involves incorporating automated tools or workflows that extract metadata from diverse data sources and populate metadata repositories or catalogues. Furthermore, implementing metadata maintenance practices, such as conducting periodic reviews and updates, is essential to ensure that the metadata remains up to date.

Data modelling

Data modelling is the process of providing a structured and organised representation of the data. By clearly defining how data is structured and interconnected, data modelling aids in overcoming difficulties in finding or retrieving data (indicated as O6 in Table 4). Well-defined data models allow for the creation of more effective indexing and search strategies, ensuring that data can be located quickly and accurately. Data modelling involves identifying the different entities that make up the data, the relationships between those entities and the attributes of each entity, providing a conceptual and logical framework for data storage and retrieval (DAMA International, 2017). After the data model is established, it is important to do testing, maintenance and iteration to keep its accuracy. For structured data, this typically involves using a relational model to define tables, enforce consistency constraints and apply mathematical models to optimise database performance. These methods focus on maintaining data accuracy and efficiency in storage and retrieval processes. For semi-structured and unstructured data, the model may focus more on leveraging formats like XML or using the Text Encoding Initiative guidelines, which provide schemas that describe data structures in a way that retains semantic information while also organising it for computational use. This approach often includes managing XML schemas and possibly using concepts like the One Document Does it all format for detailed documentation and schema generation (Flanders and Jannidis, 2015).

There is a bidirectional dependency between data modelling and metadata management: data modelling needs the support of metadata records, while metadata management relies on the conceptual framework and structure established by data modelling. Moreover, data modelling and metadata management enable more efficient and precise data cleaning by providing a structured understanding of the data. Metadata management and data modelling play crucial roles in providing clear visibility into data lineage and the data catalogue. This is because the documenting information is captured during metadata management and the relationships and dependencies between different data entities are defined during data modelling.

Data cleaning

Once the metadata has been established, it is essential to undertake data cleaning. Data cleaning is important for maintaining data quality. It includes tasks such as deduplication, imputation of missing values, handling of outliers, data type verification, data standardisation and data normalisation. These measures serve to identify and rectify erroneous data, thereby ensuring accuracy, completeness and consistency for all data types (Rahm and Do, 2000). For structured data, cleaning processes are typically more straightforward due to the organised nature of the data, which allows for automated scripts and rule-based systems to effectively identify and correct errors. This includes leveraging integrity constraints to detect anomalies, applying functional dependencies to maintain data consistency and utilising conditional formatting for systematic data correction. For semi-structured and unstructured data, this might include using techniques such as natural language processing to understand content, using pattern recognition to identify inconsistencies and applying heuristic methods to make sense of the data, which often lacks a fixed schema or form (Chu et al., 2016).

Data lineage

Data lineage delineates the path of data migration from their origin to their destination. From a data processing perspective, clear data lineage can capture and maintain changes in each attribute in the data model throughout the entire processing cycle. This ensures that users are using data derived from a primary source. From a data dissemination perspective, a particular data set may contribute to other derivative data sets over time. Data lineage can track the propagation history of research data sets, thus ensuring their accuracy and avoiding misuse (Bose, 2002). Data lineage can be established by identifying the source and target of the data, documenting their flow and transformation and analysing their dependencies and impacts. For structured data, this involves modelling, capturing SQL queries and mapping schema changes to allow for precise tracking of how data evolves and flows through different systems, which is crucial for data integrity, audit trails and compliance. The use of SQL query logs helps in understanding the transformations data undergo through and schema mapping assists in managing changes that might affect downstream data processing and reporting. For unstructured data, lineage tracking includes capturing document versioning, storage location and modification metadata, focusing on versioning and metadata to maintain a history of changes, thereby supporting data integrity and retrieval. Storage location data is also critical for ensuring that data is accessible and secure (Potočeková, 2023). Institutional repositories should use proper tools to preserve and maintain the data lineage.

Data catalogue

A data catalogue provides a centralised and organised view of the data sets (Ehrlinger et al., 2021). It can enhance the discoverability and usability of the research data. Furthermore, in an institutional repository with diverse data sets, a data catalogue can help identify data sets that can be integrated for further analysis or cross-disciplinary research. This will afford institutional repository administrators a comprehensive overview of the data stored in the repository, thereby addressing O5 indicated in Table 4. Creating a data catalogue requires a thorough understanding of requirements, the identification of available data types and the complexity of data relationships. This process needs the integration of both top–down and bottom–up approaches. The top–down approach involves structuring the catalogue hierarchy according to academic or disciplinary themes through the utilisation of metadata, data modelling and data lineage techniques, which is vital for structured data that adheres to strict schema requirements. The bottom–up approach needs the modification and updating of catalogue fields in response to feedback garnered from practical applications, which is particularly relevant for semi-structured and unstructured data where flexible, content-driven categorisation is needed. Once a data catalogue has been created, regular maintenance and updates are necessary to record newly added or deleted data and update the metadata accordingly to ensure its completeness and accuracy (Ehrlinger et al., 2021).

Data management layer

This layer functions as a gatekeeper after the data processing layer and before the data services layer. In the context of a data lakehouse, data management is critical as it bridges structured data warehouse management practices with the scalability and flexibility of the data lake architectures. Data management within the data lakehouse architecture should include data curation and preservation, data quality, data security and data standards as four indispensable parts that are integral throughout the whole lifecycle.

Data curation and preservation

In the proposed architecture, data processing entails technical manipulation of data, while data curation predominantly centres on organising and maintaining research data from a business standpoint, with the goal of enhancing its usability and reusability. In fact, data curation and preservation play critical roles in enhancing the long-term sustainability and accessibility of data. Implementing effective data curation and preservation involves a series of practical steps to ensure that data remains reliable, accessible and usable over time (Dappert et al., 2014; Osswald and Strathmann, 2012):

Extraction of key information: implement automated tools and expert review processes to extract critical information from scientific texts, such as research articles. This information, often encompassing data sets, methodologies and findings, should be transformed into a structured electronic format that enhances searchability and usability.
Disciplinary categorisation: organise data according to specific academic disciplines. This involves creating a taxonomy that reflects the various fields of study represented in the repository. Such categorisation aids in better data retrieval, allowing researchers to find relevant data sets efficiently based on disciplinary tags.
Version control: implement a version control system to manage changes to data over time. This system should track revisions and updates, allowing users to access both current and previous versions of data. This is crucial for transparency in data modifications and aids in reproducing and verifying research results.
Integrity checks: integrity checks are vital for verifying the correctness and completeness of data. This involves regular audits and validations against corruption or unauthorised alterations. Institutional repositories often use cryptographic hashing and checksums to validate the integrity of stored data over time.
Data migration and transformation: in a data lakehouse, plan for periodic migration of data to newer storage formats and platforms to ensure compatibility with evolving data analytics tools and technologies. This includes converting data to newer formats, updating metadata and revalidating data integrity post-migration.
Preservation protocols: preservation protocols involve detailed documentation and the application of standard practices to ensure data longevity. This includes the use of metadata standards for describing data, applying persistent identifiers to data sets and establishing rules for data replication and backup strategies.

Data curation and preservation form the foundation of data quality within the data lakehouse architecture.

Data quality

Data quality in a data lakehouse not only refers to the traditional aspects of accuracy, completeness, consistency, integrity, reasonability, timeliness, uniqueness and validity of data (DAMA International, 2017) but also emphasises the integration and harmonisation of diverse data types coming from both structured warehouses and unstructured data lakes. This multidimensional data quality framework affects the overall value and usability of the data within a lakehouse environment. From a quality management perspective, institutional repositories should delineate data quality objectives and metrics predicated on these aspects. Subsequently, use advanced data quality assessment techniques designed for heterogeneous data environments to identify potential data quality issues, such as inconsistency, errors and anomalies. This is crucial to ensure that data, regardless of its origin, meets the strict standards required for comprehensive analytics and decision-making processes. For structured data, these techniques include validation of data consistency, accuracy checks against predefined rules or standards, and verification of data completeness. Common metrics used include data validity, entity resolution accuracy and completeness ratios to ensure data structures adhere strictly to defined schemas and business rules. For semi-structured and unstructured data, quality assessment focuses on relevance, interpretability and accuracy of the data content. Techniques involve analysing metadata for relevance to the task at hand, utilising natural language processing tools to evaluate the interpretability of text data and using statistical models to estimate data accuracy. This might include sentiment analysis, entity recognition accuracy and text coherence measures to evaluate how well data meets quality standards for specific uses (Kiefer, 2016; Zaveri et al., 2013). Thereafter, use the steps in the data processing layer to mitigate the issues. Ultimately, use real-time monitoring tools to track the status of data quality indicators. This continuous monitoring helps to promptly identify and address any emerging data quality issues as data volumes and types continue to expand (indicated as part of O7 in Table 4), ensuring sustained data integrity and reliability. With enhanced reusability, high-quality research data can foster greater engagement among researchers.

Data security

In the context of a data lakehouse, where the confluence of diverse types of data amplifies the potential security vulnerabilities (indicated as part of O7 in Table 4), robust data security measures are crucial. Data security involves protecting data from unauthorised access, modification, disclosure or destruction that can lead to data leakage and damage, and impact the confidentiality, integrity and availability of the data (DAMA International, 2017). Institutional repositories can ensure data security by implementing data backup and recovery, data classification, access control and anomaly detection:

Data backup and recovery: timely backups can safeguard data integrity. Institutional repositories should devise regular backup strategies and incorporate data backup and recovery as cyclical tasks (Ashiq et al., 2020). To enhance data disaster recovery, institutional repositories should opt for off-site backups, such as retaining a duplicate of the data in cloud storage, which offers both security and flexibility. To improve backup efficiency and reduce backup duration and capacity, incremental backups should be used, whereby only modified data is backed up instead of the entire data sets. Regular backups for structured data should encompass the data, schema and metadata to ensure accurate reconstruction, crucial for maintaining transaction integrity. For semi-structured and unstructured data, focus on incremental backups to handle large volumes and diverse formats efficiently, reducing storage requirements and enhancing backup processes. In alignment with the US Office of Science and Technology Policy’s guidelines, the proposed architecture ensures robust data security measures to safeguard research data from unauthorised access and to maintain long-term data integrity [White House Office of Science and Technology Policy (OSTP), 2022].
Data classification: to enhance the controllability of data, it is imperative to classify data based on their sensitivity and the likelihood that they might be sought after for malicious purposes. Classifications are used to determine which roles have access to the data (DAMA International, 2017). Structured data can be classified at a granular level, such as individual database columns, allowing precise access control based on user roles and data sensitivity. Semi-structured and unstructured data classification, relying on content or metadata, may use machine learning for automated categorisation based on text patterns or image features, facilitating appropriate access controls.
Access control and anomaly detection: access control can ensure that only authorised users can access data in the lakehouse. Meanwhile, access behaviour needs to be tracked and audited. Institutional repositories need to monitor operations such as data access, backup and encryption to detect potential security issues.

Data standards

“Data standards are technical specifications that describe how data should be stored or exchanged for the consistent collection and interoperability of that data across different systems, sources, and users” (US. Office of Government Information Services, 2021). They define common formats, protocols and metadata schemas that enable data from different sources to be compatible and interoperable. By adhering to unified data standards, different data sets within an institutional repository can be organised and represented in a consistent and harmonised manner. This promotes interoperability, enhancing the data lakehouse’s ability to serve as a centralised platform for integrating, analysing and sharing data across various research disciplines and projects, thus supporting a broader range of scientific inquiries and computational workflows (addressing O8 indicated in Table 4).

The process of constructing data standards involves a meticulous approach tailored to the unique characteristics of the data. Beginning with a comprehensive review of existing standards, the development phase necessitates a deep understanding of the distinctive attributes of data originating from various sources (D. et al., 2016). Then, for structured and semi-structured data, it is important to define their format, size, precision and range attributes; moreover, naming and versioning rules should be formulated with consistency. For unstructured data, institutional repositories need to develop a content model that defines the structure of the unstructured data, including the types of content, attributes and relationships with the other data sets. A classification system for categorising unstructured data based on its metadata, content and usage is needed to formulate robust data content standards for unstructured data. After implementation, data standards must be consistently applied at every stage of data management. To ensure the effectiveness and adaptability of data standards, institutional repositories should regularly evaluate and refine them based on the evolving needs of researchers and the latest technological advancements in data management.

Data services layer

Once research data has passed through the data processing and management layer, they are prepared to be made available with a clear catalogue, high quality and robust security measures. A well-organised data catalogue provides an overview of the available data resources, enhancing the efficiency of data services for institutional repository administrators, researchers and other stakeholders. Traditional institutional repositories offer basic data services, such as uploading, downloading and assigning DOIs, enabling the reuse of data sets in their original uploaded format. However, institutional repositories based on the data lakehouse architecture can provide more advanced data services, including sharing and analytics services. These services not only improve the accuracy and reusability of the data sets but also generate new insights through data mining of related data resources in the repository.

Sharing services

Sharing services refer to the various methods and processes that enable research data to transition from a closed and exclusive state to an open state. They mainly target stakeholders such as researchers and other data repositories. Institutional repositories should manage the sharing process to ensure rational data consumption, thus making the sharing service more sustainable. They can also monitor data usage frequency and prioritise maintenance and development accordingly. Sharing services should include data marts, data APIs, data exchange, data authorisation, etc. Data marts and data APIs serve as effective means to offer data sharing services to internal researchers, whereas data exchange primarily facilitates integration with other institutional repositories, thereby addressing O9 indicated in Table 4:

Data marts: a data mart is a subset of the data lakehouse that is specifically designed and optimised for a particular user group or domain (Harby and Zulkernine, 2022). It provides curated and pre-processed data sets tailored to specific research or analytical needs. A data mart facilitates easier data discovery and access by organising data in a meaningful manner. Furthermore, a data mart offers the advantage of isolating sensitive data from the rest, thus contributing to safeguarding against unauthorised access.
Data APIs: data APIs furnish a programmable and easy-to-integrate interface by abstracting and encapsulating the underlying data storage and processing logic. This enables various applications and services to easily access and transmit data, facilitating enhanced collaboration and value creation. Data APIs are also scalable and can expand their processing capabilities as the load increases to ensure prompt response to user requests and efficient utilisation of data resources (Armbrust et al., 2021).
Data exchange: data exchange is the sharing of data between the institutional repository and other data repositories, creating new research opportunities (Ait Errami et al., 2023). Unlike data marts and data APIs, data exchange is more commonly used in scenarios that necessitate the sharing of large amounts of data and it typically involves the use of ETL tools.
Data authorisation: data authorisation authorises researchers to access a portion or all of the data assets. It also ensures that only authorised users have access to the data marts, data APIs and data exchange.

Analytics services

Analytics services within the data lakehouse enable researchers to analyse research data and derive insights. These processes involve recognising the academic value of the data and uncovering hidden patterns, trends and correlations to draw meaningful conclusions. By treating all data in a data lakehouse holistically, analytics services aid researchers in discovering new and compelling content that may have remained undiscovered otherwise, fostering cross-disciplinary innovation. Analytics services can also include capabilities to track data usage and generate usage reports. This allows institutional repository administrators to monitor access patterns, usage statistics and user interactions with the shared data. Moreover, analytics services can provide institutional repository administrators with visualised overviews of the data assets, which allows them to gain a comprehensive understanding of the data holdings and make informed decisions regarding data management, curation and resource allocation (addressing O10 indicated in Table 4). Analytics services should include data analytics applications, data visualisation, business intelligence and machine learning:

Data analytics applications: data analytics applications are topic-specific applications to facilitate the analytics of data. Institutional repository administrators can quickly understand the overall situation and detailed information of the data through these applications, thereby formulating more reasonable data management strategies and decisions.
Data visualisation: data visualisation tools and techniques are used to present data in a visual format, such as charts, graphs and interactive dashboards. Visualisations help researchers and other stakeholders to understand complex patterns and relationships within the data. Time series plots, scatter plots and bar charts can reveal temporal or relational aspects of the data. Interactive and dynamic visualisations can promote user engagement. Metrics, such as download counts, citation networks or usage statistics, can provide an overview of research impact for institutional repository administrators.
Business intelligence: business intelligence encompasses the integration and analysis of structured and semi-structured data from multiple sources within the data lakehouse to furnish academic insights and support decision-making (Al-Debei, 2011). By using exploratory data analysis, business intelligence mines information from data to fulfil the data requirements of various levels, disciplines and user groups.
Machine learning: machine learning is a powerful way of summarising various forms of unstructured data. This capability assists researchers in discovering related content and exploring the research landscape within specific areas of interest.

Implementation: use case in an institutional repository

We selected a university library as the practical application scenario to implement the designed architecture. The primary goal of this implementation is to facilitate seamless data integration, fostering enhanced circulation and sharing among researchers across diverse disciplinary domains. By doing so, it aims to provide robust data support for academic research and educational endeavours benefiting around 20,000 faculty and students. The current research data management infrastructure is inadequate for managing the scale and diversity of data formats and sources, which include structured, semi-structured and unstructured data. This heterogeneity results in format inconsistencies, data duplication and significant challenges in data access and retrieval. Additionally, administrators struggle to manage and curate these varied data sets due to the absence of unified data standards and an effective metadata management system, leading to a fragmented understanding of the repository’s content. The lack of a comprehensive data catalogue further hampers data retrieval, limiting cross-disciplinary research and making it difficult for researchers to uncover trends or patterns. Moreover, researchers often face challenges in identifying and accessing relevant data services, as awareness of the repository’s data holdings and service offerings is insufficient.

As summarised in Table 5, Google Cloud Platform (GCP) tools are seamlessly integrated across various layers to enhance lifecycle data management. In the data collection layer, Dataflow and Pub/Sub collaborate to process and normalise data streams efficiently, with Pub/Sub capturing real-time data for Dataflow to ensure its readiness for further processing. The data storage layer uses Cloud Storage to manage vast amounts of unstructured and semi-structured data securely, while BigQuery supports rigorous data querying for structured data and BigLake offers a unified platform that integrates data lakes and warehouses for diverse data types. The data processing layer features Dataplex at the core, managing metadata, ensuring clear data lineage and maintaining a comprehensive data catalogue for effective governance. BigQuery ML and Vertex AI perform advanced data modelling, and Dataprep is used to clean data, ensuring high quality for analysis. In the data management layer of our lakehouse architecture, Dataplex oversees overall data management, coordinating various GCP tools to ensure a cohesive system. BigQuery manages schema standards crucial for structural consistency, while Dataflow and Vertex AI handle data curation, including accurate labelling of data sets. For maintaining data quality, BigQuery conducts routine scans. Cloud Storage ensures data preservation through its robust versioning system. Data security is enforced with identity and access management (IAM) and specialised sensitive data protection services, securing the architecture against unauthorised access and breaches. Finally, the data services layer uses Cloud Functions and BigQuery to enable robust sharing services that distribute data and insights across academic departments and research groups. Looker, Data Studio and BigQuery’s BI and ML functionalities, along with Vertex AI, provide potent tools for data visualisation, business intelligence and predictive analytics, empowering the academic community with valuable data-driven insights.

During the pilot phase, our primary focus was on the governance of historical research data. By centralising data storage and streamlining access through data services layer, researchers can engage with historical data sets more effectively than before. Tools like BigQuery and Vertex AI have facilitated deeper data analysis, allowing scholars to uncover trends and patterns that were previously obscured in disparate systems. This integration not only accelerates the research process but also enhances the quality of academic output by providing richer, more comprehensive data sets. The platform has also empowered institutional repository administrators, providing them with advanced tools and an integrated overview of the school’s current research data landscape. This visibility enables them to manage and curate collections more effectively, ensuring that valuable data resources are readily accessible to researchers and fully used for academic purposes.

As we progressed beyond the pilot phase, our next focus was on optimising the management of new research data. We refined GCP configurations to optimise data collection, storage, processing, management and services tailored to academic roles and activities. These configurations were designed to support data management professionals, repository administrators and researchers by ensuring seamless integration and effective utilisation of research data across disciplines. The setup included Pub/Sub for efficient data stream segregation, Dataflow for automated data preprocessing, Cloud Storage for unstructured data storage, BigQuery for structured data management, Dataplex for governance and metadata management and Vertex AI and BigQuery ML for advanced analytics and machine learning. Specific steps for implementation are detailed below, outlining how these components work together to ensure scalability, compliance, data integrity, as well as easy access and seamless data sharing across academic disciplines over time.

Data collection layer

Pub/Sub were organised by academic fields, aligned with specific research projects and departments to efficiently segregate data streams according to the needs of various research groups. Topics like “biology-data-stream” and “engineering-data-stream” were created using the Cloud Console. Subscription settings were configured to manage flow rates and ensure timely message processing without data loss.
Dataflow was customised to preprocess data upon ingestion via Pub/Sub. The Dataflow pipelines validated, normalised and enriched the data before passing it to the storage layer. Specific transformations, such as formatting text data according to academic standards (e.g. American Psychological Association, Modern Language Association), were applied to facilitate further analysis. A pipeline was implemented using the Python SDK to automate this preprocessing, including validation through regex matching to ensure that data formats met academic standards, normalisation by converting all text data to UTF-8 and enrichment by adding metadata such as timestamps and source identifiers to each data set.

Data storage layer

Cloud storage buckets were used to store large unstructured data, including video recordings of experiments, images from digital archives and extensive research data sets. Data lifecycle management policies were implemented to transition older data sets to colder storage tiers based on access frequency.
BigQuery was deployed to store structured research data that required frequent querying, such as experimental results, publication records and metadata. The tables were partitioned and clustered by research project identifiers and date stamps to optimise query performance and reduce costs.
BigLake was integrated to enhance querying capabilities across both structured data in BigQuery and unstructured data in Cloud Storage. This integration enabled seamless access and complex queries across all data types, supporting interdisciplinary research and comprehensive data analysis.

Data processing layer

Dataplex managed and governed data across BigQuery and Google Cloud Storage, ensuring that it was catalogued, discoverable and securely accessible across departments. It automated metadata management and ensured compliance with academic research data privacy regulations. An asset management system was configured to catalogue data across both platforms, with automated metadata harvesting set up to maintain an updated view of data assets. Policy tags were applied to enforce access controls based on user roles.
BigQuery ML was used to create and maintain structured data models within the institutional repository, defining relationships between data sets, researchers and projects. SQL-based schema definitions structured data into relational tables, enforcing consistency with primary and foreign keys to optimise search queries and indexing. For semi-structured and unstructured data, schema definitions using formats like XML or text-based encoding organised the data semantically, ensuring efficient data retrieval and storage across diverse formats.
Vertex AI integrated machine learning and AI capabilities into the data workflow, enabling complex analyses such as image recognition in biological research and pattern detection in datasets. Data sets were imported from BigQuery and prepared through feature engineering, including normalisation and encoding, to transform data into usable features for machine learning models. Various models, including deep learning and regression, were trained with hyperparameter tuning to optimise performance. After evaluation, models were deployed via HTTP endpoints integrated into the institutional repository, enabling real-time predictions and classifications. Continuous monitoring ensured model effectiveness, with updates applied as needed. This implementation empowered researchers to leverage advanced machine learning for predictive analytics and deeper insights into their data.
Dataprep was configured to clean and prepare data before further analysis. Through the Cloud Console, data cleaning tasks such as missing value imputation, outlier detection and format standardisation were automated. Data from both Cloud Storage and BigQuery was processed in Dataprep, ensuring high-quality, clean data sets were available for analytics and machine learning.

Data management layer

Dataflow was configured for complex data transformations needed for data curation, including deduplication and normalisation to ensure uniformity and accuracy across datasets. Batch processing jobs were set up using Dataflow, with the Python SDK used to script transformations that met data quality requirements, such as deduplication algorithms and data anonymisation processes where necessary.
BigQuery managed both data quality and schema standards to ensure consistency and integrity in research data. Regular SQL-based integrity checks validated data accuracy, with automated alerts for any anomalies. Schema management defined data types, constraints and naming conventions, ensuring organised and efficient data storage and retrieval. As academic standards evolved, BigQuery’s schema evolution allowed seamless updates without disrupting queries. Automated schema validation ensured all new data adhered to these standards, aligning data management with evolving research needs.
Cloud storage was used to implement robust data preservation protocols, including automatic archiving of data versions and backups, which were essential for preserving historical research data and ensuring compliance with grant requirements. Versioning was enabled using the gsutil command, which allowed for automatic tracking and storage of multiple versions of each object. Backups were configured by setting up regular snapshot schedules, ensuring that all data could be restored to a previous state in case of accidental deletion or corruption, thus ensuring compliance with grant requirements.
Identity and Access Management (IAM) was implemented to control access to sensitive research data. Role-based access controls were set up to ensure that only authorised users, such as researchers, administrators and specific project leads, could access or modify datasets. Custom roles were also defined for different user groups, ensuring that permissions were tightly aligned with institutional policies. For sensitive data protection, encryption was enforced both at rest and in transit using Google Cloud’s built-in encryption mechanisms. Data in BigQuery and Google Cloud Storage was encrypted by default, and additional layers of encryption were applied for highly sensitive data sets. Access to sensitive data was further restricted using IAM policies, ensuring only users with specific permissions could view or manage such information.

Data services layer

Cloud functions were customised to respond to data events, such as notifying researchers of data updates or triggering data processing tasks after new uploads. Python functions were written to automate workflows, including triggering data refreshes in BigQuery when new data was uploaded to Cloud Storage. These functions were deployed to respond to HTTP requests or Pub/Sub messages.
Looker studio dashboards were customised for different academic roles, offering researchers relevant data visualisations and providing administrators with tools to monitor data usage and storage efficiency. Interactive features enabled users to drill down into data for deeper insights, promoting a data-driven research environment. Looker Studio was connected to BigQuery data sets, and interactive dashboards were designed for various user groups, with real-time data visualisations set up to provide actionable insights.
Vertex AI was integrated to provide advanced predictive analytics within the dashboards, allowing researchers to forecast data trends and identify patterns in their data sets. These insights were embedded into Looker Studio, enabling users to seamlessly access predictions and analyses alongside their existing research data visualisations.

The outcome of the implementation is an enhanced data governance system that centralises research data, improves accessibility for researchers and enables deeper analysis through tools like BigQuery and Vertex AI. This leads to better data management, more effective data sharing across disciplines, and advanced data insights, allowing researchers to uncover patterns and trends with greater ease. As the system evolves, the implementation will reach its maturity phase when new data management processes are fully optimised, user adoption is widespread and all the data workflows are automated. Continuous updates to align with evolving academic standards and technologies will ensure the system remains efficient and scalable.

Conclusions

This paper introduced an end-to-end data lakehouse architecture for research data management in institutional repositories. The proposed architecture addresses the challenges of increasing data volume, heterogeneity and the multi-source nature of research data. To ensure practicality, the paper investigated potential implementation obstacles and offered corresponding solutions. This provides valuable guidance for institutions looking to implement the proposed architecture. During the pilot implementation phase at a university library, the current emphasis on governing historical research data has already demonstrated how this architecture enhances data accessibility for researchers and the overall data landscape for administrators.

The innovations of this study can be categorised into theoretical and practical aspects. Theoretically, this architecture pioneers the integration of data lakehouses into the data management framework, a concept not yet addressed by the current iterations of DAMA methodology. This novel approach enriches existing theoretical paradigms and provides a robust research framework for scholars. Practically, the proposed architecture stands as a seminal guideline for modernising data management practices across institutional repositories. It specifically addresses the critical needs for scalability and robust data handling capabilities, which are increasingly vital as data volumes and complexity continue to grow. The implementation of this architecture promises enhanced data accessibility, integrity and utility, thereby facilitating a more efficient and effective research process. By leveraging data lakehouses, institutional repositories can be modernised to provide a wider range of data services and support the evolving needs of researchers and administrators.

In conclusion, this research contributes a significant theoretical advancement to the field of research data management and offers a concrete, actionable framework that institutional repositories can adopt to enhance their data management capabilities.

Limitations and future research

Theoretical limitations of the paper arise from its exclusive emphasis on technical aspects, potentially overshadowing the exploration of critical social and organisational factors influencing data management success. Factors like user adoption, data sharing policies and incentives for data sharing are essential considerations that warrant more thorough examination. Moreover, the paper draws insights from interviews conducted with professionals in China. However, there is a need for caution regarding the generalisability of these findings to diverse cultural and institutional contexts, as the extent of their applicability beyond the specific studied context remains uncertain.

On the practical front, the paper would benefit from a more comprehensive exploration of the cost-benefit implications associated with implementing a data lakehouse architecture for research data management. This entails a nuanced analysis of hardware, software and staff costs, coupled with a thorough evaluation of the anticipated benefits in terms of improved data management efficiency and effectiveness.

Addressing these nuanced considerations could refine further study, contributing to a more comprehensive understanding of effective strategies for research data management and preservation.

This research was funded by [Projects of the National Social Science Foundation of China] grant number [22BGL011]: Research on the Influence Mechanism and Implementation Path of Key Core Technology Breakthroughs under the New National System and [Xi’an Science and Technology Program Soft Science Project] grant number [24RKYJ0011]: Research on the Mechanisms and Pathways for Market-Oriented Allocation of Technological Elements in Xi’an to Achieve Breakthroughs in Key Core Technologies.

Figure 1.

Process of the design science research methodology

Figure 2.

End-to-end data lakehouse architecture

Figure 3.

Conceptual research data management model of the institutional repositories

Figure A1.

Weighted scores calculation for obstacles

Table 1.

Major differences between data warehouses, data lakes and data lakehouses

Parameters	Data warehouses	Data lakes	Data lakehouses
Diversity of data storage	Mainly analytics-related data, structured data	Store unstructured, semi-structured or structured data	Store unstructured, semi-structured, or structured data
Data analytics	Time-consuming to introduce new content	Helps with fast ingestion of new data	Helps with the fast ingestion of new data and allows for real-time data processing
Data quality	High	Low	High
Data security	Matured	Maturing	Maturing
Data storage costs	Expensive and time-consuming for large data volumes	Inexpensive, quick and adaptable	Inexpensive, quick and adaptable

Note:IR = Institutional repository

Source: Authors’ own work

Table 2.

Information of interviewees

Interviewee	Role (years of IR experience)	No. of interviews	Duration
1	Data management professional (3–5)	2	45 mins each
2	Institutional repository administrator (less than 2)	1	30 mins
3	Data management professional (6–10)	3	40 mins each
4	Researcher (11–20)	1	50 mins
5	Data management specialist (6–10)	2	35 mins each
6	Institutional repository administrator (less than 2)	1	25 mins
7	Data governance analyst (11–20)	2	50 mins each
8	Research data librarian (3–5)	1	30 mins
9	Data management professional (6–10)	2	40 mins each
10	Repository administrator (3–5)	1	35 mins
11	Data management specialists (more than 20)	3	45 mins each
12	Researcher (3–5 years)	1	40 mins
13	Data governance analyst (6–10)	2	55 mins each
14	Research data librarian (less than 2)	1	30 mins
15	Data management professional (6–10)	2	40 mins each

Source: Authors’ own work

Table 3.

Research data management obstacles and weighted scores from interviews

Category	Obstacles	w-score
Data collection	O1: Data format inconsistency	8.36
	O2: File-level data duplication	8.52
	O3: Delay and loss	8.91
Data storage	O4: Insufficient storage capacity	7.44
Data processing	O5: Data admins lack an overview of the data stored in the repository (unclear data catalogue)	7.70
Data processing	O6: Difficulty in finding or retrieving the data (lack of metadata management and data modelling)	7.73
Data management	O7: High requirement for data quality and security as data types and volumes expand	8.30
Data management	O8: Lack of a unified standard for research data of different types	8.10
Data services	O9: Difficulty in identifying the appropriate service offerings for different stakeholders	8.05
Data services	O10: Researchers and other data repositories have limited knowledge of the data in the institutional repository	7.81

Source: Authors’ own work

Table 4.

Mapping between proposed architecture and research data management obstacles

Obstacles	Proposed architecture components
O1: Data format inconsistency	Data collection layer: Data format inconsistency
O2: File-level data duplication	Data collection layer: Data duplication
O3: Delay and loss	Data collection layer: Data delay and data loss
O4: Insufficient storage capacity	Data storage layer
O5: Data admins lack an overview of the data stored in the repository (unclear data catalogue)	Data processing layer: Data catalogue
O6: Difficulty in finding or retrieving the data (lack of metadata management and data modelling)	Data processing layer: Metadata management and data modelling
O7: High requirement for data quality and security as data types and volumes expand	Data management layer: Data curation and preservation, data quality and data security
O8: Lack of a unified standard for research data of different types	Data management layer: Data standards
O9: Difficulty in identifying the appropriate service offerings for different stakeholders	Data services layer: Sharing services
O10: Researchers and other data repositories have limited knowledge of the data in the institutional repository	Data services layer: Analytics services

Source: Authors’ own work

Table 5.

Summary of GCP tool utilisation

Lakehouse architecture	Google cloud tools	Description
Data collection layer	Dataflow	Processes and normalises data streams
Data collection layer	Pub/Sub	Message queue with dataflow
Data storage layer	BigLake	Data foundation that integrates data stored in the data warehouses and data lakes
	Cloud storage	Unstructured and semi-structured data storage
	BigQuery	Structured data storage
Data processing layer	Dataplex	Metadata management, data lineage and data catalogue
	BigQuery (ML), vertex AI	Data modelling
	Dataprep	Data cleaning
Data management layer	Dataplex	Overall data management
	BigQuery (schema management)	Data standards
	Dataflow, vertex AI (labelling)	Data curation
	BigQuery (scan)	Data quality
	Cloud storage (versioning)	Data preservation
	IAM, sensitive data protection	Data security
Data services layer	Cloud functions, BigQuery	Sharing services
Data services layer	Looker studio, BigQuery (BI, ML), vertex AI	Analytics services

Source: Authors’ own work

Table A1.

Interviewee feedback on the proposed data lakehouse architecture

Interviewee ID	Layer addressed	Feedback on architecture	Interview quote
Interviewee 1	Data collection	Praises the adaptability to diverse data types	“Love how it handles all sorts of data types—it just gets it”
Interviewee 10	Data collection	Comments on the need for more robust data validation	“I'd like to see stronger validation right at the point of entry. Let's catch issues before they're in the system”How we addressed this: In the Findings section, data format and duplication checks, as designed in the architecture, were addressed; in the implementation section, dataflow was used to implement this functionality
Interviewee 2	Data processing	Appreciates the integration of advanced analytics	“The analytics setup is spot-on, really pulls the insights we need”
Interviewee 13	Data management	Satisfied with the data security measures	“The security measures are thorough, really makes you feel your data’s safe”
Interviewee 8	Data services	Suggests improvements in user access controls	“We need more granular control over who can see what. It’s a bit too open for my taste”How we addressed this: In the findings section, data authorisation, as designed in the architecture, were addressed; in the implementation section, looker studio was used to implement this functionality
Interviewee 6	Data services	Likes the flexibility of service offerings	“It’s really versatile with the services, gives us what we need, when we need it”
Interviewee 7	Data services	Happy with collaborative tools integration	“Integrating tools for collaboration was a smart move, makes it so much easier to work together”
Interviewee 4	Overall	Concerns about long-term scalability	“It works now, but will it handle what we're planning for in five years?”How we addressed this: the architecture is designed for long-term viability, ensuring it can scale and evolve over the next five years. BigLake provides unified, scalable storage that can grow horizontally to handle increasing data volumes as research outputs expand. Dataflow’s serverless, auto-scaling capabilities ensure seamless support for both real-time analytics and large-scale batch processing, even as data pipelines become more complex. Additionally, the integration of vertex AI and BigQuery ML future-proofs the system, allowing it to scale machine learning workflows with model versioning and automatic retraining, accommodating growing AI/ML demands without the need for major system overhauls

Source: Authors’ own work

Table A2.

Interviewee technical suggestions for implementing data lakehouse architecture

Interviewee ID	Layer addressed	Feedback on architecture	Interview quote
Interviewee 7	Data collection	Use pub/sub for real-time data collection	“Pub/Sub would be perfect for capturing real-time data, making our data collection more dynamic and responsive”
Interviewee 9	Data processing	Use dataplex for metadata management	“Dataplex could manage our metadata more effectively, ensuring data is organized and accessible across departments”
Interviewee 11	Data processing	Adopt vertex AI for ML tasks	“Vertex AI can automate many of our machine learning workflows, making our data analysis more efficient and effective”
Interviewee 4	Data processing	Use BigQuery ML for real-time analytics	“BigQuery ML can transform how we process large datasets, enabling real-time analytics that drive faster research insights”
Interviewee 5	Data management	Apply IAM for data security	“Implementing IAM would give us robust control over data access, enhancing the security across our data layers”
Interviewee 6	Data management	Integrate dataprep for data cleaning	“Dataprep will help us clean and prepare data automatically, ensuring high quality and consistency for analysis”
Interviewee 2	Data services	Leverage BigQuery for advanced data analytics	“BigQuery allows us to perform sophisticated analytics directly on our stored data, speeding up our research outputs”
Interviewee 1	Data services	Implement looker studio for data visualisation	“Using Looker Studio to visualize our data helps make complex analyses more accessible and understandable for all researchers”

Source: Authors’ own work

Word count: 11725

Show less

Research data management in institutional repositories: an architectural approach using data lakehouses

Content area

Abstract

Full text

Introduction

Literature review

Data warehouses

Data lakes

Data lakehouses

Research methodology

Findings: an end-to-end data lakehouse architecture

Data collection layer

Data format inconsistency

Data duplication

Data delay and data loss

Data storage layer

Data processing layer

Metadata management

Data modelling

Data cleaning

Data lineage

Data catalogue

Data management layer

Data curation and preservation

Data quality

Data security

Data standards

Data services layer

Sharing services

Analytics services

Implementation: use case in an institutional repository

Data collection layer

Data storage layer

Data processing layer

Data management layer

Data services layer

Conclusions

Limitations and future research