Content area
Background
For clinical care and research, knowledge graphs with patient data can be enriched by extracting parameters from a knowledge graph and then using them as inputs to compute new patient features with pure functions. Systematic and transparent methods for enriching knowledge graphs with newly computed patient features are of interest. When enriching the patient data in knowledge graphs this way, existing ontologies and well-known data resource standards can help promote semantic interoperability.
Results
We developed and tested a new data processing pipeline for extracting, computing, and returning newly computed results to a large knowledge graph populated with electronic health record and patient survey data. We show that RDF data resource types already specified by Health Level 7's FHIR RDF effort can be programmatically validated and then used by this new data processing pipeline to represent newly derived patient-level features.
Conclusions
Knowledge graph technology can be augmented with standards-based semantic data processing pipelines for deploying and tracing the use of pure functions to derive new patient-level features from existing data. Semantic data processing pipelines enable research enterprises to report on new patient-level computations of interest with linked metadata that details the origin and background of every new computation.
Background
In the domains of biomedicine and health, an increasing number of machine-executable pure functions are used for making computations. A pure function is a stateless deterministic mapping that always returns the same computed value (or result) when given identical inputs. A simple example is the Body Mass Index (BMI) function, which has these two common forms:
$$BMI = weight (kg) / height ({m}^{2})BMI =703 *[ weight (lbs)/ height ({inches}^{2}) ]$$
The inputs needed to compute with a pure function like weight and height for the BMI function are called"arguments"or"parameters"[1]. What makes pure functions"pure"is that, when implemented in software code, they exclusively map inputs to outputs and have no other software effects [1].
Examples of other pure functions include functions for computing numerical scores from questionnaire data, as with the Generalized Anxiety Disorder 7-item Scale [2], mathematical regression functions for estimating risks [3] and some functions arising from machine learning [4]. This paper explores how to leverage semantic web technologies to track and organize the use of biomedical pure functions to enrich existing patient data knowledge graphs.
In biomedicine, the use of machine learning methods to generate mathematical-statistical functions is increasing as is the creation of functions comprised of production rules with conditional"IF–THEN"logic. As a consequence, the overall number of biomedical pure functions is growing rapidly [5]. Many of these biomedical pure functions are used to compute patient-level properties (also called patient features) from existing patient data.
Generally speaking, data enrichment is adding information to an existing dataset. The focus of this paper is specifically on data enrichment that arises from the use of parameters available in a knowledge graph to produce more and different data via new computations. In this case, enriching a knowledge graph comes about by first extracting input parameters to pure functions from a graph, then computing new features using pure functions outside of the graph environment, then returning those newly computed features as additional semantic data back to the knowledge graph from which the input parameters came.
Datasets originating from Electronic Health Record (EHR) databases or patient data registries can be enriched by adding newly computed patient-level features [6, 7]. Researchers typically use these added patient-level features to stratify patient records, explore hypothetical cause and effect relationships, and answer a wide range of biomedical research questions. To better support this kind of patient dataset enrichment, this study explores how to take advantage of linked data, knowledge graphs, and existing Health Level 7 Fast Health Interoperability Resource (HL7 FHIR) data resource types to represent some of the semantics of pure functions and the patient-level computations they enable [8, 9].
Regarding linked data, the Resource Description Framework (RDF) specifies a format where any domain of interest can be represented using a pattern of triples, where each triple relates a subject resource to an object resource via a predicate relationship [10, 11]. The linked data in any RDF dataset can be serialized in a variety of formats and visualized as a directed knowledge graph [12]. RDF knowledge graphs can be queried using SPARQL [13]. These knowledge graphs can also be used to compute novel inferences using various logical reasoners [14].
Some biomedical research and healthcare provider organizations have begun converting their patient-level electronic health record (EHR) data from original proprietary data formats into RDF triples for knowledge graphs. As an example of this, the Centre for Addiction and Mental Health (CAMH) in Toronto, Ontario, Canada routinely converts its EHR data into RDF triples and then adds these triples to a patient-oriented knowledge graph built using the open source Blue Brain Nexus platform [15]. CAMH is Canada's largest mental health teaching hospital. Within CAMH, computational models are used to understand and help treat mental illnesses. To support this research and accelerate discovery using knowledge graph technology, CAMH integrates vast amounts of semantic data from genetics, brain imaging, and EHR records [16]. The EHR data in the CAMH knowledge graph currently represents patient data using a variety of HL7 FHIR RDF data resource types. Additional data models used in the CAMH graph include the Neuroimaging Data Model (NIDM), the Provenance Ontology (PROV-O), and various schema from schema.org [16]. For this study, CAMH wished to build out a computational pipeline for enriching these patient-level data by computing with pure functions.
Several challenges must be met when enriching patient data in knowledge graphs by computing with pure functions. A critical challenge is to safeguard patient data privacy and security. One approach is to keep patient data within the local IT security zone of the organization that collects them rather than shuttle them to and from computing systems outside of the organization [17]. To keep computation local, instances of biomedical pure functions must be deployed inside the same technical environments where patient data reside. This study put technology for computing with pure functions inside CAMH's technical environment alongside their existing patient-oriented knowledge graph.
In addition to upholding data security, a second challenge is for creators of pure functions to be able to manage the growing number of them used by their organizations. As others note, effective management of pure functions requires metadata sufficient to make pure functions findable, accessible, interoperable and reusable (the FAIR principles) [18, 19]. To help with this, the Knowledge Systems Laboratory (KSL) at the University of Michigan (U-M) has previously created the Knowledge Object Implementation Ontology (KOIO) [20]. We have demonstrated how KOIO can assist developers to produce Knowledge Objects containing software implementations of pure functions with corresponding tests and documentation [21]. KSL currently works with researchers in the global Mobilizing Computable Biomedical Knowledge community (MCBK) on realizing the FAIR principles for pure functions [22,23,24]. A relevant example of this work is a Knowledge Object containing an implementation of the BMI function with metadata [25]. Its metadata, serialized as Turtle RDF triples in Fig. 1 below on the left with a corresponding RDF graph view of the key triples on the right, describe and depict an implementation and software tests for the BMI function. A software implementation of the BMI function is located in a JavaScript file called bmi.js.
[IMAGE OMITTED: SEE PDF]
A third challenge to enriching knowledge graph data is to represent pure functions, newly computed patient features, and their provenance in a common way. Ideally, all new RDF triples added to the graph would conform to one or more well-known RDF schemas. For this, we ultimately selected the emerging RDF standard coming from the HL7 FHIR community [26]. Unlike KOIO which we developed but is not well-known, HL7 FHIR RDF provided this project with community-endorsed standard data resource types for representing pure functions and new computations produced by using them [27].
Leveraging our prior work with knowledge graphs and linked metadata about pure functions, we stood up a local data processing pipeline for Extracting, Transforming, and Loading (ETL) standardized HL7 FHIR RDF data resources representing pure functions, new computations generated from using pure functions, and the provenance of these computations. We report the results of a technical feasibility study along these lines.
Research questions
In this study, we sought to do the following:
1. 1.
Determine the essential information required to effectively trace and document semantic data enrichment processes involving pure functions.
2. 2.
Determine the primary components of ETL pipelines for semantic data enrichment processes involving pure functions.
3. 3.
Demonstrate what is required to adopt and conform to existing HL7 FHIR RDF Library, Observation, and Provenance data resource standards in the context of semantic data enrichment processes involving pure functions.
Methods
Project initiation and management
This project was initiated by team members at CAMH and KSL. The research team includes experienced health data and computer scientist-engineers plus library and information scientists familiar with the health and healthcare domains. Over more than three years, with some intermittency due to the COVID-19 global pandemic, the research team met many times via online video conference to discuss and work on the design, development, and trial demonstration of the planned ETL pipeline for enriching CAMH's patient data knowledge graph. CAMH team members took the lead on software development for the ETL pipeline. KSL team members took the lead on metadata development and HL7 FHIR RDF Library, Observation, and Provenance data resource conformance testing. Project documentation was created and stored either in the Google Docs platform at U-M or in an instance of Confluence made available to the project by CAMH.
Technical methods overview
This study utilized a number of technical methods. First, we leveraged existing knowledge graph technology. Second, we did formal data modeling to determine the information required to trace and document semantic data enrichment. Third, we adopted and validated relevant HL7 FHIR RDF data resource types. Fourth, we developed and packaged an example pure function with a corresponding API mechanism and metadata about the function. Fifth, we developed and tested an ETL pipeline for semantic data enrichment by leveraging the prior four efforts.
Knowledge graph availability via CAMH
Throughout the project, CAMH provided access to development and production instances of its Knowledge Graph loaded with patient data. CAMH's Knowledge Graph integrates multimodal data including: Electronic Health Record data, patient questionnaire responses from a local instance of REDCap, interpretations of neuroimaging results, laboratory observations, plus sleep, fitness and other biometric data. All of the patient data in the CAMH Knowledge Graph are represented using HL7 FHIR resource types except for the neuroimaging results data, which are represented in the graph using the Neuroimaging Data Model ontology (NIDM). The CAMH Knowledge Graph takes a significant step towards providing a self-serve data platform for mental health researcher [16].
Preliminary work to outline the information space of interest
We began by examining information needed to trace pure functions and their use to enrich patient data in knowledge graphs. This work included several rounds of data resource modeling.
We started by outlining the information space of interest and developing our own"homegrown"RDF data resource models based on earlier releases of KOIO (1.0 and 2.0) and on our previous work to identify 13 categories of metadata relevant for describing pure functions [22].
At the outset, to guide and document our iterative data resource modeling efforts, we collaboratively developed a set of competency questions. Our intent was that the data resources for tracing and documenting pure functions and their use would contain answers for each competency question. We borrowed the method of using competency questions from the field of ontology development, where Competency Question-driven Ontology Authoring has been previously described [28]. As we proceeded in this work, our data resource modeling drew on concepts and relationships from two other relevant ontologies: the Function Ontology (FnO) [29] and the Provenance Ontology (PROV-O) [30].
As a proof of concept using our own homegrown data resource models, we manually created test instances of RDF data resources serialized in JSON-LD. These test data resources were loaded into a Knowledge Graph development environment. Once loaded, we used SPARQL queries to produce answers to the competency questions we developed.
Adoption of HL7 FHIR RDF data resource types
After outlining the information space of interest, we learned that the HL7 and W3C Semantic Web Health Care and Life Sciences communities had, as part of the 5th Release of FHIR, developed common, openly available semantic RDF data models for each FHIR data resource type [31]. We used our competency questions to determine whether the combined content of the HL7 FHIR RDF Library, Provenance, and Observation data resource types was sufficient for our purposes [31]. Finding that the content of these three HL7 FHIR RDF data resource types was sufficient, we set aside our homegrown data resource models. Adopting HL7 FHIR RDF data resource types provided an opportunity to demonstrate that RDF generated by a semantic patient data enrichment ETL pipeline can conform to an openly available data resource standard for describing pure functions and computations resulting from using them.
HL7 FHIR RDF data resource validation using ShEx.js
Adopting HL7 FHIR RDF allowed us to use Shape Expressions (ShEx) to programmatically validate instances of FHIR RDF data resources [32, 33]. This validation focuses on the structure that HL7 resources must have. With technical support from the HL7 FHIR RDF community, we stood up a virtual server, loaded it with the ShEx.js tool [33], and performed validation tests on the FHIR RDF Obervation and Provenance data resource types produced by the ETL pipeline. (The FHIR RDF Library data resource type is not produced by the ETL pipeline and was not validated.) Validation using ShEx.js initially indicated the presence of several errors. After correcting these errors, additional validation tests were done until conformance with the HL7 FHIR RDF standard was achieved. The ETL pipeline software was then developed to produce conformant HL7 FHIR RDF Observation and Provenance resources. More information and relevant files can be found at: https://github.com/kgrid/fhir-rdf-validation.
Example pure function used
As an example drawn from the domain of mental health for this study, KSL team members implemented a pure function in Python for computing and interpreting a total Patient Health Questionnaire (PHQ-9) score and its standard interpretation from answers to the questionnaire provided by individual patients [34]. PHQ-9 scores are a common measure used to screen for severity of depressive symptoms [35].
The PHQ-9 pure function exists outside of the CAMH knowledge graph where it can also be accessed by other applications through a corresponding RESTful API service. The pure function accepts the individual numeric results from the nine items comprising the PHQ-9 as its input parameters. It simply computes the total PHQ-9 score and provides an interpretation of that total score. A file with technical metadata about the origin and characteristics of the PHQ-9 pure function was also developed. These metadata were later reformatted to conform to the HL7 FHIR RDF Library data resource type specification for describing software libraries but not validated using ShEx.js. At CAMH, a software script was developed to load HL7 FHIR RDF Library metadata from the Knowledge Object into the CAMH Knowledge Graph. This script exists outside of the ETL pipeline because it only needs to run once to load and record metadata about each pure function. This script can be run on an ad-hoc basis whenever a new pure function is to be used for semantic data enrichment at CAMH.
Development of the ETL pipeline using Apache NiFi and python code
To establish the ETL pipeline and complete the technical work for this project, several existing technologies were used. When building the ETL pipeline, Apache NiFi [36] was leveraged for its ability to automate the flow of data between existing software systems. CAMH already used Apache NiFi to routinely check REDCap for new PHQ-9 responses. At CAMH, whenever new PHQ-9 responses are detected, Apache NiFi inserts them into the CAMH knowledge graph as a PHQ-9 response data object. We used Apache NiFi as a tool for implementing the ETL pipeline.
In our case, the CAMH Knowledge Graph emits Server-Sent Events (SSEs) when data is inserted, updated, or deleted. In our ETL pipeline implementation, Apache NiFi was configured to monitor these SSEs specifically for PHQ-9 responses being added to the graph. Upon detecting such an event, Apache NiFi first retrieves new PHQ-9 responses for a patient from the graph and then transmits these responses via an API request to the pure function for computation.
Each time new PHQ-9 scoring and interpretation computations are made, the ETL pipeline then produces a single new conformant HL7 FHIR RDF Observation resource representing the new computations and a corresponding single new conformant HL7 FHIR RDF Provenance resource with semantic information describing how each computation came about and linking to the specifics of the pure function used to compute it.
Testing the data enrichment ETL pipeline
Manual tests were performed at each step in the pipeline to confirm that the pipeline functioned properly. Because Blue Brain Nexus will accept essentially any RDF, to confirm that the data was inserted correctly with the expected RDF data structures, SPARQL queries were performed. Blue Brain Nexus supports conformant SPARQL queries and provides its users with immediate query results.
Results
To address Research Question 1, we found the essential information required to effectively trace and document semantic data enrichment processes involving pure functions could be represented in two ways. First, as a list of competency questions (Table 1) where each question is associated with one or more target stakeholder group(s) thought to be most interested in it. Second, we found that a combination of three HL7 FHIR RDF data resource types (Library, Observation, and Provenance) included the answers to the 13 competency questions in Table 1. A mapping between the 13 competency questions and elements of the Library, Observation, and Provenance data resource types is provided in the far right column of Table 1.
[IMAGE OMITTED: SEE PDF]
To address Research Question 2, we found a number of components were needed to establish an ETL pipeline for semantic data enrichment. These components are depicted in Fig. 2.
[IMAGE OMITTED: SEE PDF]
Starting on the left of Fig. 2 above, existing data sources send data to the CAMH Knowledge Graph. In 2019, CAMH initially created this Knowledge Graph by deploying an instance of the Blue Brain Nexus knowledge graph platform in an on-premise Kubernetes cluster. When the ETL pipeline surrounded by the dotted line is triggered, it fetches and processes data from the graph. In general, the technical components needed to support a semantic data enrichment ETL pipeline like the one we developed are a knowledge graph, a listener (like Apache NiFi), the pipeline's software code (which we implemented in five stages using Python), and a deployed instance of one or more executable pure functions (which we accessed via a local API).
The computational sequence supported by these technical components is shown from top to bottom in the Fig. 3 immediately below. The sequence starts with a patient providing responses to the items of the PHQ-9 in REDCap. The Apache NiFi Listener checks REDCap for newly posted patient PHQ-9 survey responses every 10 min. When one or more instances of these responses are found, they are loaded into the Blue Brain Nexus Knowledge Graph. Next, Blue Brain Nexus generates a specific server side event message which is detected by Apache Nifi, causing it to trigger the ETL pipeline. After querying data available in the graph, the pipeline's code, which is also running in Apache Nifi, posts a request for a summary PHQ-9 score and its interpretation to the KO-API. Next, the pipeline's code generates corresponding FHIR RDF Observation and Provenance messages containing new computations and information about how the new computations were produced respectively. Finally, the pipeline's code loads the new FHIR RDF Observation and Provenance resources into the graph, thereby enriching it with more data.
[IMAGE OMITTED: SEE PDF]
To address Research Question 3, we learned the following about adopting and making any pipeline's outputs conform to existing HL7 FHIR RDF data resource standards. First, we created instances of the Observation and Provenance resources, serialized them in Turtle, and tested them for conformance with the HL7 FHIR RDF standards using the ShEx.js validation tool.
To test these two HL7 FHIR RDF data resource types using ShEx.js, a common schema detailing all of the requirements for each type of data resource is required. We used openly available HL7 FHIR RDF schema created by Sharma et al. for this [37]. As an example of schema like these, the initial lines of the Observation resource schema that we used look like this:
Next is a complete Observation resource we created that passed its ShEx.js validation:
We confirmed that the Observation resource above passed its validation using ShEx.js when we received the following truncated output from ShEx.js with no"failure list"of errors. To show the difference, below that is the truncated output from a failed validation test with a"FailureList"of errors. The first failure shown in part below is a"TypeMismatch"for the Observation resource's status field, meaning that the status field did not contain an entry allowed by the HL7 FHIR RDF standard.
We performed iterative validation tests like these for our Observation and Provenance resource examples until our examples of both resources passed their ShEx.js validation. At that point, the code for Stage 4 of the ETL pipeline (see Fig. 2) was programmed to produce conformant new instances of the Observation and Provenance data resources.
Next, we learned how to use SPARQL queries of an enriched knowledge graph like the one below to confirm that the HL7 FHIR RDF Observation and Provenance data resources loaded by the ETL pipeline into the CAMH Knowledge Graph at the final pipeline Stage 5 contain information sufficient to answer our competency questions.
As shown in Fig. 4 further below, since we are querying a graph the SPARQL queries we developed take advantage of semantic data links between new observation and provenance resources and existing patient, PHQ-9 questionnaire response, and (software) Library resources.
[IMAGE OMITTED: SEE PDF]
When subgraphs like that in Fig. 4 are queried using the SPARQL query above, the CAMH Knowledge Graph produces rows of query output like the example rows in Table 2 below.
[IMAGE OMITTED: SEE PDF]
This study addressed three research questions. The results above list essential information needed to trace and document semantic data enrichment, show the components of an ETL pipeline for enriching an RDF data store, detail what is required for such an ETL pipeline to be able to produce conformant HL7 FHIR RDF data resources, and show how to query such data resources using SPARQL.
Discussion
We conducted this technical study to show how biomedical research organizations like CAMH that already use knowledge graphs can further enrich their graph data by computing with pure functions. To a significant degree, we overcame the three challenges noted above. To uphold patient data security and integrity, the ETL pipeline we developed operates entirely inside CAMH's IT security zone. Patient data never leaves CAMH. To assist in managing growing pure function use over time, we showed how every computation made with a pure function can be associated with key provenance information, including information about the implemented version of any pure function. CAMH values this provenance information for computations about patients because it enables their researchers to understand when and how each computation was produced. Finally, to show how existing semantic data resource standards can be leveraged when enriching graph data with pure functions, we programmed the pipeline to generate valid HL7 FHIR RDF Observation and Provenance data resources.
For our example pure function in this study, we selected a simple but widely used scoring and interpretation function relevant to clinical depression. This function combines patient reported data from the Patient Health Questionnaire-9 instrument to generate a summary PHQ-9 score and an interpretation of that PHQ-9 score [38]. In this example, computing the PHQ-9 summary score and its interpretation in a standard, reliable, and consistent way enables CAMH to enrich their patient data knowledge graph automatically every time new PHQ-9 responses are logged.
While CAMH's implementation of the pipeline uses SPARQL queries to fetch PHQ-9 data stored in Blue Brain Nexus, organizations are not required to migrate their existing data to a knowledge graph platform. The modular design of pure functions allows the pipeline architecture to be source-agnostic as to its input parameters. Organizations can query PHQ-9 data directly from their current systems so long as the query returns the information required for the pure function signature. However, a key requirement for optimal pipeline performance is having a versioned data management approach where individual data points are assigned unique and persistent identifiers as mentioned in the FAIR principles. This is what enables the pipeline to maintain a provenance chain that can trace computational results back to their source observations, regardless of the underlying storage systems.
The fact that the pure function is designed independent of the pipeline allows it to seamlessly be reused across different applications, pipelines, and computing environments provided the input data conforms to the expected function signature. This reusability ensures computational consistency across systems and facilitates validation and testing efforts.
Additionally, the semantic data enrichment ETL pipeline we developed can now be fitted with other locally executable pure functions beyond PHQ-9 scoring. In biomedicine overall, computing with pure functions is increasingly important, as hundreds of medical calculators, biomedical equations, scoring tools, and computable guidelines exist that are widely and frequently used. In future work, we look forward to fitting semantic data enrichment pipelines like the one developed for this study with a wide array of biomedical pure functions relevant to mental health research. Being source-agnostic allows organizations to leverage their existing data infrastructure while gaining the provenance benefits demonstrated in our implementation, making the pipeline broadly applicable across diverse healthcare IT environments for any computable biomedical function that follows similar architectural patterns.
Several ongoing developments suggest that the number of biomedical pure functions is likely to increase. The most obvious of these may be the recent growth in small and large-scale machine learning and deep learning (ML/DL) models for biomedicine. The medical informatics literature is increasingly filled with reports of new statistical-mathematical pure functions arising from ML/DL [5]. At the individual patient level, computations using a growing number of ML/DL pure functions inform and assist with disease categorization and subcategorization, outlier detection, outcomes prediction, and many other things. Other work is ongoing to improve widely used biomedical equations for estimating patient-level features, including physiological functioning and health risks. As more and higher quality patient data become available [39], and as the use of race as a proxy for genetics gets set aside [40], widely used biomedical equations get reworked. As examples, consider how once widely used equations for predicting kidney function (e.g., the Cockroft-Gault equation) and for estimating cardiovascular disease risk (e.g., Framingham Risk Score) have been replaced by newer equations. As these changes occur, being able to trace how computations were made enables organizations to distinguish computations made with older and newer versions of the pure functions they use. For clinicians and patients, being able to trust that computations have been properly made is critical. For researchers, understanding how individual patient-level computations come about is necessary for reproducing data analyses. For risk managers, tracing the origins of select high stakes biomedical computations may be desirable.
It is also true that biomedical pure functions can be used in combination to arrive at"compound computations."For example, estimates of kidney function may serve as inputs to computable phenotype functions that determine the stages of chronic kidney disease. It is easy to imagine more elaborate examples where many pure functions are used in conjunction to compute new patient-level features. In one of our past projects, more than 40 pure functions were used to compute a battery of estimates of the risks and benefits of preventive medical services [21]. In clinical medicine, now is the time to develop reliable procedures for producing and reporting singular and compound computations arising from the use of pure functions. This work shows how knowledge graphs and related semantic web technology can assist in this regard.
This work began as an information modeling exercise. As the modeling work unfolded, it became apparent that our"homegrown"models overlapped considerably in their content with existing data resource models from the HL7 FHIR community. Adopting HL7 FHIR's RDF data resource models brought several advantages. First, relinquishing"homegrown"information models makes this work applicable to a wider audience. Second, adopting HL7 FHIR RDF models made it possible to complete a demonstration of automated data resource conformance testing and validation using the newly created ShEx toolkit. This obviated any need to develop a unique data conformance testing and validation mechanism of our own. Third, showing how HL7 FHIR RDF can be used in a practical application like our ETL pipeline may help others to do similar things.
The focus of this work is exclusively on semantic data enrichment using domain-specific pure functions implemented external to knowledge graphs. This kind of data enrichment is quite different from using logical reasoners to produce new inferences from existing graph data. A critical limitation of this work is that it did not implement the example pure function or its separate API using semantic technologies. We look forward to future work along these lines where we describe the inputs and outputs of pure functions using semantic data and deploy pure functions using semantic API technology.
This work is further limited in the following ways. No attempt was made to develop the ETL pipeline using other tooling even though Jupyter Notebooks and other available tools could be used to build similar pipelines. Only one pure function implementation was used, and the data enrichment ETL pipeline that resulted is likely over-fit to the example PHQ-9 pure function.
We also encountered limitations related to FHIR RDF. A limitation in the FHIR RDF R5 vocabulary is that not all of the FHIR RDF R5 URIs resolve to a web page. This makes it harder to understand the meaning of some FHIR RDF R5 data. Through our discussions with the FHIR RDF R5 implementation team, we understand work to address this issue is ongoing. We have provided some links to the actual definitions from the FHIR vocabulary for terms used in the above query:
One other thing we noted is that in FHIR RDF's R5 release, the use of the property fhir:value was changed to fhir:v for primitive values to avoid any conflicts with other uses of fhir:value throughout the FHIR standard.
Conclusions
Using open source technologies, an existing patient knoweldge graph, and HL7 FHIR RDF Library, Observation, and Provenance linked data resources, a software pipeline was built to enrich a patient-oriented knowledge graph with computations arising from using a pure function. This approach to data enrichment is an example of bringing"compute"to a knowledge graph environment using widely available technologies and in conformance with an emerging data standard (HL7 FHIR RDF). This work enables secure and traceable computation for healthcare and biomedical research.
Data availability
The data and software supporting the conclusions of this article are included within the article and in additional files to which links are provided.
Abbreviations
ETL:
Extract, Transform, and Load
FHIR:
Fast Health Interoperability Resources
HL7:
Health Level 7
PHQ-9:
Patient Health Questionnaire 9
RDF:
Resource Description Framework
Bartosz Milewski. Chapter 3 "Pure Functions, Laziness, I/O, and Monads" in "Basics of Haskell". School of Haskell. 2013. Online at: https://www.schoolofhaskell.com/school/starting-with-haskell/basics-of-haskell/3-pure-functions-laziness-io. Retrieved 2024-06-19.
Spitzer RL, Kroenke K, Williams JB, Löwe B. A brief measure for assessing generalized anxiety disorder: the GAD-7. Arch Intern Med. 2006;166(10):1092–7.
Lu T, Silveira PP, Greenwood CM. Development of risk prediction models for depression combining genetic and early life risk factors. Front Neurosci. 2023;18(17):1143496.
Su C, Xu Z, Pathak J, Wang F. Deep learning in mental health outcome research: a scoping review. Transl Psychiatry. 2020;10(1):116.
Federico CA, Trotsyuk AA. Biomedical data science, artificial intelligence, and ethics: navigating challenges in the face of explosive growth. Ann Rev Biomed Data Sci. 2024;10:7.
Shivade C, Raghavan P, Fosler-Lussier E, Embi PJ, Elhadad N, Johnson SB, Lai AM. A review of approaches to identifying patient phenotype cohorts using electronic health records. J Am Med Inform Assoc. 2014;21(2):221–30.
Miotto R, Li L, Kidd BA, Dudley JT. Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Sci Rep. 2016;6(1):1.
Braunstein ML. Health Informatics on FHIR: How HL7's API is transforming healthcare. Springer; 2022. https://link.springer.com/book/10.1007/978-3-030-91563-6#bibliographic-information.
Duda SN, Kennedy N, Conway D, Cheng AC, Nguyen V, Zayas-Cabán T, Harris PA. HL7 FHIR-based tools and initiatives to support clinical research: a scoping review. J Am Med Inform Assoc. 2022;29(9):1642–53.
Resource Description Framework, W3C At: https://www.w3.org/RDF/. Accessed 20 June 2024.
Ajileye T, Motik B. Materialisation and data partitioning algorithms for distributed RDF systems. J Web Semantics. 2022;1(73):100711.
JSON for Linking Data. At: https://json-ld.org/. Accessed 20 June 2024.
Quilitz B, Leser U. Querying distributed RDF data sources with SPARQL. InThe Semantic Web: Research and Applications: 5th European Semantic Web Conference, ESWC 2008, Tenerife, Canary Islands, Spain, June 1-5, 2008 Proceedings 5 2008 (pp. 524-538). Springer Berlin Heidelberg.
Mishra RB, Kumar S. Semantic web reasoners and languages. Artif Intell Rev. 2011;35:339–68.
Sy MF, Roman B, Kerrien S, Mendez DM, Genet H, Wajerowicz W, Dupont M, Lavriushev I, Machon J, Pirman K, Neela MD. Blue brain nexus: an open, secure, scalable system for knowledge graph management and data-driven science. Semantic Web. 2023;14(4):697–727.
Rotenberg DJ, Chang Q, Potapova N, Wang A, Hon M, Sanches M, Bogetic N, Frias N, Liu T, Behan B, El-Badrawi R. The CAMH neuroinformatics platform: a hospital-focused Brain-CODE implementation. Front Neuroinform. 2018;6(12):77.
Jeelani OF, Njie M, M Korzhuk V. Methods and algorithms of ensuring data privacy in AI-based healthcare systems and technologies. In Conference Proceedings, Paris France April 2024 Apr 11 (Vol. 11, p. 12).
Wilkinson MD, Dumontier M, Aalbersberg IjJ, Appleton G, Axton M, Baak A, et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data. 2016;3(1): 160018.
Barker M, Chue Hong NP, Katz DS, Lamprecht AL, Martinez-Ortiz C, Psomopoulos F, et al. Introducing the FAIR principles for research software. Sci Data. 2022;9(1): 622.
Knowledge Systems Lab. KOIO - The knowledge object implementation ontology. GitHub - kgrid/koio: Available from: https://github.com/kgrid/koio. Cited 2025 Apr 14.
Flynn A, Taksler G, Caverly T, Beck A, Boisvert P, Boonstra P, et al. CBK model composition using paired web services and executable functions: a demonstration for individualizing preventive services. Learn Health Syst. 2023;7(2): e10325.
Alper BS, Flynn A, Bray BE, Conte ML, Eldredge C, Gold S, et al. Categorizing metadata to help mobilize computable biomedical knowledge. Learning Health Systems. 2022;6(n/a): e10271.
Flynn A, Conte M, Boisvert P, Richesson R, Landis-Lewis Z, Friedman C. Linked metadata for FAIR digital objects carrying computable knowledge. Res Ideas Outcomes. 2022;10(8):e94438.
McCusker J, McIntosh LD, Shaffer C, Boisvert P, Ryan J, Navale V, Topaloglu U, Richesson RL. Guiding principles for technical infrastructure to support computable biomedical knowledge. Learn Health Syst. 2023;7(3): e10352.
Knowledge Systems Lab. BMI Calculator Knowledge Object. Available from: https://github.com/kgrid/koio/tree/master/examples/bmi_calculator_v_3. Cited 2025 Apr 14.
Prud’hommeaux E, Collins J, Booth D, Peterson KJ, Solbrig HR, Jiang G. Development of a FHIR RDF data transformation and validation framework and its evaluation. J Biomed Inform. 2021;1(117):103755.
HL7 FHIR RDF Representation. Accessed at: https://hl7.org/fhir/rdf.html December 5, 2024.
Ren Y, Parvizi A, Mellish C, Pan JZ, Van Deemter K, Stevens R. Towards competency question-driven ontology authoring. In The Semantic Web: Trends and Challenges: 11th International Conference, ESWC 2014, Anissaras, Crete, Greece, May 25–29, 2014. Proceedings 11 2014 (pp. 752–767). Springer International Publishing.
IDLab. The Function Ontology. The function ontology. Available from: https://fno.io/. Cited 2025 April 14.
PROV-O: The PROV Ontology. Available from: https://www.w3.org/TR/prov-o/. Cited 2025 Apr 14.
Health Level 7. Resource description framework representation. Available from: https://www.hl7.org/fhir/rdf.html. Cited 2025 April 14.
Solbrig HR, Prud’hommeaux E, Grieve G, McKenzie L, Mandel JC, Sharma DK, Jiang G. Modeling and validating HL7 FHIR profiles using semantic web shape expressions (ShEx). J Biomed Inform. 2017;67(1):90–100.
Prud'hommeaux E . ShEx.js. Available from: https://github.com/shexjs/shex.js. Cited 2024 Dec 5.
Knowledge Systems Lab. PHQ9 Knowledge Object. Available from: https://github.com/kgrid-objects/CAMH/blob/main/collection/99999-CAMH2-v1.0/PHQ9_algorithm.js. Cited 2025 April 14.
Kroenke K, Spitzer RL, Williams JBW. The PHQ-9. J Gen Intern Med. 2001;16(9):606–13.
Apache Software Foundation. Apache NiFi. Apache NiFi. Available from: https://nifi.apache.org/. Cited 2025 Apr 14.
Sharma DK, Prud’hommeaux E, Booth D, Nanjo C, Jiang G. Shape expressions (ShEx) schemas for the FHIR R5 specification. J Biomed Inform. 2023;148(Dec): 104534.
Kroenke K, Spitzer RL, Williams JB. The PHQ-9: validity of a brief depression severity measure. J Gen Intern Med. 2001;16(9):606–13.
International Data Corporation. 2018. Digitization of the world from edge to core. Available from: https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf. Cited 2025 Apr 14.
National Academies of Sciences and Medicine. Using population descriptors in genetics and genomics research: a new framework for an evolving field. Washington, DC: The National Academies Press; 2023. Available from: https://nap.nationalacademies.org/catalog/26902/using-population-descriptors-in-genetics-and-genomics-research-a-new.
© 2025. This work is licensed under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.