The Harmonized CRDC Data Model (CRDC-H)
The goal of the Center for Cancer Data Harmonization (CCDH) is to support the harmonization of equivalent data elements in disparate models across NCI’s Cancer Research Data Commons (CRDC) repositories (nodes) to enable cross-node querying and multi-modal analytics. Individual nodes’ data models have been developed largely independently to fit specific data types and/or use cases. The CCDH is tasked with defining a shared data model for use across the CRDC, leveraging existing standards where possible to support interoperability with external data.
The CCDH Harmonized Data Model (CRDC-H) and its terminological infrastructure are being designed to meet the needs of systems like the Cancer Data Aggregator (CDA) that support integrated search and metadata-based analyses across datasets in the CRDC. We view the CRDC-H as a continuously-evolving artifact. To become and remain useful, the CRDC-H must be able to evolve and extend to meet new needs, while at the same time representing a constant semantic anchor for existing content.
The version 1.0 release of the CRDC-H is a point in time along that model evolution, covering administrative, biospecimen, and clinical data entities from multiple data commons; namely, PDC, GDC, ICDC, and HTAN. The CRDC-H is natively expressed in the LinkML modeling language, allowing us to leverage the existing LinkML tool ecosystem, which includes tools for generating a number of useful artifacts, including the model browsable, searchable documentation available in these pages (see navigation bar on the left), representations of the model in CSV and OWL, representations used for validating data such as JSON Schema and ShEx, and artifacts for interfacing with other technologies such as GraphQL and JSON-LD. The CRDC-H model repository contains tools for converting the spreadsheets where CRDC-H content is developed into formal LinkML, and holds the resulting LinkML model and its downstream artifacts for public use. By locating the CRDC-H LinkML model in GitHub, we can also leverage GitHub tools such as issue tracking and pull requests to provide versioning and maintain a history of changes to the model over time.