Continuous multimodal data supply chain and expandable clinical decision support for oncology

Introduction
Oncology data is multidimensional and diverse, encompassing a vast array of information such as patient characteristics, stage, tumor, and imaging data1. The advent of electronic medical records (EMR) and emerging data sources has caused a transformative surge in health information2. This data deluge often exceeds human cognitive limits for decision-making3 and has led oncology professionals to spend more time navigating EMR than engaging with patients to seek fragmented health data from disparate sources, which exacerbates burnout4.
Fortunately, rapid advancements in computational techniques, notably machine learning and artificial intelligence (AI), herald new possibilities for harnessing extensive and intricate medical data for individualized, data-driven care5. These technologies have demonstrated potential in refining imaging6 and pathology diagnostics7, prognosticating clinical outcomes, optimizing radiation treatment planning8, and accelerating drug development9,10. AI has also significantly impacted foundational research in oncology11.
However, challenges related to validation and generalizability12 mean that the current methodologies for data management and model development fall short of the maturity required for broad-scale AI adoption. Transitioning from the present ad-hoc data aggregation and curation approach to a dynamic “metadata supply chain” is essential for providing contextualized, robust data in real-time13. By capturing pivotal data in real-time, this metadata supply chain can lay the groundwork for a clinical decision support system that vividly maps patient journeys, potentially transforming clinical workloads. Recent advancements have also led to the development of other multimodal frameworks, integrating diverse data types such as clinical, genomic, and imaging data to enhance precision in patient care, exemplified by the MEDomics framework developed by Morin et al.14.
In this study, our objective was to present our collaborative endeavor for establishing a comprehensive data supply chain in oncology. This system seamlessly integrates clinical, genomic, and imaging data, representing a persistent, flexible, and expandable model. The infrastructure holds the potential to expedite the development of clinical decision support systems and AI applications for risk stratification, diagnosis, and treatment in oncology, paving the way for individualized patient-centered care.
Results
Data collection and infrastructure are illustrated in Fig. 1, with detailed procedures described in the Methods section. Through this process, at the time of analysis, the DB contained records of the feature sets of 171,128 individuals diagnosed with 11 different cancers at a single academic cancer center between January 2006 and March 2022 (Table 1). For each individual, 817 essential features in the common columns and a median of 61 features (range: 38–109) in the cancer-type-specific columns were updated daily and continuously. To facilitate the extraction of structured information from unstructured medical text documents, Natural Language Processing (NLP) techniques were applied during data processing.

The original data from the EMR/OCS, which contains clinical data for all patients visiting the hospital, serves as the source data for the cancer-specific YCDL database. After the original data is transferred to the DW server, data marts are created from the DW tables in the DSC database, grouped by related topics. Separate databases are established for each cancer type, named “DSC_cancertype,” to prevent excessive time spent on complex SQL query execution. The DSC database condenses data from 18 tables and 433 columns by integrating relevant tables from the DW database and joining with code master or terminology tables to include code-code name columns for immediate comprehension of codes. Similarly, the YCDL_DB maintains separate physical databases for each cancer type, where data is loaded. A patient-centric data model was developed, underpinned by patient identification numbers dispensed by the hospital information system, serving as a linchpin for linking anonymized datasets. All data processing, transfer, and storage were performed within the network infrastructure of the hospital. The YCDL site allows the execution of individual Data Manipulation Language (DML) to load YCDL data. The transfer of data from the original source to the DW/DSC DB is automated, with ETL processes running daily at 10 AM, transferring cancer-specific target data.
During the quality control (QC) process, we established a comprehensive set of 143 human-driven logical comparisons, including 70 focused on identifying missing data, 41 ensuring temporal validity (e.g., the completion date of radiotherapy should coincide with or follow its initiation date), 15 pinpointing outlier data (such as age at menarche between 8 and 20 years), 13 selecting the relevant values among multiple time points, and 4 dedicated to spotting duplicated or inconsistent data. The QC logic outcomes showed consistent results across 11 different cancer types, comprising a total of 1,523 datasets. We initially set the estimated daily QC case count to 10%, which translated to approximately 81 cases per day.
We generated survival graphs for each of the 11 distinct cancers in our dataset, segmented by tumor stages (Fig. 2). As expected, except for prostate cancer, there is a significant variation in survival rates depending on the stage of cancer; generally, higher stages are associated with lower survival rates.

Kaplan–Meier survival curves for patients with cancer in YCDL target data, stratified by tumor stage. Survival rates were compared across 11 distinct cancer types: A breast cancer, B colorectal cancer, C lung cancer, D gastric cancer, E liver cancer, F melanoma, G kidney cancer, H prostate cancer, I thyroid cancer, J pancreatic cancer, and K biliary tract cancer. The curves illustrate stage-dependent survival variations, with generally lower survival rates at higher stages, except for prostate cancer where no significant variation was observed.
The efficacy of our data framework in rapidly generating and evaluating clinical hypotheses was demonstrated in a study, focusing on rectal cancer, published in 202215. Following the initiation of the study design in December 2020 and subsequent approval from the institutional review board, researchers requested baseline data on patients, tumors, and treatments, as well as peripheral blood neutrophil and lymphocyte counts, spanning the period from the initial diagnosis to the respective dates of primary rectal surgery for study participants. Data abstraction for 1386 individuals was efficiently executed using our framework, encompassing a total of 14 distinct clinical features. All features were validated through meticulous chart reviews by researchers, with head-to-head comparisons ensuring the accuracy and reliability of the data prior to its utilization in the study. User feedback further supported its reliability and effectiveness. This proficiency enabled researchers to commence a pilot analysis in January 2021, merely a month post the initial data acquisition.
The results of evaluating the accuracy of the NLP models used in our ETL process are as follows. For the first analysis, the median number of features for surgical pathology and molecular pathology was 26 (range: 20–33) and 13 (range: 9–16), respectively. The median accuracy and missing rate for surgical pathology were 92.6% (range: 86.5–98.8%) and 4.9% (range: 0.5–10.7%), respectively. For molecular pathology, the median accuracy and missing rate were 98.7% (range: 92–100%) and 0.6% (range: 0–8%), respectively (Supplementary Table 1). For the second analysis, the NLP classification model demonstrated accuracy across 1000 individual CT reports, with multilabel selection applied as follows: Complete Response/No Evidence of Disease (CR/NED), Partial Response (PR), Stable Disease (SD), Progressive Disease (PD), and Indeterminate (232) achieving an AUROC of 0.956 and an F1 score of 0.823. The model showed 72.3% (95% CI, 59.5–85.1) accuracy in predicting the day of disease progression within a ± 30-day window and 55.3% (95% CI, 41.1–69.5) accuracy in predicting the best response category and its timing within ±45 days. Notably, the model was more accurate in predicting CR/NED at 72% compared to SD and PR, which had accuracies of 27.3% and 15.4%, respectively.
We successfully developed a clinical decision support system with four layouts. In the upper-left layout, shown in Fig. 2, three selected image series are displayed alongside their corresponding three-dimensional tumor visualizations, using DICOM files of individually, manually contoured lesions (Supplementary Movie 1, Fig. 3a). In the middle-upper layout, the output of the longitudinal tumor tracking is in the form of a graph (Fig. 3b). The section with the hope of predicting individual patient outcomes has been reserved for future integration of any potential model (Fig. 3c). The lower layout presents a comprehensive overview of a patient’s healthcare journey, allowing readers to intuitively understand the chronological sequence of events and progression of the patient’s treatment (Fig. 3d, Supplementary Movie 2). This offers holistic and interactive patient summaries on a graphical timeline anchored by real-time data captured within our framework. Users can easily assess patient data in a temporal context with a single click, and the depth of information can be fine-tuned using zoom features and pop-up boxes.

a Three-dimensional display of overall disease burden, with individual lesions contoured manually or automatically in advance. b Longitudinal tumor tracking output in the form of a graph. c Section displaying survival curves for assessing and predicting individual patient outcomes by integrating any potential model. d Comprehensive overview of a patient’s cancer journey including treatment history, follow-up, and disease status.
To assess the satisfaction of using CDSS with EMR, we surveyed 33 healthcare oncology providers, including professors, residents, and physician assistants, for five randomly selected cases (breast, colorectal, lung, gastric, and liver cancer). The median EMR usage experience among the participants was 9 years (range: 1–19 years). The satisfaction scores of patient chart assessment using EMR along with CDSS are detailed in Table 2. The results showed that, in almost all areas, the scores averaged above 4 out of 5 points, with 5 being the highest possible score. While those with longer EMR usage experience ( ≥ 10 years) reported lower satisfaction with the user interface, there were no significant differences in other aspects based on the years of EMR usage experience.
Discussion
We successfully developed a cancer-specific information technology (IT) infrastructure designed to facilitate the longitudinal collection of comprehensive health data, an accomplishment realized through extensive cross-departmental collaboration. Using our IT infrastructure, we created a database that automatically updates the data on a regular basis, each with over 800 unique characteristics. Manually collating such an expansive array of features is challenging. To ensure data integrity, we initially implemented rigorous data QC methods, starting with manual logic applications and subsequently transitioning to an automated management system. This approach, conducted within closed-loop systems, led to a steady enhancement in data precision. while we did not verify the accuracy of all features, the simpler ETL processes demonstrated high accuracy, whereas more complex NLP tasks such as RECIST categorization highlighted the need for expert correction and ongoing model improvement and maintenance. For example, the suboptimal results in SD and PR suggest the limitations of evaluating RECIST criteria solely from textual information without incorporating imaging analysis. Another potential issue could be the unbalanced dataset across individual categories, as cases in SD and PR represent the smallest absolute numbers, which may affect predictive performance. Of practical significance, our system highlights the potential for real-time capture of disease state and treatment data, exemplified by a proof-of-concept for rapid clinical hypothesis testing and offering a holistic view of a patient’s journey with a single click. This not only alleviates the clinical burden but also optimizes the research workflow. Survey results from oncology healthcare providers revealed generally high user satisfaction and strong expectations for the potential of CDSS (Table 2). Our evaluation employs questionnaires designed for diverse healthcare professionals and encompasses multiple usability metrics—efficiency, effectiveness, and the ability to identify user errors. Nonetheless, further evaluations across different institutional and national settings, incorporating methods such as user trials, interviews, and heuristic evaluations16, are essential to fully validate the system’s utility and generalizability.
Understanding the crucial role of a reliable automatic data supply chain, several research groups have collaboratively developed frameworks to capture and transport oncologic data14,17. Morin et al.14 introduced MEDomics, an information technology infrastructure that integrates seamlessly with multiple EHR DBs to ensure uniform data collection. Their research amassed data from nearly 175,000 patients with cancer at the University of California, San Francisco, between 2010 and 2019. Employing rule-based selection techniques, they identified individuals with high-quality data, narrowing them down to 3782 breast cancer and 2054 lung cancer cases. Lower-quality data were more prevalent among individuals located further away from the institution; a trend associated with increased mortality rates. Jung et al.17 showcased ROOT, an auto-updating data warehouse that consolidates comprehensive clinical data of 67,617 individuals diagnosed with head and neck, thoracic, and esophageal cancers at the Samsung Medical Center in Korea between 2008 and 2020. These endeavors underscore the importance of data governance and active participation of all stakeholders. Considering geographic disparities and practice variations, in-house development might be best positioned to cater to the specific needs of end users.
Building an automated data warehouse using oncology EMR data poses inherent challenges because of the varying degrees of data completeness, inconsistencies, and conflicting or evolving records18. In this context, a nationwide initiative was launched to create a comprehensive cancer data library aimed at standardizing terminology and classification within our country. Concurrently, institutional efforts aimed to gather extensive feedback and integrate preexisting registries from diverse cancer groups. The recent proposal of Operational Ontology for Oncology (O3) seeks to achieve multi-institutional and multi-stakeholder consensus, lowering the barriers for collaborative information aggregation19. Our next task involves identifying differences and similarities between our defined features and the variables proposed by O3, and if possible, updating the necessary parts. To manage the vast variability of data sources and types, we devised algorithms that harness structured data from diverse origins and process unstructured content using ETL procedures. ETL operations present unique challenges, especially when dealing with components presenting multiple ETL-related complications. Collaboration with team members well-versed in treatment workflows and medical informatics, combined with close cooperation with the IT department, was pivotal in understanding the system functionality and nuances of data interpretation. Both data governance and ethical deliberation are instrumental in ensuring data security and patient privacy.
In the absence of formalized frameworks, challenges may arise in query fulfillment and data management20. However, our data supply chain addresses this issue through an end-to-end workflow for data quality assurance, ensuring continual evaluation and improvement. Conflicting, missing, or incorrect data were identified through human-driven logical comparisons and rectified by making logical corrections or adjusting the algorithms. Since its implementation, the quality assurance workflow has been continuously refined, accumulating data checks across multiple cycles. This iterative process enhances data quality and reduces the need for human intervention. Engagement with various groups familiar with the data sources and limitations is essential.
Our YCDL framework has numerous potential clinical and research applications. Although limited data have evaluated clinically relevant outcomes in oncology care, emerging evidence suggests that CDSS using EMR data can positively impact care quality21. The most actively researched area is non-knowledge-based CDSS, which leverages machine learning and AI to predict patient outcomes22, as follows: A recent randomized controlled trial by Hong et al.23 demonstrated accurate triaging of patients with cancer and reduced acute care rates using an EMR-based machine learning algorithm. Coombs et al.24 showed that a proposed machine learning tool using real-world EMR data could identify patients with cancer at risk for a 60-day emergency department visit. Another potential application is the generation and rapid testing of clinical hypotheses, as suggested by Morin et al.14, which would not have been feasible using traditional data approaches. The YCDL enabled the collection of a vast amount of data, including laboratory results and patient features, thereby facilitating the first pilot analysis. Additionally, automatic flagging of eligible patients for clinical trials shows promise24.
Our study demonstrates that data consolidation and a continuous multimodal data supply chain can automatically generate visual timelines and enhance decision support—characteristics of a knowledge-based CDSS22. With advancements in systemic drugs, patients with stage IV cancer now live longer and have complex treatment histories25. A quick overview of a patient’s cancer journey allows physicians to efficiently characterize both the disease and the individual, potentially reducing burnout and ensuring quality care26. Commercial clinical decision support software such as NAVIFY Oncology Hub27, Syapse28, and Flatiron Assist29, are undergoing evaluation for integration into the EMR system to provide a comprehensive view of a patient’s journey. Chen et al. demonstrated the potential of AI-assisted summarization tools using medical records, particularly when the input data is accurate30. With emerging local therapies31, AI can play a significant role in detecting and segmenting normal tissues and tumors32, as well as tracking lesions over time in relation to treatment33. However, further research on tumor auto-segmentation is warranted.
This study has several limitations that should be considered when interpreting our results. First, our method represents the experience of a single institution, and large-scale adjustments may be necessary for implementation elsewhere. Our system was not developed with direct consideration for interoperability standards such as HL7 and FHIR. Given that our ETL processes are based on our hospital’s data, we have primarily focused on optimizing performance within our own hospital environment. However, as the national FHIR standard evolves and is finalized, we plan to expand our system accordingly to ensure compliance. Second, the data supply chain approach is designed as an expandable infrastructure that accommodates updated ontologies and evolving demands. Establishing a strong leadership in data governance, implementing sharing agreements, and promoting open science practices are essential for a robust metadata supply chain. This requires dedicated departments to ensure job security. Collaborative efforts such as workshops and knowledge transfers promote an understanding of the benefits offered by the metadata supply chain and AI technologies. Future work will incorporate additional cancer types such as brain tumors and rare malignancies. Once the ETL process is finalized, we aim to make it publicly accessible. Our hospital primarily diagnoses and follows up with patients within our institution; however, inter-hospital data sharing may become necessary in certain cases. The NLP models used in our ETL process, such as logic-based segmentation, demonstrated sufficiently good accuracy, and we believe that applying language models could further enhance these outcomes. While large language models have made remarkable progress in terms of performance and are likely to perform well in data center environments, their adoption may be limited by concerns over cost-effectiveness. In such scenarios, smaller language models—with their reduced computational requirements and reliance on less training data—could represent a practical and efficient option for addressing specific tasks34. Additionally, our study did not demonstrate whether multimodal data is superior to single-modal data in predicting patient outcomes35. Finally, the current version of the YCDL framework only captures survival and recurrence data despite the growing recognition of the importance of quality of life and toxicity profiles as critical outcomes.
In conclusion, this study underscores the critical role of developing a streamlined data integration framework to organize and visualize the substantial volume of oncology patient data, supporting and enhancing clinical decision-making. The integration of real-time updates into our framework is particularly significant, enabling the incorporation of evolving treatment trends and up-to-date information. This collaborative effort to establish a robust data infrastructure highlights its potential to advance personalized care, accelerate the adoption of AI-driven applications, and refine clinical workflows. Furthermore, adopting comprehensive data supply chains and AI technologies requires a commitment to strong data governance, the embrace of open science principles, and strengthened collaboration within the medical community.
Methods
Development of multimodal data supply chain
Our research was conducted in accordance with the Declaration of Helsinki after approval of the protocol by the Institutional Review Board of Severance Hospital (Reference Number: 4-2021-1241). The need for informed consent was waived by the ethics committee because the study involved retrospective analysis of anonymized data, posing minimal risk to participants. First, we established a development server using a Windows-based, 12-core computer with 64 GB of memory and Serial Attached SCSI (SAS) disk drives of 100 GB and 2 TB, and four RTX 5000 GPUs. Operational servers, constituting High Availability (HA) systems, included a database (DB) server (2Ea) with a 10-core CPU, 128 GB memory, OS SSD 100 GB of storage/SCL 2019, DB Safer, Hiware, EMS, and Backup (DB), and a web-based server with a 12-core CPU, 64 GB memory, OS SSD 100 GB of storage/Hiware, EMS, and Backup (File).
The dataflow and computational modules are illustrated in Fig. 3. The original data from the EMR/OCS, which contains clinical data for all patients visiting the hospital, serves as the source data for the cancer-specific YCDL database. Data access is strictly managed to ensure security and privacy. Only authorized personnel can access the data, and even then, they must go through two layers of identity verification within the hospital’s internal network, which is isolated from the internet to prevent external threats. For research purposes, all data provided is either anonymized or pseudonymized to prevent the identification of individual patients, ensuring the protection of personal information while enabling research activities. Data are managed in compliance with ISO 27001 (international certification) and ISMS (domestic certification) standards for data protection regulations. The first data transfer from source data is governed by three conditions: the Target Patient Number (AlsUnitNo), a de-identified number assigned to each patient based on a diagnosis code matching the ICD-10 code of the primary cancer; the Target Encounter Number (AlsChosNo), a de-identified number generated during patient visits, which is selected as the encounter number for the primary cancer if the patient number from the first condition matches the target diagnosis code related to the primary cancer; and the creation of a Main Target Table, which distinguishes cancer types by the target patient and encounter numbers. If the target patient number exists in the tables on the DW server, all data related to the target encounter number are transferred. Cancer types are not distinguished in the DW tables. Additionally, the latest data from code master or terminology tables are also transferred. To improve the data quality and mitigate the risks associated with erroneous or omitted data, we tailored the selection approaches for each cancer type. The selection was based on the International Classification of Diseases for Oncology (ICD) and physician-assigned ICD-M codes as well as validity criteria designated by the cancer registration program. A comprehensive breakdown of the selection methodologies for each cancer type is provided in Table 3. We received authorization for access to all digital records from the EMR system and billing data from the Oncology Care System.
After the original data is transferred to the DW server, data marts are created from the DW tables in the Data Science Center (DSC) database, a proprietary name, grouped by related topics. Separate databases are established for each cancer type, named “DSC_cancertype,” to prevent excessive time spent on complex SQL query execution. The DSC database condenses the data from 18 tables and 433 columns by integrating relevant tables from the DW database and joining with code master or terminology tables to include code-code name columns for immediate comprehension of codes. Similarly, the YCDL_DB maintains separate physical databases for each cancer type. Each cancer-specific database has tables where data is loaded. In this process, a patient-centric data model was developed, underpinned by the patient identification numbers dispensed by the hospital information system. This served as a linchpin for linking the anonymized datasets. In the clinical data extraction stage, we developed an Extract-Transform-Load (ETL) process, which includes NLP, for each feature (Fig. 4). It facilitated the daily movement of data from the DSC source DB to the YCDL target DB. The DSC DB is a repository that contains unstructured and semi-structured data, including medical record text, imaging files, and next-generation sequencing (NGS) results. The YCDL site enables the execution of individual database queries to extract data from the source system and load it into the YCDL target system. Examples of data extraction from DSC DB to YCDL are provided in the Supplementary Figs. 1-9, using SQL queries, MS-SQL user-defined functions, and Python user-defined functions. All data processing, transfer, and storage were performed within the network infrastructure of the hospital. The transfer of data from the original source to the DW/DSC DB is automated, with ETL processes running daily, transferring cancer-specific target data. Access to sensitive data is strictly controlled through a role-based permission system, which assigns access levels based on user roles and responsibilities. To ensure system reliability and data integrity, fail-safe mechanisms are in place, including automated daily backups, geographically redundant storage, and real-time system monitoring to detect and address issues proactively. The user management system further enforces security through role-based access control, enabling administrators to define and customize specific roles, permissions, and access rights tailored to each user group’s needs.

In the clinical data extraction stage, we developed an ETL process, which includes Natural Language Processing (NLP), for each feature. The DSC DB serves as a reservoir containing raw medical text, (semi-) unstructured data, imaging files, next-generation sequencing (NGS) results, and Extensible Markup Language (XML) formats. In the initial phase of data processing, we tailored the database corpus from the DSC DB, optimizing the extraction and management of medical terminology, abbreviations, and recurrent misspellings (e.g., within pathology reports). Subsequently, the procured data underwent transformation through a specialized ETL algorithm designed to harmonize terminology based on assertions and the interrelationships of medical concepts. NLP was instrumental in utilizing CT and MRI interpretation counts from follow-up visits as criteria for individual selection.
In the initial phase of data processing, we tailored the database corpus from the DSC DB, optimizing the extraction and management of medical terminology, abbreviations, and recurrent misspellings (e.g., within pathology reports). Subsequently, the procured data underwent transformation through a specialized ETL algorithm designed to harmonize terminology based on assertions and the interrelationships of medical concepts. To enhance user convenience in handling extensive cancer patient data, we developed a model based on imaging reports to determine the best responses and the timing of disease progression36. Imaging reports from 6574 patients were gathered, amounting to 97,119 CT readings. Among these, 9000 CT reports corresponding to 2859 patients were randomly subjected to multilabel manual labeling by four radiology experts, based on the RECIST version 1.1 classification (CR/NED, PR, SD, PD). The pretrained BERT-base-uncased model was employed and fine-tuned for the downstream tasks of multilabel classification. 6765 reports were used for training, while the remaining 1000 reports were divided equally between the validation and test sets. The subsequent preprocessing phase employed tokenization techniques to structure the extracted data. SQL queries were harnessed to mine data from the primary DSC DB, facilitated by a DML management interface. For certain datasets requiring intricate extraction protocols, bespoke ETL strategies were devised using Python scripts crafted for each specific operation (Supplementary Fig. 10).
The overall process of NGS analysis has been detailed in our previous publication37. Targeted DNA and RNA sequencing was performed using the TruSight Tumor 170 (TST170, Illumina, San Diego, CA) and TruSight Oncology 500 (TSO500, Illumina) panels. Secondary analysis, including read alignment, variant calling, and variant annotation, utilized the TST170 Local App and the TSO500 Local App, respectively. For tertiary analysis, which involves additional annotation, variant filtering, prioritization, and producing interpretable output, we utilized an in-house pipeline designed to discard false positive variants, germline variants, and SNPs, ensuring the accuracy and reliability of the results. Variant interpretations were manually reviewed by institutional pathologists in accordance with guidelines from the Association for Molecular Pathology, the American Society of Clinical Oncology, and the College of American Pathologists38. Pathogenic variants categorized as Tier 1 were automatically and systematically collected and transferred to YCDL. A substantial proportion of the procedural steps were automated using OncoSTATION (Geninus, Seoul, Korea), as shown in Supplementary Fig. 11.
Data quality control and accuracy assessment
After development, we implemented this system with our electronic health data, beginning with records from 2006. The profiles were updated using electronic health records, ensuring a comprehensive view of relevant oncological components over time. The present analysis is based on data collected up to March 2022. Key constituents of these profiles included demographics, diagnoses, clinical examination reports, pathology reports, treatment histories, and encounter specifics (Tables 4 and 5). The structures of these individual profiles were categorized into common, cancer-specific, and index columns. The common features held universal information across multiple cancer types (e.g., age, sex, and cancer diagnosis date) and accounted for 817 features, which was nearly 80% of the total. The cancer-specific features contained data relevant only to specific cancer types (for instance, pulmonary function test in lung cancer) and comprised approximately 20% of the tables (Table 6). Data feature definitions and formats for all features across all cancer types have been included in the Supplementary data 1, with necessary translations from Korean to English.
We developed a web-based computational platform for QC of data that scrutinizes potential data defects both automatically and manually on a daily basis, focusing on minimizing the role of the human component (Fig. 5). All data extracted and stored in the YCDL_cancer data repository were continuously evaluated and optimized to establish high-quality data outputs, adhering to standardized data and terminology. Programs for logical checks were configured to evaluate the distribution and continuity of data extracted by the SCL. Based on the QC results, the ETL code was continuously modified, thereby refining the QC logic to enhance the quality and accuracy of the automation. We examined four data quality measures (completeness, timeliness and usefulness, consistency, and accuracy) for all variables, in accordance with established data standards and pertinent aspects of data quality (Table 7). For instance, the logic was set such that the birth date of individuals would precede the date of the initial diagnosis. The analyses revealed that the batch processing method accurately identified erroneous data points, aligned with the established logic. Each piece of data was meticulously reviewed and optimized by a Quality Control Manager. Significant discrepancies or inaccuracies prompted an in-depth examination of the source data and respective ETL processes. Moreover, a hierarchy of data sources was established to resolve conflicts. The QC steps were continuously iterated within distinct closed-loop systems, adhered to operational ontology, and executed by independent QC personnel. This methodology gradually enhanced the accuracy of the cleansed target data with minimal intervention (Supplementary Fig. 12). We assessed the completeness of each individual’s accumulated features, including fundamental characteristics such as date of birth, initial diagnosis date, age, diagnosis code (ICD), TNM and overall stages, and ICDO morphology code.

We developed a web-based computational platform for data quality control (QC) that scrutinizes potential data defects both automatically and manually on a daily basis, focusing on minimizing the role of the human component. All data extracted and stored in the YCDL_cancer data repository were continuously evaluated and optimized to establish high-quality data outputs, adhering to standardized data and terminology. Programs for logical checks were configured to evaluate the distribution and continuity of data extracted by the SCL. Based on the QC results, the ETL code was continuously modified, thereby refining the QC logic to enhance the quality and accuracy of the automation. The analyses revealed that the batch processing method accurately identified erroneous data points, aligned with the established logic. Each piece of data was meticulously reviewed and optimized by a Quality Control Manager.
To verify the accuracy of the NLP models used in our ETL process, we first evaluated the segmentation accuracy of logic-based NLP in both surgical and molecular pathology reports. We randomly selected 50 items from the top 30% in data completeness for each cancer type, where data completeness was defined as the ratio of columns filled with segmented data to the total number of pathology report features for each cancer type. Assessing the timing of the best response category is critical in understanding clinical outcomes during retrospective cancer patient data analysis. Rapid access to this information improves the efficiency of evaluating disease progression and treatment responses. Accordingly, we assessed the NLP model’s accuracy in automatically categorizing RECIST criteria36, predicting the timing of disease progression events within a ± 30-day window, and identifying the best response and its timing within a ± 45-day window.
The developed data warehouse showcased survival graphs by tumor stages and demonstrated the framework’s ability to expedite data collection for quick clinical hypothesis testing. Kaplan–Meier survival graphs were generated in all cancer types according to tumor stage with 95% confidence intervals. Survival time was defined as the time interval between initial diagnosis and death or the last follow-up. To demonstrate the efficiency of our data framework as a proof-of-concept for swiftly generating and evaluating clinical hypotheses, we present a detailed chronological progression of a previously published retrospective study. The clinical question chosen by one of the authors was whether the peripheral blood neutrophil-to-lymphocyte ratio before, during, or after neoadjuvant chemoradiotherapy for locally advanced rectal cancer is associated with an increased risk of distant metastases after primary rectal cancer surgery.
CDSS development and evaluation
To underscore the capabilities of our data framework for clinical applications, we developed a CDSS with a comprehensive and modular architecture to support efficient digital content creation and management, utilizing data from the YCDL server. The current User Interface (UI) front-end components are as follows: (1) patient information, (2) DICOM image visualization for PACS-integrated three-dimensional tumor display (viewing and interactive visualization of medical images and segmentation information in DICOM format), and (3) a longitudinal view of the complete patient journey (timeline visualization of patient data over time, highlighting key events and data points). Additionally, components for survival prediction based on data from previously treated patients, personalized news/journals, and clinical trial information are being developed for integration. The CDSS front end was built using JavaScript frameworks such as Svelte or React. It provides an interactive UI for users and handles the visualization of data received from the CDSS backend. Key functionalities include user interaction management and data visualization module processing. The CDSS backend is implemented using FastAPI, a high-performance web framework for building APIs with Python. It processes data requests from the front end and performs computations for various modules. Key functionalities include handling RESTful API requests, DICOM data processing, web scraping services, and visualization of DICOM data using the VTK library. The Front-End Modules include a 2D/3D Visualization Module and a Timeline Module. The 2D/3D Visualization Module visualizes DICOM images received from the Back-End as VTK images, supporting both 2D and 3D visualization of medical imaging data. The Timeline Module, based on EMR data, visualizes patient data as a timeline, arranging data chronologically to show key events and data points, and supporting interactions like zooming and dragging. The backend modules include an AAA Module and a DICOM Image Transformation Module. The AAA Module handles authentication, authorization, and accounting using JWT (JSON Web Token) for secure processing. The DICOM Image Transformation Module converts and processes DICOM images into VTK images using the VTK library. The overview of the described architecture is depicted in Supplementary Fig. 13.
The data flow is as follows: Users initiate requests through the web UI. The front end processes the user’s request and routes it to the appropriate frontend module. The backend processes data requests and interacts with the required backend modules. The latter processes EMR and DICOM data to generate or retrieve necessary information. The processed data is returned through each layer, ultimately displaying results in the User Interface. This design bolsters the accessibility of the system, guarantees platform independence, and ensures that users can access services across various device types.
Manual tumor segmentation data are required to use the three-dimensional tumor display with a longitudinal tumor-tracking function. If deep learning-based tumor auto-segmentation algorithms are developed, these models can be integrated into the pipeline39,40. The PACS-integrated method enables physicians to comprehensively track changes in overall trajectory patterns over a long period, fosters an environment that better explains the disease course to patients, and facilitates communication with referring physicians. Longitudinal changes in the overall disease burden were automatically generated using the prepared manual contours and displayed as graphs. The images were de-identified; however, if another image of the same patient was transferred later, the new images were allocated the same de-identified number, facilitating tumor tracking.
To investigate the effectiveness of the CDSS, we conducted a mock simulation with physicians (n = 33), comparing EMR-only assessments to assessments using both the EMR and CDSS. The simulation incorporated user evaluations to gather physician feedback on the patient assessment process. After selecting the five randomly selected cases of patients with breast cancer, colorectal cancer, lung cancer, stomach cancer, and liver cancer, participants were requested to complete a survey for each case after assessment of cases using EMR and CDSS. Using the evaluation framework from Kim et al.41, we investigated a total of 12 measures. For system quality, we assessed ease of system use, results understanding, terminology understanding, usefulness of the system, and user interface. For information quality, we evaluated information accuracy, information timeliness, information reliability, and up-to-dateness. For support factors, we examined decision support, processing time, and task satisfaction. All measures were coded on a 6-point scale, with 5 being the highest score and 0 being the lowest score.
Responses