Data Science Talent Logo
Call Now

Data Science Platforms for Patient-Centred Drug Discovery By Benjamin Glicksberg 



Bringing a new drug to market is a complex, lengthy endeavour and is beset with challenges. The traditional drug development process can span over a decade and requires substantial investment, on the order of 100s of millions to billions of dollars. Furthermore, there is a high failure rate for novel drugs across the stages of development. Lastly, even if a drug gets approved, there is no guarantee that it will be equally beneficial to patients with different demographic and clinical characteristics. Despite the numerous successful treatments available, it is clear that new strategies should be developed to overcome these challenges to generate a higher likelihood of success and more treatment options for patients. With recent

advancements in machine learning techniques and computing power, data science is poised to not only streamline drug discovery but also identify more personalised medicine applications.


Precision medicine aims to provide the right therapy for the right patient at the right time. It is becoming increasingly apparent that medicines do not work the same way for everyone. They can have varying levels of safety and efficacy for individuals with different characteristics. Some of this variability can be linked to genetics and is, therefore, especially relevant in diseases with high levels of genetic contribution. In certain cancers, for instance, there may be particular causal genetic mutations that drive disease pathogenesis. As such, therapies that target those specific disruptions will be beneficial to individuals with that particular genetic background. So-called complex diseases, like type 2 diabetes, are polygenic, often having multiple genetic components of smaller effects. Unlike targeting a single, likely causal genetic signal, personalising medicine for these types of diseases requires a multifaceted approach. The challenge is further compounded as the manifestations of these diseases can often be varied between individuals. Emerging biostatistical techniques and increased availability of human genetic data across complex diseases are facilitating strategies to identify the best therapies for individuals across heterogeneous diseases.


Progressive diseases, such as Alzheimer’s disease, develop over time and are associated with the complex physiological process of ageing, leading to another set of unique problems in developing novel therapies. Most importantly, it is imperative that enough data can be collected as the disease develops and progresses to not only understand the physiological changes but also what components, such as genetics, could drive these changes. Obtaining data spanning years with genomic data at discrete states is exorbitantly costly and operationally difficult to collect. While various initiatives and databases exist that try to accumulate such data, like the UK Biobank and the All of Us Program, large gaps still remain for the comprehensive study of longitudinal, complex diseases.

Osteoarthritis, like other progressive diseases, has varying degrees of severity, often stratified by stages, which reflect key milestones in pathophysiological, molecular, and morphological changes. These stages are distinctly marked by the presence of biological traits that develop throughout the course of the disease. For instance, moderate osteoarthritis is characterised by joint space narrowing, while severe osteoarthritis is characterised by large reductions in joint space and significant osteophyte growth. Therapeutics aimed at treating osteoarthritis can have different strategies by targeting various stages of the disease. While “reversing” the disease course is ideal, this is extraordinarily challenging, and strategies of this kind are limited across medicine as of now. Many therapeutic strategies, instead, focus on delaying the progression or growth of a key biophysical property or preventing or prolonging conversion to more advanced stages of the disease. In order to pursue therapeutic strategies at various stages of the disease course, it is imperative to effectively characterise disease progression according to pathophysiological properties and relevant biological pathways.


In order to bridge these various gaps that currently exist in the drug development space for complex, progressive diseases, it’s essential to develop precision therapeutics based on personalised characteristics, such as genomics. The goal of precision medicine is not only to identify effective targets based on these characteristics but also to determine who should take them and when. Clinicogenomic data, which couples longitudinal patient data with genomics, is essential for this goal. Such patient data, like clinical diagnosis and imaging, is necessary in order to disentangle the complex mechanisms of the progression of diseases over time. Unfortunately, it is often the case that no comprehensive dataset encompasses such requirements, at least for diverse patient populations. As the field grows, it is imperative that represented clinical-genomic biobanks be developed from a network of consented patients across the country. Both clinics and patients must agree that the data collected during their regularly scheduled visits can be utilised for analyses.

For genetically defined progressive diseases, it is imperative to study changes in progression rates that coincide with relevant endpoints for registrational trials rather than disease risk. It is also necessary to study the underlying data modalities that are assessed as part of clinical trials. In many progressive diseases, such relevant endpoints must be extracted from imaging data. Chronic obstructive pulmonary disease (COPD), for instance, relies upon high-resolution CT scans to detect structural alterations in the lungs, like fibrosis. In Osteoarthritis, structural endpoints consist of such things as joint space narrowing, which can be detected in X-rays. Clinico-genomic biobanks are poised to facilitate such analyses as these data are often collected during routine care. These biobanks often consist of DNA and longitudinal clinical data from Electronic Health Records (EHR), text from patient notes, as well as imaging data.


Multi-modal data science platforms can be used for novel drug target discovery and patient stratification. The key basis for these applications is the focus on disentangling the progression of a disease rather than incidence. Specifically, all individuals in such a cohort should have the disease of interest, allowing the focus to be on why some individuals progress more quickly than others rather than why an individual develops the disease in the first place.

The nuance here is critical as certain genetic variants may play a large role in influencing progression but are “hidden” by the overwhelming large signal from genes that confer disease risk. Characterising “cases” only over time by progression (covered more thoroughly in the next section) allows for more nuanced analyses that could reveal key genetic modifiers. The genetic markers identified from these analyses could be used as the basis for drug targets: if these factors control progression, perhaps targeting these factors with therapeutics at key time points could curtail further advancement.

In addition to novel target discovery, decoding these markers can also be used for patient stratification. The concept of patient stratification aligns with the premise of personalised medicine: diseases may manifest via different mechanisms in individuals and, therefore, require tailored treatment selections. The goal of patient stratification is to identify which individuals would be most likely to respond to treatments. This goal can be achieved in concert with drug development: if therapies are developed targeting certain genetic markers, patients can be stratified by the carrier status of these markers. Risk scores can be generated per patient for their genetic status that stratifies patients into biomarkers, positive or negative. It can then be hypothesised that patients who are biomarker-positive should respond better to the development treatment. While this strategy doesn’t necessarily benefit those who are biomarker negative, it at least allows for more informed treatment decisions; there certainly is a benefit of not giving an ineffective treatment and/or one that has an increased risk for adverse events for a given individual.

Of course, the success of discovering novel drug targets of progression and affiliated patient stratification strategies using multi-modal patient biobanks is dependent on the quality and composition of the underlying data. Furthermore, the jump from dataset to biomarker often requires intricate analyses based on biostatics and machine learning that address underlying challenges in the data.


The power of real-world data, or patient data collated from routine care, is readily apparent. Real-world data has unlocked a world of research beyond the confines of clinical trials and prospective studies. That being said, there are certain issues that can arise in the analysis of such data if not taken into proper context. Patient data are often not uniform in many datasets. Data collected from multiple clinics in different health systems in diverse geographic settings may have slightly different data type representations.

For instance, cardiac MRIs that are collected as part of hypertrophic cardiomyopathy monitoring can be captured via different machine brands, which, in turn, have slightly different output formats. Visualisations can come in different resolutions and scales. Furthermore, the practising physician may decide to take imaging based on slightly different protocols based on their experience and at their discretion. Depending on the scale at which they need to analyse the pathology, he or she may choose to view specific regions, field-of-views, or axes (planes) of the heart. Put together, while there are many similarities in the data collected within a single disease, there is actually a large amount of underlying heterogeneity which needs to be taken into account.

There are many potential biases that can result in real-world data, especially with such heterogeneous data. One issue is information bias: does each site qualify for disease presence and stage the same way? Another issue relates to bias by indication, which can be reflected in the decision of which imaging protocol is selected and when. In more advanced stages of the disease, the treating physician may be able to want to perform a 3D visualisation at a finer detail, where cheaper, more “crude” imaging would have sufficed in earlier stages. Similarly, the clinician may decide to perform imaging more frequently to detect minute changes over a short period of time that may reflect conversion to a more severe disease state. Therefore, the mere frequency of imaging available or the number of slices in a given scan can “leak” information due to its relationship with the disease state. Therefore, it is challenging to compare data across sampling protocols as it is not exactly comparing apples to apples.

These are just a few of many potential biases that can exist, and careful steps need to be taken to maximise true signal from noise in real-world data. Data quality control and standardisation should be performed at all steps of the process. Provider sites should be examined before enrollment to ensure data are captured electronically and in an accessible format. Ideally, the format of the data should also be interoperable, such as the Fast Healthcare Interoperability Resources (FHIR) HL7 format for EHRs. Robust electronic phenotyping should be conducted to ensure the images taken match disease stage diagnosis, as errors in data collection can occur. This verification step should ideally be performed by independent clinical experts. Internal bias checking is also imperative. One check can be performed by strict inclusion/exclusion criteria to ensure patients have a sufficient amount of data across modalities and time. Relatedly, one should make sure that there are no differences between those with and without comprehensive data, which may reflect a bias in access to healthcare. All biomedical images that are analysed should be standardised with extraneous information removed before modelling. If different machine types are involved, some kind of calibration should be performed to align output values. In real-world datasets, the quality and robustness of data are often tied to the uniformity of screening practices. Regardless, proper quality control of any real-world dataset is imperative for robust insights.


As mentioned, many progressive diseases progress in stages that are characterised by specific pathophysiological changes that are often captured in biomedical images. While it is useful to analyse gross progression on this scale, such as a time-to-event analysis of conversion to later stages of the disease, studying more nuanced aspects of each stage will allow for a more refined understanding through a genetic lens. Accordingly, deep phenotyping refers to the comprehensive, detailed, and systematic analysis of phenotypic abnormalities (observable traits or characteristics) in an individual, often within the context of a particular disease. Deep phenotyping often involves assessing the intricate interplay of molecular, physiological, and environmental factors that give rise to observed clinical symptoms or traits. Not all of these aspects can be directly measured with conventional technologies, but real-world data collected can partially serve this purpose by identifying observable patterns reflective of underlying endophenotypes. Put together, one primary goal of deep phenotyping is going from the abstract, qualitative, or broad sense to a more quantitative representation or assessment. For instance, the intermediate stage of dry age-related macular degeneration is characterised by the presence of large drusen, tiny yellow or white lipid deposits under the retina, mild vision loss or changes, and/ or potential pigmentary changes. There is, of course, a lot of heterogeneity that can be contained within this classification within the pre-specified ranges for inclusion (i.e., >125 µm drusen size). Additionally, patients can have different compositions of the three, which may reflect divergent etiologies and, therefore, potential novel drug targets. This enhanced granularity is particularly valuable in the realms of personalised medicine and genomics, as it allows researchers to link specific genetic variants with detailed phenotypic outcomes instead of the overall disease stage.

Deep phenotyping can represent patients along more nuanced and multi-faceted lines. As mentioned, deep phenotyping has incredible potential in diseases with image profiling, where many pathophysiological changes are observed and tracked. CT scans, for instance, can help visualise tumours, which are used to grade colon cancer stages. Exploring tumour biomarkers as quantitative measures such as size and tissue layer location, rather than presence/absence, can facilitate more nuanced genotype/phenotype associations, which will be explored in more detail below. However, manually quantifying the hundreds of thousands to millions of images for these features by experts would be exorbitantly costly and time-intensive. Therefore, machine learning techniques like computer vision can be used to achieve this at scale.

Computer vision relies on a subset of machine learning called deep learning, which allows the processing of images through multiple layers of neural network connections. The recent advancements in computer vision have made a huge impact on everyday life, from facial recognition to automated driving. For such medical purposes, computer vision can be used to analyse images automatically for such purposes as classification, i.e., determining if an image is of a certain class, or segmentation, i.e., identifying and outlining certain features of interest.

Colon cancer can be used as an illustrative example of the utility of imaging biomarkers. The stages of colon cancer are in part separated by tumour size and location within tissue layers. Therefore, it would be invaluable to quantify and localise tumours within CT scans automatically and at scale. In order to build models to achieve this task, experts have to provide training examples for the machine to learn from. These examples contain manually labelled images for the features of interest and examples for which no feature is present (negative controls).

Successful application of these segmentation models allows for the quantification of the “real world” size of these features for all images for all patients across time. In this way, not only can size be quantified and tracked, but changes over time can be calculated, forming progression phenotypes, both within and across individuals. Clear patterns often emerge that differentiate patients along these lines: some individuals have tumours that grow rapidly, while others have ones that stay at the same size for prolonged periods of time. Furthermore, the localisation of tumours can also be compared and contrasted across individuals to define another phenotype. The rate at which the tumour invades the various layers of the colon can be quantified and compared via computer vision. Additionally, the rate at which cancer is spread to other organs, if at all, can also be compared.

Put together, these traits are just some of the unique ways by which a heterogeneous disease like colon cancer can be investigated. Computer vision and longitudinal imaging data can form progression phenotypes across various dimensions, each of which allows for a personalised understanding of nuances that are highly variable between patients. Coupling these progression phenotypes with genetics can allow for the identification of signals that can explain the underpinnings of the heterogeneity.


Generating these multifaceted progression phenotypes is just a necessary first step for personalised medicine drug discovery. Most genetic associations with diseases are identified as those of susceptibility, or in other words, genetic signals that differentiate those with a disease from those without. Separate from overall disease development, there is growing evidence and examples that there are also genetic signals that control the progression of disease, which may differ from those that are associated with susceptibility. As mentioned, the genetic variants that mediate progression may be fruitful drug targets that often remain hidden due to the complexity of modelling complex progression patterns. Deep phenotyping of imaging biomarkers can enable the generation of phenotypes that track progression along various axes or elements that constitute complex progressive diseases. Modelling how these endophenotypes, or components, change over the course of the disease allows for more refined genetic association analyses of progression. Performing genetic analyses like Genome Wide Association Analyses (GWAS) on these progression phenotypes can reveal signals that mediate severity that are not apparent when comparing cases vs. controls. Many of these signals are often novel, but some can overlap with susceptibility genes, indicating multiple functions of those variants. The hits identified from progression GWAS comparisons can then be funnelled into subsequent selection and screening steps to determine the feasibility of moving forward with drug development based on these genetic signals.

Apart from uncovering new drug targets and developing drugs based on them, another prime objective of data science platforms is response-based patient stratification or discerning which patients would most likely benefit from these therapeutics. The hypothesis behind this aim is that individuals with disruptions in a specific genetic signal, targeted by a therapeutic, stand to gain the most benefit from it. These genetic signals, typically originating at the Single Nucleotide Polymorphism (SNP) level, reside in genes – collections of nucleotides often numbering in the tens of thousands or more. Genes can then be categorised based on various pathways or cascades of interconnected biological functions. Individuals have slight variations in SNPs, some of which have been associated with causing issues, while others confer no functional or observable differences. Patients can accordingly be characterised by having genetic variations in known (i.e., from the literature) or discovered (i.e., via data science platform) genetic signals relating to the disease. This disease burden can be reflected as a Polygenic Risk Score (PRS), which is based on the cumulative effect of multiple genetic variants. PRS can be conceived for patient stratification across multiple dimensions, from disease susceptibility to progression phenotype and beyond. This stratification can also be used for predicting therapeutic response. For instance, PRS can be further characterised by relevant biological pathway burden to match the purported mechanism of developed drug targets. Patients with high genetic PRS in the targeted pathways may be prioritised to receive the therapeutic, while those with low PRS may fare better with an alternative medication. In this way, clinical-genomic data science platforms can be a two-sided coin: seamless interconnectivity between genetic discovery and application.


The future is both exciting and promising for the role of data science platforms in precision drug discovery. There are many other ways in which data science platforms can support, refine, and enhance the drug discovery and application processes beyond what has been described so far. There is continued and growing interest in companion diagnostics or FDA-approved screening processes that officially designate who should be prescribed a therapeutic based on some personalised condition. While genetics can fit this need, it is only the beginning. Human biology is a multi-scale collection of systems at various molecular and cellular layers. Other -omics, like transcriptomics, proteomics, and beyond, can be used to personalise medicine in a fuller sense, which undoubtedly will be more successful than any single level in isolation. On a practical level, data science platforms can also help refine clinical trial inclusion and exclusion criteria or who should enter a clinical trial or not. The retrospective real-world data can be used to perform data-driven calculations of collections of patient features and endophenotypes that are most associated with an outcome of interest. In this way, a platform can be a continuous learning machine where past data can help inform future studies. We are only at the cusp of this data-driven renaissance in drug development. The fusion and sustained assimilation of multi-omic data with refined, nuanced longitudinal phenotypes are poised to catapult personalised pharmaceutical innovation, surmounting many of the limitations that challenge the field today.

© Data Science Talent Ltd, 2024. All Rights Reserved.