Big Data in Drug Development and Discovery

The Innovation Issue Feature: Part 1

The Next Step for Innovation: Data Integration

Integrating data into the life cycle, from development to clinical trials, commercialization and packaging, is crucial for the advancement of medicine, thereby making it more precise and effective. Data implemented across the supply chain will offer more targeted options for patients and contribute to an organizational bottom line by reducing or even eliminating the guesswork involved in bringing drugs to market and ensuring patient adherence. Ultimately, the chief challenge in harnessing big data as a tool for innovation is its vastness.

Pharma is not the only industry working toward data integration as a means for improvement of processes and products — nearly every industry can benefit from practical data analysis. However, as far as patient safety and public health are concerned, being able to integrate data into all phases of the supply chain offers advantages throughout, starting in the lab and — most importantly — ending with the patient.

Given the demands of patient-centric medicine, real-world data (RWD) must be implemented to offer more effective therapies targeted to the patient population taking them. As biologics grow in popularity, and cell and gene therapy candidates become viable options, harnessing data in the lab seems to be an increasingly important element in the drug development process.

AI must be incorporated into operations in order to make data mining for drug development a real and pragmatic possibility. 

Data in Development

Outlets for RWD capture include electronic health records (EHR). The use of EHRs is a well-established practice; however, combining and sharing records into a larger database that is then monitored by labs might be key for the next wave of innovation and for health records to be used as a tool for progress. One such exemplary database is the Clinical Practice Research Datalink funded by the NHS National Institute for Health Research (NIHR) and the Medicines and Healthcare Regulatory Agency (MHRA) in the UK. Information generated from this source has resulted in over 2,000 published works related to drug safety, as well as overall clinical guidelines and best practices.1

Similarly, the Italian Medicines Agency — known as Agenxia Italiana del Farmaco (AIFA) — has devised a registry system for the purpose of RWD collection in order to confirm that medicines are meeting their necessary targets; this includes approximately 120 registries, with 80 medications monitored across more than 50 indications.1 The information from this database is widely shared — the data is available in over 21 regions and in more than 1,000 hospitals. Over 24,000 clinicians, 1,500 pharmacists and 32 marketing authorization holders are able to access the system, which helps sustain its relevancy. These databases are significant and an obvious next step in generating information for initial lab research.

Big data will quantify the needs of various demographic populations. RWD especially will be a tool for innovation, as it leads to a clearer understanding of how disease is expressed across patient groups — and may also serve in ascertaining risk and diagnosis. For example, the Taiwan Health Insurance Database consists of more than 782 million outpatient visits. These visits were plotted in the Cancer Associations Map Animation (CAMA).1,2 CAMA is so extensive that it is actually able to predict risk factors and modifiers. When put into practice, CAMA may also be able to help researchers discover new approaches to treatments and form early-stage hypotheses faster. This is one example of how targeted data application can lead to innovation in therapies and will likely change the way a disease is understood and studied.

Cause and Effect Research — Rethinking Diseases through Data

Data mining in the lab could become the ultimate diagnostic tool; if the data is there and can be predicted, previously unknown, at-risk groups will become more treatable. A potential health issue can be detected even before symptoms occur by data mining for early risk factors. Again, for this to be successful, the data must be available, which is why shared EHRs that transcend international borders are a fundamental tool in the next wave of innovation.

An example of this was gleaned by monitoring the records of 25 million patients from the U.S. Veterans Administration. It was discovered that those suffering from periodontal disease were 1.4 times more likely to also have rheumatoid arthritis.3 This surprising relationship was previously noted on a smaller scale, with a correlation hypothesized; however, the big data, large-scale study was able to confirm it. The data sample matched previous findings, further substantiating the connection between the two. Though this proves that a higher incidence of one disease can be a marker for another, it has yet to be determined whether aggressive periodontal treatment will affect the occurrence of rheumatoid arthritis, and so forth.3

This cause and effect realization brought about by access to big data analysis is not an isolated case. In another study, which modeled time-stamped relationships of 41.2 million classifications of diseases in 1.6 million patients, it was discovered that diabetes preceded the diagnosis of Helicobacter pylori infection (bacteria linked to ulcers), sparking questions about the link between the the infection and diabetes.4

New findings stemming from big data analyses have the potential to completely upend nosology — the system by which diseases are classified — which has traditionally relied on correspondences in symptoms and anatomy. A recent study examined pairwise genetic and environmental correlations of 29 complex diseases using medical records of 128,989 U.S. families.5 While the data largely clustered into groups that matched the conventional nosology reported in the current International Classification of Diseases (ICD-9), some results fell outside this taxonomy. Most notably, migraine — which is most closely linked with eye inflammation in a cluster of central nervous system and sensory organ diseases in ICD-9 — exhibited a tighter genetic association with inflammatory conditions like irritable bowel syndrome, cystitis and urethritis, suggesting that the existing nosology requires revision.


Diagnosis via Algorithm

Data can be used as a tool to identify previously undiagnosed patients, even before their symptoms manifest. Through the use of data mining and algorithms, research may be able to identify high-risk individuals, especially for disease indications that are less obvious. Data mining is also the least invasive way to determine a diagnosis.

As a next step in drug development, EHRs and various data sets could elucidate early signs of disease and previously unknown causal disease relationships, especially in rare or novel diseases and in early stages, through AI.1

Data mining in the lab could become the ultimate diagnostic tool; if the data is available and robust enough to enable predictions, previously unknown at-risk groups will become more treatable. 

The Future of Drug Development — Deep Learning

Artificial intelligence must be incorporated into the lab in order to make data mining for drug development a real possibility. If successful, AI can be used to diagnose disease and predict drug efficacy and toxicity.5 Deep-learning AI in drug discovery will be able to extrapolate key features from large data sets and can be used to create leads and predict outcomes. The company that is able to elevate AI in the lab will benefit from extreme cost reduction associated with materials and reduced risks.

Granted, data must be qualified to be meaningful.6 As wearables become part of everyday life, with smart phones recording personal health data daily, the amount of available data is increasing at a tremendous pace. Harnessing this data will be the next frontier in innovation.

Companies that are able to curate these data sets and apply the data will have an extreme competitive advantage in drug discovery and development, as well as scale-up and manufacturing. The search for the right algorithm or AI is the new arms race in the pharmaceutical and biotechnology sector, as data mining will deepen our understanding of disease and lead to better therapies for a wider range of patients.  

Read Part 2: Big Data in Drug Manufacturing
Read Part 3: Big Data in Drug Packaging and Distribution



  1. Singh, Gurparkash, Duane Schulthess, Nigel Hughes, Bart Vannieuwenhuyse, Dipak Kalra. “Real world big data for clinical research and drug development.” Drug Discovery Today. 23: 652–660 (2018).
  2. Iqbal, U, C.K. Hsu, P.A. Nguyen, R. Lu, et al. “Cancer–disease associations: A visualization and animation through medical big data.” Comput. Methods Programs Biomed. 127:44-51 (2016).
  3. Grasso, Michael A., Angela C. Comer, Dana D. DiRenzo, Yelena Yesha, Naphtali D. Rishe. “Using Big Data to Evaluate the Association between Periodontal Disease and Rheumatoid Arthritis.” AMIA Annu. Symp. Proc. 2015: 589–593 (2015).
  4. Hanauer, David A., Naren Ramakrishnan. “Modeling temporal relationships in large scale clinical associations.” Journal of the American Medical Informatics Association. 2: 332–341 (2013).
  5. Wang, Kanix, Hallie Gaitsch, Hoifung Poon, Nancy J. Cox, Andrey Rzhetsky. “Classification of common human diseases derived from shared genetic and environmental determinants.” Nat. Genet. 49: 1319–1325 (2017).
  6. Basak, Sayan, Sukant Khurana. “Artificial Intelligence for modern drug development.” Medium. 13 May 2018. Web.

David Alvaro, Ph.D.

David is Scientific Editorial Director for That’s Nice and the Pharma’s Almanac content enterprise, responsible for directing and generating industry, scientific and research-based content, including client-owned strategic content. Before joining That’s Nice, David served as a scientific editor for the multidisciplinary scientific journal Annals of the New York Academy of Sciences. He received a B.A. in Biology from New York University and a Ph.D. in Genetics and Development from Columbia University.