The pharmaceutical industry is expected to spend more than $3 billion on artificial intelligence by 2025 – up from $463 million in 2019. AI clearly adds value, but advocates say it is not yet living up to its potential.
There are many reasons the reality hasn’t yet matched the hype, but limited datasets are a big one.
Given the enormity of available data collected every day – from steps walked to electronic medical records – scarcity of data is one of the last barriers one might expect.
The traditional big data/AI approach uses hundreds or even thousands of data points to characterize something like a human face. For that training to be reliable, thousands of data sets are required in order for the AI to recognize a face despite gender, age, ethnicity or medical condition.
For facial recognition, examples are readily available. Drug development is a different story altogether.
“When you imagine all the different ways you could tweak a drug… the dense quantity of data that covers the entire range of possibilities is less abundant,” Adityo Prakash, co-founder and CEO of Verseon told BioSpace.
“Small changes make a big difference in what a drug does inside our bodies, so you need really refined data on all the possible types of changes.”
This could require millions of example datasets, which Prakash said not even the largest pharmaceutical companies have.
Limited Predictive Capabilities
AI can be quite useful when “the rules of the game” are known, he continued, citing protein folding as an example. Protein folding is the same across multiple species and thus can be leveraged to surmise the likely structure of a functional protein because biology follows certain rules.
Drug design, however, uses completely novel combinations and is less amenable to AI “because you don’t have enough data to cover all the possibilities,” Prakash said.
Even when datasets are used to make predictions about similar things, such as small molecule interactions, the predictions are limited. This is because the negative data has not been published, he said. Negative data is important for AI predictions.
Additionally, “Many times much of what’s published is not reproducible.”
Small datasets, questionable data and a lack of negative data combine to limit AI’s predictive capabilities.
Too Much Noise
The noise within the available, large datasets presents another challenge. PubChem, one of the largest public databases, contains more than 300 million bioactivity data points from high throughput screens, said Jason Rolfe, co-founder and CEO of Variational AI.
“However, this data is both imbalanced and noisy,” he told BioSpace. “Typically, over 99% of the tested compounds are inactive.”
Of the less than 1% of compounds that do appear active in a high throughout screen, the vast majority are false positives, Rolfe said. This is due to aggregation, assay interference, reactivity or contamination.
X-ray crystallography may be used to train AI for drug discovery and to identify the exact spatial arrangement of a ligand and its protein target. But despite great strides in predicting crystal structures, the protein deformations that are induced by drugs are not well-predicted.
Likewise, molecular docking (which simulates the binding of drugs to target proteins) is notoriously inaccurate, Rolfe said.
“The correct spatial arrangements of the drug and its protein target are only accurately predicted about 30% of the time, and predictions of pharmacological activity are even less reliable.”
With an astronomically large number of drug-like molecules possible, even AI algorithms that can accurately predict binding between ligands and proteins are faced with a daunting challenge.
“This entails acting against the primary target without disrupting the tens of thousands of other proteins in the human body, lest they induce side effects or toxicity,” Rolfe said. Currently, AI algorithms are not up to that task.
He recommended using physics-based models of drug-protein interactions to improve accuracy but noted they are computationally intense, requiring approximately 100 hours of central processing unit time per drug, which may constrain their usefulness when researching large numbers of molecules.
That said, computer-based physics simulations are a step toward overcoming the current limitations of AI, Prakash noted.
“They can give you, in an artificial way, virtually generated data about how two things will interact. Physics-based simulations, however, will not give you insights into degradation inside the body.”
Another challenge relates to siloed data systems and disconnected datasets.
“Many facilities are still using paper batch records, so useful data isn’t…readily available electronically,” Moira Lynch, senior innovation leader at Thermo Fisher Scientific told BioSpace.
Compounding the challenge, “Data that is available electronically are from disparate sources and in disparate formats and stored in disparate locations.”
According to Jaya Subramaniam, head of life sciences product & strategy at Definitive Healthcare, those datasets are also limited in their scope and coverage.
The two primary reasons, she said, are disaggregated data and de-identified data. “No one entity has a complete set of any one type of data, whether that is claims, EMR/EHR or lab-diagnostics.”
Furthermore, patient privacy laws require de-identified data, making it difficult to track an individual’s journey from diagnosis through final outcome. Pharma companies are then impeded by slower speed to insights.
Despite the availability of unprecedented quantities of data, relevant, useable data remains quite limited. Only when these hurdles are overcome can the power of AI be truly unleashed.