World-first software could revolutionise how we understand disease

Researcher
Honorary Associate Professor Denis Bauer
Writer
Fran Molloy
Date
23 September 2020
Faculty
Faculty of Science and Engineering

Share

An Australian machine-learning program is the first in the world to handle genomic datasets with a trillion data points, helping scientists decode the mysteries of inherited illness, says Macquarie University Honorary Associate Professor Denis Bauer.

New research has confirmed that the VariantSpark program, developed in Australia by a team including Honorary Associate Professor Denis Bauer from Applied BioSciences at Macquarie University, is the first machine-learning framework to be able to handle genetic datasets with one trillion data points.

In the DNA: The field of genomics has made major breakthroughs in the diagnosis and treatment of many rare diseases.

VariantSpark is also the first software framework to explore how multiple genes interact with each other, which may help explain genetic influence on diseases like diabetes or Alzheimer’s.

Last year, this remarkably clever machine-learning program was released on the Amazon Web Services Marketplace, promising to almost double the speed at which researchers could do tricky data analysis – like finding genes linked to disease.

Bauer worked with a team from the CSIRO to develop the software and this month, published new research which has confirmed the software is far more effective than other programs available.

More complex than galaxies

Genomic datasets are unimaginably complex. Each human cell that contains a nucleus includes a copy of the person’s genome, comprising over three billion DNA base pairs in two strands that, if unravelled, would stretch for two metres.

DNA samples for disease research typically contain samples from hundreds of thousands of people, generating billions of variables.

The numbers involved in genetics analysis have been called astronomical, but they are actually even larger, Bauer says.

“Typically there is more data in a genomic dataset than there is in astronomy, which was the traditional place where you would find really huge numbers,” she says.

The vast majority of genetic disease is polygenic, where two or more genes influence whether someone gets diabetes, for example, or how severely it presents.

VariantSpark allows researchers to sift through ‘polygenic’ human diseases, potentially revolutionising our ability to treat hereditary illness.

“Polygenic traits are those things about us that are affected by more than one gene – like height and skin colour,” says Bauer.

“The vast majority of genetic disease is polygenic, where two or more genes influence whether someone gets diabetes, for example, or how severely it presents.”

She says that VariantSpark builds on a machine learning method called Random Forest, and by recognising sets of interacting features in groups of information it can accurately sort data very quickly.

Huge possibilities of genetics

In 2003, the Human Genome Project delivered the complete sequence of three billion DNA letters representing the genetic blueprint of homo sapiens.

The international project took 13 years of scientific collaboration between nearly 3000 biologists, engineers, computer scientists, mathematicians and other specialists in a ‘big science’ approach to research, making significant inroads into many other fields along the way.

The outcome has revolutionised fields from plant biology to infectious disease and has altered the future of medicine.

Since 2003, the field of genomics has expanded rapidly and has made major breakthroughs in the diagnosis and treatment of many rare diseases including some cancers.

However there’s still a long way to go to understand the genetic links for some of the most complex and prevalent disorders such as diabetes and heart disease.

Dastardly combinations

“Polygenic epistatic traits are those where genes build on each other, they can mask each other’s influence, or they can combine to produce a completely new characteristic,” says Bauer.

“There are some examples where we know that if a person has two genes associated with a particular disease, their probability of contracting it is much higher, three genes could be even worse.”

However, genes don’t behave in such simple ways as always adding to each other, she says.

“These are complex interactions; you can have genes A and B contributing and gene C taking away or modulating the impact of a disease,” she says.

Until now, when we have looked at the genome, the tools that we had couldn’t explain the inheritance pattern that we saw with complex disease.

“With this level of complexity, if humans worked on it in real time, it might take us 50 years to actually decode all of the complex interactions related to just one aspect of the genome; but using machine learning, we're able to process this information really, really fast.”

Genomic analysis is filled with knotty variations that make it infinitely complex, she adds; mutations in a genome can occur at different frequencies that are specific to certain populations; so the interpretation of a genomic change may depend on the population.

“In some cases it won’t be just two or three variations, it will be thousands of variations of these modulators that contribute to a complex disease – that’s not something that humans can actually process.”

Open access

“Until now, when we have looked at the genome, the tools that we had couldn’t explain the inheritance pattern that we saw with complex disease,” Bauer says.

Denis C Bauer

Code cracker: Dr Denis Bauer (pictured) says DNA information could explain not only genetic disease, but our responses to treatment.

“Looking at the whole genome, and finding interactions between genes among the three billion letters in the genome, and comparing these across cohorts of thousands of people, is so computationally intensive that it was impossible to do until we were able to develop this method, and use distributed computing to apply it.”

As DNA sampling increases, our global databases contain increasingly vast amounts of information that could help us understand and treat serious disease.

“There's a lot of information that could explain not only genetic disease, but also our responses to treatment – who is likely to have adverse reactions to treatments, or even vaccines, for example.”

The program is available for free on Amazon Web Server (AWS), the world’s leading cloud software host, which means that users don’t have to invest in high-end computing power.

Instead, they can upload their genetic data into a secure area, renting processing time on the AWS platform to apply the VariantSpark program to run the complex algorithms to analyse their data.

“Using AWS allows us to scale it up and down depending on the data set,” says Bauer.

Her team can provide consulting services to help researchers analyse their data effectively, she says; but by making the framework freely available, the global research community can accelerate progress towards solving some of humanity’s most devastating disease.

Denis Bauer is an Honorary Associate Professor in the Department of Biomedical Sciences at Macquarie University

Share

Back To Top

Recommended Reading