Children's Cancer Institute turns to containerisation to power research

By Matt Johnston on Jun 17, 2020 1:05PM

Enabling collaboration and upgrades.

The Children’s Cancer Institute has containerised key bioinformatics pipelines that underpin research into personalised treatments for kids that could one day reduce childhood cancer rates to zero.

Part of the problem the Institute faces is that children often respond differently to cancer treatments than adults, with typical treatments either not working at all or producing adverse side effects.

In trying to interpret the hundreds of terabytes of genomics data produced by a single patient, bioinformaticians at the Institute had created a complicated web of processes and applications that were often reliant on the outputs of other applications, research assistant Sabrina Yan said on the sidelines of the DockerCon Live Summit.

“We run a processing pipeline - a whole genome and RNA sequence process pipeline - that gives the sequencing information from a kid.

“So we sequence their healthy cells and we sequence their tumour cells, we analyse them together and what we do is we find the mutations that are causing the cancer.

“That helps us determine what treatments or what clinical trials might be most effective for the kid.”

Yan said that although the data pipeline worked well, it was dependent on a single platform and required specific programming and data wrangling tools - which were often older versions that were used when the workflow was first created.

To avoid having to re-engineer the entire process from scratch each time the researchers wanted to trial the pipeline on a new cloud instance or different platform, Yan worked with Kamile Taouk, a bioinformatics engineering student and intern at UNSW, to take all of the tools used in the pipeline and individually containerise them using Docker.

The tools were contained with dependencies “so that we could hook them up any way we want," Yan said.

While Yan and Taouk agree the work has been worth it in the long run, Taouk said the biggest issue was that almost all apps they encountered within the pipeline were “very heavily dependent on very specific versions of so many different apps [that] they would just build upon so many other different apps”.

“'Dockerising' was quite difficult because we had to preserve every single version of every single dependency in one instance just to ensure that that app was working," Taouk said.

“These apps get updated semi-regularly, but we have to ensure that our Dockers survive”.

The pair, along with five extra medical interns, spent the summer gradually working through each app, with individual tools taking days or weeks to 'Dockerise'.

“Some of them are very memory hungry, some of them are very finicky, some of them are a lot more stable than others,” Yan said.

“And so you could spend one day 'Dockerising' a tool and it's done in a handful of hours, or sometimes it could take a week and you're just getting this one tool done.

“The idea behind the whole team working on it was eventually you slog through this process and then you have a Dockerfile setup where anyone can run it on any system and we know we have an identical setup.”

Taouk described the new pipeline as “ridiculously efficient” now that Docker keeps each version of the different dependencies, with the developers able to specify which version should be used within the container to enable it to run successfully on any machine every time.

This also opens up the ability for the Institute to more easily share data and collaborate with hospitals and other research institutes.

“If there's some amazing [patient outcome] predictor that comes out, like using some kind of regression or deep learning, if we wanted to add that, being able to 'Dockerise' a complex tool into a single Docker app makes it less complicated to add that into the pipeline in the future, if that's something we'd like to do,” Yan said.

Got a news tip for our journalists? Share it with us anonymously here.