Big Data presents many opportunities for translational research, and many challenges.

The Innovative Medicines Initiative (IMI) project, Unbiased Biomarkers in the Prediction of Disease (U-BIOPRED), is wallowing in data. Is it ‘Big Data’?

A purist definition of ‘Big Data’ is data that cannot be stored easily using standard databases. Others view ‘Big Data’ as any complex dataset. U-BIOPRED is certainly generating a complex dataset.

The aim of U-BIOPRED is to define new phenotypes for severe asthma. There have been previous efforts in this direction. U-BIOPRED is, however, collecting a wider array of data.

The clinical cohort consists of 1.025 patients adult and paediatric patients. Subsets of subjects are undergoing endobronchial sampling, CT scanning, measurement of exhaled breath volatile organics, and intensive daily physiological monitoring. Subjects will have a 1 year followup visit and those who exacerbate will undergo an exacerbation visit.

Blood, sputum, urine, and endobronchial samples are being evaluated with high proteomics, lipidomics, transcriptomics, genomics, and physionomics. A total of 175.000 samples are being generated, 1.500 variables measured, and an estimated 3.000 data points generated.

Three million data points does not make the U-BIOPRED dataset ‘Big Data’. It qualifies as ‘Big Data’ because of the variety of data types being integrated.

Beginner’s mind

“In the beginner’s mind there are many possibilities, but in the expert’s there are few.” – Suzuki Shunryu, Zen Mind, Beginner’s Mind

U-BIOPRED is an IMI project. When the IMI started there were many criticisms. Many of the criticisms were appropriate. Add to that the inherent challenges in launching such a complex study and you see why there was often little hope that U-BIOPRED would deliver. At one point in time there were 46 active issues on the U-BIOPRED issue registry. How did U-BIOPRED project members persevere?

By taking a ‘beginners mind’ approach. Focus on the envisioned end result drove the project forward despite doubts of many experts.

Big Data often means ‘Big Problems’. Breath in. By taking of stock all the issues,…Breathe out… then working together to see them resolved the U-BIOPRED consortium pushed forward. Did it work?

Yes.

U-BIOPRED has nearly completed recruitment of the baseline cohort, there are 3 academic centers and 2 companies using the exact same SOPs and protocols for laboratory models, and a stock of GMP virus has been produced for the exacerbation model.

See the data for what it is.

“Accept these things as they are, and try to understand why they’re that way.”

- Leo Babauta, Zen Habits

U-BIOPRED is undertaking a biased and an unbiased approach to data analysis. Proteomics for example is being carried out with an approach that will look for know protein species, and an approach that captures all protein species in a given sample. The latter will truly be a fingerprint of data with no a priori connection to a conception of disease.

Big Data analysis approaches are data driven approaches. This is counter to the standard hypothesis driven approach where a specific question is examined. This carries with it attendant problems such as finding spurious correlations on the basis of making too many comparisons.

One concept in Buddhist meditation is that you work on accepting things as they are. Live in the moment. Analyzing Big Data is about finding patterns in the data that you did not expect to find and accepting them for what they are.

Personalized Medicine is not lonely

There are lots of discussions on the meaning of ‘Personalized Medicine’. Yes, personalized medicine, is like ‘Big Data’, another buzzword.

By collecting lots of complicated data on a group of individuals, yes U-BIOPRED will be identifying what makes individuals unique, while at the same time identifying a broader range of potential connections.

Such a concept is not lost on the founders of PatientsLikeMe who are creating a patient driven, health data-sharing platform. U-BIOPRED has also long recognized the value of patient input.

Patients take part in U-BIOPRED consortium meetings, provide feedback on protocols, and most importantly remind everyone of the shared vision of U-BIOPRED. Do we need this patient input?

Absolutely, and in the future even more so.

The collection of more and more data on more and more people bumps up against privacy concerns. Thus, the voice of patients becomes even more relevant as we balance the value of a dataset against the importance of privacy concerns.

Patients need to know the value of good quality data. Its obvious to most people that they shouldn’t lie to the research nurse. But what happens when individual data collection intensifies?

That rectangular electronic device in your pocket or attached to your belt know as a mobile telephone makes minute by minute personal data collection feasible.

There is already a buzz word ‘quantified self’. Knowing where you were when, what your heart rate was, and how fast you moved create an opportunity to have Star Trek like health monitoring. The data collected from such efforts will easily meet any definition ‘Big Data’.

An appreciation of the importance of quality data becomes tantamount. If your heart and movements are being collected for a study you should not stuff your mobile phone under the collar of your cat and set the ring tone to bark. You shouldn’t do that anyway, but the point is patients are becoming important parts of the whole chain translational research data collection.

Maximizing the value gained

eTRIKS is now actively engaged in supporting U-BIOPRED. First, in providing a repository for the data being generated by U-BIOPRED using the transMART platform. eTRIKS is also providing support for getting the U-BIOPRED into a format that can be uploaded into transMART. The data in transMART is arranged in various different tables for various different data types. Not all U-BIOPRED data types are specifically available in transMART however the types that are present can be used to load the U-BIOPRED data. This will make the data generated accessible and readily analyzable. A big value, but there is more.

U-BIOPRED aims to integrate various types of data into ‘handprints’. Work is being done to automate the process using ETL scripts. ETL stands for Extract Transform Load. In other words getting data out of the database in which it was collected transforming it to a standard format and then loading it into a the new database. Sounds simple.

It’s not.

This step is by far the most time consuming in data curation. It may also be the most valuable. Without data that has undergone transformation for each question that utilizes a different data type this process needs to take place. Once complete comparisons can be made across data types with a few form field entries and a click of a button. Even this process can be more efficient.

Standards make it easier to transform and load data. If data is already collected under a standard for which the knowledge management system is compliant the ETL process is simple. This is why driving the creation of standards is an important part of eTRIKS. Ideally this is done a priori. When not, it is good to consider transforming data into an established standard.

Data locked away on a DVD in someone’s storage cabinet reaches only a fraction of its potential. The promise of Big Data in translational research is that by combining data from multiple studies we will be able to answer questions and gain insights never before possible. Putting the data into a knowledge management platform, under a set of standards sets it up for becoming valuable legacy data. Yet, if no one maintains that knowledge management system then the value is still lost.

Guranteeing legacy

One of the aims of the eTRIKS project is to come up with a plan for making the eTRIKS platform sustainable beyond the 5 years of the project. This of course depends on many factors coming together -success in supporting projects, enough interest in using the system, and financial support.

When the IMI funding of U-BIOPRED ends in a little less than 2 years much of the dataset’s potential will yet to be realized.

The future should benefit from the past. Legacy can be guaranteed, but only if there is a thriving community of all stakeholders that believes in the value of collaborating on data.

Are you willing to invest in legacy?

Wayne Kubick

Excellent article, Scott. Nice characterization of how the Big Data buzzwprd applies to what eTRIKS and U-BIOPRED are doing, and thanks for stating the importance of standards to enable meaningful interpretation of shared data, which in turn makes sharing of data a more compelling value proposition.
- http://horizon2020consulting.com/ Scott Wagers
  
  Wayne thanks for you comment. It is important to highlight that what you all are doing at CDISC is really driving the field of standards forward which is why we are glad you are one of our partners in eTRIKS.
Julie Evans

Thanks, Scott, for putting a very complex subject into an understandable blog that is quite relevant to the work I do. The sentence that catches my eye the most is “The promise of Big Data in translational research is that by combining data from multiple studies we will be able to answer questions and gain insights never before possible.” I often use data aggregation as one of the most valuable benefits of using data standards and standard information models. Have you come across examples of answers to questions and insights gained that have never been possible before? I’d love to be able to use real life examples when I explain the benefits.
- http://horizon2020consulting.com/ Scott Wagers
  
  Julie thanks for you comment. Indeed one of our goals is to collect real examples where standards, and data integration have improved operational efficiency and improved project data legacy. I am currently working on getting some examples so stay tuned.

The Zen of Big Data: U-BIOPRED in action

Beginner’s mind

See the data for what it is.

Personalized Medicine is not lonely

Maximizing the value gained

Guranteeing legacy