Large-Scale Health Data Analytics with OHDSI

Data analytics is increasingly being brought to bear to treat human disease, but as more and more health data is stored in computer databases, one significant challenge is how to perform analyses across these disparate databases. In this post I take a look at the Observational Health Data Sciences and Informatics (or OHDSI, pronounced “Odyssey”) program that was formed to address this challenge, and which today accounts for 1.26 billion patient records collectively stored across 64 databases in 17 countries.

Randomized Controlled Trials and Observational Studies

Consider a researcher who has a new theory they want to test: does aspirin cause bone loss that can lead to osteoporosis? Traditionally, the main instrument for answering questions of this kind has been the randomized controlled trial (RCT). A RCT is an experiment that places the patients in the trial into one of two groups: a group that receives the treatment, and a control group that receives a placebo. By looking at the statistical effect of the treatment compared to the placebo on the outcome, you can learn if the treatment has an effect.

RCTs are considered to be the gold standard, however it is not always possible to run a RCT due to cost (they are expensive), not meeting recruitment targets (trials that are too small don’t provide reliable evidence), or for ethical reasons (withholding a treatment may be unethical under some circumstances).

An alternative to the RCT is the observational study, where the researcher does not have control over which patient receives which particular treatment. This is because the researcher is not running an experiment; they are observing what is already happening in the world. The data may come from many sources: electronic medical records, pharmacy records, monitoring devices, insurance and billing records, and so on. When aggregated across millions of patients, real world data offers an opportunity to untangle relationships between treatment and outcome. The application of analytics to real world data produces real world evidence about whether a treatment has an effect on an outcome.

Since observational studies are not randomized, there is an extra burden to eliminate bias from the study. However, a report in the New England Journal of Medicine from 2000 “found little evidence that estimates of treatment effects in observational studies reported after 1984 are either consistently larger than or qualitatively different from those obtained in randomized, controlled trials.”

Data Silos

Observational studies bring an integration challenge too: how is it possible to bring disparate data sources from multiple organizations together so that clinical questions can be asked of them? Patient confidentiality and privacy laws mean that it is simply not possible to create a central database containing a copy of all the data needed for a study.

This brings us to OHDSI, an international network of collaborators working on the problem of how to apply large-scale analytics to patient healthcare data using open-source tools. The program’s central insight is to define a common data model for patient data so that participating organizations using the model can use the same tools and queries on their own databases.

A Real World Study

Going back to our hypothetical researcher, the first step in conducting an observational study with OHDSI is to solicit interest on the OHDSI forums about the question they are interested in answering (“is there a link between aspirin and osteoporosis?”). If there is positive interest from institutions with healthcare databases, the researcher would create a protocol for the study, which includes information about the study, such as the objective, the institutes involved, date of publication, and so on.

Crucially, the protocol includes code that runs an analysis on data in the OHDSI Common Data Model format and creates a results file. The results are high-level summary statistics that contain no patient-level data. All the steps in the process are conducted in the open (in the StudyProtocols GitHub repository), so the protocol and code can be scrutinized, and worked on in a collaborative fashion.

Next, the participating institutions download the protocol and run the analysis on their databases, then return the results (which contain no patient-level data) to the researcher carrying out the study. Finally, the results are aggregated and written up in a paper for publication.

The Common Data Model and Hadoop

The OHDSI Common Data Model (CDM) is defined as a set of SQL tables. The CDM is platform-neutral, so it doesn’t dictate any particular database technology. Community members have written implementations of the model for different databases, including Oracle, PostgreSQL, and Microsoft SQL Server.

With Cloudera’s work with OHDSI, the CDM now works with Apache Hadoop. In particular, there is a set of Apache Hive table definitions to store CDM data in Apache Parquet format, or optionally in Apache Kudu, and which can be queried using Apache Impala.

The CDM defines 30 or so tables, divided into a few domains (see the diagram below). The two main domains are the standardized vocabularies, and the standardized clinical data. The standardized vocabularies contain a large number of healthcare concepts, such as drugs and health conditions. As the name suggests the concepts in the standardized vocabularies are derived from industry standards and existing databases, such as SNOMED-CT and RxNorm. The CDM integrates the different vocabularies in one database schema, and allows concepts to be mapped to one another, and for parent-child relationships between concepts to be expressed.

The vocabulary tables are curated by the OHDSI community and updated periodically from the upstream sources. This centralized, shared resource is one of the valuable things that OHDSI provides.

To briefly illustrate, let’s find the entry for aspirin in the concept table using the Impala shell:

  > SELECT concept_id, concept_name, vocabulary_id, concept_class_id> FROM concept> WHERE concept_name = ‘Aspirin’ and standard_concept = ‘S’; +————+————–+—————+——————+ concept_id concept_name vocabulary_id concept_class_id +————+————–+—————+——————+ 1112807    Aspirin      RxNorm        Ingredient       +————+————–+—————+——————+Fetched 1 row(s) in 0.47s

FROM concept


concept_id concept_name vocabulary_id concept_class_id
1112807    Aspirin      RxNorm        Ingredient      

Fetched 1 row(s) in 0.47s

  > SELECT concept_id, concept_name, vocabulary_id, concept_class_id> FROM concept> WHERE concept_name = ‘Aspirin’ and standard_concept = ‘S’; +————+————–+—————+——————+ concept_id concept_name vocabulary_id concept_class_id +————+————–+—————+——————+ 1112807     Aspirin       RxNorm         Ingredient       +————+————–+—————+——————+Fetched 1 row(s) in 0.47s

FROM concept


concept_id concept_name vocabulary_id concept_class_id
1112807     Aspirin       RxNorm         Ingredient      

Fetched 1 row(s) in 0.47s

We’ve found the single, standardized concept for aspirin, with concept ID 1112807. Of course, there are many drugs on the market that contain aspirin as an ingredient; we can find them by using the concept_ancestor table that records parent-child relationships between concepts. This query prints the first three direct descendants:

  > SELECT c.concept_id, c.concept_name> FROM concept_ancestor ca, concept c > WHERE ca.ancestor_concept_id = 1112807  > AND ca.descendant_concept_id = c.concept_id > AND max_levels_of_separation = 1> LIMIT 3; +————+———————————————————-+ concept_id concept_name                                             +————+———————————————————-+ 19124734   Aspirin 363 MG                                           40003486   Acetaminophen / Aspirin / Caffeine / Codeine Oral Tablet 40012610   Aspirin / Phenylpropanolamine Oral Solution               +————+———————————————————-+Fetched 3 row(s) in 2.31s

FROM concept_ancestor ca, concept c

AND ca.descendant_concept_id = c.concept_id



40003486   Acetaminophen / Aspirin / Caffeine / Codeine Oral Tablet


The second domain of interest is the standardized clinical data. Unlike the vocabulary tables which are small and remain relatively static, the clinical data tables hold observational data pertaining to patients (all linked from the person table), so they grow as the number of patients in the system grows, and as the number of observations grows. The clinical data tables need to scale, so they are ideal candidates for being stored in a Hadoop system.

Continuing the example, we can use the drug_exposure table to find the number of people in the system who have taken aspirin:

  > SELECT COUNT(DISTINCT person_id)> FROM drug_exposure> WHERE drug_concept_id IN (>     SELECT descendant_concept_id >     FROM concept_ancestor >     WHERE ancestor_concept_id = 1112807> ); +—————————+ count(distinct person_id) +—————————+ 244                       +—————————+Fetched 1 row(s) in 0.58s

FROM drug_exposure

     SELECT descendant_concept_id

     WHERE ancestor_concept_id = 1112807


count(distinct person_id)

Fetched 1 row(s) in 0.58s

This is a small public test dataset of 1000 people, so the result says that just under one quarter of the people in the dataset have taken aspirin.

This query should give you a hint of what’s involved when writing a protocol for an observational study. It’s not just a case of finding a single database record for a drug, but finding many candidate drug products, and using domain knowledge to devise criteria for which to include in the study. The same considerations apply to finding patients with the condition being studied.

Since the clinical data tables are regular Hive tables, they can be populated using any of the usual tools in the Hadoop ecosystem. For example, Odysseus Data Services have used Apache Spark to perform ETL into the CDM tables.


Having a CDM is in itself liberating for many organizations, since they can perform SQL queries to answer the questions they want, as the previous section showed. Those that want higher-level or easier-to-use tools can find them at the OHDSI GitHub repository—this is another reason to use the CDM—it opens up a swath of tooling that works on your data.

We have also done some work at Cloudera to get Impala working with some of the more popular OHDSI components and tools, including:

  • DatabaseConnector for connecting to databases using JDBC via R,

  • SqlRender for translating from a standard SQL dialect to a database-specific one,

  • Achilles to produce summary statistics for data in the CDM,

  • ATLAS for patient cohort generation, and

  • FeatureExtraction for generating features for a cohort.

The screenshot below shows the process of creating a cohort in the ATLAS web interface.

The web interface makes it easy to define cohorts without having to write SQL, and the cohort definitions can be exported (as JSON or SQL) for use in other tools, and as input for running observational studies.

Patient Level Prediction with Machine Learning

The question we started out with was “does aspirin cause bone loss that can lead to osteoporosis?”, which can be described as population-level effect estimation. In this type of analysis we are trying to understand if a single drug has an effect on a particular outcome.

Another type of analysis is creating a model for a particular outcome using a much larger range of features. So, for example, we might try to build a model that predicts osteoporosis given all drug interactions the patient has had (rather than just aspirin), all medical diagnoses, observations, and measurements, as well as general patient information, such as age and gender. Such a model could be used to predict if a particular patient is at risk of developing osteoporosis in the future.

This is a classical machine learning problem, and there is an OHDSI tool called PatientLevelPrediction that can run algorithms like logistic regression and random forests in R on CDM data.

No work has yet been done to use Spark ML as an approach to implement machine learning algorithms for patient-level prediction, but it would be a natural next step to scale to large numbers of patients and observations.

Future Work and Getting Involved

OHDSI has a number of working groups that bring together groups of people who are interested in a particular topic within OHDSI. The Hadoop Working Group, for example, is for people who are interesting in developing and deploying OHDSI tools on Hadoop.

Another exciting direction is the possibility of adding genomic data to the CDM, a move that would enable it as a platform for precision medicine. A new working group has recently been set up to discuss this.

There are many OHDSI projects, and it can be daunting for someone new to the project to get started. The Broadsea project can make things easier; it’s a set of Docker containers for running the main parts of the OHDSI stack for development purposes. Broadsea and the OHDSI forums are a good starting point for anyone who wants to learn more or get involved with the project. Taken together—OHDSI’s active community, a large set of federated healthcare data, Hadoop integration, a free and open source environment of open science—OHDSI makes for a great place to innovate in large scale healthcare analytics.

Further Reading

Thanks to Shawn Dolley for reviewing this article.