Hail： Scalable Genomics Analysis with Apache Spark

Technology-focused discussions about genomics usually highlight the huge growth in DNA sequencing since the beginning of the century, growth that has outpaced Moore’s law and resulted in the $1000 genome. However, future growth is projected to be even more dramatic. In the paper “Big Data: Astronomical or Genomical?”, the authors say it is estimated that “between 100 million and as many as 2 billion human genomes could be sequenced by 2025”, requiring between 2-40 exabytes of storage. (An exabyte is 1000 petabytes.)

The storage costs are one thing, but the real question is ‘how are we going to analyze all this data?’ A year ago, the first author wrote about how running the Genome Analysis Toolkit (GATK) on Spark solves one piece of the puzzle: getting from raw DNA reads to variants (which characterize how an individual’s genome differs from the reference genome)—and doing this at scale.

The pipeline for producing variants can be referred to as the upstream pipeline, and is often treated as a black box in the sense that the output of this pipeline, the variant call data, is the starting point for many researchers. By contrast, downstream processing, which takes variants and transforms them in some way, is much more open-ended. This is due to the large number of applications in areas such as genetic association studies, pharmacogenomics (how genetic variation affects drug efficacy), ancestry and genealogy, and relatedness checking.

One of the most important downstream analyses is finding genetic trait associations. Association studies look for statistical associations between genetic variation and phenotypic traits, that is, an observable characteristic of an individual, such as hair color or disease. With the increasing availability of whole-genome sequence data, it’s possible to look for variants from across the whole genome that may be associated with a disease, rather than heavily relying only on commonly known variants as in a traditional genome-wide association study (GWAS).

The challenge for downstream processing is scale. Tools that can cope with a few hundred or even a few thousand genomes, such as the well-known 1000 Genomes dataset, can’t handle datasets that are one or more orders of magnitude larger. These datasets are now becoming commonplace, thanks to the multiple sequencing efforts taking place around the world like the 100,000 Genomes Project in the UK and the Precision Medicine Initiative in the US.

Enter Hail

This is where Hail comes in. Hail was written from the outset to use Apache Spark so it could take advantage of the ability to scale to thousands of nodes and petabytes of data. Hail is released under the MIT open source license, and its development is being led by members of the Neale Lab at the Broad Institute’s Analytic and Translational Genetics Unit (ATGU). The ATGU was also the source of PLINK, developed in the Purcell Lab, a popularly used predecessor to Hail.

Hail is already being used in a number of institutes across the world. Even though Hail is a young project (it started in late 2015), it is already being cited in scientific papers. In addition, Hail was recently used for running QC and basic analysis of over 120,000 exomes and 15,000 genomes for creating the Genome Aggregation Database (gnomAD) by the MacArthur Lab, a sibling of the Neale Lab in the ATGU. Hail made the gnomAD QC possible within less than a week to meet a conference deadline (“put simply, we would have been screwed without it”). The gnomAD dataset is provided in the VDS format described below, so researchers can optionally bypass VCF altogether.

Hail has a simple, but powerful, conceptual data model called a variant dataset, or VDS for short. A VDS is a huge matrix, where rows are keyed by variant, and columns by sample. Each sample is from an individual, of course, but an individual may have many samples taken from them for sequencing (this is common for cancer diagnosis, for example, since tumor samples will often differ a lot—both from normal cells and from other cancer cells). So we talk in terms of samples rather than individuals for this reason.

To give an example, the base at position 48,258,198 on human chromosome 16 is observed to be a C in some instances and a T in others. The different possible bases for a variant are called alleles; C and T in this case.

Now consider a sample in the dataset that has a C at that position in both their copies of chromosome 16 (one copy is from the mother, one from the father). We say that their genotype for this variant is C/C (this is the case for the sample in the diagram above).

Another sample might have a genotype of C/T, indicating that the individual it comes from has C for one copy of chromosome 16, and T for the other. The third possibility is T/T. In addition to information about alleles, the genotype object carries fields pertaining to the quality of the genotype call, including a log-likelihood score (GQ) and read depth (DP). (The precise meaning of these fields is defined in the Variant Call Format (VCF) specification.)

Going back to the VDS matrix, if you read across a row, you can compare how the samples in your dataset differ for that variant. This is useful to find the allele frequencies for a given variant, for example. Whereas if you read down a column, you can read all the variants for that sample, perhaps to reconstruct an individual’s genome. Of course, you can also process the whole matrix to analyze variants from all samples across the whole genome, akin to a fine-mapped GWAS.

The data in a VDS is stored in Parquet files, a compact binary format that is readily accessed with many tools in the Hadoop ecosystem. Hail uses Spark to read and write data, but it is also possible to read a VDS from SQL using Apache Impala (incubating). Coupled with Hail’s ability to attach arbitrary metadata, or annotations, to samples and variants, using SQL allows a VDS to be joined with other relational data sources like Ensembl (a genomics database), or electronic medical records.

There is also an experimental Apache Kudu storage backend for VDS data; something that becomes interesting when samples are being ingested in a continuous fashion.

Hail is Interactive

Hail uses Spark’s interactive shell, which makes it very easy to have a look at the data, and try out new ideas for analysing it. To get a brief taste, consider the following Python session based on the Hail Tutorial that we ran using Cloudera Data Science Workbench.

We start by importing the Hail Python module, and creating a HailContext to access Hail functions, which is analogous to the SparkContext in Spark.

»> import hail»> hc = hail.HailContext()

hc = hail.HailContext()

The first thing to do in Hail is load the data, typically from a VCF file, which is the industry standard file format for variant data. VCF is a text format, but Hail can load VCF files compressed with BGZF, a splittable extension of gzip that is commonly used in bioinformatics.

»> vds = (hc.import_vcf(‘Hail_Tutorial-v2/1000Genomes_248samples_coreExome10K.vcf.bgz’) .split_multi() .sample_qc() .variant_qc())

.split_multi()

.variant_qc())

Importing the file means loading it as a Spark RDD, although that is not exposed directly to Hail users.

After importing, we run some simple pre-processing on the data. The second line splits multi-allelic variants, a technicality for simplifying later analysis. The third and fourth lines calculate a bunch of standard quality control (QC) metrics on the samples and variants, which we’ll use next.

During the course of an analysis, new data fields are usually generated that pertain to variants or samples. These fields are called annotations, and are attached to variants or samples (or just globally) using a nested naming scheme. There is no requirement to specify the schema up-front; new annotations can be added as needed as the analysis progresses. For example, the sample_qc() operation that we just ran added the following fields for each sample in the dataset.

12345678910111213141516171819202122232425

»> print(vds.sample_schema)Struct { qc: Struct { callRate: Double, nCalled: Int, nNotCalled: Int, nHomRef: Int, nHet: Int, nHomVar: Int, nSNP: Int, nInsertion: Int, nDeletion: Int, nSingleton: Int, nTransition: Int, nTransversion: Int, dpMean: Double, dpStDev: Double, gqMean: Double, gqStDev: Double, nNonRef: Int, rTiTv: Double, rHetHomVar: Double, rInsertionDeletion: Double }}

Struct {

callRate: Double,

nNotCalled: Int,

nHet: Int,

nSNP: Int,

nDeletion: Int,

nTransition: Int,

dpMean: Double,

gqMean: Double,

nNonRef: Int,

rHetHomVar: Double,

}

Let’s focus on one of these fields, qc.gqMean, the mean conditional genotype quality (GQ), which informally is a quality score for all the variants in a sample, and use it to understand how the quality of samples varies in the dataset by plotting a chart showing the distribution. To do this we first export the samples as a Pandas dataframe.

»> sampleqc_table = vds.samples_keytable().to_pandas()

The dataframe is small enough to fit in memory, since it only has metadata about each sample, and no variant or genotype information.

Now we can use matplotlib to produce a histogram of quality scores. Notice how we refer to the GQ score using the key sa.qc.gqMean; the sa prefix is needed to refer to the sample annotations.

»> import pandas as pd»> import matplotlib.pyplot as plt»> def plotgq():»> plt.hist(sampleqc_table[“sa.qc.gqMean”], bins = np.arange(0, 105, 5))»> plt.xlabel(“Mean Sample GQ”)»> plt.ylabel(“Frequency”)»> plt.xlim(0, 105)»> plt.axvline(20, color = ‘r’)»> plotgq()

import matplotlib.pyplot as plt

plt.hist(sampleqc_table[“sa.qc.gqMean”], bins = np.arange(0, 105, 5))

plt.ylabel(“Frequency”)

plt.axvline(20, color = ‘r’)

We want to remove low-quality samples from the analysis, and we’ve chosen GQ = 20 as the cutoff, shown by the vertical red line. Smaller GQ values indicate lower quality, so we want a filter that keeps all samples with a GQ of 20 or more.

»> vds_sGQ = vds.filter_samples_expr(‘sa.qc.gqMean >= 20’)

The filter_samples_expr function uses Hail’s expression language to specify the filtering condition. The expression language makes it very easy to perform simple computations and aggregations on a VDS, without having to learn all the complexities of Spark’s API (although advanced users can use Hail’s Spark API directly).

If we perform a count on the VDS before and after filtering, we see that the number of samples has gone down from 284 to 252.

»> vds.count(){u’nGenotypes’: 3112924L, u’nSamples’: 284, u’nVariants’: 10961L}»> vds_sGQ.count(){u’nGenotypes’: 2762172L, u’nSamples’: 252, u’nVariants’: 10961L}

{u’nGenotypes’: 3112924L, u’nSamples’: 284, u’nVariants’: 10961L}

{u’nGenotypes’: 2762172L, u’nSamples’: 252, u’nVariants’: 10961L}

Suppose we would like to perform a GWAS on this data, to identify variants linked to a specific discrete phenotype. Here we arbitrarily use something made up, like ‘does the person have purple hair’. Samples having the trait of interest are case samples, while those without are controls. Other aspects of our data may confound the GWAS, creating false-positive associations if not addressed. Since our data comes from the 1000 Genomes Project, which is international in scope, ancestry is likely one of these factors. Principal Component Analysis (PCA) is often used to identify the presence of latent population structure in such data. Writing the code for this task is trivial using Hail.

Again following the Hail tutorial, first we annotate the samples with their population and phenotype metadata, similar to how we added the sample QC measures above.

»> vds_pop = vds_sGQ.annotate_samples_table(‘Hail_Tutorial-v2/1000Genomes.sample_annotations’, root=’sa.pheno’, sample_expr=’Sample’, config=TextTableConfig(impute=True))

‘Hail_Tutorial-v2/1000Genomes.sample_annotations’,

sample_expr=’Sample’,

Here we can see the results of the annotations:

123456789101112131415161718

»> expressions = [ ’samples.map(s => sa.pheno.SuperPopulation).collect().toSet()’, ’samples.filter(s => sa.pheno.PurpleHair).count()’, ’samples.filter(s => !sa.pheno.PurpleHair).count()’ ] »> [super_populations, case_count, control_count] = vds_pop.query_samples(expressions) »> print(‘super populations = %s’ % super_populations)Super populations = set([u’SAS’, u’EAS’, u’AMR’, u’AFR’, u’EUR’]) »> print(‘Case count = %s’ % case_count)Case count = 122 »> print(‘Control count = %s’ % control_count)Control count = 130 »> print(‘Total samples = %s’ % vds_pop.num_samples)Total samples = 252

’samples.map(s => sa.pheno.SuperPopulation).collect().toSet()’,

’samples.filter(s => !sa.pheno.PurpleHair).count()’ ]

[super_populations, case_count, control_count] = vds_pop.query_samples(expressions)

print(‘super populations = %s’ % super_populations)

Case count = 122

print(‘Control count = %s’ % control_count)

Total samples = 252

We see our 252 QC-ed samples are divided between 122 case samples and 130 control samples, and we have indicated these samples come from individuals from 5 different ancestries. Now we run a PCA, which will cluster the samples without regard to the labeling we just provided, to see if ancestry does in fact cause latent structure in our data.

»> vds_pca = vds_pop.pca(scores=’sa.pca’)

We then plot our samples based only on the first two principal components identified by the PCA.

»> import matplotlib.patches as mpatches»> def plotpca(): pca_table = vds_pca.samples_keytable().to_pandas() colors = {‘AFR’: ‘black’, ‘AMR’: ‘red’, ‘EAS’: ‘green’, ‘EUR’: ‘blue’, ‘SAS’: ‘cyan’} plt.scatter(pca_table[“sa.pca.PC1”], pca_table[“sa.pca.PC2”], c = pca_table[“sa.pheno.SuperPopulation”].map(colors), alpha = .5) plt.xlabel(“PC1”) plt.ylabel(“PC2”) legend_entries = [mpatches.Patch(color= c, label=pheno) for pheno, c in colors.items()] plt.legend(handles=legend_entries) plt.show() »> plotpca()

def plotpca():

colors = {‘AFR’: ‘black’, ‘AMR’: ‘red’, ‘EAS’: ‘green’, ‘EUR’: ‘blue’, ‘SAS’: ‘cyan’}

plt.xlabel(“PC1”)

legend_entries = [mpatches.Patch(color= c, label=pheno) for pheno, c in colors.items()]

plt.show()

plotpca()

We see clear structure in our data due to super-population, with Eastern Asians (EAS), Africans (AFR), Europeans (EUR), and South Asians (SAS) tightly clustering when plotted against the top two principal components. The American (AMR) samples are a bit more spread out along these axes, which is not surprising given the high levels of admixture within North America. In order to move forward with a clean Case-Control GWAS, we should select only samples from a single population or make sure to add ancestry as a proper covariate in our analysis.

Hail The Future

In this very short example, we’ve done some basic QC for samples and prepared for a GWAS, and already we can see the fluent exploratory style that Hail enables. In a more extensive and realistic analysis we’d do QC on genotypes and variants as well, before fully carrying out association testing to see if there’s a statistical link between genetic variation and phenotype. The Hail tutorial has a very readable in-depth analysis that does exactly this.

Hail is powerful because it is extensible. The VDS matrix of variants and samples is designed to be enriched by annotating it with new datasets—even datasets that haven’t been conceived of yet. On the processing side, Hail already has a wide range of built-in functions for analyzing genetic data, but it’s possible to add new functions if you need them, taking advantage of Spark’s powerful APIs. Hail is multi-lingual too: you can use Python or Scala to work with it, or even drop into SQL if you want.

Hail was founded on the expectation that genomics datasets will grow massively in the coming years. At Cloudera we are confident that Hail will attract users who need a scalable datastore and interactive analysis platform for genetic data, and we look forward to seeing it being adopted more widely.

To learn more about Apache Spark at Cloudera visit our website.