Christian Fougner's notes

Visualizing mutational spectra using heatmaps

In my previous post, I gave a brief introduction to mutational signature and spectrum analysis using the deconstructSigs framework, including how to modify the source code to allow for the analysis of murine samples. deconstructSigs allows for the visualisation of mutational spectra as barcharts, giving a granular view of the trinucleotide context for variants in an individual sample. Unfortunately, this form of visualisation works poorly for the purpose of comparing the mutational spectra of multiple samples in a cohort. One solution to this is to display mutational spectra as a heatmap, as presented in PMK Westcott et al. 2015. This post will show how this can be done in R.

The work presented here builds on my previous post, and uses deconstructSigs to find the trinucleotide context for variants in a sample. The R-packages gplots and ComplexHeatmap are also required. All lines of code that will need to be modified to fit your own data (filenames, number of samples, etc.) are tagged with #MODIFY.

First, load the required packages:

library(deconstructSigs)
library(gplots)
library(ComplexHeatmap)
        

Find the trinucleotide contexts for the variants in your sample:

# Read variant lists (skip appropriate number of lines if read in as a .vcf file))
S123_14_6_table<-read.table("S123_14_6.txt", sep='\t') #MODIFY filename
S131_14_9_table<-read.table("S131_14_9.txt", sep='\t') #MODIFY filename
S132_14_5_table<-read.table("S132_14_5.txt", sep='\t') #MODIFY filename


# Create data frame of variant lists in the long format with columns for sample name
# chromosome number, variant position/coordinate, reference and alternate bases.
df1<-data.frame(sample=c(rep('S123_14_6', times=(length(S123_14_6_table$V3)))), chr=S123_14_6_table$V1, pos=S123_14_6_table$V2, ref=S123_14_6_table$V3, alt=S123_14_6_table$V4) #MODIFY filename and row/column name/numbers
df2<-data.frame(sample=c(rep('S131_14_9', times=(length(S131_14_9_table$V3)))), chr=S131_14_9_table$V1, pos=S131_14_9_table$V2, ref=S131_14_9_table$V3, alt=S131_14_9_table$V4) #MODIFY filename and row/column name/numbers
df3<-data.frame(sample=c(rep('S132_14_5', times=(length(S132_14_5_table$V3)))), chr=S132_14_5_table$V1, pos=S132_14_5_table$V2, ref=S132_14_5_table$V3, alt=S132_14_5_table$V4) #MODIFY filename and row/column name/numbers

# Bind data frames
alldfs<-rbind(df1, df2, df3) #MODIFY number of dataframes

## Create input to whichSignatures
sigs.input <- mut.to.sigs.input(mut.ref = alldfs,
                                sample.id = "sample",
                                chr = "chr",
                                pos = "pos",
                                ref = "ref",
                                alt = "alt")
        

At this point, the output - sigs.input - should look something like this:

            A[C>A]A	A[C>A]C	A[C>A]G	A[C>A]T	C[C>A]A	C[C>A]C	C[C>A]G	C[C>A]T	G[C>A]A	G[C>A]C	G[C>A]G	G[C>A]T	T[C>A]A	T[C>A]C	T[C>A]G	T[C>A]T	A[C>G]A	A[C>G]C	A[C>G]G	A[C>G]T	C[C>G]A	C[C>G]C	C[C>G]G	C[C>G]T	G[C>G]A	G[C>G]C	G[C>G]G	G[C>G]T	T[C>G]A	T[C>G]C	T[C>G]G	T[C>G]T	A[C>T]A	A[C>T]C	A[C>T]G	A[C>T]T	C[C>T]A	C[C>T]C	C[C>T]G	C[C>T]T	G[C>T]A	G[C>T]C	G[C>T]G	G[C>T]T	T[C>T]A	T[C>T]C	T[C>T]G	T[C>T]T	A[T>A]A	A[T>A]C	A[T>A]G	A[T>A]T	C[T>A]A	C[T>A]C	C[T>A]G	C[T>A]T	G[T>A]A	G[T>A]C	G[T>A]G	G[T>A]T	T[T>A]A	T[T>A]C	T[T>A]G	T[T>A]T	A[T>C]A	A[T>C]C	A[T>C]G	A[T>C]T	C[T>C]A	C[T>C]C	C[T>C]G	C[T>C]T	G[T>C]A	G[T>C]C	G[T>C]G	G[T>C]T	T[T>C]A	T[T>C]C	T[T>C]G	T[T>C]T	A[T>G]A	A[T>G]C	A[T>G]G	A[T>G]T	C[T>G]A	C[T>G]C	C[T>G]G	C[T>G]T	G[T>G]A	G[T>G]C	G[T>G]G	G[T>G]T	T[T>G]A	T[T>G]C	T[T>G]G	T[T>G]T
S123_14_6	9	1	4	2	22	7	2	9	7	0	2	5	4	3	4	7	5	0	1	0	0	1	2	1	4	1	0	1	7	1	2	1	2	0	0	0	6	3	3	3	0	0	0	0	3	2	3	1	7	5	41	11	25	24	134	44	22	16	76	14	18	12	45	14	5	1	9	3	2	2	3	3	4	3	9	1	1	2	2	1	2	0	4	0	1	2	17	2	0	0	3	1	3	0	5	0
S131_14_9	40	9	14	26	71	28	34	51	19	5	6	4	52	36	12	31	8	2	3	0	5	3	2	7	8	8	2	5	21	14	8	12	12	3	2	4	11	7	6	14	1	3	2	1	11	7	4	14	34	22	137	22	77	119	491	157	84	43	276	64	41	40	173	48	8	3	30	9	7	10	24	6	11	18	27	13	4	2	10	2	4	2	19	0	7	8	42	9	3	1	27	2	7	2	11	1
S132_14_5	43	17	14	10	53	15	19	38	18	6	4	10	30	19	13	25	10	0	3	1	6	5	7	7	4	1	1	8	10	6	5	4	11	5	2	3	15	3	2	8	4	0	5	0	14	5	5	10	23	21	116	16	69	100	394	111	79	41	218	65	36	33	106	26	10	0	21	9	7	5	15	7	11	10	23	17	1	2	11	2	2	2	8	3	6	4	44	5	2	2	29	1	2	3	14	4
        

Note that this tutorial can now be followed without access to the source files by importing the output shown above to a data frame.

Next, log scale and center the data, and reformat it so that it can be used to create a heatmap:

bound<-data.frame(matrix(0, ncol=24, nrow=4))
rownames(bound)<-c('A','C','G','T')
colnames(bound)<-rep(c('A','C','G','T'), times=6)

for (a in 1:3) #MODIFY second value to number of samples
{

  df<-data.frame(matrix(0, ncol=24, nrow=4))
  rownames(df)<-c('A','C','G','T')
  colnames(df)<-rep(c('A','C','G','T'), times=6)
  for (n in 1:24){
    start<-(4*n)-3
    end<-(4*n)
    df[,n]<-as.numeric(sigs.input[a, start:end])
  }

  df<-df+1
  df<-log10(df)
  df<-scale(df, center=TRUE, scale=TRUE)

  bound<-rbind(bound, df)
}

bound<-bound[-(1:4),]
        

As mentioned, the data should be log scaled and centered. We do this because the frequency differences in mutation type and trinucleotide context are large, and a large degree of granularity would be lost using a linear scale. We center the data around zero, because the purpose of the heatmap is to contrast the relative proportion of mutations types in given trinucleotide contexts between samples. If we were to to use absolute values instead of center scaling, the total number of variants in the sample would become the most prominent feature of the data.

An important point to note is that most heatmapping software has center and log scaling included, so why are we doing it manually? This is because the heatmapping software would do the scaling across the entire dataset, whereas we now want to do the scaling on a per sample basis. Also, note that we add 1 to every point in the data frame - this is of course because log(0) is undefined, og negative infinity. Including such data in a heatmap would be problematic, so this transformation bypasses this issue.

The next code block defines a few features needed to annotate the heatmap:

cols<-c(rep("deepskyblue", times=4), rep("black", times=4),
        rep("red", times=4), rep("magenta4", times=4),
        rep("forestgreen", times=4), rep("salmon", times=4))

my_palette <- colorRampPalette(c("yellow", "orange", "red"))
sampleNum<-length(levels(alldfs$sample))

sampleNames<-c('S123_14_6', 'S131_14_9', 'S132_14_5') #MODIFY sample names
sndf<-data.frame(sampleNames)
        

The following code block creates the actual heatmap. The first function creates the heatmap, and the following functions add annotations.

heatmap.2(as.matrix(bound),
          trace='none',
          dendrogram='none',
          Rowv=FALSE,
          Colv=FALSE,
          scale='none',
          colsep=c(4, 8, 12, 16, 20),
          rowsep=seq(4, (sampleNum*4), by=4),
          sepcolor = '#333333',
          key=TRUE,
          denscol=rgb(0,0,0,0),
          key.title='Heatmap scale (log10)',
          ylab = '3\' base',
          xlab='5\' base',
          key.xlab = NA,
          key.ylab = NA,
          keysize= 1.5,
          srtCol = 0,
          ColSideColors = cols,
          adjCol = c(0.5,1),
          key.ytickfun=function() {breaks <- list(3)},
          cexRow = 1,
          cexCol = 1,
          col=my_palette,
          labRow=rep(c('A','C','G', 'T'), times = sampleNum),
          na.color='yellow',
          lwid=c(1.5,4), #MODIFY size
          lhei=c(1.1,4) #MODIFY size

)

pos1 <- structure(list(x = c(0.262, 0.891), #MODIFY coordinates
                       y = 0.98), #MODIFY coordinates
                  .Names = c("x", "y"))

text(x=seq(pos1$x[1], pos1$x[2], len=6), y=rep(pos1$y[1],6)  ,
     srt=0, xpd=TRUE, adj = 0,
     labels=c('C>A', 'C>G', 'C>T', 'T>A', 'T>C', 'T>G') )


pos2 <- structure(list(x = 0.0608, #MODIFY coordinates
                       y = c(0.7433, -0.225)), #MODIFY coordinates
                  .Names = c("x", "y"))
text(y=seq(pos2$y[1], pos2$y[2], len=4), x=rep(pos2$x[1],4)  ,
     srt=0, xpd=TRUE, adj = 0,
     labels=sampleNames)
        

As you'll notice, the coordinates for the sample name and substitution type are hard coded. Heatmap.2 doesn't have the functionality to add more than one x-axis annotation and one y-axis annotation, so you'll have to experiment with the coordinates to make the annotations fit as well as possible. Let me know if you find a better solution for this!

The figure created will look something like this:

mutation spectrum heatmap

26.11.2016