#LabNote: Exome-seq Low Input Crash Test

Targeted capture is one of the most popular NGS applications. However, due to the often-limited amounts of tissue it is challenging to satisfy quality requirements and obtain 200 ng of input DNA still needed by most protocols.

In fact, the question that we frequently hear is: how low can you go?

To answer this question, we assessed the performance of SureSelectXT Human All Exon v6 protocol using various amounts of both DNA as well as hybridization input (prepped) with respect to library complexity and coverage efficiency.

Initial test was performed with high-quality sample ID663 whose DNA was diluted to obtain 200, 100 and 10 ng of input (663_a, 663_b and 663_c, Table 1). Upon library preparation, different amounts of capture input (750, 250, 100 and 50 ng) were used for hybridization. Subsequently, other two high-quality and eight low-quality inputs were tested as well.

Sequencing was performed on HiSeq2500, 125 bp paired-end mode. On average, we generated 32.7 M reads per sample (min 29.0 M, max 39.8 M). BWA-MEM was used to align all the raw data to the hg19 reference. SAMtools, GATK, and Picard were used for sorting SAM/BAM files, local realignment, and duplicate markings, respectively. The read alignment rate was 99.82%, on average. Picard was used to obtain high level metrics about the alignment of reads within a BAM file and the set of metrics specific to a hybrid selection analysis.

Table 1. Alignment and the hybrid selection metrics for high quality sample ID663.

As expected, by decreasing the amount of input (<100ng) and/or library used for hybridization (<250ng), the number of duplicated reads increases (Table 1). Consequently, the mean coverage and the number of identified variants drop (Table 1).

QUALITY vs QUANTITY

As already mentioned, we also tested real samples, two of which were high-quality (HQ) and eight low-quality (LQ) inputs. In both cases, we processed 200 ng, a minimum quantity required by Agilent protocol. Although the amounts of starting material were the same, different quantities of prepped material available for hybridization were obtained (Table 2). Post capture libraries were deeply sequenced and the analysis was performed on 32 M of reads picked randomly for each sample (except for HQ_1 for which we produced only 29.9 M). In addition, full sequences set, ~90 M of reads, was analyzed for LQ_4 and LQ_7 to evaluate the effect of sequencing depth on levels of duplication.

Table 2. Duplication metrics for high- and low-quality samples.

We observed perfect correlation (R2=0.97) between amounts of input used for capture and the number of duplicated reads (Figure 1).

What emerged from this experiment is that input quality counts as much as its quantity. All samples satisfied 200 ng starting material requirement, however their quality discriminated final performance right from the start. The quantity of the hybridization input obtained can be used as a marker.

Fig 1. Quantity of input used in hybridization strongly correlates (R2=0.97) with the extent of duplicated reads.

The proportion of duplicated reads is also dependent on the depth of sequencing. For a given sample, the percentage is not fixed, it grows by boosting the number of reads. At 32 M of reads LQ_4 and LQ_7 had 11.9% and 27.7% of duplicates, respectively. With triplicated depth (~90 M of reads) the percentage doubled, reaching 26.6% for LQ_4 and 48.9% for LQ_7.

TAKEN TOGETHER

The results obtained indicate 100 ng of input as a minimum quantity to be processed to achieve adequate results. As an alternative, at least of 500 ng of library should be used for the capture. However, these assumptions are made upon the analysis of a sample under very good starting conditions. Knowing that organoid or FFPE samples can have difficulties in terms of both quality and quantity we cannot give the assurance that these minimum requirements would work in the real-life conditions in which one must operate.

In case you cannot meet minimal input requests the main question is: how deep should you go?

The percentage of duplicates is going to increase by increasing sequencing depths. In fact, the same information is read many times without significantly increasing the real coverage (net of duplicates) or "enlarging" the portion of the target covered. This is because the information is limited upstream of the sequencing due to the sampling performed on an input with reduced complexity/heterogeneity.

If you have difficult, one-shot samples, maybe the best approach would be to do the test sequencing and consider data integration once you’ve assessed the cost/benefit from the preliminary data.