Exome-seq saturation analysis: when you should STOP sequencing

27 January 2020

Sequencing coverage in exome-seq is one of the major issues since a small difference in percentage coverage may significantly impact diagnostic outcome.

With the decline of the sequencing cost, initially recommended 30X for the germline variant detection boosted up to 100X. Intuitively speaking, the more coverage and depth you have, greater is the chance of success.

But, is there a higher limit?

To answer this question, we performed a sequencing saturation analysis. We examined the effect of sequencing increase on various parameters, among which the depth of coverage at target regions, a parameter usually used for the evaluation of the WES performance.

The library was prepared starting from high-quality input by SureSelect Human All Exon V7 protocol (Agilent Technologies, Santa Clara, CA) and sequenced on NovaSeq 6000 instrument (Illumina, San Diego, CA) in 150 paired-end mode producing 838 M reads. Full dataset was randomly subsampled to obtain 50M, 120M, 250M, 400M and 600M paired-end reads by seqtk. BWA-MEM was then used to align subsampled data, as well as the full dataset to the hg38 reference. SAMtools, GATK, and Picard were used for sorting SAM/BAM files, local realignment, and duplicate marking, respectively. Picard was used to obtain high level metrics about the alignment of reads within a BAM file and the set of metrics specific to a hybrid selection analysis.

Library complexity and coverage efficiency was evaluated at different sequencing depths extending from 50M of reads to full dataset of 838M (Table 1). As expected, the overall coverage increased with increasing the sequencing depth. However, the same was observed with the duplicate’s frequency, which increased linearly reaching 50% in the full dataset. Since duplicates are removed during data analysis, their presence negatively impacts overall coverage. Thus, keep in mind the progressive increase of duplicates when determining sequencing efforts for the specific coverage needed in an experiment.

 

Table 1. Alignment and the hybrid selection metrics for high quality sample at different sequencing depths.

Sample

Insert size (bp)

% Dups

Fold enrichment

Mean biat coverage (X)

Target bases 10X

Target bases 20X

Target bases 30X

50M

248

5.8

40.4

83.9

96%

90%

81%

120M

248

12.8

40.1

184.6

97%

96%

94%

250M

248

23.3

39.1

334.8

98%

97%

97%

400M

248

33.5

38.7

480.8

98%

97%

97%

600M

248

41.3

38.4

589.8

98%

98%

97%

838M

248

50

38.1

711.1

98%

98%

98%

 

The key parameter for the evaluation of the WES performance is the depth of coverage at target regions, since the minimum site coverage of more than 10-fold is generally required to reliably identify germline variants. By boosting the sequencing depth, the number of the targeted bases that are covered at least 10X expands, increasing from 96% at 50M to 97% at 120M (Table 1). However, in our experiment an expansion occurs up to 250 M reads, at which the saturation threshold is achieved (98% of target bases are covered 10X) and from there on no additional information is acquired. Basically, the library complexity is drained, and you’ll only get more of the same. At this point, in order to enlarge covered target and gain extra data it would be useful to prepare and sequence the second library.