What is VQSR?

VQSR stands for Variant Quality Score Recalibration. In a nutshell, it is a sophisticated filtering technique applied on the variant callset that uses machine learning to model the technical profile of variants in a training set and uses that to filter out probable artifacts from the callset.

What is hard filtering?

Hard-filtering consists of choosing specific thresholds for one or more annotations and throwing out any variants that have annotation values above or below the set thresholds.

How does VQSR work?

The core algorithm in VQSR is a Gaussian mixture model that aims to classify variants based on how their annotation values cluster given a training set of high-confidence variants. Then the VQSR tools use this model to assign a new confidence score to each variant, called VQSLOD.

What is QD in GATK?

The QD is the QUAL score normalized by allele depth (AD) for a variant. For a single sample, the HaplotypeCaller calculates the QD by taking QUAL/AD.

What is variant filtering?

Variant filtering is a secondary NGS analysis step that consists of identifying highly confident variants and removing the ones that are falsely called. The variant filtering step used to be mostly left out from deeper testing, even though it can has been shown to significantly improve precision of variant calls.

What is qual in VCF file?

QUAL = -10*log10 (posterior genotype probability of a homozygous-reference genotype (GT=0/0)) That is, QUAL = GP (GT=0/0), where GP = posterior genotype probability in Phred scale. QUAL = 20 means there is 99% probability that there is a variant at the site. The GP values are also given in Phred-scale in the VCF file.

What is base recalibration?

Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. For example we can identify that, for a given run, whenever we called two A nucleotides in a row, the next base we called had a 1% higher rate of error.

What is in a VCF file?

VCF is a text file format (most likely stored in a compressed manner). It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome. There is an option whether to contain genotype information on samples for each position or not.

What is allele balance?

The proportion of reads covering a variant’s location that support the variant. For example, if a variant’s location is covered by 100 reads, of which 25 support the variant and 75 do not, then the variant would have an allelic balance of 25/100 = 0.25.

What is variant prioritization?

Variant prioritization is a simple procedure, commonly used in clinical studies, to reduce the number of genetic variants that need to be evaluated manually.

What is GQ in VCF file?

Genotype quality The value of GQ is simply the difference between the second lowest PL and the lowest PL (which is always 0). So, in our example GQ = 20 – 0 = 20. Note that the value of GQ is capped at 99 for practical reasons, so even if the calculated GQ is higher, the value emitted to the VCF will be 99.

How are Phred scores calculated?

A base call with a quality score of Q40 means one base call in 10,000 is predicted to be incorrect….Why are sequence qualities stored in phred score, and what does the phred score mean?

Phred Quality Score Probability of Incorrect Base Call Base Call Accuracy
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%