WISARD official site

Select O/S : [?]

Case tutorial

Filtering

This section is about

Summary for filtering
- Filtering sequence
- Windowed filter
Filtering variants
- Include/exclude variants by Mendelian error
- Include/exclude variants by physical range
- Include/exclude variants by genotype calling rate
- Include/exclude variants by the test of missing rate
- Selecting/removing a subset of variants from analysis
- Selecting a subset of chromosomes from analysis
- Include/exclude variants with minor allele
- Include/exclude variants with genetic distance
- Include/exclude variants with HWE test
- Exclude variants with specific condition
Default filtering of WISARD
Filtering genotype with specific condition
Using regular expression to filtering
Selecting/removing samples with specific phenotypic conditions

Summary for filtering [top]

Many researches require various aspects from their data, so it is crucial for analysis tools to provide various methods to achieve such aspect. WISARD supports various filtering for variants, such as:

Inclusion/exclusion of a list of individual/variant/family
Quality filtering based on allele frequency/genotyping rate/Hardy-Weinberg equilibrium
Include/exclude specific chromosome/region
Eliminate/detect spurious relationship
Some discrimination criterion for specific analysis
...or find out which variants/samples satisfy such condition

Filtering sequence

There are two steps of dataset filtering, and it can be distinguished with two simple rules:

If the filtering option starts with rem or sel, it is applied at the first phase.
When it starts with fil or inc, it is applied at the second phase.

According to this rule, the second-phase filtering options cannot control the first-phase options, and the filtering options within same phase also cannot control each other. In the below example, since both --filmispheno and --filmac are the second-phase options, resulted dataset may contains MAC < 2.

An example of mutually-incontrollable filtering options C:\Users\WISARD> wisard --bed test_miss0 --filmispheno --filmac [0,2]

In order to prevent this situation, it is required to (1) run WISARD multiple times with desired order of filterings or (2) run WISARD with first-phase filtering options by marker/sample list to be filtered priorly.

Windowed filter

By assigning --window option, it is possible to additionally filter the adjacented markers with the markers to be filtered. Note that this option computes the adjacency based on the position, as described in the below example.

Example 1 : An example of physical map

         178      223        311
--+-------+--+-----*--------+-+---
 123        192            295

Let's say the variant marked as '*' is the variant to be filtered. With the option --window 80, three additional markers at the position 178,192 and 295 will be filtered because of the window ranges +-80(143~303). However, the range can be assigned in asymmetric manner. For example, --window [-35,100] will additionally filters the markers 192, 295 and 311 because the window ranges -35(position 188) to +100(position 323). For the detailed explanation about the parameter of --window option, see this page.

Filtering variants [top]

Include/exclude variants by Mendelian error

NOTE!

The parameter of this option supports range type parameter

Exclude variants with Mendelian error rate C:\Users\WISARD> wisard --filmendelvar "[0,0.1)" --ped test_miss0.ped

Include variants with Mendelian error rate C:\Users\WISARD> wisard --incmendelvar "<=0.01" --ped test_miss0.ped

Include/exclude variants by physical range

Variants can be included or excluded to the analysis by its physical position, using --incrange or --filrange. In order to use this filtering, a file contains included or excluded ranges is required. Below is an example of range definition file. As shown in the example, each line represents physical range to be included or filtered, and each value is chromosome, start position and end position, respectively.

Example 2 : An example of range definition file

1	1	5
7	35	38
X	90	95

Include variants located in specific region of 'test_phypos.txt' C:\Users\WISARD> wisard --incrange test_phypos.txt --ped test_miss0.ped

Exclude variants located in specific region of 'test_phypos.txt' C:\Users\WISARD> wisard --filrange test_phypos.txt --ped test_miss0.ped

Alternatively, a list of desired ranges can be directly given with following form, which are separated by comma(,), and each item consists of the chromosome in addition to the range to be filtered (inclusive).

Include variants located in specific region defined C:\Users\WISARD> wisard --incrange 1[1,5],7[35,38],X[90,95] --ped test_miss0.ped

Include/exclude variants by genotype calling rate

Do an analysis after removing variants of their genotype caling rate is under 75% C:\Users\WISARD> wisard --filgvar "<0.75" --bed test_miss2.bed

Do an analysis after selecting variants of its genotype calling rate is >80% and <=90% C:\Users\WISARD> wisard --incgvar "(0.8,0.9]" --bed test_miss2.bed

Include/exclude variants by the test of missing rate

Do an analysis after removing variants if the p-value of test < 0.05 C:\Users\WISARD> wisard --filmistest "<0.1" --bed test_miss2.bed

Do an analysis after selecting variants if the p-value of test > 0.05 C:\Users\WISARD> wisard --incmistest "(0.05,1]" --bed test_miss2.bed

Selecting/removing a subset of variants from analysis

Removing variants listed in test_variant_list.txt` from input `test` C:\Users\WISARD> wisard --remvariant test_variant_list.txt --ped test_miss0.ped

Selecting variants `SNP10` and `SNP20` only from input `test` C:\Users\WISARD> wisard --selvariant SNP10,SNP20 --ped test_miss0.ped

Randomly select 10% of variants from dataset C:\Users\WISARD> wisard --varresize 0.1 --ped test_miss0.ped

Pick variants with at least 1000bp of window C:\Users\WISARD> wisard --varwindow 1000 --ped test_miss0.ped

Randomly select 50 variants from dataset C:\Users\WISARD> wisard --varresize 50 --ped test_miss0.ped

Selecting a subset of chromosomes from analysis

Selecting variants reside in chromosome 1 to 10 from input `test` C:\Users\WISARD> wisard --chr 1-10 --ped test_miss0.ped

Selecting variants reside in chromosome 3 and X from input `test` C:\Users\WISARD> wisard --chr 3,X --ped test_miss0.ped

Selecting variants reside in autosomes from input `test` C:\Users\WISARD> wisard --autoonly --ped test_miss0.ped

Selecting variants reside in sex chromosomes from input `test` C:\Users\WISARD> wisard --sexonly --ped test_miss0.ped

Include/exclude variants with minor allele

NOTE!

The parameter of this option supports range type parameter

Exclude by minor allele frequency lower than 2% C:\Users\WISARD> wisard --filfreq "<0.02" --ped test_miss0.ped

Include by minor allele frequency greater or equal than 5% C:\Users\WISARD> wisard --incfreq [0.05,0.5] --ped test_miss0.ped

Exclude variants having minor allele 0~2 C:\Users\WISARD> wisard --filmac [0,2] --ped test_miss0.ped

Include variants having minor allele more than 5 C:\Users\WISARD> wisard --incmac ">5" --ped test_miss0.ped

Include/exclude variants with genetic distance

NOTE!

The parameter of this option supports range type parameter

NOTE!

This option is only validate for the dataset have genetic distance!

Exclude variants having their genetic distance > 1.5 C:\Users\WISARD> wisard --filgdist ">1.5" --ped test_miss0.ped

Include variants having their genetic distance is greater or equal than 0 but lower than 1 C:\Users\WISARD> wisard --incgdist "[0,1)" --ped test_miss0.ped

Include/exclude variants with HWE test

NOTE!

The parameter of this option supports range type parameter

Exclude by p-value of HWE test under 1e-7 C:\Users\WISARD> wisard --filhwe "<1e-7" --ped test_miss0.ped

Include by p-value of HWE test greater than 0.05 C:\Users\WISARD> wisard --inchwe "(0.05,1]" --ped test_miss0.ped

Exclude variants with specific condition

Selecting SNVs from dataset C:\Users\WISARD> wisard --snvonly --indel --bed test_miss0 --bim test_miss0_indel.bim

Selecting indels from dataset C:\Users\WISARD> wisard --indelonly --indel --bed test_miss0 --bim test_miss0_indel.bim

NOTE!

Dataset with indels requires --indel option to run!

Removing all samples having their phenotype missing from input `test` C:\Users\WISARD> wisard --filmispheno --mispheno NA --ped test_miss0_pheNA.ped

Removing all non QC-passed variants from input VCF file C:\Users\WISARD> wisard --vcfqc --vcf test_miss0.vcf

Removing all variants having its QUAL value is within from 0 to 30 C:\Users\WISARD> wisard --filqual [0,30] --vcf test_miss0.vcf

Selecting all variants having its QUAL value is greater or equal than 50 C:\Users\WISARD> wisard --incqual >=35 --vcf test_miss0.vcf

Default filtering of WISARD [top]

Basically, WISARD uses given dataset 'as is' until some essential conditions satisfied. In other words, unless some of your data satisfies specific condition, no variants/individuals will not be dropped. If, however, specific variants or individuals should be removed, resizing or reformatting data for each run can be quite annoying. In order to satisfy such demand, WISARD provide several aspects of data filtering scheme.

NOTE!

In case of using these options with specific IID/variant instead of the file name of a list containing IID/variants, any whitespaces between/within IID/variants are not permitted!

Filtering genotype with specific condition [top]

Include only phased genotypes from input VCF file C:\Users\WISARD> wisard --vcf test_miss0.vcf --phasedonly

Include only unphased genotypes from input VCF file C:\Users\WISARD> wisard --vcf test_miss0.vcf --unphasedonly

Set to NA if RD is less than 10 or GQ is less or equal to 30 C:\Users\WISARD> wisard --filgeno --vcf test.vcf --filgeno '[RD < 10] OR [GQ <= 30]'

Using regular expression to filtering [top]

Regular expression is a set of structured string that represents a pattern to discover. WISARD supports regular expression to various types of filters by adding --regex option. As shown in the below example, it is recommended to embrace regular expression double quote(") if Windows or single quote(') if otherwise, in order to avoid the special character problem.

Generating a BED-formatted dataset consists only the non-dbSNP variants C:\Users\WISARD> wisard --bed sample --regex --remvariant '^(rs)' --makebed --out sample_nonRSonly

Selecting/removing samples with specific phenotypic conditions [top]

WISARD provides specified selecting/removing sample scheme with their phenotype values. This filtering scheme can be multiple, and there are some simple rules follow:

Unless alternative phenotype is given with --sampvar and --pname, only default phenotype in the given dataset will be considered.
In the above case, the column name of default phenotype must be 'PHENOTYPE'..
If there are multiple selecting/removing conditions, each condition should be separated with comma (,) with no whitespaces..
Reserved phenotype/covariates column names cannot be used as the part of condition..
Each condition should specify certain and non-conflict 'range' or exact 'value'.
Each condition should include only ONE phenotype, e.g. a condition cannot be represented as the form of comparison of two or more phenotypes.

Below are some example of invalid value of --filpheno.

Example 3 : Examples of invalid --filpheno assignment

--filpheno 130<HEIGHT<120 # Range is conflicted
--filpheno REGION<3 # REGION is factor but represented 'range'
--filpheno 120<HEIGHT<180, 33<BMI<40 # Whitespace in separator(,) 
--filpheno 120x<HEIGHT # Non-numeric character(x) included in range
--filpheno HEIGHT>BMI # Condition is represented with mixture of phenotypes

Edit this page

Last modified : 2017-09-13 11:05:56