Many researches require various aspects from their data,
so it is crucial for analysis tools to provide various methods to achieve such aspect.
WISARD supports various filtering for variants, such as:
Inclusion/exclusion of a list of individual/variant/family
Quality filtering based on allele frequency/genotyping rate/Hardy-Weinberg equilibrium
Include/exclude specific chromosome/region
Eliminate/detect spurious relationship
Some discrimination criterion for specific analysis
...or find out which variants/samples satisfy such condition
Filtering sequence
There are two steps of dataset filtering, and it can be distinguished with two simple rules:
If the filtering option starts with rem or sel, it is applied at the first phase.
When it starts with fil or inc, it is applied at the second phase.
According to this rule, the second-phase filtering options cannot control the first-phase options,
and the filtering options within same phase also cannot control each other.
In the below example, since both --filmispheno and --filmac are the second-phase options,
resulted dataset may contains MAC < 2.
In order to prevent this situation, it is required to (1) run WISARD multiple times
with desired order of filterings or (2) run WISARD with first-phase filtering options
by marker/sample list to be filtered priorly.
Windowed filter
By assigning --window option, it is possible to additionally filter the adjacented markers with the markers to be filtered.
Note that this option computes the adjacency based on the position,
as described in the below example.
Let's say the variant marked as '*' is the variant to be filtered.
With the option --window 80,
three additional markers at the position 178,192 and 295 will be filtered because of the window ranges +-80(143~303).
However, the range can be assigned in asymmetric manner. For example, --window [-35,100] will additionally
filters the markers 192, 295 and 311 because the window ranges -35(position 188) to +100(position 323).
For the detailed explanation about the parameter of --window option, see this page.
Variants can be included or excluded to the analysis by its physical position, using --incrange or --filrange.
In order to use this filtering, a file contains included or excluded ranges is required.
Below is an example of range definition file. As shown in the example, each line represents physical range to be included or filtered,
and each value is chromosome, start position and end position, respectively.
Alternatively, a list of desired ranges can be directly given with following form, which are separated by comma(,), and each item consists of the chromosome in addition to the range to be filtered (inclusive).
Include/exclude variants by genotype calling rate
Include/exclude variants by the test of missing rate
Selecting/removing a subset of variants from analysis
Basically, WISARD uses given dataset 'as is' until some essential conditions satisfied.
In other words, unless some of your data satisfies specific condition, no variants/individuals will not be dropped.
If, however, specific variants or individuals should be removed, resizing or reformatting data for each run can be quite annoying.
In order to satisfy such demand, WISARD provide several aspects of data filtering scheme.
NOTE!
In case of using these options with specific IID/variant instead of the file name of a list containing IID/variants, any whitespaces between/within IID/variants are not permitted!
Regular expression is a set of structured string
that represents a pattern to discover. WISARD supports regular expression to various types of filters by adding --regex option.
As shown in the below example, it is recommended to embrace regular expression double quote(") if Windows or single quote(') if otherwise,
in order to avoid the special character problem.
Selecting/removing samples with specific phenotypic conditions [top]
WISARD provides specified selecting/removing sample scheme with their phenotype values.
This filtering scheme can be multiple, and there are some simple rules follow:
Unless alternative phenotype is given with --sampvar and --pname, only default phenotype in the given dataset will be considered.
In the above case, the column name of default phenotype must be 'PHENOTYPE'..
If there are multiple selecting/removing conditions, each condition should be separated with comma (,) with no whitespaces..
Reserved phenotype/covariates column names cannot be used as the part of condition..
Each condition should specify certain and non-conflict 'range' or exact 'value'.
Each condition should include only ONE phenotype, e.g. a condition cannot be represented as the form of comparison of two or more phenotypes.
Below are some example of invalid value of --filpheno.