WISARD[wɪzərd]
Workbench for Integrated Superfast Association study with Related Data
HOME  |   DOWNLOAD  |   OPTIONS  |   TROUBLE?  |   LOGIN
 

Filtering

This section is about

  • Summary for filtering
    • Filtering sequence
    • Windowed filter
  • Filtering variants
    • Include/exclude variants by Mendelian error
    • Include/exclude variants by physical range
    • Include/exclude variants by genotype calling rate
    • Include/exclude variants by the test of missing rate
    • Selecting/removing a subset of variants from analysis
    • Selecting a subset of chromosomes from analysis
    • Include/exclude variants with minor allele
    • Include/exclude variants with genetic distance
    • Include/exclude variants with HWE test
    • Exclude variants with specific condition
  • Default filtering of WISARD
    • Filtering genotype with specific condition
      • Using regular expression to filtering
        • Selecting/removing samples with specific phenotypic conditions

          Summary for filtering [top]

          Many researches require various aspects from their data, so it is crucial for analysis tools to provide various methods to achieve such aspect. WISARD supports various filtering for variants, such as:

          • Inclusion/exclusion of a list of individual/variant/family
          • Quality filtering based on allele frequency/genotyping rate/Hardy-Weinberg equilibrium
          • Include/exclude specific chromosome/region
          • Eliminate/detect spurious relationship
          • Some discrimination criterion for specific analysis
          • ...or find out which variants/samples satisfy such condition

          Filtering sequence

          There are two steps of dataset filtering, and it can be distinguished with two simple rules:

          • If the filtering option starts with rem or sel, it is applied at the first phase.
          • When it starts with fil or inc, it is applied at the second phase.

          According to this rule, the second-phase filtering options cannot control the first-phase options, and the filtering options within same phase also cannot control each other. In the below example, since both --filmispheno and --filmac are the second-phase options, resulted dataset may contains MAC < 2.

          An example of mutually-incontrollable filtering options C:\Users\WISARD> wisard --bed test_miss0 --filmispheno --filmac [0,2]

          In order to prevent this situation, it is required to (1) run WISARD multiple times with desired order of filterings or (2) run WISARD with first-phase filtering options by marker/sample list to be filtered priorly.

          Windowed filter

          By assigning --window option, it is possible to additionally filter the adjacented markers with the markers to be filtered. Note that this option computes the adjacency based on the position, as described in the below example.

          Example 1 : An example of physical map
          178 223 311
          --+-------+--+-----*--------+-+---
          123 192 295

          Let's say the variant marked as '*' is the variant to be filtered. With the option --window 80, three additional markers at the position 178,192 and 295 will be filtered because of the window ranges +-80(143~303). However, the range can be assigned in asymmetric manner. For example, --window [-35,100] will additionally filters the markers 192, 295 and 311 because the window ranges -35(position 188) to +100(position 323). For the detailed explanation about the parameter of --window option, see this page.

          Filtering variants [top]

          Include/exclude variants by Mendelian error

          NOTE!
          The parameter of this option supports range type parameter
          Exclude variants with Mendelian error rate C:\Users\WISARD> wisard --filmendelvar "[0,0.1)" --ped test_miss0.ped
          Include variants with Mendelian error rate C:\Users\WISARD> wisard --incmendelvar "<=0.01" --ped test_miss0.ped

          Include/exclude variants by physical range

          Variants can be included or excluded to the analysis by its physical position, using --incrange or --filrange. In order to use this filtering, a file contains included or excluded ranges is required. Below is an example of range definition file. As shown in the example, each line represents physical range to be included or filtered, and each value is chromosome, start position and end position, respectively.

          Example 2 : An example of range definition file
          1 1 5
          7 35 38
          X 90 95
          Include variants located in specific region of 'test_phypos.txt' C:\Users\WISARD> wisard --incrange test_phypos.txt --ped test_miss0.ped
          Exclude variants located in specific region of 'test_phypos.txt' C:\Users\WISARD> wisard --filrange test_phypos.txt --ped test_miss0.ped

          Alternatively, a list of desired ranges can be directly given with following form, which are separated by comma(,), and each item consists of the chromosome in addition to the range to be filtered (inclusive).

          Include variants located in specific region defined C:\Users\WISARD> wisard --incrange 1[1,5],7[35,38],X[90,95] --ped test_miss0.ped

          Include/exclude variants by genotype calling rate

          Do an analysis after removing variants of their genotype caling rate is under 75% C:\Users\WISARD> wisard --filgvar "<0.75" --bed test_miss2.bed
          Do an analysis after selecting variants of its genotype calling rate is >80% and <=90% C:\Users\WISARD> wisard --incgvar "(0.8,0.9]" --bed test_miss2.bed

          Include/exclude variants by the test of missing rate

          Do an analysis after removing variants if the p-value of test < 0.05 C:\Users\WISARD> wisard --filmistest "<0.1" --bed test_miss2.bed
          Do an analysis after selecting variants if the p-value of test > 0.05 C:\Users\WISARD> wisard --incmistest "(0.05,1]" --bed test_miss2.bed

          Selecting/removing a subset of variants from analysis

          Removing variants listed in test_variant_list.txt` from input `test` C:\Users\WISARD> wisard --remvariant test_variant_list.txt --ped test_miss0.ped
          Selecting variants `SNP10` and `SNP20` only from input `test` C:\Users\WISARD> wisard --selvariant SNP10,SNP20 --ped test_miss0.ped
          Randomly select 10% of variants from dataset C:\Users\WISARD> wisard --varresize 0.1 --ped test_miss0.ped
          Pick variants with at least 1000bp of window C:\Users\WISARD> wisard --varwindow 1000 --ped test_miss0.ped
          Randomly select 50 variants from dataset C:\Users\WISARD> wisard --varresize 50 --ped test_miss0.ped

          Selecting a subset of chromosomes from analysis

          Selecting variants reside in chromosome 1 to 10 from input `test` C:\Users\WISARD> wisard --chr 1-10 --ped test_miss0.ped
          Selecting variants reside in chromosome 3 and X from input `test` C:\Users\WISARD> wisard --chr 3,X --ped test_miss0.ped
          Selecting variants reside in autosomes from input `test` C:\Users\WISARD> wisard --autoonly --ped test_miss0.ped
          Selecting variants reside in sex chromosomes from input `test` C:\Users\WISARD> wisard --sexonly --ped test_miss0.ped

          Include/exclude variants with minor allele

          NOTE!
          The parameter of this option supports range type parameter
          Exclude by minor allele frequency lower than 2% C:\Users\WISARD> wisard --filfreq "<0.02" --ped test_miss0.ped
          Include by minor allele frequency greater or equal than 5% C:\Users\WISARD> wisard --incfreq [0.05,0.5] --ped test_miss0.ped
          Exclude variants having minor allele 0~2 C:\Users\WISARD> wisard --filmac [0,2] --ped test_miss0.ped
          Include variants having minor allele more than 5 C:\Users\WISARD> wisard --incmac ">5" --ped test_miss0.ped

          Include/exclude variants with genetic distance

          NOTE!
          The parameter of this option supports range type parameter
          NOTE!
          This option is only validate for the dataset have genetic distance!
          Exclude variants having their genetic distance > 1.5 C:\Users\WISARD> wisard --filgdist ">1.5" --ped test_miss0.ped
          Include variants having their genetic distance is greater or equal than 0 but lower than 1 C:\Users\WISARD> wisard --incgdist "[0,1)" --ped test_miss0.ped

          Include/exclude variants with HWE test

          NOTE!
          The parameter of this option supports range type parameter
          Exclude by p-value of HWE test under 1e-7 C:\Users\WISARD> wisard --filhwe "<1e-7" --ped test_miss0.ped
          Include by p-value of HWE test greater than 0.05 C:\Users\WISARD> wisard --inchwe "(0.05,1]" --ped test_miss0.ped

          Exclude variants with specific condition

          Selecting SNVs from dataset C:\Users\WISARD> wisard --snvonly --indel --bed test_miss0 --bim test_miss0_indel.bim
          Selecting indels from dataset C:\Users\WISARD> wisard --indelonly --indel --bed test_miss0 --bim test_miss0_indel.bim
          NOTE!
          Dataset with indels requires --indel option to run!
          Removing all samples having their phenotype missing from input `test` C:\Users\WISARD> wisard --filmispheno --mispheno NA --ped test_miss0_pheNA.ped
          Removing all non QC-passed variants from input VCF file C:\Users\WISARD> wisard --vcfqc --vcf test_miss0.vcf
          Removing all variants having its QUAL value is within from 0 to 30 C:\Users\WISARD> wisard --filqual [0,30] --vcf test_miss0.vcf
          Selecting all variants having its QUAL value is greater or equal than 50 C:\Users\WISARD> wisard --incqual >=35 --vcf test_miss0.vcf

          Default filtering of WISARD [top]

          Basically, WISARD uses given dataset 'as is' until some essential conditions satisfied. In other words, unless some of your data satisfies specific condition, no variants/individuals will not be dropped. If, however, specific variants or individuals should be removed, resizing or reformatting data for each run can be quite annoying. In order to satisfy such demand, WISARD provide several aspects of data filtering scheme.

          NOTE!
          In case of using these options with specific IID/variant instead of the file name of a list containing IID/variants, any whitespaces between/within IID/variants are not permitted!

          Filtering genotype with specific condition [top]

          Include only phased genotypes from input VCF file C:\Users\WISARD> wisard --vcf test_miss0.vcf --phasedonly
          Include only unphased genotypes from input VCF file C:\Users\WISARD> wisard --vcf test_miss0.vcf --unphasedonly
          Set to NA if RD is less than 10 or GQ is less or equal to 30 C:\Users\WISARD> wisard --filgeno --vcf test.vcf --filgeno '[RD < 10] OR [GQ <= 30]'

          Using regular expression to filtering [top]

          Regular expression is a set of structured string that represents a pattern to discover. WISARD supports regular expression to various types of filters by adding --regex option. As shown in the below example, it is recommended to embrace regular expression double quote(") if Windows or single quote(') if otherwise, in order to avoid the special character problem.

          Generating a BED-formatted dataset consists only the non-dbSNP variants C:\Users\WISARD> wisard --bed sample --regex --remvariant '^(rs)' --makebed --out sample_nonRSonly

          Selecting/removing samples with specific phenotypic conditions [top]

          WISARD provides specified selecting/removing sample scheme with their phenotype values. This filtering scheme can be multiple, and there are some simple rules follow:

          1. Unless alternative phenotype is given with --sampvar and --pname, only default phenotype in the given dataset will be considered.
          2. In the above case, the column name of default phenotype must be 'PHENOTYPE'..
          3. If there are multiple selecting/removing conditions, each condition should be separated with comma (,) with no whitespaces..
          4. Reserved phenotype/covariates column names cannot be used as the part of condition..
          5. Each condition should specify certain and non-conflict 'range' or exact 'value'.
          6. Each condition should include only ONE phenotype, e.g. a condition cannot be represented as the form of comparison of two or more phenotypes.

          Below are some example of invalid value of --filpheno.

          Example 3 : Examples of invalid --filpheno assignment
          --filpheno 130<HEIGHT<120 # Range is conflicted
          --filpheno REGION<3 # REGION is factor but represented 'range'
          --filpheno 120<HEIGHT<180, 33<BMI<40 # Whitespace in separator(,)
          --filpheno 120x<HEIGHT # Non-numeric character(x) included in range
          --filpheno HEIGHT>BMI # Condition is represented with mixture of phenotypes


          Edit this page
          Last modified : 2017-09-13 11:05:56