WISARD[wɪzərd]
Workbench for Integrated Superfast Association study with Related Data
HOME  |   DOWNLOAD  |   OPTIONS  |   TROUBLE?  |   LOGIN
 

Filtering

This section is about

  • Summary for filtering
    • Filtering sequence
    • Windowed filter
  • Filtering samples
    • Selecting/removing a subset of samples from analysis
    • Exclude/include samples by genotype calling rate
    • Exclude/include samples or families by Mendelian error
    • Exclude samples with sex status
    • Exclude samples with specific condition
  • Default filtering of WISARD
    • Filtering genotype with specific condition
      • Using regular expression to filtering
        • Selecting/removing samples with specific phenotypic conditions

          Summary for filtering [top]

          Many researches require various aspects from their data, so it is crucial for analysis tools to provide various methods to achieve such aspect. WISARD supports various filtering for samples, such as:

          • Inclusion/exclusion of a list of individual/variant/family
          • Quality filtering based on allele frequency/genotyping rate/Hardy-Weinberg equilibrium
          • Include/exclude specific chromosome/region
          • Eliminate/detect spurious relationship
          • Some discrimination criterion for specific analysis
          • ...or find out which variants/samples satisfy such condition

          Filtering sequence

          There are two steps of dataset filtering, and it can be distinguished with two simple rules:

          • If the filtering option starts with rem or sel, it is applied at the first phase.
          • When it starts with fil or inc, it is applied at the second phase.

          According to this rule, the second-phase filtering options cannot control the first-phase options, and the filtering options within same phase also cannot control each other. In the below example, since both --filmispheno and --filmac are the second-phase options, resulted dataset may contains MAC < 2.

          An example of mutually-incontrollable filtering options C:\Users\WISARD> wisard --bed test_miss0 --sampvar test_miss0_phen.txt --pname medi01n --filmispheno --filmac [0,2]

          In order to prevent this situation, it is required to (1) run WISARD multiple times with desired order of filterings or (2) run WISARD with first-phase filtering options by marker/sample list to be filtered priorly.

          Windowed filter

          By assigning --window option, it is possible to additionally filter the adjacented markers with the markers to be filtered. Note that this option computes the adjacency based on the position, as described in the below example.

          Example 1 : An example of physical map
          178 223 311
          --+-------+--+-----*--------+-+---
          123 192 295

          Let's say the marker marked as '*' is the variant to be filtered. Under the option --window 80, three additional markers at the position 178,192 and 295 will be filtered because of the window ranges +-80(143~303). However, the range can be assigned in asymmetric manner. For example, --window [-35,100] will additionally filters the markers 192, 295 and 311 because the window ranges -35(position 188) to +100(position 323). For the detailed explanation about the parameter of --window option, see this page.

          Filtering samples [top]

          Selecting/removing a subset of samples from analysis

          It filters sample(s) by (1)IID, (2)FID and IID, (3)Random portion or (4)FID.

          Do an analysis after removing individual `SAMP1_8` and `SAMP1_9` C:\Users\WISARD> wisard --remsamp SAMP1_8,SAMP1_9 --ped test_miss0.ped
          Do an analysis after selecting individuals only listed in `test_sample_list.txt` C:\Users\WISARD> wisard --selsamp test_sample_list.txt --ped test_miss0.ped
          Do an analysis after randomly selecting 80% of samples C:\Users\WISARD> wisard --sampresize 0.8 --ped test_miss0.ped
          Do an analysis after randomly selecting eleven of samples C:\Users\WISARD> wisard --sampresize 11 --ped test_miss0.ped
          Do an analysis after selecting samples included in the family `FAM_1`, `FAM_4` or `FAM_7` C:\Users\WISARD> wisard --selfam FAM_1,FAM_4,FAM_7 --ped test_miss0.ped
          Do an analysis after removing individuals which their FID is listed in `test_family_list.txt` C:\Users\WISARD> wisard --remfam test_family_list.txt --ped test_miss0.ped

          Exclude/include samples by genotype calling rate

          It filters sample(s) by calling rate(or genotyping rate).

          Do an analysis after removing samples of their genotype caling rate is under 90% C:\Users\WISARD> wisard --filgind "<0.8" --bed test_miss2.bed
          Do an analysis after selecting samples of their genotype calling rate is >=90% and <99% C:\Users\WISARD> wisard --incgind "[0.8,0.99)" --bed test_miss2.bed

          Exclude/include samples or families by Mendelian error

          It filters sample(s) samples or families by the proportion of genotypes having Mendelian transmission error.

          NOTE!
          The parameter of this option supports range type parameter
          Exclude families with Mendelian error rate>=50% C:\Users\WISARD> wisard --filmendelfam ">=0.5" --ped test_miss0.ped
          Include families with 10% C:\Users\WISARD> wisard --incmendelfam "(0.1,1]" --ped test_miss0.ped
          Exclude samples with Mendelian error rate>20% C:\Users\WISARD> wisard --filmendelsamp ">0.2" --ped test_miss0.ped
          Include samples with Mendelian error rate C:\Users\WISARD> wisard --incmendelsamp "<0.1" --ped test_miss0.ped

          Exclude samples with sex status

          It filters sample(s) by sex status.

          Exclude male samples C:\Users\WISARD> wisard --filmale --ped test_miss0.ped
          Exclude female samples C:\Users\WISARD> wisard --filfemale --ped test_miss0.ped
          Exclude unknown sex(nor male or female) samples C:\Users\WISARD> wisard --filnosex --bed test_miss0.bed --fam test_miss0_missex.fam
          NOTE!
          Since WISARD checks the sex code for samples with offsprings, sex code missing for those samples is now allowed!

          Exclude samples with specific condition

          It filters sample(s) by founder status or simple phenotype status.

          Exclude non-founder samples C:\Users\WISARD> wisard --filnf --ped test_miss0.ped
          Exclude missing founder samples C:\Users\WISARD> wisard --filmf --ped test.ped
          Exclude case samples C:\Users\WISARD> wisard --filcase --ped test_miss0.ped
          Exclude control samples C:\Users\WISARD> wisard --filcontrol --ped test_miss0.ped
          Removing all samples having their phenotype missing from input `test_miss0` C:\Users\WISARD> wisard --sampvar test_miss0_phen.txt --pname medi01n --filmispheno --ped test_miss0.ped

          Default filtering of WISARD [top]

          Basically, WISARD uses given dataset 'as is' until some essential conditions satisfied. In other words, unless some of your data satisfies specific condition, no variants/individuals will not be dropped. If, however, specific variants or individuals should be removed, resizing or reformatting data for each run can be quite annoying. In order to satisfy such demand, WISARD provide several aspects of data filtering scheme.

          NOTE!
          In case of using these options with specific IID/variant instead of the file name of a list containing IID/variants, any whitespaces between/within IID/variants are not permitted!

          Filtering genotype with specific condition [top]

          Include only phased genotypes from input VCF file C:\Users\WISARD> wisard --vcf test_miss0.vcf --phasedonly
          Include only unphased genotypes from input VCF file C:\Users\WISARD> wisard --vcf test_miss0.vcf --unphasedonly
          Set to NA if RD is less than 10 or GQ is less or equal to 30 C:\Users\WISARD> wisard --filgeno --vcf test.vcf --filgeno '[RD < 10] OR [GQ <= 30]'

          Using regular expression to filtering [top]

          Regular expression is a set of structured string that represents a pattern to discover. WISARD supports regular expression to various types of filters by adding --regex option. As shown in the below example, it is recommended to embrace regular expression double quote(") if Windows or single quote(') if otherwise, in order to avoid the special character problem.

          Generating a BED-formatted dataset consists only the samples their IID start from SAMP1 C:\Users\WISARD> wisard --bed test_miss0.bed --regex --selsamp "^(SAMP1).+" --makebed --out sample_KAREonly

          Selecting/removing samples with specific phenotypic conditions [top]

          WISARD provides specified selecting/removing sample scheme with their phenotype values. This filtering scheme can be multiple, and there are some simple rules follow:

          1. Unless alternative phenotype is given with --sampvar and --pname, only default phenotype in the given dataset will be considered.
          2. In the above case, the column name of default phenotype must be 'PHENOTYPE'.
          3. If there are multiple selecting/removing conditions, each condition should be separated with comma (,) with no whitespaces..
          4. Reserved phenotype/covariates column names cannot be used as the part of condition.
          5. Each condition should specify certain and non-conflict 'range' or exact 'value'.
          6. Each condition should include only ONE phenotype, e.g. a condition cannot be represented as the form of comparison of two or more phenotypes.

          Below are some example of invalid value of --filpheno.

          Example 2 : Examples of invalid --filpheno assignment
          --filpheno 130<HEIGHT<120 # Range is conflicted
          --filpheno REGION<3 # REGION is factor but represented 'range'
          --filpheno 120<HEIGHT<180, 33<BMI<40 # Whitespace in separator(,)
          --filpheno 120x<HEIGHT # Non-numeric character(x) included in range
          --filpheno HEIGHT>BMI # Condition is represented with mixture of phenotypes


          Edit this page
          Last modified : 2017-09-13 11:11:03