WISARD[wɪzərd]
Workbench for Integrated Superfast Association study with Related Data
HOME  |   DOWNLOAD  |   OPTIONS  |   TROUBLE?  |   LOGIN
 

Filtering

This section is about

  • Summary for filtering
    • Filtering sequence
    • Windowed filter
  • Filtering samples
    • Selecting/removing a subset of samples from analysis
    • Exclude/include samples by genotype calling rate
    • Exclude/include samples or families by Mendelian error
    • Exclude samples with sex status
    • Exclude samples with specific condition
  • Filtering variants
    • Include/exclude variants by Mendelian error
    • Include/exclude variants by physical range
    • Include/exclude variants by genotype calling rate
    • Include/exclude variants by the test of missing rate
    • Selecting/removing a subset of variants from analysis
    • Selecting a subset of chromosomes from analysis
    • Include/exclude variants with minor allele
    • Include/exclude variants with genetic distance
    • Include/exclude variants with HWE test
    • Exclude variants with specific condition
  • Default filtering of WISARD
    • Filtering genotype with specific condition
      • Using regular expression to filtering
        • Selecting/removing samples with specific phenotypic conditions

          Summary for filtering [top]

          Many researches require various aspects from their data, so it is crucial for analysis tools to provide various methods to achieve such aspect. WISARD supports various kinds of filtering, such as:

          • Inclusion/exclusion of a list of individual/variant/family
          • Quality filtering based on allele frequency/genotyping rate/Hardy-Weinberg equilibrium
          • Include/exclude specific chromosome/region
          • Eliminate/detect spurious relationship
          • Some discrimination criterion for specific analysis
          • ...or find out which variants/samples satisfy such condition

          Filtering sequence

          There are two steps of dataset filtering, and it can be distinguished with two simple rules:

          • If the filtering option starts with rem or sel, it is applied at the first phase.
          • When it starts with fil or inc, it is applied at the second phase.

          According to this rule, the second-phase filtering options cannot control the first-phase options, and the filtering options within same phase also cannot control each other. In the below example, since both --filmispheno and --filmac are the second-phase options, resulted dataset may contains MAC < 2.

          An example of mutually-incontrollable filtering options C:\Users\WISARD> wisard --bed sample_data --filmispheno --filmac [0,2]

          In order to prevent this situation, it is required to (1) run WISARD multiple times with desired order of filterings or (2) run WISARD with first-phase filtering options by marker/sample list to be filtered priorly.

          Windowed filter

          By assigning --window option, it is possible to additionally filter the adjacented markers with the markers to be filtered. Note that this option computes the adjacency based on the position, as described in the below example.

          Example 1 : An example of physical map
          178 223 311
          --+-------+--+-----*--------+-+---
          123 192 295

          Let's say the marker marked as '*' is the variant to be filtered. Under the option --window 80, three additional markers at the position 178,192 and 295 will be filtered because of the window ranges +-80(143~303). However, the range can be assigned in asymmetric manner. For example, --window [-35,100] will additionally filters the markers 192, 295 and 311 because the window ranges -35(position 188) to +100(position 323). For the detailed explanation about the parameter of --window option, see this page.

          Filtering samples [top]

          Selecting/removing a subset of samples from analysis

          It filters sample(s) by (1)IID, (2)FID and IID, (3)Random portion or (4)FID.

          Do an analysis after removing individual `IND0001` and `IND0003` C:\Users\WISARD> wisard --remsamp IND0001,IND0003 --ped test.ped
          Do an analysis after selecting individuals only listed in `sellist_ind.txt` C:\Users\WISARD> wisard --selsamp sellist_ind.txt --ped test.ped
          Do an analysis after randomly selecting 80% of samples C:\Users\WISARD> wisard --sampresize 0.8 --ped test.ped
          Do an analysis after randomly selecting eleven of samples C:\Users\WISARD> wisard --sampresize 11 --ped test.ped
          Do an analysis after selecting samples included in the family `FAM01`, `FAM04` or `FAM07` C:\Users\WISARD> wisard --selfam FAM01,FAM04,FAM07 --ped test.ped
          Do an analysis after removing individuals which their FID is listed in `remfam_ind.txt` C:\Users\WISARD> wisard --remfam remlist_fam.txt --ped test.ped

          Exclude/include samples by genotype calling rate

          It filters sample(s) by calling rate(or genotyping rate).

          Do an analysis after removing samples of their genotype caling rate is under 80% C:\Users\WISARD> wisard --filgind "<0.8" --bed test_miss2.bed
          Do an analysis after selecting samples of their genotype calling rate is >=70% and <99% C:\Users\WISARD> wisard --incgind "[0.7,0.99)" --bed test_miss2.bed

          Exclude/include samples or families by Mendelian error

          It filters sample(s) samples or families by the portion of genotypes having Mendelian transmission error.

          NOTE!
          The parameter of this option supports range type parameter
          Exclude families with Mendelian error rate>=50% C:\Users\WISARD> wisard --filmendelfam ">=0.5" --ped test_miss0.ped
          Include families with 10% C:\Users\WISARD> wisard --incmendelfam "(0.1,0.25]" --ped test_miss0.ped
          Exclude samples with Mendelian error rate>20% C:\Users\WISARD> wisard --filmendelsamp ">0.2" --ped test_miss0.ped
          Include samples with Mendelian error rate C:\Users\WISARD> wisard --incmendelsamp "<0.1" --ped test_miss0.ped

          Exclude samples with sex status

          It filters sample(s) by sex status.

          Exclude male samples C:\Users\WISARD> wisard --filmale --ped test.ped
          Exclude female samples C:\Users\WISARD> wisard --filfemale --ped test.ped
          Exclude unknown sex(nor male or female) samples C:\Users\WISARD> wisard --filnosex --ped test.ped

          Exclude samples with specific condition

          It filters sample(s) by founder status or simple phenotype status.

          Exclude non-founder samples C:\Users\WISARD> wisard --filnf --ped test.ped
          Exclude missing founder samples C:\Users\WISARD> wisard --filmf --ped test.ped
          Exclude case samples C:\Users\WISARD> wisard --filcase --ped test.ped
          Exclude control samples C:\Users\WISARD> wisard --filcontrol --ped test.ped
          Removing all samples having their phenotype missing from input `test` C:\Users\WISARD> wisard --filmispheno --ped test.ped

          Filtering variants [top]

          Include/exclude variants by Mendelian error

          NOTE!
          The parameter of this option supports range type parameter
          Exclude variants with Mendelian error rate C:\Users\WISARD> wisard --filmendelvar "[0,0.1)" --ped test_miss0.ped
          Include variants with Mendelian error rate C:\Users\WISARD> wisard --incmendelvar "<=0.01" --ped test_miss0.ped

          Include/exclude variants by physical range

          Variants can be included or excluded to the analysis by its physical position, using --incrange or --filrange. In order to use this filtering, a file contains included or excluded ranges is required. Below is an example of range definition file. As shown in the example, each line represents physical range to be included or filtered, and each value is chromosome, start position and end position, respectively.

          Example 2 : An example of range definition file
          1 17293840 19288428
          3 8284585 9928717
          X 2847274 2994827
          Include variants located in specific region of 'range_sel.txt' C:\Users\WISARD> wisard --incrange range_sel.txt --ped test.ped
          Exclude variants located in specific region of 'range_rem.txt' C:\Users\WISARD> wisard --filrange range_rem.txt --ped test.ped

          Alternatively, a list of desired ranges can be directly given with following form.

          Include variants located in specific region defined C:\Users\WISARD> wisard --incrange 1[17293840,19288428],3[8284585,9928717],X[2847274,2994827]--ped test.ped

          Include/exclude variants by genotype calling rate

          Do an analysis after removing variants of their genotype caling rate is under 90% C:\Users\WISARD> wisard --filgvar "<0.9" --ped test_miss0.ped
          Do an analysis after selecting variants of its genotype calling rate is <10% and >=50% C:\Users\WISARD> wisard --incgvar "(0.1,0.5]" --ped test_miss0.ped

          Include/exclude variants by the test of missing rate

          Do an analysis after removing variants if the p-value of test < 0.05 C:\Users\WISARD> wisard --filmistest "<0.05" --ped test_miss0.ped
          Do an analysis after selecting variants if the p-value of test > 0.05 C:\Users\WISARD> wisard --incmistest "(0.05,1]" --ped test_miss0.ped

          Selecting/removing a subset of variants from analysis

          Removing variants listed in remlist_variant.txt` from input `test` C:\Users\WISARD> wisard --remvariant remlist_variant.txt --ped test.ped
          Selecting variants `rs8385` and `rs93851` only from input `test` C:\Users\WISARD> wisard --selvariant rs8385,rs93851 --ped test.ped
          Randomly select 10% of variants from dataset C:\Users\WISARD> wisard --varresize 0.1 --ped test.ped
          Randomly select one thousand of variants from dataset C:\Users\WISARD> wisard --varresize 1000 --ped test.ped

          Selecting a subset of chromosomes from analysis

          Selecting variants reside in chromosome 1 to 10 from input `test` C:\Users\WISARD> wisard --chr 1-10 --ped test.ped
          Selecting variants reside in chromosome 3 and X from input `test` C:\Users\WISARD> wisard --chr 3,X --ped test.ped
          Selecting variants reside in autosomes from input `test` C:\Users\WISARD> wisard --autoonly --ped test.ped
          Selecting variants reside in sex chromosomes from input `test` C:\Users\WISARD> wisard --sexonly --ped test.ped

          Include/exclude variants with minor allele

          NOTE!
          The parameter of this option supports range type parameter
          Exclude by minor allele frequency lower than 2% C:\Users\WISARD> wisard --filfreq "<0.02" --ped test_miss0.ped
          Include by minor allele frequency greater or equal than 5% C:\Users\WISARD> wisard --incfreq [0.05,0.5] --ped test_miss0.ped
          Exclude variants having minor allele 0~2 C:\Users\WISARD> wisard --filmac [0,2] --ped test_miss0.ped
          Include variants having minor allele more than 5 C:\Users\WISARD> wisard --incmac ">5" --ped test_miss0.ped

          Include/exclude variants with genetic distance

          NOTE!
          The parameter of this option supports range type parameter
          NOTE!
          This option is only validate for the dataset have genetic distance!
          Exclude variants having their genetic distance > 3 C:\Users\WISARD> wisard --filgdist ^>3 --ped test.ped
          Include variants having their genetic distance is greater or equal than 0 but lower than 1 C:\Users\WISARD> wisard --incgdist [0,1) --ped test.ped

          Include/exclude variants with HWE test

          NOTE!
          The parameter of this option supports range type parameter
          Exclude by p-value of HWE test under 1e-7 C:\Users\WISARD> wisard --filhwe "<1e-7" --ped test_miss0.ped
          Include by p-value of HWE test greater than 0.05 C:\Users\WISARD> wisard --inchwe "(0.05,1]" --ped test_miss0.ped

          Exclude variants with specific condition

          Selecting SNVs from dataset C:\Users\WISARD> wisard --snvonly --ped test.ped
          Selecting indels from dataset C:\Users\WISARD> wisard --indelonly --ped test.ped
          Removing all samples having their phenotype missing from input `test` C:\Users\WISARD> wisard --filmispheno --ped test.ped
          Removing all non QC-passed variants from input VCF file C:\Users\WISARD> wisard --vcfqc --vcf sample.vcf
          Removing all variants having its QUAL value is within from 0 to 30 C:\Users\WISARD> wisard --filqual [0,30] --vcf sample.vcf
          Selecting all variants having its QUAL value is greater or equal than 50 C:\Users\WISARD> wisard --incqual ^>=50 --vcf sample.vcf

          Default filtering of WISARD [top]

          Basically, WISARD uses given dataset 'as is' until some essential conditions satisfied. In other words, unless some of your data satisfies specific condition, no variants/individuals will not be dropped. If, however, specific variants or individuals should be removed, resizing or reformatting data for each run can be quite annoying. In order to satisfy such demand, WISARD provide several aspects of data filtering scheme.

          NOTE!
          In case of using these options with specific IID/variant instead of the file name of a list containing IID/variants, any whitespaces between/within IID/variants are not permitted!

          Filtering genotype with specific condition [top]

          Include only phased genotypes from input VCF file C:\Users\WISARD> wisard --vcf sample.vcf --phasedonly
          Include only unphased genotypes from input VCF file C:\Users\WISARD> wisard --vcf sample.vcf --unphasedonly
          Set to NA if RD is less than 10 or GQ is less or equal to 30 C:\Users\WISARD> wisard --filgeno --vcf test.vcf --filgeno '[RD < 10] OR [GQ <= 30]'

          Using regular expression to filtering [top]

          Regular expression is a set of structured string that represents a pattern to discover. WISARD supports regular expression to various types of filters by adding --regex option. As shown in the below example, it is recommended to embrace regular expression double quote(") if Windows or single quote(') if otherwise, in order to avoid the special character problem.

          Generating a BED-formatted dataset consists only the samples their IID start from KARE C:\Users\WISARD> wisard --bed sample --regex --selsamp '^(KARE).+' --makebed --out sample_KAREonly
          Generating a BED-formatted dataset consists only the non-dbSNP variants C:\Users\WISARD> wisard --bed sample --regex --remvariant '^(rs)' --makebed --out sample_nonRSonly

          Selecting/removing samples with specific phenotypic conditions [top]

          WISARD provides specified selecting/removing sample scheme with their phenotype values. This filtering scheme can be multiple, and there are some simple rules follow:

          1. Unless alternative phenotype is given with --sampvar and --pname, only default phenotype in the given dataset will be considered.
          2. In the above case, the column name of default phenotype must be 'PHENOTYPE'..
          3. If there are multiple selecting/removing conditions, each condition should be separated with comma (,) with no whitespaces..
          4. Reserved phenotype/covariates column names cannot be used as the part of condition..
          5. Each condition should specify certain and non-conflict 'range' or exact 'value'.
          6. Each condition should include only ONE phenotype, e.g. a condition cannot be represented as the form of comparison of two or more phenotypes.

          Below are some example of invalid value of --filpheno.

          Example 3 : Examples of invalid --filpheno assignment
          --filpheno 130<HEIGHT<120 # Range is conflicted
          --filpheno REGION<3 # REGION is factor but represented 'range'
          --filpheno 120<HEIGHT<180, 33<BMI<40 # Whitespace in separator(,)
          --filpheno 120x<HEIGHT # Non-numeric character(x) included in range
          --filpheno HEIGHT>BMI # Condition is represented with mixture of phenotypes


          Edit this page
          Last modified : 2017-08-29 13:02:49