WISARD[wɪzərd] Workbench for Integrated Superfast Association study with Related Data |
|
While effect of common variant on phenotypes of interest can be detected by testing its marginal effect, the rare variant suffers from large false negative finding and the statistical algorithms for common variants cannot be directly utilized. To improve the statistical efficiency of rare variant assocation analysis, set of rare variants are simultaneously tested for their association. Therefore, set file which lists the variants belonging to a same gene should be provided. If there is a single variant in a set, the result is not valid and they must be filtered out from the analysis.
For rare variant analysis, statistical power is affected by several factors; definition of set, and homogeneity of effect of each rare variant on phenotypes. Depending on the characteristic of these factor, the most efficient statistic is different and several statistics should be considered at the same time. WISARD provides various statistis for rare variant analysis.
Detailed information about power comparison for various situations can be found at Ladouceur et al (Plos Genetics 2012).
WISARD automatically determines whether each phenotype is either quantitative or dichotomous. By default if only 1, or 2 are observed as phenotypic values, it is assumed to be dichotomous by WISARD and otherwise it is to be quantitative phenotype. With some options, phenotypes with different values can be assumed to be dichotomous.
Rare variant association is tested with a set of rare variants simultaneously because of large false negative rate, and thus a set of rare variants should be defined. WISARD supports four types of set file format, and it can be selected by using --set option.
NOTE! |
--set option is mandatory for running gene-level analysis! |
For type-I format, each line consists of two columnes for gene set name (e.g. SET_A) and variant name respectively, and they should be separated with whitespace (space or tab). Gene set name might be a gene name.
NOTE! |
Variants which belong to the same gene should be contiguously placed! |
Type-II file format is equal to t he set definition used in PLINK(see here for plink). Each set must start with a set name which can not have any spaces in it. The name is followed by a list of variants in that gene set, and the keyword END specifies the end of that particular set. You also can refer below example:
NOTE! |
Do not use END as a name of variant! |
Type-III file format is similar to the type-I definition, but all variants for each set should be enumerated in a single line. Type-III file format is equal to the set definition used in EPACTS.
Type-IV file format is different with the other three types of set.
It defines a set of multiple variants by allocating specific region to each set.
Each set can be overlapped among other sets, and a variant which is placed on overlapped region
will be assigned to every sets that occupies that region.
In many analysis toolsets such as Rvtests uses an existing format for representing gene information.
Rarer variants are often assumed to be functionally more important for phenotypes, and thus WISARD provides several ways to weight each variant by using MAF.
User-defined weight can be loaded by using --weight option, and this option is often used to weight each varaint by measuring the importance of variant in terms of protein structure. Some softwares such as SIFT or PolyPhen score can calculate this information. In this file, each line should have two columns where the first column is a variant name and the second column is a weight. Two columns should be separated by whitespace (space or tab)
NOTE! |
The user-defined weight file should contain weights for all variants within a dataset to be analyzed! |
Combined Multivariate and Collapsing(CMC) test was suggested by Li and Leal (Am J Hum Genet 2008), and it can be applied to dichotomous phenotypes. This approach is useful for case-control design and it assumes that the presence of rare alleles increases or decreases the disease risk, and the number of rare alleles is not important.
Example code
Weighted-sum test (WST) was suggested by Madsen and Browning (Plos Genetics 2009) and can be applied to dichotomous phenotypes. WST tests whether weighted rare allele counts ares associated with phenotypes and is efficient if the effects of all rare variants on phenotypes are in the same direction. Significance is calculated by comparing the weighted rare allele counts between cases and controls.
Example code
The kernel-based adaptive cluster (KBAC) test was suggested by Liu and Leal, and can be applied for dichotomous phenotypes. KBAC test categorizes set of rare variants depending on the pattern of rare alleles, and it may be efficient if there is a joint interaction between rare variants
Collapsing-based test was suggested by Morris and Zeggini (Genet Epi 2010) and can be applied to dichotomous and quantitative phenotypes. Collapsing-based tests check whether weighted rare allele counts ares associated with phenotypes and is efficient if the effects of all rare variants on phenotypes are in the same direction.
Collapsing-based test incorporates the weighted rare allele counts as a covariate, and for dichotomous and quantitative phenotypes, logistic and linear regressions are respectively used.
Example code
Variable-threshold(VT) test was suggested by Price et al (Am J Hum Genet 2010). For rare variant analysis the definition of rare variants is unclear and different MAF thresholds are used. In this reason, VT test selects the MAF threshold which maximize the significance. VT method can be applied for dichotomous and quantitative phenotypes, and it may be useful if rarer variants have strong effect on phenotype.
Final p-values for VT test is calcualted with permutation and the number of iteration should be decided with the significance level. For instance if you are interested in the 0.05 significance level, then we suggest to iterate at least 1/0.05 *10 times.
Example codes
SNP-set/Sequence kernel association test (SKAT) was suggested by Wu et al (Am J Hum Genet 2011) and can be applied for dichotomous and quantitative phenotypes. SKAT is efficient if rare variants with positive and negative effect on phenotype are grouped as a set and results from SKAT are usually similar with C-alpha test.
SKAT approximately follows the mixture of chi-square distribution if sample size is sufficiently large, and p-values for SKAT are calculated with numerical algorithm by Liu et al (Com Stat Data Anal 2009)
Example codes
SNP-set/Sequence Kernel Association Test-optimal(SKAT-o) is an extension of SKAT and was suggested by Lee et al (Biostatistics 2012). It is a mixture of burden-type test and SKAT, and can be applied to dichotomous and quantitative phenotypes. comes from an integration of mixture probability distribution function of that statistics. SKAT-o is a method robust to the direction of effects of rare variants on phenotype.
WISARD borrows an idea of optimal selection weights from SKAT package of R. In default, the weights for optimal selection is $0$, $0.1^2$, $0.2^2$, $0.3^2$, $0.4^2$, $0.5^2$, 0.5 and 1. The weights can be altered in two ways: Dividing the range from 0 to 1 with given number of equal segments(--skatondiv) and assigning user-defined weights(--skatodivs).
NOTE! |
SKAT-o method will not be performed on the gene has single variant. |
Q-test was suggested by Lee et al. and is a test combining collasping-based test and SKAT test. Q-test can be applied to quantitative phenotype. Q-test has a similar property as SKAT-o test, and Wald-type test while SKAT-o is a score-type test. Q-test may be more efficient if the number of rare variants are not large.
Example codes
WISARD can conduct the gene-level with a longitudinal data and this analysis can be applied only for SKAT and SKAT-o. For longitudinal data analysis, there exists the correlation between repeated measurements. Estimation of correlation matrix can be conducted with the statistical software such as R and SAS under the assumption that genotypes do not have any effect on phenotype. Then WISARD loads the correlation matrix and the score-type test for gene-level test can be calculated.
Example codes
NOTE! |
Correlation matrix should be symmetric! |