WISARD[wɪzərd] Workbench for Integrated Superfast Association study with Related Data |
|
This section describes about
The most efficient statistic depends on the ascertainment condition, the property of phenotypes (dichotomous/quantitative), the presence of covariates and the absence/presence of population stratification. In this section, we illustrate statistics for family-based samples.
Summary for available statistics:
Test name | Speed | Phenotype | Covariates | Description |
---|---|---|---|---|
Transmission Disequilibrium Test (TDT) | (fast) | Dichotomous | Can't adjust | It is always robust against the population stratification but because parental genotypes are not used, it is often statistically inefficient. |
Sibship TDT (SDT) | (fast) | Dichotomous | Can't adjust | The original TDT is unapplicable if parental genotypes are unknown. SDT overcomes by using variant data from unaffected sibs (Speilman and Ewens Am J Hum Genet 1998). |
MQLS | (slow) | Dichotomous | Can't adjust | It is an extended Cochran Armitage test for family-based samples (Thronton et al Am J Hum Genet 2007). It is for dichotomous phenotypes, and for ascertained families it is usually efficient. Covariate effects cannot be adjusted, but some modification enables adjustment of covariate effects. |
Family QLS (FQLS) | (moderate) | Dichotomous/continuous | Can't adjust | It is an extended Cochran Armitage test for family-based samples (Thronton et al Am J Hum Genet 2007). It is for dichotomous phenotypes, and for ascertained families it is usually efficient. Covariate effects cannot be adjusted, but some modification enables adjustment of covariate effects. |
EMMA/GEMMA | (fast) | Continuous | Adjust | they are Wald test for linear mixed model where variance-covariance matrix is parameterized by kinship coefficient matrix. They are for quantitative phenotypes and usually efficient for randomly selected samples. |
Generalized score test | (fast) | Continuous | Adjust | generalized score test for linear mixed model for EMMAX/GEMMA. It can be applied to quantitative phenotypes and usually efficient for randomly selected sample. |
MFQLS | (fast) | Continuous & multivariate | Adjust | It is an extended MQLS for joint analysis with multiple phenotypes and multiple variants. |
TDT(Speilman et al Am J Hum Genet 1993) is an association test for family-based samples and TDT tests whether transmitted alleles are different between cases.
TDT is always robust against population stratification. For large-scale genetic data, several statistics such as genomic controls, EIGENSTRAT, etc are robust against population stratification. However if the number of variants is not sufficiently large, they are not robust aginst population stratification but TDT is still robust. TDT does not utilize founders' genotypes and it is often statistically inefficient. In this reason, TDT is often used for candidate gene analysis.
WISARD can perform TDT by using --tdt option.
tdt.res is... | A result of Transmission Disequilibrium Test (TSV) | ||
Column | Format | Modifier | Description | CHR | integer | NONE | Proportion of missingess for the sample | VARIANT | integer | NONE | Proportion of missingess for the sample | POS | integer | NONE | Proportion of missingess for the sample | ALT | integer | NONE | Proportion of missingess for the sample | ANNOT | integer | NONE | Proportion of missingess for the sample | PHENO | string | --sampvar,--pname | Tesed phenotype | STAT | real | NONE | Statistic from TDT | P_TDT | real | NONE | p-value from TDT |
---|
TDT can be performed only when both parent and child's genotype are available. However, parental genotypes can sometimes be unavailable. SDT overcomes by using variant data from unaffected sibs (Speilman and Ewens Am J Hum Genet 1998).
The general statistical property is similar with TDT, and WISARD can perform SDT by using --sdt option.
sdt.res is... | A result of Sibship Disequilibrium Test (TSV) | ||
Column | Format | Modifier | Description | CHR | real | NONE | p-value from TDT | VARIANT | real | NONE | p-value from TDT | POS | real | NONE | p-value from TDT | ALT | real | NONE | p-value from TDT | ANNOT | real | NONE | p-value from TDT | PHENO | string | --sampvar,--pname | Tesed phenotype | STAT | real | NONE | Statistic from SDT | P_TDT | real | NONE | p-value from SDT |
---|
MQLS is an extended Cochran Armitage test and is suggested for family-based samples (Thornton and McPeek Am J Hum Genet 2007). It is a score test based on quasi-likelihood. MQLS under the presence and the absence of population stratification are same other than the choice of relationship matrix; under the presence of population stratification, genetic relationship matrix should be incorporated and under the absence of popoulation stratification, kinship coefficient matrix should be used.
For MQLS, affected and unaffected individuals are coded as 1 and 0 respectively, and if phenotype is missing, their phenotypes are coded as prevalence. This scheme indicates that individuals with missing genotypes may be affected with a probability of prevalence. Individuals with missing genotype are excluded from analysis or missing genotypes can be replaced with 2$\times$MAF.
Because MQLS is an extension of Cochran Armitage test for family-based design, it is efficient for case-control design but if there are some covariates that need to be adjusted or samples are randomly selected, some modification is necessary. In such a case, the residuals from the linear mixed model can be utilized as response even though phenotypes are dichotomous (Won and Lange Stat in Med 2013).
Example codes
Family-based quasi-likelihood score(FQLS) test is an extended MQLS and is more efficient than MQLS if each family is ascertained by some probands. When each family is ascertained by some probands, the ascertainment bias depends on the relationship with probands, and it is heterogeneous. The heterogeneity of ascertainment bias is substantial for large family and FQLS adjusts the heterogeneity bias by liability model. If heritability is large, the power improvement is substantial.
FQLS can be applied to both dichotomous or continuous phenotypes, and modification is necessary if there are some covariante effects to be adjusted or phenotypes are quantitative.
FQLS using WISARD can be performed with an assignment of --fqls option.
In default, --fqls requires below additional options, and performed with an imputation of missinge genotype as 2*maf, where maf is minor allele frequency of given variant from founders.
In default, FQLS computes offset based on the following factors: pedigree structure, proband status, heritability and prevalence. Among of those factors, prevalence should be omitted when the phenotype is not dichotomous.
WISARD provides two ways for estimating offset: Assume each family member is potential proband, or there is an exact information who are the probands. If there is an information of proband status for each individual, it is possible to utilize that information into FQLS analysis. In order to do that, sample variable(with --sampvar) is required in default. As introduced in the sample variable section, there are a number of 'reserved' column name for the other usage, and proband status is one of them. In default, if there is a column named 'PROBAND' in the provided sample variable file, WISARD automatically detects it and retrieve it as proband information for each sample. To assign correct proband status, some conditions as the below are required.
For family-based association analysis, kinship coefficient matrix should be used as a relationship matrix by using --kinship option. If there exists population stratification, genetic relationship matrix must be incorporated.
NOTE! |
--retestthr and --availonly cannot be used simultaneously! |
For family-based design, GEMMA can be utilized under both presence and absence of population stratification. Under the presence of population stratification, exactly same statistics and codes for GEMMA with population-based samples under the presence of population stratification (see the association analysis under the presence of population stratification for example code). Difference of statistics and WISARD code for family-based design under presence and absence of population stratification is only the choice of relationship matrix, and the kinship coefficient matrix instead of ibs matrix should be used under the absence of population stratification.
GEMMA are known to be the most efficient approach for quantitative phenotypes (Kang et al Nat Genet 2010), and if polygenic effect is substantially large(e.g. height), the improvement by them can be substantial. If it is not clear whether polygenic effect is large, heritability for quantitative phenotypes can be alternatively used. Even though parameter estimation in a linear mixed model is usually computationally intensive, computationally efficient algorithm proposed by both approaches enable the genome-wide association anlaysis in a short time. GEMMA provides computationally much efficient strategies, and by default, WISARD calculate GEMMA.
Example codes
For family-based design, generalized score test can be utilized under both presence and absence of population stratification. Under the presence of population stratification, exactly same statistics and codes for population-based samples under the presence of population stratification (see the association analysis under the presence of population stratification for example code). Difference of statistics and WISARD code for family-based design under presence and absence of population stratification is only the choice of relationship matrix, and the kinship coefficient matrix instead of ibs matrix should be used under the absence of population stratification.
WISARD supports generalized score test for linear mixed mode. Linear mixed models for EMMAX/GEMMA and generalized score test are same. EMMAX/GEMMA are Wald tests and Wald tests are known to be statistically more efficient than score tests. However, EMMAX/GEMMA are more sensitive to the normality than generalized score test, and thus if nonnormality is expected, generalized score test may be reasonble choice.
Example codes
NOTE! |
If the dataset is not family-based, generalized score test will not be performed! |
For family-based design, MFQLS can be utilized under both presence and absence of population stratification. Under the presence of population stratification, exactly same statistics and codes for population-based samples under the presence of population stratification (see the association analysis under the presence of population stratification for example code). Difference of statistics and WISARD code for family-based design under presence and absence of population stratification is only the choice of relationship matrix, and the kinship coefficient matrix instead of ibs matrix should be used under the absence of population stratification.
MFQLS is an extended MQLS for multiple phenotypes and variants. MQLS can be applied to the dataset having multiple phenotype or multiple variant, such as gene set. Hence, WISARD supports a functionality for applying MQLS to such analysis.
NOTE! |
When WISARD executed with --fqls and --mqls concurrently, multiple phenotype/variant cannot be used! |
Example codes