WISARD[wɪzərd] Workbench for Integrated Superfast Association study with Related Data |
|
This section describes about
If we denote the number of sample by n, the relationship matrix is a n$\small{\times}$n matrix, and each element in relationship matrix indicates the genetic similarity between individuals. For genetic analysis, relationship matrix is utilized to parameterize the phenotypic correlation and if it is misspeified, type-1 error, type-2 erros or both cannot be appropriately controlled. Incorporation of different relationship matrix to the same statistic can have different meaning and thus it should be carefully selected. The following statistical analyses can be conducted after the relationship matrix is chosen:
Relationship matrix can be estimated from pedigree structure or large-scale genotypes, and WISARD provides several functions to estimate the various relation matrices. Relationship matrices supported by WISARD, $\small{\Phi=(\phi_{ij})_{n\times n}}$ are
If we let $\small{\pi_{ij}}$ be a kinship coefficient between individual $\small{i}$ and $\small{j}$,
WISARD can calculate kinship coefficient matrix by using --kinship option.
Let $\small{x_{ik}}$ be a genotype(0/1/2) of individual $\small{i}$ at locus $\small{k}$. If we assume that there are $m$ markers, and maf for marker $\small{k}$ be $\small{p_k}$,
WISARD can calculate genetic relationship matrix by default.
If we let $\small{r_{ijk}}$ be the number of allele sharing between individual $\small{i}$ and $\small{j}$ at locus $k$,
WISARD can calculate genetic relationship matrix by using --ibs option.
If absolute value of difference between elements for kinship coefficient matrix and genetic relationship matrix is less than some threshold, element for kinship coefficient matrix is used and otherwise element for genetic relationship matrix is used. WISARD can calculate genetic relationship matrix by using --hybrid option.
A nonparametric correlation structure. WISARD can calculate genetic relationship matrix by using either of --ktau or --empktau option.
If there is pre-computed matrix for sample relationship from WISARD or other toolsets, a huge computational burden comes from the computation of relationship can be avoided using existing one. WISARD provides an option --cor to do such task, with various formats.
WISARD provides several useful options for calculating relationship matrix.
NOTE! |
each element in output file is an element for kinship coefficient matrix for the above example. |
NOTE! |
This option is useful for a genetic relationship matrix and ibs matrix. |
NOTE! |
This option is meaningless for a kinship coefficient matrix. |
The validity of statistical analysis is related with the choice of relationship matrix, and it should be carefully selected.
In particular, the choice of relationship matrix depends on the presence of population substructure.
Under the absence of population substructure,
Under the presence of population substructure,
Relationship matrix can be generated in two different formats by WISARD;matrix format and pairwise-element format.
The relationship matrix is expressed in a matrix form deliminated with arbitrary whitespaces (tab or space). WISARD accepts files with and without header. For a file without header, the number of rows and columns dimension must be equivalent to the sample size but for a file with header, this requirement does not need to be satisfied. For the latter case, headers must be IIDs for individuals, and WISARD construct the relationship matrix by matching individual IID. If header for some individuals are missing, elements for those individuals in relationship matrix are assumed to be 0.
NOTE! |
By default, WISARD produces an output file in a matrix format with header for relationship matrix. |
This format is more flexible than a matrix-format and can be produced from WISARD by using --corpair option. In this format, each line represents values for a single element in a relationship matrix, that is, IIDs for a pair of individuals, and their relationship coefficient. It should be noted that individuals with same IIDs but different FIDs are not allowed in WISARD. All individuals have to have different IIDs.
In the recent version of GCTA, a Genetic Relationship Matrix (GRM) file is exported with binarized format, consists of three files. WISARD can retrieve and export with this format. By using --cor option, WISARD automatically recognizes its format. In order to export sample relatedness with this format, use --corgrm option. To retrieve the produced GRM file from R, please refer the way the authors suggested in here.
NOTE! |
This option cannot be used with --corpair in same time! |
EPACTS also supports its own format of genetic relationship, and its extension is kin. WISARD can retrieve and export with this format. By using --cor option, WISARD automatically recognizes its format. In order to export sample relatedness with this format, use --corepacts option.
NOTE! |
This option cannot be used with --corpair in same time! |
Under the absence of population substructure, the kinship coefficient matrix should be utilized for genetic association analysis if family-based samples are used. For independent samples, it should be noted that the kinship coefficient matrix becomes an identity matrix, and users do not need to consider kinship coefficient matrix because the identity matrix is used as a relationship matrix by default for most of genetic analysis.
If pedigree structure is misspecified or population substructure exists, the family-based association analysis based on the misspecified kinship coefficient matrix produces the invalid results, and in such cases, ibs matrix or genetic relationship matrix are better choices.
WISARD assumes that individuals with different FID are independent, and it can calculate a kinship coefficient matrix as long as FID, MID and PID are correctly specified. If there are some inbreds, the calculation of kinship coefficient is computationally intensive.
Example codes
NOTE! |
each element in output file is a $\small{2\phi_{ij}}$, where $\small{\phi_{ij}}$ is a kinship coefficient between a pair of individuals. |
WISARD can calculate the kinship coefficient matrix for X chromosome which is required for some genetic association anlaysis with variants in X chromosome. The kinship coefficient matrix for X chromosome is different depending on gender and thus, the gender for each individual must be correctly specified.
Under the presence of population substructure, kinship coefficient matrix does not reveal the genetic similarity between individuals and incorporation of kinship coefficient matrix to the statistical analysis leads to the invalidated results.
Hybrid relationship matrix can be considered as a mixture of a kinship coefficient matrix and a genetic relationship matrix. If absolute value of difference between elements for kinship coefficient matrix and genetic relationship matrix is less than some threshold, element for kinship coefficient matrix is used and otherwise element for genetic relationship matrix is used. The threshold is related with the number of individuals and variants, but the algorithm to determine probability-based threshold is not suggested yet. The default value for threshold is 0.01 and it can be assigned as an argument for --hybrid option.
WISARD can computes and applies hybrid relationship matrix via option --hybrid as follows:
IBS(Identity By State) matrix is defined as the number of shared alleles between two individuals divided by 2$\small{\times}$the total number of variants. With the efficiency of calculation of this measure and its meaning, many analysis tookits such as EMMAX or PLINK support the functionality that calculates this measure.
WISARD can calculate the IBS matrix by using --ibs option.
WISARD provides a computation of nonparmetric relationship matrix, based on Kendeall's tau correlation coefficient. Genotype for each individual is standardized and their product for as follows:
and the Kendeall's tau for each pair of individual is considered as a corresponding element in relationship matrix. While this relationship matrix may be useful in some situation, further validation is required and this option is generally not recommended.
NOTE! |
--empktau is computationallly very intensive and it is NOT recommended if the number of variants is too large, for instance, 2000K! |
WISARD provides functionality that make available to assign under-defined correlation structure to analysis via file.
Note that this `correlation structure` is defined across samples, not across markers.
It is possible with assigning --cor option when running WISARD, like below:
In above example, `my.cor`, the user-defined correlation file should be one of below formats.
In the matrix form, the correlation file is given with matrix form with deliminated with arbitrary whitespaces.
WISARD accepts both of (1) Simple matrix form with no header, and (2) Matrix form with header.
For the former case, the dimension of matrix must equivalent to the sample size of final dataset.
In addition, row/column sequence are assumed to same as sample sequence of final dataset.
However, latter case does not necessarily requires above limitations, because WISARD automatically matches pair and reorder the matrix.
It means that the header must be an IID of corresponding sample.
NOTE! |
In default, WISARD produces sample relatedness matrix as this form with header. |
Unlike the matrix form, paired form allows more flexibility to given correlation file, due to its structure.
As seen in below example, each line represents one correlation coefficient between two samples, and each is same vice versa.
Even though your input dataset have some filter or missingness, unmatched or filtered entries will be automatically removed from further analysis.
NOTE! |
If the given correlation matrix depends on the multiple samples, your assumption on correlation matrix could be broken due to the filtering! |
NOTE! |
This type can be produced from WISARD with --corpair |
One of the toolsets like WISARD, GCTA, also provides
a comprehensive function for the association test of related genetic dataset.
It is capable to compute the relationship across samples based on the genotypes named genetic relationship matrix (GRM),
and the resutled GRM is coded in a binarized form.
WISARD accepts this GRM format and utilizes it to subsequent analyses with --cor option.
Note that the last extension of GRM outputs should be omitted when provides it as an input of WISARD.