WISARD[wɪzərd]
Workbench for Integrated Superfast Association study with Related Data
HOME  |   DOWNLOAD  |   OPTIONS  |   TROUBLE?  |   LOGIN
 

Relationship Matrix

This section describes about

Relationship matrices [top]

If we denote the number of sample by n, the relationship matrix is a n$\small{\times}$n matrix, and each element in relationship matrix indicates the genetic similarity between individuals. For genetic analysis, relationship matrix is utilized to parameterize the phenotypic correlation and if it is misspeified, type-1 error, type-2 erros or both cannot be appropriately controlled. Incorporation of different relationship matrix to the same statistic can have different meaning and thus it should be carefully selected. The following statistical analyses can be conducted after the relationship matrix is chosen:

  • Heritability estimation
  • Estimation of phenotypic variance attributable to the observed genotypes
  • Association analysis under population stratification
  • Family-based association analysis

Relationship matrix can be estimated from pedigree structure or large-scale genotypes, and WISARD provides several functions to estimate the various relation matrices. Relationship matrices supported by WISARD, $\small{\Phi=(\phi_{ij})_{n\times n}}$ are

Kinship coefficient matrix

If we let $\small{\pi_{ij}}$ be a kinship coefficient between individual $\small{i}$ and $\small{j}$,

$\phi_{ij}=2\pi_{ij}$

WISARD can calculate kinship coefficient matrix by using --kinship option.

Genetic relationship matrix

Let $\small{x_{ik}}$ be a genotype(0/1/2) of individual $\small{i}$ at locus $\small{k}$. If we assume that there are $m$ markers, and maf for marker $\small{k}$ be $\small{p_k}$,

$\phi_{ij}=\left\{\begin{array}{cc} \frac{1}{m}\sum^m_{k=1} \frac{(x_{ik}-2p_k)(x_{jk}-2p_k)}{2p_k(1-p_k)}, & i \ne j \\ 1+\frac{1}{m}\sum^m_{k=1} \frac{x_{ik}^2 - (1+2p_k)x_{Ik}+2p_k^2}{2p_k(1-p_k)}, & i = j \end{array}\right.$

WISARD can calculate genetic relationship matrix by default.

IBS matrix

If we let $\small{r_{ijk}}$ be the number of allele sharing between individual $\small{i}$ and $\small{j}$ at locus $k$,

$\phi_{ij}=\frac{1}{2m} \sum^m_{k=1} r_{ijk} $

WISARD can calculate genetic relationship matrix by using --ibs option.

Hybrid relationship matrix

If absolute value of difference between elements for kinship coefficient matrix and genetic relationship matrix is less than some threshold, element for kinship coefficient matrix is used and otherwise element for genetic relationship matrix is used. WISARD can calculate genetic relationship matrix by using --hybrid option.

Kendall's tau correlation

A nonparametric correlation structure. WISARD can calculate genetic relationship matrix by using either of --ktau or --empktau option.

Pre-computed or user-defined relationship matrix

If there is pre-computed matrix for sample relationship from WISARD or other toolsets, a huge computational burden comes from the computation of relationship can be avoided using existing one. WISARD provides an option --cor to do such task, with various formats.


Useful options [top]

WISARD provides several useful options for calculating relationship matrix.

  • --makecor: it makes WISARD provide an output file with relationship matrix.
  • Compute and export kinship coefficient (assumes autosome) C:\Users\WISARD> wisard --bed test_miss0.bed --kinship --makecor
    NOTE!
    each element in output file is an element for kinship coefficient matrix for the above example.
  • --thread: calculation of genetic relationship matrix is often computationally very intensive, and in order to reduce the computational time, multi-threaded analysis can be conducted by using --thread option.
  • NOTE!
    This option is useful for a genetic relationship matrix and ibs matrix.
  • --x: it makes WISARD make a relationship matrix for X-chromosome. Each element in relationship matrix for X chromosome is affected by gender and thus, the gender for each individual must be correctly specified.
  • --x2: it also makes WISARD make a relationship matrix for X-chromosome in a different structure with --x option.
  • --cormaf: except a kinship coefficient matrix, the other relationship matrices are built by using variants. In such a situation, this option makes WISARD use variants of which MAF are in certain range. By default, WISARD utilizes variants of which MAF ranges from 5% to 50%(maximum possible) but the range can be adjusted with --cormaf option.
  • NOTE!
    This option is meaningless for a kinship coefficient matrix.


Selecting the relationship matrix [top]

The validity of statistical analysis is related with the choice of relationship matrix, and it should be carefully selected. In particular, the choice of relationship matrix depends on the presence of population substructure.

Under the absence of population substructure,

  • The kinship coefficient matrix should be utilized for genetic association analysis if family-based samples are used. For independent samples, the kinship coefficient matrix becomes an identity matrix which is used for all genetic analysis by default.

Under the presence of population substructure,

  • the presence of population substructure can be detected by MDS plot (see the population stratification page for detail).
  • kinship coefficient matrix does not reveal the genetic similarity between individuals, and its incorporation to the statistical analysis leads to the invalidated results. Genetic relationship matrix or ibs matrix are often utilized for genetic association alaysis. The most appropriate choice depends on the analysis purpose.
  • If the dataset does not contain the sufficiently large number of markers, genetic relationship matrix/ibs matrix can be over- or under-estimated. Therefore, it should be confirmed whether sufficiently large number of markers are available.


Relationship matrix format [top]

Relationship matrix can be generated in two different formats by WISARD;matrix format and pairwise-element format.

Matrix format

The relationship matrix is expressed in a matrix form deliminated with arbitrary whitespaces (tab or space). WISARD accepts files with and without header. For a file without header, the number of rows and columns dimension must be equivalent to the sample size but for a file with header, this requirement does not need to be satisfied. For the latter case, headers must be IIDs for individuals, and WISARD construct the relationship matrix by matching individual IID. If header for some individuals are missing, elements for those individuals in relationship matrix are assumed to be 0.

NOTE!
By default, WISARD produces an output file in a matrix format with header for relationship matrix.

Element-wise format

This format is more flexible than a matrix-format and can be produced from WISARD by using --corpair option. In this format, each line represents values for a single element in a relationship matrix, that is, IIDs for a pair of individuals, and their relationship coefficient. It should be noted that individuals with same IIDs but different FIDs are not allowed in WISARD. All individuals have to have different IIDs.

Example 1 : Example of elementwise format
IND00001 IND00001 1.01
IND00001 IND00002 0.37
IND00001 IND00003 0.48
IND00002 IND00002 1.03
IND00002 IND00003 0.55
IND00003 IND00003 1.00

GCTA GRM format

In the recent version of GCTA, a Genetic Relationship Matrix (GRM) file is exported with binarized format, consists of three files. WISARD can retrieve and export with this format. By using --cor option, WISARD automatically recognizes its format. In order to export sample relatedness with this format, use --corgrm option. To retrieve the produced GRM file from R, please refer the way the authors suggested in here.

NOTE!
This option cannot be used with --corpair in same time!

EPACTS kin format

EPACTS also supports its own format of genetic relationship, and its extension is kin. WISARD can retrieve and export with this format. By using --cor option, WISARD automatically recognizes its format. In order to export sample relatedness with this format, use --corepacts option.

NOTE!
This option cannot be used with --corpair in same time!


Kinship coefficient matrix [top]

Under the absence of population substructure, the kinship coefficient matrix should be utilized for genetic association analysis if family-based samples are used. For independent samples, it should be noted that the kinship coefficient matrix becomes an identity matrix, and users do not need to consider kinship coefficient matrix because the identity matrix is used as a relationship matrix by default for most of genetic analysis.

If pedigree structure is misspecified or population substructure exists, the family-based association analysis based on the misspecified kinship coefficient matrix produces the invalid results, and in such cases, ibs matrix or genetic relationship matrix are better choices.

WISARD assumes that individuals with different FID are independent, and it can calculate a kinship coefficient matrix as long as FID, MID and PID are correctly specified. If there are some inbreds, the calculation of kinship coefficient is computationally intensive.

Example codes

  • Generate output file with kinship coefficient matrix
  • Compute and export kinship coefficient (assumes autosome) C:\Users\WISARD> wisard --bed test_miss0.bed --kinship --makecor
    NOTE!
    each element in output file is a $\small{2\phi_{ij}}$, where $\small{\phi_{ij}}$ is a kinship coefficient between a pair of individuals.
  • Generate the kinship coefficient matrix for X chromosome.
  • Compute and export kinship coefficient (assumes X-chromosome) C:\Users\WISARD> wisard --bed test_miss0.bed --kinship --x --makecor
    Compute and export kinship coefficient, in other way (assumes X-chromosome) C:\Users\WISARD> wisard --bed test_miss0.bed --kinship --x2 --makecor

    WISARD can calculate the kinship coefficient matrix for X chromosome which is required for some genetic association anlaysis with variants in X chromosome. The kinship coefficient matrix for X chromosome is different depending on gender and thus, the gender for each individual must be correctly specified.



Genetic relationship matrix [top]

Under the presence of population substructure, kinship coefficient matrix does not reveal the genetic similarity between individuals and incorporation of kinship coefficient matrix to the statistical analysis leads to the invalidated results.

  • The presence of population substructure can be detected by principal component analysis applied to genetic relationship matrix. Detail can be found at the(see the population stratification page for detail).
  • Under the presence of population substructure, some statistical methods use the genetic relationship matrix to explain the simility between individuals.However if the dataset does not contain the sufficiently large number of markers, genetic relationship matrix/ibs matrix can be over- or under-estimated. Therefore, it should be confirmed whether sufficiently large number of markers are available to calculate genetic relationship matrix.
  • For large-scale genetic data, calculation of genetic relationship matrix is computationally very intensive, and in order to reduce the computational time, multi-threaded analysis can be conducted by using --thread option.
  • Accelerate calculation of genetic relationship matrix using four threads C:\Users\WISARD> wisard --ped test.ped --thread 4
  • Each element in the genetic relationship matrix indicates the genetic similiary for a pair of individuals, but the estimate can be sensitive to the presence of outliers. Outliers can exist if number of individuals is not sufficiently large or rare variants are utilized for estimation of genetic relationship matrix, and alternatively sample median can be used as follows:
    $\phi_{ij}=\left\{\begin{array}{cc} median \left(\frac{(x_{ik}-2p_k)(x_{jk}-2p_k)}{2p_k(1-p_k)}\right), & i \ne j \\ 1+median\left( \frac{x_{ik}^2 - (1+2p_k)x_{Ik}+2p_k^2}{2p_k(1-p_k)}\right), & i = j \end{array}\right.$
    It can be calculated with WISARD by using --medcor option
  • Generate genetic relationship matrix with median and export C:\Users\WISARD> wisard --bed test_miss0.bed --medcor --makecor --out res_corr_med
  • Estimates of MAF for rare variant are highly variable and their inclusion to calculate the genetic relationship matrix can produce the unstable results. By default, WISARD utilizes variants of which MAF ranges from 5% to 50%(maximum possible) but the range can be adjusted with --cormaf option.
  • Computing genetic relationship matrix with markers having its MAF>=1% C:\Users\WISARD> wisard --bed test_miss0.bed --cormaf [0.01,0.5] --makecor --out res_corr_over1per
  • Genetic relationship matrix for X-chromosome can be built with WISARD by using all variants Computing genetic relationship matrix for X-chromosome related statistics, using the markers in autosome
  • Compute and export kinship coefficient for X-chromosome tests C:\Users\WISARD> wisard --bed test_miss0.bed --x2 --makecor --out res_corr_x2


Hybrid relationship matrix [top]

Hybrid relationship matrix can be considered as a mixture of a kinship coefficient matrix and a genetic relationship matrix. If absolute value of difference between elements for kinship coefficient matrix and genetic relationship matrix is less than some threshold, element for kinship coefficient matrix is used and otherwise element for genetic relationship matrix is used. The threshold is related with the number of individuals and variants, but the algorithm to determine probability-based threshold is not suggested yet. The default value for threshold is 0.01 and it can be assigned as an argument for --hybrid option.

WISARD can computes and applies hybrid relationship matrix via option --hybrid as follows:

Computing hybrid relationship matrix C:\Users\WISARD> WISARD --ped test_miss0.ped --hybrid --makecor --out res_corr_hybrid


IBS matrix [top]

IBS(Identity By State) matrix is defined as the number of shared alleles between two individuals divided by 2$\small{\times}$the total number of variants. With the efficiency of calculation of this measure and its meaning, many analysis tookits such as EMMAX or PLINK support the functionality that calculates this measure.

WISARD can calculate the IBS matrix by using --ibs option.

Calculate IBS matrix and export it C:\Users\WISARD> wisard --ped test_miss0.ped --ibs --makecor --out res_corr_ibs


Kendall's tau correlation [top]

WISARD provides a computation of nonparmetric relationship matrix, based on Kendeall's tau correlation coefficient. Genotype for each individual is standardized and their product for as follows:

$\frac{(x_{ik}-2p_k)}{\sqrt{2p_k(1-p_k)}}$

and the Kendeall's tau for each pair of individual is considered as a corresponding element in relationship matrix. While this relationship matrix may be useful in some situation, further validation is required and this option is generally not recommended.

Compute nonparametric sample relationship and export with raw genotype C:\Users\WISARD> wisard --bed test_miss0.bed --ktau --makecor --out res_corr_ktau
Compute&export nonparametric sample relatedness with normalized genotype C:\Users\WISARD> wisard --bed test_miss0.bed --empktau --makecor --out res_corr_normktau
NOTE!
--empktau is computationallly very intensive and it is NOT recommended if the number of variants is too large, for instance, 2000K!


Pre-computed or user-defined relatedness matrix [top]

WISARD provides functionality that make available to assign under-defined correlation structure to analysis via file.
Note that this `correlation structure` is defined across samples, not across markers.
It is possible with assigning --cor option when running WISARD, like below:

Assigning user-defined correlation structure to analysis C:\Users\WISARD> wisard --ped example.ped --cor my.cor

In above example, `my.cor`, the user-defined correlation file should be one of below formats.

Matrix form correlation

In the matrix form, the correlation file is given with matrix form with deliminated with arbitrary whitespaces.
WISARD accepts both of (1) Simple matrix form with no header, and (2) Matrix form with header. For the former case, the dimension of matrix must equivalent to the sample size of final dataset. In addition, row/column sequence are assumed to same as sample sequence of final dataset. However, latter case does not necessarily requires above limitations, because WISARD automatically matches pair and reorder the matrix. It means that the header must be an IID of corresponding sample.

NOTE!
In default, WISARD produces sample relatedness matrix as this form with header.

Paired form correlation

Unlike the matrix form, paired form allows more flexibility to given correlation file, due to its structure.
As seen in below example, each line represents one correlation coefficient between two samples, and each is same vice versa.
Even though your input dataset have some filter or missingness, unmatched or filtered entries will be automatically removed from further analysis.

NOTE!
If the given correlation matrix depends on the multiple samples, your assumption on correlation matrix could be broken due to the filtering!

NOTE!
This type can be produced from WISARD with --corpair
Example 2 : Example of paired-form correlation input
IND00001 IND00001 0.88
IND00001 IND00002 0.37
IND00001 IND00003 0.48
IND00002 IND00002 0.95
IND00002 IND00003 0.55
IND00003 IND00003 0.99

GRM-type correlation

One of the toolsets like WISARD, GCTA, also provides a comprehensive function for the association test of related genetic dataset. It is capable to compute the relationship across samples based on the genotypes named genetic relationship matrix (GRM), and the resutled GRM is coded in a binarized form. WISARD accepts this GRM format and utilizes it to subsequent analyses with --cor option. Note that the last extension of GRM outputs should be omitted when provides it as an input of WISARD.

Produce a GRM matrix using GCTAC:\Users\WISARD> gcta64 --bfile test --autosome --make-grm --out test
Above code will produce three files: test.grm.id, test.grm.N.bin and test.grm.bin, to utilize the GRM matrix from WISARD, a command like below is required, without their last extensions (.id , .N.bin and .bin).
Utilize produced GRM matrix to QLS test from WISARD C:\Users\WISARD> WISARD --bed test.bed --qls --cor test.grm



Edit this page
Last modified : 2017-08-29 13:17:43