In some cases, an input file is zipped because of the size of original file is too large.
WISARD can load gunzipped file directly.
In this case, no extra option is required.
WISARD will automatically recognize whether the input is gunzipped or not, then retrieve data.
NOTE!
In order to use this functionality, zlib must be supported. See here.
Load dataset from standard input stream
A standard input is often used for redirecting data from a program to another program, as represented by pipeline.
WISARD also supports pipelining, but is limited to a single main file of given data format. The main file for each format is
PED for PLINK PED format
BED for PLINK binary format
In order to use pipelining in WISARD, the argument of main input option must be single dash character(-), as shown in below example.
In above example, my.ped is zipped as my.ped.gz and there is paired my.map file in same directory.
The code firstly inflates my.ped.gz and print it to standard input thus it should be printed out to the console.
However, '|' character after the command catches that and redirecting it to WISARD so it can be retrieved as an input of --ped.
Generate dataset
Although there is no dataset, it is possible to generate dataset from ab initio or using seed dataset.
There are three possible ways of dataset generation.
Generate null genotypes with predefined structure (Using --sim option)
A dataset can be generated ab initio under the predefined family structure.
Currently WISARD provides three types of pedigree structure:
--sim indep (Independent dataset)
--sim extended (Extended family, consists of ten members)
--sim trio (Trio family, consists of three members)
Generate null genotypes with given family structure (Using --sim and --fam option)
If there is a family structure to be simulated,
WISARD can simulates null genotypes with given family structure.
Any kinds of dataset can be applied, but their included genotype will be ignored,
and final sample size will be the number included in the dataset times --nfam,
unless do not apply any sample filtering.
Generate dataset contains statistically significant variables Sorry, this section is still under curation.
Due to its roughness of format specification and various source of data production, PED/RAW file format is inevitably diverse in many ways.
In order to overcome this variability, WISARD admits very wide range of variability for PED/RAW format, such as:
In default, WISARD assumes the input dataset was generated from human.
This can be changed to other species, using --species option.
Currently, WISARD supports below species for the analysis.
--species human : For the dataset from humans (Homo Sapiens)
--species mouse : For the dataset from mice (Mus Musculus)
--species rat : For the dataset from rats (Rattus Rattus)
--species rabbit : For the dataset from rabbits (Oryctolagus Cuniculus)
--species sheep : For the dataset from sheeps (Ovis Aries)
--species cow : For the dataset from cows (Bos Taurus)
--species horse : For the dataset from horses (Equus Caballus)
--species dog : For the dataset from dogs (Canis Familiaris)
--species rice : For the dataset from rices (Oryza Sativa)
Allowing indels for input
In default, WISARD only allows Single Nucleotide Polymorphisms for input.
In order to break up this limitation and allow indels to input, --indel option must be specified.
Allowing partial or no MAP file
MAP file is originally consists of four columns: Chromosome, variant name, genetic distance and physical position.
However, it is possible to some non-standard MAP file have only partial of those columns. In this case, below options can be applied.
--nopos indicates there is no corresponding column for physical position.
--nogdist indicates there is no corresponding column for genetic distance.
Above options can be used in one line, so below example assumes that there are only two columns (chromosome and variant name) in the given MAP file.
Otherwise, it is possible to have NO map file,
so WISARD will automatically generate appropriate variant information.
In this case, variants will be automatically named from MARKER_1 to MARKER_p,
where p is retrieved number of variants.
Since it generates the name of variants with fixed rule, other options requiring variant name can be applied.
NOTE!
Some options referring chromosome/genetic distance and physical position will behave unexpectedly with --nomap!
Mapping number-coded genotype to character-coded genotype
In some PED, TPED or LGEN dataset, genotypes are coded in number and those of 1, 2, 3, and 4 correspond to A, C, G, and T, respectively.
By adding --1234 option to the command, it is possible to convert the notation of genotype from number to character.
In detail, refer below example.
Mapping arbitrary characters to genotype
In some cases, coded genotype cannot readable directly, e.g. 1/2/3/4 for A/C/G/T.
In order to recode this arbitrarily-coded dataset, --acgt option might be helpful.
Above code will convert dataset as following:
Flip genotype
In order to flip strands, --flip can be used.
--flip option with no argument flips genotype from A/C/G/T to T/G/C/A, respectively.
Note that --flip option is essentially equivalent to --acgt tgca.
If other types of flip sequence is desired, adding an argument consists of four characters.
Each four characters are corresponding to A/C/G/T, respectively.
Flip a subset of dataset
By an assignment of the list of markers via --varsubset,
it is possible to designate the subset to be flipped.
This option affects to --flip and --acgt, and --1234 option.
NOTE!
With an argument, this option is equivalent to --acgt!
Changing in/out phenotype/genotype missing character
In default, the value indicating missing value is fixed as -9.
However, sometimes it is coded as NA, <NA>, NONE, or hundreds of other values.
In order to process this kind of dataset appropriately, --mispheno option can arbitrarily specify the value for missing value.
In similar manner, non-standard genotype character also can be specified with --misgeno.
But note that the default missing genotype character differ for each file format.
Following is a list of default missing genotype code for each file format:
PED/Transposed PED/LGEN format: 0(ASCII number 0)
BED format: Invisible and fixed (Cannot change)
VCF: .(ASCII dot)
PLINK RAW: -9(ASCII hyphen and ASCII number 9)
Thus, be careful of the default missing character of given dataset when using --misgeno.
NOTE!
When using --merge and --misgeno, WISARD can show unexpected behavior, because --misgeno is applied to all dataset being merged.
It should be noted that --mispheno and --misgeno is only applied to the input.
In other words, every data export option starts with 'make'; such as --makeped option uses default missing genotype and missing phenotype character in its format.
For this case, --outmispheno and --outmisgeno should be used to change default coding for dataset export.
Changing parental missing notation for FAM file
Many familial relationship files define 0 as 'missing' of parental relationship, same goes for the notation of founders.
In order to change this notation as other string sequence such as NA or <NA>, use --misparent.
Let the FAM file 'test_miss0_parNA.fam' look like below.
Without --misparent option, above dataset cannot be retrieved correctly because WISARD expects 0 to recognize founder sample as founder.
To read this BED file correctly, below command is required.
Allowing skip/ignore specific columns for FAM file
In rare cases, where input is formatted in a specific way, such as a lack of specific column.
For example, some PED files are omitting the genetic distance field in their MAP file.
Because of it is not a valid format of MAP file, a special option is required to appropriately read this.
Below are a list of providing such functionality for specific columns.
--nofid assumes there is no column for FID (i.e., five-column format for FAM)
NOTE!
With this option, the dataset is treated as independent since there is no familial information! Hence, NO PARENT INFORMATION is allowed!
--singleparent allows either of paternal or maternal ID is missing (i.e., single parent). In default, WISARD do not allow single parent since it is generally unlikely to happen and breaks the usual structure of pedigree. This option detours that limitation.
--sepid assumes there is no column for FID, but IID column have both FID and IID with a specific separator
--noparent assumes there is no columns for paternal & maternal IID
--nosex assumes there is no column for sex. Since this option sets all sample's sex to NA,
--imputesex is required to do an analysis using sex.
--nopheno assumes there is no column for phenotype. Since this option removes default phenotype,
--sampvar and --pname is required to do an anylsis using phenotype.
Addition to skipping such columns, it is also possible to ignore those columns.
In other words, even if those columns are available from input, they can be ignored with below options.
--ignorefid ignores FID, and set FID to its IID. By using this option, all samples become independent samples regardless of their original state.
--ignoreparent ignores parental information. By using this option, all samples¡¯ pedigree information will be discarded.
Changing notation for phenotype of dichotomous phenotype
In order to recognize dichotomous phenotype properly, in default, a notation for case/control (or affected/unaffected) status must be either 2 or 1, respectively.
However, some datasets code dichotomous phenotypes 1, 0 or 1, -1. In order to correctly retrieve this kind of dataset, use --1case if 1=case and 0=control, or --cact otherwise.
NOTE!
Note that this option is applied to alternative phenotype!
Changing notation for gender coding
In default, WISARD recognizes the sex of sample as 2=female and 1=male.
To alter this notation, --1sex or --mafe is used.
If the dataset is encoded as 1=female and 0=male, --1sex can be used. Otherwise, --mafe can be used.
Changing allele separator in PED/TPED/LGEN
In default, a separator for two alleles for a genotype should be whitespace(s).
However, some PED/TPED/LGEN files use different separator for two alleles of a genotype.
Below is an example of such data.
In this case, this kind of file can be retrieved using --sepallele option.
NOTE!
This option only applicable when an input is either of PED, TPED, or LGEN file!
Otherwise, in rare cases, there may be no allele separator in PED/TPED/LGEN files:
This kind of input can be retrieved using --consecallele option.
NOTE!
This option cannot handle INDELs, be careful about this!
Skip some of lines in the dataset from the first
Some dataset might have extra information on the first part of dataset.
In order to load such dataset using ordinary toolsets, it should be required to eliminate
those information from dataset so that the toolset can read the dataset properly.
Using WISARD, it is possible to load such dataset without additional modification
on the dataset by skipping specific number of lines at the first part of dataset,
using --nskip option. For example, a dataset 'test_miss0_comment.ped' contains following contents.
Above dataset can be retrieved with --nskip 3 option, as below command.
NOTE!
Currently this option is only applicable to PED/RAW/TPED file format!
Generating specific format from input dataset [top]
Using WISARD, a generation of specific format of dataset is possible.
Currently below formats can be generated.
PLINK PED file : Generates .ped(genotype & pedigree, sex and single phenotype) and .map(variant info. without allele info.)
Binary PED file : Generates .bed(binary-coded genotype), .fam(pedigree, sex and phenotype) and .bim(variant info. with allele info.)
Transposed PED file : Generates .tped(Variant info. and genotype) and .tfam(pedigree, sex and single phenotype)
LGEN file : Generates .lgen(FID, IID and genotype) and .map(variant info. without allele info.)
VCF file : Generates .vcf(genotype and others) and .fam(pedigree, sex and phenotype). Details see this page.
RAW file : Generates .raw(minor allele, number-coded genotype, pedigree, sex and single phenotype)
GEN file : Generates .gen(allele info. and probability-coded genotype) and .sample(FID, IID and multiple phenotypes and covariates)
Binary GEN file : Generates .bgen(equivalent to .gen but binary-coded) and .sample(FID, IID and multiple phenotypes and covariates)
In addition to widely used file format, a subset of entire dataset also can be generated (listed below).
Phenotype file : First-line header for phenotype names and following phenotype value records equivalent to the number of samples in the dataset. (FID, IID and phenotypes)
Covariates file : Same as phenotype file but covariates value records.
Genotype file : A plain matrix file with n by p dimension, where n is number of samples and p is number of variants.