WISARD[wɪzərd]
Workbench for Integrated Superfast Association study with Related Data
HOME  |   DOWNLOAD  |   OPTIONS  |   TROUBLE?  |   LOGIN
 

Data Loading

This section describes about

  • Load dataset into WISARD
    • Load zipped dataset with gzip
    • Load dataset from standard input stream
    • Generate dataset
  • Additional options for input
    • Set the species to be analyzed
    • Allowing indels for input
    • Allowing partial or no MAP file
    • Mapping number-coded genotype to character-coded genotype
    • Mapping arbitrary characters to genotype
    • Flip genotype
    • Changing in/out phenotype/genotype missing character
    • Changing parental missing notation for FAM file
    • Allowing skip/ignore specific columns for FAM file
    • Changing notation for phenotype of dichotomous phenotype
    • Changing notation for gender coding
    • Changing allele separator in PED/TPED/LGEN
    • Skip some of lines in the dataset from the first
  • Generating specific format from input dataset
    • Generate summary after data loading

      Have any problem with dataset loading? Please e-mail to biznok@gmail.com for support!

      Load dataset into WISARD [top]

      For all supported file formats, WISARD provides three ways to read as input.

      1. Load plain dataset (unzipped) from file (no extra option is required)
      2. Load zipped dataset with gzip from file
      3. Load plain or zipped dataset from standard input stream
      4. Automatically generate dataset

      Load zipped dataset with gzip

      In some cases, an input file is zipped because of the size of original file is too large. WISARD can load gunzipped file directly. In this case, no extra option is required. WISARD will automatically recognize whether the input is gunzipped or not, then retrieve data.

      Load zipped PED file and perform regression analysis C:\Users\WISARD> wisard --ped test_miss0.ped.gz --regression --out res_zipped_regr
      NOTE!
      In order to use this functionality, zlib must be supported. See here.

      Load dataset from standard input stream

      A standard input is often used for redirecting data from a program to another program, as represented by pipeline. WISARD also supports pipelining, but is limited to a single main file of given data format. The main file for each format is

      • PED for PLINK PED format
      • BED for PLINK binary format

      In order to use pipelining in WISARD, the argument of main input option must be single dash character(-), as shown in below example.

      Perform regression analysis using dataset from standard input streamC:\Users\WISARD> gzip -dc test_miss0.ped.gz | wisard --ped - --map test_miss0.map --regression

      In above example, my.ped is zipped as my.ped.gz and there is paired my.map file in same directory. The code firstly inflates my.ped.gz and print it to standard input thus it should be printed out to the console. However, '|' character after the command catches that and redirecting it to WISARD so it can be retrieved as an input of --ped.

      Generate dataset

      Although there is no dataset, it is possible to generate dataset from ab initio or using seed dataset. There are three possible ways of dataset generation.

      • Generate null genotypes with predefined structure (Using --sim option)
        A dataset can be generated ab initio under the predefined family structure. Currently WISARD provides three types of pedigree structure: --sim indep (Independent dataset) --sim extended (Extended family, consists of ten members) --sim trio (Trio family, consists of three members)
      • Simulate ten extended families with 1,000 variants and generate Binary PED file C:\Users\WISARD> wisard --sim extended --nsim 10 --szvar 1000 --makebed --out res

      • Generate null genotypes with given family structure (Using --sim and --fam option)
        If there is a family structure to be simulated, WISARD can simulates null genotypes with given family structure. Any kinds of dataset can be applied, but their included genotype will be ignored, and final sample size will be the number included in the dataset times --nfam, unless do not apply any sample filtering.
      • Simulate 1,000 variants using family structure defined in `my.fam`, make five replicates and make Binary PED file C:\Users\WISARD> wisard --sim --fam test_miss0.fam --nsim 5 --szvar 1000 --makebed --out res_my5

      • Generate dataset contains statistically significant variables
        Sorry, this section is still under curation.

      Additional options for input [top]

      Due to its roughness of format specification and various source of data production, PED/RAW file format is inevitably diverse in many ways. In order to overcome this variability, WISARD admits very wide range of variability for PED/RAW format, such as:

      • A file with mixture of whitespaces
      • PED file with various encoding of variant
      • PED file with indels/multiple (via --indel)

      Set the species to be analyzed

      In default, WISARD assumes the input dataset was generated from human. This can be changed to other species, using --species option. Currently, WISARD supports below species for the analysis.

      • --species human : For the dataset from humans (Homo Sapiens)
      • --species mouse : For the dataset from mice (Mus Musculus)
      • --species rat : For the dataset from rats (Rattus Rattus)
      • --species rabbit : For the dataset from rabbits (Oryctolagus Cuniculus)
      • --species sheep : For the dataset from sheeps (Ovis Aries)
      • --species cow : For the dataset from cows (Bos Taurus)
      • --species horse : For the dataset from horses (Equus Caballus)
      • --species dog : For the dataset from dogs (Canis Familiaris)
      • --species rice : For the dataset from rices (Oryza Sativa)

      Allowing indels for input

      In default, WISARD only allows Single Nucleotide Polymorphisms for input. In order to break up this limitation and allow indels to input, --indel option must be specified.

      Retrieve dataset including indels C:\Users\WISARD> wisard --bed test_miss0.bed --indel

      Allowing partial or no MAP file

      MAP file is originally consists of four columns: Chromosome, variant name, genetic distance and physical position. However, it is possible to some non-standard MAP file have only partial of those columns. In this case, below options can be applied.

      • --nopos indicates there is no corresponding column for physical position.
      • --nogdist indicates there is no corresponding column for genetic distance.

      Above options can be used in one line, so below example assumes that there are only two columns (chromosome and variant name) in the given MAP file.

      Input 'sample.map' assuming an absence of two columns C:\Users\WISARD> wisard --ped test_miss0.ped --map test_nopos_nogdist.map --nopos --nogdist

      Otherwise, it is possible to have NO map file, so WISARD will automatically generate appropriate variant information.

      Input 'sample.ped' with no variant information C:\Users\WISARD> wisard --ped test_miss0.ped --nomap

      In this case, variants will be automatically named from MARKER_1 to MARKER_p, where p is retrieved number of variants. Since it generates the name of variants with fixed rule, other options requiring variant name can be applied.

      NOTE!
      Some options referring chromosome/genetic distance and physical position will behave unexpectedly with --nomap!

      Mapping number-coded genotype to character-coded genotype

      In some PED, TPED or LGEN dataset, genotypes are coded in number and those of 1, 2, 3, and 4 correspond to A, C, G, and T, respectively. By adding --1234 option to the command, it is possible to convert the notation of genotype from number to character. In detail, refer below example.

      Example 1 : Number-coded PED file
      S1 S1 0 0 1 1 1 2 1 3 2 4
      S2 S2 0 0 2 1 1 1 1 3 2 2
      S3 S3 0 0 2 2 1 1 1 3 2 2
      S4 S4 0 0 1 2 1 1 1 1 4 4
      Recode 1/2/3/4 coded dataset as A/C/G/T dataset and make PED C:\Users\WISARD> wisard --ped test_miss0_1234.ped --1234 --makeped --out sample_1234toACGT

      Mapping arbitrary characters to genotype

      In some cases, coded genotype cannot readable directly, e.g. 1/2/3/4 for A/C/G/T. In order to recode this arbitrarily-coded dataset, --acgt option might be helpful.

      Example 2 : Arbitrarily coded PED file
      S1 S1 0 0 1 1 Q W Q E R E
      S2 S2 0 0 2 1 W W Q Q E E
      S3 S3 0 0 2 2 W W E E R E
      S4 S4 0 0 1 2 W Q E E R E
      Recode Q/W/E/R coded dataset to A/C/G/T dataset and make PED C:\Users\WISARD> wisard --bed test_miss0 --bim test_miss0_qwer.bim --acgt QWER --makeped --out sample_recoded

      Above code will convert dataset as following:

      Example 3 : Converted PED file with --acgt
      S1 S1 0 0 1 1 A C A G T G
      S2 S2 0 0 2 1 C C A A G G
      S3 S3 0 0 2 2 C C G G T G
      S4 S4 0 0 1 2 C A G G T G

      Flip genotype

      In order to flip strands, --flip can be used. --flip option with no argument flips genotype from A/C/G/T to T/G/C/A, respectively. Note that --flip option is essentially equivalent to --acgt tgca. If other types of flip sequence is desired, adding an argument consists of four characters. Each four characters are corresponding to A/C/G/T, respectively.

      • Flip a subset of dataset
      • By an assignment of the list of markers via --varsubset, it is possible to designate the subset to be flipped. This option affects to --flip and --acgt, and --1234 option.

      NOTE!
      With an argument, this option is equivalent to --acgt!

      Changing in/out phenotype/genotype missing character

      In default, the value indicating missing value is fixed as -9. However, sometimes it is coded as NA, <NA>, NONE, or hundreds of other values. In order to process this kind of dataset appropriately, --mispheno option can arbitrarily specify the value for missing value.

      Reading 'test_miss0_pheNA.ped' with phenotype missing code NA C:\Users\WISARD> wisard --ped test_miss0_pheNA.ped --mispheno NA
      Example 4 : Converted PED file with --acgt
      S1 S1 0 0 1 NA A C A G T G
      S2 S2 0 0 2 1 C C A A G G
      S3 S3 0 0 2 NA C C G G T G
      S4 S4 0 0 1 2 C A G G T G

      In similar manner, non-standard genotype character also can be specified with --misgeno. But note that the default missing genotype character differ for each file format. Following is a list of default missing genotype code for each file format:

      • PED/Transposed PED/LGEN format: 0(ASCII number 0)
      • BED format: Invisible and fixed (Cannot change)
      • VCF: .(ASCII dot)
      • PLINK RAW: -9(ASCII hyphen and ASCII number 9)

      Thus, be careful of the default missing character of given dataset when using --misgeno.

      NOTE!
      When using --merge and --misgeno, WISARD can show unexpected behavior, because --misgeno is applied to all dataset being merged.

      It should be noted that --mispheno and --misgeno is only applied to the input. In other words, every data export option starts with 'make'; such as --makeped option uses default missing genotype and missing phenotype character in its format. For this case, --outmispheno and --outmisgeno should be used to change default coding for dataset export.

      Changing parental missing notation for FAM file

      Many familial relationship files define 0 as 'missing' of parental relationship, same goes for the notation of founders. In order to change this notation as other string sequence such as NA or <NA>, use --misparent. Let the FAM file 'test_miss0_parNA.fam' look like below.

      Example 5 : Contents of 'test_miss0_parNA.fam'
      FAM_1 SAMP1_1 NA NA 1 2
      FAM_1 SAMP2_1 NA NA 2 1
      FAM_1 SAMP3_1 SAMP1_1 SAMP2_1 1 2
      FAM_1 SAMP4_1 NA NA 1 2
      FAM_1 SAMP5_1 NA NA 2 2
      ...

      Without --misparent option, above dataset cannot be retrieved correctly because WISARD expects 0 to recognize founder sample as founder. To read this BED file correctly, below command is required.

      Perform TDT using sample_parentNA dataset C:\Users\WISARD> wisard --bed test_miss0 --fam test_miss0_parNA.fam --misparent NA --tdt --out res_parentNA_tdt

      Allowing skip/ignore specific columns for FAM file

      In rare cases, where input is formatted in a specific way, such as a lack of specific column. For example, some PED files are omitting the genetic distance field in their MAP file. Because of it is not a valid format of MAP file, a special option is required to appropriately read this. Below are a list of providing such functionality for specific columns.

      • --nofid assumes there is no column for FID (i.e., five-column format for FAM)
      • Example 6 : Example FAM file that do not have FID column
        S1 0 0 1 2
        S2 0 0 2 2
        S3 0 0 1 1
        S4 0 0 1 2
        Read sample.bed/bim/fam dataset but assumes that sample.fam do no have FID column C:\Users\WISARD> wisard --bed test_miss0.bed --fam test_miss0_nofid.fam --nofid
        NOTE!
        With this option, the dataset is treated as independent since there is no familial information! Hence, NO PARENT INFORMATION is allowed!
      • --singleparent allows either of paternal or maternal ID is missing (i.e., single parent). In default, WISARD do not allow single parent since it is generally unlikely to happen and breaks the usual structure of pedigree. This option detours that limitation.
      • Example 7 : Example FAM file that have single parent
        F1 S1 S0 0 1 2
        F1 S2 0 0 2 2
        F1 S3 S1 S2 1 1
        F1 S4 S1 S2 1 2
        Read sample.bed/bim/fam dataset although there is single parent in the FAM file C:\Users\WISARD> wisard --bed test_miss0.bed --fam test_miss0_singleparent.fam --singleparent
      • --sepid assumes there is no column for FID, but IID column have both FID and IID with a specific separator
      • Example 8 : Example FAM file that do not have FID column but actual FID and IID are fused in IID column
        F1_S1 0 0 1 2
        F1_S2 0 0 2 2
        F1_S3 0 0 1 1
        F1_S4 0 0 1 2
        Assign alternative FAM file that contains FID and IID is concatenated with colon(:) character C:\Users\WISARD> wisard --bed test_miss0.bed --fam test_miss0_fidiid.fam --sepid ":"
      • --noparent assumes there is no columns for paternal & maternal IID
      • --nosex assumes there is no column for sex. Since this option sets all sample's sex to NA, --imputesex is required to do an analysis using sex.
      • --nopheno assumes there is no column for phenotype. Since this option removes default phenotype, --sampvar and --pname is required to do an anylsis using phenotype.

      Addition to skipping such columns, it is also possible to ignore those columns. In other words, even if those columns are available from input, they can be ignored with below options.

      • --ignorefid ignores FID, and set FID to its IID. By using this option, all samples become independent samples regardless of their original state.
      • --ignoreparent ignores parental information. By using this option, all samples¡¯ pedigree information will be discarded.

      Changing notation for phenotype of dichotomous phenotype

      In order to recognize dichotomous phenotype properly, in default, a notation for case/control (or affected/unaffected) status must be either 2 or 1, respectively. However, some datasets code dichotomous phenotypes 1, 0 or 1, -1. In order to correctly retrieve this kind of dataset, use --1case if 1=case and 0=control, or --cact otherwise.

      NOTE!
      Note that this option is applied to alternative phenotype!
      Perform logistic regression with 1=case,0=control C:\Users\WISARD> wisard --bed test_miss0.bed --sampvar test_miss0_phen.txt --pname medi01 --1case --regression --out res_regr_1case
      Perform logistic regression with 1=case,-1=control C:\Users\WISARD> wisard --bed test_miss0.bed --sampvar test_miss0_phen.txt --pname medi1m1 --cact 1,-1 --regression --out res_regr_1-1

      Changing notation for gender coding

      In default, WISARD recognizes the sex of sample as 2=female and 1=male. To alter this notation, --1sex or --mafe is used. If the dataset is encoded as 1=female and 0=male, --1sex can be used. Otherwise, --mafe can be used.

      Perform logistic regression with 1=female,0=male C:\Users\WISARD> wisard --bed test_miss0.bed --fam test_miss0_1sex.fam --1sex --regression --out res_regr_1sex
      Perform logistic regression with M=male,F=female C:\Users\WISARD> wisard --bed test_miss0.bed --fam test_miss0_MFsex.fam --mafe M,F --regression --out res_regr_MF

      Changing allele separator in PED/TPED/LGEN

      In default, a separator for two alleles for a genotype should be whitespace(s). However, some PED/TPED/LGEN files use different separator for two alleles of a genotype. Below is an example of such data.

      Example 9 : Non-standard allele separator(,) PED file
      S1 S1 0 0 1 -9 A,C A,G T,G
      S2 S2 0 0 2 1 C,C A,A G,G
      S3 S3 0 0 2 -9 C,C G,G T,G
      S4 S4 0 0 1 2 C,A G,G T,G

      In this case, this kind of file can be retrieved using --sepallele option.

      Reading above PED file C:\Users\WISARD> wisard --ped test_miss0_comma.ped --sepallele ,
      NOTE!
      This option only applicable when an input is either of PED, TPED, or LGEN file!

      Otherwise, in rare cases, there may be no allele separator in PED/TPED/LGEN files:

      Example 10 : No allele separator PED file
      S1 S1 0 0 1 -9 AC AG TG
      S2 S2 0 0 2 1 CC AA GG
      S3 S3 0 0 2 -9 CC GG TG
      S4 S4 0 0 1 2 CA GG TG

      This kind of input can be retrieved using --consecallele option.

      Reading above PED file C:\Users\WISARD> wisard --ped test_miss0_consec.ped --consecallele
      NOTE!
      This option cannot handle INDELs, be careful about this!

      Skip some of lines in the dataset from the first

      Some dataset might have extra information on the first part of dataset. In order to load such dataset using ordinary toolsets, it should be required to eliminate those information from dataset so that the toolset can read the dataset properly. Using WISARD, it is possible to load such dataset without additional modification on the dataset by skipping specific number of lines at the first part of dataset, using --nskip option. For example, a dataset 'test_miss0_comment.ped' contains following contents.

      Example 11 : An example PED file contains three-lines header
      ##########################
      # This data is an example
      ##########################
      S1 S1 0 0 1 NA A C A G T G
      S2 S2 0 0 2 1 C C A A G G
      S3 S3 0 0 2 NA C C G G T G
      S4 S4 0 0 1 2 C A G G T G

      Above dataset can be retrieved with --nskip 3 option, as below command.

      Convert test.ped to PLINK binary PED format accounting for header C:\Users\WISARD> wisard --ped test_miss0_comment.ped --makebed --out test_converted --nskip 3
      NOTE!
      Currently this option is only applicable to PED/RAW/TPED file format!

      Generating specific format from input dataset [top]

      Using WISARD, a generation of specific format of dataset is possible. Currently below formats can be generated.

      • PLINK PED file : Generates .ped(genotype & pedigree, sex and single phenotype) and .map(variant info. without allele info.)
      • Binary PED file : Generates .bed(binary-coded genotype), .fam(pedigree, sex and phenotype) and .bim(variant info. with allele info.)
      • Transposed PED file : Generates .tped(Variant info. and genotype) and .tfam(pedigree, sex and single phenotype)
      • LGEN file : Generates .lgen(FID, IID and genotype) and .map(variant info. without allele info.)
      • VCF file : Generates .vcf(genotype and others) and .fam(pedigree, sex and phenotype). Details see this page.
      • RAW file : Generates .raw(minor allele, number-coded genotype, pedigree, sex and single phenotype)
      • GEN file : Generates .gen(allele info. and probability-coded genotype) and .sample(FID, IID and multiple phenotypes and covariates)
      • Binary GEN file : Generates .bgen(equivalent to .gen but binary-coded) and .sample(FID, IID and multiple phenotypes and covariates)

      In addition to widely used file format, a subset of entire dataset also can be generated (listed below).

      • Phenotype file : First-line header for phenotype names and following phenotype value records equivalent to the number of samples in the dataset. (FID, IID and phenotypes)
      • Covariates file : Same as phenotype file but covariates value records.
      • Genotype file : A plain matrix file with n by p dimension, where n is number of samples and p is number of variants.

      Generate summary after data loading [top]

      After dataset is successfully loaded, it is possible to generate summary for which variants and samples are acutally loaded.

      Export list of variants after filtering C:\Users\WISARD> wisard --bed test_miss2.bed --filgvar [0,0.8] --listvariant --out res_varlist
      variant.lst is... A list of variant IDs that actually included in the final dataset (TSV)
      Column Format Modifier Description
      NAME string NONE Variant name
      MAJOR string NONE Variant name
      MINOR string NONE Variant name
      Export list of samples after filtering C:\Users\WISARD> wisard --bed test_miss2.bed --filgind [0,0.8] --listsample --out res_samplist
      sample.lst is... A list of sample IDs that actually included in the final dataset (TSV)
      Column Format Modifier Description
      FID 0~1 NONE p-value
      IID 0~1 NONE p-value
      PAT 0~1 NONE p-value
      MAT 0~1 NONE p-value
      SEX 0~1 NONE p-value
      PHENOTYPE 0~1 NONE p-value
      Export list of founders after filtering C:\Users\WISARD> wisard --bed test_miss2.bed --filgind [0,0.8] --listfounder --out res_fndlist
      founder.lst is... A list of founder sample IDs that actually included in the final dataset (TSV)
      Column Format Modifier Description
      FID 0~1 NONE p-value
      IID 0~1 NONE p-value
      SEX 0~1 NONE p-value
      PHENOTYPE 0~1 NONE p-value


      Edit this page
      Last modified : 2017-09-13 11:43:37