WISARD[wɪzərd]
Workbench for Integrated Superfast Association study with Related Data
HOME  |   DOWNLOAD  |   OPTIONS  |   TROUBLE?  |   LOGIN
 

Convert/split/merge

This section is about

  • Conversion
    • PLINK PED format
    • Binary PED format
    • Transposed PED format
    • Long file format
    • Number-coded format
    • Variant Calling Format (VCF)
    • GEN file format
    • Binary GEN file format
  • Conversion-related options
    • Notation for missing phenotype
    • Notation for missing genotype
    • Notation for case/control phenotype
  • Other exportable dataset
    • Splitting input dataset
      • Merging multiple files
        Using WISARD, a dataset can be generated with specific format. Currently, the formats described below are supported.

        Conversion [top]

        WISARD can convert retrieved dataset into many other formats. Below are supported file formats from WISARD and their description. In below example, represents their parents are missing, and means the sample is founder.

        Example 1 : Example pedigree file
        FAM_ID SAMPLE_ID FATHER_ID MOTHER_ID SEX AFFECTED
        FAM1 SAMP1 <NA> <NA> MALE YES
        FAM1 SAMP2 <NA> <NA> FEMALE NO
        FAM1 SAMP3 SAMP1 SAMP2 MALE YES
        FAM1 SAMP4 <NA> <NA> MALE NO
        FAM1 SAMP5 <NA> <NA> FEMALE YES
        FAM1 SAMP6 SAMP4 SAMP5 FEMALE YES
        FAM1 SAMP7 SAMP3 SAMP6 MALE NO
        FAM1 SAMP8 SAMP3 SAMP6 FEMALE YES
        FAM1 SAMP9 SAMP3 SAMP6 MALE YES
        Example 2 : Example genotype file
        VARIANT_ID CHROMOSOME POSITION GENOMIC_DIST MAJOR MINOR GT_SAMP1 GT_SAMP2 ...
        var138831 3 1728238 0.3 T C C,T T,T
        var3238172 3 48174765 0.8 G C G,G G,G
        var72158 4 36104958 0.05 A G A,A G,G

        PLINK PED format

        The option --makeped, generates .ped (genotype & pedigree, sex and single phenotype) and .map (marker info. without allele info.).

        Binary PED format

        The option --makebed, generates .bed (binary-coded genotype), .fam (pedigree, sex and phenotype), and .bim (marker info. with allele info.).

        NOTE!
        In default, BED file is generated with SNP-major format. It can be transposed as individual-major format with --sampmajor option.

        Transposed PED format

        The option --maketped, generates .tped (Marker info. and genotype) and .tfam (pedigree, sex and single phenotype).

        Long file format

        The option --makelgen, generates .lgen (FID, IID and genotype) and .map (marker info. without allele info.).

        Number-coded format

        Generates .raw (minor allele, number-coded genotype, pedigree, sex, and single phenotype) Available codings are additive (--makeraw), dominant (--makedom) and recessive (--makerec).

        NOTE!
        In default, this format includes header. It can be omitted with --outnoheader option!
        NOTE!
        In default, this format exports six mandatory columns (FID, IID, parental IID, sex, and default phenotype). Phenotype can be separately exported with --outphenoonly option!

        Variant Calling Format (VCF)

        !!! Experimental function !!!

        The option --makevcf, generates .vcf (genotype and others) and .fam (pedigree, sex and phenotype). For details, see this page.

        GEN file format

        The option --makegen, generates .gen (allele info. and probability-coded genotype) and .sample (FID, IID and multiple phenotypes and covariates).

        Binary GEN file format

        With the option --makebgen, binary GEN format dataset is generated, made of .bgen (equivalent to .gen but binary-coded) and .sample (FID, IID and multiple phenotypes and covariates)

        NOTE!
        In default, genotype probability data of Binary GEN file is not zipped, but stored in computationally efficient form. In order to reduce the size of dataset, --zipbgen option can be used to shrink it!

        Conversion-related options [top]

        While performing conversion, details can be manipulated for the dataset with following options:

        Notation for missing phenotype

        Missing phenotype is exported as -9 in default. --outmispheno can alter this.

        Convert BED file into PED file with notation for missing phenotype to UNKNOWN C:\Users\WISARD> wisard --ped test_miss0_pheNA.ped --mispheno NA --makeped --outmispheno UNKNOWN --out conv_mispheno_UNKNOWN

        Notation for missing genotype

        A default notation for missing genotype is 0, for most of files supported by WISARD, or '.' for VCF file. --outmisgeno can alter this.

        Convert BED file into PED file with notation for missing genotype to NA C:\Users\WISARD> wisard --bed test_miss2.bed --makeped --outmisgeno N --out conv_misgeno_N
        NOTE!
        Some file formats with fixed notation for missing genotype is unaffected by this option!

        Notation for case/control phenotype

        For dichotomous phenotype, they are exported as 2 (case/affected) and 1 (control/unaffected). --out1case and --outcact can alter this.

        Other exportable dataset [top]

        In addition to widely used file format, following subsets can be generated also:

        • Phenotype file: First-line header for phenotype names and following phenotype value records equivalent to the number of samples in the dataset. (FID, IID and phenotypes)
        • Covariates file: Same as phenotype file but covariates value records.
        • Genotype file: A plain matrix file with n by p dimension, where n is number of samples and p is number of markers.

        Splitting input dataset [top]

        Sometimes, especially for the large-scale NGS dataset, the entire volume of dataset is too massive. In such situation, splitting the dataset with specific criterion could allow more flexible dataset handling. WISARD provides easy splitting functionality via --split option.

        Splitting large dataset by chromosome C:\Users\WISARD> wisard --bed test_miss0.bed --split --makebed --out split_dataset

        Above command produces chromosome-wise Binary PED file from test_miss0.bed.

        Splitting large dataset by chromosome C:\Users\WISARD> wisard --bed test_miss0.bed --famsplit --makebed --out split_dataset

        Above command produces family-wise Binary PED file from test_miss0.bed.

        Merging multiple files [top]

        WISARD supports multiple file loading, via --merge option. Merging multiple files in WISARD assumes two components, (1) Base dataset, assigned by ordinary input-related options, and (2) Datasets to be merged with base dataset. --merge option assigns an information of second component, and let WISARD know that there is/are dataset(s) to be merged into. In order to use this function, an argument should be assigned to --merge option. It can be (1) sequence of paths divided by comma with no separator, or (2) a path of file containing multiple files to be merged. In case of (1), only one dataset can be assigned, while (2) can define multiple datasets. Every paths for --merge option must satisfy conditions below.

        1. Every file path should be made of absolute path or appropriate relative path
        2. Each dataset must be represented by a set of paths with given sequence
        3. ( Binary PED file format : bed, bim and fam )
          ( PED file format : ped and map )
          ( Long file format : lgen and map )
          ( Transposed PED file format : tped and tfam )
          ( VCF format : vcf and fam )
        4. For the first file, its extension must be equivalent to requirement
        5. If following paths of dataset have exactly same name but extension, it can be omitted
        6. In case of (2), the contents of file must be set of lines, and each line must indicate single dataset
        7. NOTE!
          --merge does not export merged dataset itself like other options, unless any export-related option is assigned

        In order to allow more flexible merging options, --mergemode can be further utilized. Currently, below merging modes are possible. Note that this merging strategy only applies to the overlapped genotypes.

        1. Consensus mode (default): If the genotype in the base dataset is missing, it is replaced. If the non-missing genotype or replaced genotype is not concordant with following genotype, it is marked as NA and never replaced.
        2. Replace missing only: If the genotype in the dataset is missing, it is replaced. Otherwise do nothing.
        3. Replace all non-missing: If the genotype in the dataset to be merged, it is replaced. Otherwise do nothing.
        4. Do not replace at all: All data (any conflicting/missing/whatsoever overlap) is preserved.
        5. Replace without condition: Any overlapping genotype will be replaced.
        An example of --merge C:\Users\WISARD> wisard --bed korean_base.bed --merge merge_list.txt --makebed --out merged
        Example 3 : Contents of merge_list.txt
        japan_sequenced.vcf japan_sequenced.fam
        china_peking_cohort1.ped china_peking_cohort1.map
        Below example comprises the difference among each merge mode.
        Example 4 : Difference across merge mode
        dataset1 dataset2 dataset3 dataset4 | mode1 mode2 mode3 mode4 mode5
        --------------------------------------------------------------------------------------
        0/0 A/A A/G A/A | 0/0 A/A A/A 0/0 A/A
        0/0 0/0 G/T G/T | G/T G/T G/T 0/0 G/T
        A/C 0/0 A/T A/C | 0/0 A/C A/C A/C A/C
        A/C A/C A/C 0/0 | A/C A/C A/C A/C 0/0
        0/0 0/0 A/T A/C | 0/0 A/T A/C 0/0 A/C
        While in the merge mode, any report for confliction/replacement is not made until assignment of --mergereport option, due to efficiency.

        Edit this page
        Last modified : 2017-09-13 11:14:52