WISARD official site

Select O/S : [?]

Case tutorial

Convert/split/merge

This section is about

Conversion
- PLINK PED format
- Binary PED format
- Transposed PED format
- Long file format
- Number-coded format
- Variant Calling Format (VCF)
- GEN file format
- Binary GEN file format
Conversion-related options
- Notation for missing phenotype
- Notation for missing genotype
- Notation for case/control phenotype
Other exportable dataset
Splitting input dataset
Merging multiple files

Using WISARD, a dataset can be generated with specific format. Currently, the formats described below are supported.

Conversion [top]

WISARD can convert retrieved dataset into many other formats. Below are supported file formats from WISARD and their description. In below example, represents their parents are missing, and means the sample is founder.

Example 1 : Example pedigree file

FAM_ID  SAMPLE_ID  FATHER_ID  MOTHER_ID  SEX     AFFECTED
FAM1    SAMP1      <NA>       <NA>       MALE    YES
FAM1    SAMP2      <NA>       <NA>       FEMALE  NO
FAM1    SAMP3      SAMP1      SAMP2      MALE    YES
FAM1    SAMP4      <NA>       <NA>       MALE    NO
FAM1    SAMP5      <NA>       <NA>       FEMALE  YES
FAM1    SAMP6      SAMP4      SAMP5      FEMALE  YES
FAM1    SAMP7      SAMP3      SAMP6      MALE    NO
FAM1    SAMP8      SAMP3      SAMP6      FEMALE  YES
FAM1    SAMP9      SAMP3      SAMP6      MALE    YES

Example 2 : Example genotype file

VARIANT_ID  CHROMOSOME  POSITION  GENOMIC_DIST  MAJOR  MINOR  GT_SAMP1  GT_SAMP2  ...
var138831   3           1728238   0.3           T      C      C,T       T,T
var3238172  3           48174765  0.8           G      C      G,G       G,G
var72158    4           36104958  0.05          A      G      A,A       G,G

PLINK PED format

The option --makeped, generates .ped (genotype & pedigree, sex and single phenotype) and .map (marker info. without allele info.).

Binary PED format

The option --makebed, generates .bed (binary-coded genotype), .fam (pedigree, sex and phenotype), and .bim (marker info. with allele info.).

NOTE!

In default, BED file is generated with SNP-major format. It can be transposed as individual-major format with --sampmajor option.

Transposed PED format

The option --maketped, generates .tped (Marker info. and genotype) and .tfam (pedigree, sex and single phenotype).

Long file format

The option --makelgen, generates .lgen (FID, IID and genotype) and .map (marker info. without allele info.).

Number-coded format

Generates .raw (minor allele, number-coded genotype, pedigree, sex, and single phenotype) Available codings are additive (--makeraw), dominant (--makedom) and recessive (--makerec).

NOTE!

In default, this format includes header. It can be omitted with --outnoheader option!

NOTE!

In default, this format exports six mandatory columns (FID, IID, parental IID, sex, and default phenotype). Phenotype can be separately exported with --outphenoonly option!

Variant Calling Format (VCF)

!!! Experimental function !!!

The option --makevcf, generates .vcf (genotype and others) and .fam (pedigree, sex and phenotype). For details, see this page.

GEN file format

The option --makegen, generates .gen (allele info. and probability-coded genotype) and .sample (FID, IID and multiple phenotypes and covariates).

Binary GEN file format

With the option --makebgen, binary GEN format dataset is generated, made of .bgen (equivalent to .gen but binary-coded) and .sample (FID, IID and multiple phenotypes and covariates)

NOTE!

In default, genotype probability data of Binary GEN file is not zipped, but stored in computationally efficient form. In order to reduce the size of dataset, --zipbgen option can be used to shrink it!

Conversion-related options [top]

While performing conversion, details can be manipulated for the dataset with following options:

Notation for missing phenotype

Missing phenotype is exported as -9 in default. --outmispheno can alter this.

Convert BED file into PED file with notation for missing phenotype to UNKNOWN C:\Users\WISARD> wisard --ped test_miss0_pheNA.ped --mispheno NA --makeped --outmispheno UNKNOWN --out conv_mispheno_UNKNOWN

Notation for missing genotype

A default notation for missing genotype is 0, for most of files supported by WISARD, or '.' for VCF file. --outmisgeno can alter this.

Convert BED file into PED file with notation for missing genotype to NA C:\Users\WISARD> wisard --bed test_miss2.bed --makeped --outmisgeno N --out conv_misgeno_N

NOTE!

Some file formats with fixed notation for missing genotype is unaffected by this option!

Notation for case/control phenotype

For dichotomous phenotype, they are exported as 2 (case/affected) and 1 (control/unaffected). --out1case and --outcact can alter this.

Other exportable dataset [top]

In addition to widely used file format, following subsets can be generated also:

Phenotype file: First-line header for phenotype names and following phenotype value records equivalent to the number of samples in the dataset. (FID, IID and phenotypes)
Covariates file: Same as phenotype file but covariates value records.
Genotype file: A plain matrix file with n by p dimension, where n is number of samples and p is number of markers.

Splitting input dataset [top]

Sometimes, especially for the large-scale NGS dataset, the entire volume of dataset is too massive. In such situation, splitting the dataset with specific criterion could allow more flexible dataset handling. WISARD provides easy splitting functionality via --split option.

Splitting large dataset by chromosome C:\Users\WISARD> wisard --bed test_miss0.bed --split --makebed --out split_dataset

Above command produces chromosome-wise Binary PED file from test_miss0.bed.

Splitting large dataset by chromosome C:\Users\WISARD> wisard --bed test_miss0.bed --famsplit --makebed --out split_dataset

Above command produces family-wise Binary PED file from test_miss0.bed.

Merging multiple files [top]

WISARD supports multiple file loading, via --merge option. Merging multiple files in WISARD assumes two components, (1) Base dataset, assigned by ordinary input-related options, and (2) Datasets to be merged with base dataset. --merge option assigns an information of second component, and let WISARD know that there is/are dataset(s) to be merged into. In order to use this function, an argument should be assigned to --merge option. It can be (1) sequence of paths divided by comma with no separator, or (2) a path of file containing multiple files to be merged. In case of (1), only one dataset can be assigned, while (2) can define multiple datasets. Every paths for --merge option must satisfy conditions below.

Every file path should be made of absolute path or appropriate relative path
Each dataset must be represented by a set of paths with given sequence

For the first file, its extension must be equivalent to requirement
If following paths of dataset have exactly same name but extension, it can be omitted
In case of (2), the contents of file must be set of lines, and each line must indicate single dataset

NOTE!

--merge does not export merged dataset itself like other options, unless any export-related option is assigned

In order to allow more flexible merging options, --mergemode can be further utilized. Currently, below merging modes are possible. Note that this merging strategy only applies to the overlapped genotypes.

Consensus mode (default): If the genotype in the base dataset is missing, it is replaced. If the non-missing genotype or replaced genotype is not concordant with following genotype, it is marked as NA and never replaced.
Replace missing only: If the genotype in the dataset is missing, it is replaced. Otherwise do nothing.
Replace all non-missing: If the genotype in the dataset to be merged, it is replaced. Otherwise do nothing.
Do not replace at all: All data (any conflicting/missing/whatsoever overlap) is preserved.
Replace without condition: Any overlapping genotype will be replaced.

An example of --merge C:\Users\WISARD> wisard --bed korean_base.bed --merge merge_list.txt --makebed --out merged

Example 3 : Contents of merge_list.txt

japan_sequenced.vcf japan_sequenced.fam
china_peking_cohort1.ped china_peking_cohort1.map

Below example comprises the difference among each merge mode.

Example 4 : Difference across merge mode

dataset1   dataset2   dataset3   dataset4   |   mode1   mode2   mode3   mode4   mode5
--------------------------------------------------------------------------------------
  0/0        A/A        A/G        A/A      |    0/0     A/A     A/A     0/0     A/A
  0/0        0/0        G/T        G/T      |    G/T     G/T     G/T     0/0     G/T
  A/C        0/0        A/T        A/C      |    0/0     A/C     A/C     A/C     A/C
  A/C        A/C        A/C        0/0      |    A/C     A/C     A/C     A/C     0/0
  0/0        0/0        A/T        A/C      |    0/0     A/T     A/C     0/0     A/C

While in the merge mode, any report for confliction/replacement is not made until assignment of --mergereport option, due to efficiency.

Edit this page

Last modified : 2017-09-13 11:14:52