Input Data
The LD Reference Panel
BridgePRS requires representative genotype data in PLINK binary (bed) format to estimate LD. A thinned LD reference panel (19734 SNPs) for use with the quick start tutorial is included in the BridgePRS download (data/1000G_sample). Three full-size 1000 Genomes panels covering AFR, EUR, EAS, SAS, and AMR super-populations with different SNP sets are available for download: (1) All HapMap Variants,(2) All 1000G variants with MAF>5% in any of the five 1000G super-populations,(3) All 1000G variants with MAF>1% in any of the five 1000G super-populations.
To run bridgePRS using a custom LD reference panel see customisation.
Config Files
For the target and base populations BridgePRS requires inputs be supplied in population configuration files. The following arguments are required in the target configuration file:
Variable Name | Description |
---|---|
POP | Population name label, eg. AFR |
LDPOP | Name of 1000G population used to estimate LD (AFR, AMR, EAS, EAS, SAS) |
LD_PATH | Path to LD reference data |
SUMSTATS_FILE | Sumstats filename used for genomewide summary statistics, can be gzipped |
SUMSTATS_PREFIX | Sumstats prefix used for summary statistics split by chr, eg. height_chr |
SUMSTATS_SUFFIX | sumstats suffix, eg .txt.gz , gives sumstats files height_chr1.txt.gz -- height_chr22.txt.gz |
SUMSTATS_SIZE | GWAS sample size (required for multi-ancestry) |
GENOTYPE_PREFIX | Path and prefix to plink bed files of test samples |
PHENOTYPE_FILE | Filename of phenotype(s) and covariates of test data |
The following optional inputs can also be supplied:
Variable Name | Description |
---|---|
VALIDATION_FILE | Filename of phenotype(s) and covariates of validation data |
SNP_FILE | File listing SNP ids to use in analysis, eg. QCed SNPs |
MAX_CLUMP_SIZE | Maximum number of SNPs per clump |
COVARIATES | List of covariates (comma separated) to use, eg: COVARIATES=PC1,PC2,PC3 |
bridgePRS requires GWAS summary statistics with SNP id (to match to LD reference data), reference allele, alternate allele, p-value, and effect size (linear regression coefficient or log odds). Columns can be supplied in any order. Column names are specified in one of two ways as described below:
Variable Name | Description |
---|---|
SUMSTATS_FIELDS | List of summary statistic column names in the order listed below, e.g, SUMSTATS_FIELDS=ID,REF,A1,P,BETA |
SSF-SNPID | Column name of SNP id (default ID) |
SSF-REF | Column name of non-effect allele (default REF) |
SSF-ALT | Column name of effect allele (default ALT) |
SSF-P | Column name of P-value (default P) |
SSF-BETA | Column name of effect size (default BETA) |
Creating a Config File:
Creating a Configuration File
The following command will validate the command line data and create a target configuration file
./bridgePRS tools check-pop -o test --pop AFR --ld_path ./data/1000G_sample/ --sumstats_prefix ./data/pop_AFR/sumstats/AFR.chr
--sumstats_size 10000
--genotype_prefix ./data/pop_AFR/genotypes/chr
--phenotype_file ./data/pop_AFR/phenotypes/AFR_test.dat
./bridgePRS tools check-pops -o test --pop AFR EUR --ld_path ./data/1000G_sample/ --sumstats_prefix ./data/pop_AFR/sumstats/AFR.chr ./data/pop_EUR/sumstats/EUR.chr
--sumstats_size 10000 100000
--genotype_prefix ./data/pop_AFR/genotypes/chr
--phenotype_file ./data/pop_AFR/phenotypes/AFR_test.dat
This command will create target and base configuration files can be observed below:
Config Files: Target/Base Population Data:
POP=AFR
LDPOP=AFR
LD_PATH=$BRIDGEDIR/data/1000G_sample
SUMSTATS_PREFIX=test/save/sumstats/ss.AFR.
SUMSTATS_SUFFIX=.out.gz
SUMSTATS_FIELDS=ID,REF,A1,P,BETA
SUMSTATS_SIZE=10000
SNP_FILE=test/save/snps.afr_valid.txt
GENOTYPE_PREFIX=$BRIDGEDIR/data/pop_AFR/genotypes/chr
PHENOTYPE_FILE=tests/save/AFR.test_phenos.dat
VALIDATION_FILE=test/save/AFR.valid_phenos.dat
POP=EUR
LDPOP=EUR
LD_PATH=$BRIDGEDIR/data/1000G_sample
SUMSTATS_PREFIX=test/save/sumstats/ss.EUR.
SUMSTATS_SUFFIX=.out.gz
SUMSTATS_FIELDS=ID,REF,A1,P,BETA
SUMSTATS_SIZE=100000
SNP_FILE=test/save/snps.afr_valid.txt
GENOTYPE_PREFIX=$BRIDGEDIR/data/pop_AFR/genotypes/chr
PHENOTYPE_FILE=$BRIDGEDIR/data/pop_AFR/phenotypes/AFR_test.dat
File Specifications
1) Sumstats Data
GWAS summary statistics are provided using a prefix to one or many (per chromosome) files with the --sumstats_prefix
argument and the
--sumstats_suffix
argument when nevecessary. GWAS summary statistics must be provided as a whitespace delimited file containing
the results of an association study for a given phenotype. BridgePRS has no problem reading in a gzipped base file
(need to have a .gz suffix) or splitting the file by chromosome if necessary. An example of a sumstats file with default column headers is shown:
Defaults | #CHR | ID | REF | A1 | P | BETA |
---|---|---|---|---|---|---|
Variable | --ssf-snpid | --ssf-ref | --ssf-alt | --ssf-p | --ssf-beta | |
Data | 1 | rs121 | T | G | 0.0413692 | 0.9472871 |
Data | 1 | rs497 | C | A | 0.328347 | -1.193074 |
Data | 1 | rs271 | G | G | 0.0132225 | 0.413687 |
2) Genotype Files
Genotype files must be in binary plink format (bed, bim, fam).
3) Phenotype Files
Phenotype files are provided to BridgePRS using
--phenotype_files
for test data and --
. This must be a tab / space delimited file
and missing data must be represented by either NA
or -9
(only
for binary traits). The first two column of the phenotype file should
be the FID and the IID, and the rest can be phenotypes/covariates:
FID | IID | y | y.binary | PC1 | PC2 |
---|---|---|---|---|---|
afr1_1 | afr2_1 | 24.4 | 1 | 0.53 | 0.950 |
afr1_2 | afr2_2 | 4.10 | 0 | 0.59 | 0.450 |
afr1_3 | afr2_3 | 37.2 | 1 | 0.73 | -0.13 |
afr1_4 | afr2_4 | 5.40 | 0 | 0.44 | -0.55 |
The phenotype of interest can be specified with the --phenotype
flag and the covariates can be given as a comma separated list
after the --covariates
flag:
4) QC SNP List
To select only SNPS that have passed QC, you can include a single column text file using the --snp_file
flag.