Plink

We will be working with plink, a free, open-source whole-genome association analysis toolset, designed to perform a range of basic, large-scale analyses in a computationally efficient manner. Even though the focus of plink is on analysis of genotype/phenotype data, it is widely used in popgen as it has many features for data manipulation, it offers basic statistics, and many popgen tools assume input files to be in plink format (e.g. fastStructure, ADMIXTURE, etc.).

Plink is a tool created for genome-wide association studies (GWAS) and research in population genetics. Plink parses each command line as a collection of flags (each of which starts with two dashes), plus parameters (which immediately follow a flag). Because Plink was developed for GWAS medical studies, many statistics and pieces of information from plink files will be not used in our analysis, such as pedigree or phenotype.

Plink file formats

plink can either read
1. text-format files (.ped + .map) or
2. binary files (.bed + .bim + .fam).

Because reading large text files can be time-consuming, it is recommended to use binary files.

plink data formats (Marees et al. 2017)

Text-format plink files

Text plink data consist of two files (which should have matching names and they should be stored together):
1. .ped contains information on the individuals and their genotypes;
2. .map contains information on the genetic markers.

Binary plink files

Binary plink data consist of three files (one binary and two text files which should have matching names and which should be stored together):
1. .bed contains individual identifiers (IDs) and genotypes,
2. .fam contain information on the individuals,
3. .bim contains information on the genetic markers.

Analysis using covariates often requires the fourth file, containing the values of these covariates for each individual.

Task

Make directory called plink_exercise in ~/popgen_intro.
Make binary files from all_snp text-format plink files and output them in newly created ~/popgen_intro/plink_exercise.

Tip

Check plink manual on how to make BED file format.

Click for answer

cd ~/popgen_intro
mkdir plink_exercise
plink1.9 --file all_snp --make-bed --out ~/popgen_intro/plink_exercise/hgdp

Now go to plink_exercise and check which files are there:

$ cd ~/popgen_intro/plink_exercise/
$ ls

hgdp.bed  hgdp.bim  hgdp.fam  hgdp.log

As expected we got 3 files that go together in binary file format in plink (i.e. bed + bim + fam), but also log.

Let's take a look at it:

$ cat hgdp.log

PLINK v1.90b6.6 64-bit (12 Oct 2018)
Options in effect:
  --file all_snp
  --make-bed
  --out /home/bajiv90/popgen_intro/plink_exercise/hgdp

Hostname: evop-login
Working directory: /home/bajiv90/popgen_intro
Start time: Mon Apr 25 18:19:33 2022

Random number seed: 1650903573
16042 MB RAM detected; reserving 8021 MB for main workspace.
Scanning .ped file... done.
Performing single-pass .bed write (62679 variants, 942 people).
--file: /home/bajiv90/popgen_intro/plink_exercise/hgdp-temporary.bed +
/home/bajiv90/popgen_intro/plink_exercise/hgdp-temporary.bim +
/home/bajiv90/popgen_intro/plink_exercise/hgdp-temporary.fam written.
62679 variants loaded from .bim file.
942 people (621 males, 321 females) loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 942 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missing.
Total genotyping rate is 0.995297.
62679 variants and 942 people pass filters and QC.
Note: No phenotypes present.
--make-bed to /home/bajiv90/popgen_intro/plink_exercise/hgdp.bed +
/home/bajiv90/popgen_intro/plink_exercise/hgdp.bim +
/home/bajiv90/popgen_intro/plink_exercise/hgdp.fam ... done.

End time: Mon Apr 25 18:19:35 2022

plink by default creats .log file with information on how linked bed, bim, fam or map and ped files were created. This is very useful when trying to understand your own work after some time, or understanding the steps of your collaborators.

If we take a look at .fam file, we will see that FID (column 1) and IID (column 2) are the same. However, for popgen studies, FID could be used to store the information about the population from which the individual was sampled. We can find population information for each individual in the metainformation file that we have in ~/popgen_intro/HGDP_metainformation.txt.

In order to update sample information in our plink files we should first make a text file with old and new FID and IID, and then provide it to --update-ids when making new plink files with updated IDs.

--update-ids expects input with the following four fields:

Old FID
Old IID
New FID
New IID

To create this file we can use the command join which will join lines of two files on a common field.

Tip: Process Substitution

Piping (|) the stdout of a command into the stdin of another is a powerful technique. But, what if we need to pipe the stdout of multiple commands? This is where process substitution (<(command) or >(command)) comes in.

Process substitution feeds the output of a process (or processes) into the stdin of another process. In other words process substitution treats the output (or input) of commands as files. The syntax for process substitution is: <(list) or >(list) where each list is a command or a pipeline of commands. The effect of process substitution is to make each list act like a file.
To substitute a command pipeline for an input file the syntax is: command ... <(list) ...
To substitute a command pipeline for an output file the syntax is:command ... >(list) ...
E.g. diff -u <( ls dir1 | sort) <( ls dir2 | sort )

cd ~/popgen_intro/plink_exercise/

join -j 1 -o 1.1,1.2,2.2,2.1 <(sort -k1 hgdp.fam) <(sort -k1 ../HGDP_metainformation.txt) > hgdp_updateIDs.txt

plink1.9 --bfile hgdp --update-ids hgdp_updateIDs.txt --make-bed --out hgdp_newFIDs

Filtering with plink

We can generate some simple summary statistics using plink e.g. rates of missing data in the file, diversity within and between the samples, etc., and use some of those statistics to filter our dataset to include only high-quality entries.

We can use --missing to produces sample-based (.imiss) and variant-based (.lmiss) files with missing data reports.

cd ~/popgen_intro/plink_exercise/

plink1.9 --bfile hgdp_newFIDs --missing --out hgdp_newFIDs_miss

We can visualize the missing data reports in R. Let's download newly created files together with the metainformation file to our local machine, and then run the following R commands to plot them (remember to specify correct path to files):

# load necessary packages
library("tidyverse")

# load files
indINFO   <- read_table("HGDP_metainformation.txt", col_types = "cffffddffffffiii")
imiss     <- read_table("hgdp_newFIDs_miss.imiss", col_types = "fcfiid")
lmiss     <- read_table("hgdp_newFIDs_miss.lmiss", col_types = "fciid")

# add metainformation to imiss
imiss <- left_join(imiss, indINFO, by="IID")

# order levels for FID to be ordered by Region, Country and FID
imiss$FID <- factor(imiss$FID, levels = unique(imiss[order(imiss$Region, imiss$Country, imiss$FID),]$FID))

# ploting boxplots per population colored by region
imiss %>% 
  ggplot(aes(x=FID, y=F_MISS, fill=Region)) + 
  geom_boxplot() + 
  theme(axis.text.x = element_text(angle = 90))

# Let's keep only modern human populations
# and plot it again so we can see better which of modern populations has a lot of individuals that are missing a lot of data
imiss %>% 
  filter(!FID %in% c("Href","Clint","Denisova", "Gorilla","Macaque","Marmoset","Orang","Vindija")) %>% 
  ggplot(aes(x=FID, y=F_MISS, fill=Region)) + 
  geom_boxplot() + 
  theme(axis.text.x = element_text(angle = 90))

# Plot histograms of Missingness per Chromosome
lmiss %>% 
  ggplot(aes(x=CHR, y=F_MISS, fill=CHR)) + 
  geom_boxplot() +
  theme(legend.position = "none")

# Now let's see only those that have F_MISS < 0.1
lmiss %>% 
  filter(F_MISS < 0.1) %>% 
  ggplot(aes(x=CHR, y=F_MISS, fill=CHR)) + 
  geom_boxplot() +
  theme(legend.position = "none")

Task

Let's create new plink files in which we will keep only individuals from modern populations, and in which we will remove SNPs and individuals with more than 10% missingness. To do so, we will need to specify which individuals we want to keep (with option --keep) or remove (with option --remove). One way to make list of individuals which we want to keep is to find all individuals that contain "HGDP" in their name.

Steps:
1. Create a new txt file that will contain only lines with individuals that contain "HGDP" in IID column of hgdp_newFIDs.fam
2. Use plink to create new binary files that will have only individuals specified in the txt file you created in the previous step, and remove SNPs and individuals with more than 10% missingness.

Click for answers

grep HGDP hgdp_newFIDs.fam > ind_keep.txt
plink1.9 --bfile hgdp_newFIDs --make-bed --mind 0.1 --geno 0.1 --keep ind_keep.txt --out hgdp_newFIDs_modernPops_MindGeno

Question

How many individuals and how many SNPs were removed?

Click for answers

--keep: 934 out of 942 people remaining.
--mind: 0 people removed due to missing genotype data.
--geno: 8 variants removed due to missing genotype data.