TeaGVD: Tea plant Genomic Variations Database

Pipeline of SNP/InDel Calling

Remove low-quality sequences

The raw reads were trimmed by sickle to remove low-quality sequences.

sickle pe -t sanger -g -f ${id}_1.fastq.gz -r ${id}_2.fastq.gz -o ${id}.1.trimmed.fastq.gz -p ${id}.2.trimmed.fastq.gz -s ${id}.s.trimmed.fastq.gz

Sequence alignment and SNP/InDel Calling

WGS germplasms

The trimmed WGS reads were aligned to tea pant reference genome using Burrows-Wheeler Aligner (BWA) and PCR duplicates were filtered by Sambamba. After filtering low-quality alignments, SNP and InDel were identified by freebayes.

# for each sample
bwa mem -t ${cpu} -M -R "@RG\tID:${id}\tSM:${id}\tPL:illumina\tLB:${id}" ${ref}  ${id}.1.trimmed.fastq.gz ${id}.2.trimmed.fastq.gz -o ${id}.sam
samtools sort -@ 8 -m 4G -o ${id}.sorted.bam ${id}.sam
sambamba markdup -r --tmpdir temp/ ${id}.sorted.bam ${id}.rmdup.bam --overflow-list-size 1000000 --hash-table-size 1000000
# for all samples
fasta_generate_regions.py ${ref}.fai 100000 > fa.regions
ls *rmdup.bam > bam.fofn
freebayes-parallel fa.regions ${cpu} --fasta-reference ${ref} --bam-list bam.fofn  > web.wgs.vcf
vcftools --vcf web.wgs.vcf --max-missing 0.5 --minQ 30 --maf 0.05 --max-meanDP 20 --recode --recode-INFO-all --out web.wgs.g5q30maf05dp20
bcftools filter -g 10 web.wgs.g5q30maf05dp20.vcf.gz | awk '/^#/ || (/=snp/ || /ins/ || /del/) && $5!~/,/' > web.wgs.g5q30maf05dp20gap10.biallelic.vcf

RNA-seq germplasms

The trimmed RNA-seq reads were mapped to the reference genome using HISAT2. SNP and InDel calling were performed by HaplotypeCaller of GATK.

# for each sample
hisat2 --threads 8 -x ${ref} -1 ${id}.1.trimmed.fastq.gz -2 ${id}.2.trimmed.fastq.gz –S ${id}.alignments.sam
samtools view -Su ${id}.alignments.sam -@ 8 -o ${id}.alignments.bam
samtools sort  -@ 8 -o ${id}.alignments.sorted.bam ${id}.alignments.bam
gatk AddOrReplaceReadGroups I=${id}.alignments.sorted.bam O=${id}.rg_added_sorted.bam SO=coordinate RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=${id}
gatk MarkDuplicates I=${id}.rg_added_sorted.bam O=${id}.dedupped.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=${id}.output.metrics
gatk SplitNCigarReads -R ${ref} -I ${id}.dedupped.bam -O ${id}.split.bam
gatk BaseRecalibrator -R ${ref} -I ${id}.split.bam  -O ${id}.recal_data.table  --known-sites web.wgs.g5q30maf05dp20.loci.ref.vcf
gatk ApplyBQSR -R ${ref} -I ${id}.split.bam -bqsr ${id}.recal_data.table -O ${id}.bsqr.bam
gatk HaplotypeCaller -R ${ref} -I ${id}.bsqr.bam -O ${id}.g.vcf --minimum-mapping-quality 30 -ERC GVCF --dont-use-soft-clipped-bases
bgzip ${id}.g.vcf
# for all samples
gatk CombineGVCFs -R ${ref} --variant SRR10052054.g.vcf.gz …… -O all_sample.g.vcf.gz
gatk GenotypeGVCFs -R ${ref} -V all_sample.g.vcf.gz -O web.rna.raw.vcf.gz
gatk VariantFiltration -V web.rna.raw.vcf.gz -O web.rna.filter1.vcf.gz -R ${ref} --filter-expression 'QD < 2.0 || FS > 60.0 || MQ <40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0' --filter-name "Filter"
vcftools --gzvcf web.rna.g2maf05.recode.vcf.gz --max-missing 0.5 --maf 0.05 --recode --recode-INFO-all --out web.rna.g5maf05
bcftools filter -g 10 web.rna.g5maf05biallelic.recode.vcf | awk '$0~/^#/ || $5!~/,/' > web.rna.g5maf05gap10.biallelic.vcf

SNP/InDel annotation

The identified genomic variations were annotated by SNPEff based on the gene annotation file of the tea plant genome.

snpEff CSS_V2 web.wgs.g5q30maf05dp20gap10.biallelic.vcf > web.wgs.g5q30maf05dp20gap10.biallelic.vcf.eff
snpEff CSS_V2 web.rna.g5maf05gap10.biallelic.vcf > web.rna.g5maf05gap10.biallelic.recode.vcf.eff

Population genetic analysis

The SNP density, nucleotide diversity (θπ) and Tajima's D statistics of 219 WGS germplasms were calculated by VCFtools.

vcftools --gzvcf web.wgs.g5q30maf05dp20.chr.snp.vcf.gz --window-pi ${win} --out web.wgs.g5q30maf05dp20.chr.snp.${win}
vcftools --gzvcf web.wgs.g5q30maf05dp20.chr.snp.vcf.gz --TajimaD ${win} --out web.wgs.g5q30maf05dp20.chr.snp.${win}

GWAS

Genome-wide association study (WGAS) was performed by EMMAX software to find genetic variations or gene associated with a particular metabolic trait. The threshold of significant candidate loci (Lead SNPs) was determined by GEC software.

vcftools --gzvcf web.rna.g5maf05gap10.biallelic.vcf.gz --plink --out web.rna.g5maf05gap10.biallelic
plink --file web.rna.g5maf05gap10.biallelic --recode12 --transpose --out web.rna.g5maf05gap10.biallelic
emmax-kin -v -d 10 web.rna.g5maf05gap10.biallelic
emmax -v -d 10 -t web.rna.g5maf05gap10.biallelic –p trait.emmax.in -k web.rna.g5maf05gap10.biallelic.BN.kinf -o trait.emmax
java -Xmx10g -jar gec.jar --effect-number --plink-binary web.rna.g5maf05gap10.biallelic --genome --out gec.out

Methods