The raw reads were trimmed by sickle to remove low-quality sequences.
sickle pe -t sanger -g -f ${id}_1.fastq.gz -r ${id}_2.fastq.gz -o ${id}.1.trimmed.fastq.gz -p ${id}.2.trimmed.fastq.gz -s ${id}.s.trimmed.fastq.gz
The trimmed WGS reads were aligned to tea pant reference genome using Burrows-Wheeler Aligner (BWA) and PCR duplicates were filtered by Sambamba. After filtering low-quality alignments, SNP and InDel were identified by freebayes.
# for each sample bwa mem -t ${cpu} -M -R "@RG\tID:${id}\tSM:${id}\tPL:illumina\tLB:${id}" ${ref} ${id}.1.trimmed.fastq.gz ${id}.2.trimmed.fastq.gz -o ${id}.sam samtools sort -@ 8 -m 4G -o ${id}.sorted.bam ${id}.sam sambamba markdup -r --tmpdir temp/ ${id}.sorted.bam ${id}.rmdup.bam --overflow-list-size 1000000 --hash-table-size 1000000 # for all samples fasta_generate_regions.py ${ref}.fai 100000 > fa.regions ls *rmdup.bam > bam.fofn freebayes-parallel fa.regions ${cpu} --fasta-reference ${ref} --bam-list bam.fofn > web.wgs.vcf vcftools --vcf web.wgs.vcf --max-missing 0.5 --minQ 30 --maf 0.05 --max-meanDP 20 --recode --recode-INFO-all --out web.wgs.g5q30maf05dp20 bcftools filter -g 10 web.wgs.g5q30maf05dp20.vcf.gz | awk '/^#/ || (/=snp/ || /ins/ || /del/) && $5!~/,/' > web.wgs.g5q30maf05dp20gap10.biallelic.vcf
The trimmed RNA-seq reads were mapped to the reference genome using HISAT2. SNP and InDel calling were performed by HaplotypeCaller of GATK.
# for each sample hisat2 --threads 8 -x ${ref} -1 ${id}.1.trimmed.fastq.gz -2 ${id}.2.trimmed.fastq.gz –S ${id}.alignments.sam samtools view -Su ${id}.alignments.sam -@ 8 -o ${id}.alignments.bam samtools sort -@ 8 -o ${id}.alignments.sorted.bam ${id}.alignments.bam gatk AddOrReplaceReadGroups I=${id}.alignments.sorted.bam O=${id}.rg_added_sorted.bam SO=coordinate RGLB=lib1 RGPL=illumina RGPU=unit1 RGSM=${id} gatk MarkDuplicates I=${id}.rg_added_sorted.bam O=${id}.dedupped.bam CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=${id}.output.metrics gatk SplitNCigarReads -R ${ref} -I ${id}.dedupped.bam -O ${id}.split.bam gatk BaseRecalibrator -R ${ref} -I ${id}.split.bam -O ${id}.recal_data.table --known-sites web.wgs.g5q30maf05dp20.loci.ref.vcf gatk ApplyBQSR -R ${ref} -I ${id}.split.bam -bqsr ${id}.recal_data.table -O ${id}.bsqr.bam gatk HaplotypeCaller -R ${ref} -I ${id}.bsqr.bam -O ${id}.g.vcf --minimum-mapping-quality 30 -ERC GVCF --dont-use-soft-clipped-bases bgzip ${id}.g.vcf # for all samples gatk CombineGVCFs -R ${ref} --variant SRR10052054.g.vcf.gz …… -O all_sample.g.vcf.gz gatk GenotypeGVCFs -R ${ref} -V all_sample.g.vcf.gz -O web.rna.raw.vcf.gz gatk VariantFiltration -V web.rna.raw.vcf.gz -O web.rna.filter1.vcf.gz -R ${ref} --filter-expression 'QD < 2.0 || FS > 60.0 || MQ <40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0' --filter-name "Filter" vcftools --gzvcf web.rna.g2maf05.recode.vcf.gz --max-missing 0.5 --maf 0.05 --recode --recode-INFO-all --out web.rna.g5maf05 bcftools filter -g 10 web.rna.g5maf05biallelic.recode.vcf | awk '$0~/^#/ || $5!~/,/' > web.rna.g5maf05gap10.biallelic.vcf
The identified genomic variations were annotated by SNPEff based on the gene annotation file of the tea plant genome.
snpEff CSS_V2 web.wgs.g5q30maf05dp20gap10.biallelic.vcf > web.wgs.g5q30maf05dp20gap10.biallelic.vcf.eff snpEff CSS_V2 web.rna.g5maf05gap10.biallelic.vcf > web.rna.g5maf05gap10.biallelic.recode.vcf.eff
The SNP density, nucleotide diversity (θπ) and Tajima's D statistics of 219 WGS germplasms were calculated by VCFtools.
vcftools --gzvcf web.wgs.g5q30maf05dp20.chr.snp.vcf.gz --window-pi ${win} --out web.wgs.g5q30maf05dp20.chr.snp.${win} vcftools --gzvcf web.wgs.g5q30maf05dp20.chr.snp.vcf.gz --TajimaD ${win} --out web.wgs.g5q30maf05dp20.chr.snp.${win}
Genome-wide association study (WGAS) was performed by EMMAX software to find genetic variations or gene associated with a particular metabolic trait. The threshold of significant candidate loci (Lead SNPs) was determined by GEC software.
vcftools --gzvcf web.rna.g5maf05gap10.biallelic.vcf.gz --plink --out web.rna.g5maf05gap10.biallelic plink --file web.rna.g5maf05gap10.biallelic --recode12 --transpose --out web.rna.g5maf05gap10.biallelic emmax-kin -v -d 10 web.rna.g5maf05gap10.biallelic emmax -v -d 10 -t web.rna.g5maf05gap10.biallelic –p trait.emmax.in -k web.rna.g5maf05gap10.biallelic.BN.kinf -o trait.emmax java -Xmx10g -jar gec.jar --effect-number --plink-binary web.rna.g5maf05gap10.biallelic --genome --out gec.out