SNP Mining by Genome Resequencing of 30 Apple Varieties in Shandong Province

In this article, we carried out genome resequencing and SNP mining for cultivated apples in Shandong Province for the sake of the rapid identification of apple varieties, germplasm evaluation, and utilization. Genomic DNA was extracted immediately from leaves of each sample, and Paired-end Illumina genomic libraries were prepared and sequenced on an Illumina Hiseq 4 000 platform following the manufacturer's instructions. Resequencing of the 31 apple genomes generated a total of 363 Gb high-quality cleaned sequences, with an average of 12.5 Gb per accession that represented approximately 15.9x coverage of the apple genome. The data volume fully meets the needs of downstream analysis and SNP mining. When we used the nucleotide mismatch parameter from 1~12, the mapping rate gradually increased to saturation. There was a highly significant correlation (p<0.0001) between the total mapping rate, mapping rate of pair-end data, and mismatch parameter. Univariate fourth-order equation (regression coefficient r>0.99) were predicted. As the mismatch rate increases, the accuracy of mapping decreases; the genome coverage gradually increases, and heterozygous sites' accuracy gradually increases. In this study, two algorithms were used in SNP mining. The intersection was further taken based on the 'chromosome+site information' as the eigenvalues to obtain a highly reliable single nucleotide variant dataset. A total of 374 404 SNP locus were detected. On average, one variation can be identified from 1 896 bp. The accuracy of the Sanger verification test is as high as 98.1%. Annotation analysis shows that among the 373 763 SNPs, 25 047 (6.7%) are located in the gene coding region, 143 269 (38.27%) are located in the intergenic region, and 179 426 (47.92%) are located in the 2 kb region upstream or downstream of the corresponding genes. Among the coding region SNPs, 13 422 are non-synonymous, while 11 625 are synonymous variations. The ratio of non-synonymous to synonymous SNP is 1.15: 1. Using the filtered 4DTV sites, population clustering analysis results constructed using neighbor-joining algorithms are in line with the trend of the classification of cultivated apples in Shandong province.

Regarding the germplasm resources research, population genotyping based on resequencing provides novel methods for protecting, identifying, evaluating, and innovating fruit tree germplasm resources. Genomics studies provide unique advantages such as high-throughput, big data, and GWAS analysis among phenotype and genotype (Chen et al., 2015;Chen et al., 2018). Genomics and bioinformatics studies should be combined to carry out the high-throughput genotyping of the related germplasm resources of cultivated apples. The underlying SNP and corresponding annotation database should be constructed together. There will be an excellent job in apple omics (Genomic and Proteomic) breeding. In the post-genome era, SNP array represents the future direction of low-cost genotyping due to its unique advantages. China has carried out the research, development, and application of SNP array in field crops such as wheat, corn, soybean, rice, cotton, and some cruciferous vegetable crops. Especially in wheat, Jia Jizeng et al. built a high-resolution (Affymetrix 660k) SNP array. These arrays have demonstrated well in identifying and evaluating germplasm resources, population genotyping, association analysis or functional gene mapping, and molecular marker-assisted breeding. The application prospects should not be underestimated (Zhou et al., 2018). Compared with resequencing, the tech of chip or array is easier to perform in that no reference genome comparison is required to achieve high throughput. Also, the chip or array has high detection accuracy of more than 99.9%, while the detection cost is relatively as low as about 1 000 CNY per sample. Researchers have designed three resolution of SNP array for apple breeding such as 8 K, 20 K, and 480 K, using which the application of genotyping and association analysis for the popular cultivated apples in Europe and America have been developed (Chagné et al., 2012;Bianco et al., 2014;Bianco et al., 2016).
In China, the fruit trees for which SNP array have been developed were strawberries, pears, and peaches; the density is 90 K, 200 K and 9 K, respectively (Verde et al., 2012;Bassil et al., 2015;Li et al., 2019 ). As apple is an important fruit tree, it's breeding urgently needs targeted, low-cost, and high-throughput genotyping methods. However, the SNP array research for the unique apple varieties in China has not been carried out yet. In this study, based on the resequencing research that has been carried out, the SNP array site mining research was carried out. On the one hand, it can be used for rapid identification of apple varieties, evaluation and selection of germplasm resources; it can also be used for genome-wide association analysis, functional gene positioning and molecular marker-assisted breeding.

Statistics of apple genome resequencing for each accession
Upon the raw data was got, the adapter sequence and duplicated reads caused by PCR library building was removed. The volume of cleaned data is 363 G. Calculated with 720 M base pairs of the apple genome, the highest genome coverage was 21.02 × , the lowest genome coverage was 10.63×, and the average sample coverage reached 16.29×; it fully met the needs of re-sequencing analysis and SNP site mining (Table 1).

Effect of mismatch parameters on the mapping rate
Taking C18-06A sample Marshal (Qingdao No.1) as an example, the mismatch rate parameter mismatch required by the BWA software is the number of allowable mismatched bases between the data read and the reference genome, because the sequencing read length in this study is 150 bp in this study, the parameter value was increased from 1 (0.66%) to 12 (8.00%), and a series of comparison files were obtained. Then use the flag stat function of SAMtools to count the specific conditions of the comparison rate of all read data, paired data and single-ended data (Table 2; Figure 1); first, as the mismatch rate gradually increases, the comparison rate also gradually rises. High, but the upward trend gradually decreases until it is close to saturation.
The mapping rate of all reads and pair-end reads showed a trend of approaching saturation (Figure 1), while the mapping rate of single-ended data gradually decreased to a minimum value. As is shown in Figure 1, the total mapping rate is positively correlated with the mismatch rate, which is in line with the fourth-order equation: y=-3E-05x 4 +0.0 011x 3 -0.0 145x 2 +0.0 864x+0.7 418 (regression coefficient R² = 0.9 995). The paired mapping rate of the reading segment is positively correlated with the mismatch rate, which is in line with the fourth-order equation: y = -3E-05x 4 + 0.0 012x 3 -0.0 149x 2 + 0.0 863x + 0.7 126 (the regression coefficient R² = 0.9 994). The single-end mapping rate is negative correlated with the mismatch rate, which is in line with the fourth-order equation: y = 2E-06x 4 -7E-05x 3 + 0.0 009x 2 -0.0 054x + 0.0 212 (Regression coefficient is R² = 0.9 993).  Figure 1 Effect of mismatch parameters on the mapping of total data, paired data and single-ended data

Effect of mismatch parameters on accurate of heterozygous SNP calling
Next, we compared the influence of different mismatch parameters on the accuracy of locus detection. Take Chr11 on chromosome 11 as an example ( Figure 2): As the allowable mismatch rate increases, in the Chr11 area in the figure, comparable data Gradually increase, the sequencing coverage gradually increased from 11 to 19, and homozygous loci under low coverage were identified as heterozygous loci under high coverage, indicating that the increase in mismatch rate is beneficial to heterozygous Site detection. It can be seen that as the mismatch rate increases, the comparison stringency decreases; the genome coverage gradually increases, which is more conducive to the detection of heterozygous sites. Many plants have the characteristics of distant hybridization, self-incompatibility, high genomic heterozygosity and extensive genetic drift, such as apple, Brassica, corn and other crops. For the SNP site mining of such crops, on the one hand, it is necessary to improve the data coverage of the whole genome by sequencing, and on the other hand, it is to select the most suitable mismatch parameter; it is of reference significance for the SNP site mining of crops with higher heterozygosity.

Comparison and integration of sites under the two analysis procedures
According to the standard procedure of next-generation sequencing combined with the BCF tools: bwa-sam-bampileup-bc fools' algorithm, a total of 28 997 212 SNVs were identified, including 26 758 563 single-base SNPs, and 1 060 691 short inserts. 117 795 short deletions. This algorithm can detect 3 types of variation so the sensitivity of variation detection is as high as one variation can be identified from every 27 loci on average.
According to the standard procedure of next-generation sequencing combined with the in-house algorithm: the bwa-sam-bam-pileup-column algorithm identified a total of 1 147 801 variation. This algorithm is designed for the detection of single-base SNPs, whereas the sensitivity of variation detection is pretty low. On average, one variation can be identified from every 618 loci.
Combining the obtained by the two algorithms, and further taking the intersection based on the "chromosome + site coordinate" as the feature value, a highly reliable single-base SNP variation data set is obtained. A total of 374 404 variation were identified. Because of the intersection, these variations are all single-base SNPs. On average, one variation can be identified from every 1 896 loci. Primers were designed for 1 000 randomly selected homologous SNPs and amplified by PCR, and then the amplified products were sequenced by Sanger. The results showed that the coincidence of the selected SNP sites on the two platforms was as high as 98.1%.

Distribution of SNPs among the apple genomes
Annotation analysis of SNPs showed that out of the total 373 763 SNP loci, 143 269 (38.27%) were located in the intergenic region, 25 047 (6.7%) were located in the gene coding region, and 143 269 (38.27%) were located in the gene coding region. Located in the intergenic region, 179 426 (47.92%) were located in the 2 kb region upstream or downstream of the gene. Among all the SNPs in the coding region, 13 422 are non-synonymous variations and 11 625 are synonymous variations (Table 3; Figure 3). The ratio of non-synonymous and synonymous SNPs is 1.15: 1. Non-synonymous SNPs, also called missense SNPs, change from encoding one amino acid to another to form a phenotypic modification; synonymous SNPs are also called silent variations, although there are base variations, they still encode the same amino acid and cannot be formed Phenotypic modification. Compared with other cultivated field crops and fruit tree crops, apples have a lower percentage of variation in the genome that can form corresponding phenotypic modifications (Duan et al., 2017).

Construction of population clustering evolutionary tree
The evolutionary tree was derived using the minimal evolution method (Figure 4) (Rzhetsky and Nei, 1992). The figure shows the best evolutionary tree with a total branch length of 0.7 665, which is drawn with reference to the ratio of evolutionary distance. On the whole, the phylogenetic tree shows that the main cultivated apples collected in this study in Shandong Province are mainly divided into four types: Fuji, Marshal, Golden Crown (Jin Guan), Gala, and other hybrid combinations. It is worth mentioning that: (1) C18-23 samples are wild resources that were bred in Xinjiang wild apples, and the earliest divergence occurred in the evolutionary history; (2) C18-2, C18-3, C18-4, C18-5 and C18-6, according to the information provided by the resource nursery, these samples are all marshals, and they were successfully gathered together in this experiment. Similarly, C18-8, C18-9, C18-10, C18-11, C18-12, C18-13-1 and C18-13-2. According to the information provided by the resource nursery, these samples are all Fuji-series. We successfully got together during the experiment. The above samples are all taken from the National Apple Resource Nursery (Xingcheng) of the Institute of Fruit Trees, Chinese Academy of Agricultural Sciences, which has many years of accurate genealogical data registered for these germplasm resources. This result indirectly proves the reliability of the SNP data in this experiment; (3) The clustering results of the remaining samples are also in line with expectations. Figure 4 Evolutionary relationships of taxa 2 Discussion

How to determine the ideal mismatch parameters
Through the comparative analysis of a series of mismatch parameters, this study found that with the increase in the mismatch rate, the number of Reads that can be compared to the genome gradually increased, showing a trend of gradual saturation. Increasing the mismatch rate to a certain level is conducive to improving sequencing coverage and facilitating site mining. But it is meaningless to increase the mismatch rate indefinitely when it is close to saturation.
If so, in order to further determine the best Bwa alignment mismatch parameters for different samples , the project team members independently developed a method to select the best alignment mismatch rate parameters: NCBI (Pruitt et al., 2005) downloaded the reference genome sequence of cultivated apples and established a local Blast database of the apples. Then randomly select 1 000 reads from the sequencing data for local Blast, sort the mapping ratio in the Blast results, and then calculate the Identity similarity of the 550 th read, thereby determining the BWA mismatch parameters.
The size of the best mismatch parameter measures to a certain extent the distance of the test sample relative to the reference genome. For example, the test sample No. 23 is a new hybrid of Xinjiang wild apple (M. sieversii in Xinjiang) and red meat apple (Malus domestica 'Redlove Era'), and the genetic relationship is far from the reference genome apple Golden Delicious (M. domestica) The farthest, correspondingly, the mismatch parameter obtained by our calculation is the largest, which is 7; while the reference sample No. 1 and the reference genome belong to the Jinshuai line, the relationship is closest, and the corresponding mismatch parameter obtained by our calculation is the smallest, which is 4.
This method of selecting the best alignment mismatch rate has been adopted in the previous article (Duan et al., 2017;Duan, 2017). Choosing accurate mismatch parameters on the one hand can obtain a sufficiently high sequencing coverage and ensure the accuracy of the site; on the other hand, it can compare as many reads as possible with a minimum amount of calculation, avoiding excessive calculations and improving analysis efficiency.

The necessity of data integration
In order to enhance the reliability of SNP loci, this study integrated the sequencing data of 31 newly collected samples with the sequencing data of 23 cultivated apples sequenced previously (Duan et al., 2017;Duan, 2017) ), the purpose of data integration: one is to enhance the reliability of the site, and the other is to compare the polymorphisms between the cultivated apple populations in the province and the cultivated apple populations abroad (a separate article is published).
Increasing the number of reference samples and data integration has the following advantages: it increases the diversity of samples; because the previous re-sequencing involves 23 main cultivated apple types worldwide; on the other hand, it improves the reliability of mining sites; Through the integration, the diversity of the large data set is actually used to test the diversity of the quantum set; the future application scope of the selected site is enhanced.

SNP calling strategies for highly heterozygous species
First, for species with high heterozygosity, the assembly of their genomes is difficult, and the corresponding assembly quality is generally not high. For example, multiple published fruit tree genomes have high heterozygosity. At present, next-generation sequencing is widely used. When reads with a read length of 100~150 bp are used to assemble Contigs, the overlap relationship between contigs of highly heterozygous genomes is not easy to clarify, resulting in a low N50 and a large number of failures in the genome. The overlapping area (gap) (Pryszcz and Gabaldón, 2016). Correspondingly, when second-generation sequencing is used for SNP detection, the highest possible sequencing depth and as long as possible read reads are required, such as the hiseq 4 000 platform that is currently widely used. The average sequencing depth of this study reached 16 X, and the read length was 150 bp. Based on the above strategy, the SNP loci in this paper have been sequenced by Sanger, and the coincidence rate is as high as 98.1%. This is higher than the accuracy of re-sequencing SNP loci in maize and cotton populations, and these two crops have a good genetic research foundation, and their genome assembly quality is better than that of apples.
Furthermore, when the mismatch rate is increased, more sites will be identified as heterozygous sites, which indicates that high coverage is beneficial to the detection of heterozygous sites. Many plants have high genomic heterozygosity due to factors such as distant hybridization, self-incompatibility, etc., and there are obvious and extensive genetic drift, such as Malus Mill., Brassica, Maize, etc. crop. For the SNP site mining of such crops, on the one hand, it is necessary to improve the data coverage of the sequencing in the whole genome, and on the other hand, it is to select the most suitable mismatch parameter. This has reference significance for crops with higher heterozygosity.
Finally, an improved algorithm should be used in site selection. Try to avoid using only one detection process. At present, in the existing SNP detection process, the upstream analysis process is BWA combined with SAM tools, namely BWA-Sam-bam-pileup. Different algorithms are used to select SNP after the generated pileup file, and the file format is mainly VCF, hapmap or list format. At this time, two or more analysis processes are used to normalize the generated data into a VCF file. By using the coordinate information of the chromosome to take the intersection, a more reliable site can be obtained.

Variety collection
This study selected 31 cultivated apple varieties with a wide range of types, covering the four major apple lines Fuji, Marshal, Golden Crown, Gala and some new hybrid lines. The pedigree and origin of the samples are from the Institute of Fruit Research, Chinese Academy of Agricultural Sciences National Apple Resource Nursery (Xingcheng); covers the scion types of the main cultivated apples in Shandong Province. From the point of view of the area where the materials are obtained, the area of the materials is all over the main apple cultivation areas in Shandong Province. From the pedigree information, the experimental materials are sufficiently representative in terms of diversity (Table 4).
From June 15th to 20th, 2018, most of the leaves were treated with liquid nitrogen immediately after they were collected from the raw top tip leaves of the year. Only the leaves of the 23, 25, and 26 samples were dried with silica gel. All leaf samples were extracted according to the standard DNB extraction method. The extracted DNA samples were tested on agarose gel for quality, and after meeting the sequencing requirements, the library was constructed by the paired-end PE150 strategy and delivered to BGI to complete the sequencing on the Hiseq-4 000 platform.

Preprocessing and statistical analysis of sequencing data
The original data needs to go through a Perl sequencing script (written by the research team of this research group) to remove sequencing duplication caused by PCR. Specifically, for paired Reads with different sequencing position information IDs, any pair 1 or pair 2 with identical base data at the same time in the 15~135 bp interval is defined as a sequencing duplication caused by PCR, and the data is filtered out. The command line is: "drop_dup_both_end.pl raw_fq1 raw_fq2".
The data from which PCR sequencing duplicates have been removed is filtered by Trimmomatic3.0 software to remove 1, sequencing adapters, and 2, low-quality reads. In this way, the net data is finally obtained. The command line is "trimmomatic PE -thReads 75 fq.1 fq.2".
Including total sequencing data statistics, sequencing depth statistics, read comparison rate statistics and the determination of comparison mismatch parameters. The command line is "fastqc -q trimmed_fq1 trimmed_fq2".

Determine of mismatch values
Take the sample C18-06A (Qingdao 1, Golden delicious) as an example, for the mismatch rate parameter mismatch required by the BWA software: the value of allowable mismatched bases between the data read and the reference genome, because apples have distant hybridization and heterozygosity Higher, therefore, increase the parameter value from 0.66% to 8.00%, which corresponds to a read length of 150 bp and is 1~12. A series of comparison files were obtained to compare the comparison rate to coverage and SNP detection.
In order to determine the appropriate BWA ) alignment mismatch parameters, firstly download the reference genome sequence of cultivated apple from NCBI (www.ncbi.genome.com) and establish a local Blast database of this species. Randomly extract 1 000 reads from the sequencing data for local Blast, and then count the similarity of the 550th read after sorting the Mapping ratio to determine the mismatch value of BWA.

Data mapping and SNP mining
In this study, the genome of the cultivated apple 'Jinshuai' published in 2017 (Daccord et al., 2017) was used as the reference sequence, and all the 31 collected in this experiment and the 23 cultivated in the previous article (Duan et al., 2017) were used. For apples, the resequencing data of a total of 54 cultivated apples were compared with the reference genome by BWA (Li and Durbin, 2009) (mismatch ranging from 4 to 7). The pileup file is converted by SAM tools ). Next, two different processes are used to detect SNP site information: (1) BWA-sam-bam-pileup-bc fools' algorithm, using SAM tools combined with BCF tools to convert Pileup files to obtain SNP data sets in VCF file format for each sample.
(2) According to the second-generation sequencing standard process combined with the self-developed In-House algorithm, namely: the bwa-sam-bam-pileup-column algorithm, a SNP data set similar to hapmap is obtained.
(3) Using an improved intersection algorithm: the SNP site information obtained by the above two methods is based on the method of taking the intersection of chromosome coordinates, and then to a higher quality SNP site.
SNP validation method: 6 samples were selected, and 1,000 homogenous SNP sites (that is, non-heterozygous) were randomly intercepted from chromosome 11, and 50 bp sequences on both 3' and 5' flank were designed based on this. Then primers were constructed, and PCR amplification experiments were performed. Finally, the amplified products were testified by 3730 capillary electrophoresis.

Annotation of SNP loci
Compared with other annotation software, SNPEff is powerful. It can get the gene region where the variation site anchor and the detailed gene region information conducive to subsequent functional gene mining and mapping. Due to the use of the java platform, it is to learn and use. The manual http://SNPeff.sourceforge.net/ SNPEff_manual.html describes the annotation method in detail. The command line for the annotation in this study is as follows: To modify SNPeff software settings: "vim user path/SNPEff/SNPeff-4.3.1t-1/SNPEff.config"; To add genome information: "# apple genome version GDDH13 GDDH13.genome: Apple"; To build local library: "SNPEff build -gff3 -v GDDH13"; SNP annotation: "SNPEff -v -stats prefix.html GDDH13 prefix.vcf> prefix.ann"; the output html file is a graphical interpretation of the site annotation results presented in the form of a web page, and the output an file is a text listing the detailed results of each SNP annotation.

4DTV loci filtering and cluster analysis
In the protein coding region of the gene, there are some amino acids corresponding to the third codon that can use any 4 kinds of bases, and no amino acid changes will be formed. Such a site is called a quadruple degeneration site (4DTV). This kind of unintentional variation has almost no selective pressure, and its variation rate can be used as a "clock" to estimate evolution, which is particularly suitable for building evolutionary trees and analyzing population genetic structure (Fazio et al., 2014). This study used the Perl script written by the team to screen the entire set of SNP data in the CDS region according to the following rules: minimum allele frequency (MAF) ≥ 5%, and the data loss rate corresponding to each locus≤10%. Finally, 24 326 quadruple degenerate sites (4DTV) were screened. Finally, the location is input into the Mega X software, and the close-neighbor-interchange (CNI) algorithm (Kumar et al., 2018) is used on the first search level. Thus, the phylogenetic tree of the population is constructed.

Authors' contributions
Duan N.B., and Ma Y.M. are the executives of the experimental design and experimental research of this study; Xie K., Bai J., Yang Y.Y., Pu Y.Y., and Gong Y.C. completed DNA extraction, PCR amplification and Sanger sequencing; Duan Naibin completed sequencing data preprocessing and genome comparison , SNP site mining, gene annotation and paper writing; Ma Y.M., Wang K., and Wang X.M. are responsible for sample collection and revision of some paper chapters; Duan Naibin is the project designer and person in charge, directing experimental design, data analysis, paper writing and Modification. All authors read and approved the final manuscript.