SNP Identification from RAD-Seq Data in Faba Bean ( Vicia faba L.)

Faba bean is an important edible legume crop in China. Due to its huge genome size and no available reference genome, SNP marker is very limited in faba bean. To identify genome-wide SNP markers, we obtained 35.47 Gb data from eight landraces by RAD-sequencing, with an average of 4.77 Gb data for each accession. A total of 245443516 reads were generated, the single accession has an average of 30680439.5 reads, and the average length of the reads reaches to 144 bp. The Q20 and Q30 values were over 97.89% and 93.83%, respectively. The GC content between the reads varied from 38.05% to 40.09%. Using a special Bayesian method under the situation of no reference genome, we identified 3722 SNPs among the eight landraces. Regarding the single accession, the detected SNPs varied from 3278 to 3578, and the homozygous SNPs number was larger than that of heterozygous SNPs for most of accessions. For the SNP types, T:A->C:G type has the largest proportion (38.8%), followed by C:G->T:A (28.0%) and the smallest is T:A->A:T (7.5%). 31 SNPs were selected to convert into KASP markers, and they showed a success rate of 66.7% through amplifying on 46 accessions. The SNPs in this study provide a strong genetic tool for germplasm identification, gene mapping and molecular breeding in faba bean.

Keywords Faba bean; SNP; RAD-Sequencing; KASP Faba bean (Vicia faba L.), belonging to Leguminosae Papilionidae Vicia genus, is an important edible legume crop in temperate and subtropical regions. Faba bean has been cultivated in China for over 2100 years (Wang et al., 2011;Yang et al., 2016). Faba bean is rich in protein and cellulose, and could be digested and absorbed easily, its dry seeds can be used as food, feed or processing into leisure food, while its fresh grains can be used as vegetables. Faba bean also plays an important role in crop rotation and nourishing system in the cropping structure adjustment due to the biological nitrogen fixation from the roots. In recent years, the annual planting area of faba bean in China is about 925,000 hm 2 and the annual yield is about 1,595,000 tons, ranking first both on production area and total yield in the world (http://faostat3.fao.org/). Faba bean is a diploid crop (2n=2x=12), with a genome of about 13 Gb (Johnston et al., 1999). As one of the largest legume genomes, faba bean genome size is approximately 25 times larger than that of the model legume Medicago truncatula. Due to the larger genome, the genome research such as whole genome sequencing and marker development has been hindered seriously, thus causing limited genetic gain through the use of molecular markers. Recently, with the rapid development of next-generation sequencing technology and the significant decrease of sequencing cost, high-throughput sequencing has been widely used in the complex and larger genome crops such as wheat for marker development and gene mapping (Poland et al., 2012;Wang et al., 2014). In faba bean, Yang et al. (2012) identified 125,559 SSR sequences and developed 28503 SSR markers by sequencing the mixed genome of 247 germplasms using the 454 sequencing technology; Kaur et al. (2012) identified 304,680 specific genes and developed 802 SSR markers through transcriptome sequencing of two faba bean varieties using the 454 sequencing technology; Webb et al. (2016) detected 845 SNPs from transcriptome sequencing data of two faba inbred lines and constructed a genetic map containing 687 SNP markers. These markers resources laid a solid foundation for faba bean genetic breeding. Single nucleotide polymorphism (SNP), the DNA sequence polymorphism caused by a nucleotide mutation on the genome, has become an ideal next generation molecular marker due to its advantages including wide distribution on the genome, high density, high stability and high-throughput. However, the SNP marker resource is relatively limited in faba bean until now excepting for the 845 SNPs reported in Vignal et al. (2002).
Reduced representation sequencing such as restriction site-associated DNA sequencing (RAD-Seq) is a next-generation sequencing technique that reduces genome complexity by digestion of genomic DNA by restriction enzymes, followed by the sequencing of fragments within a given size range, and thus to obtain partial sequences representing the whole genome information (Baird et al., 2008;Rowe et al., 2011). Due to its moderate sequencing depth, low cost and independent of reference genome, RAD-Seq has been widely used in marker development, genetic map construction and target gene mapping in multiple non-model species (Yang et al., 2013;Hegarty et al., 2013;Xu et al., 2014). In the current study, RAD-seq was used to sequence eight faba bean landraces to identify genome wide SNPs based on a special SNP detection technology under lack of reference genome (Xu et al., 2014), and also analyze faba bean SNPs characteristics. The obtained SNPs markers will provide a new resource for future gene mapping, genetic map construction and molecular marker-assisted breeding.

Sequencing data statistics
In the current study, eight faba bean accessions were sequenced using the Illumina Hiseq platform and a total of 35.47 Gb data was obtained. Total 245443516 high quality reads were generated, and the average length of each reads was 144 bp. For each accession, the smallest sequencing data was 3.83 Gb, the largest was 4.77 Gb, and the average sequencing data was 4.43 Gb; the read number ranged from 26415662 to 32822210, with the average of 30680439.5; Q20 and Q30 were greater than 97.89% and 93.83%, respectively, and the GC content varied from 38.05% to 40.09% (Table 1).

Genome wide SNPs identification in faba bean
3722 SNPs were identified according to Xu et al. (2014) where a special Bayesian method was used to detect bottle gourd SNPs without reference genome. For single accession, the least 3278 SNPs were identified in FB076 while the most 3578 SNPs were identified in FB079. Among the eight accessions, the homozygous SNPs number ranged from 1579 to 2033 and the heterozygous SNPs number varied from 1245 to 1804, the number of homozygous SNPs was larger than that of the heterozygous SNPs in most accessions excluded FB080 and FB056 (Table 2). Among the 6 SNP mutational types, T:A->C:G mutational types accounted for the largest proportion (mean 38.8%), followed by C:G->T:A (mean 28.0%), and the smallest proportion was T:A->A:T (mean 7.50%) ( Table 3).

SNPs validation
To verify the effectiveness of SNPs detected in this study, 56 SNPs were selected to develop KASP markers after filtering the SNPs with missing rate ≤20%, MAF value ≥0.05 and SNP occurrence frequency ≥40. Finally, 31 SNPS were converted into KASP markers (Table 4), with a conversion success rate of 55.3%. These 31 KASP markers were used to genotype 46 faba bean germplasms, and the results showed that 22 markers detected successful signal, of which 14 markers showed no polymorphism, 4 markers revealed two genotypes and other 4 markers displayed three genotypes ( Figure 1).

Discussion
With the cost reducing rapidly, genome resequencing technique has been widely used to identify genome-wide SNPs in many crops. However, due to larger genome size and lack of reference genome, the SNP mining research in faba bean lags behind other legume crops such as soybean, common bean and cowpea (Wu et al., 2010;Song et al., 2015;Muňoz-Amatriaín et al., 2017). Ocana et al. (2015) and Webb et al. (2016) have identified 39060 SNPs and 845 SNPs by transcriptome sequencing of different inbred lines, respectively. Webb et al. (2016) also converted the SNPs into KASP markers and constructed the first genetic map based on SNPs markers in faba bean.
In the current work, eight faba bean germplasm were sequenced using the RAD-Seq technique, and each one got about 4.77Gb data, which covering about 36.7% of the whole genome, indicating a low-depth genome coverage.
To improve the accuracy of SNP identification, a special Bayesian method developed by Xu et al. (2014) was used to detect SNPs and finally a total of 3722 SNP markers were identified. The number of SNPs detected in different accessions varied highly, and the maximum difference between two accessions reaches to 300 SNPs. Further analysis showed that there was very weak correlation between the sequence data and the identified SNPs in a single accession, for example, FB076 with a larger sequencing data volume identified the least SNPS (Table 1 and  Table 2). Among the 6 SNP mutation types, T:A->C:G mutation types accounted for the largest proportion (mean 38.8%), followed by C:G->T:A (mean 28.0%), and the smallest proportion was T:A->A:T(mean 7.50%). However, the study in Ocana et al. (2015) showed that C/T type SNPs accounted for the largest proportion (18.7%), following by A/G (16.5%), T/C (15.3%) and G/A (14.2%), while other SNPs accounted for only 5%, indicating that the SNPs distribution on the genome is not evenly and different SNPs types varied highly in different cultivars. Different from Ocana et al. (2015) and Webb et al. (2016) where the SNPs detecting from transcriptome sequencing, the SNPs in the current study derived from genome sequence, thus could identify the SNPs both in inter-genes and intra-genes regions.  Figure 1 The genotype signals detedted by the KASP markers in this study Note A: failure; B: single allele genotype; C: two alleles genotype; D: three alleles genotype Faba bean is an often cross-pollinated crop with a natural crossing rate of 4%~84% (Kaur et al., 2012). Among the SNPs identified in this study, about 46% SNPs were heterozygous mutation (Table 2), indicating that higher heterozygous rate appeared in all the 8 landraces, which reminding that heterozygous genetic background should be paid more attention when analyzing the genetic diversity of faba bean germplasms. In this study, 31 SNPs were converted into KASP markers, of which 66.7% markers detected corresponding genotypes in different germplasm, which was close to the success rate of 67.6% for the KASP markers in Webb et al. (2016). As the KASP conversion rate in faba bean SNP is relatively low, the major reason is lack of reference genome, thus affecting the accuracy of SNP calling. Anyway, these SNPs provided a useful resource for genetic diversity assessment, gene mapping and molecular breeding in faba bean.

Materials
The faba bean germplasms used in this study was collected and preserved by Lishui Agricultural and Forestry Science Research Institute, among which 8 landraces (FB017, FB032, FB036, FB056, FB076, FB080, FB081) were used to sequence, and 46 germplasm were used to verify the accuracy of SNPs markers.

DNA extraction
The one week old young leaves were taken and grinded in liquid nitrogen, and then genomic DNA was extracted with a DNA extraction kit (Tiangen Biochemical Technology (Beijing) Co., LTD.). DNA quality was measured by agarose gel electrophoresis, and DNA concentration was detected by NanoDrop 2000 ultramicro spectrophotometer.

RAD library construction and sequencing
The RAD library was constructed according to the method of Baird et al. (2008). Genomic DNA was digested with EcoRI, and the 3' end of the digested fragment was ligated to a unique nucleotide multiplex identifier (MID) for barcoding. The genomic DNA libraries were enriched by PCR amplification and the PCR fragments were purified by agarose gel electrophoresis. Finally the single-end sequencing was performed using Illumina HiSeq 2000 platform.

SNP identification
The raw sequences were generated by Illumina base calling software CASAVA v1.8.2 (http://support.illumina.com/sequencing/sequencing_software/casava.ilmn), and quality analysis was performed using the Trimmomatic (http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic) under the default parameters. After removing the reads ≤ 85 bp, the high quality reads were clustered into read tags (RAD-tags) using ustacks (Catchen et al., 2011) based on sequence similarity. According to the SNP calling method in Xu et al. (2014), the RAD-tags were then collapsed into clusters using cstacks under default parameters for SNP calling, finally the SNP genotypes were corrected again by the Bayesian algorithm (Hohenlohe et al., 2010).

KASP marker development and genotyping
KASP primers were designed using the Kraken™ software (LGC, Biosearch Technologies) based on the 100 bp fragment covering the SNP site. Each KASP marker includes two allele-specific forward primers and one common reverse primer. KASP genotyping was performed using the IntelliQube workstation (LGC, Biosearch Technologies) in the central laboratory of Zhejiang Academy of Agricultural Sciences. The PCR reaction was performed in a 1.6 μL volume containing 0.8 μL DNA template, 0.8 μL mixture of 2 x Master mix and Primer mix. PCR program includes 36 cycles following this protocol: 15 min at 95°C; 10 touchdown cycles of 20 s at 94°C and 60 s at 61°C~55°C (with the annealing temperature for each cycle reduced by 0.6°C per cycle); and 26 cycles of 20 s at 94°C and 60 s at 55°C. The genotype signals were detected by the own analysis system in IntelliQube.

Authors' contributions
Zhu Yujie translated the original Chinese paper into English version. Liu Tingfu, Li Hanmei and Wang Linlin contributed to germplasm collection, purification and field management. Wang Baogen and Wu Suqi contributed to DNA extraction and genotype analysis; Li Sujuan contributed to primer design and KASP genotyping. Wu Xinyi and Li Guojing conceived and designed the research and Wu Xinyi wrote the manuscript. All authors read and approved the final manuscript.