Identification and Evaluation of the Expansin Genes in Salix purpurea Genome

Expansin is an indispensable cell wall protein, involved in plant growth and resistance to stress. Based on the sequencing genome data, a total of 34 expansin genes were identified on 14 chromosomes in Salix purpurea, which were grouped into 4 subfamilies. The analysis of the physicochemical properties showed that most expansin proteins were in weakly alkaline, had a higher aliphatic index and more stable protein structure. The gene sequence was more conservative among the members in the same subfamily with the amino acid identity between 34.2% and 99.4%, while that was between 15.0% and 39.8% among the members in the different subfamilies. Also, the same subfamily genes had similar exon-intron structures, but the different subfamily genes showed that much differently. The codon usage analysis revealed that there are 5 high-frequency codons in Salix purpurea expansin genes. The codon usage bias was much affected by natural selection. The synteny analysis showed that the expansin genes divided in corresponce to whole-genome duplication events in willow. The results here would help us to understand well the role and status of expansin genes in the growth and development of Salix purpurea. It will be helpful for further study on the expansins functions.


Background
Expansins are important components of plant cell wall that can loosen plant cell wall and increase cell flexibility by breaking the hydrogen bonding between cellulose microfibrils and hemicellulose. Therefore, expansin plays an important role in various growth processes of plants, such as seed germination , root elongation (Noh et al., 2013), fruit ripening (Jiang et al., 2019), drought resistance (Chen et al., 2016), heavy metals toxicity (Zhang et al., 2018a), and salt tolerance (Lu et al., 2013).
Typical expansin in plants consists of three domains, namely the N-terminal signal peptide, the catlysis domain in the middle, and the C-terminal cellulose-binding domain. According to the differences of gene sequence and structural composition, expansins can be divided into four subfamilies: EXPA, EXPB, EXLA and EXLB. In recent years, along with the development of genome sequencing, the expansin genes in more plant species have been identified, such as grape (Dal santo et al., 2013) and poplar  of woody plants, and Arabidopsis (Sampedro et al., 2006), maize (Zhang et al., 2014), rice  of herbaceous plants. Clearly, the expansin genes are not only different in quantity in different species, but also various in gene sequence composition and codon usage pattern.
Willow is an important woody tree species for ecological conservation with rich variety resources. The study here supposed to identify the expansin genes based on the sequenced genome of willow variety Salix purpurea, and to understand the characteristics of the willow expansin genes via bioinformatics analysis. The final results would provide some useful information for further exploring the expansin gene function and the effect of the expansin genes on willow growth, development and resistance. 1Results

Identification and chromosome location of the expansin genes in Salix purpurea
The amino acid sequence of the expansin genes in Populus trichocarpa was used for searching the homologous genes in Salix purpurea genome. The further analysis of the expansin-specific domains showed that 44 expansin gene sequences were present in Salix purpurea, of which 9 sequences were duplicate genes and 1 sequence had incomplete gene structure (since of incomplete sequencing). Finally, 34 expansin genes were determined in Salix purpurea.
According to amino acid homology with the poplar expansins, the willow expansin genes were then named individually (two homologous genes were distinguished by a, b). They were classified into four subfamilies: SpEXPA, SpEXPB, SpEXLA, and SpEXLB. Of them, the SpEXPA subfamily contained 26 members, and the SpEXPB, SpEXLA and SpEXLB subfamilies had 3, 2 and 3 members, respectively ( Figure 1). Considering the physical location of the willow expansin genes, 34 genes and 9 duplicate genes were mapped on 14 chromosomes, where chr04 and chr16 contained 4 expansin genes individually, chr01, chr03, chr13 and chr17 had 3 genes, chr02, chr08, chr09 and chr10 had 2 genes, and chr05, chr06, chr14 and chr19 had 1 gene. Besides, the SpEXPB2 and SpEXLB4 were located on the unmapped scaffolds (Figure 2), and chr07, chr11, chr12, chr15, and chr18 did not contain any expansin genes. Thus, the distribution of expansin genes on willow chromosomes was not random.

Analysis of the expansin gene composition in Salix purpurea
Statistics showed that the total exons of each expansin gene was in length of 663-861bp, with an average of 774 bp. And the total length of introns varied greatly from 185 bp to 1558 bp. Mostly, the SpEXPA subfamily genes contain 2 introns, while SpEXPB, SpEXLA and SpEXLB subfamily genes contain 3, 4 and 4 introns. All the expansin genes had a signal peptide with 17-36 amino acids (Table 1). Thus, the structure of willow expansin genes was subfamily characteristically, as same as the other species. The sequence identity in nucleotides among the willow expansin genes was in range of 31.6%~99.0%, which was really higher than that in amino acids (15.5%~99.4%). Of them, the sequence identity in nucleotides among the subfamily members was from 43.9% to 99.0%, and that in amino acids was from 34.2% to 99.4%. While, the sequence identity in nucleotides and in amino acids among different subfamily members were in range of 31.6%~54.6% and 15.0%~39.8%, respectively ( Figure 3A). It also showed the expansin gene subfamily characteristically. Further phylogenetic tree analysis showed that the subfamily genes all clustered together ( Figure 3B). Therefore, sequence alignment can be used for classification of expansin gene subfamily and identification of homologous genes in different species, and promote the understanding of the evolution history of expansin genes via further study of the gene function diversity among different subfamily genes.

Structural analysis of the expansin genes in Salix purpurea
Of the 26 EXPA subfamily genes, 20 genes contained 2 introns (17 genes had the intron splicing mode of AB, 3 genes were in AF, CB and CF), 4 genes contained 1 intron (in B or C splicing mode), and 2 genes had 3 introns (in ACB or ACF splicing mode). All EXPB subfamily genes contained 3 introns (in ABF splicing mode excepting for SpEXPB1 in ACB). SpEXLA and SpEXLB subfamily genes generally contained 4 introns (in ACBF splicing mode excepting for SpEXLB1 in ACF). Obviously, the subfamily expansin genes of willow not only had similar exon-intron structure but also had the similar intron splicing mode, which also really showed subfamily characteristically ( Figure 4). Several other genes displayed a little variation, probably suffered from different evolution pressure.

Codon usage pattern of Salix purpurea expansin genes
A total of 34 willow expansin genes contain 8773 codons. The relative frequency of synonymous codon (RFSC) of some amino acids was really different (Table 2). For example, the RFSC values of the three synonymous codons AUU/AUC/AUA encoding Ile were 45.25%, 37.00% and 17.75%, respectively. Further, 5 high frequency codons (synonymous codons with frequency higher than 60% or 1.5 times higher than the average frequency (Lin et al., 2002) were determined in willow expansin genes, which were CCU (coding Pro), GCU (coding Ala), GAU (coding Asp), AGA and AGG (coding Arg). The less amounts of high frequency codons indicated the expansin genes expressed weakly in willow. The probable modification of the codons might be expected in future to advance the gene expression efficiency. The correlation analysis between GC12 and GC3 of codons ( Figure 5A) showed that the expansin genes loci were evenly distributed at both sides of the standard curve, and the correlation was not significant (P=0.282) by two-tailed test. The regression coefficient was 0.089 1, indicating a significant difference of the base composition was present at the first and second site of codon comparing with the third site. The mutation pressure (ε) accounts for 8.91% of the varition, and the natural selection pressure (1-ε) contributes 91.09%. Thus, natural selection had a great influence on codon usage bias.
The ENC ratio of the willow expansin genes ranged from -0.03 to 0.24. Of them, the genes with -0.05~0.05 accounts for 30.00% and the genes with 0.10~0.20 accounts for 47.06%, indicating a great difference was present between the observed ENC value and the expected value. Correlation analysis between ENC value and GC3s displayed that most of the genes were distributed farther under the standard curve It indicated that natural selection was the main influencing element for gene codon bias in evolution. A few genes distributed on the standard curve indicated that the mutation effect could partly cause the effect on codon bias ( Figure 5B).
A Parity Rule 2 (PR2)-bias plot was used to evaluate the selection of pyrimidine (C and T) and purine (A and G) at the third base of codon. The vector dots were unevenly distributed in four regions, majorly at the down-left of the plan. It indicated that the frequency of T at the 3 rd base of codons was higher than that of A, and the frequency of C was higher than that of G. In other words, the pyrimidine was more frequently used than purine. Natural selection had great influence on codon usage bias, and played an important role in evolution as some other factors ( Figure 5C).
Based on the correspondence analysis of the relative synonymous codon usage (RSCU, the relative probability of a particular codon among the synonymous codons for the amino acid that could avoid the effect of amino acid composition on codon usage), the 1 st axis contributed the largest variation of 13.43%, followed by the 2 nd , 3 rd , and 4 th vector axes with the variation of 12.05%, 9.32% and 8.46%, respectively. A significant negative correlation was defined for the 1st axis with CAI, CBI and FOP with the correlation coefficient of -0.556, -0.617 and -0.576, respectively. No significant correlation was revealed for the 1 st axis with ENC and GC3 with the correlation coefficient of 0.22, 0.088, respectively. Obviously, CAI, CBI and FOP had great influence on the codon bias of willow expansin genes, along with the other factors accordingly ( Figure 5D).

The synteny analysis of expansin genes in Salix purpurea
Based on the BLASTX analysis, 17 pairs of willow expansin genes were detected having a syntenic relationship (including 4 pairs of tandem duplications), even the cross-synteny among several expansin genes such as SpEXPA1/SpEXPA7/SpEXPA15, and SpEXPA2/SpEXPA12/SpEXPA13 (Figure 6). They were mostly distributed on chromosome 4, 8, 10, 16 and 19 and displayed as paralogs or orthologs, where the duplicate fragments probably existed.  (Table 3). Probably they had experienced a gradual evolution process via several times of gene replication and expansion.

Physicochemical properties of expansin proteins in Salix purpurea
The molecular weight of the expansins in Salix purpurea was predicted in range of 23.61-31.70kD with an average value of 27.91. The most expansins showed slightly alkaline and had the pI value >7.0 excepting for SpEXPB1, SpEXLB3, and SpEXLB4, with an average of 8.66. The aliphatic index of the proteins ranged from 61.72 to 82.51, with an average value of 70.61. The relatively high aliphatic index ensured them function normally in various environment conditions, especially SpEXPA23 belonged to thermophilic protein (the aliphatic index was 82.51). The grand average of hydropathicity of willow expasins was from -0.403 to 0.051, indicating them as amphiphilic proteins (positive value means hydrophobicity while negative value indicates hydrophilicity). And, the instability index was between 22.93 and 46.40 with an average value of 34.51(more than 40 was unstable). Thus, 85.29% of the expansins had the stable protein structure, which was necessary for their function in plant growth and resistance (Table 4).

Discussion
In this study, 34 expansin genes belong to four subfamilies were identified based on the sequenced genome of Salix purpurea. The genes in a subfamily had a conservative intron-exon structure and high sequence identity. Codon analysis showed the willow expansin genes having low codon usage bias, mainly resulted from the natural selection. And, codon composition also displayed characteristically by species and genus, In addition, the formation of willow expansin gene family might be in relation to several duplication events of the genome. All the information here partly revealed the evolution history and background of the willow genome. Similarly, Harikrishnan detected a large amount of duplication fragments in willow genome through RNA sequencing (Harikrishnan et al., 2015), with an average Ka/Ks (nonsynonymous substitution/synonymous substitution) of 0.23. Probably the willow genome had experienced strong purification selection process in evolution, which was consistent with our results based on a single gene family analysis here.
More studies on the expansin gene function had been carried out in several laboratories via gene expression (Castillo et al., 2018) and gene transformation experiment (Zenoni et al., 2011). The results really confirmed the important role of expansin genes in plant growth and development and environmental adaptation process from different perspectives. Bioinformatics analysis also provided the evidences, including the expansin genes in Salix purpurea, that the sequence identity of the subfamily genes was generally over 80%, and the proteins had three typical domains. The synteny analysis of the willow expansin genes showed their Ka/Ks value all less than 1, indicating that the willow expansin genes had suffered from a serious purification selection pressure in evolution and the gene with nonsynonymous mutation were really eliminated. Also, the structure stability and conservation of the expansin genes would be necessary for their function process, as same as the performance of the indispensable housekeeping genes in plants. The expansin genes obviously display species and genus characteristics. Based on the 11 sequenced plant genomes, more expansin genes were identified in monocots plants of rice (58)  and maize (88) (Zhang et al., 2014b), while less gene members were detected in dicots plants of willow (34), poplar (36) , Arabidopsis (36) (Sampedro and Cosgrove, 2005), potato (36) (Chen et al., In 2019), grape (29) (Dalsanto et al., 2013), jujube (30) (Hou et al., 2019), apple (36) (Zhang et al., 2014a), soybean(65) (Zhu et al., 2014), and cucumber (35) (Hao et al., 2015). Possibly, the monocots are at the top of evolution, and had experienced several times of duplication events in evolution that might cause more expansin genes retained. And the enriched amount of expansin genes is consistent with the evolution of species and advanced performance of plants, which really determined the important status of expansin genes in plant genome. In addition, the frequency of synonymous codons usage in expansin genes also varied by species. The amount of high-frequency codons in monocots of maize and rice were 24 and 26, respectively, while 5, 6, 5 and 9 in dicots of willow, poplar, grape, and Arabidopsis (Zhang et al., 2018b). Meanwhile, 4 of 5 high-frequency codons (CCU, GCU, AGA and AGG) were shared in Salix purpurea and Populus trichocarpa,3 of them (CCU, GCU, AGA) had the RFSC values close to the threshold values of high-frequency codons in dicots of grape and Arabidopsis but significantly reduced in monocots of maize and rice. On the contrary, some high-frequency codons in monocots were used less in dicots. Some reports demonstrated that more high-frequency codons might be improvable on gene expression (Liu, et al., 2004). Therefore, codon usage bias of the expansin genes in monocots and dicots also pretends the gene expression variation even the status in genome.
Of the all expansin genes in plant genome, the proportion of EXPB subfamily members was small in several dicots species (ranging from 8.57% to 16.67%), while that in monocots was relatively high such as 32.76% in rice and 54.55% in maize (Li et al., 2016). Considering of the mechanism, the expansins could bind then break the covalent bonding between cellulose and hemicellulose in cell wall, and finally loosen cell wall components (Cosgrove, 2015). Some experiments reported that the cell wall components in monocots and dicots varied greatly. The hemicellulose in dicots mainly consists of xyloglucan, while the hemicellulose in monocots mainly consists of glucurono-arabinoxylan polysaccharide that was more likely to be the target of EXPB (Cosgrove, 2000). Considering of the evolution process, all EXPB genes in monocot of rice were involved in segmental or tandem replication but only a few EXPB genes in dicots of soybean and Arabidopsis were involved (Zhu et al., 2014). The difference in duplication and expansion in evolution could also lead to the divergency of expansin genes varied in species and genera (Kong et al., 2007). In terms of function, the EXPB genes were more involved in reproductive-related pollen tube elongation (Cosgrove et al., 1997) and fruit ripening (Brummell et al, 1999), while the EXPA genes were mostly involved in growth and development of plants. They might partly explained that the woody dicots plants need high proportion of EXPA genes for wood production, while poaceae monocots plants need high proportion of EXPB genes for seed harvest. It would be interested to be further discussed in future.

Genes and genome sequences
The genomic sequence of Salix purpurea was downloaded from the phytozome database, and the expansin genes sequences of Populus trichocarpa was obtained from the EXPANSIN website.

Identification of the expansin genes in Salix purpurea
The nucleotide sequence of the expansin genes in Populus trichocarpa was used as a probe to search the sequenced genome library of Salix purpurea, and defined the candidate expansin genes with E-value 0.001 of BlastP alignment. The gene domain of Pollen_allerg_1 (PF03330) and DPBB_1 (PF01357) were verified by using the hidden Markov model of Pfam, NCBI conserved domain database (CDD) and SMART. The willow expansin genes were finally named according to the sequence identity with that in Populus trichocarpa. Based on the physical information of sequenced genome, each expansin gene was individually mapped on the chromosomes by the TBtools software. The Neighbor-Joining (NJ) method of MEGA7 was used to construct the phylogenetic tree based on the amino acid sequence of the expansin genes in Salix purpurea and Populus trichocarpa, and bootstrap was used to evaluate the phylogenetic tree with the repetition value of 1000.

Sequence analysis of expansin genes
The sequence identity between genes was conducted by the bioinformatics software DNAMAN v7.0. The box-plot map of the expansin genes in nucleotide and in amino acid identity was drawn by using Origin9. SignalP v4.1 was used to predict the signal peptide. ProtParam proteomics online tool was used to obtain the molecular weight, protein isoelectric point, aliphatic index and other related physicochemical parameters of expansins. The intron-exon structure map of the gene was drawn by online tool GSDS.

Codon usage bias analysis
The codon usage bias of expansin genes in Salix purpurea was analyzed by CodonW v1.4.4 and CHIPS and CUSP programs of EMBOSS. The neutrality plot analysis was carried out with GC3 and GC12 as horizontal and longitudinal coordinates, two-dimensional scatter plot of ENC was drawn with GC3S as abscissa and ENC as ordinate. PR2-plot analysis was carried out with G3/(G3+C3) as abscissa and A3/(A3+T3) as ordinate. All the genes were plotted in a 59-dimensional vector space based on RSCU values. The correspondence analysis was conducted bewteen one-axis and two-axis, and the correlation analysis was conducted by SPSS 20.0 software.

The synteny analysis of expansin genes
All mapped expansin genes of Salix purpurea were conducted for synteny analysis by using Circos. Two or more genes in a <200 kb fragment were defined as tandem duplications (Holub, 2001). The Ka/Ks (non-synonymous substitution/synonymous substitution of each site) ratio was calculated for each pair of the duplicated genes by using ClustalW alignment tool and KaKs_Calculator2.0 software. The divergence time in evolution was estimated using the formula T=Ks/2r (r=1.5×10-8, the divergence rate of plant nuclear gene) (Koch et al., 2000).

Authors' contributions
Yang Ruixia and Liu Xinru designed and carried out the study and contributed equally to this work. Lan Baoliang, Wang Hao, and Liu Xiao participated the data analysis and manuscript writing. Xu Jichen is the corresponding author for the project and gave the guidance on experiment data analysis and manuscript writing.