Cattle provide a significant source of nutrition and livelihood to humans in the world. Cattle belong to Ruminantia, which occupy diverse terrestrial environments. They are renowned for their ability to efficiently convert low quality forage into energy, fat, muscle and milk. These biological processes have been exploited by the human species since domestication, which began in the Near East some 8000-10000 years ago (Willham 1986; Elsik et al. 2009). Since then, over 800 cattle breeds have been established representing an important world heritage and a scientific resource for understanding the genetics of complex traits. DNA methylation is a type of chemical modification of DNA that can be inherited without changing the DNA sequence (Amiri Roudbar et al. 2015). It involves the addition of a methyl group to DNA and typically occurs at CpG dinucleotide (Vercelli, 2016) and it is an important part of gene regulation (Irizarry et al. 2009; Wu et al. 2010). The DNA of most vertebrates, especially mammals, is depleted in CpG dinucleotides. The remaining CpGs are clustered in regions referred to as CpG islands (CGIs). Interest in CGI has grown up because they are enriched in promoters of genes (Hackenberg et al. 2010) and by the presence of altered DNA methylation in CGIs, they play important roles in the regulation of gene expression and gene silencing during processes such as X-chromosome inactivation, imprinting and silencing of intragenomic parasites (Takai and Jones, 2002; Su et al. 2010). Therefore, full comprehension of such CGI could help in discovering the epigenetic causes of cancer (Han and Zhao, 2009; Wu et al. 2010; Saito et al. 2016). Because of their crucial roles, multiple algorithms (either species specific or overall ones) have been developed for identifying CGIs in the genomes. The first algorithm was proposed by Gardiner-Garden and Frommer in 1987, in which CGI is featured as a region of at least 200 bp, with GC content greater than 50% and the observed-to-expected CpG ratio (ObsCpG/ExpCpG ) greater than 0.6 (Gardiner-Garden and Frommer, 1987; Glass et al. 2007; Tan et al. 2011; Yang et al. 2016). Due to these CGI inherent features, many algorithms and applications were designed to work according to these three features and also some new algorithms which circumvent preceding algorithm’s cavities were also introduced (Rice et al. 2000; Takai and Jones, 2002; Wang and Leung, 2004). These algorithms depend on cutoffs and leaves out important CpG clusters associated with epigenetic marks, relevant to development and disease and since they were mainly developed for humans genome studies, they were not applicable at all to non-vertebrate genomes (Irizarry et al. 2009) . To solve these problems Wu et al. (2010) proposed an alternative Hidden Markov model-based approach that permitted an extensible approach to detecting CGI. The main advantage of this approach over others is that it summarizes the evidence for CGI status as probability scores. This provides flexibility in the definition of a CGI and facilitates the creation of CGI lists for other species. The utility of this approach is demonstrated by generating the first CGI lists for invertebrates and can create CGI lists that substantially increases overlap with discovered epigenetic marks (Wu et al. 2010). Hidden Markov models (HMMs) have been shown to be very effective in representing biological sequences and they have been successfully used for modeling speech signals (Rabiner, 1989). Thus, HMMs have become increasingly popular in computational molecular biology and many state of the arts studies. From 2009 onwards, the results of whole genome sequencing projects in domestic cattle have been released (Elsik et al. 2009). There have been some reports in terms of comparisons of CGIs and their correlation with genomic features in some mammalian genomes (mostly human and mouse) and non-mammalian genomes (fish) in the literatures (Glass et al. 2007; Han et al. 2008; Han and Zhao, 2008; Irizarry et al. 2009; Wu et al. 2010), but to our knowledge, there hasn’t been a separate research of CGIs and their correlation with genomic features in cattle. Therefore, the main objective of this study was to investigate the presence of CGIs at the DNA sequence level in cattle genomes and outline a comparison to other vertebrate’s genomes.
MATERIALS AND METHODS
Genome sequences and other genome information
Assembled genome sequences of cattle were downloaded from the NationalCenter for Biotechnology Information (NCBI). The statistic and other genome information at the chromosome level are given in Table 1. CGIs of sheep, goat, dromedary camel, bactrian camel and alpaca were predicted by HMM algorithm. CGIs analysis for other vertebrate genomes (human, mouse, dog, horse, chicken and zebrafish) were downloaded from website (www.rafalab.Jhsph.edu/CGI/) and predicted according to HMM algorithm by Wu et al. (2010). Data of recombination rate in cattle (window size, 1 Mb) were obtained from (Weng et al. 2014).
Algorithm for detection of CGIs
CGIs were identified based on HMM algorithm (Irizarry et al. 2009; Wu et al. 2010). The foundation of this algorithm is the stochastic modeling of bases in the genome. This algorithm assumes that each genome is divided into 2 states (CGI and baseline). In this algorithm the genome is divided into non-overlapping segments of length L (in bp). The length L= 16 is used for the segments, because at this length the association of identified CGI with epigenetic markers is higher, then use the number of C, G and CpG in segment of length L as parameters for the model and the hidden state Y(s) for segment swith states Y(s)= 1 as CGI and Y(s)= 0 as baseline. Assume that Y(s) is a stationary first-order Markov chain. The choice of the state is based on two HMM. One is for GC content to be high or low with assumption of the binomial distribution approximated with the normal density. The second one is for CpG number with the assumption of Poisson distribution as follow:
ai × L × p(s)2 / 4
ai: O/E for the CGI (i=1) and baseline (i=0).
p(s): GC content for segment that is calculated in the first step.
As a final step, the algorithm could obtain posterior probabilities of being in each state and creates lists of CGI using different specificity cutoffs values. A cutoff value of 0.9 is chosen based on the association of CGI with epigenetic markers. For the mathematical details of this algorithm see Wu et al. (2010).
Table 1 Statistic and distribution of CpG islands and genome in chromosomes of cattle
CGI: CpG island and GC: guanine-cytosine.
This algorithm is implemented as an R adds-in package for R software which is called make CGI.
CGI mapping to different genomic regions
The method of Han et al. (2008) is used to identify CGIs in different genomic regions (genes, intergenic regions, intragenic regions and transcriptional start sites (TSS) regions). Concisely, locations of CGIs compared with the coordinates of different genomic regions, is based on the cow gene annotation information from the NCBI and UCSC databases by BEDTools (Quinlan and Hall, 2010). CGIs overlapped with any genes were classified as gene-associated CGIs; CGIs whose whole sequences were in intergenic regions were classified as intergenic CGIs; CGIs whose sequences were in gene regions were classified as intragenic CGIs and CGIs overlapped with TSSs were classified as TSS CGIs.
RESULTS AND DISCUSSION
CGIs and CGI density in cattle genome
CGIs detected by HMM algorithm for all chromosomes in cattle genome is summarized in Table 1. Because the size varied across chromosomes, the CGI density was measured by the average number of CGIs per Mb. The number of predicted CGIs and correspondingly CGI densities varied downwardly among chromosomes. Chromosome 25 had the largest number of CGIs (4556) and the highest CGI density (106.20 CGIs/Mb). The CGI density in the smallest chromosome (Chr 25) was about five times greater than the largest chromosome (Chr 1), that would give out a high density of CGIs on micro chromosomes. Moreover, previous analyses of CGIs in the chicken genome revealed a high concentration of CGIs on micro chromosomes (Hillier et al. 2004; Rao et al. 2013). These results suggest that some other genomic factors might have also played important roles in the course of CGI evolution (Han and Zhao, 2008). Figure 1 shows box plots of the lengths of detected CGIs for all chromosomes. The average length and variance of CGI across chromosomes was low. These results were similar to other studies (Hackenberg et al. 2006; Chuang et al. 2011; Chuang et al. 2012). The total number of predicted CGIs for cattle was 90668. Han et al. (2008) analyzed CGIs in 9 mammalian genomes. They predicted the number of CGIs in cattle to be 58327. Interestingly, the CGIs number reported there are quite lower than those reported in the current study, which is probably due to the different approaches used for predicting CGIs.
Correlation between CGI density and other genomic features of cattle genome
A significant negative correlation (r=-0.49, p=0.006) was observed between CGI density and log10 (chromosome size) (Figure 2(A)). Also extremely significant positive correlation were discovered among CGI density with GC content (r=0.83, p=2×10-8) (Figure 2(B)) and ObsCpG/ExpCpG (r=0.90, p=8.3×10-12) (Figure 2(C)) on the chromosomes. These results indicate that chromosome GC content is probably a main genetic factor impacting CGI density. In addition CGI density was significantly positively correlated with gene density (r=0.74, p=2.8×10-6) (Figure 2(D)). The pattern of significant correlation between CGI densities and some genomic features at the chromosome level such as chromosome size, GC content and ObsCpG/ExpCpG was similar to other studies (Han et al. 2008; Han and Zhao, 2008). The significant positive correlation between CGI density and gene density in chromosome level was in agreement with research of Han and Zhao (2009) in dog genome (r=0.63, P=8.0×10−6). Most of CGIs are sites of transcription initiation, including thousands that are remote from currently annotated promoters. Shared DNA sequence features adapt CGIs for promoter function by destabilizing nucleosomes and attracting proteins that create a transcriptionally permissive chromatin state.
Figure 1 Lengths of CGIs in chromosomes of cattle by HMM algorithm
Small difference in the average length and variance of CGI between chromosomes
Figure 2 Correlations between CGI densities with genomic features in cattle genome
(A) CGI density (per Mb) versus log10 (chromosome size)
(B) CGI density (per Mb) versus chromosome GC content (%)
(C) CGI density (per Mb) versus chromosome ObsCpG/ExpCpG
(D) CGI density (per Mb) versus Gene density (per Mb)
The Y chromosomes were excluded because of insufficient data
CGIs are therefore generically equipped to influence local chromatin structure and simplify regulation of gene activity (Deaton and Bird, 2011). Table 2 shows mean of CGI densities in segregated chromosomes by sizes into different groups (<50, 50-100, 100-150 and >150 Mb). When the size of chromosomes increased, the CGI densities decreased. The CGI density in the smallest chromosome group was about two times greater than the largest chromosome group. Previous study of GC content and CGIs in the chicken genomes (Hillier et al. 2004; Rao et al. 2013) and some mammalian genomes (Han et al. 2008) revealed a high density of CGIs on microchromosomes. Interestingly, in the current study, the CGI density in the smallest chromosome group was greater than largest chromosome and it was consistent with those studies.
Table 2 CGIs densities in chromosomes with different sizes in cattle genome. The Y chromosomes were excluded because of insufficient data
These results increase the possibility that gene density on microchromosomes approaches the maximum value known for vertebrates (McQueen et al. 1996). CGIs overlapped or within genes (gene-associated CGI) and CGIs in the intergenic regions (intergenic CGI) of cattle genome, correlated with genomic features. An extremely significant correlation was discovered between gene-associated CGI densities with log10 (chromosome size) (r=-0.69, p=2.4×10-5), with GC content of the chromosomes (r=0.90, p=1.1×10-11) and ObsCpG/ExpCpG of the chromosomes(r=0.95, p=8.9×10-16). However, correlation among intergenic CGI densities with log10 (chromosome size) (r=-0.39, p=0.03), GC content of the chromosomes (r=0.47, p=0.008) and ObsCpG/ExpCpG of the chromosomes(r=0.65, p=8.5×10-5) were very low and also less significant than gene-associated CGI densities. These findings support the opinion that, CGIs can function as gene markers. The significant positive correlation between CGI density with GC content and gene density indicates that CGIs depend on both local genomic features and gene number (Han and Zhao, 2009). Also for a more detailed investigation, cattle gene annotation data was used to search for CGIs in different regions of genome. Table 3 displays significant correlation among CGI densities in all genomic regions of cattle and genomic features (log10 (chromosome size), GC content and ObsCpG/ExpCpG) at the chromosome level. According to gene annotation in the NCBI and UCSC databases, 61095, 29613, 48759 and 12315 CGIs were overlapped with genes (gene–associated CGIs), intergenic regions (intergenic CGIs), intragenic regions (intragenic CGIs) and transcriptional start sites (TSS CGIs), respectively. Number of CGIs and corresponding CGI density in intergenic region was remarkably lower than that in the intragenic and genes regions and had lower significant correlation with other genomic features (Table 3). Other researcher founded similar results for CGIs in various genomic regions (Han et al. 2008; Han and Zhao, 2009; Medvedeva et al. 2010). These observations imply that CGIs are a considerable gene feature and they can be used to identify transcripts in cattle genomes.
CGI density and recombination rate in cattle genome
A set of recombination rate data of cattle (window size, 1 Mb was obtained from (Weng et al. 2014). A meaningful correlation (r=0.61, p=0.00031) was detected between CGI density and recombination rate (Figure 3(a)). Because the recombination rate increases from centromeric toward telomeric regions (Jensen-seaman et al. 2004; Han et al. 2008; Poissant et al. 2010; Weng et al. 2014), the trend of CGI density in length of chromosomes can be obtained. Attractively, a trend of higher CGI density in the telomeric regions is obtained (Figure 3(b)). This feature may be the reason of a positive correlation between CGI density and recombination rate. Several GC related measures (GC-rich repeats, CpG dinucleotide sites and CpG islands) were positively correlated with recombination rate in previous studies (Han et al. 2008; Tortereau et al. 2012; Rao et al. 2013).
Table 3 Correlation between CpG island (CGI) density in various cattle genomic regions and genomic features
Figure 3 (A) correlation among CGI density and recombination rate (cM/Mb) in the cow genome,
(B) Distribution of CGI density (per Mb) on chromosome 28
The data demonstrate a trend of higher CGI density in telomeric regions
A similar trend was found for other chromosomes
The same result was observed in this study (Figure 3(a)). It is difficult to reveal the causality of the observed relationships between GC related measures such as CGIs and recombination rate, i.e., which parameter drives which. Further analyses of the mechanisms underlying recombination are needed to identify the molecular mechanism.
Comparison of CGIs in cattle genome and other vertebrate genomes
To detect the information on difference of CGI density between cattle and other vertebrate genomes, CGI density was scanned in eleven vertebrate genomes, including the sheep, goat, dromedary camel, bactrian camel, alpaca, human, mouse, horse, dog, chicken and zebrafish genomes (Figure 4).
Figure 4 Cooperation of CGI density (per Mb) in various vertebrate genomes. CGI density varied among genomes
Predicted CGI density ranged from 23.05 (human) to 85.17 (zebrafish). Variation of CGI density between cattle with, chicken and other mammals (including ruminants) was reported to be not very high. According to previous studies, there are a lot of similarities between cattle genome and other ruminant as they have a close phylogeny, syntenic maps and orthologous genes (Elsik et al. 2009; Archibald et al. 2010; Jirimutu et al. 2012; Dong et al. 2013; Wu et al. 2014). Probably, this is the reason of almost close CGIs density among these genomes. In mammals, the dog genome had the largest CGI density in comparison to other mammals and especially cattle. Han and Zhao (2009) studied contrast features of CpG islands in the promoter and other regions in the dog genome. They revealed a remarkably higher CGI density in the dog genome than in the human and mouse genomes. But the dog genome had fewer promoter-associated CGIs than the human and mouse, and the abundance of CGIs in the dog genome was largely contributed by the non coding regions including the intergenic and intronic regions. The zebra fish genome had the largest CGI density in comparison to cattle and other ruminant genomes. CpGs are progressively depleted from fish to mammals. This is mostly consistent with the idea of a greater degree of CpG loss with organism complexity (Irizarry et al. 2009). Variation in the number of CGIs and CGI density in warm-blooded vertebrate such as mammals, but very high in fish (Han et al. 2008). Fish genome is different in both number of CGIs and CGI density. In the study of Han and Zhao (2008) on fish genomes, they found that number of CGIs and the CGI density varied greatly in four fish genomes. They concluded that this feature might be caused by genetic (sequence composition evolution) and environmental factors such as water temperature, speed of flow, extent of light in different depth of water during the long evolutionary period after the divergence of common ancestor of fishes.
A systematic comparative genomic analysis of CGIs and CGI density was conducted in the cattle genome. The number of CGIs and the CGI density showed low variations in chromosomes. This study discloses significant correlations between CGI density and genomic features such as chromosome size, GC content, ObsCpG/ExpCpG, gene density and recombination rate in cattle. The results indicates neatly a close relationship between cattle evolution and CGI density.When comparing CGI between cattle and other vertebrates, it is evident that the former are characterized by lower number of CGI and CGI density when opposed to col-blooded vertebrates such as fishes. This is mostly consistent with the idea of a greater degree of CpG loss with organism complexity.