Linkage analysis and association studies made significant progress in understanding the genetic basis of common phenotypes and complex diseases (McCarthy et al. 2008; Chen, 2011). Next generation sequencing (NGS) technology has dramatically increased the human ability for DNA sequencing (Londin et al. 2013). Genomic studies in farm animals will increase our understanding of the genetic basis of traits and their results will be used in breeding programs and reduce the occurrence of diseases and improve products’ quality and production efficiency. Sequence of farm animal genome is expected to have a significant effect on sustainable production of animals (Andersson, 2001; Bisht and Panda, 2014). The present review aimed to summarize the current knowledge about genome wide association studies (GWAS), NGS and their application in animal breeding.
Genome wide association study (GWAS)
The ability to predict genetic risk factors for human disease and important economic traits in animals, such as growth rate and production, requires understanding of responsible genetic loci for the phenotypic and genetic architecture of traits (Korte and Farlow, 2013). Genetic association study is a statistical method to identify genes or loci regulating complex traits that utilizes linkage disequilibrium (LD) to connect phenotypic trait with genetic polymorphisms. All mapping methods fall mainly into two categories: studies of candidate genes and whole genome studies (Jiang, 2013). The candidate genes study examines the relationship between known genes and traits (Liu et al. 2008; Bisht and Panda, 2014). Compared to candidate genes and linkage analysis, GWAS is studying the entire genome with a systemic method for detecting susceptible genetic variants for diseases and complex traits (Hirschhorn and Daly, 2005; Huang, 2015). In general, the identification of genetic variants associated with complex traits requires a large number of variants and samples (Huang, 2015). Genome-wide association studies (GWAS) is the study of genotyped single nucleotide polymorphisms (SNPs) in the genome and their association with phenotype (Zeng et al. 2015). Literatures contain numerous examples of GWAS that explain the genetic background of traits. The missing genotypes, genetic heterogeneity, low LD, effect size, low allele frequencies and genetic architecture of complex traits are a challenge for GWAS (Korte and Farlow, 2013). Also, major technical and analytical challenges remain with the GWAS including multiple test corrections and missing loci or blocks, low power to identify sites with low effect, risk of stratification finding, overestimation of haplotype effects, poor model fitting, insufficient sample size, low-density SNP coverage, bringing out rare variants and unknown copy number variation (CNV) effects (Kadarmideen, 2014) and not justification of genetic variance for complex traits (Manolio et al. 2009; Clarke and Cooper, 2010; Gibson, 2010; Kadarmideen, 2014).
The GWAS relies on LD between SNPs and causative genes (Schmid and Bennewitz, 2017). LD is non-random association of alleles among different loci within a population. LD would be affected by different factors such as selection, mutation, migration, population structure and recombination rate (Zhu et al. 2013). The efficiency of quantitative trait locus (QTL) mapping studies, e.g. GWAS and marker-assisted selection (MAS), depends on the LD in population (Sellner et al. 2007). LD is an ideal parameter for diagnosing genetic association between markers and genes or casual loci for complex traits with high accuracy (Jiang, 2013).
Missing heritability refers to a part of the genetic variance that cannot be interpreted by all significant single-nucleotide polymorphisms(SNPs). A significant proportion of the heritability is not justified by common genetic variants in GWAS (Manolio et al. 2009). Missing heritability theory hypothesized that unknown and missing variants may be in GWAS with big effect on phenotype, but their frequency is much lower than that identified by SNP chips (Huang, 2015). Likely the use of NGS data for GWAS will help to fix this problem. Sequencing enables detection of low frequency and rare variants with medium to high effect and expected at least part of the missing heritability justified with this technology (Feng, 2015). The application of sequencing technology for a large number of samples with the appropriate phenotype provides a great opportunity to uncover the missing heritability and genetic architecture of complex traits (Luo et al. 2011). Many reasons have been suggested for the missing heritability including: the large number of unknown variants with small effects, rare variants which are less diagnosed by available genotyping arrays and probably have great effects, structural variation and inappropriate calculation of common environment among relatives (Manolio et al. 2009).
Copy number variation (CNV)
CNV is an important source of genetic diversity that provides structural information in genomics (Hou et al. 2011). CNV refers to a change in the number of copies of a region of the genome (between one kb to several mb) (Henrichsen et al. 2009). CNVs’ size is defined differently in various sources. CNVs are the result of DNA deletion, duplication, insertion and rearrangement. Because most CNVs contain gene coding regions and regulatory factors, they play an important role in gene expression regulation (Conrad et al. 2010). It has been confirmed that CNVs have a higher mutation rate than SNPs (Zogopoulos et al. 2007). CNVs can be considered as a significant part of genetic variation for diseases or traits (McCarroll, 2008). It has been confirmed that an overlap exists between the CNVs and genes, and there is a correlation between CNVs and genes expression levels and between CNVs and some clinical phenotypes (Stranger et al. 2007). GWAS is a good tool for simultaneous survey of SNPs and CNVs (McCarroll, 2008). This study could help to explain genetic variability and heritability. In the past few years, considerable progress has been occurred in identifying the CNVs in domestic animals. In the future, the development of accurate tools for the detection of CNVs and their application in combination with QTL and gene expression data will be necessary to identify the impact of structural variation on many phenotypes (Clop et al. 2012).
Application of GWAS in animal breeding and genetics
The advent of genome sequencing, including GWAS, whole genome prediction (WGP) and genomic selection has changed the pattern of animal breeding (Kadarmideen, 2014). The combination of allelic and phenotypic information through GWAS facilitates the discovery of genetic loci associated with important traits (D’Agostino and Tripodi, 2017). Improving genomic selection through GWAS enhances biological knowledge about trait expression, provides information on genetic architecture of quantitative traits and makes gene mapping as a hot topic in the genetics of livestock (Goddard et al. 2016). The use of GWAS in animal breeding and genetics has expanded since the genome sequence of domestic animals was identified and a large number of SNPs were discovered through sequencing. A variety of commercial SNP chips are available for cattle, sheep, poultry, horses, dogs and pigs. Despite toddler use of GWAS in domestic animals, desirable results have been reported, particularly in the analysis of the quantitative traits mechanism. Now, SNP chips are widely used in GWAS to identify QTL for traits in domesticated animals (Zhang et al. 2012). The use of SNP arrays considerably affected the theory and practice of animal breeding and genetics, which will play important roles in the future (Fan et al. 2010). Much progress has been made in GWAS in domestic animals and some genes have been identified for important traits (Zhang et al. 2012). Compared to SNP chips, sequencing can provide almost all information about variants including SNPs, CNV, insertions and deletions. By reducing the cost of sequencing, it is possible that everyone in the community is sequenced and GWAS done with this technique (Zhang et al. 2012). Some recent literature on the application of GWAS in animal breeding and genetics is presented in Table 1.
Next generation sequencing and GWAS
NGS technology allows rare variants to be studied. Also, NGS technology enables us to identify many variants including SNP and structural variation and search for rare variants (Chen, 2011). Human understanding about the genetic basis of diseases is expanding due to increased use of NGS. Perhaps the biggest success of NGS is the discovery of variants for rare diseases with Mendelian inheritance (Londin et al. 2013). While chip-based GWAS progresses, sequencing technology is developing rapidly and cost of sequencing is decreasing (Feng, 2015). With the advent of whole-genome sequencing (WGS) technology and increasing the capacity to rare variants detection, it is expected that GWAS using WGS will provide more opportunities to explore variants with larger size and causal effect (Huang, 2015). Unlike chip-based GWAS, sequencing supplies the direct analysis of causal genes and variants rather than considering their linkage disequilibrium (Feng, 2015). NGS technology has a significant impact on our ability to find variants related with diseases and traits (Edwards et al. 2014). With the progress in implementation and invention for sequencing the entire genome, new valves have been opened for the recognition of DNA building (Feuk et al. 2006).
First-generation sequencing technology
The first-generation sequencing was the sequencing of bacteriophage phiX174 that was done in 1977 by Frederick Sanger (Sanger, 1977). Sanger sequencing was the basis for modern methods of sequencing that are already in use (Gabaldón and Alioto, 2016).
Second-generation sequencing technology
General principles of NGS are similar to capillary electrophoresis sequencing (Sanger) in which sequencing occurs by the synthesis, but in NGS sequencing, millions of fragments are simultaneously sequenced instead of sequencing a single fragment of DNA (Gabaldón and Alioto, 2016). Five hundred millions to billions bases of raw sequence can be generated in a single run of the second-generation sequencing platforms (Pareek et al. 2011). Illumina (sequencing by synthesis), SoLID (sequencing by ligation), Roche (pyrosequencing chemistry) and Ion Torrent (semiconductor detection of H+) are second-generation sequencing techniques. All second-generation sequencing techniques rely on polymerase chain reaction (PCR) to amplify DNA. The major challenges of second-generation techniques are short reads which can be complicated in genome assembly and alignment algorithms (Pareek et al. 2011).
Third-generation sequencing technology
Third-generation sequencing technologies have several features including: 1) capability of detecting a single nucleotide change based on new visual and electrical single-molecule techniques, 2) these methods do not require amplification by PCR, thereby reduce the sequencing time and cost, 3) reading length in this method is long (1000 bp to 50 kb) (Steinbock and Radenovic, 2015). The third-generation sequencing techniques are explained in detail as follows:
I. PacBio technique
The first NGS tool is PacBio technique which is known as single-molecule real-time sequencing (SMRT) and has been used since 2011 (Steinbock and Radenovic, 2015). This technique is provided by Pacific Bio Sciences Company and has higher reading length than second-generation sequencing technology (SGS). Highly interconnected assemblies in de novo sequencing projects using PacBio technique have the ability to eliminate gaps in the current reference assemblies and identify structural variation (SV) in the personal genome.
II. Helicos technique
Helicos single molecule sequencing technique provides a particular vision of the genome biology through direct sequencing of nucleic acids. The sample preparation is simple and does not require any composition or amplification by PCR, and DNA and RNA are directly hybridized within the cell.
Table 1 Recent literature on genome wide association study in domestic animals
This eliminates many intermediate stages which may cause distortion or loss of the sample (Milos, 2010). Helicos sequencing technique is not dependent on the PCR (Schuster, 2007; Blow, 2008; Arif et al. 2010). This method does not need to convert RNA to cDNA for RNA sequencing and provides a new perspective to broad and unbiased understanding of the transcriptome (Ozsolak et al. 2009; Arif et al. 2010). The Helicos reading length is about 800-1000 bp (Ku and Roukos, 2013). In this method, millions of DNA single molecules trapped in two flow cells. These strings serve as sample for sequencing by synthesis. Then polymerase and a fluorescent-labeled nucleotide are added. Polymerase catalyzes specific binding of fluorescent nucleotide sequences into complementary strands in all samples. Then strings are washed and free nucleotides going out. Binding of nucleotides is made and position of banded nucleotides recorded. Fluorescent groups separated from strands but connected nucleotides are remained. The process repeated for other nucleotides (A/T/C/G) (Blow, 2008; Arif et al. 2010).
III. Nanopore technique
Single-molecule techniques used in the nanopore method allow further studies such as DNA-proteins and protein-protein interactions (Feng et al. 2015). The idea of using nanopore for DNA sequencing was introduced in the 1990 s (Deamer and Akeson, 2000). Recently this method has attracted considerable attention, due to its fast sequencing, low cost, long read length (5 kb) and no need for amplification of DNA or connection of enzymes or modified nucleotides (Steinbock and Radenovic, 2015). The main advantages of nanopore technique are very long readings, high throughput and low requirements. These features simplify the use of these techniques (Feng et al. 2015). The entire genome sequenced in about 15 minutes and with very low cost. Nanopore sequencing is based on the principle that single molecule DNA can be detected by passing through a very small channel (Ku and Roukos, 2013). The steps involved in nanopore sequencing technique are double stranded DNA conversion to single-stranded DNA using a polymerase. This will slow down the movement of ssDNA through the nanopore. Nanopore has the property of constriction around the channels that allow the read of ssDNA sequences. Sequences of ssDNA translated during the passage of nanopore and produce the signal. Each level of signal represents a nucleotide and sequence of DNA is decoded by detecting these levels (Steinbock and Radenovic, 2015).
Restrictions of sequencing technologies
NGS is becoming the premier tool in genetic diagnostics. However, concerns are raised about the complexity and volume of data for genome full sequence that may lead to inefficiency of interpretation method for the relationship between genetic variants and diseases (Goldstein et al. 2013). NGS technology can generate millions of genetic diversity that densely distribute in the genome (Luo et al. 2012). Therefore, this sequencing method generated large amounts of data. However, the current computational methods are not able to harness the full potential of genome and epigenome data from NGS. Therefore, there is a need for new and upgraded tools and systems (Chaitankar et al. 2016). The biggest constraint of NGS is bioinformatics methods for storing and analyzing the data (Trapnell and Salzberg, 2009; Blaby-Haas and de Crécy-Lagard, 2011; DePristo et al. 2011; Hinchcliffe and Webster, 2011; Nielsen et al. 2011; Londin et al. 2013). On the other hand, rare variants with high volume, sequencing errors and missing data are important challenges for association test of NGS data. These challenges are largely affected Type I error rate and power of test for phenotype-genotype correlation (Luo et al. 2011). Unlike the highly accurate genotypes of GWAS, deep sequencing produces millions of DNA short fragments which this process requires precise and effective statistical algorithms for genotype calling and mapping (Chen, 2011). Hundreds samples collected and thousands to millions variants are genotyped in the genome for GWAS (Chen, 2011; Risch, 2000). Therefore, weak design of experiments and sample collection can cause challenges in the subsequent analysis (Gabaldón and Alioto, 2016).
Categories of sequencing projects
Genome sequencing projects can generally be divided into two categories: 1) de novo sequencing where the goal is to obtain a high quality sequence of genome that can be used as reference for species and 2) resequencing where a reference genome is available and goal is to determine the sequence variation map for individuals. These variations may include all or some of single nucleotide polymorphism, rare variants, simple somatic mutations, deletion and insertion, copy number variations and other structural variation (Gabaldón and Alioto, 2016). De novo genome sequencing is sequencing of a new genome for which there is not reference sequence for alignment. Quality of coated de novo sequencing data depends on contig size and continuity and variety of sizes included in the library. Researchers can made high quality de novo using NGS readings and available assembly tools. The de novo sequencing required depth is determined by several factors including the sequencing method and strategy, reading length, assembly method and the complexity or repetitive regions of the genome (Chen, 2011). Studies have shown that the required depth of sequencing to detect more SNPs and indels are 15X and 33X for homozygous and heterozygous genotypes, respectively (Bentley et al. 2008; Ajay et al. 2011; Gabaldón and Alioto, 2016).
Whole-genome and whole-exome sequencing
Whole-genome sequencing (WGS) and whole-exome sequencing (WES) technologies are purposeful, powerful and relatively unbiased methods for the discovery of genetic variations (Chaitankar et al. 2016). Instead of sequencing the entire genome, targeted sequencing of coding regions such as exome sequencing produces valuable results to identify disease-related variants (Londin et al. 2013). In the WES, protein coding regions of the genome are selected and sequenced. This method can efficiently identify variants for a large range of applications such as population genetics, genetic diseases and cancer studies. WGS provides a special opportunity for surveying genetic and somatic variations but now large amounts of data and high computational requirement are limited the use of WGS in routine biological and genetic studies. But WES focuses on sequencing of protein-coding regions (exons) and therefore produces low data (Chaitankar et al. 2016). WES covers about 1.5 percent of the human genome (Lander et al. 2001; Huang, 2015) and has low cost. In the past few years, WES has conducted on a larger scale than WGS due to the economic performance, while WGS can discover more variants for complex traits (Huang, 2015). Despite the clear advantages of WES, this method has shortcomings, such as CNV detection (Londin et al. 2013). Sequencing of exons is based on the idea that mutations affecting the phenotype are in coding regions of the genome. However, we have very little information about the distribution of functional variants (Goldstein et al. 2013). Only relying on the sequencing of exons cannot be a good way and the entire genome of affected individuals must be sequenced to find all effective variants (Londin et al. 2013).
Application of genome sequencing in animal science
Genome sequencing can transform food security and sustainable agriculture including food safety, public, animals and plants health, reducing the risk of diseases and increasing the development in agriculture through the breeding of animals and plants (FAO, 2016). Farm animals are valuable resources and often used as a model in studies of physiology and pathology, duo to very similar reproductive physiology and nutrition system in farm animals and humans. Thus, farm animals are unique resource for human researches (Bisht and Panda, 2014). However, production of farm animals is more important because of the provision of food for human society. Development of genomics in animal is the outcome of genomics development in human as a result of genome sequencing projects (Kadarmideen, 2014). In the recent years, application of genomic evaluation for mapping small effect QTLs using many markers in dairy cattle can greatly increase reliability, especially for young animals (Bisht and Panda, 2014). Use of NGS enables us to detect SNP in the genome and the development of SNP chips for wide evaluation of SNPs with desired phenotypes (Kranis et al. 2013; Pértille et al. 2016).
Table 2 Recent literature on genome sequencing in livestock and poultry
The primary chips have limited coverage on the genome and not cover effective SNPs completely. NGS technology is powerful enough to detect casual polymorphisms but its use in animal breeding is impractical due to high cost (Elshire et al. 2011; Glaubitz et al. 2014; Pértille et al. 2016), although sequencing cost have decreased sharply in the past decade. Also, the huge expansion happened in the capacity and performance of information technology that made it possible to store and transfer large volume of information (FAO, 2016). Along with the development of sequencing methods and reducing costs, the widespread use of whole genome sequencing is likely in animal breeding. NGS leads to better understanding of their genome, transcriptome and epigenome in livestock (Sharma et al. 2017). Genotyping is becoming a common tool in the development of poultry breeding (Pértille et al. 2016). Many scientists use genomic information to identify genes associated with diseases in cattle, sheep, goats and horses and created disease-resistant animals (Bisht and Panda, 2014). Some industrialized countries use WGS in the field of food and prevent and control of animal diseases. Genome sequencing also uses for inspection of food imports and exports (FAO, 2016). The genome sequence for a number of species, including poultry, cattle, horses, pigs and chimps have been completed that support many developments in animal breeding (Bisht and Panda, 2014). Sequencing was used for all pathogens in prevention and control of zoonotic diseases (FAO, 2016). Also, NGS were used to identify breed specific variants, signatures of selection and mutations in livestock (Sharma et al. 2017). Recent advances in genotyping and sequencing technologies have created fast evolution in beef cattle evaluation methods. As a result, new tools provided for effective production of high quality meat (Rolf et al. 2010). RNA sequencing is also widely used in farm animal species such as chicken that leads to understanding of animal development mechanisms and use in functional genomics (Dunisławska et al. 2017). The development of new techniques and software in this field has made it possible to design effective strategies to improve livestock breeding and intended purpose using this technology and precise understanding of genomic structure and study of relationship between genotype and phenotype. Recent literature on genome sequencing in livestock and poultry are summarized in Table 2.
Linkage analysis and GWAS studies have major role in understanding the genetic basis of traits and diseases and there has been much success using them. However, these methods have defects such as lack of full justification of genetic variance in genetic association studies. NGS technology can partly overcome on defects of GWAS studies. Many studies in field of NGS are related to human research, especially relationship between the genotype and the occurrence of various diseases. Human research has always been the first, and has been a model for undertaking genotype-phenotype studies on animals. Thus it is expected that in the next few years NGS will have a significant impact on livestock production and health. Currently the use of NGS especially in animals has been limited due to high cost of WGS and the huge amount of data produced by these methods and computational problem of data. WES is a cost effective way (rather than WGS), however, the sequencing of all exons gives much less information and knowledge than sequencing the entire genome. Thus, despite conducting a few works using whole-genome association studies and NGS for improving efficiency and quality of animal products, high cost and then huge volume of information and computational problems are the most important limiting factors for the use of NGS technology in farm animals. It is expected that development of sequencing methods and reducing the cost of sequencing with the progress of hardware and computational methods have significant impact on animal breeding and genetics.