A genome is defined as the complete set of DNA for an organism (Klug et al. 2015). The field of genomics utilizes high-throughput sequencing of the DNA base pairs for a specific organism (Hugenholtz and Tyson 2008). Genome sequencing began in 1986, and by 2000, nine organisms (bacteriophages, bacteria, plants, fruit fly, human) had their genomes completely sequenced. (Adams et al. 2000; Venter et al. 2001). The completion of the Human Genome Project and advancements in bioinformatics inspired the beginning of metagenomics (National Research Council (US) Committee on Metagenomics: Challenges and Functional Applications 2007).
Prior to the understanding of metagenomics, microbes could not be studied without first being isolated and cultured (Vietes et al. 2009). Metagenomics, through the use of DNA-sequencing, is the study of microbial communities and genetic material sampled directly from the environment. The purpose of metagenomics is to determine the genome of an organism in its environment, while also observing the biochemical interactions and exchanges (Wooley and Ye 2009). Metagenomics provides information about biocatalysts and enzymes, genomic links to phylogeny and function, and evolutionary profiles for the role, function, assembly, and structure of microbial communities (Thomas et al. 2012).
The classic approach to metagenomics involves cloning environmental DNA into vectors with assistance from bioengineered host strains; the clones are then screened for various biomarkers or functions. The sequence-driven approach, screening for selected genes, is rarely used in current metagenomics studies, but the function-driven approach, screening for metabolic functions, is still widely used for screening enzymes (Riesenfeld et al. 2004).
Shotgun metagenomics has replaced the classical method of metagenomics and is a commonly used, modern approach that involves direct sequencing of environmental DNA. Shotgun metagenomics enables the evaluation of microbial diversity and the detection of microbial abundance in different environments (Tringe et al. 2005). In the shotgun technique, DNA is extracted from all of the cells in a community and sheared into small portions to be sequenced independently. The result is DNA sequences that correspond to genomic locations for the countless number of genomes within the prospective sample (Sharpton 2014).
Several steps are necessary to complete a metagenomic analysis: sampling, fractionation, DNA extraction, DNA sequencing, assembly, binning, annotation, statistical analysis, and data storing/sharing (Thomas et al. 2012). The first step in metagenomics is to collect the community sample from the environment. If the environment is a host, fractionation of the sample should be utilized to rid the sample of any host DNA. Once the sample is deemed pure, DNA extraction takes place; samples with relatively small amounts of DNA can be amplified.
There are two major gene sequencing strategies: Sanger and next-generation sequencing (NGS) (Mardis 2008; Scholz et al. 2012). The Sanger method has a low error rate and is best suited for low-diversity environments (Goltsman et al. 2009). NGS is a modern, high-throughput method that can be utilized in a broad range of complex environments (Scholz et al. 2012).
After DNA from the sample are fragmented and sequenced, the areas of overlap are noted and used to establish the order of the fragments, thus assembling the genome (National Research Council (US) Committee on Metagenomics: Challenges and Functional Applications 2007). Assembly is not a required step for all metagenomics analyses, but it is necessary to recover a genome or coding sequence. There are two different strategies for assembly: co-assembly and de novo assembly (Thomas et al. 2012). Co-assembly, also known as reference-based assembly, should be utilized when a reference genome is available; the assembly can then be built-upon the preexisting genome. This method performs best with prokaryotic and less-complex eukaryotic communities, but the success of assembly depends on the quality of the reference genome (Martin 2012). De novo assembly is employed for assembly of communities that do not have a reference genome, and as a result, provide the initial set of transcripts for the organism. While de novo assembly can be extremely efficient for the prokaryotes and moderately complex eukaryotes, the major drawback is that complex eukaryotes are not easily assembled due to the sheer volume of data that must be processed (Martin 2012; Robertson et al. 2010).
Binning is defined as the process of sorting DNA sequences into clusters (“bins”) that potentially represent an individual genome or genomes of similar organisms (Thomas et al. 2012). Annotation is the classification of predicted genes into constructed gene families (Mathé et al. 2002; National Research Council (US) Committee on Metagenomics: Challenges and Functional Applications 2007). The first step in annotation is to identify the gene(s) of interest, and the second step is to “assign gene function and taxonomic neighbors.” Annotation can be performed on genome reconstruction, as well as on entire-community analyses (Thomas et al. 2012). The annotation process is often considered a major implication for metagenomic studies as there is a pressing need to update and merge annotation databases (Seshadri et al. 2007; Yandell and Ence 2012).
Analysis, data storing, and data sharing are the final steps of the metagenomic method. Much like annotation, analysis of metagenomic data is complex and not always feasible. There are, however, several tools available for analyzing DNA sequences, including MEGAN, CAMERA, RAST, EMG, etc. (Huson et al. 2007; Seshadri et al. 2007; Meyer et al. 2008; Hunter et al. 2014). Metagenomic analyses provide opportunities for determining the type and number of microbes present in an environmental sample, the functions of said microbes, community structure and diversity, and the sequenced genome of a particular species (Sharpton 2014). Upon completion, partial completion, or recovery of a genome, the information can be stored in a number of different databases and made publicly available to supplement further research.
Metagenomics has allowed for the expansion of genetics and microbiology through this method. Due to the reduction of sequencing costs and ever-increasing research, metagenomics quickly became a mainstream technique with the capability to produce vast amounts of sequenced data (Metzker 2010). While metagenomics is effective, traditional sequence analyses cannot process and store the genetic information when massive amounts of data are produced (Mitchell et al. 2015).
Another limitation of metagenomics is that the current interpretation of sequenced data relies on the assumption “that similar DNA sequences imply similar protein function.” The use of metagenomics for microbial diversity studies is limited due to the incomprehensive annotations of protein-coding genes and that several of the genes of microbes and viruses do not have to be annotated (Gilbert et al. 2011; Lam et al. 2015). Eukaryotic genomes are also of particular concern due to their massive size and genes that are difficult to annotate (Liolios et al. 2006). Genome annotation is hindering several metagenomics projects: genomes are sequenced at a much higher rate than they are annotated, rendering them incomplete. Two solutions are available to make gene annotation more efficient: outsource to a major database or streamline portable annotation pipelines (e.g. MAKER) (Cantarel et al. 2008).
The pressing limitations of metagenomics include the low resolution of microbial communities, inability to classify the source of fragments, and inability to provide metabolic information for genes without similar structure and function (Warnecke and Hugenholtz 2007). When metagenomics is paired with various complementary techniques, several solutions are made available for the questions and problems that metagenomics alone cannot solve.
Metagenomics is most useful when paired with other technologies that provide assistance in areas where metagenomic methods alone are lacking.
As previously stated, two main limitations of metagenomics are the low resolution of microbial communities and inability to classify the source of fragments. Fluorescence-activated cell sorting is an advantageous method of flow cytometry that is used when an extremely high purity of a population is desired (Basu et al. 2010). Microfluidics is the science of controlling and manipulating fluids in channels; this allows for the profiling and identification of individual cells in varied environments (Wadsworth et al. 2015; Whitesides 2006).
FACS and microfluidics, when paired with shotgun metagenomics, provide a solution by quickly sorting large volumes of cells belonging to a specific population that is based on any number of characteristics (size, DNA content, and other cell properties). By sorting the cells into similar categories, there is enough biomass for direct extraction of the DNA or RNA. (Brehm-Stecher and Johnson 2004; Warnecke and Hugenholtz 2007; Weibel et al. 2007).
Transcriptomic and proteomic analyses are useful for observing expressed metabolic potential in isolated microbes – a third major limitation of metagenomics. The aforementioned analyses can also be implemented to accommodate for the metagenomics inability to provide metabolic information on genes without similar structure and function (Warnecke and Hugenholtz 2007). Previous studies have also demonstrated that metatranscriptomic and metaproteomic techniques can distinguish between strain-specific proteins that are only off by one amino acid (Ortseifen et al. 2016; Völker and Hecker 2005).
Stable isotope probing (SIP) is useful for the identification of microbes in samples that use growth substrates enriched with stable isotopes (Dumont and Murrell 2005). The complementation of SIP and metagenomics provides the connection for metabolic capacity and phylogeny in microbial communities through the use of biomarkers that can trace and sort isotopically marked genes (Uhlik et al. 2013).
From the early 2000s to present day, metagenomics has been the leading method for genome sequencing (Doolittle and Zhaxybayeva 2010). In its fifteen years of being active, several adaptations have been made, such as moving from the sequence-driven approach to next-generation/whole-genome sequencing (Mardis 2008; Wackett 2013). Adjustments and advancements are necessary to provide up-to-date and applicable techniques in the area of metagenomics. Metagenomics is not in danger of being replaced, but it is notably more efficient when complemented with other technologies. Future prospects of metagenomics include further development of bioinformatics tools, (Vakhlu et al. 2008), improved bioremediation (Uhlik et al. 2013), and holistic ecosystems biology (Teeling and Glöckner 2012).
A previously mentioned limitation of metagenomics, genome sequences being produced at a much faster than the data is able to be processed and analyzed, can be improved with the further development of bioinformatics tools. Servers have recently been developed to provide an integrated database of metagenomic information that allows for large-volume data analysis. (Hiraoka et al. 2016; Kunin et al. 2008; Vakhlu et al. 2008).
Stemming from the need for improved bioinformatics, EBI metagenomics (EMG) was created to overcome the inefficiency of sequence analysis programs. EMG is a no-cost, recently designed, large-scale sequence analysis platform (Hunter et al. 2014). EMG has the capacity to analyze and archive massive amounts of sequencing data, unlike many of the classic analysis methods (i.e. BLAST) (Mitchell et al. 2015).
Stable isotope probing and metagenomics are powerful tools when used in conjunction. SIP and metagenomics can be used to assess the bioremediative potential of microbes, and improve current bioremediation strategies. SIP-metagenoimics is also useful for the identification of microbes that would be suitable for bioaugmentation (Coyotzi et al. 2016). SIP is used to determine whether or not a microbe that is able to metabolize a specific contaminant is present in the environment. Metagenomic gene function analyses are applied to the SIP data to assist in the prediction of which contaminants will be degraded by the microbes. With this microbe-contaminant relationship data, bioremediation techniques become fine-tuned because researchers understand exactly what is happening in the environment, and which method would be most effective (Manefield et al. 2004; Uhlik et al. 2013).
A new approach to ecosystems biology recently emerged with the goal of combining metagenomics, biodiversity, metatranscriptomics, and metaproteomics. With the ecosystems biology approach, researchers aim to identify and sequence microbial communities that were once deemed too complex for standalone metagenomic sequencing, as well as gaining understanding of the diversity of species that can be found in one sample (Pignatelli et al. 2008; Teeling and Glöckner 2012; Wilmes and Bond 2006).
In conclusion, metagenomics has brought astounding advancements to the discipline of genetics by providing the capability to sequence an organism’s entire genome from an environmental sample. Metagenomics, like most scientific techniques, has its weaknesses, but the methods have successfully adapted over the past fifteen years in order to stay relevant. Metagenomics operates at maximum effectiveness when paired with varied complementary techniques. With continual research and countless genomes still to be sequenced, metagenomics has a promising future.