Marine Environmental Genomics : Unlocking the Ocean ’ s Secrets

In 1944, Oswald Avery, Colin MacLeod, and Maclyn McCarty demonstrated that DNA was the chemical basis of heredity and the genetic cornerstone of life on Earth (Avery et al., 1944). Some 30 years later, Frederick Sanger, Steve Nicklen, and Alan Coulson developed the dideoxy termination sequencing reaction to allow accurate and rapid determination of the sequence of long stretches of DNA (Sanger et al., 1977). Another 30 years later, we find that automated techniques, novel sequencing approaches, and technological advancements are again transforming our vision of the distribution and diversity of organisms. We have sequenced a human genome, several other animal and plant genomes, and over 500 complete microbial genomes. Sequencing the environment was the next big challenge, and marine microbiologists rose to that challenge. Here we review the current state and future prospects for marine environmental genomics.

Another 30 years later, we find that automated techniques, novel sequencing approaches, and technological advancements are again transforming our vision of the distribution and diversity of organisms.We have sequenced a human genome, several other animal and plant genomes, and over 500 complete microbial genomes.Sequencing the environment was the next big challenge, and marine microbiologists rose to that challenge.Here we review the current state and future prospects for marine environmental genomics.
thE FiR St O cE AN MEtAGENOME S An early example of the way sequencing technology changed our view of marine microbial communities was the discovery and analysis of free-living archaea in the ocean's surface waters (DeLong, 1992;Fuhrman et al., 1992).Until ocean water was sampled using polymerase chain reaction (PCR) and fluorescentbased hybridization techniques, archaea had been considered specialists of extreme environments, including those with low pH, high temperature, high salinity, and limited or no oxygen.In the marine realm, archaea were thought to be restricted to the deep-sea vents, anoxic muds, and other limited locales.
In contrast, the oxygenated, moderate-pH surface or deep waters were thought to harbor only bacteria (DeLong, 1992;Fuhrman et al., 1992).Because these free-living archaea remained recalcitrant to culturing, more genetic information was required to understand their functional contribution to the ecosystem.
Large insert (approximately 40,000 base pairs or 40 kb) fosmid libraries were constructed and probed for the presence of archaea (Stein et al., 1996).Crenarchaeota-and provided insight into the evolution, ancestry, and metabolic potential of this organism (Stein et al., 1996).
Apart from identifying free-living archaea, subsequencing, fosmid-clone libraries can be used to determine the taxonomic extent of newly identified proteins, such as proteorhodopsins.A new type of rhodopsin, a purple pigment photoprotein that harvests biochemical energy from green light, was thought to be restricted to archaea.However, community sequencing approaches revealed this gene adjacent to a bacterial 16S rDNA gene (Béjà et al., 2000).

Fosmid libraries have proved fruitful
for identifying new sources of genetic information and remain the method of choice for isolating complete genes that perform biological functions of interest to the biotechnological community (Vergin et al., 1998;Robertson and Steer, 2004;Hårdeman and Sjöling, 2007).In addition, it has long been known that genes that perform related functions cluster together along the chromosome (Overbeek et al., 1999), and sequencing large contiguous DNA fragments con-tained on fosmids can yield complete pathways (DeLong et al., 2006).For example, this approach was used to identify the pathways whereby archaea use oxidized methane anaerobically (Hallam et al., 2004).The application of high-through- into the high-throughput, post-genomic era (Venter et al., 2004).The Sargasso Sea study took an approach, now familiar from complete genome sequencing projects, that eschewed large insert libraries and selective subsequencing in favor of complete sequencing of smaller inserts (Figure 1).Theory held that with enough sequencing, whole genomes could be assembled from environments, and large inserts were no longer needed From left to right: First, large insert libraries were made, screened for a gene of interest, and that particular isolate was subcloned and sequenced.Second, random small insert libraries were sequenced using high-throughput Sanger Sequencing.Then, more recently, random uncloned fragments were sequenced using highthroughput pyrosequencing.
AGCTACGCATGCAT GCTAGCTAGCTAGC GATCTCAGCATCGA CAGCTACGATACGC ATAGCATCAGCATC AGCATACGCATCAG CAGCATCGCATCAG for detailed understanding of the complexities of marine microbial life.Indeed, complete genomes were assembled from the individual reads-alas, they had a decidedly nonmarine origin (Falkowski and Vargas, 2004;DeLong, 2005;Mahenthiralingam et al., 2006).These data revolutionized our view of marine microbiology, altered our perception of how the data would be handled and analyzed in the future, and created a furor among biologists, as widely used databases overflowed with sequences simply labeled as hypothetical proteins from the Sargasso Sea (Tress et al., 2006).An immediate observation from the Sargasso Sea sampling was an abundance of genes involved in photosynthesis.
Rather than being from chlorophyllbased systems, many of these genes were rhodopsin-like photoreceptors (Venter et al., 2004).Furthermore, many of the photorhodopsin-like genes identified, 782 in total, were distinct from the proteins identified in earlier work (Béjà et al., 2000), suggesting that many more organisms in the ocean are capable of harvesting light first imagined, and productivity estimates from satellite chlorophyll measurement may misjudge the amount of light being captured by marine microbes.
The Sargasso Sea environmental genome was a milestone in marine genome analyses, and it continues to be mined by researchers in a surprising number of areas, especially in biology and computer science.These data allow hypotheses to be generated, and tested (e.g., Rodriguez-Brito et al., 2006).However, the visage provided by the Sargasso Sea data set was dwarfed by the first release of the Global Ocean Sampling (GOS) expedition data set (Rusch et al., 2007;Yooseph et al., 2007), systems (Tyson et al., 2004), and using large insert libraries, like those described above (Béja, 2004).Second, similar species are found in widely geographically separated samples: bacteria similar to SAR11 and SAR86 were found in almost every sample (Rusch et al., 2007).This "marine-ness" of the samples has been reported elsewhere (Massana et al., 2000;Tringe et al., 2005;Angly et al., 2006).
However, apart from the ubiquitous microbes, there were also differences in the samples from the pole to the equator, with tropical and temperate communities comprised of different organisms.
Further, an increase in diversity towards the equator was found in the microbial communities present in both the GOS data (Rusch et al., 2007) and a survey of nine targeted oceanic regions (Pommier et al., 2007), which reflects trends seen in the ecology of macrobiota.
These two observations-sequences will not assemble but many bacteria the large number of viral-like proteins (Yooseph et al., 2007).Previous studies have shown the dramatic amount of gene transfer that is likely to occur in the open ocean, approximately 100 transduction events per day per liter of water (Jiang and Paul, 1998).In addition to spreading genetic variation, microbial mortality by viruses may eliminate the most successful isolates as soon as they have reached appreciable numbers (the kill-the-winner hypothesis [Thingstad and Lignell, 1997]).Therefore, the viral component of the microbial community appears to hamper efforts of technologists to assemble complete microbial genomes from environmental samples (see Breitbart et al., this issue).
In terms of the distribution of microbes, is it true that "everything is everywhere and the environment selects" (Pommier et al., 2007, quoting  Metagenomic studies also target novel functions in different environments and, increasingly, statistical techniques are being used to discern the differences among environments (Tringe et al., 2005;Rodriguez-Brito et al., 2006).
Whale carcasses that sink to the bottom of the ocean form ecological "islands" that undergo a prolonged breakdown (Smith and Baco, 2003).The environmental genomics study of three whale falls shows they harbor many fewer species than the open ocean.Although they might be considered an ideal location for finding enzymes that break down fats in cold water (e.g., cold-water esterases used in laundry detergents), none were reported in these metagenomes.
Nonetheless, environmental genomics will likely be widely used in future gene discovery applications to harness natural biological processes for industrial applications (Li and Qin, 2005).Identifying changes in metabolic potential of microbial communities within various environments will help identify important areas of biogeochemical activity.
Technological shifts are yet again upending our view of marine microbiology.Industrialization of an alternative method of sequencing, called pyrosequencing, which does not rely on Sanger's dideoxy terminators, has emerged as a contender in environmental genomics studies (Margulies et al., 2005;Angly et al., 2006;Edwards et al., 2006;Prosser et al., 2007).The advantage of pyrosequencing is in the adaptations that enable hundreds of thousands of sequences to be interrogated simultaneously and cheaply.Recent studies on marine samples suggest that there are several orders of magnitude more species than previously imagined in the ocean (Sogin et al., 2006), and they are complemented by major groups that were thought not to occur in the ocean, such as single-stranded viral sequences (Angly et al., 2006).Therefore, in addition to the commonly sampled organisms that are found in the 16S rDNA libraries and shotgun sequences, it is becoming increasingly apparent that the ocean harbors a "rare biosphere" that may be the source of the genetic material for the variation observed.
...it is clear that the more we learn, the more we realize how much we don't yet know.
The advent of cheap, fast sequencing through pryosequencing offers the ability to use metagenomics to answer important questions in marine ecology and geochemistry rather than just provide generalized observations.For example, disease has been steadily increasing in marine environments, and many important commercial species, such as oysters and mussels, are being affected and lost for commercial purposes (Barber, 2004;Harvell et al., 2004) pers. comm., 2007).
The coral reef studies highlight an outstanding problem in environmental genomics: the association of "metadata" with genomic data.It is critical to identify not only where these sequences are from but also what is happening around them (i.e., obtain a richer set of metadata).These data are essential for truly understanding the role of microbes in the environment and will lead to new ways of exploring genome sequences (Lombardot et al., 2006;Field et al., in press).

AckNOwlEd GEMENtS
We thank Mya Breitbart, Ed DeLong, and Mary Ann Moran for critical comments on this manuscript.

REFERENcE S
The 16S rDNA gene, whose product is required for DNA transcription, was, and remains, the most authoritative determinant of the presence of bacteria or archaea in a sample.Because archaea accounted for less than 5% of the microbial cells in the oceans (though still numbering in the millions or more cells per milliliter of seawater), and because very few of the 40 kb fragments contained 16S rDNA genes, thousands of individual clones were screened before a single clone that contained an archaeal 16S rDNA gene was found(Stein et al., 1996).Approximately 2 kb portions of the 40-kb insert from the single archaeal clone were subcloned and sequenced, representing one of the first sequenced marine community genomes.These sequenced fragments revealed the true origin of this DNA fragment-from a viruses are ubiquitous, have much smaller genomes than other organisms, and are readily fractionated from bacteria, archaea, and eukarya, much of the pioneering work on random community genomics from marine environments was performed on phages-those viruses that infect bacteria (see Breitbart et al., this issue).The much heralded, and oft-debated Sargasso Sea random community genome publication represents a "line in the sand," demarking the entrance of marine microbiology

Figure 1 .
Figure1.Metagenomes have been made in different ways.From left to right: First, large insert libraries were made, screened for a gene of interest, and that particular isolate was subcloned and sequenced.Second, random small insert libraries were sequenced using high-throughput Sanger Sequencing.Then, more recently, random uncloned fragments were sequenced using highthroughput pyrosequencing.
contains the equivalent of about two human genomes, approximately 6.3 x 10 9 bp of sequences.The size of this data set hampers all but the most ardent computationalists from analyzing the data.However, several patterns are beginning to emerge from the initial publications and synthesis with other metagenomic data.First, in general, even with the deep sequence coverage provided by the data set, assembly of significantly long, contiguous regions of sequences (contigs) from the small fragments failed.Assembly of long contigs from previous environmental genome projects (apart from the contaminated Sargasso Sea sample) was aided by sampling low-complexity environments, such as acid mine drainage are ubiquitous-appears contradictory.The third observation from the GOS data set hints at possible causes for the conflicting results-the high number of viral-like sequences within the bacterial samples.Large numbers of viral-like sequences were previously observed in marine bacterial/archaeal metagenomic libraries (DeLong et al., 2006).The viral signatures obviously provide flexibility in the genomes, as denoted by variation between sequence reads(Rusch et al., 2007), and also in the speed with which environmental genomics has impacted marine microbiology reinforces how far the field has come in increasing our understanding of the smallest, but perhaps most important, inhabitants of the ocean.
Baas-Becking L.G.M. (1934) Geobiologie of Inleiding Tot de Milieukunde.W.P. Van Stockum & Zoon N.V., den Haag) or are there localized enrichments for specialist microbes?Environmental genomics studies on microbial use of organic matter indicate that generalist bacteria are capable of utilizing multiple carbon sources as they become available, suggesting that bacteria may exploit a wide range of environments (Mary Ann Moran, University of Georgia, pers.comm., 2007).However, the Sargasso and GOS sequences were all collected from ocean surface waters, at depths ranging from 0.1 to 30 m.In contrast to covering a wide geographic range in one depth zone, a vertical transect (0-4000 m deep) taken at a single location reveals the changes in microbial communities that occur as light attenu-ates and pressure increases (DeLong et al., 2006).Low-light apparatus replaces the high-light photosynthetic apparatus before all photosynthesis is lost as light disappears, demonstrating environmental selection through specialization in light adaptation.Further adaptation appears to occur in the very deep-water samples, which contain large numbers of transposases, indicative of slow growth rates or the need to adapt to changing conditions, such as influxes of nutrients.However, deep waters are very geochemically and physically stable (DeLong et al., 2006) and, presumably, so are the microbial communities that inhabit them.
. Coral reefs are particularly vulnerable to disease, resulting in both the loss of individual species and altered community structure and function (see Rosenberg et al., this issue).Recent studies of microbial communities on coral reefs using metagenomics found dramatic effects caused by adjacent human populations.As the influence of human activity increases, the microbial communities shift from a balanced heterotrophic/autotrophic mix toward an overwhelmingly heterotrophic population that besieges the corals, according to recent work of author Dinsdale and 13 colleagues.This group conducted one of the first studies to measure contributions of each trophic level, from viruses through microbes to corals, algae, fishes, and sharks, to the ecosystem (additional data from Stuart Sandin, Scripps Institution of Oceanography, The last 60-plus years gave us the identification of DNA as the genetic material, the means to sequence that material, and now the technological leaps to sequence DNA cheaply and efficiently.The speed with which environmental genomics has impacted marine microbiology reinforces how far the field has come in increasing our understanding of the smallest, but perhaps most important, inhabitants of the ocean.However, it is clear that the more we learn, the more we realize how much we don't yet know.Technological advances over the next few years will include single-cell sequencing, long reads from single DNA molecules, and an explosion of synthetic DNA approaches to reconstructing sequences in the ocean.Together, these tools will help answer some of the remaining questions (see Box 1).