An integrated metagenomics pipeline for strain profiling reveals novel patterns of bacterial transmission and biogeography

  1. Katherine S. Pollard1,2,3
  1. 1Integrative Program in Quantitative Biology, University of California, San Francisco, San Francisco, California 94158, USA;
  2. 2Gladstone Institutes, San Francisco, California 94158, USA;
  3. 3Institute for Human Genetics, Institute for Computational Health Sciences, and Department of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, California 94158, USA
  1. Corresponding author: kpollard{at}gladstone.ucsf.edu

Abstract

We present the Metagenomic Intra-species Diversity Analysis System (MIDAS), which is an integrated computational pipeline for quantifying bacterial species abundance and strain-level genomic variation, including gene content and single-nucleotide polymorphisms (SNPs), from shotgun metagenomes. Our method leverages a database of more than 30,000 bacterial reference genomes that we clustered into species groups. These cover the majority of abundant species in the human microbiome but only a small proportion of microbes in other environments, including soil and seawater. We applied MIDAS to stool metagenomes from 98 Swedish mothers and their infants over one year and used rare SNPs to track strains between hosts. Using this approach, we found that although species compositions of mothers and infants converged over time, strain-level similarity diverged. Specifically, early colonizing bacteria were often transmitted from an infant’s mother, while late colonizing bacteria were often transmitted from other sources in the environment and were enriched for spore-formation genes. We also applied MIDAS to 198 globally distributed marine metagenomes and used gene content to show that many prevalent bacterial species have population structure that correlates with geographic location. Strain-level genetic variants present in metagenomes clearly reveal extensive structure and dynamics that are obscured when data are analyzed at a coarser taxonomic resolution.

Footnotes

  • [Supplemental material is available for this article.]

  • Article published online before print. Article, supplemental material, and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.201863.115.

  • Freely available online through the Genome Research Open Access option.

  • Received November 12, 2015.
  • Accepted September 8, 2016.

This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/.

| Table of Contents
OPEN ACCESS ARTICLE

Preprint Server