Next generation sequencing is a powerful method increasing in popularity for use in metagenomic and transcriptomic analysis in environmental microbiology. Compared to Sanger sequencing, next generation allows for sequencing of the complete genomic content of a sample without the need to make clone libraries. Using this technique, microbial community analysis can be performed in a matter of days instead of weeks or months.
One problem with next generation sequencing projects is the handling of massive amounts of sequencing data that must be organized, cleaned up, assembled, and analyzed. Sequencing read lengths using the 454/Roche instrument are between 100-400 bp in length and the sequencing of an entire genome can generate millions of pieces of sequence that must be assembled.
For example, researchers at Bielefeld University in Germany used a single sequencing run on the Genome Sequencer FLX system to completely assemble and characterize the genome of Corynebacterium kroppenstedtii. In 7.5 hrs, they generated over 500,000 shotgun reads with greater than 100 million bases that were assembled into a contiguous genomic sequence with a total size of 2,446,804 bp. Can you imagine the bioinformatics required to assemble that much information in one day? It is a primary concern for next generation sequencing labs using this innovative technology for microbial community analysis in rare environmental samples. Easy to use computing programs are desperately needed to make data interpretation manageable and fast.
The purpose of the PANGEA (Pipeline for analysis of next generation amplicons) program is exactly this. This month in the July issue of The ISME Journal (4, 852-861, July 2010), Adriana Giongo from the lab of Eric Triplett at the University of Florida in Gainesville published a study demonstrating the functionality of a new set of computing tools for making analysis of next generation sequencing data faster and easier. PANGEA is written in Perl and can be run on Mac OSX, Windows, or Linux.
What is PANGEA and what is it for?
PANGEA is used to compile the huge datasets generated after 454/Roche next generation amplicon sequencing. In this publication, PANGEA was used to demonstrate assembly and annotation of 16S rRNA libraries for the purpose of microbial identification in a metagenomic analysis. To make next generation sequencing more cost effective and higher throughput, barcoding techniques are used to tag the 5′ end of DNA samples so that multiple libraries may be run simultaneously and then later organized and assembled. PANGEA takes the sequencing data directly from the 454 and performs all the necessary steps to generate files used for sequence identification by BLAST. The program includes statistical analysis to look at diversity of communities as well.
Although this study is a microbial analysis, PANGEA may be used to perform the bioinformatics for any barcoded amplicon sequencing project and can identify the origin of the sequences if an appropriate database for the gene of interest is available.
The source codes are freely available at pangea-16s.sourceforge.net and Microgator.org.
Samples:
To demonstrate the utility of PANGEA, the authors performed two different studies. The first was an analysis of fecal DNA from rats and the second was a study of surface soils collected at Hawaii Volcanoes National Park in May 2008.
Sequencing:
Sequences were generated using the GS FLX 454 DNA pyrosequencer (454 Life Sciences, CT). A total of 89,847 reads were obtained from the rat fecal samples and 275,529 reads were obtained for the soil samples. Sequences were trimmed to remove short reads or low quality scores using a script called Trim2 (Huang et al., 2003).
A perl script called barcode.pl was used to separate the reads into their respective groups and then remove the barcode from the read and insert a number at the beginning of each sequence.
Megablast is part of the BLAST package at NCBI and was used to analyze the final trimmed and barcode separated sequence. These sequences were then phylogenetically classified using TaxCollector to attach taxonomic information of the closest bacterial relative to each sequence, according to the best match in a modified bacterial RDP-II database.
Statistical Analysis:
The numbers of sequences were normalized using selector.pl and the Shannon diversity index was determined using a script called shannon.pl.
Results
High throughput sequencing of two 16S rRNA gene datasets was performed to demonstrate the utility of PANGEA for rapid characterization of microbial communities. The rat dataset which contained ~89,000 sequences was analyzed in only 24 hrs using a Mac Book Pro with 2.4 GHz Intel Core 2 Duo and 4GB 667 MHz DDR2 SDRAM.
One of the benefits of PANGEA over other programs, such as CD-HIT, is that PANGEA identifies sequences before clustering instead of after. This is important because it allows every sequence to be classified at the genus and species level first whereas clustering before identification results in loss of representation of smaller sequence reads. If sequences are clustered first, the clusters are created according to sequence length and some sequences may be placed into categories that are not a good fit. In this experiment with rat fecal samples, fewer and sometimes different genera were observed using the CD-HIT method vs. the PANGEA method of data organization.
Another major benefit of PANGEA is that it normalizes the data before community analysis so that the number of sequences in each sample within a barcoded set is identical. This minimizes the effects of variation between the number of sequencing reads for each barcoded data set. Variation in number of reads may be due to errors in quantification of the genomic DNA at the start or problems with samples lacking a barcode during the library construction. Normalization of the data sets is crucial for accurate diversity measurements using the Shannon method.
The authors conclude their paper with a thorough analysis of the differences between PANGEA and another data analysis tool called the RDP Pipeline. A few of the major advantages of PANGEA mentioned include that PANGEA is a stand alone tool not dependent on a web interface. Uploading very large datasets to a web based program is not only slow, but also leads to concern about confidentiality. In addition, the user does not need to wait in line for their project to be finished because they have complete control of the analysis and the databases used. Another advantage of PANGEA over the RDP Pipeline is that PANGEA is fully automated and does not require the user to keep feeding the program files after each step.
More information on the PANGEA workflow and command lines for the program are described in the paper available online.
Summary:
With the explosion of the use of next generation sequencing in metagenomic projects come the need for bioinformatic tools that can organize huge datasets accurately. When files containing data for 300,000,000 sequences are generated in a single run, it is easy to make mistakes and lose critical information if the tools for analysis are not efficient. After reading this report by Adriana, it is clear that PANGEA is an easy to use program for the handling of barcoded amplicon next generation sequencing datasets.

