Saturday, July 14, 2018

Searching for new Type I reaction centre proteins in metagenomes

Testing my new-found metagenome-searching skills I decided to look for Type I reaction centre core subunits from Heliobacteria. This is because there are less than a handful of PshA sequences from this fascinating group of organisms, and only one complete and published sequenced genome.

Judging from the massive phylogenetic distance between the PshA core subunit of the reaction centre from Heliobacteria and the next closest relative (the PscA from Chlorobi/Acidobacteria), one must assume that a significant biodiversity should have existed spanning this distance, even if one or the other obtained phototrophy via horizontal gene transfer.

I limited my search to about 2000 metagenomes. I narrowed down my selection to those using in the metagenome title: “microbial dark matter”. I am not sure however if all of these belong to a singular project or if these have come from different/independent labs or projects.

I have always wondered however, if in these humongous datasets there are any novel phototrophs still unknown to science.

I used the PshA sequence from Heliobacterium modesticaldum as query.

The BLAST did not retrieve new sequence from Heliobacteria nor Acidobacteria, but did retrieve quite a few sequences from phototrophic Chlorobi and Cyanobacteria, see the attached figures. No sequences outside the known phyla of phototrophs were found, which is kind of sad. I had great expectations.

PscA from phototrophic Chlorobi
255 complete or almost complete sequences were obtained, which I then used to build a Maximum Likelihood tree. I did not have a look at fragmented sequences.

There was one almost complete sequence of a PsaA subunit from a new strain close to Gloeobacter.

It had 82% sequence identity to the PsaA of G. violaceus and G. kilaueensis. In comparison, the PsaA of these last two share 88% sequence identity. As another point of comparison, the level of sequence identity for PsaA between a red algae, C. merolae, and A. thaliana is 82%.

PsaA, the early branches. ML tree. In bold the metagenome sequnces
At this level of sequence divergence, it should be a new genus/species. I name this strain Protogloeobacter cardonensis. Kidding.

The metagenome where this particular sequence was found is the following:

Hot spring sediment bacterial and archeal communities from British Columbia, Canada, to study Microbial Dark Matter (Phase II) - Larsen N4 metaG (Released on 2016-05-27)

There were also quite a few sequences from the early-branching hot spring Synechococcus type. In addition, a PsaA/PsaB pair for another Gloeomargarita strain and a PsaA/PsaB pair of isoforms of the far-red light acclimation response from a form of Fischerella.

If you want the sequences or would like to see the full tree, let me know.

Friday, July 6, 2018

The atypical D1 sequence of Gloeobacter kilaueensis: looking for another one in metagenomes

The evolution of D1 proteins is complicated. It is characterized by many gene duplication events occurring at every taxonomic level. Some of these duplications could potentially predate the most recent common ancestor of all described cyanobacteria.
See our previous work on this:
Some of the earliest duplications, we suggested, gave rise to the atypical D1 forms, of which we have described three forms. What I have called Group 0, Group 1, and Group 2 D1.
Group 0 is made of a single sequence, found exclusively in the genome of Gloeobacter kilaueensisG. kilaueensis has additionally 5 standard D1 forms. There may be a D1 fragment encoded in the genome of the early branching Synechococcus sp. PCC 7336, have a look at this:
Group 1 is the super-rogue D1 also known as chlorophyll f synthase (or PsbA4).
Group 2 is the rogue D1: function unknown/unconfirmed.
A recent preprint by Grettenberger et al., described a new type of early branching cyanobacteria, which was named Aurora. The genome of this cyanobacterium was assembled from a metagenome of a microbial mat found in lake Vanda in Antarctica. It is more than 90% complete. This strain seems to be distantly related to Gloeobacter. As far as I understand, it is not clear however if this strain is an early-branching cyanobacterium sister to Gloeobacter, or whether it predates Gloeobacter, being therefore a sister branch to all described cyanobacteria.
This is the preprint:
Aurura vandensis has a PSII with a subunit composition similar to that of Gloeobacter. Only one D1 was reported in the preprint, and this is a standard form of D1, a Group 4.
Excited by this, I wondered if I could find another Group 0 sequence in the available metagenomes. Another G0, similar to that from G. kilaueensis.
So, I did a BLAST to all JGI environmental metagenomes: these were a total of 12361. I left out metagenomes categorized as “engineered” or “host-associated”.
To do a BLAST in so many metagenomes directly on the JGI site, it is necessary to split the data into sets of maximum 500 metagenomes. That gives 25 sets of metagenomes that needed to be BLASTed.
My query sequence was the very atypical G0 sequence from G. kilaueensis.
In the first set I obtained more than 30000 hits, which must include D1, D2, L, and M subunits; both complete and partial sequences. The cut-off E-value was 1e-5.
None of the 25 sets produced a sequence similar to the G0 sequence. Nothing close to it. The closest identity was 54%, usually to other standard forms of D1. No sequence alignment included the C-terminus, which is kind of special in the G0 sequence. Some of the metagenome sets gave a top hit to super-rogue D1 sequences, but the level of sequence identity between G0 and the other atypical forms is also just over 50%. This makes sense if the phylogenetic tree that we published in the paper above is correct, as it would imply that the G0 sequence is as close to the other atypical sequences, as it is to the standard forms of D1.
This is because we suggested based on the phylogeny of D1, that Group 1 to Group 4 would make a monophyletic group to the exclusion of the G0 sequence. But, phylogenetic trees are susceptible to artifacts, so having more G0 sequences could potentially improve the D1 phylogeny.
Each search for each of the metagenome sets produced more than 30k hits: that means that I could have obtained more than 750k hits in these 12361 metagenomes! But not a second G0 sequence?
I have to say that I did not examine every sequence in detail (of course)… waaay too many. So there may have been a partial sequence close to G0 that did not score high due to its very short length. If there was another G. kialueensis somewhere else I would have expected at least some identical sequences, but nothing at all!
I thought that Gloeobacter was not that uncommon after all:
Would anyone be interested in repeating this search? :)
This is the link to the G0 sequence: https://www.ncbi.nlm.nih.gov/protein/AGY58976.1
Now, with the recent eruption of Kilauea this unique strain of Gloeobacter may have just gone extinct.