An ancient ecospecies of Helicobacter pylori

Genome collection

We collected a total of 9,188 Helicobacter whole-genome sequences from public and private sources, including 4,210 H. pylori and one H. acinonychis genomes publicly available in Enterobase⁵⁰ (as of 6 June 2022), 1,011 samples from the Helicobacter pylori genome project²⁵ (https://doi.org/10.5281/zenodo.10048320)⁵¹ and 350 samples available in NCBI, BIGs and Figshare; and 3,615 H. pylori novel genomes from different geographic regions around the world and one novel H. acinonychis. The novel sequences included 2,133 isolates collected by Y.Y. at the Department of Environmental and Preventive Medicine, Faculty of Medicine, Oita University, Japan, 244 strains from Iran collected by A.Y. and 142 genomes from different parts of the world, including 89 from the Swedish Kalixanda cohort⁵². Lastly, 1,096 worldwide DNA samples and one H. acinonychis sample were contributed by Mark Achtman. These sequences have also been deposited in Enterobase at the following workspace: https://enterobase.warwick.ac.uk/a/108555.

Samples from the Yamaoka laboratory were sequenced at Novogene Co., Ltd., Beijing, China with the Illumina NOVA PE150 platform and assembled using the SPAdes genome assembler v.3.15.3 (ref. ⁵³) by downsampling read depth to 100 base pairs, specifying a genome size of 1.6 megabase pairs and enabling the option –careful. Of those samples obtained from M.A., 916 were sequenced using either the Illumina MiSeq platform at the University of Gothenburg, Sweden or the Illumina NOVA PE150 platform at Novogene Co., Ltd, UK; a further 180 were sequenced at the University of Warwick, UK. Remaining sequences were sequenced on the Illumina MiSeq platform at Karolinska Institutet and the University of Gothenburg and assembled using the BACTpipe pipeline (https://doi.org/10.5281/zenodo.4742358)⁵⁴.

We then filtered out redundant genomes, defined as those sequences with SNP distance below 200, and removed low-quality genomes based on assembly fragmentation (over 500 contigs), coverage to the 26695 H. pylori reference strain (below 70%) and contamination (above 90% H. pylori) as predicted by Kraken v.2.1.2 (ref. ⁵⁵), to obtain a final total of 6,866 genomes corresponding to the 2,916 genomes first reported in this work and 3,950 that are publicly available (Supplementary Table 2).

Sequence alignment and core genome variant calling

Single-nucleotide polymorphisms from the core genome (core SNPs) were called using an algorithm based on MUMmer v.3.20 (ref. ⁵⁶), as previously described⁵⁷. We first aligned each genome sequence of the entire dataset to the H. pylori 26695 reference strain (NC_000915.1) using nucmer v.3.1. Next, snp-sites v.2.5.2 was used to call all variants from the whole-genome alignment obtained. Variants present in at least 99% of genomes were finally extracted using VCFtools v.0.1.17, generating a total of 866,840 core SNPs.

Population assignment

To assign a population to the final dataset, we first defined a reference subset of 285 strains consistently assigned into one of 19 H. pylori populations/subpopulations in previous reports^15,33,58,59 and that we confirmed by running fineSTRUCTURE³² based on haplotype data from core SNPs, computed for this subset as mentioned above, and using 200,000 iterations of burn-in and Markov chain Monte Carlo. This subset was then considered as a donor panel to paint each sample of the entire dataset using ChromoPainter v.2 (ref. ³²). Genomes were labelled with a population based on the largest fraction of their genome painted by a population.

PCA

PCA on the whole dataset was performed using SNPs extracted from the global alignment file, following linkage disequilibrium pruning to remove linked SNPs (window size, 50 base pairs; step size, ten variants; r² threshold, 0.1), using the software PLINK (v.1.9)⁶⁰.

The analysis and plotting (for this section and the following) were done using R v.4.3.1 and python v.3.10.6, as well as the R packages ggplot2 v.3.3.6 and tidyverse v.1.3.2 and python library numpy v.1.23.2.

Phylogenetic analysis

For reconstruction of the various phylogenetic trees, we used coding sequences that were aligned with strain 26695 (see above). When looking at specific genes (vacA, ureA, ureB), gene sequences were first obtained from the individual strains annotation file then aligned using MAFFT (v.7.505, option –auto)⁶¹. The tree shown in Extended Data Fig. 2a was built using SNPs from all coding sequences. The trees shown in Fig. 1b,c and Extended Data Fig. 2b,c were built using SNPs from the undifferentiated (B panels) and differentiated (C panels) genes, respectively. The trees shown in Supplementary Figs. 1–4 were built using SNPs from specific genes. In addition, for ureA (Supplementary Fig. 3) and ureB (Supplementary Fig. 4), the sequences were separated into two types because some strains had two copies of the gene. The choice of copy type was based on the similarity between sequences (based on tree clustering and BLAST results, in particular against H. cetorum). Using the various alignments of nucleotide sequences, maximum-likelihood trees were constructed using FastTree software⁶² (v.2.1.10, option -nt). The trees were then rooted based on a given outgroup normally used for H. pylori: hpAfrica2 and H. acinonychis for trees looking at the whole genome or at undifferentiated genes, and Hardy strains for trees looking at differentiated genes. In the case of those trees looking at individual genes, we used H. cetorum as an outgroup, these genes having been chosen after their sequences had been BLASTED against the H. cetorum genome (see below). Rooting was done using the R package ape⁶³ (v.5.7-1, root function). Plotting was done using the R package ggtree v.3.2.1.

The population-level trees shown in Fig. 2b were built via a neighbour-joining algorithm (R package ape, function nj), using a matrix of the average distance between populations represented in both Hardy and Ubiquitous ecospecies. As an equivalent of H. acinonychis (Hardy), we used strains from hpAfrica2. Distances between strains were calculated using the dist.dna function (option model, ‘raw’) from the ape R⁶⁴ package. The trees were rooted with hpAfrica2/H. acinonychis as outgroups, using the same root function as previously.

FineSTRUCTURE

To further investigate the population structure of Hardy and Ubiquitous strains, we analysed 295 strains assigned by ChromoPainter to hspIndigenousSAmerica, hspSiberia and hspIndigenousNAmerica H. pylori populations by running fineSTRUCTURE with 200,000 iterations of burn-in and Markov chain Monte Carlo, using as input the haplotype data prepared with SNPs from the core genome of those 295 strains, considering only those variants present in more than 99% of the samples.

GWAS, FST and separation into undifferentiated, intermediate and differentiated genes

Considering the ecospecies as a trait and using the 244 strains from hspSiberia and hspIndigenousNAmerica, we performed GWAS to determine which biallelic core SNPs were significantly associated with the ecospecies. Although Hardy strains were also found in hspIndigenousSAmerica, we chose to remove this population from GWAS analysis due to the small number (two of 49) of Hardy strains. Ubiquitous strains were coded as 0 (198 strains) and Hardy strains as 1 (46 strains). GWAS was performed using the R package bugwas (v.0.0.0.9000)⁶⁵, which considers the population structure using PCA, followed by GEMMA (v.0.93) to perform GWAS analysis. Using a standard significance threshold of −log(P) = 5, 4,609 out of 285,792 core biallelic SNPs were significantly associated with the Hardy clade.

To reinforce the results obtained by GWAS, we calculated per-site FST between Ubiquitous and Hardy ecospecies with the R package PopGenome⁶⁶ using the same set of strains and SNPs. We considered a SNP to be differentiated between Hardy and Ubiquitous ecospecies if it was significantly associated with the ecospecies by GWAS (−log(P) > 10), and highly differentiated between the two groups based on its FST value (FST > 0.9). Of the core biallelic SNPs we found 2,568 differentiated coding SNPs and 175 differentiated intergenic SNPs. We considered a SNP to be undifferentiated if −log(P) < 10 and FS < 0.5 (265,621 coding and 8,950 intergenic SNPs). All other SNPs (7,756 coding and 591 intergenic) were considered intermediate. Following separation of SNPs into three classes, we also distinguished three types of gene based on those present in the 26695 genome: differentiated (100 genes; Supplementary Table 3), containing at least five differentiated SNPs; undifferentiated (1,034 genes), with only undifferentiated SNPs; and the remaining genes (443 genes), which we considered intermediate.

For each strain, we calculated the number of differentiated sites that have the Hardy allele (major allele among the Hardy strains) and compared this number with both the genetic distance to Ubiquitous strains and the number of Hardy blocks. For a given strain, the distance to Ubiquitous strains is the average number of differences between the sequences of this strain and sequences of Ubiquitous strains from hspIndigenousSAmerica, hspSiberia and hspIndigenousNAmerica. The sequences were those aligned on the 26695 sequence, and gaps were removed.

Hardy blocks were defined based on differentiated SNPs: for each strain, if two adjacent differentiated SNPs had the same allele and were part of the same gene, we considered that they were part of the same Hardy block; otherwise, they were from different blocks.

Pangenome analysis

First, we estimated the gene content of each sample with prokka v.1.4.6 software⁶⁷ using the proteome of 26695 as reference. Then, .gff files were used as input for Panaroo’s v.1.2.8 (ref. ⁶⁸) pangenome pipeline using the strict mode, merging paralogues based on sequence identity of 0.95, length difference of 0.90 and core threshold of 0.95. For this analysis, a smaller dataset was used that consisted of all strains from hspSiberia and hspIndigenousNAmerica (that is, all Hardy strains were included and all Ubiquitous strains from the same populations), as well as randomly chosen strains from the other populations (size of the sample dataset, 721 strains). Then, a hierarchical clustering based on the presence/absence of pangenes was conducted with the pheatmap v.1.0.12 package in R v.4.3.1 using the complete linkage method.

For detection of cagA, vacA and ureAB homologues in diverse Helicobacter spp., non-pylori Helicobacter genomes were recovered from either GenBank or Enterobase (Supplementary Table 4) and annotated using Prokka. Strains/species were considered if they encoded an intact copy of UreA1B1 (indicating gastric tropism) and available metadata indicated isolation from humans or animals. Metagenome-derived genomes from non-animal hosts were excluded. Host diets were identified using Wikipedia.

For identification of putative homologues, pangenome analysis was performed together with a subset of H. pylori strains (Supplementary Table 2) using panaroo with 70% sequence identity and 75% sequence coverage cut-off. For vacA, this was supplemented with additional manual inspection (for example, incomplete genomes) using Mauve⁶⁹ and Tablet⁷⁰ and data from the literature (for example, H. cetorum⁷¹).

Comparison with H. cetorum and phylogenetic analysis of differentiated genes

We used H. cetorum as an outgroup in the study of differentiated genes, specifically H. cetorum strain MIT99-5656 (downloaded 14 February 2023 from NCBI (https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_000259275.1/). Gene sequences were obtained from the GenBank file provided.

A phylogenetic analysis was performed on differentiated genes, with the H. cetorum genome included. First, we obtained H. cetorum gene sequences using BLASTing (blastn v.2.11.0)⁷² of a Hardy and a Ubiquitous version of each differentiated gene against the H. cetorum genome. For those genes that returned at least one hit, the phylogenetic tree of the gene was generated using FastTree and rooted on the H. cetorum sequence (see above).

Comparison of genome structure

Significant structural variations, including inversions, gaps, repeats and gene cluster rearrangement, are readily visualized using a dot plot. For investigation of genome structure similarity and the difference between Hardy and Ubiquitous groups, we used the Gepard program⁷³ (v.1.40 jar file from https://github.com/univieCUBE/gepard) to make the dot plot. Different comparisons were considered: Hardy versus Hardy, Hardy versus Ubiquitous and Ubiquitous versus Ubiquitous and against H. cetorum. Publicly available genome sequences were downloaded from NCBI GenBank (https://www.ncbi.nlm.nih.gov/genbank/). The Gepard internal DNA substitution matrix (edna.mat) was selected to generate the alignment and plot. The lower colour limit was 50, to reduce noise and emphasize significant regions. Window size and word length were default values of 0 and 10, respectively.

Functional enrichment analysis

To determine whether particular types of genes were over-represented among the 100 highly differentiated genes, we performed functional enrichment analysis using the website DAVID⁷⁴. Three hypothetical proteins were removed because they lacked a unique identifier. The background set of genes chosen (see Supplementary Table 3 for a list of genes tested and their categories) for the comparison was based on those genes present in H. pylori 26695. A term was considered as significant at P < 0.05 following Benjamini correction for multiple tests.

Calculation of dN/dS and ANI

For each strain in the dataset, we estimated its dN/dS and dS values to an outgroup population using the Yang and Nielsen method (YN00) in PAML (v.4.9) software^75,76, whereas ANI was calculated with FastANI (v.1.34 (ref. ⁷⁷)). The dN/dS, dS and ANI values shown in plots were averaged over pairwise comparisons with each of the different outgroup strains. Codons that coded for a stop in at least one of the strains were removed from all strains, and dN/dS and ANI values were calculated pairwise using the coding sequences from those aligned to the reference genome (26695, global alignment). In addition, we separated the values between undifferentiated and differentiated coding sequences. We used three different outgroups: hpAfrica2, H. acinonychis (Hardy strains) and H. pylori Hardy strains, with outgroup population/ecospecies not shown in the plots (only population/ecospecies of the ‘focal’ strain).

Ethics and inclusion statement

The Helicobacter genomics consortium includes gastroenterologists and researchers from several developing countries. Its aim is to characterize the genetic diversity of Helicobacter in human populations across the world, and its correlations with gastric disease. In Bangladesh, Bhutan, Congo DR, Dominican Republic, Indonesia, Japan, Myanmar, Nepal, Sri Lanka, Thailand and Vietnam, all preparation for endoscopy surveying was performed by local researchers (persons in the consortium and PhD students at Oita University). Ethical permission for the collection of human gastric biopsy material was obtained for all cohorts, including informed consent from participating individuals. Endoscopy was performed by local physicians and Y.Y. Culturing of bacteria, DNA extraction, next-generation sequencing and basic genetic analysis in these countries were performed at Oita University, principally by doctoral students who were locally recruited in several of the countries and are coauthors of the paper. In addition, the doctors and scientists involved in this consortium are actively involved in the research process and are kept up to date with its findings. This training and dissemination programme will help to spread both genomics knowledge and best practice for treating gastric illness, from Japan, where there has been considerable success in mitigating the burden of gastric cancer and other conditions, to other less developed nations. The study protocol (Iranian strains) was approved by the Institutional Ethical Review Committee of the Research Institute for Gastroenterology and Liver Diseases at Shahid Beheshti University of Medical Sciences, Tehran, Iran (no. IR.SBMU.RIGLD.REC.1395.878). All experiments were performed in accordance with relevant guidelines and regulations recommended by the institution. Written informed consent was obtained from all enroled subjects and/or their legal guardians before sample collection. The sampling of DNA used to generate the new Siberian genomes has been detailed previously¹⁵. The remaining new genomes were also sourced from previously collected H. pylori DNA (for example, refs. ^33,34,59). The study protocol for the Swedish Kalixanda genomes was approved by Umeå University ethics committee, and the study was conducted in accordance with the Helsinki Declaration. The science of microbiology has not generally viewed human-derived microbes as belonging to the individuals they came from and routine publication of genetic sequences from bacterial pathogens has enabled many public health applications. In a few bacteria, however, Helicobacter pylori being one of them, the tight coupling between human and bacterial population structure makes unexpected inferences about human hosts possible from bacterial data, which may be far from the uses envisioned when consent was obtained for sample collection. We will strive to ensure that the design of future studies is built around the needs of communities in which they are performed and that consent procedures provide accurate information on how the samples will be used, informed by recent scientific advances. Our finding of a highly distinct Hardy ecospecies has potential implications for infected individuals in many Indigenous communities known to have a high gastric disease burden. However, the pathogenicity profile, either in single or mixed infection, is currently unknown. In keeping with the TRUST code of conduct, we are maintaining and developing contacts with researchers working in communities where Hardy strains have been isolated with the intention of consulting representatives of these communities to ascertain and accommodate their interests in collaborative research on the functional characterization of these strains, including associations with gastric disease.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link