Predicting Immunological Cross-reactivity of Pathogen Strains: From Genotype to Antigenic Phenotype

Predicting Immunological Cross-reactivity of Pathogen Strains: From Genotype to Antigenic Phenotype

Dan Haydon, SVN Vishwanathan, David Paton, Liz Fry, and Wilna Vosloo

The objective of the proposed research is to develop bioinformatic algorithms that will predict the immunological cross reactivity of different strains of the same pathogen from the amino-acid sequences of their antigenic proteins. Estimating cross reactivity between different strains is important in determining the effectiveness of naturally or vaccinally acquired immunity to past antigens in protecting against disease caused by novel pathogen strains. We intend to use foot-and-mouth disease virus (FMDV) as a model to develop these algorithms, but in principle, the approach could be extended to apply to any pathogen provided the genes determining its antigenicity are known, and empirical data on immunological cross-reactivity exists with which to train the algorithms.

Predicting the antigenic similarities of different viral strains from sequence data alone, without the need for time-consuming and costly animal experimentation is a long-standing goal of immunology. For FMDV, the existence of seven serotypes and many antigenically significant subtypes makes the choice of a vaccine strain with a good antigenic match to circulating field strains essential for efficient control of disease by vaccination. Thus, the ability to predict antigenic matching from capsid sequence data alone is an epidemiological imperative and would represent an advance of great veterinary significance. FMDV is a diverse and rapidly evolving RNA virus but the circumstance under which this genetic dynamism results in the generation of significant antigenic novelty remains poorly understood. Developing a more sophisticated and quantitative understanding of the relationship between capsid genotype and antigenic phenotype is essential if the long-term consequences of viral capsid evolution, antigenic novelty, and vaccine use are to be anticipated.

To accomplish these goals, significant advances will need to be made on a number of technical, theoretical, and algorithmic fronts. This project is positioned at the interface of three themes identified as priorities by the EBS committee: systems biology, theoretical biology, and bioinformatics. It will make use of a great deal of existing knowledge regarding the antigenicity of FMDV capsids, relying on the substantial amounts of sequence data that are available for this pathogen – acquired during previous outbreaks including UK-2001 in which the potential such pathogens have for causing agricultural catastrophes was amply demonstrated . We believe that by combining the twin information sources of capsid sequence data with pairwise antigenic comparisons we will generate completely novel applications of data analysis that will open both explanatory and predictive avenues of virological research.

The antigenic characteristics of FMDV are largely determined by the capsid proteins (only ~730 amino acids in length), and this together with the large amount of capsid gene sequence data that are available for FMDV, the antigenic diversity that is known to be reflected in this sequence data, and the economic importance of the disease, makes FMDV an ideal virus for which to develop the immunological models and the bioinformatic tools that would be required for developing methods for estimating antigenic relationships from capsid sequence data. The general approach developed here will be of particular relevance to many other RNA virus species (for example Influenza) - and especially those within the epidemiologically important Picornavirus family (e.g. HAV, polio, and rhinoviruses).

Irrespective of the genetic processes through which RNA virus genetic diversity is generated and maintained, RNA virus populations are indisputably genetically diverse 3 . FMDV occurs as 7 immunologically distinct serotypes, within each of which exist numerous genetic and antigenic variants. Primary processing of the residue polyprotein yields the P1 product (~730 aa) which is further processed, by virally-encoded proteases, to yield the four capsid proteins (VP4, 2, 3 and 1), sixty copies of each of which are assembled to form the icosahedral virus particle. The 3-dimensional capsid structures of strains from different serotypes have been determined, and with studies using monoclonal antibodies (Mabs) and mutagenesis, much has been learned about the location and structure of epitopes (areas of the capsid comprising 5-10 amino acids to which antibody binds) across the capsid structure. Furthermore, the large amount of capsid sequence data available permits estimation of the frequency of nucleotide substitution rates at different codon positions, and the selective forces that might bear on these mutations across the capsid genes.

A long-standing goal of FMDV research has been the prediction of antigenic characteristics of different viral strains directly from their capsid sequences. This would be of direct value to the control of disease by vaccination, for which it is necessary to choose appropriately matched vaccine strains for different outbreak strains. Current approaches assume that antigenic comparison can be determined from the identification of shared epitopes. The principle method used to identify different viral epitopes involves the generation and sequencing of escape mutants from monoclonal antibodies. While this is a powerful approach, there are a number of disadvantages, most limiting of which is that it is generally not known to what extent particular monoclonal antibodies to specific epitopes are representative of the full 'in vivo' polyclonal response (but see 21 ).

Here we propose a more informatic-based approach that makes use of machine-learning algorithms. Our proposed methods would make use of published amino acid alignments from complete viral capsid gene sequences and matrices containing the pairwise antigenic cross-reactivities of the different strains for which sequences exist. Currently it is standard procedure to compare FMDV strains by means of so-called 'r-values'. An r-value compares the antigenic similarity of strain i with strain j by examining the cross-reactivity of antibodies generated in vivo against strain i (through experimental inoculation of cattle with strain i) to strain j, as determined through an ELISA reaction. The r-value is normalized by standardizing the reactivity of strain i with j, by the reactivity of i with itself (using an otherwise identical ELISA procedure). These r-values are routinely calculated between many vaccine strains and selected circulating field strains. However, r-value estimation has fairly low repeatability and currently, it is not possible to obtain permission to sequence many modern commercial vaccines strains from their respective manufacturers.

We believe that it is important that newer technologies are brought to bear on the problem of comparing antigenic characteristics of viral strains. Methods that are easier, less-expensive, and more accurate are urgently required. One way to stimulate such research is to demonstrate the value of comparative pairwise antigenic data.

Overall Objectives

  • To develop two different algorithms that can predict the immunological cross-reactivity of different but related pathogen strains from complete amino acid sequences of their antigenically relevant proteins.
  • To use simulated data to evaluate the extent and nature of data required to train such algorithms
  • To use simulated data to evaluate the predictive accuracy and power of the two different algorithms
  • To test the ability of these two algorithms to predict the cross-reactivity of different strains of FMDV using pre-existing amino acid alignments and cross-reactivity data.

General Approach

To predict the immunological cross-reactivity of different pairs of pathogen strains from antigen sequence data we need to develop algorithms that can relate alignments of amino acid sequence data to the pair-wise immunological cross-reactivity of the proteins defined by these sequences. Let S be an amino acid alignment containing N sequences (strains) each of n residues. Let C be the N x N matrix containing the immunological cross-reactivities of the antigens specified by each amino acid sequence. The training data for the algorithms we will develop will be the matrices S and C. In this proposal we will generate three different types of data, some of which will be used to train algorithms, and some of which will be used to test the algorithms predictive performance once trained.
  • Entirely simulated data: We will generate random but varyingly related amino acid sequences, and specify different sets of assumptions regarding the relationship between amino acid sequence and antigenicity (detailed below) that will enable us to model a polyclonal antibody responses to these different hypothetical viral strains. These assumptions will enable us to judge the cross-reactivity of the polyclonal response induced by one strain when presented with other strains. This will enable us to generate numerous pairs of S and C matrices with which to develop and train the proposed algorithms and test their performance at predicting the cross-reactivity of additional strains using the same antigenic assumptions.
  • Empirically informed simulated data: We will use actual FMDV capsid alignments and simulated sequences closely related to them. We will develop antigenic assumptions based on the very considerable pre-existing knowledge regarding the locations and sensitivity of FMDV epitopes to amino-acid substitution, and use these rules in generating polyclonal immune responses to each strain, and estimating pair-wise cross-reactivities of these responses to other strains.
  • Actual empirical data: We will use pre-existing capsid sequence and associated cross-reactivity data (Fig. 1) to test whether algorithms trained on data generated in (2) using different assumptions about the underlying antigenicity successfully predict the empirically observed cross-reactivity values of pairs of FMDV strains from their capsid amino acid sequences alone.
We will explore the use of two different types of algorithm: neural networks, and kernel-based methods. Each method requires training data (both the S and C data). Once trained, blind trials can be conducted to investigate the accuracy of inputting only S data to predict C data. Currently, we do not anticipate that there is enough sufficiently accurate data (for FMDV or any other pathogen) to train our algorithms using real empirical data (nor is it within the scope of this study to generate such data ourselves). However, there is certainly enough real data to test the algorithms trained on 'realistically simulated FMDV data'. While there is a significant possibility that our algorithms might successfully predict real FMDV cross-reactivities with useful accuracy, this is not the main objective of this proposal. The real prize of this research is to develop the algorithmic technology and establish the plausibility of making such predictions using simulated data, and, in addition, to develop an understanding of the volume and type of data required to do so. By so doing, we hope to motivate the generation of larger and more accurate empirical data sets on strain cross-reactivities.