GW_protein_pI

class GWProt.GW_protein_pI.GW_protein_pI(name: str, seq: str, pI_list, coords=None, ipdm=None, scaled_flag=False, distribution=None)

This subclass of GW_protein contains everything needed to run fused GW on proteins using isoelectric points, as well as versions with downsampling, distortion scaling and sequence alignment.

Parameters
  • name – Simply for ease of use

  • coords – The coordinates of the alpha-Carbons of the protein, ordered sequentially

  • seq – A string giving the sequence of the protein

  • ipdm – The intra-protein distance matrix of a protein. The (i,j)th entry is the (possibly scaled) distance between residues i and j. This is mutable can can change if distortion scaling is used.

  • scaled_flag – Records whether the ipdm is the exact distance between residues or if it has been scaled.

  • distribution – Numpy array of the weighting of the residues, must sum to 1. Default is a uniform distribution.

Variables

pI_list – Estimated isoelectric point of each residue based on Solomon values.

Downsampling

Unlike GW_protein , this stores the isoelectric point values with each protein 1. Most of the methods are the same as those of GW_protein. The key difference is that when downsampling, we can combine the isoelectric points of adjacent residues which are grouped together, giving an estimated isoelectric point of the segments. Thus unlike the FGW methods in GW_protein , we can run FGW on downsampled proteins in a meaningful way. This is done using an algorithm based on the one in the Sequence Manipulation Suite based on the Henderson-Hasselbach equation.

For uniform downsampling, this is done automatically in the downsample_n method.

GWProt.GW_protein_pI.GW_protein_pI.downsample_n(self, n: int = inf, pI_combination: bool = True, left_sample: bool = False, mean_sample: bool = False) GW_protein_pI

This method makes a new GW_protein object created by downsampling from self. This is done by dividing self into n evenly sized segments, then creates an GW_protein object whose residues are formed by those segments. Depending on the parameters this can be done with regular downsampling (simply picking one residue from each segment and copying its data) or by combining the coordinate data and/or isoelectric values of the residues in a segment.

Parameters
  • n – The maximum number of residues in the output protein. If this is larger than the size of self, then there is no downsampling.

  • pI_combination – Whether to combine the isoelectric points of nearby residues when downsampling. If False then the values in pI_list of the returned GW_protein are a subset of those of self.pI_list. If True then pI_alg is used to estimate the isoelectric point of nearby residues

  • left_sample – Whether to use the left-most (lowest index) or median residue from each segment. left_sample == True uses the left-most, left_sample== False uses the median

  • mean_sample – Whether to average the coordinates of the residues in a segment. mean_sample == False uses the coordinates of the residue determined by left_sample`, ``mean_sample==True uses the average of the coordinates in a segment.

Returns

A new GW_protein` object created by downsampling from ``self.

For downsampling to specified indices, we first need to ‘ smooth out ` the isoelectic values, so that when we downsample, we will be accounting for the isoelectric points of nearby discarded residues.

GWProt.GW_protein_pI.GW_protein_pI.convolve_pIs(self, kernel_list: list[int] = [1, 2, 3, 2, 1], origin: int = 2, inplace: bool = False) list[float]

This method applies a convolution process to the GW_protein_pI object which smoothes out the isoelectic points associated to each residue by combining them with those of nearby residues. The intended use is that this could be applied before downsampling so that the isoelectric points of discarded residues is still preserved. That is done automatically with downsample_n(pI_combination = True), so this is most useful when applied before run_FGW_seq_aln() as that method discards unaligned residues.

The convolution works as follows: for each residue we make a virtual oligopeptide of copies of that residue and its neighbors, then use the Henderson–Hasselbalch-based algorithm to estimate the oligopeptide’s isoelectric point. The number of copies is the entry of kernel_list, where the current residue is at position origin, The isoelectric contributions of the protein’s N- and C-termini are accounted for similarly.

We recommend that the kernel_list is symmetric about index origin and unimodal. For instance [1,2,3,2,1] with 2.

Parameters
  • kernel_list – The list of how many copies of nearby residues we use when smoothing the isoelectric points

  • origin – The index in the kernel_list of the current residue

  • inplace – Whether this modifies self.pI_list or returns a new list

Returns

For inplace==False a new GW_protein_pI object with the smoothed isoelectric point values. For inplace==True nothing is returned.

Then we can downsample:

GWProt.GW_protein_pI.GW_protein_pI.downsample_by_indices(self, indices: list[int]) GW_protein

This creates a new GW_protein object consisting of the residues of self in the input indices.

Parameters

indices – The indices to keep.

Returns

A new GW_protein object

Computing FGW and LGD

As GW_protein_pI objects only use isoelectric points, the FGW methods are streamlined:

GWProt.GW_protein_pI.GW_protein_pI.run_FGW(prot1: GW_protein_pI, prot2: GW_protein_pI, alpha: float = 0.5, correspondence: bool = False) Union[float, tuple[float, numpy.array]]

This calculates the fused Gromov-Wasserstein distance between two proteins. The computation is done with the Python ot library.

Parameters
  • p1 – The first protein

  • p2 – The second protein

  • alpha – The trade-off parameter in [0,1] between fused term and geometric term. A higher value of alpha means more geometric weight, alpha = 1 is equivalent to regular GW.

  • correspondence – Whether to return the computed correspondence

Returns

Returns the FGW distance and the optimal correspondence if correspondence

GWProt.GW_protein_pI.GW_protein_pI.FGW_lgd(prot1: GW_protein_pI, prot2: GW_protein_pI, alpha: float, T: array)

This calculates the local geometric distortion (LGD), i.e. the contribution of each residue to the sum in the FGW cost, using the correspondence T. This is output as two np.array s, one for prot1 , the second for prot2.

Parameters
  • prot1 – The first GW_protein

  • prot2 – The second GW_protein

  • diff_mat – The difference matrix in the feature space

  • T – The correspondence to be used

  • alpha – The trade-off constant between the fused cost and the geometric cost

Returns

lgd1, lgd2; the LGD values for the two proteins

GWProt.GW_protein_pI.GW_protein_pI.run_FGW_seq_aln(prot1: GW_protein_pI, prot2: GW_protein_pI, alpha: float, allow_mismatch: bool = True, BLOSUM='62', gap_open=0, gap_extend=0) float

This calculates the fused Gromov-Wasserstein distance between two proteins when applied just to aligned residues. It first applies sequence alignment, downsamples up to the aligned residues, then applies FGW.

Parameters
  • prot1 – The first protein

  • prot2 – The second protein

  • alpha – The trade-off parameter in [0,1] between fused term and geometric term. A higher value of alpha means more geometric weight, alpha = 1 is equivalent to regular GW.

Returns

The FGW distance

The local geometric distortion (LGD) quantifies the contribution of each residue to the FGW distance, providing a residue-level measure of structural conservation or flexibility based on isoelectric point differences.

1

We note that these are rather naive estimates of the isoelectric points. More sophisticated ones can be computed by other software packages using the 3-dimensional structure of a protein. This could then be used with GW_protein.GW_protein.run_FGW_data_lists .