GW_protein_pI
- class GWProt.GW_protein_pI.GW_protein_pI(name: str, seq: str, pI_list, coords=None, ipdm=None, scaled_flag=False, distribution=None)
This subclass of
GW_proteincontains everything needed to run fused GW on proteins using isoelectric points, as well as versions with downsampling, distortion scaling and sequence alignment.- Parameters
name – Simply for ease of use
coords – The coordinates of the alpha-Carbons of the protein, ordered sequentially
seq – A string giving the sequence of the protein
ipdm – The intra-protein distance matrix of a protein. The (i,j)th entry is the (possibly scaled) distance between residues i and j. This is mutable can can change if distortion scaling is used.
scaled_flag – Records whether the ipdm is the exact distance between residues or if it has been scaled.
distribution – Numpy array of the weighting of the residues, must sum to 1. Default is a uniform distribution.
- Variables
pI_list – Estimated isoelectric point of each residue based on Solomon values.
Downsampling
Unlike GW_protein , this stores the isoelectric point values with each protein 1. Most of the methods are the same as those of GW_protein. The key difference is that when downsampling, we can combine the isoelectric points of adjacent residues which are grouped together, giving an estimated isoelectric point of the segments. Thus unlike the FGW methods in GW_protein , we can run FGW on downsampled proteins in a meaningful way. This is done using an algorithm based on the one in the Sequence Manipulation Suite based on the Henderson-Hasselbach equation.
For uniform downsampling, this is done automatically in the downsample_n method.
- GWProt.GW_protein_pI.GW_protein_pI.downsample_n(self, n: int = inf, pI_combination: bool = True, left_sample: bool = False, mean_sample: bool = False) GW_protein_pI
This method makes a new
GW_proteinobject created by downsampling fromself. This is done by dividingselfintonevenly sized segments, then creates anGW_proteinobject whose residues are formed by those segments. Depending on the parameters this can be done with regular downsampling (simply picking one residue from each segment and copying its data) or by combining the coordinate data and/or isoelectric values of the residues in a segment.- Parameters
n – The maximum number of residues in the output protein. If this is larger than the size of
self, then there is no downsampling.pI_combination – Whether to combine the isoelectric points of nearby residues when downsampling. If
Falsethen the values inpI_listof the returnedGW_proteinare a subset of those ofself.pI_list. IfTruethenpI_algis used to estimate the isoelectric point of nearby residuesleft_sample – Whether to use the left-most (lowest index) or median residue from each segment.
left_sample == Trueuses the left-most,left_sample== Falseuses the medianmean_sample – Whether to average the coordinates of the residues in a segment.
mean_sample == Falseuses the coordinates of the residue determined byleft_sample`, ``mean_sample==Trueuses the average of the coordinates in a segment.
- Returns
A new
GW_protein` object created by downsampling from ``self.
For downsampling to specified indices, we first need to ‘ smooth out ` the isoelectic values, so that when we downsample, we will be accounting for the isoelectric points of nearby discarded residues.
- GWProt.GW_protein_pI.GW_protein_pI.convolve_pIs(self, kernel_list: list[int] = [1, 2, 3, 2, 1], origin: int = 2, inplace: bool = False) list[float]
This method applies a convolution process to the
GW_protein_pIobject which smoothes out the isoelectic points associated to each residue by combining them with those of nearby residues. The intended use is that this could be applied before downsampling so that the isoelectric points of discarded residues is still preserved. That is done automatically withdownsample_n(pI_combination = True), so this is most useful when applied beforerun_FGW_seq_aln()as that method discards unaligned residues.The convolution works as follows: for each residue we make a virtual oligopeptide of copies of that residue and its neighbors, then use the Henderson–Hasselbalch-based algorithm to estimate the oligopeptide’s isoelectric point. The number of copies is the entry of
kernel_list, where the current residue is at positionorigin, The isoelectric contributions of the protein’s N- and C-termini are accounted for similarly.We recommend that the
kernel_listis symmetric about indexoriginand unimodal. For instance[1,2,3,2,1]with2.- Parameters
kernel_list – The list of how many copies of nearby residues we use when smoothing the isoelectric points
origin – The index in the
kernel_listof the current residueinplace – Whether this modifies
self.pI_listor returns a new list
- Returns
For
inplace==Falsea newGW_protein_pIobject with the smoothed isoelectric point values. Forinplace==Truenothing is returned.
Then we can downsample:
- GWProt.GW_protein_pI.GW_protein_pI.downsample_by_indices(self, indices: list[int]) GW_protein
This creates a new
GW_proteinobject consisting of the residues ofselfin the input indices.- Parameters
indices – The indices to keep.
- Returns
A new
GW_proteinobject
Computing FGW and LGD
As GW_protein_pI objects only use isoelectric points, the FGW methods are streamlined:
- GWProt.GW_protein_pI.GW_protein_pI.run_FGW(prot1: GW_protein_pI, prot2: GW_protein_pI, alpha: float = 0.5, correspondence: bool = False) Union[float, tuple[float, numpy.array]]
This calculates the fused Gromov-Wasserstein distance between two proteins. The computation is done with the Python
otlibrary.- Parameters
p1 – The first protein
p2 – The second protein
alpha – The trade-off parameter in [0,1] between fused term and geometric term. A higher value of
alphameans more geometric weight,alpha= 1 is equivalent to regular GW.correspondence – Whether to return the computed correspondence
- Returns
Returns the FGW distance and the optimal correspondence if
correspondence
- GWProt.GW_protein_pI.GW_protein_pI.FGW_lgd(prot1: GW_protein_pI, prot2: GW_protein_pI, alpha: float, T: array)
This calculates the local geometric distortion (LGD), i.e. the contribution of each residue to the sum in the FGW cost, using the correspondence
T. This is output as twonp.arrays, one forprot1, the second forprot2.- Parameters
prot1 – The first
GW_proteinprot2 – The second
GW_proteindiff_mat – The difference matrix in the feature space
T – The correspondence to be used
alpha – The trade-off constant between the fused cost and the geometric cost
- Returns
lgd1, lgd2; the LGD values for the two proteins
- GWProt.GW_protein_pI.GW_protein_pI.run_FGW_seq_aln(prot1: GW_protein_pI, prot2: GW_protein_pI, alpha: float, allow_mismatch: bool = True, BLOSUM='62', gap_open=0, gap_extend=0) float
This calculates the fused Gromov-Wasserstein distance between two proteins when applied just to aligned residues. It first applies sequence alignment, downsamples up to the aligned residues, then applies FGW.
- Parameters
prot1 – The first protein
prot2 – The second protein
alpha – The trade-off parameter in [0,1] between fused term and geometric term. A higher value of
alphameans more geometric weight,alpha = 1is equivalent to regular GW.
- Returns
The FGW distance
The local geometric distortion (LGD) quantifies the contribution of each residue to the FGW distance, providing a residue-level measure of structural conservation or flexibility based on isoelectric point differences.
- 1
We note that these are rather naive estimates of the isoelectric points. More sophisticated ones can be computed by other software packages using the 3-dimensional structure of a protein. This could then be used with
GW_protein.GW_protein.run_FGW_data_lists.