GW_protein

This class has the core functionalities of GWProt. A GW_protein object contains all the the data used to compute the GW distance.

class GWProt.GW_protein.GW_protein(name: str, seq: str, coords=None, ipdm=None, scaled_flag: bool = False, distribution=None)

This class contains everything needed to run GW and FGW on proteins, as well as versions with distortion scaling and sequence alignment

Parameters

name – A string for ease of use
coords – The coordinates of the CA atoms of the protein, ordered sequentially
seq – A string giving the sequence of the protein
ipdm – The intra-protein distance matrix of a protein. The (i,j)th entry is the (possibly scaled) distance between residues i and j. This is mutable can can change if distortion scaling is used.
scaled_flag – Records whether the ipdm is the exact distance between residues or if it has been scaled.
distribution – np.array of the weighting of the residues, must sum to 1. Default is a uniform distribution.

Basic Methods

We have basic ways to create and compare GW_protein objects.

The usual way to make a GW_protein object is by loading it from a .pdb (Protein Data Bank) file.

GWProt.GW_protein.GW_protein.make_protein_from_pdb(pdb_file: str, chain_id: Optional[str] = None, name=None) → GW_protein

Creates a GW_protein object with the coordinate and sequence data from the pdb_file. This gives a uniform distribution.

Parameters

pdb_file – Filepath to the pdb file
chain_id – Which chain(s) to use, None uses all chains

Returns

A new GW_protein object

If data is missing in the form of missing residues or missing alpha-Carbons it will be skipped. Note that all indices within a GW_protein object are based on those loaded, which may not agree with the indices in the pdb file.

GWProt.GW_protein.GW_protein.validate(self) → bool

Checks if a GW_protein object passes basic consistency tests.

Returns: True is it passes, raises assertion error otherwise.

GWProt.GW_protein.GW_protein.__eq__(self, other): Compares the seq, the ipdm, distribution, and the coords if both are defined. This does NOT compare the name or scaled_flag.

GWProt.GW_protein.GW_protein.__len__(self)

Returns: the number of amino acids in the protein

Intra-Protein Distance Matrix Scaling

Next we have methods to manipulate the intra-protein distance matrix for distortion scaling.

GWProt.GW_protein.GW_protein.scale_ipdm(self, scaler: ~typing.Callable[[float], float] = <built-in function sqrt>, inplace: bool = False)

This method scales all entries of the intra-protein distance matrix.

Parameters

scaler – A function with which to scale the intraprotein distance matrix. It must send 0 to 0, be strictly monotonic increasing, and concave down. Default is the square root function.
inplace – Whether to modify self.ipdm or output a new GW_protein object.

Returns

The scaled ipdm if inplace == False, and None if inplace == True.

GWProt.GW_protein.GW_protein.reset_ipdm(self) → None: This method recalculates the ipdm inplace based on the coordinates. Raises an error if self.coords is None.

Downsampling

Then we have two methods for downsampling. Downsampling reduces the number of residues used so has the effect of speeding up computations, but can reduce accuracy.

GWProt.GW_protein.GW_protein.downsample_by_indices(self, indices: list[int]) → GW_protein

This creates a new GW_protein object consisting of the residues of self in the input indices.

Parameters: indices – The indices to keep.
Returns: A new GW_protein object

GWProt.GW_protein.GW_protein.downsample_n(self, n: int = inf, left_sample: bool = False, mean_sample: bool = False) → GW_protein

This method makes a new GW_protein object created by downsampling from self. This is done by dividing self into n evenly sized segments, then creates an GW_protein object whose residues are formed by those segments.

Parameters

n – The maximum number of residues in the output protein. If this is larger than len(self), then there is no downsampling.
left_sample – Whether to use the left-most (lowest index) or median residue from each segment. left_sample == True uses the left-most, left_sample== False uses the median.
mean_sample – Whether to average the coordinates of the residues in a segment. mean_sample == False uses the coordinates of the residue determined by left_sample, mean_sample==True uses the average of the coordinates in a segment.

Returns

A new GW_protein object created by downsampling from self.

Computing GW

The methods for computing the Gromov-Wasserstein distance use the CAJAL library , also created by the CámaraLab, for efficient computation.

GWProt.GW_protein.GW_protein.run_GW(prot1: GW_protein, prot2: GW_protein, correspondence: bool = False) → Union[float, tuple[float, numpy.array]]

Computes the GW distance and correspondence if correspondence.

Parameters

prot1 –
prot2 –
correspondence – Whether to return the computed correspondence

Returns

Returns the GW distance and optimal correspondence if correspondence

This is a wrapper for the following two functions:

GWProt.GW_protein.GW_protein.make_cajal_cell(self) → GW_cell

This method makes a cajal.gw_cython.GW_cell object from the CAJAL library.

Returns: A cajal.gw_cython.GW_cell object representing self.

GWProt.GW_protein.GW_protein.run_GW_from_cajal(cajal_cell1: GW_cell, cajal_cell2: GW_cell, correspondence: bool = False) → Union[float, tuple[float, numpy.array]]

This is a wrapper for the CAJAL code to compute the GW distance between cajal_cell1 and cajal_cell2, outputs the computed correspondence if tranport_plan.

Parameters

cajal_cell1 –
cajal_cell2 –
correspondence – Whether to return the computed correspondence

Returns

Returns the GW distance and optimal correspondence if correspondence

GWProt.GW_protein.GW_protein.run_GW_seq_aln(prot1: GW_protein, prot2: GW_protein, allow_mismatch: bool = True, BLOSUM='62', gap_extend=0, gap_open=0) → float

This calculates the Gromov-Wasserstein distance between two proteins when applied just to aligned residues. It first applies sequence alignment, downsamples to the aligned residues, then applies GW. ssearch36 must be in the PATH to use this method.

Parameters

prot1 – The first protein
prot2 – The second protein

Returns

The GW distance

As this uses CAJAL, there is the ability to use other functionalities from CAJAL.

Computing FGW and UGW

FGW computations are done with the POT package. Multiple ways of inputting the feature space data are included. The first is the most general as it can use any user-inputted feature difference matrix. However a new difference matrix must be used for every pair of proteins.

GWProt.GW_protein.GW_protein.run_FGW_diff_mat(prot1: GW_protein, prot2: GW_protein, diff_mat: array, alpha: float = 1, correspondence: bool = False) → Union[float, tuple[float, numpy.array]]

This calculates the fused Gromov-Wasserstein distance between two proteins.

Parameters

prot1 – The first protein
prot2 – The second protein
diff_mat – A user-inputted matrix of the differences in the feature space between the residues of the two proteins. Of shape (len(prot1),len(prot2)). diff_mat[i,j] is the difference in features between the ith residue of prot1 and the jth residue of prot2.
alpha – The trade-off parameter in [0,1] between fused term and geometric term. A higher value of alpha means more geometric weight, alpha = 1 is equivalent to regular GW.
correspondence – Whether to return the correspondence

Returns

Returns the FGW distance and optimal correspondence if correspondence

The second uses a linear feature space. This is suitable for scalar features including isoelectric point, solvent-accessible surface area, charge, and hydrophobicity.

GWProt.GW_protein.GW_protein.run_FGW_data_lists(prot1: GW_protein, prot2: GW_protein, data1: list[float], data2: list[float], alpha: float = 1, correspondence=False) → Union[float, tuple[float, numpy.array]]

This calculates the fused Gromov-Wasserstein distance between two proteins. It takes in a list of float s for each proteins representing the value in the feature space for each residue. The ijth entry in the associated distance matrix is abs(data1[i] - data2[j]).

Parameters

prot1 – The first protein
prot2 – The second protein
data1 – The data used in the first protein
data2 – The data used in the second protein
alpha – The trade-off parameter in [0,1] between fused term and geometric term. A higher value of alpha means more geometric weight, alpha=1 is equivalent to regular GW.
correspondence – Whether to return the computed correspondence

Returns

Returns the FGW distance and correspondence if correspondence

The third uses a dictionary giving difference values between different types of amino acids.

GWProt.GW_protein.GW_protein.run_FGW_dict(prot1: GW_protein, prot2: GW_protein, d: dict[str, dict[str, float]], alpha: float = 1, correspondence: bool = False) → Union[float, tuple[float, numpy.array]]

This calculates the fused Gromov-Wasserstein distance between two proteins.

Parameters

prot1 – The first protein
prot2 – The second protein
d – The dictionary used for the fused distances based on the protein sequences. Of the form d['A']['B'] == float
alpha – The trade-off parameter in [0,1] between fused term and geometric term. A higher value of alpha means more geometric weight, alpha = 1 is equivalent to regular GW.
correspondence – Whether to return the correspondence

Returns

Returns the FGW distance and correspondence if correspondence

UGW computations are done with the POT package. Runtime and computed alignments are very sensitive to choice of rho and epsilon and suitable values can vary depending on the proteins.

GWProt.GW_protein.GW_protein.run_UGW(prot1: GW_protein, prot2: GW_protein, rho: float = 3, epsilon: float = 3, correspondence: bool = False) → Union[float, tuple[float, numpy.array]]

Computes the unbalanced GW distance and correspondence if correspondence.

Parameters

prot1 –
prot2 –
rho – Marginal relaxation term - trade off between geometric distortion and Kullback-Leibler divergence on marginals, higher rho means less mass is lost
epsilon – Regularization parameters for entropic approximation
correspondence – Whether to return the computed correspondence

Returns

Returns the unbalanced GW distance and optimal correspondence if correspondence

Similarly we have three versions of fused unbalanced GW.

GWProt.GW_protein.GW_protein.run_FUGW_diff_mat(prot1: GW_protein, prot2: GW_protein, diff_mat: array, alpha: float = 1, rho: float = 3, epsilon: float = 3, correspondence: bool = False) → Union[float, tuple[float, numpy.array]]

This calculates the fused unbalanced Gromov-Wasserstein distance between two proteins.

Parameters

prot1 – The first protein
prot2 – The second protein
diff_mat – A user-inputted matrix of the differences in the feature space between the residues of the two proteins. Of shape (len(prot1),len(prot2)). diff_mat[i,j] is the difference in features between the ith residue of prot1 and the jth residue of prot2.
alpha – The trade-off parameter in [0,1] between fused term and geometric term. A higher value of alpha means more geometric weight.
rho – Marginal relaxation term - trade off between geometric distortion and Kullback-Leibler divergence on marginals, higher rho means less mass is lost
epsilon – Regularization parameters for entropic approximation
correspondence – Whether to return the computed correspondence

Returns

Returns the FGW distance and optimal correspondence if correspondence

GWProt.GW_protein.GW_protein.run_FUGW_data_lists(prot1: GW_protein, prot2: GW_protein, data1: list[float], data2: list[float], rho: float = 3, epsilon: float = 3, alpha: float = 1, correspondence=False) → Union[float, tuple[float, numpy.array]]

This calculates the fused unbalanced Gromov-Wasserstein distance between two proteins. It takes in a list of float s for each proteins representing the value in the feature space for each residue. The ijth entry in the associated distance matrix is abs(data1[i] - data2[j]).

Parameters

prot1 – The first protein
prot2 – The second protein
data1 – The data used in the first protein
data2 – The data used in the second protein
alpha – The trade-off parameter in [0,1] between fused term and geometric term. A higher value of alpha means more geometric weight,
rho – Marginal relaxation term - trade off between geometric distortion and Kullback-Leibler divergence on marginals, higher rho means less mass is lost
epsilon – Regularization parameters for entropic approximation
correspondence – Whether to return the computed correspondence

Returns

Returns the FGW distance and correspondence if correspondence

GWProt.GW_protein.GW_protein.run_FUGW_dict(prot1: GW_protein, prot2: GW_protein, d: dict[str, dict[str, float]], alpha: float = 1, correspondence: bool = False) → Union[float, tuple[float, numpy.array]]

This calculates the fused unbalanced Gromov-Wasserstein distance between two proteins.

Parameters

prot1 – The first protein
prot2 – The second protein
d – The dictionary used for the fused distances based on the protein sequences. Of the form d['A']['B'] == float
alpha – The trade-off parameter in [0,1] between fused term and geometric term. A higher value of alpha means more geometric weight
rho – Marginal relaxation term - trade off between geometric distortion and Kullback-Leibler divergence on marginals, higher rho means less mass is lost
epsilon – Regularization parameters for entropic approximation
correspondence – Whether to return the correspondence

Returns

Returns the FGW distance and correspondence if correspondence

It is not recommended to use fused or unbalanced methods on downsampled proteins, as the data is lost from the excluded residues.

Computing Local Geometric Distortion (LGD)

The local geometric distortion (LGD) quantifies the contribution of each residue to the GW or FGW distance, providing a residue-level measure of structural conservation or flexibility.

GWProt.GW_protein.GW_protein.GW_lgd(prot1: GW_protein, prot2: GW_protein, T: array) → tuple[numpy.array, numpy.array]

This calculates the local geometric distortion (LGD), i.e. the contribution of each residue to the sum in the GW cost, using the correspondence T. This is output as two np.array s, one for prot1 , the second for prot2.

Parameters

prot1 – The first GW_protein
prot2 – The second GW_protein
T – The correspondence to be used

Returns

lgd1, lgd2; the LGD values for the two proteins

GWProt.GW_protein.GW_protein.FGW_lgd(prot1: GW_protein, prot2: GW_protein, T: array, diff_mat: array, alpha: float) → tuple[numpy.array, numpy.array]

This calculates the local geometric distortion (LGD), i.e. the contribution of each residue to the sum in the FGW cost, using the correspondence T. This is output as two np.array s, one for prot1 , the second for prot2.

Parameters

prot1 – The first GW_protein
prot2 – The second GW_protein
diff_mat – The difference matrix in the feature space
T – The correspondence to be used
alpha – The trade-off constant between the fused cost and the geometric cost

Returns

lgd1, lgd2; the LGD values for the two proteins

Note

np.sum(lgd1) != c, where c is the GW cost; rather math.sqrt(np.sum(lgd1))/2 == c;

and similarly for lgd2, and for FGW.

Miscellaneous Methods

GWProt.GW_protein.GW_protein.run_ssearch_indices(prot1: GW_protein, prot2: GW_protein, allow_mismatch: bool = True, BLOSUM='62', gap_open=0, gap_extend=0) → tuple[list[int], list[int]]

Runs a local sequence alignment returns the indices of the two proteins which are aligned. ssearch36 must be in the PATH to use this method.

Parameters

prot1 – First protein
prot2 – Second protein
allow_mismatch – Whether to include residues which are aligned but not the same type of amino acid

Returns

Two lists of indices, those of prot1 and prot2 which are aligned

Explicity this runs the command

$ ssearch36 -s BP62 -p -T 1 -b 1 -f 0 -g 0 -z -1 -m 9C

for the Smith-Waterman algorithm in the Fasta Package.

GWProt.GW_protein.GW_protein.get_eccentricity(self, p: float = 2) → array

This calculates the eccentricity of residues in a protein with exponent p.

Parameters: p – The exponent, 0< p <= np.inf
Returns: The eccentricities of each residue, as a np.array

Eccentricity is defined in 1 Definition 5.3. Intuitively, it quantifies how far each residue is from the rest of the residues in a protein. Within a given protein, residues with higher eccentricity often have higher LGD when aligned to other proteins, so this could be used for normalization.

References

1: Mémoli, F. (2011). Gromov–Wasserstein distances and the metric approach to object matching. Foundations of Computational Mathematics, 11(4), 417-487.