GW_protein
This class has the core functionalities of GWProt. A GW_protein object contains all the the data used to compute the GW distance.
- class GWProt.GW_protein.GW_protein(name: str, seq: str, coords=None, ipdm=None, scaled_flag: bool = False, distribution=None)
This class contains everything needed to run GW and FGW on proteins, as well as versions with distortion scaling and sequence alignment
- Parameters
name – A string for ease of use
coords – The coordinates of the CA atoms of the protein, ordered sequentially
seq – A string giving the sequence of the protein
ipdm – The intra-protein distance matrix of a protein. The (i,j)th entry is the (possibly scaled) distance between residues i and j. This is mutable can can change if distortion scaling is used.
scaled_flag – Records whether the ipdm is the exact distance between residues or if it has been scaled.
distribution – np.array of the weighting of the residues, must sum to 1. Default is a uniform distribution.
Basic Methods
We have basic ways to create and compare GW_protein objects.
The usual way to make a GW_protein object is by loading it from a .pdb (Protein Data Bank) file.
- GWProt.GW_protein.GW_protein.make_protein_from_pdb(pdb_file: str, chain_id: Optional[str] = None, name=None) GW_protein
Creates a
GW_proteinobject with the coordinate and sequence data from thepdb_file. This gives a uniform distribution.- Parameters
pdb_file – Filepath to the pdb file
chain_id – Which chain(s) to use, None uses all chains
- Returns
A new
GW_proteinobject
If data is missing in the form of missing residues or missing alpha-Carbons it will be skipped. Note that all indices within a GW_protein object are based on those loaded, which may not agree with the indices in the pdb file.
- GWProt.GW_protein.GW_protein.validate(self) bool
Checks if a
GW_proteinobject passes basic consistency tests.- Returns
Trueis it passes, raises assertion error otherwise.
- GWProt.GW_protein.GW_protein.__eq__(self, other)
Compares the
seq, theipdm,distribution, and thecoordsif both are defined. This does NOT compare thenameorscaled_flag.
- GWProt.GW_protein.GW_protein.__len__(self)
- Returns
the number of amino acids in the protein
Intra-Protein Distance Matrix Scaling
Next we have methods to manipulate the intra-protein distance matrix for distortion scaling.
- GWProt.GW_protein.GW_protein.scale_ipdm(self, scaler: ~typing.Callable[[float], float] = <built-in function sqrt>, inplace: bool = False)
This method scales all entries of the intra-protein distance matrix.
- Parameters
scaler – A function with which to scale the intraprotein distance matrix. It must send 0 to 0, be strictly monotonic increasing, and concave down. Default is the square root function.
inplace – Whether to modify
self.ipdmor output a newGW_proteinobject.
- Returns
The scaled ipdm if
inplace == False, andNoneifinplace == True.
Downsampling
Then we have two methods for downsampling. Downsampling reduces the number of residues used so has the effect of speeding up computations, but can reduce accuracy.
- GWProt.GW_protein.GW_protein.downsample_by_indices(self, indices: list[int]) GW_protein
This creates a new
GW_proteinobject consisting of the residues ofselfin the input indices.- Parameters
indices – The indices to keep.
- Returns
A new
GW_proteinobject
- GWProt.GW_protein.GW_protein.downsample_n(self, n: int = inf, left_sample: bool = False, mean_sample: bool = False) GW_protein
This method makes a new
GW_proteinobject created by downsampling fromself. This is done by dividingselfintonevenly sized segments, then creates anGW_proteinobject whose residues are formed by those segments.- Parameters
n – The maximum number of residues in the output protein. If this is larger than
len(self), then there is no downsampling.left_sample – Whether to use the left-most (lowest index) or median residue from each segment.
left_sample == Trueuses the left-most,left_sample== Falseuses the median.mean_sample – Whether to average the coordinates of the residues in a segment.
mean_sample == Falseuses the coordinates of the residue determined byleft_sample,mean_sample==Trueuses the average of the coordinates in a segment.
- Returns
A new
GW_proteinobject created by downsampling fromself.
Computing GW
The methods for computing the Gromov-Wasserstein distance use the CAJAL library , also created by the CámaraLab, for efficient computation.
- GWProt.GW_protein.GW_protein.run_GW(prot1: GW_protein, prot2: GW_protein, correspondence: bool = False) Union[float, tuple[float, numpy.array]]
Computes the GW distance and correspondence if
correspondence.- Parameters
prot1 –
prot2 –
correspondence – Whether to return the computed correspondence
- Returns
Returns the GW distance and optimal correspondence if
correspondence
This is a wrapper for the following two functions:
- GWProt.GW_protein.GW_protein.make_cajal_cell(self) GW_cell
This method makes a
cajal.gw_cython.GW_cellobject from the CAJAL library.- Returns
A
cajal.gw_cython.GW_cellobject representingself.
- GWProt.GW_protein.GW_protein.run_GW_from_cajal(cajal_cell1: GW_cell, cajal_cell2: GW_cell, correspondence: bool = False) Union[float, tuple[float, numpy.array]]
This is a wrapper for the CAJAL code to compute the GW distance between
cajal_cell1andcajal_cell2, outputs the computed correspondence iftranport_plan.- Parameters
cajal_cell1 –
cajal_cell2 –
correspondence – Whether to return the computed correspondence
- Returns
Returns the GW distance and optimal correspondence if
correspondence
- GWProt.GW_protein.GW_protein.run_GW_seq_aln(prot1: GW_protein, prot2: GW_protein, allow_mismatch: bool = True, BLOSUM='62', gap_extend=0, gap_open=0) float
This calculates the Gromov-Wasserstein distance between two proteins when applied just to aligned residues. It first applies sequence alignment, downsamples to the aligned residues, then applies GW. ssearch36 must be in the PATH to use this method.
- Parameters
prot1 – The first protein
prot2 – The second protein
- Returns
The GW distance
As this uses CAJAL, there is the ability to use other functionalities from CAJAL.
Computing FGW and UGW
FGW computations are done with the POT package. Multiple ways of inputting the feature space data are included. The first is the most general as it can use any user-inputted feature difference matrix. However a new difference matrix must be used for every pair of proteins.
- GWProt.GW_protein.GW_protein.run_FGW_diff_mat(prot1: GW_protein, prot2: GW_protein, diff_mat: array, alpha: float = 1, correspondence: bool = False) Union[float, tuple[float, numpy.array]]
This calculates the fused Gromov-Wasserstein distance between two proteins.
- Parameters
prot1 – The first protein
prot2 – The second protein
diff_mat – A user-inputted matrix of the differences in the feature space between the residues of the two proteins. Of shape
(len(prot1),len(prot2)).diff_mat[i,j]is the difference in features between the ith residue ofprot1and the jth residue ofprot2.alpha – The trade-off parameter in [0,1] between fused term and geometric term. A higher value of
alphameans more geometric weight,alpha = 1is equivalent to regular GW.correspondence – Whether to return the correspondence
- Returns
Returns the FGW distance and optimal correspondence if
correspondence
The second uses a linear feature space. This is suitable for scalar features including isoelectric point, solvent-accessible surface area, charge, and hydrophobicity.
- GWProt.GW_protein.GW_protein.run_FGW_data_lists(prot1: GW_protein, prot2: GW_protein, data1: list[float], data2: list[float], alpha: float = 1, correspondence=False) Union[float, tuple[float, numpy.array]]
This calculates the fused Gromov-Wasserstein distance between two proteins. It takes in a list of
floats for each proteins representing the value in the feature space for each residue. The ijth entry in the associated distance matrix isabs(data1[i] - data2[j]).- Parameters
prot1 – The first protein
prot2 – The second protein
data1 – The data used in the first protein
data2 – The data used in the second protein
alpha – The trade-off parameter in [0,1] between fused term and geometric term. A higher value of
alphameans more geometric weight,alpha=1is equivalent to regular GW.correspondence – Whether to return the computed correspondence
- Returns
Returns the FGW distance and correspondence if
correspondence
The third uses a dictionary giving difference values between different types of amino acids.
- GWProt.GW_protein.GW_protein.run_FGW_dict(prot1: GW_protein, prot2: GW_protein, d: dict[str, dict[str, float]], alpha: float = 1, correspondence: bool = False) Union[float, tuple[float, numpy.array]]
This calculates the fused Gromov-Wasserstein distance between two proteins.
- Parameters
prot1 – The first protein
prot2 – The second protein
d – The dictionary used for the fused distances based on the protein sequences. Of the form
d['A']['B'] == floatalpha – The trade-off parameter in [0,1] between fused term and geometric term. A higher value of
alphameans more geometric weight,alpha = 1is equivalent to regular GW.correspondence – Whether to return the correspondence
- Returns
Returns the FGW distance and correspondence if
correspondence
UGW computations are done with the POT package. Runtime and computed alignments are very sensitive to choice of rho and epsilon and suitable values can vary depending on the proteins.
- GWProt.GW_protein.GW_protein.run_UGW(prot1: GW_protein, prot2: GW_protein, rho: float = 3, epsilon: float = 3, correspondence: bool = False) Union[float, tuple[float, numpy.array]]
Computes the unbalanced GW distance and correspondence if
correspondence.- Parameters
prot1 –
prot2 –
rho – Marginal relaxation term - trade off between geometric distortion and Kullback-Leibler divergence on marginals, higher rho means less mass is lost
epsilon – Regularization parameters for entropic approximation
correspondence – Whether to return the computed correspondence
- Returns
Returns the unbalanced GW distance and optimal correspondence if
correspondence
Similarly we have three versions of fused unbalanced GW.
- GWProt.GW_protein.GW_protein.run_FUGW_diff_mat(prot1: GW_protein, prot2: GW_protein, diff_mat: array, alpha: float = 1, rho: float = 3, epsilon: float = 3, correspondence: bool = False) Union[float, tuple[float, numpy.array]]
This calculates the fused unbalanced Gromov-Wasserstein distance between two proteins.
- Parameters
prot1 – The first protein
prot2 – The second protein
diff_mat – A user-inputted matrix of the differences in the feature space between the residues of the two proteins. Of shape
(len(prot1),len(prot2)).diff_mat[i,j]is the difference in features between the ith residue ofprot1and the jth residue ofprot2.alpha – The trade-off parameter in [0,1] between fused term and geometric term. A higher value of
alphameans more geometric weight.rho – Marginal relaxation term - trade off between geometric distortion and Kullback-Leibler divergence on marginals, higher rho means less mass is lost
epsilon – Regularization parameters for entropic approximation
correspondence – Whether to return the computed correspondence
- Returns
Returns the FGW distance and optimal correspondence if
correspondence
- GWProt.GW_protein.GW_protein.run_FUGW_data_lists(prot1: GW_protein, prot2: GW_protein, data1: list[float], data2: list[float], rho: float = 3, epsilon: float = 3, alpha: float = 1, correspondence=False) Union[float, tuple[float, numpy.array]]
This calculates the fused unbalanced Gromov-Wasserstein distance between two proteins. It takes in a list of
floats for each proteins representing the value in the feature space for each residue. The ijth entry in the associated distance matrix isabs(data1[i] - data2[j]).- Parameters
prot1 – The first protein
prot2 – The second protein
data1 – The data used in the first protein
data2 – The data used in the second protein
alpha – The trade-off parameter in [0,1] between fused term and geometric term. A higher value of
alphameans more geometric weight,rho – Marginal relaxation term - trade off between geometric distortion and Kullback-Leibler divergence on marginals, higher rho means less mass is lost
epsilon – Regularization parameters for entropic approximation
correspondence – Whether to return the computed correspondence
- Returns
Returns the FGW distance and correspondence if
correspondence
- GWProt.GW_protein.GW_protein.run_FUGW_dict(prot1: GW_protein, prot2: GW_protein, d: dict[str, dict[str, float]], alpha: float = 1, correspondence: bool = False) Union[float, tuple[float, numpy.array]]
This calculates the fused unbalanced Gromov-Wasserstein distance between two proteins.
- Parameters
prot1 – The first protein
prot2 – The second protein
d – The dictionary used for the fused distances based on the protein sequences. Of the form
d['A']['B'] == floatalpha – The trade-off parameter in [0,1] between fused term and geometric term. A higher value of
alphameans more geometric weightrho – Marginal relaxation term - trade off between geometric distortion and Kullback-Leibler divergence on marginals, higher rho means less mass is lost
epsilon – Regularization parameters for entropic approximation
correspondence – Whether to return the correspondence
- Returns
Returns the FGW distance and correspondence if
correspondence
It is not recommended to use fused or unbalanced methods on downsampled proteins, as the data is lost from the excluded residues.
Computing Local Geometric Distortion (LGD)
The local geometric distortion (LGD) quantifies the contribution of each residue to the GW or FGW distance, providing a residue-level measure of structural conservation or flexibility.
- GWProt.GW_protein.GW_protein.GW_lgd(prot1: GW_protein, prot2: GW_protein, T: array) tuple[numpy.array, numpy.array]
This calculates the local geometric distortion (LGD), i.e. the contribution of each residue to the sum in the GW cost, using the correspondence
T. This is output as twonp.arrays, one forprot1, the second forprot2.- Parameters
prot1 – The first
GW_proteinprot2 – The second
GW_proteinT – The correspondence to be used
- Returns
lgd1, lgd2; the LGD values for the two proteins
- GWProt.GW_protein.GW_protein.FGW_lgd(prot1: GW_protein, prot2: GW_protein, T: array, diff_mat: array, alpha: float) tuple[numpy.array, numpy.array]
This calculates the local geometric distortion (LGD), i.e. the contribution of each residue to the sum in the FGW cost, using the correspondence
T. This is output as twonp.arrays, one forprot1, the second forprot2.- Parameters
prot1 – The first
GW_proteinprot2 – The second
GW_proteindiff_mat – The difference matrix in the feature space
T – The correspondence to be used
alpha – The trade-off constant between the fused cost and the geometric cost
- Returns
lgd1, lgd2; the LGD values for the two proteins
Note
np.sum(lgd1) != c, wherecis the GW cost; rathermath.sqrt(np.sum(lgd1))/2 == c;
and similarly for lgd2, and for FGW.
Miscellaneous Methods
- GWProt.GW_protein.GW_protein.run_ssearch_indices(prot1: GW_protein, prot2: GW_protein, allow_mismatch: bool = True, BLOSUM='62', gap_open=0, gap_extend=0) tuple[list[int], list[int]]
Runs a local sequence alignment returns the indices of the two proteins which are aligned. ssearch36 must be in the PATH to use this method.
- Parameters
prot1 – First protein
prot2 – Second protein
allow_mismatch – Whether to include residues which are aligned but not the same type of amino acid
- Returns
Two lists of indices, those of
prot1andprot2which are aligned
Explicity this runs the command
$ ssearch36 -s BP62 -p -T 1 -b 1 -f 0 -g 0 -z -1 -m 9C
for the Smith-Waterman algorithm in the Fasta Package.
- GWProt.GW_protein.GW_protein.get_eccentricity(self, p: float = 2) array
This calculates the eccentricity of residues in a protein with exponent
p.- Parameters
p – The exponent,
0< p <= np.inf- Returns
The eccentricities of each residue, as a
np.array
Eccentricity is defined in 1 Definition 5.3. Intuitively, it quantifies how far each residue is from the rest of the residues in a protein. Within a given protein, residues with higher eccentricity often have higher LGD when aligned to other proteins, so this could be used for normalization.
References
- 1
Mémoli, F. (2011). Gromov–Wasserstein distances and the metric approach to object matching. Foundations of Computational Mathematics, 11(4), 417-487.