LGD_Comparison

class GWProt.lgd_comparison.LGD_Comparison(prot_list: list[GWProt.GW_protein.GW_protein], RAM: bool = True, transport_dir: Optional[str] = None)

This class streamlines computing local geometric distortions for a dataset of proteins and analyzing correspondences.

Parameters

prot_list – A list of GW_protein.GW_protein objects
RAM – Whether to store the computed correspondences in RAM versus saving to files. Default is in RAM.
transport_dir – If RAM == False, a filepath to save the correspondences in.

Variables

name_list – A list of the names of the GW_protein``s in ``prot_list, equivalent to [p.name for p in prot_list]. This serves as the keys for the dicts.
raw_lgd_dict – A dict storing all computed lgd values. Where raw_lgd_dict[prot1.name][prot1.name] is the local geometric distortion of the residues in prot1 when aligning it to prot2.
dist_dict – A dict storing all computed GW or FGW distances. Where dist_dict[prot1.name][prot1.name] is the GW or FGW distance between prot1 and prot2.
transport_dict – A dict storing all computed correspondences; only used if RAM ==True. Where transport_dict[prot1.name][prot1.name] is the correspondence aligning prot1 to prot2.
transport_dir – A filepath to the directory where correspondences are stored if RAM ==False.

This module stores the correspondences and local geometric distortion (LGD) values between all pairs of proteins, which can be memory intensive. Setting RAM = False saves these to files, but significantly slows down computations due to file I/O.

Calculating LGDs and Distances

These methods run all pairwise computations. For each pair of proteins, they run GW/FGW, store the distance, correspondence, and associated LGD values. One of these must be run before any of the latter methods are called.

As these involve a large number of computations, they can be time consuming on large datasets.

GWProt.lgd_comparison.LGD_Comparison.GW_compute_lgd(self, processes: Optional[int] = None) → None

This method runs all pairwise GW computations. This can be done in parallel with processes number of processes.

Parameters: processes – How many parallel processes to run, default is 1.

GWProt.lgd_comparison.LGD_Comparison.FGW_compute_lgd_data_lists(self, data_list_dict: dict, alpha: float, processes: Optional[int] = None) → None

This method runs all pairwise FGW computations with GW_protein.run_FGW_data_lists. This can be done in parallel with processes number of processes.

Parameters

data_list_dict – A dictionary of {name: data_list} for GW_protein.run_FGW_data_lists
processes – How many parallel processes to run, default is 1.
alpha – The value of alpha to use for FGW.

GWProt.lgd_comparison.LGD_Comparison.FGW_compute_lgd_dict(self, diff_dict: dict, alpha: float, processes: Optional[int] = None) → None

This method runs all pairwise FGW computations with GW_protein.run_FGW_dict. This can be done in parallel with processes number of processes.

Parameters: diff_dict – A dictionary for GW_protein.run_FGW_dict :param processes: How many parallel processes to run, default is 1. :param alpha: The value of alpha to use for FGW.

These methods must be run after computing the LGDs:

GWProt.lgd_comparison.LGD_Comparison.get_GW_dmat(self) → array

This method returns the GW or FGW distances of self in the form of a distance matrix. The indexing is that of self.prot_list

Returns: np.array

GWProt.lgd_comparison.normalize_lgd_dict(raw_dict: dict[str, dict[str, numpy.array]], code: tuple[float, float, float, float] = (1, 0, 0, 0, 0)) → dict[str, numpy.array]

This method takes in a dictionary of raw local geometric distortions and outputs a dictionary of weighted averages.

Parameters

raw_dict – A dictionary of raw local geometric distortions of the format raw_dict[name1][name2] == lgd.
code – This is a tuple of exponents to be used for weighting. code[0] is the exponent for each local geometric distortion value, code[1] is the exponent for each local geometric distortion value to be summed for the row local geometric distortion, code[2] is the exponent of the total local geoemtric distortion in a row, code[3] is the exponent of the number of residues in the protein, code[4] is the exponent for the number of other proteins. The default is (1,0,0,0,0) which corresponds to the simple sum. (1,0,0,0,-1) is the mean.

Return dict

A dictionary of local geometric distortions of the format lgd_dict[name]== lgd

This method combines the raw LGD arrays (one for each protein pair) into a single LGD array for each protein.

For each protein, it returns the normalized_lgd calculated below, where the rows of mat are its different LGD arrays in the raw LGD dict:

a, b, c, d, e = code
normalized_lgd = np.sum(mat**a * (np.sum(mat**b, axis=1) ** c)[:, np.newaxis] * mat.shape[1] ** d, axis=0) * mat.shape[0] ** e

Transferring LGDs

GWProt.lgd_comparison.LGD_Comparison.raw_transferred_lgd(self, lgd_dict: dict[str, numpy.array]) → dict[str, dict[str, numpy.array]]

This method computes all of the transferred local geometric distortions.

Parameters: lgd_dict – A dictionary of the form {name: np.array} with keys the protein names in self.name_list, where the arrays represent the local geometric distortion of each residue.
Return dict: A dictionary of all transferred local geometric distortions where Where raw_transferred_lgd[prot1.name][prot1.name] is the local geometric distortion of the residues in prot1 based on the transferred local geometric distortion from prot2.

This is in the same format as self.raw_lgd_dict so must be normalized before further use.

Analysis Helper Methods

We also provide helper methods not part of the LGD_Comparison class but useful for analyzing LGD values.

GWProt.lgd_comparison.get_percentile_of_dict(lgd_dict: dict[str, numpy.array]) → dict[str, numpy.array]

This method replaces each local geometric distortion array with the percentile values of the local geometric distortion for that protein.

Parameters: lgd_dict – A dict of the local geometric distortion levels for each protein.
Returns: A dict of the percentiles of the local geometric distortion levels at each residue for each protein.

GWProt.lgd_comparison.get_AP_scores(lgd_dict: dict[str, numpy.array], true_region_dict: dict[str, list[bool]], upper=False) → dict[str, float]

This method takes an lgd dict and calculates the average precision for each protein: of using the local geometric distortions to predict user-inputted regions in the proteins

Parameters

lgd_dict – A dict of the local geometric distortion levels for each protein.
true_region_dict – A dict of the true regions to be predicted.
upper – Whether to predict the regions based on high local geometric distortion (True) or low local geometric distortion (False)

Returns

A dict of the average precision scores for each protein.