LGD_Comparison
- class GWProt.lgd_comparison.LGD_Comparison(prot_list: list[GWProt.GW_protein.GW_protein], RAM: bool = True, transport_dir: Optional[str] = None)
This class streamlines computing local geometric distortions for a dataset of proteins and analyzing correspondences.
- Parameters
prot_list – A list of
GW_protein.GW_proteinobjectsRAM – Whether to store the computed correspondences in RAM versus saving to files. Default is in RAM.
transport_dir – If
RAM == False, a filepath to save the correspondences in.
- Variables
name_list – A list of the names of the
GW_protein``s in ``prot_list, equivalent to[p.name for p in prot_list]. This serves as the keys for the dicts.raw_lgd_dict – A dict storing all computed lgd values. Where
raw_lgd_dict[prot1.name][prot1.name]is the local geometric distortion of the residues inprot1when aligning it toprot2.dist_dict – A dict storing all computed GW or FGW distances. Where
dist_dict[prot1.name][prot1.name]is the GW or FGW distance betweenprot1andprot2.transport_dict – A dict storing all computed correspondences; only used if
RAM ==True. Wheretransport_dict[prot1.name][prot1.name]is the correspondence aligningprot1toprot2.transport_dir – A filepath to the directory where correspondences are stored if
RAM ==False.
This module stores the correspondences and local geometric distortion (LGD) values between all pairs of proteins, which can be memory intensive. Setting RAM = False saves these to files, but significantly slows down computations due to file I/O.
Calculating LGDs and Distances
These methods run all pairwise computations. For each pair of proteins, they run GW/FGW, store the distance, correspondence, and associated LGD values. One of these must be run before any of the latter methods are called.
As these involve a large number of computations, they can be time consuming on large datasets.
- GWProt.lgd_comparison.LGD_Comparison.GW_compute_lgd(self, processes: Optional[int] = None) None
This method runs all pairwise GW computations. This can be done in parallel with
processesnumber of processes.- Parameters
processes – How many parallel processes to run, default is 1.
- GWProt.lgd_comparison.LGD_Comparison.FGW_compute_lgd_data_lists(self, data_list_dict: dict, alpha: float, processes: Optional[int] = None) None
This method runs all pairwise FGW computations with
GW_protein.run_FGW_data_lists. This can be done in parallel withprocessesnumber of processes.- Parameters
data_list_dict – A dictionary of {name: data_list} for
GW_protein.run_FGW_data_listsprocesses – How many parallel processes to run, default is 1.
alpha – The value of alpha to use for FGW.
- GWProt.lgd_comparison.LGD_Comparison.FGW_compute_lgd_dict(self, diff_dict: dict, alpha: float, processes: Optional[int] = None) None
This method runs all pairwise FGW computations with
GW_protein.run_FGW_dict. This can be done in parallel withprocessesnumber of processes.- Parameters
diff_dict – A dictionary for
GW_protein.run_FGW_dict:param processes: How many parallel processes to run, default is 1. :param alpha: The value of alpha to use for FGW.
These methods must be run after computing the LGDs:
- GWProt.lgd_comparison.LGD_Comparison.get_GW_dmat(self) array
This method returns the GW or FGW distances of
selfin the form of a distance matrix. The indexing is that ofself.prot_list- Returns
np.array
- GWProt.lgd_comparison.normalize_lgd_dict(raw_dict: dict[str, dict[str, numpy.array]], code: tuple[float, float, float, float] = (1, 0, 0, 0, 0)) dict[str, numpy.array]
This method takes in a dictionary of raw local geometric distortions and outputs a dictionary of weighted averages.
- Parameters
raw_dict – A dictionary of raw local geometric distortions of the format
raw_dict[name1][name2] == lgd.code – This is a tuple of exponents to be used for weighting.
code[0]is the exponent for each local geometric distortion value,code[1]is the exponent for each local geometric distortion value to be summed for the row local geometric distortion,code[2]is the exponent of the total local geoemtric distortion in a row,code[3]is the exponent of the number of residues in the protein,code[4]is the exponent for the number of other proteins. The default is (1,0,0,0,0) which corresponds to the simple sum. (1,0,0,0,-1) is the mean.
- Return dict
A dictionary of local geometric distortions of the format
lgd_dict[name]== lgd
This method combines the raw LGD arrays (one for each protein pair) into a single LGD array for each protein.
For each protein, it returns the normalized_lgd calculated below, where the rows of mat are its different LGD arrays in the raw LGD dict:
a, b, c, d, e = code
normalized_lgd = np.sum(mat**a * (np.sum(mat**b, axis=1) ** c)[:, np.newaxis] * mat.shape[1] ** d, axis=0) * mat.shape[0] ** e
Transferring LGDs
- GWProt.lgd_comparison.LGD_Comparison.raw_transferred_lgd(self, lgd_dict: dict[str, numpy.array]) dict[str, dict[str, numpy.array]]
This method computes all of the transferred local geometric distortions.
- Parameters
lgd_dict – A dictionary of the form
{name: np.array}with keys the protein names inself.name_list, where the arrays represent the local geometric distortion of each residue.- Return dict
A dictionary of all transferred local geometric distortions where Where
raw_transferred_lgd[prot1.name][prot1.name]is the local geometric distortion of the residues inprot1based on the transferred local geometric distortion fromprot2.
This is in the same format as self.raw_lgd_dict so must be normalized before further use.
Analysis Helper Methods
We also provide helper methods not part of the LGD_Comparison class but useful for analyzing LGD values.
- GWProt.lgd_comparison.get_percentile_of_dict(lgd_dict: dict[str, numpy.array]) dict[str, numpy.array]
This method replaces each local geometric distortion array with the percentile values of the local geometric distortion for that protein.
- Parameters
lgd_dict – A dict of the local geometric distortion levels for each protein.
- Returns
A dict of the percentiles of the local geometric distortion levels at each residue for each protein.
- GWProt.lgd_comparison.get_AP_scores(lgd_dict: dict[str, numpy.array], true_region_dict: dict[str, list[bool]], upper=False) dict[str, float]
- This method takes an lgd dict and calculates the average precision for each protein
of using the local geometric distortions to predict user-inputted regions in the proteins
- Parameters
lgd_dict – A dict of the local geometric distortion levels for each protein.
true_region_dict – A dict of the true regions to be predicted.
upper – Whether to predict the regions based on high local geometric distortion (
True) or low local geometric distortion (False)
- Returns
A dict of the average precision scores for each protein.