pySIPFENN Core

pysipfenn.core.pysipfenn

class pysipfenn.Calculator(autoLoad=True)[source]

Bases: object

pySIPFENN Calculator automatically initializes all functionalities including identification and loading of all available models defined statically in models.json file. It exposes methods for calculating predefined structure-informed descriptors (feature vectors) and predicting properties using models that utilize them.

Parameters:

autoLoad (bool) – Automatically load all available models. Default: True.

models[source]

Dictionary with all model information based on the models.json file in the modelsSIPFENN directory. The keys are the network names and the values are dictionaries with the model information.

loadedModels[source]

Dictionary with all loaded models. The keys are the network names and the values are the loaded pytorch models.

descriptorData[source]

List of all descriptor data created during the last predictions run. The order of the list corresponds to the order of atomic structures given to models as input. The order of the list of descriptor data for each structure corresponds to the order of networks in the toRun list.

predictions[source]

List of all predictions created during the last predictions run. The order of the list corresponds to the order of atomic structures given to models as input. The order of the list of predictions for each structure corresponds to the order of networks in the toRun list.

inputFiles[source]

List of all input file names used during the last predictions run. The order of the list corresponds to the order of atomic structures given to models as input.

calculate_KS2022(structList, mode='serial', max_workers=8)[source]

Calculates KS2022 descriptors for a list of structures. The calculation can be done in serial or parallel mode. In parallel mode, the number of workers can be specified. The results are stored in the descriptorData attribute. The function returns the list of descriptors as well.

Parameters:
  • structList (List[Structure]) – List of structures to calculate descriptors for. The structures must be initialized with the pymatgen Structure class.

  • mode (str) – Mode of calculation. Defaults to ‘serial’. Options are ‘serial’ and ‘parallel’.

  • max_workers (int) – Number of workers to use in parallel mode. Defaults to 8.

Return type:

list

Returns:

List of KS2022 descriptor (feature vector) for each structure.

calculate_KS2022_dilute(structList, baseStruct='pure', mode='serial', max_workers=8)[source]

Calculates KS2022 descriptors for a list of dilute structures (either based on pure elements and on custom base structures, e.g. TCP endmember configurations) that contain a single alloying atom. Speed increases are substantial compared to the KS2022 descriptor, which is more general and can be used on any structure. The calculation can be done in serial or parallel mode. In parallel mode, the number of workers can be specified. The results are stored in the descriptorData attribute. The function returns the list of descriptors as well.

Parameters:
  • structList (List[Structure]) – List of structures to calculate descriptors for. The structures must be dilute structures (either based on pure elements and on custom base structures, e.g. TCP endmember configurations) that contain a single alloying atom. The structures must be initialized with the pymatgen Structure class.

  • baseStruct (Union[str, List[Structure]]) – Non-diluted references for the dilute structures. Defaults to ‘pure’, which assumes that the structures are based on pure elements and generates references automatically. Alternatively, a list of structures can be provided, which can be either pure elements or custom base structures (e.g. TCP endmember configurations).

  • mode (str) – Mode of calculation. Defaults to ‘serial’. Options are ‘serial’ and ‘parallel’.

  • max_workers (int) – Number of workers to use in parallel mode. Defaults to 8.

Return type:

list

Returns:

List of KS2022 descriptor (feature vector) for each structure.

calculate_Ward2017(structList, mode='serial', max_workers=4)[source]

Calculates Ward2017 descriptors for a list of structures. The calculation can be done in serial or parallel mode. In parallel mode, the number of workers can be specified. The results are stored in the descriptorData attribute. The function returns the list of descriptors as well.

Parameters:
  • structList (List[Structure]) – List of structures to calculate descriptors for. The structures must be initialized with the pymatgen Structure class.

  • mode (str) – Mode of calculation. Defaults to ‘serial’. Options are ‘serial’ and ‘parallel’.

  • max_workers (int) – Number of workers to use in parallel mode. Defaults to 4.

Return type:

list

Returns:

List of Ward2017 descriptor (feature vector) for each structure.

downloadModels(network='all')[source]

Downloads ONNX models. By default, all available models are downloaded. If a model is already available on disk, it is skipped. If a specific network is given, only that network is downloaded possibly overwriting the existing one. If the networks name is not recognized message is printed.

Parameters:

network (str) – Name of the network to download. Defaults to ‘all’.

Return type:

None

downloadModels_legacyMxNet(network='all')[source]

Legacy Function Downloads MxNet models.

Parameters:

network (str) – Name of the network to download. Defaults to ‘all’.

Return type:

None

findCompatibleModels(descriptor)[source]

Finds all models compatible with a given descriptor based on the descriptor definitions loaded from the models.json file.

Parameters:

descriptor (str) – Descriptor to use. Must be one of the available descriptors. See pysipfenn.descriptorDefinitions to see available modules or add yours. Available default descriptors are: ‘Ward2017’, ‘KS2022’, ‘KS2022_dilute’.

Return type:

List[str]

Returns:

List of compatible models.

get_resultDicts()[source]

Returns a list of dictionaries with the predictions for each network. The keys of the dictionaries are the names of the networks. The order of the dictionaries is the same as the order of the input structures passed through runModels() functions.

Return type:

List[dict]

Returns:

List of dictionaries with the predictions.

get_resultDictsWithNames()[source]

Returns a list of dictionaries with the predictions for each network. The keys of the dictionaries are the names of the networks and the names of the input structures. The order of the dictionaries is the same as the order of the input structures passed through runModels() functions. Note that this function requires self.inputFiles to be set, which is done automatically when using runFromDirectory() or runFromDirectory_dilute() but not when using runModels() or runModels_dilute(), as the input structures are passed directly to the function and names have to be provided separately by assigning them to self.inputFiles.

Return type:

List[dict]

Returns:

List of dictionaries with the predictions.

loadModelCustom(networkName, modelName, descriptor, modelDirectory='.')[source]

Load a custom ONNX model from a custom directory specified by the user. The primary use case for this function is to load models that are not included in the package and cannot be placed in the package directory because of write permissions (e.g. on restrictive HPC systems) or storage allocations.

Parameters:
  • modelDirectory (str) – Directory where the model is located. Defaults to the current directory.

  • networkName (str) – Name of the network. This is the name used to refer to the ONNX network. It has to be unique, not contain any spaces, and correspond to the name of the ONNX file (excluding the .onnx extension).

  • modelName (str) – Name of the model. This is the name that will be displayed in the model selection menu. It can be any string desired.

  • descriptor (str) – Descriptor/feature vector used by the model. pySIPFENN currently supports the following descriptors: KS2022, KS2022_dilute, and Ward2017.

Return type:

None

loadModels(network='all')[source]

Load model/models into memory of the Calculator class. The models are loaded from the modelsSIPFENN directory inside the package. It’s location can be seen by calling print() on the Calculator. The models are stored in the loadedModels attribute as a dictionary with the network string as key and the PyTorch model as value.

Note

This function only works with models that are stored in the modelsSIPFENN directory inside the package, are in ONNX format, and have corresponding entries in models.json. For all others, you will need to use loadModelCustom().

Parameters:

network (str) – Default is ‘all’, which loads all models detected as available. Alternatively, a specific model can be loaded by its corresponding key in models.json. E.g. ‘SIPFENN_Krajewski2020_NN9’ or ‘SIPFENN_Krajewski2022_NN30’. The key is the same as the network argument in downloadModels().

Return type:

None

makePredictions(models, toRun, dataInList)[source]

Makes predictions using PyTorch networks listed in toRun and provided in models dictionary.

Parameters:
  • models (Dict[str, Module]) – Dictionary of models to use. Keys are network names and values are PyTorch models.

  • toRun (List[str]) – List of networks to run. Must be a subset of models.keys().

  • dataInList (List[Union[List[float], array]]) – List of data to make predictions for. Each element of the list should be a descriptor accepted by all networks in toRun. Can be a list of lists of floats or a list of numpy arrays.

Return type:

List[list]

Returns:

List of predictions. Each element of the list is a list of predictions for all ran network. The order of the predictions is the same as the order of the networks in toRun.

makePredictions_legacyMxNet(mxnet_networks, dataInList)[source]

Loads legacy mxnet networks models and makes predictions using them. Requires mxnet to be installed. This may be challenging on Windows systems but is easy on Linux systems.

Parameters:
  • mxnet_networks (List[str]) – List of networks to use.

  • dataInList (List[List[float]]) – List of data to make predictions for. Each element of the list should be a list of descriptors.

Return type:

List[list]

Returns:

List of predictions. Each element of the list is a list of predictions for all ran network.

Note

MxNet support is deprecated as of V0.10.0 and will be removed in future versions. Use at your own risk. Compatibility with legacy networks is not guaranteed. Use at your own risk.

runFromDirectory(directory, descriptor, mode='serial', max_workers=4)[source]

Runs all loaded models on a list of Structures it automatically imports from a specified directory. The directory must contain only atomic structures in formats such as ‘poscar’, ‘cif’, ‘json’, ‘mcsqs’, etc., or a mix of these. The structures are automatically sorted using natsort library, so the order of the structures in the directory, as defined by the operating system, is not important. Natural sorting, for example, will sort the structures in the following order: ‘1-Fe’, ‘2-Al’, ‘10-xx’, ‘11-xx’, ‘20-xx’, ‘21-xx’, ‘11111-xx’, etc. This is useful when the structures are named using a numbering system. The order of the predictions is the same as the order of the input structures. The order of the networks in a prediction is the same as the order of the networks in self.network_list_available. If a network is not available, it will not be included in the list.

Parameters:
  • directory (str) – Directory containing the structures to run the models on. The directory must contain only atomic structures in formats such as ‘poscar’, ‘cif’, ‘json’, ‘mcsqs’, etc., or a mix of these. The structures are automatically sorted as described above.

  • descriptor (str) – Descriptor to use. Must be one of the available descriptors. See pysipgenn.descriptorDefinitions for a list of available descriptors.

  • mode (str) – Computation mode. ‘serial’ or ‘parallel’. Default is ‘serial’. Parallel mode is not recommended for small datasets.

  • max_workers (int) – Number of workers to use in parallel mode. Default is 4. Ignored in serial mode. If set to None, will use all available cores. If set to 0, will use 1 core.

Return type:

List[list]

Returns:

List of predictions. Each element of the list is a list of predictions for all ran networks. The order of the predictions is the same as the order of the input structures. The order of the networks is the same as the order of the networks in self.network_list_available. If a network is not available, it will not be included in the list.

runFromDirectory_dilute(directory, descriptor, baseStruct='pure', mode='serial', max_workers=8)[source]

Runs all loaded models on a list of dilute Structures it automatically imports from a specified directory. The directory must contain only atomic structures in formats such as ‘poscar’, ‘cif’, ‘json’, ‘mcsqs’, etc., or a mix of these. The structures are automatically sorted using natsort library, so the order of the structures in the directory, as defined by the operating system, is not important. Natural sorting, for example, will sort the structures in the following order: ‘1-Fe’, ‘2-Al’, ‘10-xx’, ‘11-xx’, ‘20-xx’, ‘21-xx’, ‘11111-xx’, etc. This is useful when the structures are named using a numbering system. The order of the predictions is the same as the order of the input structures. The order of the networks in a prediction is the same as the order of the networks in self.network_list_available. If a network is not available, it will not be included in the list.

Parameters:
  • directory (str) – Directory containing the structures to run the models on. The directory must contain only atomic structures in formats such as ‘poscar’, ‘cif’, ‘json’, ‘mcsqs’, etc., or a mix of these. The structures are automatically sorted as described above. The structures must be dilute structures, i.e. they must contain only one alloying element.

  • descriptor (str) – Descriptor to use. Must be one of the available descriptors. See pysipgenn.descriptorDefinitions for a list of available descriptors.

  • baseStruct (str) – Non-diluted references for the dilute structures. Defaults to ‘pure’, which assumes that the structures are based on pure elements and generates references automatically. Alternatively, a list of structures can be provided, which can be either pure elements or custom base structures (e.g. TCP endmember configurations).

  • mode (str) – Computation mode. ‘serial’ or ‘parallel’. Default is ‘serial’. Parallel mode is not recommended for small datasets.

  • max_workers (int) – Number of workers to use in parallel mode. Default is 8. Ignored in serial mode. If set to None, will use all available cores. If set to 0, will use 1 core.

Return type:

None

Returns:

List of predictions. Each element of the list is a list of predictions for all ran networks. The order of the predictions is the same as the order of the input structures. The order of the networks is the same as the order of the networks in self.network_list_available. If a network is not available, it will not be included in the list.

runModels(descriptor, structList, mode='serial', max_workers=4)[source]

Runs all loaded models on a list of Structures using specified descriptor. Supports serial and parallel computation modes. If parallel is selected, max_workers determines number of processes handling the featurization of structures (90-99+% of computational intensity) and models are then run in series.

Parameters:
  • descriptor (str) – Descriptor to use. Must be one of the available descriptors. See pysipfenn.descriptorDefinitions to see available modules or add yours. Available default descriptors are: ‘Ward2017’, ‘KS2022’.

  • structList (List[Structure]) – List of pymatgen Structure objects to run the models on.

  • mode (str) – Computation mode. ‘serial’ or ‘parallel’. Default is ‘serial’. Parallel mode is not recommended for small datasets.

  • max_workers (int) – Number of workers to use in parallel mode. Default is 4. Ignored in serial mode. If set to None, will use all available cores. If set to 0, will use 1 core.

Return type:

List[list]

Returns:

List of predictions. Each element of the list is a list of predictions for all ran networks. The order of the predictions is the same as the order of the input structures. The order of the networks is the same as the order of the networks in self.network_list_available. If a network is not available, it will not be included in the list. If a network is not compatible with the selected descriptor, it will not be included in the list.

runModels_dilute(descriptor, structList, baseStruct='pure', mode='serial', max_workers=4)[source]

Runs all loaded models on a list of Structures using specified descriptor. A critical difference from runModels() is that this function supports the KS2022_dilute descriptor, which can only be used on dilute structures (both based on pure elements and on custom base structures, e.g. TCP endmember configurations) that contain a single alloying atom. Speed increases are substantial compared to the KS2022 descriptor, which is more general and can be used on any structure. Supports serial and parallel modes in the same way as runModels().

Parameters:
  • descriptor (str) – Descriptor to use. Must be one of the available descriptors. See pysipfenn.descriptorDefinitions to see available modules or add yours. Available default descriptors are: ‘KS2022_dilute’. The ‘KS2022’ should also work, but is not recommended, as it negates the speed increase of the dilute descriptor.

  • structList (List[Structure]) – List of pymatgen Structure objects to run the models on. Must be dilute structures as described above.

  • baseStruct (Union[str, List[Structure]]) – Non-diluted references for the dilute structures. Defaults to ‘pure’, which assumes that the structures are based on pure elements and generates references automatically. Alternatively, a list of structures can be provided, which can be either pure elements or custom base structures (e.g. TCP endmember configurations).

  • mode (str) – Computation mode. ‘serial’ or ‘parallel’. Default is ‘serial’. Parallel mode is not recommended for small datasets.

  • max_workers (int) – Number of workers to use in parallel mode. Default is 4. Ignored in serial mode. If set to None, will use all available cores. If set to 0, will use 1 core.

Return type:

List[list]

Returns:

List of predictions. Each element of the list is a list of predictions for all ran networks. The order of the predictions is the same as the order of the input structures. The order of the networks is the same as the order of the networks in self.network_list_available. If a network is not available, it will not be included in the list. If a network is not compatible with the selected descriptor, it will not be included in the list.

updateModelAvailability()[source]

Updates availability of models based on the pysipfenn.modelsSIPFENN directory contents. Works only for current ONNX model definitions. Legacy support for MxNet models is retained in other functions, but they have to be manually added here.

Return type:

None

writeDescriptorsToCSV(descriptor, file='descriptorData.csv')[source]

Writes the descriptor data to a CSV file. The first column is the name of the structure. If the self.inputFiles attribute is populated automatically by runFromDirectory() or set manually, the names of the structures will be used. Otherwise, the names will be ‘1’, ‘2’, ‘3’, etc. The remaining columns are the descriptor values. The order of the columns is the same as the order of the labels in the descriptor definition file.

Parameters:
  • descriptor (str) – Descriptor to use. Must be one of the available descriptors. See pysipgenn.descriptorDefinitions for a list of available descriptors, such as ‘KS2022’ and ‘Ward2017’.

  • file (str) – Name of the file to write the results to. If the file already exists, it will be overwritten. If the file does not exist, it will be created. The file must have a ‘.csv’ extension to be recognized correctly.

Return type:

None

writeResultsToCSV(file)[source]

Writes the results to a CSV file. The first column is the name of the structure. If the self.inputFiles attribute is populated automatically by runFromDirectory() or set manually, the names of the structures will be used. Otherwise, the names will be ‘1’, ‘2’, ‘3’, etc. The remaining columns are the predictions for each network. The order of the columns is the same as the order of the networks in self.network_list_available.

Parameters:

file (str) – Name of the file to write the results to. If the file already exists, it will be overwritten. If the file does not exist, it will be created. The file must have a ‘.csv’ extension to be recognized correctly.

Return type:

None

pysipfenn.showDocs()[source]

Open the offline documentation in a web browser.

pysipfenn.ward2ks2022(ward2017)[source]

Converts a Ward 2017 descriptor to a KS2022 descriptor (which is its subset).

Parameters:

ward2017 (ndarray) – Ward2017 descriptor. Must be a 1D NumPy array of length 271.

Return type:

ndarray

Returns:

KS2022 descriptor array.