pySIPFENN Core

pySIPFENN

class Calculator(autoLoad=True, verbose=True)[source]

Bases: object

pySIPFENN Calculator automatically initializes all functionalities including identification and loading of all available models defined statically in the models.json file. It exposes methods for calculating predefined structure-informed descriptors (feature vectors) and predicting properties using models that utilize them.

Parameters:
  • autoLoad (bool) – Automatically load all available ML models based on the models.json file. This will require significant memory and time if they are available, so for featurization and other non-model-requiring tasks, it is recommended to set this to False. Defaults to True.

  • verbose (bool) – Print initialization messages and several other non-critical messages during runtime procedures. Defaults to True.

models[source]

Dictionary with all model information based on the models.json file in the modelsSIPFENN directory. The keys are the network names and the values are dictionaries with the model information.

loadedModels[source]

Dictionary with all loaded models. The keys are the network names and the values are the loaded pytorch models.

descriptorData[source]

List of all descriptor data created during the last predictions run. The order of the list corresponds to the order of atomic structures given to models as input. The order of the list of descriptor data for each structure corresponds to the order of networks in the toRun list.

predictions[source]

List of all predictions created during the last predictions run. The order of the list corresponds to the order of atomic structures given to models as input. The order of the list of predictions for each structure corresponds to the order of networks in the toRun list.

inputFiles[source]

List of all input file names used during the last predictions run. The order of the list corresponds to the order of atomic structures given to models as input.

appendPrototypeLibrary(customPath)[source]

Parses a custom prototype library YAML file and permanently appends it into the internal prototypeLibrary of the pySIPFENN package. They will be persisted for future use and, by default, they will be loaded automatically when instantiating the Calculator object, similar to your custom models.

Parameters:

customPath (str) – Path to the prototype library YAML file to be appended to the internal self.prototypeLibrary of the Calculator object.

Return type:

None

Returns:

None

calculate_KS2022(structList, mode='serial', max_workers=8)[source]

Calculates KS2022 descriptors for a list of structures. The calculation can be done in serial or parallel mode. In parallel mode, the number of workers can be specified. The results are stored in the descriptorData attribute. The function returns the list of descriptors as well.

Parameters:
  • structList (List[Structure]) – List of structures to calculate descriptors for. The structures must be initialized with the pymatgen Structure class.

  • mode (str) – Mode of calculation. Defaults to ‘serial’. Options are 'serial' and 'parallel'.

  • max_workers (int) – Number of workers to use in parallel mode. Defaults to 8. If None, the number of workers will be set to the number of available CPU cores. If set to 0, 1 worker will be used.

Return type:

list

Returns:

List of KS2022 descriptor (feature vector) for each structure.

calculate_KS2022_dilute(structList, baseStruct='pure', mode='serial', max_workers=8)[source]

Calculates KS2022 descriptors for a list of dilute structures (either based on pure elements and on custom base structures, e.g. TCP endmember configurations) that contain a single alloying atom. Speed increases are substantial compared to the KS2022 descriptor, which is more general and can be used on any structure. The calculation can be done in serial or parallel mode. In parallel mode, the number of workers can be specified. The results are stored in the self.descriptorData attribute. The function returns the list of descriptors as well.

Parameters:
  • structList (List[Structure]) – List of structures to calculate descriptors for. The structures must be dilute structures (either based on pure elements and on custom base structures, e.g. TCP endmember configurations) that contain a single alloying atom. The structures must be initialized with the pymatgen Structure class.

  • baseStruct (Union[str, List[Structure]]) – Non-diluted references for the dilute structures. Defaults to 'pure', which assumes that the structures are based on pure elements and generates references automatically. Alternatively, a list of structures can be provided, which can be either pure elements or custom base structures (e.g. TCP endmember configurations).

  • mode (str) – Mode of calculation. Defaults to 'serial'. Options are 'serial' and 'parallel'.

  • max_workers (int) – Number of workers to use in parallel mode. Defaults to 8. If None, the number of workers will be set to the number of available CPU cores. If set to 0, 1 worker will be used.

Return type:

List[ndarray]

Returns:

List of KS2022 descriptor (feature vector) np.ndarray for each structure.

calculate_KS2022_randomSolutions(baseStructList, compList, minimumSitesPerExpansion=50, featureConvergenceCriterion=0.005, compositionConvergenceCriterion=0.01, minimumElementOccurrences=10, plotParameters=False, printProgress=False, mode='serial', max_workers=8)[source]

Calculates KS2022 descriptors corresponding to random solid solutions occupying base structure / lattice sites for a list of compositions through method described in descriptorDefinitions.KS2022_randomSolutions submodule. The results are stored in the descriptorData attribute. The function returns the list of descriptors in numpy format as well.

Parameters:
  • baseStructList (Union[str, Structure, List[str], List[Structure], List[Union[Composition, str]]]) – The base structure to generate a random solid solution (RSS). It does _not_ need to be a simple Bravis lattice, such as BCC lattice, but can be any Structure object or a list of them, if you need to define them on per-case basis. In addition to Structure objects, you can use “magic” strings corresponding to one of the structures in the library you can find under pysipfenn.misc directory or loaded under self.prototypeLibrary attribute. The magic strings include, but are not limited to: 'BCC', 'FCC', 'HCP', 'DHCP', 'Diamond', and so on. You can invoke them by their name, e.g. BCC, or by passing self.prototypeLibrary['BCC']['structure'] directly. If you pass a list to baseStruct, you are allowed to mix-and-match Structure objects and magic strings.

  • compList (Union[str, List[str], Composition, List[Composition], List[Union[Composition, str]]]) – The composition to populate the supercell with until KS2022 descriptor converges. You can use pymatgen’s Composition objects or strings of valid chemical formulas (symbol - atomic fraction pairs), like 'Fe0.5Ni0.3Cr0.2', 'Fe50 Ni30 Cr20', or 'Fe5 Ni3 Cr2'. You can either pass a single entity, in which case it will be used for all structures (use to run the same composition for different base structures), or a list of entities, in which case pairs will be used in the order of the list. If you pass a list to compList, you are allowed to mix-and-match Composition objects and composition strings.

  • minimumSitesPerExpansion (int) – The minimum number of sites that the base structure will be expanded to (doubling dimension-by-dimension) before it is used as expansion step/batch in each iteration of adding local chemical environment information to the global ensemble. The optimal value will depend on the number of species and their relative fractions in the composition. Generally, low values (<20ish) will result in a slower convergence, as some extreme local chemical environments will have strong influence on the global ensemble, and too high values (>150ish) will result in a needlessly slow computation for not-complex compositions, as at least two iterations will be processed. The default value is 50 and works well for simple cases.

  • featureConvergenceCriterion (float) – The maximum difference between any feature belonging to the current iteration (statistics based on the global ensemble of local chemical environments) and the previous iteration (before last expansion) expressed as a fraction of the maximum value of each feature found in the OQMD database at the time of SIPFENN creation (see KS2022_randomSolutions.maxFeaturesInOQMD array). The default value is 0.01, corresponding to 1% of the maximum value.

  • compositionConvergenceCriterion (float) – The maximum average difference between any element fraction belonging to the current composition (net of all expansions) and the target composition (comp). The default value is 0.01, corresponding to 1% deviation, which interpretation will depend on the number of elements in the composition.

  • minimumElementOccurrences (int) – The minimum number of times all elements must occur in the composition before it is considered converged. This setting prevents the algorithm from converging before very dilute elements like C in low-carbon steel, have had a chance to occur. The default value is 10.

  • plotParameters (bool) – If True, the convergence history will be plotted using plotly. The default value is False, but tracking them is recommended and will be accessible in the metas attribute of the Calculator under the key 'RSS'.

  • printProgress (bool) – If True, the progress will be printed to the console. The default value is False.

  • mode (str) – Mode of calculation. Options are serial (default) and parallel.

  • max_workers (int) – Number of workers to use in parallel mode. Defaults to 8.

Return type:

List[ndarray]

Returns:

A list of numpy.ndarray``s containing the ``KS2022 descriptor, just like the ordinary KS2022. Please note the stochastic nature of this algorithm. The result will likely vary slightly between runs and parameters, so if convergence is critical, verify it with a test matrix of minimumSitesPerExpansion, featureConvergenceCriterion, and compositionConvergenceCriterion values.

calculate_Ward2017(structList, mode='serial', max_workers=4)[source]

Calculates Ward2017 descriptors for a list of structures. The calculation can be done in serial or parallel mode. In parallel mode, the number of workers can be specified. The results are stored in the self.descriptorData attribute. The function returns the list of descriptors as well.

Parameters:
  • structList (List[Structure]) – List of structures to calculate descriptors for. The structures must be initialized with the pymatgen Structure class.

  • mode (str) – Mode of calculation. Defaults to ‘serial’. Options are 'serial' and 'parallel'.

  • max_workers (int) – Number of workers to use in parallel mode. Defaults to 4. If None, the number of workers will be set to the number of available CPU cores. If set to 0, 1 worker will be used.

Return type:

list

Returns:

List of Ward2017 descriptor (feature vector) for each structure.

destroy()[source]

Deallocates all loaded models and clears all data from the Calculator object.

Return type:

None

downloadModels(network='all')[source]

Downloads ONNX models. By default, all available models are downloaded. If a model is already available on disk, it is skipped. If a specific network is given, only that network is downloaded, possibly overwriting the existing one. If the network name is not recognized, the message will be printed.

Parameters:

network (str) – Name of the network to download. Defaults to 'all'.

Return type:

None

findCompatibleModels(descriptor)[source]

Finds all models compatible with a given descriptor based on the descriptor definitions loaded from the models.json file.

Parameters:

descriptor (str) – Descriptor to use. Must be one of the available descriptors. See pysipfenn.descriptorDefinitions to see available modules or add yours. Available default descriptors are: 'Ward2017', 'KS2022'.

Return type:

List[str]

Returns:

List of strings corresponding to compatible models.

get_resultDicts()[source]

Returns a list of dictionaries with the predictions for each network. The keys of the dictionaries are the names of the networks. The order of the dictionaries is the same as the order of the input structures passed through runModels() functions.

Return type:

List[dict]

Returns:

List of dictionaries with the predictions.

get_resultDictsWithNames()[source]

Returns a list of dictionaries with the predictions for each network. The keys of the dictionaries are the names of the networks and the names of the input structures. The order of the dictionaries is the same as the order of the input structures passed through runModels() functions. Note that this function requires self.inputFiles to be set, which is done automatically when using runFromDirectory() or runFromDirectory_dilute() but not when using runModels() or runModels_dilute(), as the input structures are passed directly to the function and names have to be provided separately by assigning them to self.inputFiles.

Return type:

List[dict]

Returns:

List of dictionaries with the predictions.

loadModelCustom(networkName, modelName, descriptor, modelDirectory='.')[source]

Load a custom ONNX model from a custom directory specified by the user. The primary use case for this function is to load models that are not included in the package and cannot be placed in the package directory because of write permissions (e.g. on restrictive HPC systems) or storage allocations.

Parameters:
  • modelDirectory (str) – Directory where the model is located. Defaults to the current directory.

  • networkName (str) – Name of the network. This is the name used to refer to the ONNX network. It has to be unique, not contain any spaces, and correspond to the name of the ONNX file (excluding the .onnx extension).

  • modelName (str) – Name of the model. This is the name that will be displayed in the model selection menu. It can be any string desired.

  • descriptor (str) – Descriptor/feature vector used by the model. pySIPFENN currently supports the following descriptors: 'KS2022', and 'Ward2017'.

Return type:

None

loadModels(network='all')[source]

Load model/models into memory of the Calculator class. The models are loaded from the modelsSIPFENN directory inside the package. Its location can be seen by calling print() on the Calculator. The models are stored in the self.loadedModels attribute as a dictionary with the network string as key and the PyTorch model as value.

Note:

This function only works with models that are stored in the modelsSIPFENN directory inside the package, are in ONNX format, and have corresponding entries in models.json. For all others, you will need to use loadModelCustom().

Parameters:

network (str) – Default is 'all', which loads all models detected as available. Alternatively, a specific model can be loaded by its corresponding key in models.json. E.g. 'SIPFENN_Krajewski2020_NN9' or 'SIPFENN_Krajewski2022_NN30'. The key is the same as the network argument in downloadModels().

Raises:

ValueError – If the network name is not recognized or if the model is not available in the modelsSIPFENN directory.

Return type:

None

Returns:

None. It updates the loadedModels attribute of the Calculatorclass.

makePredictions(models, toRun, dataInList)[source]

Makes predictions using PyTorch networks listed in toRun and provided in models dictionary. Shared among all “predict” functions.

Parameters:
  • models (Dict[str, Module]) – Dictionary of models to use. Keys are network names and values are PyTorch models loaded from ONNX with loadModels() / loadModelCustom() or manually (fairly simple!).

  • toRun (List[str]) – List of networks to run. It must be a subset of models.keys().

  • dataInList (List[Union[List[float], array]]) – List of data to make predictions for. Each element of the list should be a descriptor accepted by all networks in toRun. Can be a list of lists of floats or a list of numpy ``nd.array``s.

Return type:

List[list]

Returns:

List of predictions. Each element of the list is a list of predictions for all run networks. The order of the predictions is the same as the order of the networks in toRun.

parsePrototypeLibrary(customPath='default', verbose=False, printCustomLibrary=False)[source]

Parses the prototype library YAML file in the misc directory, interprets them into pymatgen Structure objects, and stores them in the self.prototypeLibrary dict attribute of the Calculator object. You can use it also to temporarily append a custom prototype library (by providing a path) which will live as long as the Calculator. For permanent changes, use appendPrototypeLibrary().

Parameters:
  • customPath (str) – Path to the prototype library YAML file. Defaults to the magic string "default", which loads the default prototype library included in the package in the misc directory.

  • verbose (bool) – If True, it prints the number of prototypes loaded. Defaults to False, but note that Calculator class automatically initializes with verbose=True.

  • printCustomLibrary (bool) – If True, it prints the name and POSCAR of each prototype being added to the prototype library. Has no effect if customPath is 'default'. Defaults to False.

Return type:

None

Returns:

None

runFromDirectory(directory, descriptor, mode='serial', max_workers=4)[source]

Runs all loaded models on a list of Structures it automatically imports from a specified directory. The directory must contain only atomic structures in formats such as 'poscar', 'cif', 'json', 'mcsqs', etc., or a mix of these. The structures are automatically sorted using natsort library, so the order of the structures in the directory, as defined by the operating system, is not important. Natural sorting, for example, will sort the structures in the following order: '1-Fe', '2-Al', '10-xx', '11-xx', '20-xx', '21-xx', '11111-xx', etc. This is useful when the structures are named using a numbering system. The order of the predictions is the same as the order of the input structures. The order of the networks in a prediction is the same as the order of the networks in self.network_list_available. If a network is not available, it will not be included in the list.

Parameters:
  • directory (str) – Directory containing the structures to run the models on. The directory must contain only atomic structures in formats such as 'poscar', 'cif', 'json', 'mcsqs', etc., or a mix of these. The structures are automatically sorted as described above.

  • descriptor (str) – Descriptor to use. Must be one of the available descriptors. See pysipgenn.descriptorDefinitions for a list of available descriptors.

  • mode (str) – Computation mode. 'serial' or 'parallel'. Default is 'serial'. Parallel mode is not recommended for small datasets.

  • max_workers (int) – Number of workers to use in parallel mode. Default is 4. Ignored in serial mode. If set to None, will use all available cores. If set to 0, will use 1 core.

Return type:

List[list]

Returns:

List of predictions. Each element of the list is a list of predictions for all run networks. The order of the predictions is the same as the order of the input structures. The order of the networks is the same as the order of the networks in self.network_list_available. If a network is not available, it will not be included in the list.

runFromDirectory_dilute(directory, descriptor, baseStruct='pure', mode='serial', max_workers=8)[source]

Runs all loaded models on a list of dilute Structures it automatically imports from a specified directory. The directory must contain only atomic structures in formats such as 'poscar', 'cif', 'json', 'mcsqs', etc., or a mix of these. The structures are automatically sorted using natsort library, so the order of the structures in the directory, as defined by the operating system, is not important. Natural sorting, for example, will sort the structures in the following order: '1-Fe', '2-Al', '10-xx', '11-xx', '20-xx', '21-xx', '11111-xx', etc. This is useful when the structures are named using a numbering system. The order of the predictions is the same as the order of the input structures. The order of the networks in a prediction is the same as the order of the networks in self.network_list_available. If a network is not available, it will not be included in the list.

Parameters:
  • directory (str) – Directory containing the structures to run the models on. The directory must contain only atomic structures in formats such as 'poscar', 'cif', 'json', 'mcsqs', etc., or a mix of these. The structures are automatically sorted as described above. The structures must be dilute structures, i.e. they must contain only one alloying element.

  • descriptor (str) – Descriptor to use. Must be one of the available descriptors. See pysipfenn.descriptorDefinitions for a list of available descriptors.

  • baseStruct (str) – Non-diluted references for the dilute structures. Defaults to 'pure', which assumes that the structures are based on pure elements and generates references automatically. Alternatively, a list of structures can be provided, which can be either pure elements or custom base structures (e.g. TCP endmember configurations).

  • mode (str) – Computation mode. 'serial' or 'parallel'. Default is 'serial'. Parallel mode is not recommended for small datasets.

  • max_workers (int) – Number of workers to use in parallel mode. Default is 8. Ignored in serial mode. If set to None, will use all available cores. If set to 0, will use 1 core.

Return type:

None

Returns:

List of predictions. Each element of the list is a list of predictions for all run networks. The order of the predictions is the same as the order of the input structures. The order of the networks is the same as the order of the networks in self.network_list_available. If a network is not available, it will not be included in the list.

runModels(descriptor, structList, mode='serial', max_workers=4)[source]

Runs all loaded models on a list of Structures using specified descriptor. Supports serial and parallel computation modes. If parallel is selected, max_workers determines number of processes handling the featurization of structures (90-99+% of computational intensity) and models are then run in series.

Parameters:
  • descriptor (str) – Descriptor to use. Must be one of the available descriptors. See pysipfenn.descriptorDefinitions to see available modules or add yours. Available default descriptors are: 'Ward2017', 'KS2022'.

  • structList (List[Structure]) – List of pymatgen Structure objects to run the models on.

  • mode (str) – Computation mode. 'serial' or 'parallel'. Default is 'serial'. Parallel mode is not recommended for small datasets.

  • max_workers (int) – Number of workers to use in parallel mode. Default is 4. Ignored in serial mode. If set to None, will use all available cores. If set to 0, will use 1 core.

Return type:

List[List[float]]

Returns:

List of predictions. Each element of the list is a list of predictions for all ran networks. The order of the predictions is the same as the order of the input structures. The order of the networks is the same as the order of the networks in self.network_list_available. If a network is not available, it will not be included in the list. If a network is not compatible with the selected descriptor, it will not be included in the list.

runModels_dilute(descriptor, structList, baseStruct='pure', mode='serial', max_workers=4)[source]

Runs all loaded models on a list of Structures using specified descriptor. A critical difference from runModels() is that this function will call dilute-specific featurizer, e.g. KS2022_dilute when 'KS2022' is provided as input, which can only be used on dilute structures (both based on pure elements and on custom base structures, e.g. TCP endmember configurations) that contain a single alloying atom. Speed increases are substantial compared to the KS2022 descriptor, which is more general and can be used on any structure. Supports serial and parallel modes in the same way as runModels().

Parameters:
  • descriptor (str) – Descriptor to use for predictions. Must be one of the descriptors which support the dilute structures (i.e. *_dilute). See pysipfenn.descriptorDefinitions to see available modules or add yours here. Available default dilute descriptors are now: 'KS2022'. The 'KS2022' can also be called from runModels() function, but is not recommended for dilute alloys, as it negates the speed increase of the dilute structure featurizer.

  • structList (List[Structure]) – List of pymatgen Structure objects to run the models on. Must be dilute structures as described above.

  • baseStruct (Union[str, List[Structure]]) – Non-diluted references for the dilute structures. Defaults to ‘pure’, which assumes that the structures are based on pure elements and generates references automatically. Alternatively, a list of structures can be provided, which can be either pure elements or custom base structures (e.g. TCP endmember configurations).

  • mode (str) – Computation mode. 'serial' or 'parallel'. Default is 'serial'. Parallel mode is not recommended for small datasets.

  • max_workers (int) – Number of workers to use in parallel mode. Default is 4. Ignored in serial mode. If set to None, will use all available cores. If set to 0, will use 1 core.

Return type:

List[List[float]]

Returns:

List of predictions. Each element of the list is a list of predictions for all run networks. The order of the predictions is the same as the order of the input structures. The order of the networks is the same as the order of the networks in self.network_list_available. If a network is not available, it will not be included in the list. If a network is not compatible with the selected descriptor, it will not be included in the list.

runModels_randomSolutions(descriptor, baseStructList, compList, minimumSitesPerExpansion=50, featureConvergenceCriterion=0.005, compositionConvergenceCriterion=0.01, minimumElementOccurrences=10, plotParameters=False, printProgress=False, mode='serial', max_workers=8)[source]

A top-level convenience wrapper for the calculate_KS2022_randomSolutions function. It passes all the arguments to that function directly (except for descriptor and uses its result to run all applicable models. The result is a list of predictions for all run networks.

Parameters:
  • descriptor (str) – Descriptor to use for predictions. Must be one of the descriptors which support the random

  • structures (solid solution)

  • v0.15.0 (available modules or add yours here. As of)

  • is (the only available descriptor)

  • submodule. ('KS2022' through its KS2022_randomSolutions)

  • baseStructList (Union[str, Structure, List[str], List[Structure], List[Union[Composition, str]]]) – See calculate_KS2022_randomSolutions for details. You can mix-and-match Structure objects and magic strings, either individually (to use the same entity for all calculations) or in a list.

  • compList (Union[str, List[str], Composition, List[Composition], List[Union[Composition, str]]]) – See calculate_KS2022_randomSolutions for details. You can mix-and-match Composition objects and composition strings, either individually (to use the same entity for all calculations) or in a list.

  • minimumSitesPerExpansion (int) – See calculate_KS2022_randomSolutions.

  • featureConvergenceCriterion (float) – See calculate_KS2022_randomSolutions.

  • compositionConvergenceCriterion (float) – See calculate_KS2022_randomSolutions.

  • minimumElementOccurrences (int) – See calculate_KS2022_randomSolutions.

  • plotParameters (bool) – See calculate_KS2022_randomSolutions.

  • printProgress (bool) – See calculate_KS2022_randomSolutions.

  • mode (str) – Computation mode. 'serial' or 'parallel'. Default is 'serial'. Parallel mode is not recommended for small datasets.

Return type:

List[List[float]]

Returns:

List of predictions. They will correspond to the order of the networks in self.toRun established by the findCompatibleModels() function. If a network is not available, it will not be included in the list.

updateModelAvailability()[source]

Updates availability of models based on the pysipfenn.modelsSIPFENN directory contents. Works only for current ONNX model definitions.

Return type:

None

writeDescriptorsToCSV(descriptor, file='descriptorData.csv')[source]

Writes the descriptor data to a CSV file. The first column is the name of the structure. If the self.inputFiles attribute is populated automatically by runFromDirectory() or set manually, the names of the structures will be used. Otherwise, the names will be '1', '2', '3', etc. The remaining columns are the descriptor values. The order of the columns is the same as the order of the labels in the descriptor definition file.

Parameters:
  • descriptor (str) – Descriptor to use. Must be one of the available descriptors. See pysipfenn.descriptorDefinitions for a list of available descriptors, such as 'KS2022' and 'Ward2017'. It provides the labels for the descriptor values.

  • file (str) – Name of the file to write the results to. If the file already exists, it will be overwritten. If the file does not exist, it will be created. The file must have a '.csv' extension to be recognized correctly.

Return type:

None

writeDescriptorsToNPY(descriptor, file='descriptorData.npy')[source]

Writes the descriptor data to a numpy file (.NPY). The order of the columns in the numpy array corresponds to the order of the labels in the descriptor definition files at descriptorDefinitions directory. To match the data with the labels, load the labels from the descriptor definition file and use the same index to access the corresponding data in the numpy array. The order of the rows corresponds to the last run input structures.

Parameters:
  • descriptor (str) – Descriptor to use. Must be one of the available descriptors. See pysipfenn.descriptorDefinitions for a list of available descriptors, such as 'KS2022' and 'Ward2017'.

  • file (str) – Name of the file to write the results to. If the file already exists, it will be overwritten. If the file does not exist, it will be created. The file must have a '.npy' extension to be recognized correctly. Default is 'descriptorData.npy'.

Return type:

None

writeResultsToCSV(file)[source]

Writes the results to a CSV file. The first column is the name of the structure. If the self.inputFiles attribute is populated automatically by runFromDirectory() or set manually, the names of the structures will be used. Otherwise, the names will be '1', '2', '3', etc. The remaining columns are the predictions for each network. The order of the columns is the same as the order of the networks in self.network_list_available.

Parameters:

file (str) – Name of the file to write the results to. If the file already exists, it will be overwritten. If the file does not exist, it will be created. The file must have a '.csv' extension to be recognized correctly.

Return type:

None

overwritePrototypeLibrary(prototypeLibrary)[source]

Destructively overwrites the prototype library with a custom one. Used by the appendPrototypeLibrary() function to persist its changes. The other main use it to restore the default one to the original state based on a backup made earlier (see tests for an example).

Return type:

None

string2prototype(c, prototype)[source]

Converts a prototype string to a pymatgen Structure object.

Parameters:
  • c (Calculator) – Calculator object with the prototypeLibrary.

  • prototype (str) – Prototype string.

Return type:

Structure

Returns:

Structure object.

ward2ks2022(ward2017)[source]

Converts a Ward2017 descriptor to a KS2022 descriptor (which is its subset). It removes: mean_WCMagnitude_Shell1, mean_WCMagnitude_Shell2, mean_WCMagnitude_Shell3, mean_NeighDiff_shell1_SpaceGroupNumber, var_NeighDiff_shell1_SpaceGroupNumber, min_NeighDiff_shell1_SpaceGroupNumber, max_NeighDiff_shell1_SpaceGroupNumber, range_NeighDiff_shell1_SpaceGroupNumber, mean_SpaceGroupNumber, maxdiff_SpaceGroupNumber, dev_SpaceGroupNumber, max_SpaceGroupNumber, min_SpaceGroupNumber, most_SpaceGroupNumber, and CanFormIonic, for physicality and performance improvements.

Parameters:

ward2017 (ndarray) – Ward2017 descriptor. Must be a 1D np.ndarray of length 271.

Return type:

ndarray

Returns:

KS2022 descriptor array.

wrapper_KS2022_dilute_generate_descriptor(args)[source]

Wraps the KS2022_dilute.generate_descriptor function for parallel processing.

wrapper_KS2022_randomSolutions_generate_descriptor(args)[source]

Wraps the KS2022_randomSolutions.generate_descriptor function for parallel processing.

modelAdjusters

class LocalAdjuster(calculator, model, targetData, descriptorData=None, device='cpu', descriptor=None, useClearML=False, taskName='LocalFineTuning')[source]

Bases: object

Adjuster class taking a Calculator and operating on local data provided to model as a pair of descriptor data (provided in several ways) and target values (provided in several ways). It can then adjust the model with some predefined hyperparameters or run a fairly typical grid search, which can be interpreted manually or uploaded to the ClearML platform. Can use CPU, CUDA, or MPS (Mac M1) devices for training.

Parameters:
  • calculator (Calculator) – Instance of the Calculator class with the model to be adjusted, defined and loaded. It can contain the descriptor data already in it, so that it does not have to be provided separately.

  • model (str) – Name of the model to be adjusted in the Calculator. E.g., SIPFENN_Krajewski2022_NN30.

  • targetData (Union[str, ndarray]) – Target data to be used for training the model. It can be provided as a path to a NumPy .npy/ .NPY or CSV .csv/.CSV file, or directly as a NumPy array. It has to be the same length as the descriptor data.

  • descriptorData (Union[None, str, ndarray]) – Descriptor data to be used for training the model. It can be left unspecified (None) to use the data in the Calculator, or provided as a path to a NumPy .npy/.NPY or CSV .csv/ .CSV file, or directly as a NumPy array. It has to be the same length as the target data. Default is None.

  • device (Literal['cpu', 'cuda', 'mps']) – Device to be used for training the model. It Has to be one of the following: "cpu", "cuda", or "mps". Default is "cpu".

  • descriptor (Optional[Literal['Ward2017', 'KS2022']]) – Name of the feature vector provided in the descriptorData. It can be optionally provided to check if the descriptor data is compatible.

  • useClearML (bool) – Whether to use the ClearML platform for logging the training process. Default is False.

  • taskName (str) – Name of the task to be used. Default is "LocalFineTuning".

calculator[source]

Instance of the Calculator class being operated on.

model[source]

The original model to be adjusted.

adjustedModel[source]

A PyTorch model after the adjustment. Initially set to None.

descriptorData[source]

NumPy array with descriptor data to use as input for the model.

targetData[source]

NumPy array with target data to use as output for the model.

adjust(validation=0.2, learningRate=1e-05, epochs=50, batchSize=32, optimizer='Adam', weightDecay=1e-05, lossFunction='MAE', verbose=True)[source]

Takes the original model, copies it, and adjusts the model on the provided data. The adjusted model is stored in the adjustedModel attribute of the class and can be then persisted to the original Calculator or used for plotting. The default hyperparameters are selected for fine-tuning the model rather than retraining it, as to slowly adjust it (1% of the typical learning rate) and not overfit it (50 epochs).

Parameters:
  • learningRate (float) – The learning rate to be used for the adjustment. Default is 1e-5 that is 1% of a typical learning rate of Adam optimizer.

  • epochs (int) – The number of times to iterate over the data, i.e., how many times the model will see the data. Default is 50, which is on the higher side for fine-tuning. If the model does not retrain fast enough but already converged, consider lowering this number to reduce the time and possibly overfitting to the training data.

  • batchSize (int) – The number of points passed to the model at once. Default is 32, which is a typical batch size for smaller datasets. If the dataset is large, consider increasing this number to speed up the training.

  • optimizer (Literal['Adam', 'AdamW', 'Adamax', 'RMSprop']) – Algorithm to be used for optimization. Default is Adam, which is a good choice for most models and one of the most popular optimizers. Other options are

  • lossFunction (Literal['MSE', 'MAE']) – Loss function to be used for optimization. Default is MAE (Mean Absolute Error / L1) that is more robust to outliers than MSE (Mean Squared Error).

  • validation (float) – Fraction of the data to be used for validation. Default is the common 0.2 (20% of the data). If set to 0, the model will be trained on the whole dataset without validation, and you will not be able to check for overfitting or gauge the model’s performance on unseen data.

  • weightDecay (float) – Weight decay to be used for optimization. Default is 1e-5 that should work well if data is abundant enough relative to the model complexity. If the model is overfitting, consider increasing this number to regularize the model more.

  • verbose (bool) – Whether to print information, such as loss, during the training. Default is True.

Returns:

(1) the adjusted model, (2) training loss list of floats, and (3) validation loss list of floats. The adjusted model is also stored in the adjustedModel attribute of the class.

Return type:

A tuple with 3 elements

highlightCompositions(compositions)[source]

Highlights data points that correspond to certain chemical compositions, so that they can be distinguished at later steps. The strings you provide will be interpreted when matching to the data, so HfMo, Hf1Mo1, Hf2Mo2, and Hf50 Mo50 will all be considered equal. They will be plotted in red by plotStarting() and plotAdjusted(). Please note that this will be overwriten the next time you make a call to the adjust(), so you may need to perform it again.

Parameters:

compositions (List[str]) – A list of strings with chemical formulas. They will be interpreted, so any valid formula pointing to the same composition will be parsed in the same fashion. Currently, the composition needs to be exact, i.e. Hf33 Mo33 Ni33 will match to HfMoNi but Hf28.6 Mo71.4 will not match to Hf2 Mo5. This can be implemented if there is interest.

Return type:

None

highlightPoints(pointsIndices)[source]

Highlights data points at certain indices, so that they can be distinguished at later steps. They will be plotted in red by plotStarting() and plotAdjusted(). Please note that this will be overwriten the next time you make a call to the adjust(), so you may need to perform it again.

Parameters:

pointsIndices (List[int]) – A list of point indices to highlight. Please note that in Python lists indices start from 0.

Return type:

None

matrixHyperParameterSearch(validation=0.2, epochs=20, batchSize=64, lossFunction='MAE', learningRates=(1e-06, 1e-05, 0.0001), optimizers=('Adam', 'AdamW', 'Adamax'), weightDecays=(1e-05, 0.0001, 0.001), verbose=True, plot=True)[source]

Performs a grid search over the hyperparameters provided to find the best combination. By default, it will plot the training history with plotly in your browser, and (b) print the best hyperparameters found. If the ClearML platform was set to be used for logging (at the class initialization), the results will be uploaded there as well. If the default values are used, it will test 27 combinations of learning rates, optimizers, and weight decays. The method will then adjust the model to the best hyperparameters found, corresponding to the lowest validation loss if validation is used, or the lowest training loss if validation is not used (validation=0). Note that the validation is used by default.

Parameters:
  • validation (float) – Same as in the adjust method. Default is 0.2.

  • epochs (int) – Same as in the adjust method. Default is 20 to keep the search time reasonable on most CPU-only machines (around 1 hour). For most cases, a good starting number of epochs is 100-200, which should complete in 10-30 minutes on most modern GPUs or Mac M1-series machines (w. device set to MPS).

  • batchSize (int) – Same as in the adjust method. Default is 32.

  • lossFunction (Literal['MSE', 'MAE']) – Same as in the adjust method. Default is MAE, i.e. Mean Absolute Error or L1 loss.

  • learningRates (List[float]) – List of floats with the learning rates to be tested. Default is (1e-6, 1e-5, 1e-4). See the adjust method for more information.

  • optimizers (List[Literal['Adam', 'AdamW', 'Adamax', 'RMSprop']]) – List of strings with the optimizers to be tested. Default is ("Adam", "AdamW", "Adamax"). See the adjust method for more information.

  • weightDecays (List[float]) – List of floats with the weight decays to be tested. Default is (1e-5, 1e-4, 1e-3). See the adjust method for more information.

  • verbose (bool) – Same as in the adjust method. Default is True.

  • plot (bool) – Whether to plot the training history after all the combinations are tested. Default is True.

Return type:

Tuple[Module, Dict[str, Union[float, str]]]

plotAdjusted()[source]

Plot the adjusted model on the target data. By default, it will plot in your browser.

Return type:

None

plotStarting()[source]

Plot the starting model (before adjustment) on the target data. By default, it will plot in your browser.

Return type:

None

class OPTIMADEAdjuster(calculator, model, provider='mp', targetPath=('attributes', '_mp_stability', 'gga_gga+u', 'formation_energy_per_atom'), targetSize=1, device='cpu', descriptor='KS2022', useClearML=False, taskName='OPTIMADEFineTuning', maxResults=10000, endpointOverride=None)[source]

Bases: LocalAdjuster

Adjuster class operating on data provided by the OPTIMADE API. Primarily geared towards tuning or retraining of the models based on other atomistic databases, or their subsets, accessed through OPTIMADE, to adjust the model to a different domain, which in the context of DFT datasets could mean adjusting the model to predict properties with DFT settings used by that database or focusing its attention to specific chemistry like, for instance, all compounds of Sn and all perovskites. It accepts OPTIMADE query as an input and then operates based on the LocalAdjuster class.

It will set up the environment for the adjustment, letting you progressively build up the training dataset by OPTIMADE queries which get featurized and their results will be concatenated, i.e., you can make one big query or several smaller ones and then adjust the model on the whole dataset when you are ready.

For details on more advanced uses of the OPTIMADE API client, please refer to the documentation.

Parameters:
  • calculator (Calculator) – Instance of the Calculator class with the model to be adjusted, defined and loaded. Unlike in the LocalAdjuster, the descriptor data will not be passed, since it will be fetched from the OPTIMADE API.

  • model (str) – Name of the model to be adjusted in the Calculator. E.g., SIPFENN_Krajewski2022_NN30.

  • provider (Literal['aiida', 'aflow', 'alexandria', 'cod', 'ccpnc', 'cmr', 'httk', 'matcloud', 'mcloud', 'mcloudarchive', 'mp', 'mpdd', 'mpds', 'mpod', 'nmd', 'odbx', 'omdb', 'oqmd', 'jarvis', 'pcod', 'tcod', 'twodmatpedia']) – Strings with the name of the provider to be used for the OPTIMADE queries. The type-hinting gives a list of providers available at the time of writing this code, but it is by no means limited to them. For the up-to-date list, along with their current status, please refer to the OPTIMADE Providers Dashboard. The default is "mp" which stands for the Materials Project, but we do not recommend any particular provider over any other. One has to be picked to work out of the box. Your choice should be based on the data you are interested in.

  • targetPath (List[str]) – List of strings with the path to the target data in the OPTIMADE response. This will be dependent on the provider you choose, and you will need to identify it by looking at the response. The easiest way to do this is by going to their endpoint, like this, very neat one, for JARVIS, this one for Alexandria PBEsol, this one for MP, or this one for our in-house MPDD. Examples include ('attributes', '_mp_stability', 'gga_gga+u', 'formation_energy_per_atom') for GGA+U formation energy per atom in MP, or ('attributes', '_alexandria_scan_formation_energy_per_atom') for the SCAN formation energy per atom in Alexandria, or ('attributes', '_alexandria_formation_energy_per_atom') for the GGAsol formation energy per atom in Alexandria, or ('attributes', '_jarvis_formation_energy_peratom') for the optb88vdw formation energy per atom in JARVIS, or ('attributes', '_mpdd_formationenergy_sipfenn_krajewski2020_novelmaterialsmodel') for the formation energy predicted by the SIPFENN_Krajewski2020_NovelMaterialsModel for every structure in MPDD. Default is the MP example.

  • targetSize (int) – The length of the target data to be fetched from the OPTIMADE API. This is typically 1 for a single scalar property, but it can be more. Default is 1.

  • device (Literal['cpu', 'cuda', 'mps']) – Same as in the LocalAdjuster. Default is "cpu" which is available on all systems. If you have a GPU, you can set it to "cuda", or to "mps" if you are using a Mac M1-series machine, in order to speed up the training process by orders of magnitude.

  • descriptor (Literal['Ward2017', 'KS2022']) – Not the same as in the LocalAdjuster. Since the descriptor data will be calculated for each structure fetched from the OPTIMADE API, this parameter is needed to specify which descriptor to use. At the time of writing this code, it can be either "Ward2017" or "KS2022". Special versions of KS2022 cannot be used since assumptions cannot be made about the data fetched from the OPTIMADE API and only general symmetry-based optimizations can be applied. Default is "KS2022".

  • useClearML (bool) – Same as in the LocalAdjuster. Default is False.

  • taskName (str) – Same as in the LocalAdjuster. Default is "OPTIMADEFineTuning", and you are encouraged to change it, especially if you are using the ClearML platform.

  • maxResults (int) – The maximum number of results to be fetched from the OPTIMADE API for a given query. Default is 10000 which is a very high number for most re-training tasks. If you are fetching a lot of data, it’s possible the query is too broad, and you should consider narrowing it down.

  • endpointOverride (Optional[List[str]]) – List of URL strings with the endpoint to be used for the OPTIMADE queries. This is an advanced option allowing you to ignore the provider parameter and directly specify the endpoint to be used. It is useful if you want to use a specific version of the provider’s endpoint or narrow down the query to a sub-database (Alexandria has two different endpoints for PBEsol and SCAN, for instance). You can also use it to query unofficial endpoints. Make sure to (a) include protocol (http:// or https://) and (b) not include version (/v1/), nor the specific endpoint (/structures) as the client will add them. I.e., you want https://alexandria.icams.rub.de/pbesol rather than alexandria.icams.rub.de/pbesol/v1/structures. Default is None which has no effect.

fetchAndFeturize(query, parallelWorkers=1, verbose=True)[source]

Automatically (1) fetches data from OPTIMADE API provider specified in the given OPTIMADEAdjuster instance, (2) filters the data by checking if the target property is available, and (3) featurizes the incoming data with the descriptor calculator selected for OPTIMADEAdjuster instance (KS2022 by default). It effectively prepares everything for the adjustments to be made as if local data was loaded and some metadata was added on top of that.

Parameters:
  • query (str) – A valid OPTIMADE API query as defined at the specification page for the OPTIMADE consortium. These can be made very elaborate by stacking several filters together, but generally retain good readability and are easy to interpret thanks to explicit structure written in English. Here are two quick examples: 'elements HAS "Hf" AND elements HAS "Mo" AND NOT elements HAS ANY "O","C","F","Cl","S"' / 'elements HAS "Hf" AND elements HAS "Mo" AND elements HAS "Zr"'.

  • parallelWorkers (int) – How many workers to use at the featurization step. See KS2022 for more details. On most machines, 4-12 should be the optimal number.

  • verbose (bool) – Prints information about progress and results of the process. It is set to True by default.

Return type:

None

modelExporters

class CoreMLExporter(calculator)[source]

Bases: object

Export models to the CoreML format to allow for easy loading and inference in CoreML in other projects, particularly valuable for Apple devices, as pySIPFENN models can be run using the Neural Engine accelerator with minimal power consumption and neat optimizations.

Note: Some of the dependencies (coremltools) are not installed by default. If you need them, you have to install pySIPFENN in dev mode like: pip install "pysipfenn[dev]", or like pip install -e ".[dev]".

Parameters:

calculator (Calculator) – A Calculator object with loaded models.

calculator[source]

A Calculator object with loaded models.

export(model, append='')[source]

Export a loaded model to CoreML format. Models will be saved as {model}.mlpackage in the current working directory. Models will be annotated with the feature vector name (Ward2017 or KS2022) and the output will be named “property”. The latter behavior will be adjusted in the future when model output name and unit will be added to the model JSON metadata.

Parameters:
  • model (str) – The name of the model to export (must be loaded in the Calculator) and it must have a descriptor (Ward2017 or KS2022) defined in the calculator.models dictionary created when the Calculator was initialized.

  • append (str) – A string to append to the exported model name after the model name. Useful for adding a version number or other information to the exported model name.

Return type:

None

Returns:

None

exportAll(append='')[source]

Export all loaded models to CoreML format with the export function. append can be passed to the export function to append to all exported model names.

Return type:

None

class ONNXExporter(calculator)[source]

Bases: object

Export models to the ONNX format (what they ship in by default) to allow (1) exporting modified pySIPFENN models, (2) simplify the models using ONNX optimizer, and (3) convert them to FP16 precision, cutting the size in half.

Note: Some of the dependencies (onnxconverter_common and onnxsim) are not installed by default. If you need them, you have to install pySIPFENN in dev mode like: pip install "pysipfenn[dev]", or like pip install -e ".[dev]".

Parameters:
  • calculator (Calculator) – A Calculator object with loaded models that has loaded PyTorch models (happens automatically

  • the (when the autoLoad argument is kept to its default value of True when initializing the Calculator). During)

  • initialization (in memory)

  • ONNX (the loaded PyTorch models are converted back to)

  • disk. (persisted to)

calculator[source]

A Calculator object with ONNX loaded models.

simplifiedDict[source]

A boolean dictionary of models that have been simplified.

fp16Dict[source]

A boolean dictionary of models that have been converted to FP16.

export(model, append='')[source]

Export a loaded model to ``ONNX``format.

Parameters:
  • model (str) – The name of the model to export (must be loaded in the Calculator).

  • append (str) – A string to append to the exported model name after the model name, simplification marker, and FP16 marker. Useful for adding a version number or other information to the exported model name.

Return type:

None

Returns:

None

exportAll(append='')[source]

Export all loaded models to ONNX format with the export function. append string can be passed to the export function to append to the exported model name.

Return type:

None

simplify(model)[source]

Simplify a loaded model using the ONNX optimizer.

Parameters:

model (str) – The name of the model to simplify (must be loaded in the Calculator).

Return type:

None

Returns:

None

simplifyAll()[source]

Simplify all loaded models with the simplify function.

toFP16(model)[source]

Convert a loaded model to FP16 precision.

Parameters:

model (str) – The name of the model to convert to FP16 (must be loaded in the Calculator).

Return type:

None

Returns:

None

toFP16All()[source]

Convert all loaded models to FP16 precision with the toFP16 function.

class TorchExporter(calculator)[source]

Bases: object

Export models to the PyTorch PT format to allow for easy loading and inference in PyTorch in other projects.

Parameters:

calculator (Calculator) – A Calculator object with loaded models.

calculator[source]

A Calculator object with loaded models.

export(model, append='')[source]

Export a loaded model to PyTorch PT format. Models are exported in eval mode (no dropout) and saved in the current working directory.

Parameters:
  • model (str) – The name of the model to export (must be loaded in the Calculator) and it must have a descriptor (Ward2017 or KS2022) defined in the Calculator.models dictionary created when the Calculator was initialized.

  • append (str) – A string to append to the exported model name after the model name. Useful for adding a version number or other information to the exported model name.

Return type:

None

Returns:

None

exportAll(append='')[source]

Exports all loaded models to PyTorch PT format with the export function. append can be passed to the export function

Return type:

None