pySIPFENN Core
pySIPFENN
- class Calculator(autoLoad=True, verbose=True)[source]
Bases:
object
pySIPFENN Calculator automatically initializes all functionalities including identification and loading of all available models defined statically in the
models.json
file. It exposes methods for calculating predefined structure-informed descriptors (feature vectors) and predicting properties using models that utilize them.- Parameters:
autoLoad (
bool
) – Automatically load all available ML models based on themodels.json
file. This will require significant memory and time if they are available, so for featurization and other non-model-requiring tasks, it is recommended to set this toFalse
. Defaults toTrue
.verbose (
bool
) – Print initialization messages and several other non-critical messages during runtime procedures. Defaults to True.
- models[source]
Dictionary with all model information based on the
models.json
file in the modelsSIPFENN directory. The keys are the network names and the values are dictionaries with the model information.
- loadedModels[source]
Dictionary with all loaded models. The keys are the network names and the values are the loaded pytorch models.
- descriptorData[source]
List of all descriptor data created during the last predictions run. The order of the list corresponds to the order of atomic structures given to models as input. The order of the list of descriptor data for each structure corresponds to the order of networks in the toRun list.
- predictions[source]
List of all predictions created during the last predictions run. The order of the list corresponds to the order of atomic structures given to models as input. The order of the list of predictions for each structure corresponds to the order of networks in the toRun list.
- inputFiles[source]
List of all input file names used during the last predictions run. The order of the list corresponds to the order of atomic structures given to models as input.
- appendPrototypeLibrary(customPath)[source]
Parses a custom prototype library YAML file and permanently appends it into the internal prototypeLibrary of the pySIPFENN package. They will be persisted for future use and, by default, they will be loaded automatically when instantiating the
Calculator
object, similar to your custom models.- Parameters:
customPath (
str
) – Path to the prototype library YAML file to be appended to the internalself.prototypeLibrary
of theCalculator
object.- Return type:
None
- Returns:
None
- calculate_KS2022(structList, mode='serial', max_workers=8)[source]
Calculates
KS2022
descriptors for a list of structures. The calculation can be done in serial or parallel mode. In parallel mode, the number of workers can be specified. The results are stored in the descriptorData attribute. The function returns the list of descriptors as well.- Parameters:
structList (
List
[Structure
]) – List of structures to calculate descriptors for. The structures must be initialized with the pymatgenStructure
class.mode (
str
) – Mode of calculation. Defaults to ‘serial’. Options are'serial'
and'parallel'
.max_workers (
int
) – Number of workers to use in parallel mode. Defaults to8
. IfNone
, the number of workers will be set to the number of available CPU cores. If set to0
, 1 worker will be used.
- Return type:
list
- Returns:
List of
KS2022
descriptor (feature vector) for each structure.
- calculate_KS2022_dilute(structList, baseStruct='pure', mode='serial', max_workers=8)[source]
Calculates
KS2022
descriptors for a list of dilute structures (either based on pure elements and on custom base structures, e.g. TCP endmember configurations) that contain a single alloying atom. Speed increases are substantial compared to theKS2022
descriptor, which is more general and can be used on any structure. The calculation can be done in serial or parallel mode. In parallel mode, the number of workers can be specified. The results are stored in theself.descriptorData
attribute. The function returns the list of descriptors as well.- Parameters:
structList (
List
[Structure
]) – List of structures to calculate descriptors for. The structures must be dilute structures (either based on pure elements and on custom base structures, e.g. TCP endmember configurations) that contain a single alloying atom. The structures must be initialized with the pymatgenStructure
class.baseStruct (
Union
[str
,List
[Structure
]]) – Non-diluted references for the dilute structures. Defaults to'pure'
, which assumes that the structures are based on pure elements and generates references automatically. Alternatively, a list of structures can be provided, which can be either pure elements or custom base structures (e.g. TCP endmember configurations).mode (
str
) – Mode of calculation. Defaults to'serial'
. Options are'serial'
and'parallel'
.max_workers (
int
) – Number of workers to use in parallel mode. Defaults to8
. IfNone
, the number of workers will be set to the number of available CPU cores. If set to0
, 1 worker will be used.
- Return type:
List
[ndarray
]- Returns:
List of
KS2022
descriptor (feature vector)np.ndarray
for each structure.
- calculate_KS2022_randomSolutions(baseStructList, compList, minimumSitesPerExpansion=50, featureConvergenceCriterion=0.005, compositionConvergenceCriterion=0.01, minimumElementOccurrences=10, plotParameters=False, printProgress=False, mode='serial', max_workers=8)[source]
Calculates
KS2022
descriptors corresponding to random solid solutions occupying base structure / lattice sites for a list of compositions through method described indescriptorDefinitions.KS2022_randomSolutions
submodule. The results are stored in the descriptorData attribute. The function returns the list of descriptors in numpy format as well.- Parameters:
baseStructList (
Union
[str
,Structure
,List
[str
],List
[Structure
],List
[Union
[Composition
,str
]]]) – The base structure to generate a random solid solution (RSS). It does _not_ need to be a simple Bravis lattice, such as BCC lattice, but can be anyStructure
object or a list of them, if you need to define them on per-case basis. In addition to Structure objects, you can use “magic” strings corresponding to one of the structures in the library you can find underpysipfenn.misc
directory or loaded underself.prototypeLibrary
attribute. The magic strings include, but are not limited to:'BCC'
,'FCC'
,'HCP'
,'DHCP'
,'Diamond'
, and so on. You can invoke them by their name, e.g.BCC
, or by passingself.prototypeLibrary['BCC']['structure']
directly. If you pass a list tobaseStruct
, you are allowed to mix-and-matchStructure
objects and magic strings.compList (
Union
[str
,List
[str
],Composition
,List
[Composition
],List
[Union
[Composition
,str
]]]) – The composition to populate the supercell with until KS2022 descriptor converges. You can use pymatgen’sComposition
objects or strings of valid chemical formulas (symbol - atomic fraction pairs), like'Fe0.5Ni0.3Cr0.2'
,'Fe50 Ni30 Cr20'
, or'Fe5 Ni3 Cr2'
. You can either pass a single entity, in which case it will be used for all structures (use to run the same composition for different base structures), or a list of entities, in which case pairs will be used in the order of the list. If you pass a list tocompList
, you are allowed to mix-and-matchComposition
objects and composition strings.minimumSitesPerExpansion (
int
) – The minimum number of sites that the base structure will be expanded to (doubling dimension-by-dimension) before it is used as expansion step/batch in each iteration of adding local chemical environment information to the global ensemble. The optimal value will depend on the number of species and their relative fractions in the composition. Generally, low values (<20ish) will result in a slower convergence, as some extreme local chemical environments will have strong influence on the global ensemble, and too high values (>150ish) will result in a needlessly slow computation for not-complex compositions, as at least two iterations will be processed. The default value is50
and works well for simple cases.featureConvergenceCriterion (
float
) – The maximum difference between any feature belonging to the current iteration (statistics based on the global ensemble of local chemical environments) and the previous iteration (before last expansion) expressed as a fraction of the maximum value of each feature found in the OQMD database at the time of SIPFENN creation (seeKS2022_randomSolutions.maxFeaturesInOQMD
array). The default value is0.01
, corresponding to 1% of the maximum value.compositionConvergenceCriterion (
float
) – The maximum average difference between any element fraction belonging to the current composition (net of all expansions) and the target composition (comp
). The default value is0.01
, corresponding to 1% deviation, which interpretation will depend on the number of elements in the composition.minimumElementOccurrences (
int
) – The minimum number of times all elements must occur in the composition before it is considered converged. This setting prevents the algorithm from converging before very dilute elements like C in low-carbon steel, have had a chance to occur. The default value is10
.plotParameters (
bool
) – If True, the convergence history will be plotted using plotly. The default value isFalse
, but tracking them is recommended and will be accessible in the metas attribute of the Calculator under the key'RSS'
.printProgress (
bool
) – If True, the progress will be printed to the console. The default value is False.mode (
str
) – Mode of calculation. Options areserial
(default) andparallel
.max_workers (
int
) – Number of workers to use in parallel mode. Defaults to8
.
- Return type:
List
[ndarray
]- Returns:
A list of
numpy.ndarray``s containing the ``KS2022
descriptor, just like the ordinaryKS2022
. Please note the stochastic nature of this algorithm. The result will likely vary slightly between runs and parameters, so if convergence is critical, verify it with a test matrix ofminimumSitesPerExpansion
,featureConvergenceCriterion
, andcompositionConvergenceCriterion
values.
- calculate_Ward2017(structList, mode='serial', max_workers=4)[source]
Calculates
Ward2017
descriptors for a list of structures. The calculation can be done in serial or parallel mode. In parallel mode, the number of workers can be specified. The results are stored in theself.descriptorData
attribute. The function returns the list of descriptors as well.- Parameters:
structList (
List
[Structure
]) – List of structures to calculate descriptors for. The structures must be initialized with the pymatgenStructure
class.mode (
str
) – Mode of calculation. Defaults to ‘serial’. Options are'serial'
and'parallel'
.max_workers (
int
) – Number of workers to use in parallel mode. Defaults to4
. IfNone
, the number of workers will be set to the number of available CPU cores. If set to0
, 1 worker will be used.
- Return type:
list
- Returns:
List of
Ward2017
descriptor (feature vector) for each structure.
- destroy()[source]
Deallocates all loaded models and clears all data from the Calculator object.
- Return type:
None
- downloadModels(network='all')[source]
Downloads ONNX models. By default, all available models are downloaded. If a model is already available on disk, it is skipped. If a specific
network
is given, only that network is downloaded, possibly overwriting the existing one. If thenetwork
name is not recognized, the message will be printed.- Parameters:
network (
str
) – Name of the network to download. Defaults to'all'
.- Return type:
None
- findCompatibleModels(descriptor)[source]
Finds all models compatible with a given descriptor based on the descriptor definitions loaded from the
models.json
file.- Parameters:
descriptor (
str
) – Descriptor to use. Must be one of the available descriptors. Seepysipfenn.descriptorDefinitions
to see available modules or add yours. Available default descriptors are:'Ward2017'
,'KS2022'
.- Return type:
List
[str
]- Returns:
List of strings corresponding to compatible models.
- get_resultDicts()[source]
Returns a list of dictionaries with the predictions for each network. The keys of the dictionaries are the names of the networks. The order of the dictionaries is the same as the order of the input structures passed through
runModels()
functions.- Return type:
List
[dict
]- Returns:
List of dictionaries with the predictions.
- get_resultDictsWithNames()[source]
Returns a list of dictionaries with the predictions for each network. The keys of the dictionaries are the names of the networks and the names of the input structures. The order of the dictionaries is the same as the order of the input structures passed through
runModels()
functions. Note that this function requiresself.inputFiles
to be set, which is done automatically when usingrunFromDirectory()
orrunFromDirectory_dilute()
but not when usingrunModels()
orrunModels_dilute()
, as the input structures are passed directly to the function and names have to be provided separately by assigning them toself.inputFiles
.- Return type:
List
[dict
]- Returns:
List of dictionaries with the predictions.
- loadModelCustom(networkName, modelName, descriptor, modelDirectory='.')[source]
Load a custom ONNX model from a custom directory specified by the user. The primary use case for this function is to load models that are not included in the package and cannot be placed in the package directory because of write permissions (e.g. on restrictive HPC systems) or storage allocations.
- Parameters:
modelDirectory (
str
) – Directory where the model is located. Defaults to the current directory.networkName (
str
) – Name of the network. This is the name used to refer to the ONNX network. It has to be unique, not contain any spaces, and correspond to the name of the ONNX file (excluding the.onnx
extension).modelName (
str
) – Name of the model. This is the name that will be displayed in the model selection menu. It can be any string desired.descriptor (
str
) – Descriptor/feature vector used by the model. pySIPFENN currently supports the following descriptors:'KS2022'
, and'Ward2017'
.
- Return type:
None
- loadModels(network='all')[source]
Load model/models into memory of the
Calculator
class. The models are loaded from themodelsSIPFENN
directory inside the package. Its location can be seen by callingprint()
on theCalculator
. The models are stored in theself.loadedModels
attribute as a dictionary with the network string as key and the PyTorch model as value.- Note:
This function only works with models that are stored in the
modelsSIPFENN
directory inside the package, are in ONNX format, and have corresponding entries inmodels.json
. For all others, you will need to useloadModelCustom()
.
- Parameters:
network (
str
) – Default is'all'
, which loads all models detected as available. Alternatively, a specific model can be loaded by its corresponding key in models.json. E.g.'SIPFENN_Krajewski2020_NN9'
or'SIPFENN_Krajewski2022_NN30'
. The key is the same as the network argument indownloadModels()
.- Raises:
ValueError – If the network name is not recognized or if the model is not available in the
modelsSIPFENN
directory.- Return type:
None
- Returns:
None. It updates the loadedModels attribute of the Calculatorclass.
- makePredictions(models, toRun, dataInList)[source]
Makes predictions using PyTorch networks listed in toRun and provided in models dictionary. Shared among all “predict” functions.
- Parameters:
models (
Dict
[str
,Module
]) – Dictionary of models to use. Keys are network names and values are PyTorch models loaded from ONNX withloadModels()
/loadModelCustom()
or manually (fairly simple!).toRun (
List
[str
]) – List of networks to run. It must be a subset ofmodels.keys()
.dataInList (
List
[Union
[List
[float
],array
]]) – List of data to make predictions for. Each element of the list should be a descriptor accepted by all networks in toRun. Can be a list of lists of floats or a list of numpy ``nd.array``s.
- Return type:
List
[list
]- Returns:
List of predictions. Each element of the list is a list of predictions for all run networks. The order of the predictions is the same as the order of the networks in
toRun
.
- parsePrototypeLibrary(customPath='default', verbose=False, printCustomLibrary=False)[source]
Parses the prototype library YAML file in the
misc
directory, interprets them into pymatgenStructure
objects, and stores them in theself.prototypeLibrary
dict attribute of theCalculator
object. You can use it also to temporarily append a custom prototype library (by providing a path) which will live as long as theCalculator
. For permanent changes, useappendPrototypeLibrary()
.- Parameters:
customPath (
str
) – Path to the prototype library YAML file. Defaults to the magic string"default"
, which loads the default prototype library included in the package in themisc
directory.verbose (
bool
) – If True, it prints the number of prototypes loaded. Defaults toFalse
, but note thatCalculator
class automatically initializes withverbose=True
.printCustomLibrary (
bool
) – If True, it prints the name and POSCAR of each prototype being added to the prototype library. Has no effect ifcustomPath
is'default'
. Defaults toFalse
.
- Return type:
None
- Returns:
None
- runFromDirectory(directory, descriptor, mode='serial', max_workers=4)[source]
Runs all loaded models on a list of Structures it automatically imports from a specified directory. The directory must contain only atomic structures in formats such as
'poscar'
,'cif'
,'json'
,'mcsqs'
, etc., or a mix of these. The structures are automatically sorted using natsort library, so the order of the structures in the directory, as defined by the operating system, is not important. Natural sorting, for example, will sort the structures in the following order:'1-Fe'
,'2-Al'
,'10-xx'
,'11-xx'
,'20-xx'
,'21-xx'
,'11111-xx'
, etc. This is useful when the structures are named using a numbering system. The order of the predictions is the same as the order of the input structures. The order of the networks in a prediction is the same as the order of the networks inself.network_list_available
. If a network is not available, it will not be included in the list.- Parameters:
directory (
str
) – Directory containing the structures to run the models on. The directory must contain only atomic structures in formats such as'poscar'
,'cif'
,'json'
,'mcsqs'
, etc., or a mix of these. The structures are automatically sorted as described above.descriptor (
str
) – Descriptor to use. Must be one of the available descriptors. Seepysipgenn.descriptorDefinitions
for a list of available descriptors.mode (
str
) – Computation mode.'serial'
or'parallel'
. Default is'serial'
. Parallel mode is not recommended for small datasets.max_workers (
int
) – Number of workers to use in parallel mode. Default is4
. Ignored in serial mode. If set toNone
, will use all available cores. If set to0
, will use 1 core.
- Return type:
List
[list
]- Returns:
List of predictions. Each element of the list is a list of predictions for all run networks. The order of the predictions is the same as the order of the input structures. The order of the networks is the same as the order of the networks in
self.network_list_available
. If a network is not available, it will not be included in the list.
- runFromDirectory_dilute(directory, descriptor, baseStruct='pure', mode='serial', max_workers=8)[source]
Runs all loaded models on a list of dilute Structures it automatically imports from a specified directory. The directory must contain only atomic structures in formats such as
'poscar'
,'cif'
,'json'
,'mcsqs'
, etc., or a mix of these. The structures are automatically sorted using natsort library, so the order of the structures in the directory, as defined by the operating system, is not important. Natural sorting, for example, will sort the structures in the following order:'1-Fe'
,'2-Al'
,'10-xx'
,'11-xx'
,'20-xx'
,'21-xx'
,'11111-xx'
, etc. This is useful when the structures are named using a numbering system. The order of the predictions is the same as the order of the input structures. The order of the networks in a prediction is the same as the order of the networks in self.network_list_available. If a network is not available, it will not be included in the list.- Parameters:
directory (
str
) – Directory containing the structures to run the models on. The directory must contain only atomic structures in formats such as'poscar'
,'cif'
,'json'
,'mcsqs'
, etc., or a mix of these. The structures are automatically sorted as described above. The structures must be dilute structures, i.e. they must contain only one alloying element.descriptor (
str
) – Descriptor to use. Must be one of the available descriptors. Seepysipfenn.descriptorDefinitions
for a list of available descriptors.baseStruct (
str
) – Non-diluted references for the dilute structures. Defaults to'pure'
, which assumes that the structures are based on pure elements and generates references automatically. Alternatively, a list of structures can be provided, which can be either pure elements or custom base structures (e.g. TCP endmember configurations).mode (
str
) – Computation mode.'serial'
or'parallel'
. Default is'serial'
. Parallel mode is not recommended for small datasets.max_workers (
int
) – Number of workers to use in parallel mode. Default is8
. Ignored in serial mode. If set toNone
, will use all available cores. If set to0
, will use 1 core.
- Return type:
None
- Returns:
List of predictions. Each element of the list is a list of predictions for all run networks. The order of the predictions is the same as the order of the input structures. The order of the networks is the same as the order of the networks in
self.network_list_available
. If a network is not available, it will not be included in the list.
- runModels(descriptor, structList, mode='serial', max_workers=4)[source]
Runs all loaded models on a list of Structures using specified descriptor. Supports serial and parallel computation modes. If parallel is selected, max_workers determines number of processes handling the featurization of structures (90-99+% of computational intensity) and models are then run in series.
- Parameters:
descriptor (
str
) – Descriptor to use. Must be one of the available descriptors. Seepysipfenn.descriptorDefinitions
to see available modules or add yours. Available default descriptors are:'Ward2017'
,'KS2022'
.structList (
List
[Structure
]) – List of pymatgen Structure objects to run the models on.mode (
str
) – Computation mode.'serial'
or'parallel'
. Default is'serial'
. Parallel mode is not recommended for small datasets.max_workers (
int
) – Number of workers to use in parallel mode. Default is4
. Ignored in serial mode. If set toNone
, will use all available cores. If set to0
, will use1
core.
- Return type:
List
[List
[float
]]- Returns:
List of predictions. Each element of the list is a list of predictions for all ran networks. The order of the predictions is the same as the order of the input structures. The order of the networks is the same as the order of the networks in
self.network_list_available
. If a network is not available, it will not be included in the list. If a network is not compatible with the selected descriptor, it will not be included in the list.
- runModels_dilute(descriptor, structList, baseStruct='pure', mode='serial', max_workers=4)[source]
Runs all loaded models on a list of Structures using specified descriptor. A critical difference from runModels() is that this function will call dilute-specific featurizer, e.g.
KS2022_dilute
when'KS2022'
is provided as input, which can only be used on dilute structures (both based on pure elements and on custom base structures, e.g. TCP endmember configurations) that contain a single alloying atom. Speed increases are substantial compared to the KS2022 descriptor, which is more general and can be used on any structure. Supports serial and parallel modes in the same way asrunModels()
.- Parameters:
descriptor (
str
) – Descriptor to use for predictions. Must be one of the descriptors which support the dilute structures (i.e. *_dilute). Seepysipfenn.descriptorDefinitions
to see available modules or add yours here. Available default dilute descriptors are now:'KS2022'
. The'KS2022'
can also be called fromrunModels()
function, but is not recommended for dilute alloys, as it negates the speed increase of the dilute structure featurizer.structList (
List
[Structure
]) – List of pymatgenStructure
objects to run the models on. Must be dilute structures as described above.baseStruct (
Union
[str
,List
[Structure
]]) – Non-diluted references for the dilute structures. Defaults to ‘pure’, which assumes that the structures are based on pure elements and generates references automatically. Alternatively, a list of structures can be provided, which can be either pure elements or custom base structures (e.g. TCP endmember configurations).mode (
str
) – Computation mode.'serial'
or'parallel'
. Default is'serial'
. Parallel mode is not recommended for small datasets.max_workers (
int
) – Number of workers to use in parallel mode. Default is4
. Ignored in serial mode. If set toNone
, will use all available cores. If set to0
, will use1
core.
- Return type:
List
[List
[float
]]- Returns:
List of predictions. Each element of the list is a list of predictions for all run networks. The order of the predictions is the same as the order of the input structures. The order of the networks is the same as the order of the networks in
self.network_list_available
. If a network is not available, it will not be included in the list. If a network is not compatible with the selected descriptor, it will not be included in the list.
- runModels_randomSolutions(descriptor, baseStructList, compList, minimumSitesPerExpansion=50, featureConvergenceCriterion=0.005, compositionConvergenceCriterion=0.01, minimumElementOccurrences=10, plotParameters=False, printProgress=False, mode='serial', max_workers=8)[source]
A top-level convenience wrapper for the
calculate_KS2022_randomSolutions
function. It passes all the arguments to that function directly (except fordescriptor
and uses its result to run all applicable models. The result is a list of predictions for all run networks.- Parameters:
descriptor (
str
) – Descriptor to use for predictions. Must be one of the descriptors which support the randomstructures (solid solution)
v0.15.0 (available modules or add yours here. As of)
is (the only available descriptor)
submodule. ('KS2022' through its KS2022_randomSolutions)
baseStructList (
Union
[str
,Structure
,List
[str
],List
[Structure
],List
[Union
[Composition
,str
]]]) – Seecalculate_KS2022_randomSolutions
for details. You can mix-and-matchStructure
objects and magic strings, either individually (to use the same entity for all calculations) or in a list.compList (
Union
[str
,List
[str
],Composition
,List
[Composition
],List
[Union
[Composition
,str
]]]) – Seecalculate_KS2022_randomSolutions
for details. You can mix-and-matchComposition
objects and composition strings, either individually (to use the same entity for all calculations) or in a list.minimumSitesPerExpansion (
int
) – Seecalculate_KS2022_randomSolutions
.featureConvergenceCriterion (
float
) – Seecalculate_KS2022_randomSolutions
.compositionConvergenceCriterion (
float
) – Seecalculate_KS2022_randomSolutions
.minimumElementOccurrences (
int
) – Seecalculate_KS2022_randomSolutions
.plotParameters (
bool
) – Seecalculate_KS2022_randomSolutions
.printProgress (
bool
) – Seecalculate_KS2022_randomSolutions
.mode (
str
) – Computation mode.'serial'
or'parallel'
. Default is'serial'
. Parallel mode is not recommended for small datasets.
- Return type:
List
[List
[float
]]- Returns:
List of predictions. They will correspond to the order of the networks in
self.toRun
established by thefindCompatibleModels()
function. If a network is not available, it will not be included in the list.
- updateModelAvailability()[source]
Updates availability of models based on the pysipfenn.modelsSIPFENN directory contents. Works only for current ONNX model definitions.
- Return type:
None
- writeDescriptorsToCSV(descriptor, file='descriptorData.csv')[source]
Writes the descriptor data to a CSV file. The first column is the name of the structure. If the
self.inputFiles
attribute is populated automatically by runFromDirectory() or set manually, the names of the structures will be used. Otherwise, the names will be'1'
,'2'
,'3'
, etc. The remaining columns are the descriptor values. The order of the columns is the same as the order of the labels in the descriptor definition file.- Parameters:
descriptor (
str
) – Descriptor to use. Must be one of the available descriptors. Seepysipfenn.descriptorDefinitions
for a list of available descriptors, such as'KS2022'
and'Ward2017'
. It provides the labels for the descriptor values.file (
str
) – Name of the file to write the results to. If the file already exists, it will be overwritten. If the file does not exist, it will be created. The file must have a'.csv'
extension to be recognized correctly.
- Return type:
None
- writeDescriptorsToNPY(descriptor, file='descriptorData.npy')[source]
Writes the descriptor data to a numpy file (.NPY). The order of the columns in the numpy array corresponds to the order of the labels in the descriptor definition files at
descriptorDefinitions
directory. To match the data with the labels, load the labels from the descriptor definition file and use the same index to access the corresponding data in the numpy array. The order of the rows corresponds to the last run input structures.- Parameters:
descriptor (
str
) – Descriptor to use. Must be one of the available descriptors. Seepysipfenn.descriptorDefinitions
for a list of available descriptors, such as'KS2022'
and'Ward2017'
.file (
str
) – Name of the file to write the results to. If the file already exists, it will be overwritten. If the file does not exist, it will be created. The file must have a'.npy'
extension to be recognized correctly. Default is'descriptorData.npy'
.
- Return type:
None
- writeResultsToCSV(file)[source]
Writes the results to a CSV file. The first column is the name of the structure. If the
self.inputFiles
attribute is populated automatically byrunFromDirectory()
or set manually, the names of the structures will be used. Otherwise, the names will be'1'
,'2'
,'3'
, etc. The remaining columns are the predictions for each network. The order of the columns is the same as the order of the networks inself.network_list_available
.- Parameters:
file (
str
) – Name of the file to write the results to. If the file already exists, it will be overwritten. If the file does not exist, it will be created. The file must have a'.csv'
extension to be recognized correctly.- Return type:
None
- overwritePrototypeLibrary(prototypeLibrary)[source]
Destructively overwrites the prototype library with a custom one. Used by the
appendPrototypeLibrary()
function to persist its changes. The other main use it to restore the default one to the original state based on a backup made earlier (see tests for an example).- Return type:
None
- string2prototype(c, prototype)[source]
Converts a prototype string to a pymatgen
Structure
object.- Parameters:
c (
Calculator
) –Calculator
object with theprototypeLibrary
.prototype (
str
) – Prototype string.
- Return type:
Structure
- Returns:
Structure
object.
- ward2ks2022(ward2017)[source]
Converts a
Ward2017
descriptor to aKS2022
descriptor (which is its subset). It removes: mean_WCMagnitude_Shell1, mean_WCMagnitude_Shell2, mean_WCMagnitude_Shell3, mean_NeighDiff_shell1_SpaceGroupNumber, var_NeighDiff_shell1_SpaceGroupNumber, min_NeighDiff_shell1_SpaceGroupNumber, max_NeighDiff_shell1_SpaceGroupNumber, range_NeighDiff_shell1_SpaceGroupNumber, mean_SpaceGroupNumber, maxdiff_SpaceGroupNumber, dev_SpaceGroupNumber, max_SpaceGroupNumber, min_SpaceGroupNumber, most_SpaceGroupNumber, and CanFormIonic, for physicality and performance improvements.- Parameters:
ward2017 (
ndarray
) –Ward2017
descriptor. Must be a 1Dnp.ndarray
of length271
.- Return type:
ndarray
- Returns:
KS2022
descriptor array.
modelAdjusters
- class LocalAdjuster(calculator, model, targetData, descriptorData=None, device='cpu', descriptor=None, useClearML=False, taskName='LocalFineTuning')[source]
Bases:
object
Adjuster class taking a
Calculator
and operating on local data provided to model as a pair of descriptor data (provided in several ways) and target values (provided in several ways). It can then adjust the model with some predefined hyperparameters or run a fairly typical grid search, which can be interpreted manually or uploaded to the ClearML platform. Can use CPU, CUDA, or MPS (Mac M1) devices for training.- Parameters:
calculator (
Calculator
) – Instance of theCalculator
class with the model to be adjusted, defined and loaded. It can contain the descriptor data already in it, so that it does not have to be provided separately.model (
str
) – Name of the model to be adjusted in theCalculator
. E.g.,SIPFENN_Krajewski2022_NN30
.targetData (
Union
[str
,ndarray
]) – Target data to be used for training the model. It can be provided as a path to a NumPy.npy
/.NPY
or CSV.csv
/.CSV
file, or directly as a NumPy array. It has to be the same length as the descriptor data.descriptorData (
Union
[None
,str
,ndarray
]) – Descriptor data to be used for training the model. It can be left unspecified (None
) to use the data in theCalculator
, or provided as a path to a NumPy.npy
/.NPY
or CSV.csv
/.CSV
file, or directly as a NumPy array. It has to be the same length as the target data. Default isNone
.device (
Literal
['cpu'
,'cuda'
,'mps'
]) – Device to be used for training the model. It Has to be one of the following:"cpu"
,"cuda"
, or"mps"
. Default is"cpu"
.descriptor (
Optional
[Literal
['Ward2017'
,'KS2022'
]]) – Name of the feature vector provided in the descriptorData. It can be optionally provided to check if the descriptor data is compatible.useClearML (
bool
) – Whether to use the ClearML platform for logging the training process. Default isFalse
.taskName (
str
) – Name of the task to be used. Default is"LocalFineTuning"
.
- adjust(validation=0.2, learningRate=1e-05, epochs=50, batchSize=32, optimizer='Adam', weightDecay=1e-05, lossFunction='MAE', verbose=True)[source]
Takes the original model, copies it, and adjusts the model on the provided data. The adjusted model is stored in the
adjustedModel
attribute of the class and can be then persisted to the originalCalculator
or used for plotting. The default hyperparameters are selected for fine-tuning the model rather than retraining it, as to slowly adjust it (1% of the typical learning rate) and not overfit it (50 epochs).- Parameters:
learningRate (
float
) – The learning rate to be used for the adjustment. Default is1e-5
that is 1% of a typical learning rate ofAdam
optimizer.epochs (
int
) – The number of times to iterate over the data, i.e., how many times the model will see the data. Default is50
, which is on the higher side for fine-tuning. If the model does not retrain fast enough but already converged, consider lowering this number to reduce the time and possibly overfitting to the training data.batchSize (
int
) – The number of points passed to the model at once. Default is32
, which is a typical batch size for smaller datasets. If the dataset is large, consider increasing this number to speed up the training.optimizer (
Literal
['Adam'
,'AdamW'
,'Adamax'
,'RMSprop'
]) – Algorithm to be used for optimization. Default isAdam
, which is a good choice for most models and one of the most popular optimizers. Other options arelossFunction (
Literal
['MSE'
,'MAE'
]) – Loss function to be used for optimization. Default isMAE
(Mean Absolute Error / L1) that is more robust to outliers thanMSE
(Mean Squared Error).validation (
float
) – Fraction of the data to be used for validation. Default is the common0.2
(20% of the data). If set to0
, the model will be trained on the whole dataset without validation, and you will not be able to check for overfitting or gauge the model’s performance on unseen data.weightDecay (
float
) – Weight decay to be used for optimization. Default is1e-5
that should work well if data is abundant enough relative to the model complexity. If the model is overfitting, consider increasing this number to regularize the model more.verbose (
bool
) – Whether to print information, such as loss, during the training. Default isTrue
.
- Returns:
(1) the adjusted model, (2) training loss list of floats, and (3) validation loss list of floats. The adjusted model is also stored in the
adjustedModel
attribute of the class.- Return type:
A tuple with 3 elements
- highlightCompositions(compositions)[source]
Highlights data points that correspond to certain chemical compositions, so that they can be distinguished at later steps. The strings you provide will be interpreted when matching to the data, so
HfMo
,Hf1Mo1
,Hf2Mo2
, andHf50 Mo50
will all be considered equal. They will be plotted in red byplotStarting()
andplotAdjusted()
. Please note that this will be overwriten the next time you make a call to theadjust()
, so you may need to perform it again.- Parameters:
compositions (
List
[str
]) – A list of strings with chemical formulas. They will be interpreted, so any valid formula pointing to the same composition will be parsed in the same fashion. Currently, the composition needs to be exact, i.e.Hf33 Mo33 Ni33
will match toHfMoNi
butHf28.6 Mo71.4
will not match toHf2 Mo5
. This can be implemented if there is interest.- Return type:
None
- highlightPoints(pointsIndices)[source]
Highlights data points at certain indices, so that they can be distinguished at later steps. They will be plotted in red by
plotStarting()
andplotAdjusted()
. Please note that this will be overwriten the next time you make a call to theadjust()
, so you may need to perform it again.- Parameters:
pointsIndices (
List
[int
]) – A list of point indices to highlight. Please note that in Python lists indices start from0
.- Return type:
None
- matrixHyperParameterSearch(validation=0.2, epochs=20, batchSize=64, lossFunction='MAE', learningRates=(1e-06, 1e-05, 0.0001), optimizers=('Adam', 'AdamW', 'Adamax'), weightDecays=(1e-05, 0.0001, 0.001), verbose=True, plot=True)[source]
Performs a grid search over the hyperparameters provided to find the best combination. By default, it will plot the training history with plotly in your browser, and (b) print the best hyperparameters found. If the ClearML platform was set to be used for logging (at the class initialization), the results will be uploaded there as well. If the default values are used, it will test 27 combinations of learning rates, optimizers, and weight decays. The method will then adjust the model to the best hyperparameters found, corresponding to the lowest validation loss if validation is used, or the lowest training loss if validation is not used (
validation=0
). Note that the validation is used by default.- Parameters:
validation (
float
) – Same as in theadjust
method. Default is0.2
.epochs (
int
) – Same as in theadjust
method. Default is20
to keep the search time reasonable on most CPU-only machines (around 1 hour). For most cases, a good starting number of epochs is 100-200, which should complete in 10-30 minutes on most modern GPUs or Mac M1-series machines (w. device set to MPS).batchSize (
int
) – Same as in theadjust
method. Default is32
.lossFunction (
Literal
['MSE'
,'MAE'
]) – Same as in theadjust
method. Default isMAE
, i.e. Mean Absolute Error or L1 loss.learningRates (
List
[float
]) – List of floats with the learning rates to be tested. Default is(1e-6, 1e-5, 1e-4)
. See theadjust
method for more information.optimizers (
List
[Literal
['Adam'
,'AdamW'
,'Adamax'
,'RMSprop'
]]) – List of strings with the optimizers to be tested. Default is("Adam", "AdamW", "Adamax")
. See theadjust
method for more information.weightDecays (
List
[float
]) – List of floats with the weight decays to be tested. Default is(1e-5, 1e-4, 1e-3)
. See theadjust
method for more information.verbose (
bool
) – Same as in theadjust
method. Default isTrue
.plot (
bool
) – Whether to plot the training history after all the combinations are tested. Default isTrue
.
- Return type:
Tuple
[Module
,Dict
[str
,Union
[float
,str
]]]
- class OPTIMADEAdjuster(calculator, model, provider='mp', targetPath=('attributes', '_mp_stability', 'gga_gga+u', 'formation_energy_per_atom'), targetSize=1, device='cpu', descriptor='KS2022', useClearML=False, taskName='OPTIMADEFineTuning', maxResults=10000, endpointOverride=None)[source]
Bases:
LocalAdjuster
Adjuster class operating on data provided by the OPTIMADE API. Primarily geared towards tuning or retraining of the models based on other atomistic databases, or their subsets, accessed through OPTIMADE, to adjust the model to a different domain, which in the context of DFT datasets could mean adjusting the model to predict properties with DFT settings used by that database or focusing its attention to specific chemistry like, for instance, all compounds of Sn and all perovskites. It accepts OPTIMADE query as an input and then operates based on the
LocalAdjuster
class.It will set up the environment for the adjustment, letting you progressively build up the training dataset by OPTIMADE queries which get featurized and their results will be concatenated, i.e., you can make one big query or several smaller ones and then adjust the model on the whole dataset when you are ready.
For details on more advanced uses of the OPTIMADE API client, please refer to the documentation.
- Parameters:
calculator (
Calculator
) – Instance of theCalculator
class with the model to be adjusted, defined and loaded. Unlike in theLocalAdjuster
, the descriptor data will not be passed, since it will be fetched from the OPTIMADE API.model (
str
) – Name of the model to be adjusted in theCalculator
. E.g.,SIPFENN_Krajewski2022_NN30
.provider (
Literal
['aiida'
,'aflow'
,'alexandria'
,'cod'
,'ccpnc'
,'cmr'
,'httk'
,'matcloud'
,'mcloud'
,'mcloudarchive'
,'mp'
,'mpdd'
,'mpds'
,'mpod'
,'nmd'
,'odbx'
,'omdb'
,'oqmd'
,'jarvis'
,'pcod'
,'tcod'
,'twodmatpedia'
]) – Strings with the name of the provider to be used for the OPTIMADE queries. The type-hinting gives a list of providers available at the time of writing this code, but it is by no means limited to them. For the up-to-date list, along with their current status, please refer to the OPTIMADE Providers Dashboard. The default is"mp"
which stands for the Materials Project, but we do not recommend any particular provider over any other. One has to be picked to work out of the box. Your choice should be based on the data you are interested in.targetPath (
List
[str
]) – List of strings with the path to the target data in the OPTIMADE response. This will be dependent on the provider you choose, and you will need to identify it by looking at the response. The easiest way to do this is by going to their endpoint, like this, very neat one, for JARVIS, this one for Alexandria PBEsol, this one for MP, or this one for our in-house MPDD. Examples include('attributes', '_mp_stability', 'gga_gga+u', 'formation_energy_per_atom')
for GGA+U formation energy per atom in MP, or('attributes', '_alexandria_scan_formation_energy_per_atom')
for the SCAN formation energy per atom in Alexandria, or('attributes', '_alexandria_formation_energy_per_atom')
for theGGAsol
formation energy per atom in Alexandria, or('attributes', '_jarvis_formation_energy_peratom')
for the optb88vdw formation energy per atom in JARVIS, or('attributes', '_mpdd_formationenergy_sipfenn_krajewski2020_novelmaterialsmodel')
for the formation energy predicted by the SIPFENN_Krajewski2020_NovelMaterialsModel for every structure in MPDD. Default is the MP example.targetSize (
int
) – The length of the target data to be fetched from the OPTIMADE API. This is typically1
for a single scalar property, but it can be more. Default is1
.device (
Literal
['cpu'
,'cuda'
,'mps'
]) – Same as in theLocalAdjuster
. Default is"cpu"
which is available on all systems. If you have a GPU, you can set it to"cuda"
, or to"mps"
if you are using a Mac M1-series machine, in order to speed up the training process by orders of magnitude.descriptor (
Literal
['Ward2017'
,'KS2022'
]) – Not the same as in theLocalAdjuster
. Since the descriptor data will be calculated for each structure fetched from the OPTIMADE API, this parameter is needed to specify which descriptor to use. At the time of writing this code, it can be either"Ward2017"
or"KS2022"
. Special versions ofKS2022
cannot be used since assumptions cannot be made about the data fetched from the OPTIMADE API and only general symmetry-based optimizations can be applied. Default is"KS2022"
.useClearML (
bool
) – Same as in theLocalAdjuster
. Default isFalse
.taskName (
str
) – Same as in theLocalAdjuster
. Default is"OPTIMADEFineTuning"
, and you are encouraged to change it, especially if you are using the ClearML platform.maxResults (
int
) – The maximum number of results to be fetched from the OPTIMADE API for a given query. Default is10000
which is a very high number for most re-training tasks. If you are fetching a lot of data, it’s possible the query is too broad, and you should consider narrowing it down.endpointOverride (
Optional
[List
[str
]]) – List of URL strings with the endpoint to be used for the OPTIMADE queries. This is an advanced option allowing you to ignore theprovider
parameter and directly specify the endpoint to be used. It is useful if you want to use a specific version of the provider’s endpoint or narrow down the query to a sub-database (Alexandria has two different endpoints for PBEsol and SCAN, for instance). You can also use it to query unofficial endpoints. Make sure to (a) include protocol (http://
orhttps://
) and (b) not include version (/v1/
), nor the specific endpoint (/structures
) as the client will add them. I.e., you wanthttps://alexandria.icams.rub.de/pbesol
rather thanalexandria.icams.rub.de/pbesol/v1/structures
. Default isNone
which has no effect.
- fetchAndFeturize(query, parallelWorkers=1, verbose=True)[source]
Automatically (1) fetches data from
OPTIMADE API
provider specified in the givenOPTIMADEAdjuster
instance, (2) filters the data by checking if the target property is available, and (3) featurizes the incoming data with the descriptor calculator selected forOPTIMADEAdjuster
instance (KS2022
by default). It effectively prepares everything for the adjustments to be made as if local data was loaded and some metadata was added on top of that.- Parameters:
query (
str
) – A validOPTIMADE API
query as defined at the specification page for the OPTIMADE consortium. These can be made very elaborate by stacking several filters together, but generally retain good readability and are easy to interpret thanks to explicit structure written in English. Here are two quick examples:'elements HAS "Hf" AND elements HAS "Mo" AND NOT elements HAS ANY "O","C","F","Cl","S"'
/'elements HAS "Hf" AND elements HAS "Mo" AND elements HAS "Zr"'
.parallelWorkers (
int
) – How many workers to use at the featurization step. SeeKS2022
for more details. On most machines,4
-12
should be the optimal number.verbose (
bool
) – Prints information about progress and results of the process. It is set toTrue
by default.
- Return type:
None
modelExporters
- class CoreMLExporter(calculator)[source]
Bases:
object
Export models to the
CoreML
format to allow for easy loading and inference inCoreML
in other projects, particularly valuable for Apple devices, as pySIPFENN models can be run using the Neural Engine accelerator with minimal power consumption and neat optimizations.Note: Some of the dependencies (
coremltools
) are not installed by default. If you need them, you have to install pySIPFENN in dev mode like:pip install "pysipfenn[dev]"
, or likepip install -e ".[dev]"
.- Parameters:
calculator (
Calculator
) – ACalculator
object with loaded models.
- export(model, append='')[source]
Export a loaded model to
CoreML
format. Models will be saved as{model}.mlpackage
in the current working directory. Models will be annotated with the feature vector name (Ward2017
orKS2022
) and the output will be named “property”. The latter behavior will be adjusted in the future when model output name and unit will be added to the model JSON metadata.- Parameters:
model (
str
) – The name of the model to export (must be loaded in theCalculator
) and it must have a descriptor (Ward2017
orKS2022
) defined in thecalculator.models
dictionary created when theCalculator
was initialized.append (
str
) – A string to append to the exported model name after the model name. Useful for adding a version number or other information to the exported model name.
- Return type:
None
- Returns:
None
- class ONNXExporter(calculator)[source]
Bases:
object
Export models to the ONNX format (what they ship in by default) to allow (1) exporting modified pySIPFENN models, (2) simplify the models using ONNX optimizer, and (3) convert them to FP16 precision, cutting the size in half.
Note: Some of the dependencies (
onnxconverter_common
andonnxsim
) are not installed by default. If you need them, you have to install pySIPFENN in dev mode like:pip install "pysipfenn[dev]"
, or likepip install -e ".[dev]"
.- Parameters:
calculator (
Calculator
) – ACalculator
object with loaded models that has loaded PyTorch models (happens automaticallythe (when the autoLoad argument is kept to its default value of True when initializing the Calculator). During)
initialization (in memory)
ONNX (the loaded PyTorch models are converted back to)
disk. (persisted to)
- export(model, append='')[source]
Export a loaded model to ``ONNX``format.
- Parameters:
model (
str
) – The name of the model to export (must be loaded in theCalculator
).append (
str
) – A string to append to the exported model name after the model name, simplification marker, and FP16 marker. Useful for adding a version number or other information to the exported model name.
- Return type:
None
- Returns:
None
- exportAll(append='')[source]
Export all loaded models to
ONNX
format with the export function.append
string can be passed to the export function to append to the exported model name.- Return type:
None
- simplify(model)[source]
Simplify a loaded model using the ONNX optimizer.
- Parameters:
model (
str
) – The name of the model to simplify (must be loaded in theCalculator
).- Return type:
None
- Returns:
None
- class TorchExporter(calculator)[source]
Bases:
object
Export models to the
PyTorch PT
format to allow for easy loading and inference in PyTorch in other projects.- Parameters:
calculator (
Calculator
) – ACalculator
object with loaded models.
- export(model, append='')[source]
Export a loaded model to
PyTorch PT
format. Models are exported in eval mode (no dropout) and saved in the current working directory.- Parameters:
model (
str
) – The name of the model to export (must be loaded in theCalculator
) and it must have a descriptor (Ward2017
orKS2022
) defined in theCalculator.models
dictionary created when theCalculator
was initialized.append (
str
) – A string to append to the exported model name after the model name. Useful for adding a version number or other information to the exported model name.
- Return type:
None
- Returns:
None