.. _descriptors: Descriptors =========== Descriptors allow users to define a controlled vocabulary with which to describe the physical context of an AI task. Each descriptor defines a term in that vocabulary, which is comprised of a name, a datatype, and bounds on that data type. If you are familiar with the GEMD data model, descriptors are roughly equivalent to :class:`AttributeTemplates `. The AI Engine currently supports 5 kinds of descriptors: - `Real Descriptors <#real-descriptor>`__ - `Integer Descriptor <#integer-descriptor>`__ - `Categorical Descriptor <#categorical-descriptor>`__ - `Chemical Formula Descriptor <#chemical-formula-descriptor>`__ - `Molecular Structure Descriptor <#molecular-structure-descriptor>`__ - `Formulation Descriptor <#formulation-descriptor>`__ Real Descriptor --------------- :class:`~citrine.informatics.descriptors.RealDescriptor` is used to represent continuous variables. Each Real Descriptor must provide a lower and upper bound, which is used to both validate input data and as a prior when making predictions. If you are not sure what bounds to use, you may want to look at the attribute templates to see if another user has defined bounds for you. It is better to err on the side of broader bounds than narrower ones. Additionally, each Real Descriptor defines the units of every variable associated with that descriptor. Any `GEMD-compatible `__ unit string may be used. If a variable is dimensionless, you can use an empty string. Integer Descriptor ------------------ :class:`~citrine.informatics.descriptors.IntegerDescriptor` is used to represent discrete, dimensionless variables. Each Integer Descriptor must provide a lower and upper bound, which is used to both validate input data and as a prior when making predictions. If you are not sure what bounds to use, you may want to look at the attribute templates to see if another user has defined bounds for you. It is better to err on the side of broader bounds than narrower ones. Integer Descriptors are dimensionless and units cannot be specified. Categorical Descriptor ---------------------- :class:`~citrine.informatics.descriptors.CategoricalDescriptor` is used to represent variables that can take one of a set of values, i.e., categories. All of the possible categories must be known ahead of time and specified in the Categorical Descriptor. Chemical Formula Descriptor --------------------------- :class:`~citrine.informatics.descriptors.ChemicalFormulaDescriptor` is used to represent variables that should be interpreted as chemical formulas. The Chemical Formula Descriptor has no parameters other than a name. Molecular Structure Descriptor ------------------------------ :class:`~citrine.informatics.descriptors.MolecularStructureDescriptor` is used to represent variables that should be interpreted as molecular structures. Both `SMILES `__ and `InChI `__ are supported. The Molecular Structure Descriptor has no parameters other than a name. Formulation Descriptor ------------------------------ :class:`~citrine.informatics.descriptors.FormulationDescriptor` is used to represent variables that contain information about mixtures of other materials. The Formulation Descriptor has no parameters other than a name, of which the two allowed values are 'Formulation' and 'Flat Formulation'. The key 'Formulation' should be used when referring to mixtures found directly in the training data, for which the ingredients may be other mixtures themselves. The key 'Flat Formulation' should be reserved for mixtures comprised of only raw ingredients produced by a :class:`~citrine.informatics.predictors.simple_mixture_predictor.SimpleMixturePredictor`. The two allowed formulation descriptors can be obtained by the helper methods ``FormulationDescriptor.hierarchical()`` and ``FormulationDescriptor.flat()`` that produce descriptors with the keys 'Formulation' and 'Flat Formulation', respectively. Platform Vocabularies ===================== A set of descriptors defines a controlled vocabulary with which to describe AI tasks. The :class:`~citrine.builders.descriptors.PlatformVocabulary` class is provided to collect a set of descriptors, associate them with short convenient names, and provide them via a familiar dictionary interface. While descriptors cannot be independently saved on the platform for reuse, :class:`AttributeTemplates ` can be. Therefore, common descriptors can be saved as attribute templates to the data platform, effectively sharing them with other users. :meth:`~citrine.builders.descriptors.PlatformVocabulary.from_templates` facilitates this pattern by automatically downloading attribute templates and converting them into descriptors. Attribute templates must be associated with a namespace via custom identifiers (the `uids` field). When calling ``from_templates``, a scope is provided to select one of those namespaces. The descriptors can then be associated with the names from that namespace. .. code:: python from citrine import Citrine from citrine.resources.property_template import PropertyTemplate from citrine.builders.descriptors import PlatformVocabulary # create a session with citrine using your API key session = Citrine(api_key=API_KEY) # create project project = session.projects.register('Example project') # create an property template for density project.property_templates.register(PropertyTemplate( name="density", uids={"my_templates": "rho"}, bounds=RealBounds(lower_bound=0, upper_bound=100, default_units="g/cm^3") )) # create a condition template for temperature project.property_templates.register(PropertyTemplate( name="temperature", uids={"my_templates": "T"}, bounds=RealBounds(lower_bound=0, upper_bound=1000000, default_units="kelvin") )) # create a PlatformVocabulary from the templates pv = PlatformVocabulary.from_templates(project=project, scope="my_templates") # see the terms in the platform vocabulary print(list(pv)) # returns ["rho", "T"] # access a descriptor from the platform vocabulary print(pv["T"]) # returns RealDescriptor(key="temperature", lower_bound=0, upper_bound=1000000, units="K")