3. Data Extraction

GEM Tables are a component of the Citrine Platform data service that extracts data from GEMD’s complex and expressive graphical representations into a tabular structure (like a CSV file) that is easier to consume in analytical contexts. A GEM Table is defined on a set of material histories, and the rows in the resulting GEM Table are in 1-to-1 correspondence with those material histories. Columns correspond to data about the material histories, such as the temperature measured in a kiln used at a specific manufacturing step.

3.1. Defining rows and columns

A Row object describes a mapping from a list of Datasets to rows of a table by selecting a set of Material Histories. Each Material History corresponds to exactly one row, though the Material Histories may overlap such that the same objects contribute data to multiple rows. For example, the material histories of distinct Material Runs will map to exactly two rows even if their histories are identical up to differing Process Runs of their final baking step. The only way to define rows right now is through MaterialRunByTemplate, which produces one row per Material Run associated with any of a list of material templates.

The set of tags is optional. The usage scenario is for fine-grained curation. If a terminal material doesn’t contain any of the tags it will be filtered out.

from citrine.gemtables.rows import MaterialRunByTemplate
from gemd.entity.link_by_uid import LinkByUID
row_def = MaterialRunByTemplate(
      templates=[LinkByUID(scope="templates", id="finished cookie")],
      tags=["foo::bar"]
      )

A Variable object specifies how to select a piece of data from each Material History. Thus, it performs the first part of a mapping from the set of Material Histories to columns in the GEM Table. A Variable is addressed locally (within a definition) by a name. A Variable is also labeled with headers, which is a list of strings that can express a hierarchical relationship with other variables. The headers are listed in decreasing hierarchical order: the first string indicates the broadest classification, and each subsequent string indicates a refinement of those classifications preceding it. In the example below, a hardness measurement might also be performed on the object denoted by the Product header. One might assign headers = ["Product", "Hardness"] to that measurement in order to relate it with the Density measurement of the same physical object. Assuming they are next to each other, the Density and Hardness columns in the resulting table would be grouped under a larger Product column.

from citrine.gemtables.variables import AttributeByTemplateAfterProcessTemplate
from gemd.entity.link_by_uid import LinkByUID
final_density = AttributeByTemplateAfterProcessTemplate(
      name="final density",
      headers=["Product", "Density"],
      attribute_template=LinkByUID(scope="templates", id="cookie density"),
      process_template=LinkByUID(scope="templates", id="apply glaze"))

A Column object describes how to transform a Variable into a primitive value (e.g., a real number, an integer, or a string) that can be entered into a cell in a table. This is necessary because GEMD Attributes are more general than primitive values; they often convey uncertainty estimates, for example. All columns are linked to a variable through the data_source field, which must equal the name of some variable in the table configuration.

from citrine.gemtables.columns import MeanColumn, StdDevColumn
final_density_mean = MeanColumn(data_source="final density", target_units="g/cm^3")
final_density_std = StdDevColumn(data_source="final density", target_units="g/cm^3")

The data_source parameter is a reference to a Variable for this Column to describe, so the value of data_source must match the name of a Variable.

3.2. Defining tables

The TableConfig object defines how to build a GEM Table. It specifies a list of UUIDs for Datasets to query in generating the table, a list of Row objects that define material histories to use as rows, a list of Variable objects that specify how to extract data from those material histories, and a list of Column objects to transform those variables into columns.

from citrine.resources.table_config import TableConfig
from uuid import UUID
table_config = TableConfig(
      name="cookies",
      description="Cookie densities",
      datasets=[UUID("7d040451-7cfb-45ca-9e0e-4b2b7010edd6"),
                  UUID("7cfb45ca-9e0e-4b2b-7010-edd67d040451")],
      variables=[final_density],
      rows=[row_def],
      columns=[final_density_mean, final_density_std])

Note the inclusion of two Datasets above. In general, you should have at least two Datasets referenced because Objects and Templates are generally associated with different Datasets.

In addition to defining variables, rows, and columns individually, there are convenience methods that simultaneously add multiple elements to an existing Table Config. One such method is add_all_ingredients(), which creates variables and columns for every potential ingredient in a process. The user provides a link to a process template that has a non-empty set of allowed_names (the allowed names of the ingredient runs and specs in the process). This creates an id variable/column and a quantity variable/column for each allowed name. The user specifies the dimension to report the quantity in: mass fraction, volume fraction, number fraction, or absolute quantity. If the quantities are reported in absolute amounts, then there is also a column for the units.

The code below takes the table_config object defined in the preceding code block and adds the ingredient amounts for a “batter mixing” process with known uid “3a308f78-e341-f39c-8076-35a2c88292ad”. Assume that the process template is accessible from a known Project, project.

from citrine.gemtables.variables import IngredientQuantityDimension

table_config = table_config.add_all_ingredients(
    process_template=LinkByUID('id', '3a308f78-e341-f39c-8076-35a2c88292ad'),
    project=project,
    quantity_dimension=IngredientQuantityDimension.MASS
)

If the process template’s allowed names includes, e.g., “flour”, then there will now be columns “batter mixing~flour~id” and “batter mixing~flour~mass fraction~mean.”

3.3. Previewing tables

Calling table_configs() on a Project returns an TableConfigCollection object, which facilitates access to the collection of all TableConfigs visible to a Project. Via such an object, one can preview a draft TableConfig on an explicit set of Material Histories, defined by their terminal materials.

For example:

table_configs = project.table_configs
preview = table_configs.preview(
      table_config=table_config,
      preview_materials=[
            LinkByUID(scope="products", id="best cookie ever"),
            LinkByUID(scope="products", id="worst cookie ever")]
 )

The preview returns a dictionary with two keys:

The csv key will get a preview of the table in the comma-separated-values format.
The warnings key will get a list of String-valued warnings that describe possible issues with the Table Config, e.g., that one of the columns is completely empty.

For example, if you wanted to print the warnings and then load the preview into a pandas dataframe, you could:

from io import StringIO
import pandas as pd

preview = table_configs.preview(table_config=table_config, preview_materials=preview_materials)
print("\n\n".join(preview["warnings"]))
data_frame = pd.read_csv(StringIO(preview["csv"]))

or even wrap it in a method that displays multi-row headers:

def resp_to_pandas(resp):
    import warnings
    from io import StringIO
    import pandas as pd
    import numpy as np

    if resp["warnings"]:
        warnings.warn("\n\n".join(resp["warnings"]))

    df = pd.read_csv(StringIO(resp["csv"]))

    headers = [x.split('~') for x in df]
    for header in headers:
        header.extend([''] * (max(len(x) for x in headers) - len(header)))

    return pd.DataFrame(df.values, columns=[x for x in np.array(headers).T])

3.4. Building and downloading tables

After iteratively adjusting the TableConfig with the preview method above, the definition can be registered to save it.

table_config = table_configs.register(table_config)
print("Definition registered as {}".format(table_config.definition_uid))

Registered Table Configs can be built into GEM Tables. For example:

table = project.tables.build_from_config(table_config)
project.tables.read(table, "./my_table.csv")

The above will build a table, wait for the build job to complete, and save the table locally.

However, GEM Tables are sometimes large and time-consuming to build, so the build process can be performed asynchronously with the initiate_build method. For example:

job = project.tables.initiate_build(table_config)

The return type of the initiate_build method is a JobSubmissionResponse that contains a unique identifier for the submitted job.

The table id and version can be used to get a GemTable resource that provides access to the table.

You can also use the JobStatusResponse to return the GemTable resource directly with the get_by_build_job method. Just like the FileLink resource, GemTable does not literally contain the table but does expose a read method that will download it.

For example, once the above initiate_build method has completed:

# Get the table resource as an object
table = project.tables.get_by_build_job(job)
# Download the table
project.tables.read(table=table, local_path="./my_table.csv")

3.5. Available Row Definitions

Currently, GEM Tables provide a single way to define rows: by the MaterialTemplate of the terminal materials of the material histories that correspond to each row.

3.5.1. `MaterialRunByTemplate`

The MaterialRunByTemplate class defines rows through a list of MaterialTemplate. Every MaterialRun that is assigned to any template in the list is used as the terminal material of a Material History to be mapped to a row. This is helpful when the rows correspond to classes of materials that are defined through their templates. For example, there could be a MaterialTemplate called “Cake” that is used in all of the cakes and another called “Brownies” that is used in all of the brownies. By including one or both of those templates, you can define a table of Cakes, Brownies, or both.

3.6. Available Variable Definitions

There are several ways to define variables that take their values from Attributes and identifiers in GEMD objects.

Attributes
- AttributeByTemplate: for when the attribute occurs once per material history
- AttributeByTemplateAndObjectTemplate: for when the attributes are distinguished by the object that they are contained in
- AttributeByTemplateAfterProcessTemplate: for when measurements are distinguished by the process that precedes them
- AttributeInOutput: for when attributes occur both in a process output and one or more of its inputs
- IngredientQuantityByProcessAndName: for the specific case of the volume fraction, mass fraction, number fraction, or absolute quantity of an ingredient
- IngredientQuantityInOutput: for the quantity of an ingredient between the terminal material and a given set of processes (useful for ingredients used in multiple processes)
- LocalAttribute: for retrieving the attribute from the terminal material or its attached process or measurements (useful for attributes found on multiple materials)
- LocalIngredientQuantity: for the quantity of an ingredient used in the process creating the terminal material (useful for ingredients used in multiple processes)
Identifiers
- TerminalMaterialInfo: for fields defined on the material at the terminus of the Material History, like the name of the material
- TerminalMaterialIdentifier: for the id of the Material History, which can be used as a unique identifier for the rows
- IngredientIdentifierByProcessTemplateAndName: for the id of the material being used in an ingredient, which can be used as a key for looking up that input material
- IngredientIdentifierInOutput: for the id of a material used in an ingredient between the terminal material and a given set of processes (useful for ingredients used in multiple processes)
- LocalIngredientIdentifier: for the id of a material used in an ingredient used in the process creating the terminal material (useful for ingredients used in multiple processes)
- IngredientLabelByProcessAndName: for a Boolean that indicates whether an ingredient is assigned a given label
- IngredientLabelsSetByProcessAndName: for the set of labels belonging to an ingredient in a process
- IngredientLabelsSetInOutput: for the set of labels belonging to an ingredient between the terminal material and a given set of processes (useful for ingredients used in multiple processes)
- LocalIngredientLabelsSet: for the set of labels belonging to an ingredient used in the process creating the terminal material (useful for ingredients used in multiple processes)
Compound Variables
- XOR: for combining multiple variable definitions into one variable, when only one of those definitions yields a result for a given tree (logical exclusive OR)

3.7. Available Column Definitions

There are several ways to define columns, depending on the type of the attribute that is being used as the data source for the column.

Numeric attributes values, like ContinuousValue and IntegerValue

MeanColumn: for the mean value of the numeric distribution

StdDevColumn: for the standard deviation of the numeric distribution, or empty if the value is nominal

QuantileColumn: for a user-defined quantile of the numeric distribution, or empty if the value is nominal

OriginalUnitsColumn: for getting the units, as entered by the data author, from the specific attribute value; valid for continuous values only

Enumerated attribute values, like CategoricalValue

MostLikelyCategoryColumn: for getting the mode

MostLikelyProbabilityColumn: for getting the probability of the mode

Composition and chemical formula attribute values, like CompositionValue

FlatCompositionColumn: for flattening the composition into a chemical-formula-like string

ComponentQuantityColumn: for getting the (optionally normalized) quantity of a specific component, by name

NthBiggestComponentNameColumn: for getting the name of the n-th biggest component (by quantity)

NthBiggestComponentQuantityColumn: for getting the (optionally normalized) quantity of the n-th biggest component (by quantity)

Molecular structure attribute values, like MolecularValue

MolecularStructureColumn: for getting molecular structures in a line notation

String- and Boolean-valued fields, like identifiers and non-attribute fields

IdentityColumn: for simply casting the value to a string, which doesn’t work on values from Attributes

Collections of values

ConcatColumn: for concatenating the results of a list- or set-valued result, such as is returned by IngredientLabelsSetInOutput

3.8. Compatibility with AI Engine

The Citrine Platform automatically converts the values found in a GEM Table into the format used by the AI Engine for predictor training and default asset creation. This includes generating descriptors from the variables found in the table configuration and extracting individual values from the cells of the GEM Table.

In most cases, descriptors are generated based on the bounds (children of the BaseBounds class) found on the attribute template referenced by a GEM Table variable. The key of the descriptor is derived from concatenation of the headers field of the table variable. An exception to this is for the FormulationDescriptor which follows the special rule described below. The mappings from variables in a GEM Table to descriptors are as follows:

RealBounds are converted to a RealDescriptor
IntegerBounds are converted to an IntegerDescriptor
CategoricalBounds are converted to a CategoricalDescriptor
CompositionBounds are converted to a ChemicalFormulaDescriptor
MolecularStructureBounds are converted to a MolecularStructureDescriptor
A FormulationDescriptor with key ‘Formulation’ is generated whenever an ingredient quantity variable (e.g., IngredientQuantityInOutput, IngredientQuantityByProcessAndName, or LocalIngredientQuantity) is present in the table configuration

When using a GEM Table as a data source for predictor training, the generated descriptors are associated with individual cell values in each row of data. The following value types (children of the BaseValue class) are compatible with each type of descriptor:

RealDescriptor: values of type NominalReal, NormalReal, and UniformReal
IntegerDescriptor: values of type NominalInteger and UniformInteger
MolecularStructureDescriptor: values of type Smiles and InChI
CategoricalDescriptor: values of type NominalCategorical and DiscreteCategorical
ChemicalFormulaDescriptor: values of type EmpiricalFormula, or values of type NominalComposition when all quantity keys are valid atomic symbols
FormulationDescriptor: all values extracted by ingredient quantity, identifier, and label variables are used to represent the formulation