3. Data Extraction
GEM Tables are a component of the Citrine Platform data service that extracts data from GEMD’s complex and expressive graphical representations into a tabular structure (like a CSV file) that is easier to consume in analytical contexts. A GEM Table is defined on a set of material histories, and the rows in the resulting GEM Table are in 1-to-1 correspondence with those material histories. Columns correspond to data about the material histories, such as the temperature measured in a kiln used at a specific manufacturing step.
3.1. Defining rows and columns
A Row object describes a mapping from a list of Datasets to rows of a table by selecting a set of Material Histories.
Each Material History corresponds to exactly one row, though the Material Histories may overlap such that the same objects contribute data to multiple rows.
For example, the material histories of distinct Material Runs will map to exactly two rows even if their histories are identical up to differing Process Runs of their final baking step.
The only way to define rows right now is through MaterialRunByTemplate
, which produces one row per Material Run associated with any of a list of material templates.
The set of tags is optional. The usage scenario is for fine-grained curation. If a terminal material doesn’t contain any of the tags it will be filtered out.
from citrine.gemtables.rows import MaterialRunByTemplate
from gemd.entity.link_by_uid import LinkByUID
row_def = MaterialRunByTemplate(
templates=[LinkByUID(scope="templates", id="finished cookie")],
tags=["foo::bar"]
)
A Variable
object specifies how to select a piece of data from each Material History.
Thus, it performs the first part of a mapping from the set of Material Histories to columns in the GEM Table.
A Variable
is addressed locally (within a definition) by a name
.
A Variable
is also labeled with headers
, which is a list of strings that can express a hierarchical relationship with other variables.
The headers are listed in decreasing hierarchical order: the first string indicates the broadest classification, and each subsequent string indicates a refinement of those classifications preceding it.
In the example below, a hardness measurement might also be performed on the object denoted by the Product
header.
One might assign headers = ["Product", "Hardness"]
to that measurement in order to relate it with the Density
measurement of the same physical object.
Assuming they are next to each other, the Density
and Hardness
columns in the resulting table would be grouped under a larger Product
column.
from citrine.gemtables.variables import AttributeByTemplateAfterProcessTemplate
from gemd.entity.link_by_uid import LinkByUID
final_density = AttributeByTemplateAfterProcessTemplate(
name="final density",
headers=["Product", "Density"],
attribute_template=LinkByUID(scope="templates", id="cookie density"),
process_template=LinkByUID(scope="templates", id="apply glaze"))
A Column
object describes how to transform a Variable
into a primitive value (e.g., a real number, an integer, or a string) that can be entered into a cell in a table.
This is necessary because GEMD Attributes are more general than primitive values; they often convey uncertainty estimates, for example.
All columns are linked to a variable through the data_source
field, which must equal the name
of some variable in the table configuration.
from citrine.gemtables.columns import MeanColumn, StdDevColumn
final_density_mean = MeanColumn(data_source="final density", target_units="g/cm^3")
final_density_std = StdDevColumn(data_source="final density", target_units="g/cm^3")
The data_source parameter is a reference to a Variable
for this Column
to describe, so the value of data_source
must match the name
of a Variable
.
3.2. Defining tables
The TableConfig
object defines how to build a GEM Table.
It specifies a list of UUIDs for Datasets to query in generating the table,
a list of Row
objects that define material histories to use as rows,
a list of Variable
objects that specify how to extract data from those material histories,
and a list of Column
objects to transform those variables into columns.
from citrine.resources.table_config import TableConfig
from uuid import UUID
table_config = TableConfig(
name="cookies",
description="Cookie densities",
datasets=[UUID("7d040451-7cfb-45ca-9e0e-4b2b7010edd6"),
UUID("7cfb45ca-9e0e-4b2b-7010-edd67d040451")],
variables=[final_density],
rows=[row_def],
columns=[final_density_mean, final_density_std])
Note the inclusion of two Datasets above. In general, you should have at least two Datasets referenced because Objects and Templates are generally associated with different Datasets.
In addition to defining variables, rows, and columns individually, there are convenience methods that simultaneously add multiple elements to an existing Table Config.
One such method is add_all_ingredients()
, which creates variables and columns for every potential ingredient in a process.
The user provides a link to a process template that has a non-empty set of allowed_names
(the allowed names of the ingredient runs and specs in the process).
This creates an id variable/column and a quantity variable/column for each allowed name.
The user specifies the dimension to report the quantity in: mass fraction, volume fraction, number fraction, or absolute quantity.
If the quantities are reported in absolute amounts, then there is also a column for the units.
The code below takes the table_config
object defined in the preceding code block and adds the ingredient amounts for a “batter mixing” process with known uid “3a308f78-e341-f39c-8076-35a2c88292ad”.
Assume that the process template is accessible from a known Project, project
.
from citrine.gemtables.variables import IngredientQuantityDimension
table_config = table_config.add_all_ingredients(
process_template=LinkByUID('id', '3a308f78-e341-f39c-8076-35a2c88292ad'),
project=project,
quantity_dimension=IngredientQuantityDimension.MASS
)
If the process template’s allowed names includes, e.g., “flour”, then there will now be columns “batter mixing~flour~id” and “batter mixing~flour~mass fraction~mean.”
3.3. Previewing tables
Calling table_configs()
on a Project returns an TableConfigCollection
object, which facilitates access to the collection of all TableConfigs visible to a Project.
Via such an object, one can preview a draft TableConfig on an explicit set of Material Histories, defined by their terminal materials.
For example:
table_configs = project.table_configs
preview = table_configs.preview(
table_config=table_config,
preview_materials=[
LinkByUID(scope="products", id="best cookie ever"),
LinkByUID(scope="products", id="worst cookie ever")]
)
The preview returns a dictionary with two keys:
The
csv
key will get a preview of the table in the comma-separated-values format.The
warnings
key will get a list of String-valued warnings that describe possible issues with the Table Config, e.g., that one of the columns is completely empty.
For example, if you wanted to print the warnings and then load the preview into a pandas dataframe, you could:
from io import StringIO
import pandas as pd
preview = table_configs.preview(table_config=table_config, preview_materials=preview_materials)
print("\n\n".join(preview["warnings"]))
data_frame = pd.read_csv(StringIO(preview["csv"]))
or even wrap it in a method that displays multi-row headers:
def resp_to_pandas(resp):
import warnings
from io import StringIO
import pandas as pd
import numpy as np
if resp["warnings"]:
warnings.warn("\n\n".join(resp["warnings"]))
df = pd.read_csv(StringIO(resp["csv"]))
headers = [x.split('~') for x in df]
for header in headers:
header.extend([''] * (max(len(x) for x in headers) - len(header)))
return pd.DataFrame(df.values, columns=[x for x in np.array(headers).T])
3.4. Building and downloading tables
After iteratively adjusting the TableConfig with the preview
method above, the definition can be registered to save it.
table_config = table_configs.register(table_config)
print("Definition registered as {}".format(table_config.definition_uid))
Registered Table Configs can be built into GEM Tables. For example:
table = project.tables.build_from_config(table_config)
project.tables.read(table, "./my_table.csv")
The above will build a table, wait for the build job to complete, and save the table locally.
However, GEM Tables are sometimes large and time-consuming to build, so the build process can be performed asynchronously with the initiate_build
method.
For example:
job = project.tables.initiate_build(table_config)
The return type of the initiate_build
method is a JobSubmissionResponse
that contains a unique identifier for the submitted job.
The table id and version can be used to get a GemTable
resource that provides access to the table.
You can also use the JobStatusResponse
to return the GemTable
resource directly with the get_by_build_job
method.
Just like the FileLink
resource, GemTable
does not literally contain the table but does expose a read
method that will download it.
For example, once the above initiate_build
method has completed:
# Get the table resource as an object
table = project.tables.get_by_build_job(job)
# Download the table
project.tables.read(table=table, local_path="./my_table.csv")
3.5. Available Row Definitions
Currently, GEM Tables provide a single way to define rows: by the MaterialTemplate
of the terminal materials of the material histories that correspond to each row.
3.5.1. MaterialRunByTemplate
The MaterialRunByTemplate
class defines rows through a list of MaterialTemplate
.
Every MaterialRun
that is assigned to any template in the list is used as the terminal material of a Material History to be mapped to a row.
This is helpful when the rows correspond to classes of materials that are defined through their templates.
For example, there could be a MaterialTemplate
called “Cake” that is used in all
of the cakes and another called “Brownies” that is used in all of the brownies.
By including one or both of those templates, you can define a table of Cakes, Brownies, or both.
3.6. Available Variable Definitions
There are several ways to define variables that take their values from Attributes and identifiers in GEMD objects.
Attributes
AttributeByTemplate
: for when the attribute occurs once per material historyAttributeByTemplateAndObjectTemplate
: for when the attributes are distinguished by the object that they are contained inAttributeByTemplateAfterProcessTemplate
: for when measurements are distinguished by the process that precedes themAttributeInOutput
: for when attributes occur both in a process output and one or more of its inputsIngredientQuantityByProcessAndName
: for the specific case of the volume fraction, mass fraction, number fraction, or absolute quantity of an ingredientIngredientQuantityInOutput
: for the quantity of an ingredient between the terminal material and a given set of processes (useful for ingredients used in multiple processes)LocalAttribute
: for retrieving the attribute from the terminal material or its attached process or measurements (useful for attributes found on multiple materials)LocalIngredientQuantity
: for the quantity of an ingredient used in the process creating the terminal material (useful for ingredients used in multiple processes)
Identifiers
TerminalMaterialInfo
: for fields defined on the material at the terminus of the Material History, like the name of the materialTerminalMaterialIdentifier
: for the id of the Material History, which can be used as a unique identifier for the rowsIngredientIdentifierByProcessTemplateAndName
: for the id of the material being used in an ingredient, which can be used as a key for looking up that input materialIngredientIdentifierInOutput
: for the id of a material used in an ingredient between the terminal material and a given set of processes (useful for ingredients used in multiple processes)LocalIngredientIdentifier
: for the id of a material used in an ingredient used in the process creating the terminal material (useful for ingredients used in multiple processes)IngredientLabelByProcessAndName
: for a Boolean that indicates whether an ingredient is assigned a given labelIngredientLabelsSetByProcessAndName
: for the set of labels belonging to an ingredient in a processIngredientLabelsSetInOutput
: for the set of labels belonging to an ingredient between the terminal material and a given set of processes (useful for ingredients used in multiple processes)LocalIngredientLabelsSet
: for the set of labels belonging to an ingredient used in the process creating the terminal material (useful for ingredients used in multiple processes)
Compound Variables
XOR
: for combining multiple variable definitions into one variable, when only one of those definitions yields a result for a given tree (logical exclusive OR)
3.7. Available Column Definitions
There are several ways to define columns, depending on the type of the attribute that is being used as the data source for the column.
Numeric attributes values, like
ContinuousValue
andIntegerValue
MeanColumn
: for the mean value of the numeric distribution
StdDevColumn
: for the standard deviation of the numeric distribution, or empty if the value is nominal
QuantileColumn
: for a user-defined quantile of the numeric distribution, or empty if the value is nominal
OriginalUnitsColumn
: for getting the units, as entered by the data author, from the specific attribute value; valid for continuous values only
Enumerated attribute values, like
CategoricalValue
MostLikelyCategoryColumn
: for getting the mode
MostLikelyProbabilityColumn
: for getting the probability of the mode
Composition and chemical formula attribute values, like
CompositionValue
FlatCompositionColumn
: for flattening the composition into a chemical-formula-like string
ComponentQuantityColumn
: for getting the (optionally normalized) quantity of a specific component, by name
NthBiggestComponentNameColumn
: for getting the name of the n-th biggest component (by quantity)
NthBiggestComponentQuantityColumn
: for getting the (optionally normalized) quantity of the n-th biggest component (by quantity)
Molecular structure attribute values, like
MolecularValue
MolecularStructureColumn
: for getting molecular structures in a line notation
String- and Boolean-valued fields, like identifiers and non-attribute fields
IdentityColumn
: for simply casting the value to a string, which doesn’t work on values from Attributes
Collections of values
ConcatColumn
: for concatenating the results of a list- or set-valued result, such as is returned byIngredientLabelsSetInOutput
3.8. Compatibility with AI Engine
The Citrine Platform automatically converts the values found in a GEM Table into the format used by the AI Engine for predictor training and default asset creation. This includes generating descriptors from the variables found in the table configuration and extracting individual values from the cells of the GEM Table.
In most cases, descriptors are generated based on the bounds
(children of the BaseBounds
class)
found on the attribute template referenced by a GEM Table variable.
The key of the descriptor is derived from concatenation of the headers field of the table variable.
An exception to this is for the FormulationDescriptor
which follows the special rule described below.
The mappings from variables in a GEM Table to descriptors are as follows:
RealBounds
are converted to aRealDescriptor
IntegerBounds
are converted to anIntegerDescriptor
CategoricalBounds
are converted to aCategoricalDescriptor
CompositionBounds
are converted to aChemicalFormulaDescriptor
MolecularStructureBounds
are converted to aMolecularStructureDescriptor
A
FormulationDescriptor
with key ‘Formulation’ is generated whenever an ingredient quantity variable (e.g.,IngredientQuantityInOutput
,IngredientQuantityByProcessAndName
, orLocalIngredientQuantity
) is present in the table configuration
When using a GEM Table as a data source for predictor training,
the generated descriptors are associated with individual cell values in each row of data.
The following value types (children of the BaseValue
class)
are compatible with each type of descriptor:
RealDescriptor
: values of typeNominalReal
,NormalReal
, andUniformReal
IntegerDescriptor
: values of typeNominalInteger
andUniformInteger
MolecularStructureDescriptor
: values of typeSmiles
andInChI
CategoricalDescriptor
: values of typeNominalCategorical
andDiscreteCategorical
ChemicalFormulaDescriptor
: values of typeEmpiricalFormula
, or values of typeNominalComposition
when all quantity keys are valid atomic symbolsFormulationDescriptor
: all values extracted by ingredient quantity, identifier, and label variables are used to represent the formulation