4.7. Data Sources

Data sources are used by modules to pull data from outside of the AI engine. For example, a predictor may need to define external training data.

4.7.1. CSV Data Source

The CSVDataSource draws data from a CSV file stored on the data platform and annotates it by mapping each column name to a Descriptor. The file is referenced via a FileLink. Each FileLink references an explicit version of the CSV file. Uploading a new file with the same name will produce a new version of the file with a new FileLink.

The columns in the CSV are extracted and parsed by a mapping of column header names to user-created descriptors. Columns in the CSV that are not mapped with a descriptor are ignored.

Assume that a file data.csv exists with the following contents:

Chemical Formula,Gap,Crystallinity
Bi2Te3,0.153,Single crystalline
Mg2Ge,0.567,Single crystalline
GeTe,0.7,Amorphous
Sb2Se3,1.15,Polycrystalline

That file could be used as the training data for a predictor as:

from citrine.informatics.data_sources import CSVDataSource
from citrine.informatics.predictors import AutoMLPredictor
from citrine.informatics.descriptors import RealDescriptor, CategoricalDescriptor, ChemicalFormulaDescriptor

file_link = dataset.files.upload("./data.csv", "bandgap_data.csv")

data_source = CSVDataSource(
    file_link = file_link,
    # `column_definitions` maps a column header to a descriptor
    # the column header and the descriptor key do not need to be identical
    column_definitions = {
        "Chemical Formula": ChemicalFormulaDescriptor(key="formula"),
        "Gap": RealDescriptor(key="Band gap", lower_bound=0, upper_bound=20, units="eV"),
        "Crystallinity": CategoricalDescriptor(key="Crystallinity", categories=[
            "Single crystalline", "Amorphous", "Polycrystalline"])
    }
)

predictor = AutoMLPredictor(
    name = "Band gap predictor",
    description = "Predict the band gap from the chemical formula and crystallinity",
    inputs = [
        # referencing `data_source.column_definitions` is one way to ensure that the
        # descriptors in the training data match the descriptors in the predictor definition
        data_source.column_definitions["Chemical Formula"],
        data_source.column_definitions["Crystallinity"]
    ],
    outputs = [data_source.column_definitions["Gap"]],
    training_data = [data_source]
)

An optional list of identifiers can be specified. Identifiers uniquely identify a row and are used in the context of formulations to link from an ingredient to its properties. Each identifier must correspond to a column header name. No two rows within this column can contain the same value. If a column should be parsed as data and used as an identifier, identifier columns may overlap with the keys defined in the mapping from column header names to descriptors.

Identifiers are required in two circumstances. These circumstances are only relevant if CSV data source represents formulation data.

Ingredient properties are featurized using a MeanPropertyPredictor. In this case, the link from identifier to row is used to compute mean ingredient property values.
Mixtures that contain mixtures are simplified to simple mixtures that contain only leaf ingredients using a SimpleMixturePredictor. In this case, links from each mixture’s ingredients to its row (which may also be a mixture) are used to recursively crawl hierarchical blends of blends and construct a recipe that contains only leaf ingredients.

Note: to build a formulation from a CSV data source an IngredientsToFormulationPredictor must be present in the workflow. Additionally, each ingredient id used as a key in the predictor’s map from ingredient id to its quantity must exist in an identifier column.

As an example, consider the following saline solution data.

Ingredient id	water quantity	salt quantity	density
hypertonic saline	0.93	0.07	1.08
isotonic saline	0.99	0.01	1.01
water			1.0
salt			2.16

Hypertonic and isotonic saline are mixtures formed by mixing water and salt. Ingredient identifiers are given by the first column. A CSV data source and an IngredientsToFormulationPredictor can be configured to construct formulations from tabular data. This predictor automatically generates an output formulation descriptor named ‘Formulation’. The following example illustrates this process:

from citrine.informatics.data_sources import CSVDataSource
from citrine.informatics.descriptors import FormulationDescriptor, RealDescriptor
from citrine.informatics.predictors import IngredientsToFormulationPredictor

file_link = dataset.files.upload(file_path="./saline_solutions.csv", dest_name="saline_solutions.csv")

# create descriptors for each ingredient quantity (volume fraction)
water_quantity = RealDescriptor(key='water quantity', lower_bound=0, upper_bound=1, units="")
salt_quantity = RealDescriptor(key='salt quantity', lower_bound=0, upper_bound=1, units="")

# create a descriptor to hold density data
density = RealDescriptor(key='density', lower_bound=0, upper_bound=1000, units='g/cc')

data_source = CSVDataSource(
    file_link = file_link,
    column_definitions = {
        'water quantity': water_quantity,
        'salt quantity': salt_quantity,
        'density': density
    },
    identifiers=['Ingredient id']
)

IngredientsToFormulationPredictor(
    name='Ingredients to formulation predictor',
    description='Constructs a mixture from ingredient quantities',
    # map from ingredient id to its quantity
    id_to_quantity={
        'water': water_quantity,
        'salt': salt_quantity
    },
    # label water as a solvent and salt a solute
    labels={
        'solvent': {'water'},
        'solute': {'salt'}
    }
)

4.7.2. GEM Table Data Source

An GemTableDataSource references a GEM Table. As explained more in the documentation, GEM Tables provide a structured version of on-platform data. GEM Tables are specified by the display table uuid, version number, and optional formulation descriptor. A formulation descriptor must be specified if formulations should be built from the data source. If specified, any formulations emitted by the data source are stored using the provided descriptor. The example below assumes that the uuid and the version of the desired GEM Table are known.

from citrine.informatics.data_sources import GemTableDataSource
from citrine.informatics.predictors import AutoMLPredictor
from citrine.informatics.descriptors import RealDescriptor, CategoricalDescriptor, ChemicalFormulaDescriptor

data_source = GemTableDataSource(
    table_id = "842434fd-11fe-4324-815c-7db93c7ed81e",
    table_version = "2"
)

predictor = AutoMLPredictor(
    name = "Band gap predictor",
    description = "Predict the band gap from the chemical formula and crystallinity",
    inputs = [
        ChemicalFormulaDescriptor("terminal~formula"),
        CategoricalDescriptor("terminal~crystallinity", categories=[
            "Single crystalline", "Amorphous", "Polycrystalline"])
    ],
    outputs = [RealDescriptor("terminal~band gap", lower_bound=0, upper_bound=20, units="eV")],
    training_data = [data_source]
)

Note that the descriptor keys above are the headers of the variable not the column in the table. The last term in the column header is a suffix associated with the specific column definition rather than the variable. It should be omitted from the descriptor key.

4.7.3. Experiment Data Source

An ExperimentDataSource references a snapshot of the Experiment Results in a Branch that are fit for training. This snapshot is created when one updates the data on a Branch and chooses to include Experiment Results in the training data via the web application. There is only one Experiment Data Source per Branch, though it is versioned. The version increments everytime a new or updated Experiment Result is chosen as training data via the web application.

One can reference an Experiment Data Source from a branch:

eds = branch.experiment_datasource

The .read() method will return a string in a CSV-friendly format for convenient export or further analysis:

# Write to CSV:
with open('experiment_datasource.csv', 'w') as f:
    f.write(eds.read())

# Convert to a Pandas DataFrame
import pandas as pd
from io import StringIO

eds_io = StringIO(eds.read())
eds_dataframe = pd.read_csv(eds_io.read()))