.. _data-sources: Data Sources ============ Data sources are used by modules to pull data from outside of the AI engine. For example, a :doc:`predictor ` may need to define external training data. CSV Data Source --------------- The :class:`~citrine.informatics.data_sources.CSVDataSource` draws data from a CSV file stored on the data platform and annotates it by mapping each column name to a :class:`~citrine.informatics.descriptors.Descriptor`. The file is referenced via a :class:`~citrine.resources.file_link.FileLink`. Each FileLink references an explicit version of the CSV file. Uploading a new file with the same name will produce a new version of the file with a new FileLink. The columns in the CSV are extracted and parsed by a mapping of column header names to user-created descriptors. Columns in the CSV that are not mapped with a descriptor are ignored. Assume that a file data.csv exists with the following contents: .. code:: Chemical Formula,Gap,Crystallinity Bi2Te3,0.153,Single crystalline Mg2Ge,0.567,Single crystalline GeTe,0.7,Amorphous Sb2Se3,1.15,Polycrystalline That file could be used as the training data for a predictor as: .. code:: python from citrine.informatics.data_sources import CSVDataSource from citrine.informatics.predictors import AutoMLPredictor from citrine.informatics.descriptors import RealDescriptor, CategoricalDescriptor, ChemicalFormulaDescriptor file_link = dataset.files.upload("./data.csv", "bandgap_data.csv") data_source = CSVDataSource( file_link = file_link, # `column_definitions` maps a column header to a descriptor # the column header and the descriptor key do not need to be identical column_definitions = { "Chemical Formula": ChemicalFormulaDescriptor(key="formula"), "Gap": RealDescriptor(key="Band gap", lower_bound=0, upper_bound=20, units="eV"), "Crystallinity": CategoricalDescriptor(key="Crystallinity", categories=[ "Single crystalline", "Amorphous", "Polycrystalline"]) } ) predictor = AutoMLPredictor( name = "Band gap predictor", description = "Predict the band gap from the chemical formula and crystallinity", inputs = [ # referencing `data_source.column_definitions` is one way to ensure that the # descriptors in the training data match the descriptors in the predictor definition data_source.column_definitions["Chemical Formula"], data_source.column_definitions["Crystallinity"] ], outputs = [data_source.column_definitions["Gap"]], training_data = [data_source] ) An optional list of identifiers can be specified. Identifiers uniquely identify a row and are used in the context of formulations to link from an ingredient to its properties. Each identifier must correspond to a column header name. No two rows within this column can contain the same value. If a column should be parsed as data and used as an identifier, identifier columns may overlap with the keys defined in the mapping from column header names to descriptors. Identifiers are required in two circumstances. These circumstances are only relevant if CSV data source represents formulation data. 1. Ingredient properties are featurized using a :class:`~citrine.informatics.predictors.mean_property_predictor.MeanPropertyPredictor`. In this case, the link from identifier to row is used to compute mean ingredient property values. 2. Mixtures that contain mixtures are simplified to simple mixtures that contain only leaf ingredients using a :class:`~citrine.informatics.predictors.simple_mixture_predictor.SimpleMixturePredictor`. In this case, links from each mixture's ingredients to its row (which may also be a mixture) are used to recursively crawl hierarchical blends of blends and construct a recipe that contains only leaf ingredients. Note: to build a formulation from a CSV data source an :class:`~citrine.informatics.predictors.ingredients_to_formulation_predictor.IngredientsToFormulationPredictor` must be present in the workflow. Additionally, each ingredient id used as a key in the predictor's map from ingredient id to its quantity must exist in an identifier column. As an example, consider the following saline solution data. +-------------------+----------------+---------------+---------+ | Ingredient id | water quantity | salt quantity | density | +===================+================+===============+=========+ | hypertonic saline | 0.93 | 0.07 | 1.08 | +-------------------+----------------+---------------+---------+ | isotonic saline | 0.99 | 0.01 | 1.01 | +-------------------+----------------+---------------+---------+ | water | | | 1.0 | +-------------------+----------------+---------------+---------+ | salt | | | 2.16 | +-------------------+----------------+---------------+---------+ Hypertonic and isotonic saline are mixtures formed by mixing water and salt. Ingredient identifiers are given by the first column. A CSV data source and an :class:`~citrine.informatics.predictors.ingredients_to_formulation_predictor.IngredientsToFormulationPredictor` can be configured to construct formulations from tabular data. This predictor automatically generates an ``output`` formulation descriptor named 'Formulation'. The following example illustrates this process: .. code:: python from citrine.informatics.data_sources import CSVDataSource from citrine.informatics.descriptors import FormulationDescriptor, RealDescriptor from citrine.informatics.predictors import IngredientsToFormulationPredictor file_link = dataset.files.upload(file_path="./saline_solutions.csv", dest_name="saline_solutions.csv") # create descriptors for each ingredient quantity (volume fraction) water_quantity = RealDescriptor(key='water quantity', lower_bound=0, upper_bound=1, units="") salt_quantity = RealDescriptor(key='salt quantity', lower_bound=0, upper_bound=1, units="") # create a descriptor to hold density data density = RealDescriptor(key='density', lower_bound=0, upper_bound=1000, units='g/cc') data_source = CSVDataSource( file_link = file_link, column_definitions = { 'water quantity': water_quantity, 'salt quantity': salt_quantity, 'density': density }, identifiers=['Ingredient id'] ) IngredientsToFormulationPredictor( name='Ingredients to formulation predictor', description='Constructs a mixture from ingredient quantities', # map from ingredient id to its quantity id_to_quantity={ 'water': water_quantity, 'salt': salt_quantity }, # label water as a solvent and salt a solute labels={ 'solvent': {'water'}, 'solute': {'salt'} } ) GEM Table Data Source --------------------- An :class:`~citrine.informatics.data_sources.GemTableDataSource` references a GEM Table. As explained more in the :doc:`documentation <../data_extraction>`, GEM Tables provide a structured version of on-platform data. GEM Tables are specified by the display table uuid, version number, and optional formulation descriptor. A formulation descriptor must be specified if formulations should be built from the data source. If specified, any formulations emitted by the data source are stored using the provided descriptor. The example below assumes that the uuid and the version of the desired GEM Table are known. .. code:: python from citrine.informatics.data_sources import GemTableDataSource from citrine.informatics.predictors import AutoMLPredictor from citrine.informatics.descriptors import RealDescriptor, CategoricalDescriptor, ChemicalFormulaDescriptor data_source = GemTableDataSource( table_id = "842434fd-11fe-4324-815c-7db93c7ed81e", table_version = "2" ) predictor = AutoMLPredictor( name = "Band gap predictor", description = "Predict the band gap from the chemical formula and crystallinity", inputs = [ ChemicalFormulaDescriptor("terminal~formula"), CategoricalDescriptor("terminal~crystallinity", categories=[ "Single crystalline", "Amorphous", "Polycrystalline"]) ], outputs = [RealDescriptor("terminal~band gap", lower_bound=0, upper_bound=20, units="eV")], training_data = [data_source] ) Note that the descriptor keys above are the headers of the *variable* not the column in the table. The last term in the column header is a suffix associated with the specific column definition rather than the variable. It should be omitted from the descriptor key. Experiment Data Source ---------------------- An :class:`~citrine.resources.experiment_datasource.ExperimentDataSource` references a snapshot of the Experiment Results in a Branch that are fit for training. This snapshot is created when one updates the data on a Branch and chooses to include Experiment Results in the training data via the web application. There is only one Experiment Data Source per Branch, though it is versioned. The version increments everytime a new or updated Experiment Result is chosen as training data via the web application. One can reference an Experiment Data Source from a branch: .. code:: python eds = branch.experiment_datasource The `.read()` method will return a string in a CSV-friendly format for convenient export or further analysis: .. code:: python # Write to CSV: with open('experiment_datasource.csv', 'w') as f: f.write(eds.read()) # Convert to a Pandas DataFrame import pandas as pd from io import StringIO eds_io = StringIO(eds.read()) eds_dataframe = pd.read_csv(eds_io.read()))