Data Client

The DataClient is the section of PyCC that allows you to manage your data on Citrination. To access the data client, instantiate CitrinationClient and read the data attribute:

from citrination_client import CitrinationClient
from os import environ

client = CitrinationClient(environ["CITRINATION_API_KEY"], environ["CITRINATION_SITE"])
data_client = client.data

Uploading Files

The DataClient class exposes a method, .upload which allows you to upload a file or a directory to a dataset on Citrination using the “default” ingester. This method is useful for uploading JSON files that follow the PIF schema, as well as files that do not require additional processing with a custom ingester.

Attention

For files that require additional processing, see the documentation for the upload_with_ingester and upload_with_template_csv_ingester sections.

This method is parameterized with the following values:

  • dataset_id - The integer value of the ID of the dataset to which you will be uploading

  • source_path - The path to the file or directory which you want to upload

  • dest_path (optional) - The name of the file or directory as it should appear in Citrination.

Uploading a File To Citrination

The following Python snippet demonstrates the approach for uploading a file with the relative path characterizations/CdTe1.json to dataset 1 on Citrination.

# ... client initialization left out
data_client = client.data

file_path = "characterizations/CdTe1.json"
dataset_id = 1
data_client.upload(dataset_id, file_path)

In the web UI, this file will appear as CdTe1.json nested in a characterizations folder.

Uploading a File To Citrination Under a New Name

You can specify the name a file should have once uploaded to Citrination by passing the second parameter to the .upload method.

# ... client initialization left out
data_client = client.data

file_path = "characterizations/CdTe1.json"
dataset_id = 1
# Pass in the third parameter to upload()
data_client.upload(dataset_id, file_path, "CadTel1.json")

In the web UI, this file will appear as CadTel1.json at the top level of the dataset.

Uploading a Directory To Citrination

If you pass a directory into the upload() method, all the files in the directory will be recursively uploaded to the dataset. Their paths (relative to the directory specified), will remain intact.

Attention

Files uploaded this way will be prefixed with the name of the parent directory that you originally specify. In other words, if upload() is called with the source path my_folder, the dataset on Citrination will contain files prefixed with my_folder/.

The following code sample uploads the characterizations/ folder to dataset 1 on Citrination.

# ... client initialization left out
data_client = client.data

directory_path = "characterizations/"
dataset_id = 1
data_client.upload(dataset_id, directory_path)

Uploading a Directory To Citrination Under a New Name

You can also specify that a folder be renamed when it is uploaded to Citrination. The following code sample uploads the contents of the characterizations/ directory to a directory called january_characterizations on Citrination.

# ... client initialization left out
data_client = client.data

directory_path = "characterizations/"
dataset_id = 1
data_client.upload(dataset_id, directory_path, "january_characterizations/")

Selecting a Custom Ingester

Finding an Ingester by ID

The list_ingesters method returns all of the custom ingesters available on your Citrination deployment. It accepts no parameters, and returns an instance of IngesterList, which itself contains instances of Ingester.

The unique field of an Ingester is its id, so a particular Ingester can be located via the IngesterList#find_by_id method.

# ... client initialization left out
data_client = client.data

ingester_list = data_client.list_ingesters()
csv_ingester = ingester_list.find_by_id("citrine/ingest template_csv_converter")

Finding an Ingester by Searching its Attributes

If you don’t know the id of an ingester, you can also search through an IngesterList via the where method, which supports searching through a variety of attributes that can be found in the Ingester.SEARCH_FIELDS constant.

# ... client initialization left out
data_client = client.data

# All ingesters available on your Citrination deployment
ingester_list = data_client.list_ingesters()
# Find all ingesters whose `name` attributes contain the phrase "xrd"
# Note that the `where` method returns a new IngesterList
xrd_ingesters = ingester_list.where({ "name": "xrd" })

# How many ingesters had names that contained "xrd"?
print(xrd_ingesters.ingester_count)
# 2

# Quick summary of what those ingesters are
print(xrd_ingesters)
# <IngesterList ingester_count=2 ingesters=[
#   "<Ingester id='citrine/ingest bruker_xrd_xy_prod'
#              display_name='Citrine: Bruker XRD .XY'
#              description='Converts Bruker V8 .XY files to PIF .json format.'
#              num_arguments='3'>",
#   "<Ingester id='citrine/ingest xrdml_xrd_converter'
#              display_name='Citrine: XRD .xrdml'
#              description='Converter for .xrdml files from XRD measurements'
#              num_arguments='2'>"
# ]>

# Supposing we want to go with the one whose display_name is `Citrine: XRD .xrdml`,
# there are several ways to do this:
# 1. Indexing into the xrd_ingesters' `ingesters` list:
xrdml_ingester = xrd_ingesters.ingesters[1]

# 2. Alternatively, using a where clause with the matching `display_name`, `id`,
#    or other searchable attribute:
xrdml_ingester = xrd_ingesters.where({ "display_name": "Citrine: XRD .xrdml" }).ingesters[0]

# 3. Since xrd_ingesters is an `IngesterList`, we could also use `find_by_id` to
#    avoid having to index into the `ingesters` list:
xrdml_ingester = xrd_ingesters.find_by_id("citrine/ingest xrdml_xrd_converter")

Attention

Note that the IngesterList#where method returns a new instance of IngesterList, and requires indexing into the IngesterList#ingesters attribute to ultimately select an ingester.

Viewing an Ingester’s Arguments

Once you have an Ingester, you can check its optional and required arguments via the Ingester#arguments attribute. As seen below, each argument has a name, desc (description), type, and required key. If you need/want to provide ingester_arguments for an ingester when using the upload_with_ingester method, you will want to ensure that the name and value of your dictionaries match up to the name and type of those arguments found in Ingester#arguments.

# ... client initialization left out
data_client = client.data

ingester_list = data_client.list_ingesters()
csv_ingester = ingester_list.find_by_id("citrine/ingest template_csv_converter")
formulation_ingester = ingester_list.find_by_id("citrine/ingest formulation_csv_converter")
xrdml_ingester = ingester_list.find_by_id("citrine/ingest xrdml_xrd_converter")

# Here we can see that the Template CSV ingester accepts no arguments
print(csv_ingester.arguments)
# []

# Here we can see that the Formulation CSV ingester accepts one optional argument
print(formulation_ingester.arguments)
# [{ 'name': 'check_ingredient_names',
#    'desc': 'Whether to check that the names of the ingredients in the formulations are present in this upload',
#    'type': 'Boolean',
#    'required': False }]

# Here we can see that the Citrine: XRD .xrdml ingester requires 2 arguments,
# one named `sample_id` and the other named `chemical_formula`, both of which
# should be strings.
print(xrdml_ingester.arguments)
# [{ 'name': 'sample_id',
#    'desc': 'An ID to uniquely identify the material referenced in the file.',
#    'type': 'String',
#    'required': True },
#  { 'name': 'chemical_formula',
#    'desc': 'The chemical formula of the material referenced in the file.',
#    'type': 'String',
#    'required': True }]

Uploading Data Using a Custom Ingester

The upload_with_ingester method allows for custom ingesters to be used when uploading a file.

This method is parameterized with the following values:

  • dataset_id - The integer value of the ID of the dataset to which you will be uploading

  • source_path - The path to the file that you want to upload and for the ingester to then process

  • ingester - The custom Ingester you want to use

  • ingester_arguments (optional) - Any ingester arguments you want to apply to the ingester - this should be a list of dicts that contain name and value keys

  • dest_path (optional) - The name of the file or directory as it should appear in Citrination.

Ingesting Without Ingester Arguments

The following Python snippet demonstrates 2 approaches for uploading a file with the relative path data/formulation.csv to dataset 1 on Citrination, one with a specified destination path and one without (similar to how the upload method works). Both approaches utilize the Formulation CSV ingester with no ingester arguments.

# ... client initialization left out
data_client = client.data

file_path = "data/formulation.csv"
dataset_id = 1

ingester_list = data_client.list_ingesters()
formulation_ingester = ingester_list.find_by_id("citrine/ingest formulation_csv_converter")

# Printing the formulation_ingester's arguments, we can see that it takes one
# argument that is optional - so we can elect to omit it
print(formulation_ingester.arguments)
# [{ 'name': 'check_ingredient_names',
#    'desc': 'Whether to check that the names of the ingredients in the formulations are present in this upload',
#    'type': 'Boolean',
#    'required': False }]

# To ingest the file using the file_path as the destination path
data_client.upload_with_ingester(
    dataset_id, file_path, formulation_ingester
)
# To ingest the file using a different destination path
data_client.upload_with_ingester(
    dataset_id, file_path, formulation_ingester, dest_path='formulation.csv'
)

In the web UI, this file will appear as either formulation.csv nested in a data folder, or formulation.csv in the top level of the dataset depending on whether or not the destination path was provided.

Ingesting With Ingester Arguments

The following Python snippet demonstrates 2 approaches for uploading a file with the relative path experiments/data.xrdml to dataset 1 on Citrination, one with a specified destination path and one without (similar to how the upload method works). Both approaches utilize the Citrine: XRD .xrdml ingester with a set of arguments provided.

# ... client initialization left out
data_client = client.data

file_path = "experiments/data.xrdml"
dataset_id = 1

ingester_list = data_client.list_ingesters()
xrdml_ingester = ingester_list.find_by_id("citrine/ingest xrdml_xrd_converter")

# Printing the ingester's arguments, we can see it requires an argument with the
# name `sample_id`, and another with the name `chemical_formula`, both of which
# should be strings.
print(ingester.arguments)
# [{ 'name': 'sample_id',
#    'desc': 'An ID to uniquely identify the material referenced in the file.',
#    'type': 'String',
#    'required': True },
#  { 'name': 'chemical_formula',
#    'desc': 'The chemical formula of the material referenced in the file.',
#    'type': 'String',
#    'required': True }]

ingester_arguments = [
    { "name": "sample_id", "value": "1212" },
    { "name": "chemical_formula", "value": "NaCl" },
]

# To ingest the file using the file_path as the destination path
data_client.upload_with_ingester(
    dataset_id, file_path, xrdml_ingester, ingester_arguments
)
# To ingest the file using a different destination path
data_client.upload_with_ingester(
    dataset_id, file_path, xrdml_ingester, ingester_arguments, 'data.xrdml'
)

In the web UI, this file will appear as either data.xrdml nested in a experiments folder, or data.xrdml in the top level of the dataset depending on whether or not the destination path was provided.

Uploading Using the Template CSV Ingester

The upload_with_template_csv_ingester method abstracts away the logic of finding the Template CSV ingester, since it is one of the more commonly used ingesters. The same work can be accomplished by by finding the Template CSV ingester and using the upload_with_ingester method.

This method is parameterized with the following values:

  • dataset_id - The integer value of the ID of the dataset to which you will be uploading

  • source_path - The path to the file that you want to upload and for the Template CSV to then process

  • dest_path (optional) - The name of the file or directory as it should appear in Citrination.

The following Python snippet demonstrates 2 approaches for uploading a file with the relative path experiments/data.csv to dataset 1 on Citrination, one with a specified destination path and one without (similar to how the upload method works). Both approaches utilize the Citrine: XRD .xrdml ingester with a set of arguments provided.

# ... client initialization left out
data_client = client.data

file_path = "experiments/data.csv"
dataset_id = 1

# To ingest the file using the file_path as the destination path
data_client.upload_with_template_csv_ingester(
    dataset_id, file_path
)
# To ingest the file using a different destination path
data_client.upload_with_template_csv_ingester(
    dataset_id, file_path, dest_path='data.csv'
)

In the web UI, this file will appear as either data.csv nested in a experiments folder, or data.csv in the top level of the dataset depending on whether or not the destination path was provided.

Checking the Ingest Status of a Dataset

The get_ingest_status method can be used to check the ingestion status of a dataset. It returns the string Processing when data is being ingested or indexed, and returns the string Finished when no data is being processed.

# ... client initialization left out
data_client = client.data

dataset = client.data.create_dataset()
file = 'test_data/template_example.csv'
client.data.upload_with_template_csv_ingester(dataset.id, file)

# After uploading, the status will initially be `Processing`
print(client.data.get_ingest_status(dataset.id))
# Processing

# After data has finished processing, the status will be `Finished`
print(client.data.get_ingest_status(dataset.id))
# Finished

Attention

Note that this method does not distinguish between successful and failed data ingestions - it is simply whether or not data is currently being processed for the dataset.

Retrieving Files

There are two mechanisms for retrieving data from datasets on Citrination:

  1. Request download URLs for previously uploaded files

  2. Request the contents of a single record in PIF JSON format

File Download URLs

The DataClient class provides several methods for retrieving files from a dataset:

  • get_dataset_files()

  • get_dataset_file()

These two methods will each return URLs which can be used to download one or more files in a dataset.

# ... client initialization left out
data_client = client.data
dataset_id = 1

# Gets a single file named exactly my_file.json

dataset_file = data_client.get_dataset_file(dataset_id, "my_file.json")

dataset_file.url  # url that can be used to download the file
dataset_file.path # the filepath as it appears in Citrination

# Gets all the files in a dataset, organized by version,
# represented as a list of DatasetFile objects

dataset_files = data_client.get_dataset_files(dataset_id)

PIF Retrieval

A PIF record on Citrination can be retrieved using the DataClient#get_pif method. The record will be returned as a PyPif Pif object. The dataset_version and pif_version arguments are optional - by default the PIF returned will be the current version of the PIF from the current version of the dataset.

# ... client initialization left out
data_client = client.data
dataset_id = 1
pif_uid = "abc123"

# Retrieves the latest version of the PIF with uid is "abc123" from the latest
# version of dataset 1
data_client.get_pif(dataset_id, pif_uid)

# Retrieves the latest version of the PIF with uid is "abc123" from version 3
# of dataset 1
data_client.get_pif(dataset_id, pif_uid, dataset_version = 3)

# Retrieves the version 2 of the PIF with uid is "abc123" from the latest version
# of dataset 1
data_client.get_pif(dataset_id, pif_uid, pif_version = 2)

# Retrieves the version 2 of the PIF with uid is "abc123" from version 3 of
# dataset 1
data_client.get_pif(dataset_id, pif_uid, dataset_version = 3, pif_version = 2)

To get the metadata of a PIF, use the DataClient#get_pif_with_metadata method. This method acts similar to DataClient#get_pif, but returns a dictionary instead of a PyPif Pif object. The resulting dictionary will have two keys: “pif”, which will point to a PyPif Pif object, and “metadata”, which will be a dictionary with “dataset_id”, “dataset_version”, “uid”, “version”, and “updated_at” keys.

DataClient#get_pif_with_metadata has the same method signature as DataClient#get_pif, with optional dataset_version and pif_version arguments.

# ... client initialization left out
data_client = client.data
dataset_id = 105924
pif_uid = "1DF1C8EB706363E40546253D5D025D90"

get_pif_with_metadata = data_client.get_pif_with_metadata(dataset_id, pif_uid)

print(get_pif_with_metadata)
# {'metadata': {
#     'uid': '1DF1C8EB706363E40546253D5D025D90',
#     'version': 1,
#     'dataset_id': '105924',
#     'dataset_version': 1,
#     'updated_at': '2017-07-04T19:41:40.139Z'},
#  'pif': <pypif.obj.system.chemical.chemical_system.ChemicalSystem at 0x1131b8a50>}

Dataset Manipulation

The DataClient class allows you to create datasets, update their names, descriptions, and permissions, and create new versions of them.

Creating a new version of a dataset bumps the version number on Citrination. All files uploaded after this point will be uploaded to the new version.

The example below demonstrates how a files in old version of a dataset are not included in the file count mechanism.

# ... client initialization left out
data_client = client.data

# Creates a new dataset (permissions default to private)
dataset = data_client.create_dataset("My New Dataset")
dataset_id = dataset.id

# Uploads a file to it
data_client.upload(dataset_id, "my_file.json")

print(data_client.matched_file_count(dataset_id))
# -> 1

# Create a new dataset version
data_client.create_dataset_version(dataset_id)

# No files in the new version
print(data_client.matched_file_count(dataset_id))
# -> 0

It is also possible to toggle a dataset between being publicly accessible and private to your own user:

# ... client initialization left out
data_client = client.data

# Creates a new dataset (permissions default to private)
dataset = data_client.create_dataset("My New Dataset")
dataset_id = dataset.id

# Make the dataset public
data_client.update_dataset(dataset_id, public=True)

# Make the dataset private again
data_client.update_dataset(dataset_id, public=False)