Data Client¶
The DataClient
is the section of PyCC that allows you to manage your data
on Citrination. To access the data client, instantiate CitrinationClient
and read the data
attribute:
from citrination_client import CitrinationClient
from os import environ
client = CitrinationClient(environ["CITRINATION_API_KEY"], environ["CITRINATION_SITE"])
data_client = client.data
Uploading Files¶
The DataClient
class exposes a method, .upload
which allows you to
upload a file or a directory to a dataset on Citrination using the “default”
ingester. This method is useful for uploading JSON files that follow the PIF
schema, as well as files that do not require additional processing with a custom
ingester.
Attention
For files that require additional processing, see the documentation for the
upload_with_ingester
and upload_with_template_csv_ingester
sections.
This method is parameterized with the following values:
dataset_id - The integer value of the ID of the dataset to which you will be uploading
source_path - The path to the file or directory which you want to upload
dest_path (optional) - The name of the file or directory as it should appear in Citrination.
Uploading a File To Citrination¶
The following Python snippet demonstrates the approach for uploading a file with the relative path characterizations/CdTe1.json
to dataset 1 on Citrination.
# ... client initialization left out
data_client = client.data
file_path = "characterizations/CdTe1.json"
dataset_id = 1
data_client.upload(dataset_id, file_path)
In the web UI, this file will appear as CdTe1.json
nested in a characterizations
folder.
Uploading a File To Citrination Under a New Name¶
You can specify the name a file should have once uploaded to Citrination by passing the second parameter to the .upload
method.
# ... client initialization left out
data_client = client.data
file_path = "characterizations/CdTe1.json"
dataset_id = 1
# Pass in the third parameter to upload()
data_client.upload(dataset_id, file_path, "CadTel1.json")
In the web UI, this file will appear as CadTel1.json
at the top level of the dataset.
Uploading a Directory To Citrination¶
If you pass a directory into the upload()
method, all the files in the
directory will be recursively uploaded to the dataset. Their paths (relative to the directory specified), will remain intact.
Attention
Files uploaded this way will be prefixed with the name of the parent
directory that you originally specify. In other words, if upload()
is
called with the source path my_folder
, the dataset on Citrination will
contain files prefixed with my_folder/
.
The following code sample uploads the characterizations/
folder to dataset 1 on Citrination.
# ... client initialization left out
data_client = client.data
directory_path = "characterizations/"
dataset_id = 1
data_client.upload(dataset_id, directory_path)
Uploading a Directory To Citrination Under a New Name¶
You can also specify that a folder be renamed when it is uploaded to Citrination. The following code sample uploads the contents of the characterizations/
directory to a directory called january_characterizations
on Citrination.
# ... client initialization left out
data_client = client.data
directory_path = "characterizations/"
dataset_id = 1
data_client.upload(dataset_id, directory_path, "january_characterizations/")
Selecting a Custom Ingester¶
Finding an Ingester by ID¶
The list_ingesters
method returns all of the custom ingesters available on
your Citrination deployment. It accepts no parameters, and returns an instance
of IngesterList
, which itself contains instances of Ingester
.
The unique field of an Ingester
is its id
, so a particular Ingester
can be located via the IngesterList#find_by_id
method.
# ... client initialization left out
data_client = client.data
ingester_list = data_client.list_ingesters()
csv_ingester = ingester_list.find_by_id("citrine/ingest template_csv_converter")
Finding an Ingester by Searching its Attributes¶
If you don’t know the id
of an ingester, you can also search through an
IngesterList
via the where
method, which supports searching through a
variety of attributes that can be found in the Ingester.SEARCH_FIELDS
constant.
# ... client initialization left out
data_client = client.data
# All ingesters available on your Citrination deployment
ingester_list = data_client.list_ingesters()
# Find all ingesters whose `name` attributes contain the phrase "xrd"
# Note that the `where` method returns a new IngesterList
xrd_ingesters = ingester_list.where({ "name": "xrd" })
# How many ingesters had names that contained "xrd"?
print(xrd_ingesters.ingester_count)
# 2
# Quick summary of what those ingesters are
print(xrd_ingesters)
# <IngesterList ingester_count=2 ingesters=[
# "<Ingester id='citrine/ingest bruker_xrd_xy_prod'
# display_name='Citrine: Bruker XRD .XY'
# description='Converts Bruker V8 .XY files to PIF .json format.'
# num_arguments='3'>",
# "<Ingester id='citrine/ingest xrdml_xrd_converter'
# display_name='Citrine: XRD .xrdml'
# description='Converter for .xrdml files from XRD measurements'
# num_arguments='2'>"
# ]>
# Supposing we want to go with the one whose display_name is `Citrine: XRD .xrdml`,
# there are several ways to do this:
# 1. Indexing into the xrd_ingesters' `ingesters` list:
xrdml_ingester = xrd_ingesters.ingesters[1]
# 2. Alternatively, using a where clause with the matching `display_name`, `id`,
# or other searchable attribute:
xrdml_ingester = xrd_ingesters.where({ "display_name": "Citrine: XRD .xrdml" }).ingesters[0]
# 3. Since xrd_ingesters is an `IngesterList`, we could also use `find_by_id` to
# avoid having to index into the `ingesters` list:
xrdml_ingester = xrd_ingesters.find_by_id("citrine/ingest xrdml_xrd_converter")
Attention
Note that the IngesterList#where method returns a new instance of IngesterList, and requires indexing into the IngesterList#ingesters attribute to ultimately select an ingester.
Viewing an Ingester’s Arguments¶
Once you have an Ingester
, you can check its optional and required arguments
via the Ingester#arguments
attribute. As seen below, each argument has a name
,
desc
(description), type
, and required
key. If you need/want to provide
ingester_arguments
for an ingester when using the upload_with_ingester
method,
you will want to ensure that the name
and value
of your dictionaries match
up to the name
and type
of those arguments found in Ingester#arguments
.
# ... client initialization left out
data_client = client.data
ingester_list = data_client.list_ingesters()
csv_ingester = ingester_list.find_by_id("citrine/ingest template_csv_converter")
formulation_ingester = ingester_list.find_by_id("citrine/ingest formulation_csv_converter")
xrdml_ingester = ingester_list.find_by_id("citrine/ingest xrdml_xrd_converter")
# Here we can see that the Template CSV ingester accepts no arguments
print(csv_ingester.arguments)
# []
# Here we can see that the Formulation CSV ingester accepts one optional argument
print(formulation_ingester.arguments)
# [{ 'name': 'check_ingredient_names',
# 'desc': 'Whether to check that the names of the ingredients in the formulations are present in this upload',
# 'type': 'Boolean',
# 'required': False }]
# Here we can see that the Citrine: XRD .xrdml ingester requires 2 arguments,
# one named `sample_id` and the other named `chemical_formula`, both of which
# should be strings.
print(xrdml_ingester.arguments)
# [{ 'name': 'sample_id',
# 'desc': 'An ID to uniquely identify the material referenced in the file.',
# 'type': 'String',
# 'required': True },
# { 'name': 'chemical_formula',
# 'desc': 'The chemical formula of the material referenced in the file.',
# 'type': 'String',
# 'required': True }]
Uploading Data Using a Custom Ingester¶
The upload_with_ingester
method allows for custom ingesters to be used when
uploading a file.
This method is parameterized with the following values:
dataset_id - The integer value of the ID of the dataset to which you will be uploading
source_path - The path to the file that you want to upload and for the ingester to then process
ingester - The custom
Ingester
you want to useingester_arguments (optional) - Any ingester arguments you want to apply to the ingester - this should be a list of dicts that contain
name
andvalue
keysdest_path (optional) - The name of the file or directory as it should appear in Citrination.
Ingesting Without Ingester Arguments¶
The following Python snippet demonstrates 2 approaches for uploading a file with
the relative path data/formulation.csv
to dataset 1 on Citrination, one
with a specified destination path and one without (similar to how the upload
method works). Both approaches utilize the Formulation CSV
ingester with
no ingester arguments
.
# ... client initialization left out
data_client = client.data
file_path = "data/formulation.csv"
dataset_id = 1
ingester_list = data_client.list_ingesters()
formulation_ingester = ingester_list.find_by_id("citrine/ingest formulation_csv_converter")
# Printing the formulation_ingester's arguments, we can see that it takes one
# argument that is optional - so we can elect to omit it
print(formulation_ingester.arguments)
# [{ 'name': 'check_ingredient_names',
# 'desc': 'Whether to check that the names of the ingredients in the formulations are present in this upload',
# 'type': 'Boolean',
# 'required': False }]
# To ingest the file using the file_path as the destination path
data_client.upload_with_ingester(
dataset_id, file_path, formulation_ingester
)
# To ingest the file using a different destination path
data_client.upload_with_ingester(
dataset_id, file_path, formulation_ingester, dest_path='formulation.csv'
)
In the web UI, this file will appear as either formulation.csv
nested in a
data
folder, or formulation.csv
in the top level of the dataset
depending on whether or not the destination path was provided.
Ingesting With Ingester Arguments¶
The following Python snippet demonstrates 2 approaches for uploading a file with
the relative path experiments/data.xrdml
to dataset 1 on Citrination, one
with a specified destination path and one without (similar to how the upload
method works). Both approaches utilize the Citrine: XRD .xrdml
ingester with
a set of arguments provided
.
# ... client initialization left out
data_client = client.data
file_path = "experiments/data.xrdml"
dataset_id = 1
ingester_list = data_client.list_ingesters()
xrdml_ingester = ingester_list.find_by_id("citrine/ingest xrdml_xrd_converter")
# Printing the ingester's arguments, we can see it requires an argument with the
# name `sample_id`, and another with the name `chemical_formula`, both of which
# should be strings.
print(ingester.arguments)
# [{ 'name': 'sample_id',
# 'desc': 'An ID to uniquely identify the material referenced in the file.',
# 'type': 'String',
# 'required': True },
# { 'name': 'chemical_formula',
# 'desc': 'The chemical formula of the material referenced in the file.',
# 'type': 'String',
# 'required': True }]
ingester_arguments = [
{ "name": "sample_id", "value": "1212" },
{ "name": "chemical_formula", "value": "NaCl" },
]
# To ingest the file using the file_path as the destination path
data_client.upload_with_ingester(
dataset_id, file_path, xrdml_ingester, ingester_arguments
)
# To ingest the file using a different destination path
data_client.upload_with_ingester(
dataset_id, file_path, xrdml_ingester, ingester_arguments, 'data.xrdml'
)
In the web UI, this file will appear as either data.xrdml
nested in a
experiments
folder, or data.xrdml
in the top level of the dataset
depending on whether or not the destination path was provided.
Uploading Using the Template CSV Ingester¶
The upload_with_template_csv_ingester
method abstracts away the logic of
finding the Template CSV ingester, since it is one of the more commonly used
ingesters. The same work can be accomplished by by finding the Template CSV
ingester and using the upload_with_ingester
method.
This method is parameterized with the following values:
dataset_id - The integer value of the ID of the dataset to which you will be uploading
source_path - The path to the file that you want to upload and for the Template CSV to then process
dest_path (optional) - The name of the file or directory as it should appear in Citrination.
The following Python snippet demonstrates 2 approaches for uploading a file with
the relative path experiments/data.csv
to dataset 1 on Citrination, one
with a specified destination path and one without (similar to how the upload
method works). Both approaches utilize the Citrine: XRD .xrdml
ingester with
a set of arguments provided
.
# ... client initialization left out
data_client = client.data
file_path = "experiments/data.csv"
dataset_id = 1
# To ingest the file using the file_path as the destination path
data_client.upload_with_template_csv_ingester(
dataset_id, file_path
)
# To ingest the file using a different destination path
data_client.upload_with_template_csv_ingester(
dataset_id, file_path, dest_path='data.csv'
)
In the web UI, this file will appear as either data.csv
nested in a
experiments
folder, or data.csv
in the top level of the dataset
depending on whether or not the destination path was provided.
Checking the Ingest Status of a Dataset¶
The get_ingest_status
method can be used to check the ingestion status of
a dataset. It returns the string Processing
when data is being ingested
or indexed, and returns the string Finished
when no data is being processed.
# ... client initialization left out
data_client = client.data
dataset = client.data.create_dataset()
file = 'test_data/template_example.csv'
client.data.upload_with_template_csv_ingester(dataset.id, file)
# After uploading, the status will initially be `Processing`
print(client.data.get_ingest_status(dataset.id))
# Processing
# After data has finished processing, the status will be `Finished`
print(client.data.get_ingest_status(dataset.id))
# Finished
Attention
Note that this method does not distinguish between successful and failed data ingestions - it is simply whether or not data is currently being processed for the dataset.
Retrieving Files¶
There are two mechanisms for retrieving data from datasets on Citrination:
Request download URLs for previously uploaded files
Request the contents of a single record in PIF JSON format
File Download URLs¶
The DataClient
class provides several methods for retrieving files
from a dataset:
get_dataset_files()
get_dataset_file()
These two methods will each return URLs which can be used to download one or more files in a dataset.
# ... client initialization left out
data_client = client.data
dataset_id = 1
# Gets a single file named exactly my_file.json
dataset_file = data_client.get_dataset_file(dataset_id, "my_file.json")
dataset_file.url # url that can be used to download the file
dataset_file.path # the filepath as it appears in Citrination
# Gets all the files in a dataset, organized by version,
# represented as a list of DatasetFile objects
dataset_files = data_client.get_dataset_files(dataset_id)
PIF Retrieval¶
A PIF record on Citrination can be retrieved using the DataClient#get_pif
method.
The record will be returned as a PyPif Pif object. The dataset_version
and
pif_version
arguments are optional - by default the PIF returned will be
the current version of the PIF from the current version of the dataset.
# ... client initialization left out
data_client = client.data
dataset_id = 1
pif_uid = "abc123"
# Retrieves the latest version of the PIF with uid is "abc123" from the latest
# version of dataset 1
data_client.get_pif(dataset_id, pif_uid)
# Retrieves the latest version of the PIF with uid is "abc123" from version 3
# of dataset 1
data_client.get_pif(dataset_id, pif_uid, dataset_version = 3)
# Retrieves the version 2 of the PIF with uid is "abc123" from the latest version
# of dataset 1
data_client.get_pif(dataset_id, pif_uid, pif_version = 2)
# Retrieves the version 2 of the PIF with uid is "abc123" from version 3 of
# dataset 1
data_client.get_pif(dataset_id, pif_uid, dataset_version = 3, pif_version = 2)
To get the metadata of a PIF, use the DataClient#get_pif_with_metadata
method.
This method acts similar to DataClient#get_pif
, but returns a dictionary
instead of a PyPif Pif object. The resulting dictionary will have two keys:
“pif”, which will point to a PyPif Pif object, and “metadata”, which will be
a dictionary with “dataset_id”, “dataset_version”, “uid”, “version”, and
“updated_at” keys.
DataClient#get_pif_with_metadata
has the same method signature as DataClient#get_pif
,
with optional dataset_version
and pif_version
arguments.
# ... client initialization left out
data_client = client.data
dataset_id = 105924
pif_uid = "1DF1C8EB706363E40546253D5D025D90"
get_pif_with_metadata = data_client.get_pif_with_metadata(dataset_id, pif_uid)
print(get_pif_with_metadata)
# {'metadata': {
# 'uid': '1DF1C8EB706363E40546253D5D025D90',
# 'version': 1,
# 'dataset_id': '105924',
# 'dataset_version': 1,
# 'updated_at': '2017-07-04T19:41:40.139Z'},
# 'pif': <pypif.obj.system.chemical.chemical_system.ChemicalSystem at 0x1131b8a50>}
Dataset Manipulation¶
The DataClient
class allows you to create datasets, update their names, descriptions, and permissions, and create new versions of them.
Creating a new version of a dataset bumps the version number on Citrination. All files uploaded after this point will be uploaded to the new version.
The example below demonstrates how a files in old version of a dataset are not included in the file count mechanism.
# ... client initialization left out
data_client = client.data
# Creates a new dataset (permissions default to private)
dataset = data_client.create_dataset("My New Dataset")
dataset_id = dataset.id
# Uploads a file to it
data_client.upload(dataset_id, "my_file.json")
print(data_client.matched_file_count(dataset_id))
# -> 1
# Create a new dataset version
data_client.create_dataset_version(dataset_id)
# No files in the new version
print(data_client.matched_file_count(dataset_id))
# -> 0
It is also possible to toggle a dataset between being publicly accessible and private to your own user:
# ... client initialization left out
data_client = client.data
# Creates a new dataset (permissions default to private)
dataset = data_client.create_dataset("My New Dataset")
dataset_id = dataset.id
# Make the dataset public
data_client.update_dataset(dataset_id, public=True)
# Make the dataset private again
data_client.update_dataset(dataset_id, public=False)