2.1. Extractors

Extractors are operators that read data from files. The are essentially Casters because the input is always a string, and the output is usually another datatype. Below are examples of using extractors in Pipe objects.

2.1.1. CSVExtractor

The CSVExtractor is used to read a csv file and return the data in a pandas DataFrame.

from piperoni.operators.extract.extract_file.csv_ import CSVExtractor
from piperoni.operators.pipe import Pipe

extractor_pipe = Pipe(
    [
        CSVExtractor()
    ]
)

extracted_data = extractor_pipe("path/to/file.csv")

2.1.2. ExcelExtractor

The ExcelExtractor is used to read an excel file and return the data in a pandas DataFrame.

from piperoni.operators.extract.extract_file.excel import ExcelExtractor
from piperoni.operators.pipe import Pipe

extractor_pipe = Pipe(
    [
        ExcelExtractor()
    ]
)

extracted_data = extractor_pipe("path/to/file.xlsx")

2.1.3. JSONExtractor

The JSONExtractor is used to read a json file and return the data in a pandas DataFrame.

from piperoni.operators.extract.extract_file.json_ import JSONExtractor
from piperoni.operators.pipe import Pipe

extractor_pipe = Pipe(
    [
        JSONExtractor()
    ]
)

extracted_data = extractor_pipe("path/to/file.json")

2.1.4. Custom Extractors

You may very often need to make your own custom extractor which will be a superclass of any of the above classes or the base FileExtractor. You need to define your own transform method that reads the data, transforms it, and returns it in some fashion.

Below is an example of a custom extractor using the base FileExtractor. It reads a csv separated by tabs, downscopes to a few specific columns, and returns a pandas DataFrame:

from piperoni.operators.extract.extract_file.base import FileExtractor
from piperoni.operators.pipe import Pipe
from pandas import DataFrame

import pandas as pd

class MyCustomExtractor(FileExtractor):
    def transform(self, path: str) -> DataFrame:
        """Parses a cif file. This is a demo, and does not really work. """
        raw_file = pd.read_csv(path, sepstr = "\t")
        data_dict = {}
        data_dict["uid"] = raw_file.iloc[2,1]
        data_dict["journal"] = raw_file.iloc[1,8]
        data_dict["sites"] = raw_file.iloc[1,10:21]
        output_df = DataFrame(data = data_dict)
        return output_df

extractor_pipe = Pipe(
    [
        MyCustomExtractor()
    ]
)

extracted_data = extractor_pipe("path/to/file.csv")