Classes

TonicTextual Class

class tonic_textual.api.TonicTextual(base_url: str, api_key: str | None = None, verify: bool = True)

Wrapper class for invoking Tonic Textual API

Parameters:
  • base_url (str) – The URL to your Tonic Textual instance. Do not include trailing backslashes.

  • api_key (str) – Your API token. This argument is optional. Instead of providing the API token here, it is recommended that you set the API key in your environment as the value of TONIC_TEXTUAL_API_KEY.

  • verify (bool) – Whether SSL Certification verification is performed. This is enabled by default.

Examples

>>> TonicTextual("http://localhost:3000")
create_dataset(dataset_name: str)

Creates a dataset. A dataset is a collection of 1 or more files for Tonic Textual to scan and redact.

Parameters:

dataset_name (str) – The name of the dataset. Dataset names must be unique.

Returns:

The newly created dataset.

Return type:

Dataset

Raises:

DatasetNameAlreadyExists – Raised if a dataset with the same name already exists.

delete_dataset(dataset_name: str)
download_redacted_file(job_id: str, generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Redaction, custom_models: List[str] = [], random_seed: int | None = None, num_retries: int = 6) bytes

Download a redacted file

Parameters:
  • job_id (str) – The ID of the redaction job

  • generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it.

  • generator_default (PiiState = PiiState.Redaction) – The default redaction used for all types not specified in generator_config.

  • custom_models (List[str] = []) – A list of custom model names to use to identify values to redact. To see the list of custom models that you have access to, use the get_custom_models function.

  • random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.

  • num_retries (int = 6) – An optional value to specify how many times to attempt to download the file. If a file is not yet ready for download, there will be a 10 second pause before retrying. (The default value is 6)

Returns:

The redacted file as byte array

Return type:

bytes

get_custom_models() List[CustomModel]

Returns all of the custom models that the user owns.

Returns:

A list of all of the custom models that the user owns.

Return type:

List[CustomModel]

get_dataset(dataset_name: str) Dataset

Gets the dataset for the specified dataset name.

Parameters:

dataset_name (str) – The name of the dataset.

Return type:

Dataset

Examples

>>> dataset = tonic.get_dataset("llama_2_chatbot_finetune_v5")
get_files(dataset_id: str) List[DatasetFile]

Gets all of the files in the dataset.

Returns:

  • List[DatasetFile]

  • A list of all of the files in the dataset.

llm_synthesis(string: str, generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Redaction) RedactionResponse

Deidentifies a string by redacting sensitive data and replacing these values with values generated by an LLM.

stringstr

The string to redact.

generator_config: Dict[str, PiiState]

A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it.

generator_default: PiiState = PiiState.Redaction

The default redaction used for all types not specified in generator_config.

Returns

The de-identified string

redact(string: str, generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Redaction, custom_models: List[str] = [], random_seed: int | None = None) RedactionResponse

Redacts a string. Depending on the configured handling for each sensitive data type, values can be either redacted, synthesized, or ignored.

Parameters:
  • string (str) – The string to redact.

  • generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.

  • generator_default (PiiState = PiiState.Redaction) – The default redaction used for all types not specified in generator_config. Values must be one of “Redaction”, “Synthesis”, or “Off”.

  • custom_models (List[str] = []) – A list of custom model names to use to identify values to redact. To see the list of custom models that you have access to, use the get_custom_models function.

  • random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.

Returns:

The redacted string along with ancillary information.

Return type:

RedactionResponse

Examples

>>> textual.redact("John Smith is a person", generator_config= {"NAME_GIVEN": "Redaction"}, generator_default="Off") # only redacts NAME_GIVEN
redact_json(json_data: str | dict, generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Redaction, custom_models: List[str] = [], random_seed: int | None = None) RedactionResponse

Redacts the values in a JSON blob. Depending on the configured handling for each sensitive data type, values can be either redacted, synthesized, or ignored.

Parameters:
  • json_string (Union[str, dict]) – The JSON whose values will be redacted. This can be either a JSON string or a Python dictionary

  • generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it.

  • generator_default (PiiState = PiiState.Redaction) – The default redaction used for all types not specified in generator_config.

  • custom_models (List[str] = []) – A list of custom model names to use to identify values to redact. To see the list of custom models that you have access to, use the get_custom_models function.

  • random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.

Returns:

The redacted string along with ancillary information.

Return type:

RedactionResponse

start_file_redaction(file: IOBase, file_name: str) str

Redact a provided file

Parameters:
  • file (io.IOBase) – The opened file, available for reading, which will be uploaded and redacted

  • file_name (str) – The name of the file

Returns:

  • str

  • The job id which can be used to download the redacted file once it is ready

unredact(redacted_string: str, random_seed: int | None = None) str

Removes the redaction from a provided string. Returns the string with the original values.

Parameters:
  • redacted_string (str) – The redacted string from which to remove the redaction.

  • random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.

Returns:

The string with the redaction removed.

Return type:

str

unredact_bulk(redacted_strings: List[str], random_seed: int | None = None) List[str]

Removes redaction from a list of strings. Returns the strings with the original values.

Parameters:
  • redacted_strings (List[str]) – The list of redacted strings from which to remove the redaction.

  • random_seed (Optional[int] = None) – An optional value to use to override Textual’s default random number seeding. Can be used to ensure that different API calls use the same or different random seeds.

Returns:

The list of strings with the redaction removed.

Return type:

List[str]

Dataset Class

class tonic_textual.classes.dataset.Dataset(id: str, name: str, files: List[Dict], client: HttpClient)

Class to represent and provide access to a Tonic Textual dataset.

Parameters:
  • id (str) – Dataset identifier.

  • name (str) – Dataset name.

  • files (Dict) – Serialized DatasetFile objects representing the files in a dataset.

  • client (HttpClient) – The HTTP client to use.

describe()

Prints the dataset name, identifier, and the list of files.

Examples

>>> workspace.describe()
Dataset: your_dataset_name [dataset_id]
Number of Files: 2
Number of Rows: 1000
fetch_all_df() <Mock name='mock.DataFrame' id='140658276249088'>

Fetches all of the data in the dataset as a pandas dataframe.

Returns:

Dataset data in a pandas dataframe.

Return type:

pd.DataFrame

fetch_all_json() str

Fetches all of the data in the dataset as JSON.

Returns:

Dataset data in JSON format.

Return type:

str

get_failed_files() List[DatasetFile]

Gets all of the files in dataset that encountered an error when they were processed. These files are effectively ignored.

Returns:

The list of files that had processing errors.

Return type:

List[DatasetFile]

get_processed_files() List[DatasetFile]

Gets all of the files in the dataset for which processing is complete. The data in these files is returned when data is requested.

Returns:

The list of processed dataset files.

Return type:

List[DatasetFile]

get_queued_files() List[DatasetFile]

Gets all of the files in the dataset that are waiting to be processed.

Returns:

The list of dataset files that await processing.

Return type:

List[DatasetFile]

get_running_files() List[DatasetFile]

Gets all of the files in the dataset that are currently being processed.

Returns:

The list of files that are being processed.

Return type:

List[DatasetFile]

reset()
upload_then_add_file(file_path: str, file_name: str | None = None)

Uploads a file to the dataset.

Parameters:
  • file_path (str) – The absolute path of the file to upload.

  • file_name (str) – The name of the file to save to Tonic Textual.

Raises:

DatasetFileMatchesExistingFile – Returned if the file content matches an existing file.

DatasetFile Class

class tonic_textual.classes.datasetfile.DatasetFile(id: str, name: str, num_rows: int, num_columns: int, processing_status: str, processing_error: str)

Class to store the metadata for a dataset file.

Parameters:
  • id (str) – The identifier of the dataset file.

  • name (str) – The file name of the dataset file.

  • num_rows (long) – The number of rows in the dataset file.

  • num_columns (int) – The number of columns in the dataset file.

  • processing_status (string) – The status of the dataset file in the processing pipeline. Possible values are ‘Completed’, ‘Failed’, ‘Cancelled’, ‘Running’, and ‘Queued’.

  • processing_error (string) – If the dataset file processing failed, a description of the issue that caused the failure.

  • uploaded_timestamp (str) – Timestamp in UTC when dataset file was uploaded to the dataset.

describe()

Prints the dataset file metadata. Includes the identifier, file name, number of rows, and number of columns.

TonicException Class

exception tonic_textual.classes.tonic_exception.DatasetFileMatchesExistingFile(errors)

Raised when the content in a file to upload matches the content in an existing file in the dataset.

exception tonic_textual.classes.tonic_exception.DatasetNameAlreadyExists(errors)

Raised when there is an attempt to create a dataset with a name that already exists.

exception tonic_textual.classes.tonic_exception.ErrorWhenDownloadFile(errors)

Raised when server returns 500 when attempting to download file

exception tonic_textual.classes.tonic_exception.FileNotReadyForDownload(msg)

Raised when you make a reques to download a file that is not yet ready for download

exception tonic_textual.classes.tonic_exception.InvalidJsonForRedactionRequest(msg)

Raised when the JSON redaction request contains invalid JSON

exception tonic_textual.classes.tonic_exception.LicenseInvalid(errors)

Raised when either your license has expired OR you’ve exceeded your allowed word limit