Parse API

TonicTextual Parse Class

Pipeline Class

class tonic_textual.classes.pipeline.Pipeline(name: str, id: str, client: HttpClient)

Class to represent and provide access to a Tonic Textual pipeline.

Parameters:
  • name (str) – Pipeline name.

  • id (str) – Pipeline identifier.

  • client (HttpClient) – The HTTP client to use.

describe() str

Returns the name and id of the pipeline.

enumerate_files(lazy_load_content=True) PipelineFileEnumerator

Enumerate the files in the pipeline.

Parameters:

lazy_load_content (bool) – Whether to lazily load the content of the files. Default is True.

Returns:

An enumerator for the files in the pipeline.

Return type:

PipelineFileEnumerator

get_delta(pipeline_run1: PipelineRun, pipeline_run2: PipelineRun) FileParseResultsDiffEnumerator

Enumerates the files in the diff between two pipeline runs.

Parameters:
  • pipeline_run1 (PipelineRun) – The first pipeline run.

  • pipeline_run2 (PipelineRun) – The second pipeline run.

Returns:

An enumerator for the files in the diff between the two runs.

Return type:

FileParseResultsDiffEnumerator

get_runs() List[PipelineRun]

Get the runs for the pipeline.

Returns:

A list of PipelineRun objects.

Return type:

List[PipelineRun]

run() str

Run the pipeline.

Returns:

The ID of the job.

Return type:

str

upload_file(file: IOBase, file_name: str, csv_config: SolarCsvConfig | None = None) str

Upload a file to the pipeline.

Parameters:
  • pipeline_id (str) – The ID of the pipeline.

  • file (io.IOBase) – The file to upload.

  • file_name (str) – The name of the file.

  • csv_config (SolarCsvConfig) – The configuration for the CSV file. This is optional.

Returns:

This function does not return any value.

Return type:

None

File Enumerators

class tonic_textual.classes.pipeline_file_enumerator.PipelineFileEnumerator(job_id: str, client: HttpClient, lazy_load_content=True)

Enumerates the files in a pipeline.

Parameters:
  • job_id (str) – The job identifier.

  • client (HttpClient) – The HTTP client to use.

  • lazy_load_content (bool) – Whether to lazy load the content of the files. Default is True.

next() FileParseResult
class tonic_textual.classes.file_parse_result_diff_enumerator.FileParseResultsDiffEnumerator(job_id1: str, job_id2: str, client: HttpClient)

Enumerates the files in a diff between two jobs.

Parameters:
  • job_id1 (str) – The first job identifier.

  • job_id2 (str) – The second job identifier.

  • client (HttpClient) – The HTTP client to use.

next() FileParseResultsDiff

Pipeline File Results

class tonic_textual.classes.parse_api_responses.file_parse_result.FileParseResult(response: Dict, client: HttpClient, lazy_load_content=False, document: Dict | None = None)

A class representing the result of a parsed file.

Parameters:
  • response (Dict) – The response from the API.

  • client (HttpClient) – The HTTP client to use.

  • lazy_load_content (bool) – Whether to lazy load the content of the file. Default is False.

describe() str

Returns the parsed file path.

download_results() str

Downloads the results file.

Returns:

The results file.

Return type:

string

get_all_entities() List[SingleDetectionResult]

Returns a list of all the detected entities in the file.

Returns:

A list of detected entities in the file.

Return type:

List[SingleDetectionResult]

get_chunks(max_chars=15000, generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Off, metadata_entities: List[str] = [], include_metadata=True) List

Returns a list of chunks of text from the document. The chunks are filtered by the generator_default configuration.

Parameters:
  • max_chars (int = 15_000) – The maximum number of characters in each chunk.

  • generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.

  • generator_default (PiiState = PiiState.Redaction) – The default redaction used for all types not specified in generator_config. Values must be one of “Redaction”, “Synthesis”, or “Off”.

include_metadata: bool = True

If True, the metadata is included in the chunk.

Returns:
List[str]

A list of strings containing the chunks of text.

get_entities(generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Redaction, allow_overlap: bool = False) List[SingleDetectionResult]

Returns a list of entities in the document. The entities are filtered by the generator_default configuration.

Parameters:
  • generator_default (PiiState) – The default redaction used for all types not specified in generator_config. Values must be one of “Redaction”, “Synthesis”, or “Off”.

  • generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.

Returns:

A list of the detected entities. Each item in list contains the entity type, source start index, source end index, the entity text, and replacement text.

Return type:

List[SingleDetectionResult]

get_markdown(generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Off) str

Returns file in markdown format, redacted or synthesized based on config.

Parameters:
  • generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.

  • generator_default (PiiState = PiiState.Redaction) – The default redaction used for all types not specified in generator_config. Values must be one of “Redaction”, “Synthesis”, or “Off”.

Returns:

The file in markdown format, redacted or synthesized based on generator_config and generator_default.

Return type:

str

get_tables() List[Table]

Returns a list of tables found in document. This is applicable to CSV, XLSX, PDF, and images

Parameters:
  • sensitive_entity_types (List[str]) – A list of sensitive entity types to check for.

  • start (int = 0) – The start index to check for sensitive data.

  • end (int = -1) – The end index to check for sensitive data.

Returns:

True if the element contains sensitive data, False otherwise.

Return type:

bool

is_sensitive(sensitive_entity_types: List[str], start: int = 0, end: int = -1) bool

Returns True if the element contains sensitive data, False otherwise.

Parameters:
  • sensitive_entity_types (List[str]) – A list of sensitive entity types to check for.

  • start (int = 0) – The start index to check for sensitive data.

  • end (int = -1) – The end index to check for sensitive data.

Returns:

True if the element contains sensitive data, False otherwise.

Return type:

bool

class tonic_textual.classes.parse_api_responses.file_parse_results_diff.FileParseDiffAction(value)

Enum that stores possible state of a file parse result diff.

Added = 1

The file was added, so it is new..

Deleted = 2

The file was deleted.

Modified = 3

The file was was modified.

NonModified = 4

The file was not modified.

class tonic_textual.classes.parse_api_responses.file_parse_results_diff.FileParseResultsDiff(status: FileParseDiffAction, file: FileParseResult)

Stores file parse result and file parse result action.

Parameters:
  • status (FileParseDiffAction) – The action of the file parse result.

  • file (FileParseResult) – The file parse result.

deconstruct() Tuple[FileParseDiffAction, FileParseResult]

Returns the status and the file path of the diff.

describe() str

Returns the status and the file path of the diff as string.