Overview

Textual’s Pipeline API allows you to extract text and entity metadata from Textual Pipelines

Creating and deleting a pipeline

To create a pipeline, use the create_pipeline method. To delete a pipeline, use the delete_pipeline method.

from tonic_textual.parse_api import TonicTextualParse

textual = TonicTextual("<TONIC-TEXTUAL-URL>")
pipeline = textual.create_pipeline("pipeline name")
textual.delete_pipeline(pipeline.id)

Getting pipelines

The Pipeline class represents a pipeline in Textual. A pipeline is a collection of jobs that process files and extract text and entities from them. To get the list of all available pipelines, use the get_pipelines method

from tonic_textual.parse_api import TonicTextualParse

textual = TonicTextual("<TONIC-TEXTUAL-URL>")
pipelines = textual.get_pipelines()
latest_pipeline = pipelines[-1]
print(latest_pipeline.describe())

This produces results similar to the following:

--------------------------------------------------------
 Name: pipeline demo
 ID: 056e6cc7-0a1d-3ab4-5e61-919fb5475b31
 --------------------------------------------------------

Alternatively, use the get_pipeline_by_id method to get a specific pipeline.

pipeline_id = '056e6cc7-0a1d-3ab4-5e61-919fb5475b31'
textual.get_pipeline_by_id(pipeline_id)

Uploading files

To upload a file to a pipeline, use the upload_file method.

pipeline = textual.create_pipeline(pipeline_name)
with open(file_path, "rb") as file_content:
    file_bytes = file_content.read()
pipeline.upload_file(file_bytes, file_name)

Enumerating files in a pipeline

A pipeline’s enumerate_files method returns a pipeline enumerator of all of the files that the pipeline processed. By default, this will enumerate over the most recent job run of the pipeline, but you can specify a specific job run by passing the job run ID as an argument.

for file in pipeline.enumerate_files():
    print(file.describe())

Parsing documents

Files can be parsed on a one off, on demand fashion without using Pipelines. In this approach, you simply send a file to the Textual service and receive back a parsed result to be used.

Note that files should be read using the ‘rb’ access mode, which opens the file for read in binary format. A timeout can optionally be set in the parse_file command to stop waiting on the parsed result after some number of seconds. You can also set the TONIC_TEXTUAL_PARSE_TIMEOUT_IN_SECONDS environment variable to enfore an SDK-wide timeout.

In addition to reading files from your local system, you can also pass in a bucket, key pair to parse files sitting in S3. This uses the boto3 library to fetch the file from S3 and therefore requires the correct AWS credentials be setup. Usage is similar to the above.

Extracting text and entities

A FileParseResult object can be used to extract text and entities from a file.

for file in pipeline.enumerate_files():
    file.get_markdown() # returns the markdown of the file
    file.get_all_entities() # returns all entities in the file
    file.get_chunks() # chunks the file into paragraphs and returns them
    file.download_results() # downloads the results file as UTF-8 string