This function sends files from your Google Storage bucket to Google Document AI v1 for asynchronous (offline) processing. The output is delivered to your bucket as json files.

dai_async(
  files,
  dest_folder = NULL,
  bucket = Sys.getenv("GCS_DEFAULT_BUCKET"),
  proj_id = get_project_id(),
  proc_id = Sys.getenv("DAI_PROCESSOR_ID"),
  skip_rev = "true",
  loc = "eu",
  token = dai_token()
)

Arguments

files

A vector or list of pdf filepaths in a Google Cloud Storage bucket. Filepaths must include all parent bucket folder(s) except the bucket name.

dest_folder

The name of the GCS Storage bucket subfolder where you want the json output.

bucket

The name of the GCS Storage bucket where the files to be processed are located.

proj_id

A Google Cloud Services project id.

proc_id

A Document AI processor id.

skip_rev

Whether to skip human review; "true" or "false".

loc

A two-letter region code; "eu" or "us".

token

An authentication token generated by dai_auth() or another auth function.

Value

A list of HTTP responses.

Details

Requires a Google Cloud access token (google_token) and a certain amount of configuration in RStudio; see vignettes for details. For long pdf documents, the json output is divided into separate files (shards) of 20 pages each. Maximum pdf document length is 2000 pages. The maximum number of pages in active processing is 10,000. The function waits 10 seconds between each document submission, so I recommend using RStudio's "Jobs" functionality if you are processing a lot of files.

NOTE: The function in its current form is a placeholder for a future function that allows for real batch processing (as opposed to an iterated single submission).

Examples

if (FALSE) { # with daiR configured on your system, several parameters are automatically provided, # and you can pass simple calls, such as: dai_async("my_document.pdf") # NB: Include all parent bucket folders (but not the bucket name) in the filepath: dai_async("for_processing/pdfs/my_document.pdf") # Bulk process by passing a vector of filepaths in the files argument: dai_async(my_files) # Specify a bucket subfolder for the json output: dai_async(my_files, dest_folder = "processed") }