OCR documents asynchronously — dai

Sends files from a Google Cloud Services (GCS) Storage bucket to the GCS Document AI v1 API for asynchronous (offline) processing. The output is delivered to the same bucket as JSON files containing the OCRed text and additional data.

dai_async(
  files,
  dest_folder = NULL,
  bucket = Sys.getenv("GCS_DEFAULT_BUCKET"),
  proj_id = get_project_id(),
  proc_id = Sys.getenv("DAI_PROCESSOR_ID"),
  proc_v = NA,
  skip_rev = "true",
  loc = "eu",
  token = dai_token()
)

Arguments

files: a vector or list of pdf filepaths in a GCS Storage bucket Filepaths must include all parent bucket folder(s) except the bucket name
dest_folder: the name of the GCS Storage bucket subfolder where you want the json output
bucket: the name of the GCS Storage bucket where the files to be processed are located
proj_id: a GCS project id
proc_id: a Document AI processor id
proc_v: one of 1) a processor version name, 2) "stable" for the latest processor from the stable channel, or 3) "rc" for the latest processor from the release candidate channel.
skip_rev: whether to skip human review; "true" or "false"
loc: a two-letter region code; "eu" or "us"
token: an access token generated by dai_auth() or another auth function

Value

A list of HTTP responses

Details

Requires a GCS access token and some configuration of the .Renviron file; see package vignettes for details. Currently, a dai_async() call can contain a maximum of 50 files (but a multi-page pdf counts as one file). You can not have more than 5 batch requests and 10,000 pages undergoing processing at any one time. Maximum pdf document length is 2,000 pages. With long pdf documents, Document AI divides the JSON output into separate files ('shards') of 20 pages each. If you want longer shards, use dai_tab_async(), which accesses another API endpoint that allows for shards of up to 100 pages.

Examples

if (FALSE) { # \dontrun{
# with daiR configured on your system, several parameters are automatically provided,
# and you can pass simple calls, such as:
dai_async("my_document.pdf")

# NB: Include all parent bucket folders (but not the bucket name) in the filepath:
dai_async("for_processing/pdfs/my_document.pdf")

# Bulk process by passing a vector of filepaths in the files argument:
dai_async(my_files)

# Specify a bucket subfolder for the json output:
dai_async(my_files, dest_folder = "processed")
} # }