Sends files from a Google Cloud Services (GCS) Storage bucket to the GCS Document AI v1 API for asynchronous (offline) processing. The output is delivered to the same bucket as JSON files containing the OCRed text and additional data.
dai_async(
files,
dest_folder = NULL,
bucket = Sys.getenv("GCS_DEFAULT_BUCKET"),
proj_id = get_project_id(),
proc_id = Sys.getenv("DAI_PROCESSOR_ID"),
proc_v = NA,
skip_rev = "true",
loc = "eu",
token = dai_token()
)
a vector or list of pdf filepaths in a GCS Storage bucket Filepaths must include all parent bucket folder(s) except the bucket name
the name of the GCS Storage bucket subfolder where you want the json output
the name of the GCS Storage bucket where the files to be processed are located
a GCS project id
a Document AI processor id
one of 1) a processor version name, 2) "stable" for the latest processor from the stable channel, or 3) "rc" for the latest processor from the release candidate channel.
whether to skip human review; "true" or "false"
a two-letter region code; "eu" or "us"
an access token generated by dai_auth()
or another
auth function
A list of HTTP responses
Requires a GCS access token and some configuration of the
.Renviron file; see package vignettes for details. Currently, a
dai_async()
call can contain a maximum of 50 files (but a
multi-page pdf counts as one file). You can not have more than
5 batch requests and 10,000 pages undergoing processing at any one time.
Maximum pdf document length is 2,000 pages. With long pdf documents,
Document AI divides the JSON output into separate files ('shards') of
20 pages each. If you want longer shards, use dai_tab_async()
,
which accesses another API endpoint that allows for shards of up to
100 pages.
if (FALSE) { # \dontrun{
# with daiR configured on your system, several parameters are automatically provided,
# and you can pass simple calls, such as:
dai_async("my_document.pdf")
# NB: Include all parent bucket folders (but not the bucket name) in the filepath:
dai_async("for_processing/pdfs/my_document.pdf")
# Bulk process by passing a vector of filepaths in the files argument:
dai_async(my_files)
# Specify a bucket subfolder for the json output:
dai_async(my_files, dest_folder = "processed")
} # }