This function sends files from your Google Storage bucket to Google Document AI v1 for asynchronous (offline) processing. The output is delivered to your bucket as json files.
dai_async( files, dest_folder = NULL, bucket = Sys.getenv("GCS_DEFAULT_BUCKET"), proj_id = get_project_id(), proc_id = Sys.getenv("DAI_PROCESSOR_ID"), skip_rev = "true", loc = "eu", token = dai_token() )
files | A vector or list of pdf filepaths in a Google Cloud Storage bucket. Filepaths must include all parent bucket folder(s) except the bucket name. |
---|---|
dest_folder | The name of the GCS Storage bucket subfolder where you want the json output. |
bucket | The name of the GCS Storage bucket where the files to be processed are located. |
proj_id | A Google Cloud Services project id. |
proc_id | A Document AI processor id. |
skip_rev | Whether to skip human review; "true" or "false". |
loc | A two-letter region code; "eu" or "us". |
token | An authentication token generated by |
A list of HTTP responses.
Requires a Google Cloud access token (google_token
) and a certain amount of configuration in RStudio; see vignettes for details. For long pdf documents, the json output is divided into separate files (shards) of 20 pages each. Maximum pdf document length is 2000 pages. The maximum number of pages in active processing is 10,000. The function waits 10 seconds between each document submission, so I recommend using RStudio's "Jobs" functionality if you are processing a lot of files.
NOTE: The function in its current form is a placeholder for a future function that allows for real batch processing (as opposed to an iterated single submission).
if (FALSE) { # with daiR configured on your system, several parameters are automatically provided, # and you can pass simple calls, such as: dai_async("my_document.pdf") # NB: Include all parent bucket folders (but not the bucket name) in the filepath: dai_async("for_processing/pdfs/my_document.pdf") # Bulk process by passing a vector of filepaths in the files argument: dai_async(my_files) # Specify a bucket subfolder for the json output: dai_async(my_files, dest_folder = "processed") }