Last updated 10 February 2024
Follow the instructions here for the GUI method or here for the command line method. See also the GCS concept cheatsheet for an overview of recommended environment variables.
Pass a single-page pdf or image file to Document AI and get the output immediately:
Requires configuration of
googleCloudStorageR
. Send larger batches for offline
processing in three steps:
## Not run:
library(googleCloudStorageR)
library(purrr)
my_pdfs <- c("sample1.pdf", "sample2.pdf")
map(my_pdfs, ~ gcs_upload(.x, name = basename(.x)))
## Not run:
resp <- dai_async(my_pdfs)
dai_status(resp) # to check the progress
The output will be delivered to the same bucket as JSON files.
## Not run:
# Get a dataframe with the bucket contents
contents <- gcs_list_objects()
# Get the names of the JSON output files
jsons <- grep("*.json", contents$name, value = TRUE)
# Download them
map(jsons, ~ gcs_get_object(.x, saveToDisk = basename(.x)))
# Extract the text from the JSON files and save it as .txt files
local_jsons <- basename(jsons)
map(local_jsons, ~ get_text(.x, type = "async", save_to_file = TRUE))
Assuming your pdfs were named sample1.pdf
and
sample2.pdf
, there will now be two files named
sample1-0.txt
and sample2-0.txt
in your
working directory.