Last updated 10 February 2024

Set up a Google Cloud Services account

Follow the instructions here for the GUI method or here for the command line method. See also the GCS concept cheatsheet for an overview of recommended environment variables.

Process synchronously

Pass a single-page pdf or image file to Document AI and get the output immediately:

library(daiR)
## Not run:
myfile <- "sample.pdf"
text <- get_text(dai_sync(myfile))

Process asynchronously

Requires configuration of googleCloudStorageR. Send larger batches for offline processing in three steps:

1. Upload files to your Google Cloud Storage bucket

## Not run:
library(googleCloudStorageR)
library(purrr)
my_pdfs <- c("sample1.pdf", "sample2.pdf")
map(my_pdfs, ~ gcs_upload(.x, name = basename(.x)))

2. Tell Document AI to process them:

## Not run:
resp <- dai_async(my_pdfs)
dai_status(resp) # to check the progress

The output will be delivered to the same bucket as JSON files.

3. Download the JSON output and extract the text:

## Not run:
# Get a dataframe with the bucket contents
contents <- gcs_list_objects()
# Get the names of the JSON output files
jsons <- grep("*.json", contents$name, value = TRUE)
# Download them 
map(jsons, ~ gcs_get_object(.x, saveToDisk = basename(.x)))
# Extract the text from the JSON files and save it as .txt files
local_jsons <- basename(jsons)
map(local_jsons, ~ get_text(.x, type = "async", save_to_file = TRUE))

Assuming your pdfs were named sample1.pdf and sample2.pdf, there will now be two files named sample1-0.txt and sample2-0.txt in your working directory.