Last updated 10 February 2024

daiR has many different functionalities, but the core one is to provide access to the Google Document AI API so you can OCR your documents. That procedure is fairly straightforward: you make a processing call with either dai_sync() or dai_async() – depending on whether you want synchronous or asynchronous processing – and then you retrieve the plaintext with get_text().

Synchronous processing

The quickest and easiest way to OCR with Document AI (DAI) is through synchronous processing. You simply pass an image file or a pdf (of up to 5 pages) to the processor and get the result into your R environment within seconds.

We can try with a sample pdf from the CIA’s Freedom of Information Act Electronic Reading Room:

library(daiR)
setwd(tempdir())
url <- "https://www.cia.gov/readingroom/docs/AGH%2C%20LASLO_0011.pdf"
download.file(url, "CIA.pdf")

We send it to Document AI with dai_sync() and store the HTTP response in an object.

resp <- dai_sync("CIA.pdf")

We then pass the response object to get_text(), which extracts the text identified by Document AI.

get_text(resp)

What if we have many documents? dai_sync() is not vectorized, but you can iterate with it over vectors of filepaths. For the sake of illustration, let’s download a second PDF.

url <- "https://www.cia.gov/readingroom/docs/1956-11-26.pdf"
download.file(url, "CIA2.pdf")

We now want to apply the functions dai_sync() and get_text() iteratively over the files CIA.pdf and CIA2.pdf. In such cases you probably want to preserve the extracted text in .txt files along the way. You can do this by setting the parameter save_to_file in get_text() to TRUE. This function also has a parameter outfile_stem which allows you to specify the namestem of the .txt file. We can get the stem from each file by combining fs::path_ext_remove() and basename().

I further recommend adding a small pause so as not to run into rate limit issues. A print statement is also useful for keeping track of where you are in case of an error or interruption. A sample script might thus look like this:

## NOT RUN
myfiles <- list.files(pattern = "*.pdf")

for (i in seq_along(myfiles)) {
  print(paste("Processing file", i, "of", length(myfiles)))
  resp <- dai_sync(myfiles[i])
  stem <- fs::path_ext_remove(basename(myfiles[i]))
  get_text(resp, save_to_file = TRUE, outfile_stem = stem)
  Sys.sleep(2)
}

If you now run list.files(), you will see that the code generated two files named CIA.txt and CIA2.txt respectively.

Synchronous processing is very convenient, but has two limitations. One is that OCR accuracy may be slightly reduced compared with asynchronous processing, because dai_sync() converts the source file to a lightweight, grayscale image before passing it to DAI. The other is scaling; If you have a large pdf or many files, it is usually preferable to process them asynchronously.

Asynchronous processing

In asynchronous (offline) processing, you don’t send DAI the actual document, but rather its location on Google Storage so that DAI can process it “in its own time”. While slightly slower than synchronous OCR, it allows for batch processing and makes the process less vulnerable to interruptions (like laptop battery death or inadvertent closing of your console). Unlike dai_sync, dai_async() is vectorized, so you can send multiple files with a single call.

The first step is to use the package googleCloudStorageR to upload the source file(s) to a Google Storage bucket where DAI can find them. The following assumes that you have already configured Google Storage and set up a default bucket as described in this vignette.

Let’s upload our two CIA documents. I am assuming the filepaths are still stored in the vector myfiles we created earlier.

Let’s check that our files made it safely:

contents <- gcs_list_objects()

We can now use dai_async() to tell Document AI to process these files. At this stage it is crucial to know that dai_async() takes as its main argument the filenames in the bucket, NOT the filenames or -paths on your local drive. In this particular example there is no difference, but that is not always the case. A common error scenario is when you use a vector of full local filepaths (e.g. files <- c("/path/to/file1.pdf", "/path/to/file2.pdf")) to upload the files, saving them there with their basenames (file1.pdf and file2.pdf). When you then try to pass the same vector to dai_sync(), the processing fails because Document AI cannot find /path/to/file1.pdf in the bucket, only file1.pdf.

It is therefore good practice to use the output of gcs_list_objects() to create the vector that you pass to dai_async. From the vignette on Google Storage, we remember that if we store the output of gcs_list_objects() in a dataframe named contents, the filenames will be in contents$name.

resp <- dai_async(contents$name)

If your call returned “status: 200”, it was accepted by the API. Note that this does NOT mean that the processing itself was successful, only that the request went through. For example, if there are errors in your filepaths, DAI will create empty JSON files in the folder you provided. If you see JSON files of around 70 bytes each in the destination folder, you know there was something wrong with your filenames. Other things too can cause the processing to fail, for example a corrupt file or a format that DAI cannot handle.

You can check the status of a job with dai_status(). Just pass the response object from your dai_async() call into the parentheses, like so:

This will tell you whether the job is “RUNNING”, “FAILED”, or “SUCCEEDED”. It won’t tell you how much time remains, but in my experience, processing takes about 5-20 seconds per page. To find out when it’s done, you can either rerun dai_status() till it says “SUCCEEDED”, or you can use the function dai_notify(), which will check the status for you in the background and beep when the job is finished.

When the processing is done, there will be JSON output files waiting for you in the bucket. Let’s take a look.

contents <- gcs_list_objects()
contents

Output file names look cryptic, but there’s a logic to them, namely: "<job_number>/<document_number>/<filename>-<shard_number>.json". Our file will thus take the form "<job_number>/0/CIA-0.json", with <job_number> changing from one processing call to the next.

These JSON files contain the extracted text plus a wealth of other data, such as the location of each word on the page and a binary version of the original image. In order to get to this information we need to download them to our local drive. Because these output files have unpredictable names, it is often easiest to simply search for all files ending in *.json using grep() or stringr::str_detect().

jsons <- grep("*.json$", contents$name, value = TRUE)

We can then download them with gcs_get_object().

map(jsons, ~ gcs_get_object(.x, saveToDisk = basename(.x)))

If you now run list.files() again, you should see CIA-0.json and CIA2-0.json in our working directory.

To get the text from a DAI JSON file, we can use get_text(), but we have to specify type = "async" so that the function knows it is being served a JSON file and not a response object.

get_text("CIA-0.json", type = "async")

To get the text from several JSON files, we just iterate over them, setting save_to_file to TRUE. Unlike in the dai_sync() earlier, we don’t need to specify outfile_stem, because get_text() has the names of the JSON files and uses their stems to create the .txts.

local_jsons <- list.files(pattern = "*.json")
map(local_jsons, ~ get_text(.x, type = "async", save_to_file = TRUE))

Running list.files() one last time, you should have two new files named CIA-0.txt and CIA2-0.txt.

Large batches

Although dai_async() takes batches of files, it is constrained by Google’s rate limits. Currently, a dai_async() call can contain maximum 50 files (a multi-page pdf counts as one file), and you can not have more than 5 batch requests and 10 000 pages undergoing processing at any one time.

Therefore, if you’re looking to process a large batch, you need to spread the dai_async() calls out over time. While you can split up your corpus into sets of 50 files and batch process those, the simplest solution is to make a function that sends files off individually with a small wait in between. Say we have a vector called big_batch containing thousands of filenames. First we would make a function like this:

process_slowly <- function(file) {
  dai_async(file)
  Sys.sleep(10)
}

Then we would iterate it over our file vector:

## NOT RUN
map(big_batch, process_slowly)

This will hold up your console for a while, so it may be worth doing in the background as an RStudio job.

Finding the optimal wait time for Sys.sleep() may require some trial and error. As a rule of thumb, it should approximate the time it takes for DAI to process one of your files. This, in turn, depends on the size of the files, for a 100-page pdf will take a lot longer to process than a single-page one. In my experience, a 10-second interval is ample time for a batch of single-page PDFs. Multi-page pdfs require proportionally more time. If your files vary in size, calibrate the wait time to the largest file, or you may get 429s (HTTP code for “rate limit exceeded”) half way through the iteration.

Although this procedure is relatively slow, it need not add much to the overall processing time. DAI starts processing the first files it receives right away, so when your loop ends, DAI will be mostly done with the OCR as well.

Merging shards

If you have long PDFs, DAI will break the output into shards, meaning that, for a single PDF file, you may get back multiple JSON files named *-1.json, *-2.json, etc.

To weave the text back together again, you can use daiR’s merge_shards() function. It works on .txt files, not JSON files, so you need to extract the text from the JSONs first. You also need to keep the name stem – turning document-1.json into document-1.txt and so forth – so that merge_shards() knows which pieces belong together. This is the default behaviour of get_text(), so as long as you don’t touch the outfile_stem parameter, you should be fine.

Here is a sample workflow:

## NOT RUN
shards <- c("longdoc-1.json", "longdoc-2.json", "longdoc-3.json")
map(shards, ~ get_text(.x, type = "async", save_to_file = TRUE)
merge_shards()