Last updated 14 April 2021
Google Document AI (DAI) is a server-based OCR engine that extracts text from pdf files. Released in November 2020, it is much more powerful than static libraries such as
tesseract. Short of corpus-specific, self-trained processors, DAI offers some of the best OCR capabilities currently available to the general public. At the time of writing, DAI is more expensive than Amazon’s Textract, but promises to support many more languages.
DAI is accessed through an API, but this API currently has no official R client library. This is where the
daiR package comes in; it provides a light wrapper for DAI’s REST API, making it possible to submit documents to DAI from within R. In addition,
daiR comes with pre- and postprocessing tools intended to make the whole text extraction process easier.
Google Document AI is closely connected with Google Storage, as the latter serves as a drop-off and pick-up point for files you want processed in DAI. An R workflow for DAI processing consists of three core steps:
daiR, tell DAI to process the files in your bucket. DAI will return its output to your Storage bucket in the form of json files.
A previous vignette covered the setting up of a Google Cloud service account and interacting with Google Storage. Here we pick up from where that vignette left off, and assume that the following things are in place:
To use Document AI, we need to complete a few more steps.
First, we must activate the API. Go to the Google Cloud Console and open the navigation menu on the left hand side. Click on “APIs and services”. Then click on “Enable APIs and Services”, type “document ai” in the search field, click on “cloud document ai API”, and then “Enable”.
Open the navigation menu on the left again. Scroll down, almost to the bottom, till you see “Document AI” (under the group heading “Artificial intelligence”). Click on “Document AI”.
Now click the blue button labelled “Create processor”. On the next page, choose the “Document OCR” processor type. A pane should open on your right where you can choose a name for the processor. Call it what you like; the name is mainly for your own reference. Select a location (where you want your files to be processed), then click create.
You should now see a page listing the processor’s Name, ID, Status and other attributes. The main thing you want here is the ID. Select it and copy it to the clipboard.
.Renviron file by calling
DAI_PROCESSOR_ID="<your processor id>" on a separate line. Save
.Renviron and restart RStudio.
That’s it. If these things are in place, you can start processing right after loading the package.
A note on access tokens: Unlike some other GCS wrappers, daiR does not authenticate on startup and store access tokens in the environment. Instead it generates tokens on a per-call basis. If you prefer to generate one token per session, you can use
dai_token() to store your token in an object and pass that object directly into the API call functions using the latter’s
token = parameter. This also means you can use auth functions from pretty much any other GCS wrapper to generate your token.
Now let’s try this thing out.
The quickest and easiest way to OCR with DAI is through synchronous processing. You simply pass an image file or a pdf (of up to 5 pages) to the processor and get the result into your R environment within seconds.
We can try with a sample pdf from the CIA’s Freedom of Information Act Electronic Reading Room:
setwd(tempdir()) download.file("https://www.cia.gov/readingroom/docs/AGH%2C%20LASLO_0011.pdf", destfile = "CIA.pdf", mode = "wb")
We send it to Document AI with
dai_sync() and store the HTTP response in an object, for example
response1 <- dai_sync("CIA.pdf")
Then we extract the text with
Synchronous processing is very convenient, but has two limitations. One is that OCR accuracy may be slightly reduced compared with asynchronous processing, because
dai_sync() converts the source file to a lightweight, grayscale image before passing it to DAI. The other is scaling; If you have a large pdf or many files, it is usually easier to process them asynchronously.
In asynchronous (offline) processing, you don’t send DAI the actual document, but rather its location on Google Storage so that DAI can process it “in its own time”. While slower than synchronous OCR, it allows for batch processing. The
dai_async() is vectorized, so you can send multiple files with a single call. For this vignette, however, we’ll just use a single document; the same as in the previous example.
The first step is to upload the source file(s) to a Google Storage bucket where DAI can find it.1
Let’s check that our file made it safely:
We’re now ready to send it off to Document AI with
daiR’s workhorse function,
dai_async(), as follows:
response2 <- dai_async("CIA.pdf")
A few words about this function. Its core parameter,
files, tells DAI what to process. You can submit either .pdf, .gif, or .tiff files, and your
files vector can contain a mixture of these three file formats.
You can also specify a
dest_folder: the name of the bucket folder where you want the output. It defaults to the root of the bucket, but you can specify another subfolder. If the folder does not exist already, it will be created.
The function also takes a location parameter (
loc), which defaults to “eu” but can be set to “us”. It has nothing to do with where you are based, but with which of Google’s servers will process your files. The parameter
skip_rev can be ignored by most; it is for passing selected documents to human review in business workflows. The remaining parameters default to things that are defined by your environment variables (provided you followed the recommendations above).
Back to our processing. If your call returned “status: 200”, it was accepted by the API. This does not necessarily mean that the processing was successful, because the API has no way of knowing right away if the filepaths you provided exist in your bucket. If there were errors in your filepaths, your HTTP request would get a 200, but your files would not actually process. They would turn up as empty files in the folder you provided. So if you see json files of around 70 bytes each in the destination folder, you know there was something wrong with your filenames.
You can check the status of a job with
dai_status(). Just pass the response object from your
dai_async() into the parentheses, and it will tell you whether the job is finished. It won’t tell you how much time remains, but in my experience, processing takes about 5-20 seconds per page.
dai_status() says “SUCCEEDED”, the json output files are waiting for you in the bucket.
Output file names look cryptic, but there’s a logic to them, namely:
"<job_number>/<document_number>/<filename>-<shard_number>.json" Our file will thus take the form
<job_number> changing from one processing call to the next. Let us store the name in a vector for simplicity:
## NOT RUN our_file <- "<job_number>/0/CIA-0.json"
Now let’s download it and save it under a simpler name:
gcs_get_object(our_file, saveToDisk = "CIA.json", overwrite = TRUE)
Finally we extract the text using
dai_async() takes batches of files, it is constrained by Google’s rate limits. Currently, a
dai_async() call can contain maximum 50 files (a multi-page pdf counts as one file), and you can not have more than 5 batch requests and 10 000 pages undergoing processing at any one time.
Therefore, if you’re looking to process a large batch, you need to spread the
dai_async() calls out over time. The simplest solution is to make a function that sends files off individually with a small wait in between. Say we have a vector called
big_batch containing thousands of filenames. First we would make a function like this:
Then we would iterate it over our file vector:
## NOT RUN map(big_batch, process_slowly)
This will hold up your console for a while, so it may be worth doing in the background as an RStudio job.
Finding the optimal wait time for the
Sys-sleep() may require some trial and error. As a rule of thumb, it should approximate the time it takes for DAI to process one of your files. This, in turn, depends on the size of the files, for a 100-page pdf will take a lot longer to process than a single-page one. In my experience, a 10-second interval works fine for a batch of single-page pdfs. Multi-page pdfs require proportionally more time. If your files vary in size, calibrate the wait time to the largest file, or you may get 429s (HTTP code for “rate limit exceeded”) half way through the iteration.
Although this procedure is relatively slow, it need not add much to the overall processing time. DAI starts processing the first files it receives right away, so when your loop ends, DAI will be mostly done with the OCR as well.