Note: Some of the functions below, notably get_ids_by_type() and get_versions_by_type() are currently only available in the development version of daiR. Install from Github with devtools::install_github("hegghammer/daiR") to use them.

Google Document AI offers a range of different processors, each optimized for a specific task. For most use cases, the default settings do the job, but there may be situations when you want to use a specific processor type or version. This vignette explains how to do that in daiR.

First, let’s clarify some concepts.

  • A processor type is a category of processors, such as OCR_PROCESSOR. Google keeps adding new types, but you can find a menu of available processor types at any one time in the official documentation. Many of them are designed for very specific types of documents (like the US tax form “W2”) and are thus not relevant to the average user. At the current time of writing there are three main generic processor types, namely, OCR_PROCESSOR, FORM_PARSER_PROCESSOR, and LAYOUT_PARSER_PROCESSOR.

  • A processor version is an instance of a processor type, reflecting the fact that different instances have finished training at different points in time. A version can be referenced either by its full name (aka “Version ID”), such as pretrained-ocr-v1.0-2020-09-23, or by its alias, such as stable or rc (“release candidate”).

  • A processor id is the user-specific identifier of a processor, made up of 16 random characters (for example 3e03a16deqac44a9). When you create a processor for use with Document AI, that processor gets a unique id. Note that one and the same processor can be available in multiple versions.

When you process documents with dai_sync() and dai_async(), you don’t normally need to specify a processor, because the functions default to the stable version of the processor you specified in the environment variable DAI_PROCESSOR_ID (see the Configuration vignette). However, you can use the parameters proc_id and proc_v to specify a non-default processor type and version.

To see which processors you have at your disposal at any one time, you can use the function get_processors().

my_processors <- get_processors()

The function returns a dataframe with various metadata for the available processors. If you run this right after setup and you followed the Configuration vignette, you should have only one processor (of type OCR_PROCESSOR), but the dataframe will have several rows; one for each version.

Now let’s say you wanted to add a processor of the type FORM_PARSER_PROCESSOR. Then you just need to think of a display name for that processor and pass it to the function create_processor() like so:

## NOT RUN
create_processor("<unique_display_name>", type = "FORM_PARSER_PROCESSOR")

The function will create a processor and output the id in the console. But how do you retrieve the id for a processor that you created some time ago? You can’t run create_processor() again, as it would create yet another processor. There are several ways to do this, but the easiest is with the function get_ids_by_type(). It takes the processor type as its main argument, for example like this:

get_ids_by_type("FORM_PARSER_PROCESSOR")

Assuming you have only one processor of this type, the function will return an id which you can then pass to dai_sync()/dai_async() via the proc_id parameter. If you have more than one processor of the type in question, it is better to run get_processors() and pick the right id from the resulting data frame.

A processor is usually available in more than one version, but the range of available versions varies from one processor to another. To find out which versions are available for a given processor, you can use the function get_versions_by_type(), like this:

get_versions_by_type("FORM_PARSER_PROCESSOR")

This function will output both the aliases and the full names of the available versions. Pick the name or alias of the version you want to use and pass it to dai_sync()/dai_async() with the proc_v parameter. You can use either an alias (like rc) or a full name (like pretrained-ocr-v1.0-2020-09-23).

A sample dai_sync() call using a specified processor might look like this:

## NOT RUN
resp <- dai_sync("document.pdf", proc_id = "abcdefgh12345678", proc_v = "rc")