Last updated 10 February 2024
Command line users: See here for GCS setup
using gcloud CLI.
This is a rundown for beginners of how to set up a Google Cloud Services account so you can start using Google Document AI.
The biggest hurdle to using any Google API is authentication. It can be daunting, partly because it involves abstract new concepts like “service accounts”, “Oauth2.0”, and “scopes”, and partly because the website you have to work with – the Google Cloud Console – is a nightmare to navigate.
There are several different ways to authenticate to GCS from R, and
you may have seen that different R packages do it slightly differently
(see e.g. gmailr and bigrquery). This vignette takes
an opinionated approach and recommends using service accounts and json
key files. It involves a little bit more hands-on configuration than
some of the setup wizards in other R packages, but my thinking is that
if you are computer-literate enough to consider using an API, you are
probably also able to navigate the GCS and set environmental variables
in R. But know that there are several
authentication strategies, and you can use
daiR with most
of them so long as you obtain an authentication token.
There are nine main steps to go through before you can start using
If you have one already, you can use that. Or you can create a burner account for your GCS work.
While logged in to your gmail account, go to the Google Cloud Console. Agree to the terms of service and click “Try for free”.
Accept the terms again, and add an address and a credit card. This last part is a prerequisite for using GCS. But Google Document AI is not particularly expensive to use; see the FAQ page for updated prices.
The largest “unit” of your GCS activities is your project. You can think of it as your root folder, since you will most likely only ever need one unless you are a business or a developer (though in principle you can have as many projects as you like).
When you activate GCS, you are assigned a project named “My first project”. Click on “My first project” in the top blue bar, just to the right of “Google cloud services”. You’ll see a screen like this:
Note that your project has an ID, usually consisting of an
adjective, a noun, and a number. You’ll need this soon, so I recommend
opening RStudio and storing it as a vector with
project_id <- "<your project id>".
Return to the Google Cloud Console and look at the left column. Toward the top you see an entry called “Billing”. Click it. You’ll get to a screen saying “This project has no billing account”. Click “link a billing account” and set the billing account to “My billing account”.
All this is necessary for you to be able to use Google Document AI (DAI) and related services programmatically.
Now we need to create a service account. Bring out the navigation menu on the left hand side by clicking the little circle with the three horizontal lines in the top left of the screen. Click on “APIs and services.” Then click on “credentials” in the left pane. You should see this:
Click on “create credentials” in the top middle, then choose service account. Give it any name you like (e.g. “my_rstudio_service_account”) and a description (e.g. “Interacting with GCS through R”) and click “create”.
In section 2 titled “Grant this service account access to project”, add “Basic > Owner” to the service account’s roles.
Click “continue”, then “done” at the bottom. You should now see your service account listed at the bottom.
Now we need to generate a JSON file containing the credentials for
this service account. Click the small edit icon on the bottom right. On
the next page, click “add key”, choose “create new key”, select JSON
format, and click “create”. This should prompt a save file window. Save
the file to your hard drive. You can change the name to something more
memorable if you like (but keep the
.json extension). Also,
take note of where you stored it. Now we are done in the Google Cloud
Console and can finally start working in RStudio.
daiR to automatically authenticate you when you
load the package so you don’t have to fiddle with authentication every
time. For this to happen, you need to tell R where it can find the JSON
file with the credentials. We do this by storing the path to the json
file as an environmental variable in your
Start by writing the following in your RStudio console:
This will open a pane with your
.Renviron file. If you
haven’t modified it before, it is probably empty.
All you need to do is add a line with the following:
GCS_AUTH_FILE='<full path to the json file you stored earlier>'.
Make sure all the slashes in the filepath are forward slashes. It should
look something like this:
Save the file, close it, and restart RStudio. If you try to load
daiR now with
library(daiR), you will be
auto-authenticated, but there are a few more steps left before you can
Head back to your browser and the GCS website. We now need to activate the Google Document AI API for your account. Open the navigation menu on the left hand side. Click on “APIs and services”. Then click on “Enable APIs and Services”, type “document ai” in the search field, click on “Cloud Document AI API”, and then “Enable”.
You now need to think of a globally unique display name for the
processor, to be supplied with the
function. Processor names are shared across the GCS ecosystem, so simple
ones like “processor” or “my-processor” are probably already taken.
Instead try something along the lines of “ocr-general-9567”.
If successful, the
create_processor() function will
return a processor id, a 16-character sequence of random numbers and
letters. Take good note of it, for while the name (“ocr-general-9567”)
is just a display name to help you distinguish between processors when
you have several, the id is what really identifies it to the Document AI
## NOT RUN
id <- create_processor("<my-unique-processor-name>")
It is possible to create more than one processor, but most users only
need one at any given time. To view your processors, run
daiR::get_processors(). Should you wish to delete one, use
There are several different
types of processors available, but most users will want the generic
OCR_PROCESSOR), and this is what
daiR::create_processor() defaults to. The other main type
to note is
FORM_PARSER_PROCESSOR, which can identify
Processor types also come in different versions, with each version representing a snapshot of the underlying machine learning model. There will typically be a couple of stable versions and a more recent, pre-release version of any one processor type. The difference in performance between versions is rarely substantial, so you do not need to worry about specifying a processor version at this stage.
Every Document AI processing request needs to contain a processor id,
so I strongly recommend storing the id you just generated in an
environment variable called
DAI_PROCESSOR_ID. All of
daiR’s processing functions are set up to look for this
.Renviron file by calling
DAI_PROCESSOR_ID="<your processor id>" on a separate
.Renviron may now look something like this:
.Renviron and restart RStudio. That’s it: You’re
now ready to start
processing. Note that in order to process asynchronously (best for
large batches), you will need to familiarise yourself with Google Cloud
Storage as well, so you may want to look at the vignette on Google
Cloud Storage right away.
If you ever get lost in the jungle of GCS terms, check out the concept cheatsheet.
It is also possible to create processors manually in the Google Cloud Console.↩︎