Last updated 10 February 2024

What exactly is Google Document AI?

Google Document AI (DAI) is a server-based OCR engine that extracts text from pdf files. Released in November 2020, it’s off-the-shelf accuracy is generally higher than static libraries such as tesseract. Short of corpus-specific, self-trained processors, DAI offers some of the best OCR capabilities currently available to the general public. At the time of writing, DAI is more expensive than Amazon’s Textract, but promises to support many more languages.

DAI is accessed through an API, but this API currently has no official R client library. This is where the daiR package comes in; it provides a light wrapper for DAI’s REST API, making it possible to submit documents to DAI from within R. In addition, daiR comes with pre- and postprocessing tools intended to make the whole text extraction process easier.

Google Document AI is closely connected with Google Storage, as the latter serves as a drop-off and pick-up point for files you want processed in DAI.

How much does Google Document AI cost to use?

Google currently charges USD 1.50 per 1,000 pages for use of the default processor, “Document OCR”. Specialized processors (such as form parsers for business use) cost more; see Google’s documentation for details.

How does Document AI perform compared with free solutions such as Tesseract?

It depends on the document. For clean pages in English, there is usually no difference to speak of. But for documents with noise or non-latin script, Document AI tends to have substantially better off-the-shelf accuracy, as described in this paper.

Which filetypes can I submit?

Document AI currently handles .pdf, .jpg/.jpeg, .png, .tiff/.tif, .bmp, and .webp files. See Google’s documentation for more details.

How much can I process at once?

Rate limits:

  • Synchronous processing: 120 requests per minute per processor
  • Asynchronous processing: 5 concurrent requests per processor, max 5000 files per request.

File size limits:

  • Synchronous processing: 20 MB per file, maximum 15 pages
  • Asynchronous processing: 1 GB per file, maximum 500 pages

These limits are for the “Document OCR” processor. They may change in the future. See Google’s documentation here and here for details.

Which languages does Document AI cover?

The general processor (“Document OCR”) currently handles over 60 languages, including Arabic, Chinese, Greek, Hebrew, Hindi, Japanese, Persian, and Russian. Notable languages not covered by the “Document OCR” processor include Swahili and Urdu. Some of the specialised Document AI processors (such as “Custom Document Extractor”) handle more or other languages. See Google’s documentation for more details.

Can I trust the output to be perfect?

No. Document AI is very good, but it does make mistakes. Be especially cautious when processing multi-column text.

How many different processors are there?

Google Document AI ecosystem contains three categories of processors: General, specialized, and custom. The specialized and custom ones are mainly for high-accuracy parsing of particular types of forms such as US driver’s licences or tax documents.

There are two types of general processors: “Document OCR” and “Form Parser” (Two others, “Intelligent Document Quality Processor” and “Document splitter”, are in the works but currently not publicly available). Of these, “Document OCR” is intended for general OCR tasks. The daiR package currently calls only “Document OCR”. The “Intelligent Document Quality Processor” will be added when it becomes publicly available.

Finally, each type comes in several versions, some stable and others in beta. The current stable version of “Document OCR” was trained in September 2020, while the latest release candidate was trained in November 2022. See Google’s documentation for an updated list. With daiR you can choose between different versions of “Document OCR”.

Can I finetune Document AI on my own data?

No, you cannot finetune the general processors with your own data the way you can with Google Translate. Document AI does offer something called uptraining, but it is only for specialized processors (form parsers and such). You can also train a custom document classifier which sorts documents into different categories prior to processing.

Can Document AI process handwriting?

Yes, for most languages. See Google’s documentation for a detailed overview.

Can I process sensitive data with Google Document AI?

It depends on the data in question and on your threat model. Google says it complies with strict privacy and security protocols and that it will not use customer data to train its models. It further notes that

“For batch operations, the stored document is typically deleted right after the processing is done, with a failsafe Time to live (TTL) from a few hours to up to 7 days. For online (immediate response) operations, the document data is processed in memory and not persisted to disk. Google also temporarily logs some metadata about your Document AI API requests (such as the time the request was received and the size of the request) to improve our service and combat abuse.”

Do I need to use the daiR package to access Document AI?

No, you can also access it from Python and several other programming languages (see here for an overview). You can also test process individual documents manually on the Document AI website (scroll down a little, to the section titled “Try Document AI in your environment”). daiR is designed mainly for users whose main workflow is already in R.

Is there an easier way to set up a Google Cloud Services account for use with daiR?

Unfortunately not. But there is a command line way involving gcloud CLI if you prefer that.