Last updated 10 February 2024
Google Document AI
(DAI) is a server-based OCR engine that extracts text from pdf files.
Released in November 2020, it’s off-the-shelf accuracy is generally
higher than static libraries such as tesseract
.
Short of corpus-specific, self-trained processors, DAI offers some of
the best OCR capabilities currently available to the general public. At
the time of writing, DAI is more expensive than Amazon’s Textract, but promises to
support many more languages.
DAI is accessed through an API, but this API currently has no
official R client
library. This is where the daiR
package comes in; it
provides a light wrapper for DAI’s REST
API, making it possible to submit documents to DAI from within R. In
addition, daiR
comes with pre- and postprocessing tools
intended to make the whole text extraction process easier.
Google Document AI is closely connected with Google Storage, as the latter serves as a drop-off and pick-up point for files you want processed in DAI.
Google currently charges USD 1.50 per 1,000 pages for use of the default processor, “Document OCR”. Specialized processors (such as form parsers for business use) cost more; see Google’s documentation for details.
It depends on the document. For clean pages in English, there is usually no difference to speak of. But for documents with noise or non-latin script, Document AI tends to have substantially better off-the-shelf accuracy, as described in this paper.
Document AI currently handles .pdf
,
.jpg
/.jpeg
, .png
,
.tiff
/.tif
, .bmp
, and
.webp
files. See Google’s documentation
for more details.
Rate limits:
File size limits:
These limits are for the “Document OCR” processor. They may change in the future. See Google’s documentation here and here for details.
The general processor (“Document OCR”) currently handles over 60 languages, including Arabic, Chinese, Greek, Hebrew, Hindi, Japanese, Persian, and Russian. Notable languages not covered by the “Document OCR” processor include Swahili and Urdu. Some of the specialised Document AI processors (such as “Custom Document Extractor”) handle more or other languages. See Google’s documentation for more details.
No. Document AI is very good, but it does make mistakes. Be especially cautious when processing multi-column text.
Google Document AI ecosystem contains three categories of processors: General, specialized, and custom. The specialized and custom ones are mainly for high-accuracy parsing of particular types of forms such as US driver’s licences or tax documents.
There are two types of general processors: “Document OCR” and “Form
Parser” (Two others, “Intelligent Document Quality Processor” and
“Document splitter”, are in the works but currently not publicly
available). Of these, “Document OCR” is intended for general OCR tasks.
The daiR
package currently calls only “Document OCR”. The
“Intelligent Document Quality Processor” will be added when it becomes
publicly available.
Finally, each type comes in several versions, some stable and others
in beta. The current stable version of “Document OCR” was trained in
September 2020, while the latest release candidate was trained in
November 2022. See Google’s documentation
for an updated list. With daiR
you can choose between
different versions of “Document OCR”.
No, you cannot finetune the general processors with your own data the way you can with Google Translate. Document AI does offer something called uptraining, but it is only for specialized processors (form parsers and such). You can also train a custom document classifier which sorts documents into different categories prior to processing.
Yes, for most languages. See Google’s documentation for a detailed overview.
It depends on the data in question and on your threat model. Google says it complies with strict privacy and security protocols and that it will not use customer data to train its models. It further notes that
“For batch operations, the stored document is typically deleted right after the processing is done, with a failsafe Time to live (TTL) from a few hours to up to 7 days. For online (immediate response) operations, the document data is processed in memory and not persisted to disk. Google also temporarily logs some metadata about your Document AI API requests (such as the time the request was received and the size of the request) to improve our service and combat abuse.”
daiR
package to access Document
AI?
No, you can also access it from Python and several other programming
languages (see here for
an overview). You can also test process individual documents manually on
the Document AI website (scroll down a
little, to the section titled “Try Document AI in your environment”).
daiR
is designed mainly for users whose main workflow is
already in R.
daiR
?
Unfortunately not. But there is a command line way involving gcloud CLI if you prefer that.