Content Comparison

Observations

Text detection breaks recognized text into lines, which are further broken up into words.
Text detection alone won’t recognize that a sentence is made up of multiple lines.
Using the Text Analysis functionality with FORM recognition, Textract tries to find key-value pairs, and seems to do well at this.
1. Colons seem to be a strong indicator for Textract, as in Key: Value Line.
2. This also helps preserve context across what are merely independent lines from the text detection functionality alone.
The Text Analysis functionality can also identify tables and cells, but I didn’t see any examples in the few samples that I reviewed.
The Textract Timings.xlsx file has timing information for a set of files submitted for processing in quick succession.
1. There seems to be about 15 – 20 seconds of minimum time for asynchronous jobs, which are required for multi-page documents.

The timing data end times came from the timestamps of output files written on S3.
1. For Textract, an output file is not required.
2. Textract automatically saves the information from a Job for 7 days, and can be extracted with the Job ID.
3. Can you get the information quicker by querying/polling the Textract output directly?
You could break multi-page documents into pages and submit the pages using synchronous calls.
1. This is probably faster for smaller documents, but maybe not for larger ones.
  1. What are the metrics for that approach?
2. How likely is it that you would lose context from information broken across 2 pages? (Probably not very, if we are only concerned with SSN, I would think.)

Version	Old Version 1	New Version 2
Changes made by	Stephen Lashinski (Unlicensed)	Stephen Lashinski (Unlicensed)
Saved on	Jan 25, 2022	Jan 25, 2022