Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Observations

Textract:

  1. Text detection breaks recognized text into lines, which are further broken up into words.

  2. Text detection alone won’t recognize that a sentence is made up of multiple lines.

  3. Using the Text Analysis functionality with FORM recognition, Textract tries to find key-value pairs, and seems to do well at this.

    1. Colons seem to be a strong indicator for Textract, as in Key: Value Line.

    2. This also helps preserve context across what are merely independent lines from the text detection functionality alone.

  4. The Text Analysis functionality can also identify tables and cells, but I didn’t see any examples in the few samples that I reviewed.

  5. The Textract Timings.xlsx file has timing information for a set of files submitted for processing in quick succession.

    1. There seems to be about 15 – 20 seconds of minimum time for asynchronous jobs, which are required for multi-page documents.

Things to be determined:

  1. The timing data end times came from the timestamps of output files written on S3.

    1. For Textract, an output file is not required.

    2. Textract automatically saves the information from a Job for 7 days, and can be extracted with the Job ID.

    3. Can you get the information quicker by querying/polling the Textract output directly?

  2. You could break multi-page documents into pages and submit the pages using synchronous calls.

    1. This is probably faster for smaller documents, but maybe not for larger ones.

      1. What are the metrics for that approach?

    2. How likely is it that you would lose context from information broken across 2 pages? (Probably not very, if we are only concerned with SSN, I would think.)

Comprehend:

  1. Comprehend is able to look across lines of information sent from Textract. For example, in the image file 2020021298_ori.tif, the second page Textract output has 2 separate lines:

    1. JONATHAN D. CLEMENTE as ATTORNEY in FACT for HOWARD R. SMITH and DORIS

    2. V. SMITH, Husband and Wife Separating those 2 lines (and every other pair of successive lines in the Textract output) by a newline character, and putting that into Comprehend yields the output that there is a name:

      DORIS

      V. SMITH

      The DORIS and the V. SMITH are still separated by the newline character, but they appear together identified as a NAME.