Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Observations

Textract:

  1. Text detection breaks recognized text into lines, which are further broken up into words.

  2. Text detection alone won’t recognize that a sentence is made up of multiple lines.

  3. Using the Text Analysis functionality with FORM recognition, Textract tries to find key-value pairs, and seems to do well at this.

    1. Colons seem to be a strong indicator for Textract, as in Key: Value Line.

    2. This also helps preserve context across what are merely independent lines from the text detection functionality alone.

  4. The Text Analysis functionality can also identify tables and cells, but I didn’t see any examples in the few samples that I reviewed.

  5. The Textract Timings.xlsx file has timing information for a set of files submitted for processing in quick succession.

    1. There seems to be about 15 – 20 seconds of minimum time for asynchronous jobs, which are required for multi-page documents.

Things to be determined:

  1. The timing data end times came from the timestamps of output files written on S3.

    1. For Textract, an output file is not required.

    2. Textract automatically saves the information from a Job for 7 days, and can be extracted with the Job ID.

    3. Can you get the information quicker by querying/polling the Textract output directly?

  2. You could break multi-page documents into pages and submit the pages using synchronous calls.

    1. This is probably faster for smaller documents, but maybe not for larger ones.

      1. What are the metrics for that approach?

    2. How likely is it that you would lose context from information broken across 2 pages? (Probably not very, if we are only concerned with SSN, I would think.)

Comprehend:

  1. Comprehend is able to look across lines of information sent from Textract. For example, in the image file 2020021298_ori.tif, the second page Textract output has 2 separate lines:

    1. JONATHAN D. CLEMENTE as ATTORNEY in FACT for HOWARD R. SMITH and DORIS

    2. V. SMITH, Husband and Wife

      Separating those 2 lines (and every other pair of successive lines in the Textract output) by a newline character, and putting that into Comprehend yields the output that there is a name:

      DORIS

      V. SMITH

      The DORIS and the V. SMITH are still separated by the newline character, but they appear together identified as a NAME.

  2. Synchronous Comprehend Detect PII Entities calls are limited to 5000 UTF-8 characters, whereas the asynchronous StartPiiEntitiesDetectionJob uses an S3 object with no given size limit as the source.

  3. Initial timing tests with 3 separate files from 10 to 269 pages in length resulted in them all taking about 8 minutes and 8 or 9 seconds to complete. A couple of those were running concurrently.

  4. Later a series of files were submitted with gaps in the start times so that as many as 10 (the maximum concurrency for our account) were running concurrently.

    1. The first couple of files again took a little over 8 minutes to complete.

    2. Later files, except for the 269-pager, took about 6 minutes 7 – 9 seconds to complete.

    3. The 269-page document (372 k characters of Textract output) too 8 minutes 8 seconds again.

    4. While pipelining does seem to show some benefit, the minimum time found even with 1 or 2 page documents was a little over 6 minutes.

Things to be determined:

  1. Like Textract input being split into 1-page documents for synchronous processing, Comprehend input could be split into 5000 UTF-8 character chunks and run synchronously.

    1. We would probably want some overlap between successive block in case there was PII that would possibly be split and undetected at a 500-character boundary.

    2. We would want to determine how many 5000-character chunks it takes before we just choose to use the asynchronous version.

  2. Comprehend allows you to send a batch of documents for processing as part of 1 job.

    1. If the InputDataConfig parameter and its S3Uri point to a folder with multiple files, then all of those files will be used as input.

    2. Does that help with processing time?