An integrated search solution powered by ML

by Andrés Villarroel and Santiago Vazquez, Cloud Engineering

AWS Textract is an AWS service that allows developers to detect, extract text (even if it is handwritten), forms and tables from PDF, PNG, and JPG files.

The service is trained with deep learning, and it is incrementally adjusted by AWS video and image recognition services with millions of images and videos daily.

It is used for

  • OCR
  • Extracting text by means of NLP (Natural Language Processing): it creates word tags to predict and suggest similar patterns

Benefits

  • Low cost
  • Endpoints are trained by AWS and consumed via API
  • Scalable and can be integrated with other AWS native services
A2I (Augmented AI)
Textract sample

NLP (natural language processing) is a component of artificial intelligence, whose aim is to convert text into structured data so that it comprehends human speech while reproducing it by means of analysis, understanding, and generation of natural language.

Used for:

  • Translations
  • Autofill/ predictive texts
  • Chatbots
  • Spelling check
  • Looking for similar words in search engines
  • Analyze and understand messages from voice to text
  • Spam filters
  • Virtual assistants like Google Assistant, Siri, Alexa

AWS Comprehend

It is an NLP service, trained and administered by AWS with million data points collected from diverse sources. Its learning can be enhanced with AutoML and customized data training.

Used for:

  • Opinion, sentiment and semantic analysis
  • Entity recognition
  • Medical data (detecting acronyms, shortened names of virus and bacteria, etc)
  • Language Detection
  • Key phrases and words

Benefits

  • It allows to quickly track and find relevant information in texts
  • It can be trained according to the data that needs to be processed
  • It can be integrated with other AWS services such as Translate, Transcribe, POlly, Lex, etc.)

Project Architecture

From the need to create an integrated search solution that allows extracting information, classify, comprehend and index it for later exploitation, we worked on the following: https://github.com/aws-samples/amazon-textract-comprehend-OCRimage-search-and-analyze

Issues

  • Methods to interact with Textract had to be adjusted so that it worked with PDF files
  • Service limits
  • Textract and Comprehend jobs must operate asynchronously
  • Lambas would time-out while it waited for Textract and Comprehend jobs to finalize
  • By the end of a comprehend process for sentiment/entity analysis threw a result of output.tar.gz, which was an issue when ES data had to be indexed.
  • By dividing lambda with 3 different triggers, there were permit/role/cloud formation/variable issues

Final Project Architecture for the solution

FLOW

  • When a PDF file is uploaded to the bucket, a trigger executes a lambda function, triggering, in turn, a Textract job
  • When that job concludes, the result is TXT file in the bucket, which in turn triggers a new lambda to summon the API from comprehend
  • Once the Comprehend job finalizes, the resulting file is .TAR.GZ in the bucket, triggering the last lambda that will extract the content from the output.tar.gz file and will index it into the ES cluster
  • Finally, the resulting content can be consulted on/from Kibana

DEMO DEPLOYMENT

  1. First, run the zip_generator.sh script. This will generate the zip file for the Lambda functions and for the layer
  2. Create an S3 bucket and upload the ZIPs generated.
  3. Run template-generator.sh. This will replace the value BUCKET_CODE in the cloud formation template(stack.yaml) with the name of the bucket created for the code.
  4. Run the Cloudformation template.yaml on AWS that was generated on step 3.

For the analysis, 2 options were considered:

  1. Analyze KEY PHRASES from 5 different CVs, which would allow us to navigate through a repository of candidate CVs and make specific searches according to experience or a given technology:

Result search

2. Sentiment analysis of song lyrics and define whether it was positive, negative, neutral or mixed.

Analysis of results

USE CASES

Depending on settings, this search solution could be useful for:

  • Looking for specific documents in a Data Lake
  • Analyzing and classify different files/ documents
  • Extracting text/ data/key phrases and words from images and PDFs
  • Sentiment analysis in emails, tweets, social networks comments, websites, business proposals, employee feedback or quality assessments, insurance claims, service companies, etc.
  • When combined with other ML tools, it can be integrated with voice to text applications/ solutions, automatic translations, transcriptions. Etc.

Improvement can be achieved in the code (convert from CF to TF), integration with other ML services (Alexa, Lex, Translate, Transcribe). Besides, indexation has to be revised (e.g. frequency of revision of indexation)

AWS solutions like Comprehend, Textract, S3 and Lambda are very versatile and accessible. Their applications could be customized to different industries and organizations, built to retrain themselves and optimize operations and analytics.

Want to join an innovative cloud team? Contact us.

We are an AWS Premier Consulting Partner company. Since 2009 we’ve been delivering business outcomes and we want to share our experience with you. Enjoy!