An integrated search solution powered by ML

5 min readOct 18, 2021

by Andrés Villarroel and Santiago Vazquez, Cloud Engineering

AWS Textract is an AWS service that allows developers to detect, extract text (even if it is handwritten), forms and tables from PDF, PNG, and JPG files.

The service is trained with deep learning, and it is incrementally adjusted by AWS video and image recognition services with millions of images and videos daily.

It is used for

OCR
Extracting text by means of NLP (Natural Language Processing): it creates word tags to predict and suggest similar patterns

Benefits

Low cost
Endpoints are trained by AWS and consumed via API
Scalable and can be integrated with other AWS native services

NLP (natural language processing) is a component of artificial intelligence, whose aim is to convert text into structured data so that it comprehends human speech while reproducing it by means of analysis, understanding, and generation of natural language.

Used for:

Translations
Autofill/ predictive texts
Chatbots
Spelling check
Looking for similar words in search engines
Analyze and understand messages from voice to text
Spam filters
Virtual assistants like Google Assistant, Siri, Alexa

AWS Comprehend

It is an NLP service, trained and administered by AWS with million data points collected from diverse sources. Its learning can be enhanced with AutoML and customized data training.

Used for:

Opinion, sentiment and semantic analysis
Entity recognition
Medical data (detecting acronyms, shortened names of virus and bacteria, etc)
Language Detection
Key phrases and words

Benefits

It allows to quickly track and find relevant information in texts
It can be trained according to the data that needs to be processed
It can be integrated with other AWS services such as Translate, Transcribe, POlly, Lex, etc.)

Project Architecture

From the need to create an integrated search solution that allows extracting information, classify, comprehend and index it for later exploitation, we worked on the following: https://github.com/aws-samples/amazon-textract-comprehend-OCRimage-search-and-analyze

Issues

Methods to interact with Textract had to be adjusted so that it worked with PDF files
Service limits
Textract and Comprehend jobs must operate asynchronously
Lambas would time-out while it waited for Textract and Comprehend jobs to finalize
By the end of a comprehend process for sentiment/entity analysis threw a result of output.tar.gz, which was an issue when ES data had to be indexed.
By dividing lambda with 3 different triggers, there were permit/role/cloud formation/variable issues

Final Project Architecture for the solution

FLOW

When a PDF file is uploaded to the bucket, a trigger executes a lambda function, triggering, in turn, a Textract job
When that job concludes, the result is TXT file in the bucket, which in turn triggers a new lambda to summon the API from comprehend
Once the Comprehend job finalizes, the resulting file is .TAR.GZ in the bucket, triggering the last lambda that will extract the content from the output.tar.gz file and will index it into the ES cluster
Finally, the resulting content can be consulted on/from Kibana

DEMO DEPLOYMENT

First, run the zip_generator.sh script. This will generate the zip file for the Lambda functions and for the layer
Create an S3 bucket and upload the ZIPs generated.
Run template-generator.sh. This will replace the value BUCKET_CODE in the cloud formation template(stack.yaml) with the name of the bucket created for the code.
Run the Cloudformation template.yaml on AWS that was generated on step 3.

For the analysis, 2 options were considered:

Analyze KEY PHRASES from 5 different CVs, which would allow us to navigate through a repository of candidate CVs and make specific searches according to experience or a given technology:

Result search

2. Sentiment analysis of song lyrics and define whether it was positive, negative, neutral or mixed.

Analysis of results

USE CASES

Depending on settings, this search solution could be useful for:

Looking for specific documents in a Data Lake
Analyzing and classify different files/ documents
Extracting text/ data/key phrases and words from images and PDFs
Sentiment analysis in emails, tweets, social networks comments, websites, business proposals, employee feedback or quality assessments, insurance claims, service companies, etc.
When combined with other ML tools, it can be integrated with voice to text applications/ solutions, automatic translations, transcriptions. Etc.

Improvement can be achieved in the code (convert from CF to TF), integration with other ML services (Alexa, Lex, Translate, Transcribe). Besides, indexation has to be revised (e.g. frequency of revision of indexation)

AWS solutions like Comprehend, Textract, S3 and Lambda are very versatile and accessible. Their applications could be customized to different industries and organizations, built to retrain themselves and optimize operations and analytics.

Want to join an innovative cloud team? Contact us.

An integrated search solution powered by ML

Written by EDRANS Stories