All Projects → mzaradzki → document_search_engine_on_aws

mzaradzki / document_search_engine_on_aws

Licence: MIT license
No description, website, or topics provided.

Programming Languages

javascript
184084 projects - #8 most used programming language
HTML
75241 projects

Document Search Engine On AWS

Indexer and Scrapper

Medium blog for detailed explanation

https://medium.com/@m.zaradzki/build-your-own-document-search-engine-using-amazon-web-services-82d5b165d96c

Notes on using non-default packages in Lambda Nodejs

To use non-default packages such as cheerio "html dom parser" you need to send your Lamdba code as a zip file to AWS. If your zip is too large because of the packages you wont be able to edit/test the code from the console.

However in that case you can use lambda-local to emulate Lambda locally. See this link : https://github.com/ashiina/lambda-local

Usefull links

What we have so far as Lambda functionality

  • Function that write updates to dynamodb
  • Function that write files to S3
  • Function that reads file meta-data from S3
  • Function that reads HTML
  • Function that download files on the web
  • Function that write/read messages to/from SQS
  • Function that queries from cloudsearch
  • Function that index documents to cloudsearch

Description (changing quickly)

  • legiscrap0 writes all scraped links to SQS
  • legiscrap_manager1 picks-up 1 SQS message and delegates it to scrap1 (triggered by CloudWatch CRON)
  • legiscrap1 fetch html or attachement online and saves it on S3
  • docIndexer listens to S3 file addition events and send index commands to CloudSearch
  • docSearcher could be exposed through an API to query CloudSearch
  • indexCleaner scans CloudSearch documents and check they all have a matching S3 file (triggered by CloudWatch CRON)
  • indexCatcher scans S3 and check all files are in CloudSearch index (triggered by CloudWatch CRON)

Note that:

  • indexCatcher uses an SQS Queue to keep track of its position in the S3 bucket as it processes it by chuncks
  • indexCleaner uses an SQS Queue to keep track of its position in CloudSearch index

Browser code credentials

To invoke Lambda from the Browser the page need to provide credential in the form of an Identity Pool managed by AWS Cognito. See : http://docs.aws.amazon.com/cognito/latest/developerguide/identity-pools.html The pool will allow to control permissions for both authenticated and un-authenticated users via specific roles.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].