All Projects → oliver006 → Elasticsearch Gmail

oliver006 / Elasticsearch Gmail

Index your Gmail Inbox with Elasticsearch

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Elasticsearch Gmail

Elasticsearch Hn
Index & Search Hacker News using Elasticsearch and the HN API
Stars: ✭ 92 (-95.32%)
Mutual labels:  elasticsearch, tutorial, tornado
Docker Elk Tutorial
docker-elk-tutorial + django + logging
Stars: ✭ 69 (-96.49%)
Mutual labels:  elasticsearch, tutorial
Elasticsearch Test Data
Generate and upload test data to Elasticsearch for performance and load testing
Stars: ✭ 194 (-90.12%)
Mutual labels:  elasticsearch, tornado
maildir2gmail
Maildir 2 Gmail
Stars: ✭ 14 (-99.29%)
Mutual labels:  gmail, gmail-inbox
Elasticsearch tutorial
An action-packed, example-based ElasticSearch tutorial
Stars: ✭ 133 (-93.23%)
Mutual labels:  elasticsearch, tutorial
Docker Elasticsearch
Dockerfile for a base Elasticsearch image to be extended by others (allow to install plug-ins, change configuration, etc.)
Stars: ✭ 162 (-91.75%)
Mutual labels:  elasticsearch
React Letter
✉️ Display e-mails in your React.js projects. (Targets Gmail rendering.)
Stars: ✭ 164 (-91.65%)
Mutual labels:  gmail
Dcos Commons
DC/OS SDK is a collection of tools, libraries, and documentation for easy integration of technologies such as Kafka, Cassandra, HDFS, Spark, and TensorFlow with DC/OS.
Stars: ✭ 162 (-91.75%)
Mutual labels:  elasticsearch
Flask Sqlalchemy Tutorial
🐍 ℹ️ Create and manage data in your Flask app via a SQL database.
Stars: ✭ 162 (-91.75%)
Mutual labels:  tutorial
Pegparser
💡 Build your own programming language! A C++17 PEG parser generator supporting parser combination, memoization, left-recursion and context-dependent grammars.
Stars: ✭ 164 (-91.65%)
Mutual labels:  filter
Adventurelookup
Adventure Lookup Main Repository
Stars: ✭ 164 (-91.65%)
Mutual labels:  elasticsearch
Dashbuilder
Dashboard composition tooling based on the Uberfire framework
Stars: ✭ 163 (-91.7%)
Mutual labels:  elasticsearch
Wipi
nextjs + nestjs + TypeScript +MySQL 开发的前后端分离,服务端渲染的博客系统
Stars: ✭ 163 (-91.7%)
Mutual labels:  elasticsearch
Muuri React
The layout engine for React
Stars: ✭ 163 (-91.7%)
Mutual labels:  filter
Pifpaf
Python fixtures and daemon managing tools for functional testing
Stars: ✭ 161 (-91.8%)
Mutual labels:  elasticsearch
D3tutorial
📊📈 A D3 v6 tutorial - interactive bar chart and multiple coordinated views (MCV)
Stars: ✭ 163 (-91.7%)
Mutual labels:  tutorial
Cmake Tutorial
CMake 官方教程
Stars: ✭ 162 (-91.75%)
Mutual labels:  tutorial
Express Cassandra
Cassandra ORM/ODM/OGM for Node.js with optional support for Elassandra & JanusGraph
Stars: ✭ 163 (-91.7%)
Mutual labels:  elasticsearch
Spring Boot Leaning
Spring Boot 2.X 最全课程代码
Stars: ✭ 2,008 (+2.24%)
Mutual labels:  elasticsearch
Haskell Scotty Realworld Example App
Exemplary real world backend API built with Haskell/Scotty https://realworld.io
Stars: ✭ 162 (-91.75%)
Mutual labels:  tutorial

Elasticsearch For Beginners: Indexing your Gmail Inbox (and more: Supports any mbox and MH mailboxes)

Build Status

What's this all about?

I recently looked at my Gmail inbox and noticed that I have well over 50k emails, taking up about 12GB of space but there is no good way to tell what emails take up space, who sent them to, who emails me, etc

Goal of this tutorial is to load an entire Gmail inbox into Elasticsearch using bulk indexing and then start querying the cluster to get a better picture of what's going on.

Prerequisites

Set up Elasticsearch and make sure it's running at http://localhost:9200

A quick way to run Elasticsearch is using Docker: (the cors settings aren't really needed but come in handy if you want to use e.g. dejavu to explore the index)

docker run --name es -d -p 9200:9200 -e http.port=9200 -e http.cors.enabled=true -e 'http.cors.allow-origin=*' -e http.cors.allow-headers=X-Requested-With,X-Auth-Token,Content-Type,Content-Length,Authorization -e http.cors.allow-credentials=true -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch-oss:7.6.1

I use Python and Tornado for the scripts to import and query the data. Also beautifulsoup4 for the stripping HTML/JS/CSS (if you want to use the body indexing flag).

Install the dependencies by running:

pip3 install -r requirements.txt

Aight, where do we start?

First, go here and download your Gmail mailbox, depending on the amount of emails you have accumulated this might take a while. There's also a small sample.mbox file included in the repo for you to play around with while you're waiting for Google to prepare your download.

The downloaded archive is in the mbox format and Python provides libraries to work with the mbox format so that's easy.

You can run the code (assuming Elasticsearch is running at localhost:9200) with the sammple mbox file like this:

$ python3 src/index_emails.py --infile=sample.mbox
[I index_emails:173] Starting import from file sample.mbox
[I index_emails:101] Upload: OK - upload took: 1033ms, total messages uploaded:      3
[I index_emails:197] Import done - total count 16
$

Note: All examples focus on Gmail inboxes. Substitute any --infile= parameters with --indir= pointing to an MH directory to make them work with MH mailboxes instead.

The Source Code

The overall program will look something like this:

mbox = mailbox.mbox('emails.mbox') // or mailbox.MH('inbox/')

for msg in mbox:
    item = convert_msg_to_json(msg)
	upload_item_to_es(item)

print "Done!"

Ok, tell me more about the details

The full Python code is here: src/index_emails.py

Turn mailbox into JSON

First, we got to turn the messages into JSON so we can insert it into Elasticsearch. Here is some sample code that was very useful when it came to normalizing and cleaning up the data.

A good first step:

def convert_msg_to_json(msg):
    result = {'parts': []}
    for (k, v) in msg.items():
        result[k.lower()] = v.decode('utf-8', 'ignore')

Additionally, you also want to parse and normalize the From and To email addresses:

for k in ['to', 'cc', 'bcc']:
    if not result.get(k):
        continue
    emails_split = result[k].replace('\n', '').replace('\t', '').replace('\r', '').replace(' ', '').encode('utf8').decode('utf-8', 'ignore').split(',')
    result[k] = [ normalize_email(e) for e in emails_split]

if "from" in result:
    result['from'] = normalize_email(result['from'])

Elasticsearch expects timestamps to be in microseconds so let's convert the date accordingly

if "date" in result:
    tt = email.utils.parsedate_tz(result['date'])
    result['date_ts'] = int(calendar.timegm(tt) - tt[9]) * 1000

We also need to split up and normalize the labels

labels = []
if "x-gmail-labels" in result:
    labels = [l.strip().lower() for l in result["x-gmail-labels"].split(',')]
    del result["x-gmail-labels"]
result['labels'] = labels

Email size is also interesting so let's break that out

parts = json_msg.get("parts", [])
json_msg['content_size_total'] = 0
for part in parts:
    json_msg['content_size_total'] += len(part.get('content', ""))
Index the data with Elasticsearch

The most simple approach is a PUT request per item:

def upload_item_to_es(item):
    es_url = "http://localhost:9200/gmail/email/%s" % (item['message-id'])
    request = HTTPRequest(es_url, method="PUT", body=json.dumps(item), request_timeout=10)
    response = yield http_client.fetch(request)
    if not response.code in [200, 201]:
        print "\nfailed to add item %s" % item['message-id']

However, Elasticsearch provides a better method for importing large chunks of data: bulk indexing Instead of making a HTTP request per document and indexing individually, we batch them in chunks of eg. 1000 documents and then index them.
Bulk messages are of the format:

cmd\n
doc\n
cmd\n
doc\n
...

where cmd is the control message for each doc we want to index. For our example, cmd would look like this:

cmd = {'index': {'_index': 'gmail', '_type': 'email', '_id': item['message-id']}}`

The final code looks something like this:

upload_data = list()
for msg in mbox:
    item = convert_msg_to_json(msg)
    upload_data.append(item)
    if len(upload_data) == 100:
        upload_batch(upload_data)
        upload_data = list()

if upload_data:
    upload_batch(upload_data)

and

def upload_batch(upload_data):

    upload_data_txt = ""
    for item in upload_data:
        cmd = {'index': {'_index': 'gmail', '_type': 'email', '_id': item['message-id']}}
        upload_data_txt += json.dumps(cmd) + "\n"
        upload_data_txt += json.dumps(item) + "\n"

    request = HTTPRequest("http://localhost:9200/_bulk", method="POST", body=upload_data_txt, request_timeout=240)
    response = http_client.fetch(request)
    result = json.loads(response.body)
	if 'errors' in result:
	    print result['errors']

Ok, show me some data!

After indexing all your emails, we can start running queries.

Filters

If you want to search for emails from the last 6 months, you can use the range filter and search for gte the current time (now) minus 6 month:

curl -XGET 'http://localhost:9200/gmail/email/_search?pretty' -d '{
"filter": { "range" : { "date_ts" : { "gte": "now-6M" } } } }
'

or you can filter for all emails from 2014 by using gte and lt

curl -XGET 'http://localhost:9200/gmail/email/_search?pretty' -d '{
"filter": { "range" : { "date_ts" : { "gte": "2013-01-01T00:00:00.000Z", "lt": "2014-01-01T00:00:00.000Z" } } } }
'

You can also quickly query for certain fields via the q parameter. This example shows you all your Amazon shipping info emails:

curl "localhost:9200/gmail/email/_search?pretty&q=from:[email protected]"
Aggregation queries

Aggregation queries let us bucket data by a given key and count the number of messages per bucket. For example, number of messages grouped by recipient:

curl -XGET 'http://localhost:9200/gmail/email/_search?pretty&search_type=count' -d '{
"aggs": { "emails": { "terms" : { "field" : "to",  "size": 10 }
} } }
'

Result:

"aggregations" : {
"emails" : {
  "buckets" : [ {
       "key" : "[email protected]",
       "doc_count" : 1920
  }, { "key" : "[email protected]",
       "doc_count" : 1326
  }, { "key" : "[email protected]",
       "doc_count" : 263
  }, { "key" : "[email protected]",
       "doc_count" : 232
  }
  ...
  ]
}

This one gives us the number of emails per label:

curl -XGET 'http://localhost:9200/gmail/email/_search?pretty&search_type=count' -d '{
"aggs": { "labels": { "terms" : { "field" : "labels",  "size": 10 }
} } }
'

Result:

"hits" : {
  "total" : 51794,
},
"aggregations" : {
"labels" : {
  "buckets" : [       {
       "key" : "important",
       "doc_count" : 15430
  }, { "key" : "github",
       "doc_count" : 4928
  }, { "key" : "sent",
       "doc_count" : 4285
  }, { "key" : "unread",
       "doc_count" : 510
  },
  ...
   ]
}

Use a date histogram you can also count how many emails you sent and received per year:

curl -s "localhost:9200/gmail/email/_search?pretty&search_type=count" -d '
{ "aggs": {
    "years": {
      "date_histogram": {
        "field": "date_ts", "interval": "year"
}}}}
'

Result:

"aggregations" : {
"years" : {
  "buckets" : [ {
    "key_as_string" : "2004-01-01T00:00:00.000Z",
    "key" : 1072915200000,
    "doc_count" : 585
  }, {
...
  }, {
    "key_as_string" : "2013-01-01T00:00:00.000Z",
    "key" : 1356998400000,
    "doc_count" : 12832
  }, {
    "key_as_string" : "2014-01-01T00:00:00.000Z",
    "key" : 1388534400000,
    "doc_count" : 7283
  } ]
}

Write aggregation queries to work out how much you spent on Amazon/Steam:

GET _search
{
  "query": {
    "match_all": {}
      },
      "size": 0,
      "aggs": {
        "group_by_company": {
          "terms": {
            "field": "order_details.merchant"
            },
            "aggs": {
              "total_spent": {
                "sum": {
                  "field": "order_details.order_total"
                }
                },
                "postage": {
                  "sum": {
                    "field": "order_details.postage"
                  }
                }
              }
            }
          }
        }

Todo

  • more interesting queries
  • schema tweaks
  • multi-part message parsing
  • blurb about performance
  • ...

Feedback

Open a pull requests or an issue!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].