All Projects → codelucas → Cracking The Da Vinci Code With Google Interview Problems And Nlp In Python

codelucas / Cracking The Da Vinci Code With Google Interview Problems And Nlp In Python

Licence: mit
A guide on how to crack combinatorics puzzles shown in The Da Vinci Code movie using CS fundamentals and NLP

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Cracking The Da Vinci Code With Google Interview Problems And Nlp In Python

Text Classification Keras
📚 Text classification library with Keras
Stars: ✭ 53 (-29.33%)
Mutual labels:  nlp-machine-learning
Javascript Interview Questions Developer
Danh sách những câu hỏi trong phỏng vấn Javascript 📝
Stars: ✭ 62 (-17.33%)
Mutual labels:  interview-questions
Intent classifier
Stars: ✭ 67 (-10.67%)
Mutual labels:  nlp-machine-learning
Interviewjam
Interview Questions & Coding Challenges
Stars: ✭ 55 (-26.67%)
Mutual labels:  interview-questions
Feguide
【前端面试题+前端学习+面试指南】 一份涵盖大部分前端工程师所需要掌握的核心知识。这个项目就是为了帮助那些找工作的前端开发工程师去回顾前端的基础知识,如果你不想找工作,也可以通过查看这些面试问题去巩固你的前端技能。
Stars: ✭ 1,101 (+1368%)
Mutual labels:  interview-questions
Aiops platform
An Artificial Intelligence Platform for IT Operations.
Stars: ✭ 63 (-16%)
Mutual labels:  nlp-machine-learning
Devops Exercises
Linux, Jenkins, AWS, SRE, Prometheus, Docker, Python, Ansible, Git, Kubernetes, Terraform, OpenStack, SQL, NoSQL, Azure, GCP, DNS, Elastic, Network, Virtualization. DevOps Interview Questions
Stars: ✭ 20,905 (+27773.33%)
Mutual labels:  interview-questions
Coding Ninjas Data Structures And Algorithms In Python
Solved problems and assignments of DSA course taught by Coding Ninjas team
Stars: ✭ 70 (-6.67%)
Mutual labels:  interview-questions
How To Mine Newsfeed Data And Extract Interactive Insights In Python
A practical guide to topic mining and interactive visualizations
Stars: ✭ 61 (-18.67%)
Mutual labels:  nlp-machine-learning
Ctci 6th Edition
Cracking the Coding Interview 6th Ed. Solutions
Stars: ✭ 9,328 (+12337.33%)
Mutual labels:  interview-questions
Interview Guide
Coding/technical interview guide: data structures, algorithms, complexity analyses, interview questions
Stars: ✭ 54 (-28%)
Mutual labels:  interview-questions
Argument Reasoning Comprehension Task
The Argument Reasoning Comprehension Task: Source codes & Datasets
Stars: ✭ 57 (-24%)
Mutual labels:  nlp-machine-learning
Leetcode
👏🏻 leetcode solutions for Humans™
Stars: ✭ 1,129 (+1405.33%)
Mutual labels:  interview-questions
Iosinterviewquestions
iOS基础问答面试题连载-附答案
Stars: ✭ 53 (-29.33%)
Mutual labels:  interview-questions
Nlp Paper
自然语言处理领域下的对话语音领域,整理相关论文(附阅读笔记),复现模型以及数据处理等(代码含TensorFlow和PyTorch两版本)
Stars: ✭ 67 (-10.67%)
Mutual labels:  nlp-machine-learning
Java
Repository for Java codes and algos.Star the repo too.
Stars: ✭ 53 (-29.33%)
Mutual labels:  interview-questions
Algorithms Cheatsheet Resources
🤓All the geeky stuffs you need to know at one place!
Stars: ✭ 60 (-20%)
Mutual labels:  interview-questions
Interview Prep
Everything you need to know to get the job
Stars: ✭ 69 (-8%)
Mutual labels:  interview-questions
Algorithm qa
左程云老师算法最优解Python实现
Stars: ✭ 68 (-9.33%)
Mutual labels:  interview-questions
Php Interview Best Practices In China
📙 PHP 面试知识点汇总
Stars: ✭ 1,133 (+1410.67%)
Mutual labels:  interview-questions

Cracking the Da Vinci Code with Google interview problems and NLP in python

Written by Lucas Ou-Yang

I was rewatching The Da Vinci Code the other day and came across an incredible scene near the start where Robert and Sophie, the two leading protagonists playing detective roles, stumble across an anagram puzzle in the Louvre Museum in Paris. It was a dark night and their lives depended on them cracking the code quickly! Silas, the ruthless Opus Dei Zealot, was out for blood.

Some vocabulary for those who aren't familiar:

An anagram is a word, phrase, or name formed by rearranging the letters of another word. For example: car => rac. These two words are anagrams of each other.

In the Louvre that night, there was a dying man on the floor and beside him was a seemingly meaningless string of text.

O, DRACONIAN DEVIL! OH, LAME SAINT

If we looked these two phrases up online, in any encyclopedia or reference manual, nothing meaningful will show up. Robert quickly arrived at the correct conclusion in the movie; that these two phrases were anagrams disguising two hidden and more meaningful phrases. (What likely gave it away is the difference in spelling of OH and O in the two phrases)

For those of us plebs who don't have Ph.Ds in symbology, how can we systematically find out all the meaningful anagrams of a phrase like "O, DRACONIAN DEVIL!"?

Thankfully, we have computers and algorithms to help! Let's tackle this problem step by step:

Disclaimer: For the following algorithm we are assuming that:

  • The empty spaces count uniquely as characters in an anagram
  • Punctuation marks don't count
  • The anagrams are all in English

It wouldn't be hard to build an algorithm that doesn't need the above assumptions but I omitted that to simplify this post

A brute-force strategy would be to compute all of the possible permutations of "O DRACONIAN DEVIL" first, and then to examine each one. Because of this text's length of 17 characters, the number of permutations is 17! ~ seventeen factorial equals to about 355 trillion combinations! There is no way a human or small group of computers could even compute the possibilities -- nevertheless find out the correct one. We need something smarter.

Computing all permutations is overkill. For example, a permutation of DEVIL is VLDEI, which is a meaningless word, it's not even valid english! Hence, we can change our permutation algorithm to immediately skip out of a computation once the current word in our phrase isn't a valid english word. We can also use this insight to skip out our computation if our current permutation isn't a prefix of a valid english word.

How do we determine if a word is a valid English prefix or word? We can use a trie data structure holding the top 10,000 most commonly used English words to find prefixes. We can use a hashtable of the same 10,000 words to find valid words. We chose the 10K most common words because we want to keep the size of this trie smaller - A smaller trie means we have a tighter definition for "valid English word" which allows for a more efficient permutation algorithm.

The big O-notation runtime is still O(n!) ~ 17! ~ 355 trillion because the upper-bound is that all permutations are not valid English words and also not valid prefixes. But, in practice / on average this algorithm is much more efficient since I got it to run in ~140 seconds on "O DRACONIAN DEVIL". We produced 1,458 valid permutations, a much more sane number than 355 trillion. With adding printlines we see that with the above heuristics we've skipped 65 million recursive calls! WOW!

Here is an algorithm that enumerates all the sane permutations of "O, DRACONIAN DEVIL!".

import string
import time

TRIE_END = '__end'

INPUT_COMMON_ENGLISH_WORDS = '/Users/lucasou-yang/google-10000-english-usa.txt'
INPUT_TRIGRAMS = '/Users/lucasou-yang/trigrams.txt'


def build_trie():
    with open (INPUT_COMMON_ENGLISH_WORDS, 'r') as english_words:
        data = [line.strip() for line in english_words.readlines()
                if line.strip()]
        root = {}
        for word in data:
            current_dict = root
            for letter in word:
                current_dict = current_dict.setdefault(letter, {})
            current_dict[TRIE_END] = True
        print('Trie contains %s words' % len(data))
        return root


def build_english_dict():
    with open (INPUT_COMMON_ENGLISH_WORDS, 'r') as english_words:
        return set(line.strip() for line in english_words.readlines()
                   if line.strip() and len(line.strip()) > 1)


def find_bigrams(input_list):
    return zip(input_list, input_list[1:])


def is_trie_prefix(trie, word):
    if not word or not word.strip():
        return False

    current_dict = trie
    for letter in word:
        if letter in current_dict:
            current_dict = current_dict[letter]
        else:
            return False
    return True


TRIE = build_trie()
ENGLISH_DICT = build_english_dict()

skipped_recursion = 0


def recurse(input_letters, current_permu, solution, final_len, word_end_index):
    global skipped_recursion
    # current permutation matches original anagram length, see if words
    # qualify as real sentence and dedupe, if valid, add to solution set
    if len(current_permu) == final_len:
        last_word = ''.join(current_permu[word_end_index:])
        candidate_word = ''.join(current_permu)
        if last_word in ENGLISH_DICT and \
                candidate_word not in solution:
            solution = solution.add(candidate_word)
            # print(candidate_word.upper())
        return

    seen_at_start = set()  # dedupe repeated characters in recursive depth
    for idx in range(len(input_letters)):
        cur = input_letters.pop(idx)
        if cur in seen_at_start:
            skipped_recursion += 1
            input_letters.insert(idx, cur)
            continue

        seen_at_start.add(cur)
        current_permu.append(cur)
        # spaces get special treatment since they distinguish words,
        # if a word is not valid english (via trie) we stop recursing
        if cur == ' ':
            # cut off extra space
            candidate_word = ''.join(current_permu[word_end_index:-1])
            if candidate_word not in ENGLISH_DICT: # not is_trie_word(TRIE, candidate_word):
                # don't recurse - recent word chunk not valid
                skipped_recursion += 1
                pass
            else:
                recurse(
                    input_letters,
                    current_permu,
                    solution,
                    final_len,
                    len(current_permu)
                )
        else:
            if is_trie_prefix(TRIE, ''.join(current_permu[word_end_index:])):
                recurse(
                    input_letters,
                    current_permu,
                    solution,
                    final_len,
                    word_end_index
                )
            else:
                # don't recurse - word isn't valid, skip onto next one
                skipped_recursion += 1
                pass

        input_letters.insert(idx, cur)
        current_permu.pop()


def transform(s):
    # remove punctution, make lowercase
    s = s.translate(None, string.punctuation)
    return s.lower()


start = time.time()

# orig_string = 'OH, LAME SAINT'
orig_string = 'O, DRACONIAN DEVIL'
xformed_string = transform(orig_string)
xformed_string = list(xformed_string)

# we sort the input characters so we can dedupe and skip recursive calls
# orig = sorted(xformed_string)
final_anagram_len = len(xformed_string)
solution = set()

recurse(xformed_string, [], solution, final_anagram_len, 0)
print('After permuting the anagram %s has %s combinations' %
    (orig_string, len(solution)))

end = time.time()
print('Time elapsed is %s' % (end - start))
print('We skipped recursive calls %s times' % skipped_recursion)

Running the code returns 1,458 unique candidates for the decoded anagram of "O, DRACONIAN DEVIL!".

If Robert and Sophie had all afternoon to examine the 1,458 phrases I guess that would be fine .. but we can do better! How can we now algorithmically extract all the most meaningful candidates from this result set of 1,458 anagrams? It would be nice to narrow the candidate set down to, say < 10 anagrams.

We already know that the existing candidates in the 1,458 anagrams are all valid English words since we've done that filtering previously. So, given that we've applied syntactic filtering now would be a good time to apply semantic filtering.

How about a strategy that filters out the candidates that have meaningless phrases. Can "VOID NONE RADICAL" or "VON RADIO ICELAND" possibly mean anything? The text isn't valid English once again. The of words "von", "radio", "iceland", "void", and "none" are all real English words but combined they are meaningless.

Several solutions to semantic filtering come up:

  • Perform NLP analysis on every phrase, remove all the phrases that aren't in the form of valid English grammar.
  • Perform NLP analysis but do it in the form of common bigram or trigram filtering, keep the anagrams that are collocated trigrams. Collocations are expressions of multiple words which commonly co-occur. For example: "New York" or "going to". Use IDF and keep anagrams which have documents with high IDF scores. (we can't use TF-IDF since the TF in this case wouldn't make sense, a list of permutations isn't a document).

There are likely other strategies but I ended up implementing #2. It worked amazingly. Here is the code (I loaded into a hashtable the 1 million most common trigrams found in the Corpus of Contemporary American English found on this website). The trigram solution lends itself very well to this problem since our original anagram "O, DRACONIAN DEVIL!" has two spaces and three words - which makes it a trigram. We also don't necessarily need to use trigrams to solve this, bigrams also works.

# orig_string = 'OH, LAME SAINT'
orig_string = 'O, DRACONIAN DEVIL'
transformed_string = transform(orig_string)

orig = sorted(transformed_string)
FINAL_ANAGRAM_LEN = len(orig)
solution = set()

recurse(orig, [], solution, len(orig), 0)
print('After permuting the anagram %s has %s combinations' %
    (orig_string, len(solution)))


common_ngrams_dict = {}
# now we need to build a map of the english language's most common bigrams, common bigrams
# are referred to as collocations - two words that, together, are unusually common in
# english texts. we use this technique to filter out the nonsense anagram combinations
# to a sane number this file has just over 1M trigrams, about 17MB (that's fine)
with open(INPUT_TRIGRAMS, 'r') as ngrams_data:
    clean_lines = [s.strip() for s in ngrams_data.read().splitlines()]
    for line in clean_lines:
        freq, w1, w2, w3 = line.split()
        common_ngrams_dict[(w1, w2, w3)] = int(freq)


print('ngram dictionary contains %s words' % len(common_ngrams_dict))

filtered_solution = []
for anagram in solution:
    trigram = tuple(anagram.split())
    if trigram in common_ngrams_dict:
        filtered_solution.append(anagram)


print('After filtering anagrams without common trigrams, the anagram %s has %s combinations' %
    (orig_string, len(filtered_solution)))


for anagram in filtered_solution:
    print(anagram.upper())

Iterating all 1,458 anagram candidates and merely checking if they existed in the most frequent trigrams hashmap yielded exactly one result!!!

LEONARDO DA VINCI is the anagram. The grail puzzle is solved, for now at least!

"O, DRACONIAN DEVIL" <==> "LEONARDO DA VINCI"

Robert and Sophie now sprint over to the Mona Lisa in the movie, they move on to the next puzzle!

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].