Patent Project for Big Data for Competitive Advantage (DSBA 6140)
Introduction
This readme outlines the steps in Python to use topic modeling on US patents for 3M and seven competitors.
There are five sections of the code:
- Modules & Working Directory
- Load Dataset, Set Column Names and Sample (Explore) Data
- Data Wrangling (Tokenize, Clean, TF-IDF)
- Topic Modeling (K-Means, LDA, Topic Word Cloud)
- K-Means Clustering on the Topic Probabilities
This code was created as a collection of several online references. Each reference is labelled with a [number] tag that will be used through this document to cite references.
Somewhat technical:
- [1] Introduction to Bag-of-Words modeling in Python
- [2] Applications and Challenges of Text Mining with Patents
- [3] Topic Modeling Visualizations
- [4] Stemming & Lemmatization
Very technical:
- [5] NLTK Homepage
- [6] gensim Homepage
- [7] LDA Modeling for Python
- [8] Topic Modeling (& K-Means) with Gensim
- [9] Constructing a broad-coverage lexicon for text mining in the patent domain
- [10] Document Clustering with Similarity
- [11] DBSCAN
- [12] Identifying Bigrams using gensim
1. Modules & Set Working Directory
Import Modules
For Python, call modules that will be used in this script.
# from nltk
from nltk.corpus import stopwords # Import the stop word list
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer # Lemmatization
# modules that are used
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import re
import csv
import logging
from scipy.spatial.distance import cdist, pdist
from wordcloud import WordCloud
# import gensim
import gensim
from gensim import corpora, models, similarities
Home Directory
Set home directory
# home directory - modify to your own homedirectory
os.chdir("/home/ryanceros/Dropbox/Project - Big Data Analytics/WordCloud")
2. Load Dataset, Set Column Names and Sample Data
Load Dataset and Set Column Names
The dataset is named "updatedCompanies.csv". It includes nearly 33,000 patents for eight companies (3M and seven competitors).
# sample dataset
dataset = 'updatedCompanies.csv'
# Load dataset
rawData = pd.read_csv(dataset, header=None)
# Rename the columns as CSV does not contain headers
rawData.columns = ["PatentNumber","CompanyName","Company","PatentAssignee",
"YearGranted","YearApplied","Year","PatentClass1","PatentClass2","PatentClassClean",
"ClassName","PatentTitle","PatentAbstract"]
# Check Shape
exampleData.shape
# ((32405, 13)
exampleData.columns.values
#array(['PatentNumber', 'CompanyName', 'PatentAssignee', 'YearGranted',
# 'YearApplied', 'PatentClass1', 'PatentClass2', 'ClassName',
# 'PatentTitle', 'PatentAbstract'], dtype=object)
Run StemLemma.py (Stemmer, Lemmatization functions)
[4] was modified to create the script StemLemma.py
exec(open("StemLemma.py").read())
Sample (Explore) Data
Using an example (num = 560), explore fields including the Abstract Bag-of-Words.
num=560
print "Company Name: %s" % (exampleData["Company"][num])
print("")
print("Patent Title: " + exampleData["PatentTitle"][num])
print("")
print("Class Name: " + exampleData["ClassName"][num])
print("")
print("Class Number (Left 3): %s " % exampleData["PatentClassClean"][num])
print("")
print("Abstract: " + exampleData["PatentAbstract"][num])
print("")
print("Abstract Bag of Words: " + patent_to_words(exampleData["PatentAbstract"][num]))
3. Data Wrangling
Clean and Tokenize patents into lists (each patent is a words array)
This section first cleans and tokensizes the patent abstracts into unigrams. Next, using gensim's Phrases function, we create additional bigrams to include in the topic modeling. We used [12] as a reference to create the bigrams.
# Get the number of reviews based on the dataframe column size
num_patents = exampleData["PatentAbstract"].size
# Initialize an empty list to hold the clean reviews
clean_abstracts = []
# Loop over each review; create an index i that goes from 0 to the length
# of the patent list
for i in xrange( 0, num_patents ):
# Call our function for each one, and add the result to the list of
patent = patent_to_words(exampleData["PatentAbstract"][i])
array = patent.split()
clean_abstracts.append(array)
# Identify Bigrams using gensim's Phrases function
bigram = models.Phrases(clean_abstracts)
final_abstracts = []
for i in xrange(0,num_patents):
sent = clean_abstracts[i]
temp_bigram = bigram[sent]
final_abstracts.append(temp_bigram)
Convert tokenized document to dictionary and document-term matrix
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(final_abstracts)
# convert tokenized documents into a document-term matrix (bag-of-words)
corpus = [dictionary.doc2bow(text) for text in final_abstracts]
Term Frequency and Inverse Document Frequency (TF-IDF)
#TF IDF
tfidf = models.TfidfModel(corpus, normalize=True)
corpus_tfidf = tfidf[corpus]
4. Topic Modeling
K-Means to Determine Number of Topics
Run KMeans.py to create KMeans function and to determine the number of topics. This section used the method laid out in [8].
exec(open("KMeans.py").read())
Generate LDA Model using gensim
gensim
is a text mining module. [6] is the official website of gensim and provides several introductory tutorials on using the modules. [7] and [8] were used to create this section.
# generate LDA model
NUM_TOPICS = 5
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',
level=logging.INFO)
# Project to LDA space
%time ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=NUM_TOPICS, id2word = dictionary, passes=100)
ldamodel.print_topics(NUM_TOPICS)
docTopicProbMat = ldamodel.get_document_topics(corpus,minimum_probability=0)
listDocProb = list(docTopicProbMat)
Put LDA Probabilities into a Matrix and then DataFrame
This step cleans up the output of LDA (topic probabilities for each document) and converts it to a pandas dataframe to make analysis easier.
probMatrix = np.zeros(shape=(num_patents,NUM_TOPICS))
for i,x in enumerate(listDocProb): #each document i
for t in x: #each topic j
probMatrix[i, t[0]] = t[1]
df = pd.DataFrame(probMatrix)
Generate Word Clouds for each Topic
This step creates word clouds for each of the topics. This section was created referencing [8].
final_topics = ldamodel.show_topics(num_words = 20)
curr_topic = 0
for line in final_topics:
line = line.strip()
scores = [float(x.split("*")[0]) for x in line.split(" + ")]
words = [x.split("*")[1] for x in line.split(" + ")]
freqs = []
for word, score in zip(words, scores):
freqs.append((word, score))
wordcloud = WordCloud(max_font_size=40).generate_from_frequencies(freqs)
plt.figure()
plt.imshow(wordcloud)
plt.axis("off")
curr_topic += 1
"Synthetic Materials":
"Data & Information":
"Chemistry":
"Electrical":
"Energy":
Plot Probability heatmap (Before Topic Clusters)
This step creates a topic probability heatmap for all documents. It is unordered so the probabilities will be scattered since they are based on the order of the dataset. [3] was referenced to create this part of the code.
plt.pcolor(df.transpose(), norm=None, cmap='Blues')
topic_labels = ['Synthetic_Material',
'Data_Information',
'Chemistry',
'Electrical',
'Energy_Turbine'
]
plt.yticks(np.arange(df.shape[1])+0.5, topic_labels)
plt.colorbar(cmap='Blues')
5. K-Means Clustering on the Topics
K-Means to Determine Number of Topics
This section samples a range of potential number of topics (1 to 11). The within sum of squares is calculated and the "elbow" rule is used to deduce that k should be 5. The within sum of squares is exported as a csv file.
k_range = range(1,11)
k_means_var = [KMeans(n_clusters=k).fit(df) for k in k_range]
centroids = [X.cluster_centers_ for X in k_means_var]
k_euclid = [cdist(df, cent, 'euclidean') for cent in centroids]
dist = [np.min(ke,axis=1) for ke in k_euclid]
wcss = [sum(d**2) for d in dist]
dfwcss = pd.DataFrame(wcss)
dataset = 'WCSS.csv'
dfwcss.to_csv(dataset, quotechar='\"', quoting=csv.QUOTE_NONNUMERIC,delimiter=',')
#tss = sum(pdist(df)**2)/df.shape[0]
#bss = tss - wcss
Run K-Means with 5 Clusters; Create a Total Patent Dataset
k = 5
kmeans = KMeans(n_clusters=k).fit(df)
clusters = kmeans.labels_
dfclusters = pd.DataFrame(clusters)
# Append Patent Number, Company and Class Name
df.columns = ['Synthetic_Material',
'Data_Information',
'Chemistry',
'Electrical',
'Energy_Turbine']
df["PatentNumber"] = exampleData["PatentNumber"]
df["Company"] = exampleData["Company"]
df["ClassName"] = exampleData["ClassName"]
df["PatentTitle"] = exampleData["PatentTitle"]
df["PatentAssignee"] = exampleData["PatentAssignee"]
df["Year"] = exampleData["Year"]
df["Cluster"] = dfclusters
Export Patent Topic Dataset and Rerun Probability Plot ordered by Topic
This step saves the patent dataset (with LDA topic probabilities and K-means cluster). This step also reruns the probability heat map but this time ordered by the topic cluster.
# Save in a new directory
os.chdir("/home/ryanceros/Dropbox/Project - Big Data Analytics/WordCloud/CompetitorLDA")
dataset = 'ProbDocUpdated.csv'
df.to_csv(dataset, quotechar='\"', quoting=csv.QUOTE_NONNUMERIC,delimiter=',')
with open("topicProb.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(final_topics)
newPlot = df.sort(['Cluster'], ascending=[1])
newPlot2=newPlot[topic_labels]
plt.pcolor(newPlot2.transpose(), norm=None, cmap='Blues')
plt.yticks(np.arange(5)+0.5, topic_labels)
plt.colorbar(cmap='Blues')