All Projects → hibayesian → spark-word2vec

hibayesian / spark-word2vec

Licence: Apache-2.0 License
A parallel implementation of word2vec based on Spark

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to spark-word2vec

fastdata-cluster
Fast Data Cluster (Apache Cassandra, Kafka, Spark, Flink, YARN and HDFS with Vagrant and VirtualBox)
Stars: ✭ 20 (-16.67%)
Mutual labels:  spark
spark-druid-olap
Sparkline BI Accelerator provides fast ad-hoc query capability over Logical Cubes. This has been folded into our SNAP Platform(http://bit.ly/2oBJSpP) an Integrated BI platform on Apache Spark.
Stars: ✭ 286 (+1091.67%)
Mutual labels:  spark
yuzhouwan
Code Library for My Blog
Stars: ✭ 39 (+62.5%)
Mutual labels:  spark
Spark-Ar
Resources for Spark AR
Stars: ✭ 43 (+79.17%)
Mutual labels:  spark
ODSC India 2018
My presentation at ODSC India 2018 about Deep Learning with Apache Spark
Stars: ✭ 26 (+8.33%)
Mutual labels:  spark
data processing course
Some class materials for a data processing course using PySpark
Stars: ✭ 50 (+108.33%)
Mutual labels:  spark
spark-stringmetric
Spark functions to run popular phonetic and string matching algorithms
Stars: ✭ 51 (+112.5%)
Mutual labels:  spark
shamash
Autoscaling for Google Cloud Dataproc
Stars: ✭ 31 (+29.17%)
Mutual labels:  spark
reach
Load embeddings and featurize your sentences.
Stars: ✭ 17 (-29.17%)
Mutual labels:  word2vec
spark-gradle-template
Apache Spark in your IDE with gradle
Stars: ✭ 39 (+62.5%)
Mutual labels:  spark
sparkar-volts
An extensive non-reactive Typescript framework that eases the development experience in Spark AR
Stars: ✭ 15 (-37.5%)
Mutual labels:  spark
swordfish
Open-source distribute workflow schedule tools, also support streaming task.
Stars: ✭ 35 (+45.83%)
Mutual labels:  spark
openverse-catalog
Identifies and collects data on cc-licensed content across web crawl data and public apis.
Stars: ✭ 27 (+12.5%)
Mutual labels:  spark
experiments
Code examples for my blog posts
Stars: ✭ 21 (-12.5%)
Mutual labels:  spark
Search Ads Web Service
Online search advertisement platform & Realtime Campaign Monitoring [Maybe Deprecated]
Stars: ✭ 30 (+25%)
Mutual labels:  spark
splink
Implementation of Fellegi-Sunter's canonical model of record linkage in Apache Spark, including EM algorithm to estimate parameters
Stars: ✭ 181 (+654.17%)
Mutual labels:  spark
awesome-AI-kubernetes
❄️ 🐳 Awesome tools and libs for AI, Deep Learning, Machine Learning, Computer Vision, Data Science, Data Analytics and Cognitive Computing that are baked in the oven to be Native on Kubernetes and Docker with Python, R, Scala, Java, C#, Go, Julia, C++ etc
Stars: ✭ 95 (+295.83%)
Mutual labels:  spark
spark-kubernetes
spark on kubernetes
Stars: ✭ 80 (+233.33%)
Mutual labels:  spark
sent2vec
How to encode sentences in a high-dimensional vector space, a.k.a., sentence embedding.
Stars: ✭ 99 (+312.5%)
Mutual labels:  word2vec
spark-util
low-level helpers for Apache Spark libraries and tests
Stars: ✭ 16 (-33.33%)
Mutual labels:  spark

Spark-Word2Vec

Spark-Word2Vec creates vector representation of words in a text corpus. It is based on the implementation of word2vec in Spark MLlib. Several optimization techniques are used to make this algorithm more scalable and accurate.

Highlights

  • Two models CBOW and Skip-gram are used in our implementation.
  • Both hierarchical softmax and negative sampling methods are supported to train the model.
  • The sub-sampling trick can be used to achieve both faster training and significantly better representations of uncommon words.

Examples

Scala API

val spark = SparkSession
  .builder
  .appName("Word2Vec example")
  .master("local[*]")
  .getOrCreate()

  // $example on$
  // Input data: Each row is a bag of words from a sentence or document.
  val documentDF = spark.createDataFrame(Seq(
    "Hi I heard about Spark".split(" "),
    "I wish Java could use case classes".split(" "),
    "Logistic regression models are neat".split(" ")
  ).map(Tuple1.apply)).toDF("text")

  // Learn a mapping from words to Vectors.
  val word2Vec = new Word2Vec()
    .setInputCol("text")
    .setOutputCol("result")
    .setVectorSize(3)
    .setMinCount(0)
  val model = word2Vec.fit(documentDF)

  val result = model.transform(documentDF)
  result.collect().foreach { case Row(text: Seq[_], features: Vector) =>
    println(s"Text: [${text.mkString(", ")}] => \nVector: $features\n") }
  // $example off$

  spark.stop()

Requirements

Spark-Word2Vec is built against Spark 2.1.1.

Build From Source

sbt package

Licenses

Spark-Word2Vec is available under Apache Licenses 2.0.

Contact & Feedback

If you encounter bugs, feel free to submit an issue or pull request. Also you can mail to:

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].