huichen / wordvector_be

Licence: other

Web服务：使用腾讯 800 万词向量模型和 spotify annoy 引擎得到相似关键词

Programming Languages

31211 projects - #10 most used programming language

Projects that are alternatives of or similar to wordvector be

Milvus

An open-source vector database for embedding similarity search and AI applications.

Stars: ✭ 9,015 (+9698.91%)

Mutual labels: nearest-neighbor-search, similarity-search

awesome-vector-search

Collections of vector search related libraries, service and research papers

Stars: ✭ 460 (+400%)

Mutual labels: nearest-neighbor-search, similarity-search

Ocamlapi

Path-based http request routing in Ocaml.

Stars: ✭ 19 (-79.35%)

Mutual labels: http-server

cpp-rest-api

RESTFul Web service by C++, implemented basic REST endpoints and RESTVerbs (GET,POST,PUT,DELETE).

Stars: ✭ 13 (-85.87%)

Mutual labels: http-server

tipi

Tipi - the All-in-one Web Server for Ruby Apps

Stars: ✭ 214 (+132.61%)

Mutual labels: http-server

scikit-hubness

A Python package for hubness analysis and high-dimensional data mining

Stars: ✭ 41 (-55.43%)

Mutual labels: nearest-neighbor-search

pycameresp

Motion detection with image notification for Esp32CAM and Esp32 flasher with GUI based on esptool.py.

Stars: ✭ 40 (-56.52%)

Mutual labels: http-server

Crow

A Fast and Easy to use microframework for the web.

Stars: ✭ 1,718 (+1767.39%)

Mutual labels: http-server

edap

No description or website provided.

Stars: ✭ 22 (-76.09%)

Mutual labels: http-server

open-rest-es6-boilerplate

open-rest boilerplate project with es6

Stars: ✭ 24 (-73.91%)

Mutual labels: http-server

rust-spa-auth

Example application using a Vue frontend with Rust backend that has authentication + authorization.

Stars: ✭ 45 (-51.09%)

Mutual labels: http-server

gofile

HTTP/1.1 directory listing and file server using TCP sockets for fun

Stars: ✭ 59 (-35.87%)

Mutual labels: http-server

Kvantum

An intellectual (HTTP/HTTPS) web server with support for server side templating (Crush, Apache Velocity and JTwig)

Stars: ✭ 17 (-81.52%)

Mutual labels: http-server

go-sse

Fully featured, spec-compliant HTML5 server-sent events library

Stars: ✭ 165 (+79.35%)

Mutual labels: http-server

httpfs

Go 编写的静态文件服务器，支持文件拖拽上传，无第三方包依赖, 支持 Windows, Linux , Darwin。

Stars: ✭ 28 (-69.57%)

Mutual labels: http-server

matador

Take your appclication by the horns

Stars: ✭ 59 (-35.87%)

Mutual labels: http-server

ValaSimpleHTTPServer

Simple HTTP server made in vala

Stars: ✭ 49 (-46.74%)

Mutual labels: http-server

waycup

A minimal tool that hides your online assets from online security scanners, researchers and hackers.

Stars: ✭ 100 (+8.7%)

Mutual labels: http-server

visualsearch

Visual Search is a little app to find and cluster similar images using Tagbox

Stars: ✭ 31 (-66.3%)

Mutual labels: similarity-search

dhash-vips

vips-powered ruby gem to measure images similarity, implementing dHash and IDHash algorithms

Stars: ✭ 75 (-18.48%)

Mutual labels: similarity-search

View All Similar Projects ➔

wordvector_be

这个项目用 go 语言实现了一个 HTTP 服务，使用腾讯 800 万词的 word vector 模型得到相似关键词和关键词的cosine similarity。索引使用了 spotify 的 annoy 引擎。

安装

一、首先安装 annoy 的 golang 包，参照这个文档，不需要执行所有步骤，只要执行下面命令

swig -go -intgosize 64 -cgo -c++ src/annoygomodule.i
mkdir -p $GOPATH/src/annoyindex
cp src/annoygomodule_wrap.cxx src/annoyindex.go \
  src/annoygomodule.h src/annoylib.h src/kissrandom.h test/annoy_test.go $GOPATH/src/annoyindex

二、然后下载腾讯的模型文件，建议使用 aria2c

go get github.com/huichen/wordvector_be
cd $GOPATH/src/github.com/huichen/wordvector_be
mkdir data
cd data/
aria2c -c https://ai.tencent.com/ailab/nlp/data/Tencent_AILab_ChineseEmbedding.tar.gz
tar zxvf https://ai.tencent.com/ailab/nlp/data/Tencent_AILab_ChineseEmbedding.tar.gz

三、将腾讯的 txt 模型文件导出为 leveldb 格式的数据库，进入 gen_wordvector_leveldb 后执行

go run main.go

生成的数据库在 data/tencent_embedding_wordvector.db 目录下

四、创建 annoy 索引文件和 metadata 数据库，进入 gen_annoy_index 目录，执行

go run main.go

你的电脑要有 10G 左右内存。不到 30 分钟后，索引文件生成在 data/tencent_embedding.ann。annoy 索引的 key 是整数 id，不包括关键词和 id 之间的映射关系，这个关系放在了 data/tencent_embedding_index_to_keyword.db 和 data/tencent_embedding_keyword_to_index.db 两个 leveldb 数据库备用。

使用

所有包和数据文件准备好之后，就可以启动服务了：

go build
./wordvector_be

在浏览器打开 http://localhost:3721/get.similar.keywords/?keyword=美白&num=20 ，返回如下，word 字段是关键词，similarity 是关键词词向量之间的 consine similarity，约接近 1 越相似。

{
  "keywords": [
    {
      "word": "美白",
      "similarity": 1
    },
    {
      "word": "淡斑",
      "similarity": 0.8916605
    },
    {
      "word": "美白产品",
      "similarity": 0.8722978
    },
    {
      "word": "美白效果",
      "similarity": 0.8654123
    },
    {
      "word": "想美白",
      "similarity": 0.86464494
    },
...

更多函数见 main.go 代码中的注释。

参数调优

你可能发现了，这个程序返回的相似词和腾讯官方的例子略有不同，因为我们用的是相似紧邻算法，不保证 100% 的召回率。主要有以下参数可以调整

numTrees: gen_annoy_index/main.go，近似最近邻计算需要的随机森林中树的个数，树越多召回率越高，但也意味更久的建树时间（一次性）和请求延迟
kSearch: main.go，搜索栈长度，这个值越大则请求耗时越长，但召回率越高

在程序中默认使用 numTrees = 10 和 kSearch = 10000 两个参数，可以得到不错的召回率，压测 100 并发 http 请求的情况下，延迟平均 76 毫秒方差 65 毫秒。如果你有更充足的时间，可以增加 numTrees 延长建树的时间；如果你对服务的并发和延迟有更高要求，可以适当降低 kSearch，不过这样做也会降低召回率。请根据业务需求做适当的权衡。

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

huichen / wordvector_be

Programming Languages

Labels

Projects that are alternatives of or similar to wordvector be

wordvector_be

安装

使用

参数调优