All Projects → HaoyuHu → gosimhash

HaoyuHu / gosimhash

Licence: other
A simhasher for Chinese documents implemented by golang, simply translated from yanyiwu/gosimhash

Programming Languages

go
31211 projects - #10 most used programming language

Projects that are alternatives of or similar to gosimhash

git-forensics-plugin
Jenkins plug-in that mines and analyzes data from a Git repository
Stars: ✭ 19 (+11.76%)
Mutual labels:  jenkins
simhash-js
Simhash implementation in Javascript
Stars: ✭ 35 (+105.88%)
Mutual labels:  simhash
cn.jenkins.io
Chinese version of the website
Stars: ✭ 30 (+76.47%)
Mutual labels:  jenkins
megalinter
🦙 Mega-Linter analyzes 48 languages, 22 formats, 19 tooling formats, excessive copy-pastes, spelling mistakes and security issues in your repository sources with a GitHub Action, other CI tools or locally.
Stars: ✭ 534 (+3041.18%)
Mutual labels:  jenkins
movie-db-java-on-azure
Sample movie database app built using Java on Azure
Stars: ✭ 28 (+64.71%)
Mutual labels:  jenkins
jobs done10
Travis like .yaml file for generating Jenkins jobs
Stars: ✭ 19 (+11.76%)
Mutual labels:  jenkins
gradle-jenkins-jobdsl-plugin
A plugin for Gradle to manage Jenkins Job DSL projects.
Stars: ✭ 48 (+182.35%)
Mutual labels:  jenkins
jenkins-scriptlets
Useful groovy scripts that can be used while using Jenkins-CI for workflow automation
Stars: ✭ 16 (-5.88%)
Mutual labels:  jenkins
plot-plugin
Jenkins plot plugin
Stars: ✭ 54 (+217.65%)
Mutual labels:  jenkins
mattermost-plugin-jenkins
A Mattermost plugin to interact with Jenkins
Stars: ✭ 25 (+47.06%)
Mutual labels:  jenkins
calendar-view-plugin
Jenkins Calendar View Plugin: Shows past and future builds in a calendar view
Stars: ✭ 17 (+0%)
Mutual labels:  jenkins
jenkins-shared-library-example
Example for a Jenkins shared library with unit tests
Stars: ✭ 35 (+105.88%)
Mutual labels:  jenkins
updatebot
a simple bot for updating dependencies in source code
Stars: ✭ 30 (+76.47%)
Mutual labels:  jenkins
terraform-github-repository-webhooks
Terraform module to provision webhooks on a set of GitHub repositories
Stars: ✭ 20 (+17.65%)
Mutual labels:  jenkins
solutions-terraform-jenkins-gitops
Demonstrates the use of Jenkins and Terraform to manage Infrastructure as Code using GitOps practices
Stars: ✭ 49 (+188.24%)
Mutual labels:  jenkins
github-oauth-plugin
Jenkins authentication plugin using GitHub OAuth as the source.
Stars: ✭ 97 (+470.59%)
Mutual labels:  jenkins
aws-pipeline
Build a CI/CD for Microservices and Serverless Functions in AWS ☁️
Stars: ✭ 32 (+88.24%)
Mutual labels:  jenkins
easy-jenkins
Easily deploy a Jenkins CI/CD infrastructure via docker-compose
Stars: ✭ 29 (+70.59%)
Mutual labels:  jenkins
CIAnalyzer
A tool collecting multi CI services build data and export it for creating self-hosting build dashboard.
Stars: ✭ 52 (+205.88%)
Mutual labels:  jenkins
AnyStatus
A remote control for your CI/CD pipelines and more
Stars: ✭ 38 (+123.53%)
Mutual labels:  jenkins

GoSimhash for Chinese Documents

Build Status Coverage Status License

Usage

go get github.com/HaoyuHu/gosimhash

Usage of Package

import (
	"github.com/HaoyuHu/gosimhash"
)

func getSimhash() {
    hasher := gosimhash.NewSimpleSimhasher()
    defer hasher.Free()

    var sentence string = "今天的天气确实适合户外运动"
    var another string = "今年的气候确实很糟糕"
    var topN int = 5
    var limit int = 3

    // make simhash in uint64, like: 0xfa596a42bb35f945
    var first uint64 = hasher.MakeSimhash(&sentence, topN)
    var second uint64 = hasher.MakeSimhash(&another, topN)
    var dist1 int = gosimhash.CalculateDistanceBySimhash(first, second)
    var duplicated bool = gosimhash.IsSimhashDuplicated(first, second, limit)
    
    // make simhash in binary string, like: "10101110101111010101..."
    var firstStr string = hasher.MakeSimhashBinString(&sentence, topN)
    var secondStr string = hasher.MakeSimhashBinString(&another, topN)
    dist2, err := gosimhash.CalculateDistanceBySimhashBinString(firstStr, secondStr)
    if err != nil {
        fmt.Printf(err.Error())
    }
    duplicated, anotherErr := gosimhash.IsSimhashBinStringDuplicated(firstStr, secondStr, limit)
    if anotherErr != nil {
        fmt.Printf(anotherErr.Error())
    }
}

What's more, you can customize the hash algorithm(currently support siphash and jenkins) in simhash and dicts for jieba.

import (
	"github.com/HaoyuHu/gosimhash"
	"github.com/HaoyuHu/gosimhash/utils"
)
...
sip := utils.NewSipHasher([]byte(gosimhash.DEFAULT_HASH_KEY))
// jenkins := utils.NewJenkinsHasher()

hasher := gosimhash.NewSimhasher(sip, "./dict/jieba.dict.utf8", "./dict/hmm_model.utf8", "", "./dict/idf.utf8", "./dict/stop_words.utf8")

Usage of Command

See example in example/example.go

cd example
go build

./example -help
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].