All Projects → yu-iskw → kuromoji-for-bigquery

yu-iskw / kuromoji-for-bigquery

Licence: other
Tokenize Japanese text on BigQuery with Kuromoji in Apache Beam/Google Dataflow at scale

Programming Languages

java
68154 projects - #9 most used programming language
Makefile
30231 projects

Projects that are alternatives of or similar to kuromoji-for-bigquery

bigquery-to-datastore
Export a whole BigQuery table to Google Datastore with Apache Beam/Google Dataflow
Stars: ✭ 56 (+409.09%)
Mutual labels:  bigquery, google-cloud, apache-beam, google-dataflow
Magnolify
A collection of Magnolia add-on modules
Stars: ✭ 81 (+636.36%)
Mutual labels:  bigquery, google-cloud
Ethereum Etl
Python scripts for ETL (extract, transform and load) jobs for Ethereum blocks, transactions, ERC20 / ERC721 tokens, transfers, receipts, logs, contracts, internal transactions. Data is available in Google BigQuery https://goo.gl/oY5BCQ
Stars: ✭ 956 (+8590.91%)
Mutual labels:  bigquery, google-cloud
DataflowTemplate
Mercari Dataflow Template
Stars: ✭ 46 (+318.18%)
Mutual labels:  google-cloud, apache-beam
Spark Bigquery Connector
BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
Stars: ✭ 126 (+1045.45%)
Mutual labels:  bigquery, google-cloud
ob google-bigquery
This service is meant to simplify running Google Cloud operations, especially BigQuery tasks. This means you do not have to worry about installation, configuration or ongoing maintenance related to an SDK environment. This can be helpful to those who would prefer to not to be responsible for those activities.
Stars: ✭ 43 (+290.91%)
Mutual labels:  bigquery, google-cloud
Scio
A Scala API for Apache Beam and Google Cloud Dataflow.
Stars: ✭ 2,247 (+20327.27%)
Mutual labels:  bigquery, google-cloud
bqv
The simplest tool to manage views of BigQuery.
Stars: ✭ 22 (+100%)
Mutual labels:  bigquery, google-cloud
DataflowTemplates
Convenient Dataflow pipelines for transforming data between cloud data sources
Stars: ✭ 22 (+100%)
Mutual labels:  bigquery, apache-beam
go-bqloader
bqloader is a simple ETL framework to load data from Cloud Storage into BigQuery.
Stars: ✭ 16 (+45.45%)
Mutual labels:  bigquery, google-cloud
iris3
An upgraded and improved version of the Iris automatic GCP-labeling project
Stars: ✭ 38 (+245.45%)
Mutual labels:  bigquery, google-cloud
blockchain-etl-streaming
Streaming Ethereum and Bitcoin blockchain data to Google Pub/Sub or Postgres in Kubernetes
Stars: ✭ 57 (+418.18%)
Mutual labels:  apache-beam, google-dataflow
bigquery-kafka-connect
☁️ nodejs kafka connect connector for Google BigQuery
Stars: ✭ 17 (+54.55%)
Mutual labels:  bigquery, google-cloud
argon
Campaign Manager 360 and Display & Video 360 Reports to BigQuery connector
Stars: ✭ 31 (+181.82%)
Mutual labels:  bigquery, google-cloud
etlflow
EtlFlow is an ecosystem of functional libraries in Scala based on ZIO for writing various different tasks, jobs on GCP and AWS.
Stars: ✭ 38 (+245.45%)
Mutual labels:  bigquery
vault-demo
Walkthroughs and scripts for my @hashicorp Vault talks
Stars: ✭ 67 (+509.09%)
Mutual labels:  google-cloud
sg-food-ml
This script is used to scrap images from the Internet to classify 5 common noodle "mee" dishes in Singapore. Wanton Mee, Bak Chor Mee, Lor Mee, Prawn Mee and Mee Siam.
Stars: ✭ 18 (+63.64%)
Mutual labels:  google-cloud
google translate diff
Google Translate API wrapper translates only changes between revisions of big texts
Stars: ✭ 51 (+363.64%)
Mutual labels:  google-cloud
30Days-of-GCP
Resources for the 30 Days of GCP program
Stars: ✭ 26 (+136.36%)
Mutual labels:  google-cloud
notionproxy
Notion as a web site, inspired by react-notion-x.
Stars: ✭ 24 (+118.18%)
Mutual labels:  google-cloud

kuromoji-for-bigquery

Build Status

kuromoji-for-bigquery tokenizes text on a BigQuery table with kuromoji and apache beam. And then the tokenized result will be stored into another BigQuery table.

It is horizontally-scalable on top of distributed system, since apache beam can run on Google Dataflow, Apache Spark, Apache Flink and so on.

Overview

Requirements

  • Maven
  • Java 1.8+
  • Google Cloud Platform account

Version Info

  • Apache Beam: 2.34.0
  • Kuromoji: 0.7.7

How to Use

Command Line Options

Required Options

  • --project: Google Cloud Project
  • --inputDataset: Input BigQuery dataset ID
  • --inputTable: Input BigQuery table ID
  • --tokenizedColumn: Column name to tokenize in a input table
  • --outputDataset: Output BigQuery dataset ID
  • --outputTable: Output BigQuery table ID
  • --schema: BigQuery schema to select columns in a input table. (Format: id:integer,name:string,value:float,ts:timestamp)
  • --tempLocation: The Cloud Storage path to use for temporary files. Must be a valid Cloud Storage URL, beginning with gs://.
  • --gcpTempLocation: A GCS path for storing temporary files in GCP.

Optional Options

  • --outputColumn: Output column for tokenized result in output table. (Default: token)
  • --kuromojiMode: Kuromoji Mode. (NORMAL, SEARCH, or EXTENDED) (Default: NORMAL)
  • --createDisposition: Create Disposition option for BigQuery. (CREATE_NEVER or CREATE_IF_NEEDED)
  • --writeDisposition: Write Disposition option for BigQuery. (WRITE_TRUNCATE, WRITE_APPEND or WRITE_EMPTY)
  • --runner: Apache Beam runner.
    • When you don't set this option, it will run on your local machine, not Google Dataflow.
    • e.g. DataflowRunner
  • --numWorkers: The number of workers when you run it on top of Google Dataflow.
  • --workerMachineType: Google Dataflow worker instance type
    • e.g. n1-standard-1, n1-standard-4

Run the command

# compile
mvn clean package

# Run bigquery-to-datastore via the compiled JAR file
java -jar $(pwd)/target/kuromoji-for-bigquery-bundled-0.2.2.jar \
  --project=test-project-id \
  --schema=id:integer \
  --inputDataset=test_input_dataset \
  --inputTable=test_input_table \
  --outputDataset=test_output_dataset \
  --outputTable=test_output_table \
  --tokenizedColumn=text \
  --outputColumn=token \
  --kuromojiMode=NORMAL \
  --tempLocation=gs://test_yu/test-log/ \
  --gcpTempLocation=gs://test_yu/test-log/ \
  --maxNumWorkers=10 \
  --workerMachineType=n1-standard-2

Versions

kuromoji-for-bigquery Apache Beam kuromoji
0.1.0 2.1.0 0.7.7
0.2.x 2.20.0 0.7.7
0.3.x 2.34.0 0.7.7

License

Copyright (c) 2017 Yu Ishikawa.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].