Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

linkedin / Avro2tf

Licence: other

Avro2TF is designed to fill the gap of making users' training data ready to be consumed by deep learning training frameworks.

Programming Languages

scala

5932 projects

Labels

deep-learning machine-learning tensorflow linkedin

Projects that are alternatives of or similar to Avro2tf

Play Authenticate

An authentication plugin for Play Framework 2.x (Java)

Stars: ✭ 813 (+550.4%)

Mutual labels: linkedin

Keyring

Keyring is an authentication framework for WordPress. It comes with definitions for a variety of HTTP Basic, OAuth1 and OAuth2 web services. Use it as a common foundation for working with other web services from within WordPress code.

Stars: ✭ 52 (-58.4%)

Mutual labels: linkedin

Linkedin Api Php Client

LinkedIn API PHP SDK with OAuth 2 support. Can be used for social sign in or sharing on LinkedIn. Has a good usage examples

Stars: ✭ 88 (-29.6%)

Mutual labels: linkedin

Linkedinsignin

Simple view controller to log in and retrieve an access token from LinkedIn.

Stars: ✭ 31 (-75.2%)

Mutual labels: linkedin

Socialite

Socialite is an OAuth2 Authentication tool. It is inspired by laravel/socialite, you can easily use it without Laravel.

Stars: ✭ 1,026 (+720.8%)

Mutual labels: linkedin

Linkedingiveaway

👨🏽‍🏫You can learn about anything over here. What Giveaways I do and why it's important in today's modern world. Are you interested in Giveaway's?🔋

Stars: ✭ 67 (-46.4%)

Mutual labels: linkedin

Brooklin

An extensible distributed system for reliable nearline data streaming at scale

Stars: ✭ 668 (+434.4%)

Mutual labels: linkedin

Linkedrw

A simple CLI to create your resume and personal website based on your LinkedIn profile or a JSON file

Stars: ✭ 104 (-16.8%)

Mutual labels: linkedin

Linkedin Pdf Parsing

Parsing resumes in a PDF format from linkedIn

Stars: ✭ 46 (-63.2%)

Mutual labels: linkedin

Dex Test Parser

Find all test methods in an Android instrumentation APK

Stars: ✭ 87 (-30.4%)

Mutual labels: linkedin

Test Butler

Reliable Android Testing, at your service

Stars: ✭ 976 (+680.8%)

Mutual labels: linkedin

React Native Linkedin Sdk

React Native Wrapper for Latest LinkedIn Mobile SDK for Sign-In / Auth and API Access.

Stars: ✭ 37 (-70.4%)

Mutual labels: linkedin

Electron

This is the repository for my course, Electron: Building Cross Platform Desktop Apps on LinkedIn Learning and Lynda.com.

Stars: ✭ 69 (-44.8%)

Mutual labels: linkedin

Cordova Plugin Linkedin

Cordova plugin for LinkedIn

Stars: ✭ 17 (-86.4%)

Mutual labels: linkedin

Daily Coding Problem

Series of the problem 💯 and solution ✅ asked by Daily Coding problem👨‍🎓 website.

Stars: ✭ 90 (-28%)

Mutual labels: linkedin

Attacksurfacemapper

AttackSurfaceMapper is a tool that aims to automate the reconnaissance process.

Stars: ✭ 702 (+461.6%)

Mutual labels: linkedin

Linkedin Bot

JS script for automatic invitations to add to the network of contacts

Stars: ✭ 52 (-58.4%)

Mutual labels: linkedin

Truvisory

This project is meant to provide resources to users who want to access good LinkedIn posts which contains resources to learn any Technology, Design, Self-Branding, Motivation etc. You can visit project by:

Stars: ✭ 116 (-7.2%)

Mutual labels: linkedin

Linkedin Learning Video Downloader

LinkedIn video learning video downloader July 2019

Stars: ✭ 98 (-21.6%)

Mutual labels: linkedin

Social Login Helper Deprecated

A simple android library to easily implement social login into your android project

Stars: ✭ 81 (-35.2%)

Mutual labels: linkedin

View All Similar Projects ➔

Avro2TF

Deep learning has been successfully applied to multiple AI systems at LinkedIn that are related to recommendation and search. One of the important lessons that we have learned during this journey is to provide good deep learning platforms that help our modeling engineers become more efficient and productive. Avro2TF is part of this effort to reduce the complexity of data processing and improving velocity of advanced modeling. In addition to advanced deep learning techniques, LinkedIn has been at the forefront of Machine Learning innovation for years now. We have many different ML approaches that consume large amount of data everyday. Efficiency and accuracy are the most important measurements for these approaches. To effectively support deep learning at LinkedIn, we need to first address the data processing issues. Most of the datasets used by our ML algorithms (e.g., LinkedIn’s large scale personalization engine Photon-ML) are in Avro format. Each record in a Avro dataset is essentially a sparse vector, and can be easily consumed by most of the modern classifiers. However, the format cannot be directly used by TensorFlow -- the leading deep learning package. The main blocker is that the sparse vector is not in the same format as Tensor. We believe that this is not only a LinkedIn problem, many companies have vast amount of ML data in similar sparse vector format, and Tensor format is still relatively new to many companies. Avro2TF bridges this gap by providing scalable Spark based transformation and extensions mechanism to efficiently convert the data into TF records that can be readily consumed by TensorFlow. With this technology, developers can improve their productivity by focusing on model building rather than data conversion.

Compatibility Notes

It is recommended to run Avro2TF with Scala 2.11 and above.

Build

How to build

Avro2TF is built using Gradle. To build Avro2TF, run:

./gradlew build

This will automatically run tests, if want to build without running tests, run:

./gradlew build -x test

Usage

Avro2TF reads raw user input data with any format supported by Spark to generate Avro or TFRecord tensorized training data.

Avro2TF exposes to users a JSON or HOCON config to specify the tensors that a modeler wants to use in training. For each tensor, a user should specify two kinds of information:

What existing features are used to construct the tensor.
The expected name, dtype, and shape of the tensor.

Use Avro2TF in a gradle project

Specify the repositories hosting Avro2TF job.

repositories {
    maven {
        url  "https://dl.bintray.com/linkedin/maven"
    }
}

Declare Avro2TF dependency

dependencies {
    compile 'com.linkedin.avro2tf:avro2tf:2.0.1'
}

Input Data Requirements

We support all data format that Spark can read, including the most popular formats Avro and ORC. For categorical/sparse features, we require them represented in NTV (name-term-value) format.

The type of a single categorical/sparse feature will have a type of Array[NTV]. We treat Array[NTV] as a special primitive type. Thus, the supported input primitive types include:

long
float
double
String
bytes (for multimedia data such as image, audio, and video)
boolean
Array[NTV]

Arrays of primitive types with any rank are supported.

Supported Data Types of Output Tensor

In Avro2TF, the supported data types (dtype) of output tensors are: int, long, float, double, string, boolean, bytes. A sparseVector tensor has two fields: indices and values.

In the below table, we list the corresponding data types after loading the serialized tensors in Avro2TF to TensorFlow. In TensorFlow, bytes is represented using tf.string and users can later decode it to images, audios, etc. The sparseVector type will be converted to tf.SparseTensor.

Data type of serialized tensor in Avro2TF	Data type of deserialized tensor in TensorFlow
int	tf.int32
long	tf.int64
float	tf.float32
double	tf.float64
String	tf.string
bytes	tf.string
boolean	tf.bool

Avro2TF Configuration

The below table shows all the available configuration names and their detailed explanation.

Name	Required?	Default Value	Meaning
features	yes	/	A list of tensor configs. Each config includes inputFeatureInfo and outputTensorInfo. Must not be empty.
labels	no	None	A list of tensor configs.
inputFeatureInfo	no	{“columnExpr” : ${outputTensorInfo.name}}	Specify input features used to construct a tensor.
columnExpr	no	None	Spark SQL column expression. If both columnExpr and columnConfig do not exist, columnExpr will have a default {“columnExpr” : ${outputTensorInfo.name}} value.
columnConfig	no	None	Only for extracting NTV features. A user should not specify both columnExpr and columnConfig.
transformConfig	no	None	Specify transformations applied on input features, available transformers: hashing and tokenization.
hashInfo	no	None	Specify hashing related information.
hashBucketSize	yes	/	The bucket size of the hash function.
numHashFunctions	no	1	The number of hash functions used (only salt will be different).
combiner	no	SUM	How to merge the values of repeated indices in a sparse vector (AVG, SUM, MAX).
tokenization	no	None	Tokenization related configs.
removeStopWords	no	false	Whether to remove stop words during tokenization.
outputTensorInfo	yes	/	Info on expected output tensor.
name	yes	/	Name of output tensor.
dtype	yes	/	The expected dtype of output tensor.
shape	no	[]	The expected shape of output tensor, examples: []: scalar; [-1] : 1D array of any length; [6]: 1D array with size 6; [2, 3]: matrix with 2 rows and 3 columns.
isSparse	no	false	To indicate whether the output tensor is sparse tensor.
isDocumentFeature	no	true	To indicate whether the output tensor is a document feature or query feature.

Avro2TF Config Example

Suppose your input data has the following schema

{
  "type": "record",
  "name": "topLevelRecord",
  "fields": [
    {
      "type": [
        "int",
        "null"
      ],
      "name": "label"
    },
    {
      "type": [
        "string",
        "null"
      ],
      "name": "review"
    },
    {
      "type": [
        {
          "type": "array",
          "items": [
            {
              "type": "record",
              "name": "words",
              "namespace": "topLevelRecord",
              "fields": [
                {
                  "type": [
                    "string",
                    "null"
                  ],
                  "name": "name"
                },
                {
                  "type": [
                    "string",
                    "null"
                  ],
                  "name": "term"
                },
                {
                  "type": [
                    "float",
                    "null"
                  ],
                  "name": "value"
                }
              ]
            },
            "null"
          ]
        },
        "null"
      ],
      "name": "words"
    },
    {
      "type": [
        {
          "type": "array",
          "items": [
            {
              "type": "record",
              "name": "wideFeatures",
              "namespace": "topLevelRecord",
              "fields": [
                {
                  "type": [
                    "string",
                    "null"
                  ],
                  "name": "name"
                },
                {
                  "type": [
                    "string",
                    "null"
                  ],
                  "name": "term"
                },
                {
                  "type": [
                    "float",
                    "null"
                  ],
                  "name": "value"
                }
              ]
            },
            "null"
          ]
        },
        "null"
      ],
      "name": "wideFeatures"
    }
  ]
}

An example Avro2TF config will looks like this:

{
  "features": [
    {
      "inputFeatureInfo": {
        "columnExpr": "words.term[0]"
      },
      "outputTensorInfo": {
        "name": "firstWord",
        "dtype": "long",
        "shape": []
      }
    },
    {
      "inputFeatureInfo": {
        "columnExpr": "review",
        "transformConfig": {
          "tokenization": {
            "removeStopWords": true
          }
        }
      },
      "outputTensorInfo": {
        "name": "wordSeq",
        "dtype": "long",
        "shape": [
          -1
        ]
      }
    },
    {
      "inputFeatureInfo": {
        "columnExpr": "review",
        "transformConfig": {
          "hashInfo": {
            "hashBucketSize": 1000,
            "numHashFunctions": 4
          }
        }
      },
      "outputTensorInfo": {
        "name": "wordSeq_hashed",
        "dtype": "long",
        "shape": [
          4
        ]
      }
    },
    {
      "inputFeatureInfo": {
        "columnExpr": "words.term"
      },
      "outputTensorInfo": {
        "name": "words_term",
        "dtype": "long",
        "shape": [
          -1
        ]
      }
    },
    {
      "inputFeatureInfo": {
        "columnConfig": {
          "words": {
            "whitelist": [
              "*"
            ]
          },
          "wideFeatures": {
            "blacklist": [
              "wideFeatures"
            ]
          }
        }
      },
      "outputTensorInfo": {
        "name": "words_wideFeatures_sparse",
        "dtype": "float",
        "shape": [],
        "isSparse" : true
      }
    },
    {
      "inputFeatureInfo": {
        "columnConfig": {
          "words": {
            "whitelist": [
              "*"
            ]
          },
          "wideFeatures": {
            "blacklist": [
              "wideFeatures"
            ]
          }
        }
      },
      "outputTensorInfo": {
        "name": "words_wideFeatures_dense",
        "dtype": "float",
        "shape": [],
        "isSparse" : false
      }
    },
    {
      "inputFeatureInfo": {
        "columnConfig": {
          "words": {
            "whitelist": [
              "*"
            ]
          },
          "wideFeatures": {
            "blacklist": [
              "wideFeatures"
            ]
          }
        },
        "transformConfig": {
          "hashInfo": {
            "hashBucketSize": 100,
            "combiner": "AVG"
          }
        }
      },
      "outputTensorInfo": {
        "name": "words_wideFeatures_hash",
        "dtype": "float",
        "shape": [],
        "isSparse" : true
      }
    }
  ],
  "labels": [
    {
      "inputFeatureInfo": {
        "columnExpr": "label"
      },
      "outputTensorInfo": {
        "name": "response",
        "dtype": "int",
        "shape": []
      }
    }
  ]
}

Avro2TF Job Parameters

  --input-paths                     Required.   A list of comma separated paths for input.
  --working-dir                     Required.   The path to working directory where the output should be saved.
  --input-date-range                Optional.   The input date range in the format of yyyymmdd-yyyymmdd.
  --input-days-range                Optional.   The input days range in the format of startOffest-endOffset.
  --num-output-files                Optional.   The number of output files with the default set to -1.
  --min-parts                       Optional.   The minimum number of partitions for input data; if below this threshold, repartition will be triggered.
  --shuffle                         Optional.   Whether to shuffle the converted training data with the default set to true.
  --external-feature-list-path      Optional.   The path to user supplied feature mapping files.
  --avro2tf-config-path             Required.   The Avro2TF configuration in JSON format.
  --execution-mode                  Optional.   Whether to prepare training, validation, or test data.
  --enable-cache                    Optional.   Whether to cache the intermediate Spark DataFrame result with default set to false.
  --skip-conversion                 Optional.   Whether to skip the conversion step with default set to false.
  --output-format                   Optional.   The output format of tensorized data, e.g. Avro or TFRecord.
  --extra-columns-to-keep           Optional.   A list of comma separated column names to specify extra columns to keep.
  --tensors-sharing-feature-lists   Optional.   Groups of output tensor names separated by semicolon; tensors in the same group are separated by comma. Tensors within the same group share the same feature list.
  --enable-filter-zero              Optional.   Whether to enable filter zero for all sparse vector output. Default is false.

Avro2TF Examples

Please take a look at our Avro2TF Official Tutorial! :)

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 125

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (16) 🔗