Data Prep

A collection of libraries, a pipeline plugin, and a CDAP service for performing data cleansing, transformation, and filtering using a set of data manipulation instructions (directives). These instructions are either generated using an interative visual tool or are manually created.

Data Prep defines few concepts that might be useful if you are just getting started with it. Learn about them here
The Data Prep Transform is separately documented.
Data Prep Cheatsheet

New Features

More here on upcoming features.

User Defined Directives, also known as UDD, allow you to create custom functions to transform records within CDAP DataPrep or a.k.a Wrangler. CDAP comes with a comprehensive library of functions. There are however some omissions, and some specific cases for which UDDs are the solution. Additional information on how you can build your custom directives here.
- Migrating directives from version 1.0 to version 2.0 here
- Information about Grammar here
- Various TokenType supported by system here
- Custom Directive Implementation Internals here
A new capability that allows CDAP Administrators to restrict the directives that are accessible to their users. More information on configuring can be found here

Demo Videos and Recipes

Videos and Screencasts are best way to learn, so we have compiled simple, short screencasts that shows some of the features of Data Prep. Additional videos can be found here

Videos

[SCREENCAST] Creating Lookup Dataset and Joining
[SCREENCAST] Restricted Directives
[SCREENCAST] Parse Excel files in CDAP
[SCREENCAST] Parse File As AVRO File
[SCREENCAST] Parsing Binary Coded AVRO Messages
[SCREENCAST] Parsing Binary Coded AVRO Messages & Protobuf messages using schema registry
[SCREENCAST] Quantize a column - Digitize
[SCREENCAST] Data Cleansing capability with send-to-error directive
[SCREENCAST] Building Data Prep from the GitHub source
[VOICE-OVER] End-to-End Demo Video
[SCREENCAST] Ingesting into Kudu
[SCREENCAST] Realtime HL7 CCDA XML from Kafka into Time Parititioned Parquet
[SCREENCAST] Parsing JSON file
[SCREENCAST] Flattening arrays
[SCREENCAST] Data cleansing with send-to-error directive
[SCREENCAST] Publishing to Kafka
[SCREENCAST] Fixed length to JSON

Recipes

Available Directives

These directives are currently available:

Directive	Description
Parsers
JSON Path	Uses a DSL (a JSON path expression) for parsing JSON records
Parse as AVRO	Parsing an AVRO encoded message - either as binary or json
Parse as AVRO File	Parsing an AVRO data file
Parse as CSV	Parsing an input record as comma-separated values
Parse as Date	Parsing dates using natural language processing
Parse as Excel	Parsing excel file.
Parse as Fixed Length	Parses as a fixed length record with specified widths
Parse as HL7	Parsing Health Level 7 Version 2 (HL7 V2) messages
Parse as JSON	Parsing a JSON object
Parse as Log	Parses access log files as from Apache HTTPD and nginx servers
Parse as Protobuf	Parses an Protobuf encoded in-memory message using descriptor
Parse as Simple Date	Parses date strings
Parse XML To JSON	Parses an XML document into a JSON structure
Parse as Currency	Parses a string representation of currency into a number.
Parse as Datetime	Parses strings with datetime values to CDAP datetime type
Output Formatters
Write as CSV	Converts a record into CSV format
Write as JSON	Converts the record into a JSON map
Write JSON Object	Composes a JSON object based on the fields specified.
Format as Currency	Formats a number as currency as specified by locale.
Transformations
Changing Case	Changes the case of column values
Cut Character	Selects parts of a string value
Set Column	Sets the column value to the result of an expression execution
Find and Replace	Transforms string column values using a "sed"-like expression
Index Split	(Deprecated)
Invoke HTTP	Invokes an HTTP Service (Experimental, potentially slow)
Quantization	Quantizes a column based on specified ranges
Regex Group Extractor	Extracts the data from a regex group into its own column
Setting Character Set	Sets the encoding and then converts the data to a UTF-8 String
Setting Record Delimiter	Sets the record delimiter
Split by Separator	Splits a column based on a separator into two columns
Split Email Address	Splits an email ID into an account and its domain
Split URL	Splits a URL into its constituents
Text Distance (Fuzzy String Match)	Measures the difference between two sequences of characters
Text Metric (Fuzzy String Match)	Measures the difference between two sequences of characters
URL Decode	Decodes from the `application/x-www-form-urlencoded` MIME format
URL Encode	Encodes to the `application/x-www-form-urlencoded` MIME format
Trim	Functions for trimming white spaces around string data
Encoders and Decoders
Decode	Decodes a column value as one of `base32`, `base64`, or `hex`
Encode	Encodes a column value as one of `base32`, `base64`, or `hex`
Unique ID
UUID Generation	Generates a universally unique identifier (UUID) .Recommended to use with Wrangler version 4.4.0 and above due to an important bug fix CDAP-17732
Date Transformations
Diff Date	Calculates the difference between two dates
Format Date	Custom patterns for date-time formatting
Format Unix Timestamp	Formats a UNIX timestamp as a date
DateTime Transformations
Current DateTime	Generates the current datetime using the given zone or UTC by default
Datetime To Timestamp	Converts a datetime value to timestamp with the given zone
Format Datetime	Formats a datetime value to custom date time pattern strings
Timestamp To Datetime	Converts a timestamp value to datetime
Lookups
Catalog Lookup	Static catalog lookup of ICD-9, ICD-10-2016, ICD-10-2017 codes
Table Lookup	Performs lookups into Table datasets
Hashing & Masking
Message Digest or Hash	Generates a message digest
Mask Number	Applies substitution masking on the column values
Mask Shuffle	Applies shuffle masking on the column values
Row Operations
Filter Row if Matched	Filters rows that match a pattern for a column
Filter Row if True	Filters rows if the condition is true.
Filter Row Empty of Null	Filters rows that are empty of null.
Flatten	Separates the elements in a repeated field
Fail on condition	Fails processing when the condition is evaluated to true.
Send to Error	Filtering of records to an error collector
Send to Error And Continue	Filtering of records to an error collector and continues processing
Split to Rows	Splits based on a separator into multiple records
Column Operations
Change Column Case	Changes column names to either lowercase or uppercase
Changing Case	Change the case of column values
Cleanse Column Names	Sanatizes column names, following specific rules
Columns Replace	Alters column names in bulk
Copy	Copies values from a source column into a destination column
Drop Column	Drops a column in a record
Fill Null or Empty Columns	Fills column value with a fixed value if null or empty
Keep Columns	Keeps specified columns from the record
Merge Columns	Merges two columns by inserting a third column
Rename Column	Renames an existing column in the record
Set Column Header	Sets the names of columns, in the order they are specified
Split to Columns	Splits a column based on a separator into multiple columns
Swap Columns	Swaps column names of two columns
Set Column Data Type	Convert data type of a column
NLP
Stemming Tokenized Words	Applies the Porter stemmer algorithm for English words
Transient Aggregators & Setters
Increment Variable	Increments a transient variable with a record of processing.
Set Variable	Sets a transient variable with a record of processing.
Functions
Data Quality	Data quality check functions. Checks for date, time, etc.
Date Manipulations	Functions that can manipulate date
DDL	Functions that can manipulate definition of data
JSON	Functions that can be useful in transforming your data
Types	Functions for detecting the type of data

Performance

Initial performance tests show that with a set of directives of high complexity for transforming data, DataPrep is able to process at about ~106K records per second. The rates below are specified as records/second.

Directive Complexity	Column Count	Records	Size	Mean Rate
High (167 Directives)	426	127,946,398	82,677,845,324	106,367.27
High (167 Directives)	426	511,785,592	330,711,381,296	105,768.93

Contact

Mailing Lists

CDAP User Group and Development Discussions:

[email protected]

The cdap-user mailing list is primarily for users using the product to develop applications or building plugins for appplications. You can expect questions from users, release announcements, and any other discussions that we think will be helpful to the users.

IRC Channel

CDAP IRC Channel: #cdap on irc.freenode.net

Slack Team

CDAP Users on Slack: cdap-users team

License and Trademarks

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Apache, Apache HBase, and HBase are trademarks of The Apache Software Foundation. Used with permission. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

data-integrations / wrangler

Programming Languages

Labels

Projects that are alternatives of or similar to wrangler