All Projects → databrickslabs → smolder

databrickslabs / smolder

Licence: Apache-2.0 License
HL7 Apache Spark Datasource

Programming Languages

scala
5932 projects

Projects that are alternatives of or similar to smolder

hl7v2-fhir-converter
Converts HL7 v2 Messages to FHIR Resources
Stars: ✭ 40 (+21.21%)
Mutual labels:  hl7, hl7v2
Xsql
Unified SQL Analytics Engine Based on SparkSQL
Stars: ✭ 176 (+433.33%)
Mutual labels:  spark, datasource
HL7-dotnetcore
Lightweight HL7 C# parser and composer compatible with .Net Core and .Net Standard
Stars: ✭ 150 (+354.55%)
Mutual labels:  hl7, hl7v2
HL7
PHP library for Parsing, Generation and Sending HL7 v2 messages
Stars: ✭ 135 (+309.09%)
Mutual labels:  hl7, hl7v2
fhirpath
FHIRPath implementation in Python.
Stars: ✭ 25 (-24.24%)
Mutual labels:  hl7
spark-word2vec
A parallel implementation of word2vec based on Spark
Stars: ✭ 24 (-27.27%)
Mutual labels:  spark
shamash
Autoscaling for Google Cloud Dataproc
Stars: ✭ 31 (-6.06%)
Mutual labels:  spark
yuzhouwan
Code Library for My Blog
Stars: ✭ 39 (+18.18%)
Mutual labels:  spark
spark-demos
Collection of different demo applications using Apache Spark
Stars: ✭ 15 (-54.55%)
Mutual labels:  spark
incubator-linkis
Linkis helps easily connect to various back-end computation/storage engines(Spark, Python, TiDB...), exposes various interfaces(REST, JDBC, Java ...), with multi-tenancy, high performance, and resource control.
Stars: ✭ 2,459 (+7351.52%)
Mutual labels:  spark
kafka-compose
🎼 Docker compose files for various kafka stacks
Stars: ✭ 32 (-3.03%)
Mutual labels:  spark
spark-sql-flow-plugin
Visualize column-level data lineage in Spark SQL
Stars: ✭ 20 (-39.39%)
Mutual labels:  spark
BigData-News
基于Spark2.2新闻网大数据实时系统项目
Stars: ✭ 36 (+9.09%)
Mutual labels:  spark
spark-kubernetes
spark on kubernetes
Stars: ✭ 80 (+142.42%)
Mutual labels:  spark
tpch-spark
TPC-H queries in Apache Spark SQL using native DataFrames API
Stars: ✭ 63 (+90.91%)
Mutual labels:  spark
Search Ads Web Service
Online search advertisement platform & Realtime Campaign Monitoring [Maybe Deprecated]
Stars: ✭ 30 (-9.09%)
Mutual labels:  spark
docker-spark
Apache Spark docker container image (Standalone mode)
Stars: ✭ 34 (+3.03%)
Mutual labels:  spark
frovedis
Framework of vectorized and distributed data analytics
Stars: ✭ 59 (+78.79%)
Mutual labels:  spark
sentry-spark
Apache Spark Sentry Integration
Stars: ✭ 14 (-57.58%)
Mutual labels:  spark
Python Master Courses
人生苦短 我用Python
Stars: ✭ 61 (+84.85%)
Mutual labels:  spark

A library for burning through electronic health record data using Apache Spark™

Smolder provides an Apache Spark™ SQL data source for loading EHR data from HL7v2 message formats. Additionally, Smolder provides helper functions that can be used on a Spark SQL DataFrame to parse HL7 message text, and to extract segments, fields, and subfields, from a message.

Project Support

Please note that all projects in the /databrickslabs github account are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.

Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo. They will be reviewed as time permits, but there are no formal SLAs for support.

Building and Testing

This project is built using sbt and Java 8.

Start an sbt shell using the sbt command.

FYI: The following SBT projects are built on Spark 3.0.0/Scala 2.12.8 by default. To change the Spark version and Scala version, set the environment variables SPARK_VERSION and SCALA_VERSION.

To compile the main code:

compile

To run all Scala tests:

test

To test a specific suite:

testOnly *HL7FileFormatSuite

To create a JAR that can be run as part of an Apache Spark job or shell, run:

package

The JAR can be found under target/scala-<major-version>.

Getting Started

To load HL7 messages into an Apache Spark SQL DataFrame, simply invoke the hl7 reader:

scala> val df = spark.read.format("hl7").load("path/to/hl7/messages")
df: org.apache.spark.sql.DataFrame = [message: string, segments: array<struct<id:string,fields:array<string>>>]

The schema returned contains the message header in the message column. The message segments are nested in the segments column, which is an array. This array contains two nested fields: the string id for the segment (e.g., PID for a patient identification segment and an array of segment fields.

Parsing message text from a DataFrame

Smolder can also be used to parse raw message text. This might happen if you had an HL7 message feed land in an intermediate source first (e.g., a Kafka stream). To do this, we can use Smolder's parse_hl7_message helper function. First, we start with a DataFrame containing HL7 message text:

scala> val textMessageDf = ...
textMessageDf: org.apache.spark.sql.DataFrame = [value: string]

scala> textMessageDf.show()
+--------------------+                                                          
|               value|
+--------------------+
|MSH|^~\&|||||2020...|
+--------------------+

Then, we can import the parse_hl7_message message from the com.databricks.labs.smolder.functions object and apply that to the column we want to parse:

scala> import com.databricks.labs.smolder.functions.parse_hl7_message
import com.databricks.labs.smolder.functions.parse_hl7_message

scala> val parsedDf = textMessageDf.select(parse_hl7_message($"value").as("message"))
parsedDf: org.apache.spark.sql.DataFrame = [message: struct<message: string, segments: array<struct<id:string,fields:array<string>>>>]

This yields the same schema as our hl7 data source.

Extracting fields from an HL7 message segment

While Smolder provides an easy-to-use schema for HL7 messages, we also provide helper functions in com.databricks.labs.smolder.functions to extract subfields of a message segment. For instance, let's say we want to get the patient's name, which is the 5th field in the patient ID (PID) segment. We can extract this with the segment_field function:

scala> import com.databricks.labs.smolder.functions.segment_field
import com.databricks.labs.smolder.functions.segment_field

scala> val nameDf = df.select(segment_field("PID", 4).alias("name"))
nameDf: org.apache.spark.sql.DataFrame = [name: string]

scala> nameDf.show()
+-------------+
|         name|
+-------------+
|Heller^Keneth|
+-------------+

If we then wanted to get the patient's first name, we can use the subfield function:

scala> import com.databricks.labs.smolder.functions.subfield
import com.databricks.labs.smolder.functions.subfield

scala> val firstNameDf = nameDf.select(subfield($"name", 1).alias("firstname"))
firstNameDf: org.apache.spark.sql.DataFrame = [firstname: string]

scala> firstNameDf.show()
+---------+
|firstname|
+---------+
|   Keneth|
+---------+

License and Contributing

Smolder is made available under an Apache 2.0 license, and we welcome contributions from the community. Please see our contibutor guidance for information about how to contribute to the project. To ensure that contributions to Smolder are properly licensed, we follow the Developer Certificate of Origin (DCO) for all contributions to the project.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].