All Projects → deltaDNA → sql-cookbook

deltaDNA / sql-cookbook

Licence: Apache-2.0 license
Common SQL recipes and best practises

Programming Languages

PLSQL
303 projects

Projects that are alternatives of or similar to sql-cookbook

old nesta daps
[archived]
Stars: ✭ 16 (-76.47%)
Mutual labels:  data-mining
tree-hugger
A light-weight, extendable, high level, universal code parser built on top of tree-sitter
Stars: ✭ 96 (+41.18%)
Mutual labels:  data-mining
XCloud
Official Code for Paper <XCloud: Design and Implementation of AI Cloud Platform with RESTful API Service> (arXiv1912.10344)
Stars: ✭ 58 (-14.71%)
Mutual labels:  data-mining
python-notebooks
A collection of Jupyter Notebooks used in conferences or just to have some snippets.
Stars: ✭ 14 (-79.41%)
Mutual labels:  data-mining
2018-JData-Unicom-RiskUser
2018-JData-联通-基于移动网络通讯行为的风险用户识别:Baseline 0.77
Stars: ✭ 20 (-70.59%)
Mutual labels:  data-mining
vertica-swoole-adapter
Provides a DB layer for Swoole-based applications to communicate to HP Vertica databases.
Stars: ✭ 14 (-79.41%)
Mutual labels:  vertica
LeetCode
At present contains scraped data from around 1500 problems present on the site. More to follow....
Stars: ✭ 45 (-33.82%)
Mutual labels:  data-mining
emperor-os
(new released v2.5 LTS.2022-06-25) It has focused on developing an All in One operating system for programming, designing and data science.Emperor-OS has over 500 apps and important tools
Stars: ✭ 32 (-52.94%)
Mutual labels:  data-mining
The-Purchase-and-Redemption-Forecast-Challenge-baseline
天池“资金流入流出预测——挑战baseline”的解决方案,线上效果143.5
Stars: ✭ 78 (+14.71%)
Mutual labels:  data-mining
Loan-Approval-Prediction
Loan Application Data Analysis
Stars: ✭ 61 (-10.29%)
Mutual labels:  data-mining
iww
AI based web-wrapper for web-content-extraction
Stars: ✭ 61 (-10.29%)
Mutual labels:  data-mining
dee2
Digital Expression Explorer 2 (DEE2): a repository of uniformly processed RNA-seq data
Stars: ✭ 32 (-52.94%)
Mutual labels:  data-mining
FEATHER
The reference implementation of FEATHER from the CIKM '20 paper "Characteristic Functions on Graphs: Birds of a Feather, from Statistical Descriptors to Parametric Models".
Stars: ✭ 34 (-50%)
Mutual labels:  data-mining
Data-Mining-and-Warehousing
Data Mining algorithms for IDMW632C course at IIIT Allahabad, 6th semester
Stars: ✭ 19 (-72.06%)
Mutual labels:  data-mining
chainRec
Mengting Wan, Julian McAuley, "Item Recommendation on Monotonic Behavior Chains", in Proc. of 2018 ACM Conference on Recommender Systems (RecSys'18), Vancouver, Canada, Oct. 2018.
Stars: ✭ 52 (-23.53%)
Mutual labels:  data-mining
heidi
heidi : tidy data in Haskell
Stars: ✭ 24 (-64.71%)
Mutual labels:  data-mining
HFT-Prediction
Machine learning approach to high frequency trading, MLP & RNN used
Stars: ✭ 19 (-72.06%)
Mutual labels:  data-mining
csmath-2021
This mathematics course is taught for the first year Ph.D. students of computer science and related areas @zju
Stars: ✭ 30 (-55.88%)
Mutual labels:  data-mining
candis
🎀 A data mining suite for gene expression data.
Stars: ✭ 28 (-58.82%)
Mutual labels:  data-mining
tf-idf-python
Term frequency–inverse document frequency for Chinese novel/documents implemented in python.
Stars: ✭ 98 (+44.12%)
Mutual labels:  data-mining

sql-cookbook

Common SQL recipes and best practises

Purpose

This repository aims to provide a number of best practises and ideas on how to construct queries that provide interesting data in a way that works well with our platform.

Use these queries as you please for educational purposes and to adjust them to suit your specific needs.

Also everyone is welcome to contribute, just fork the project and file a pull request.

Reference

These queries can be used to run within the deltaDNA platform in the data mining section. Alternatively when adding either _live or _dev to the table selection you can use these queries in your favourite DB tool via direct access as well.

Data mining reference.

Direct SQL Access reference.

The queries are executed by an HP vertica DBMS and the SQL dialect used is defined by them and described in the following documentation: HP Vertica documentation A HP Vertica documentation.

Prerequisites

We assume some basic knowledge of SQL select statements and a basic knowledge of the data structure within the DeltaDNA platform, what events you are sending in and what they denote.

In this cookbook we aim to show how SQL queries can be best used to query the data with the DeltaDNA platform. The theory applies to both the data mining feature as well as direct access.

Data Mining

Our data mining feature enables you to write an SQL query that retrieves data from the tables exposed and shows your the result on screen. These results can then be shown as a graph, a pie chart or as plain data. The queries can be stored and used in custom dashboards and email reports. You can even use data mining queries to create a list of userID's that you can then use within the platform as a target list.

Direct Access

Direct Access is the name of our direct database connection. We provide an endpoint where you can directly connect to the data warehouse as if it was a simple PostgreSQL (Postgres) database. This way you can connect via tools like Tableau, dBeaver or directly within Python or R to the data warehouse. This provides for an infinitely flexible use of the data but it does require some more work on your part. What direct access is not intended for is to mirror the data somewhere else, if you'd like to do this then please download the raw data exports from our S3 archive and use these.

Data mining and Direct Access, the differences

Data mining is part of the platform and is linked to the game and the environment. Within data mining you'll have the events table which is the table for the environment you're in (dev or live) for your game. Direct access connects to the game, this means here you don't have an events table but both the events_dev and the events_live

The data warehouse

The data warehouse consists of a Verica database, we currently expose this via direct access as a Postgres database since the SQL dialect is similar but Postgres is supported by more clients than Vertica.

Vertica is a column store database that is especially good at analytics. Obviously this is a good thing but there are some things to keep in mind.

There are a lot of columns, hence why it's also called a wide table, for the way it is stored this is not an issue since only rows that have a value in that column are stored. Naturally there will be quite a few empty values in this table. Due to all events being stored in this one table there will be a lot of rows when represented as a table.

When querying the data it makes a big difference how many columns are within scope. The columns you are not interested in should not be queried. This makes select count(*) from events usually faster than select * from events limit 1 When you select a row with all columns anyway you'll quickly find that it is hard to find the values you are actually interested since you will keep having to scroll horizontally.

Tips

Use analytic functions instead of joins

  • use case = select clientVersion (clientVersionLast)

Get the last clientVersion known for a user from the user_metrics table and join it back on to the events table by joining the events

select events.EventName, user_metrics.fieldClientVersionLast, count(distinct userId) as users
from events, user_metrics
where user_metrics.user_id = events.userId
and eventDate > current_date-14
group by events.EventName, user_metrics.fieldClientVersionLast

Instead you can use an analytic function to find the last value of the field clientVersion for that user.

with dataCTE as (select eventName, 
last_value(ClientVersion ignore nulls) over (partition by userId order by eventId) as lastClientVersion, userId
from events
where eventDate > current_date-14
)
select eventName, lastClientVersion, count(distinct userId) as users
from dataCTE
group by eventName, lastClientVersion

Query Performance

A column store database, like Vertica, stores a table of data by splitting the data into multiple indexed data stores where a data store is a chunk split up by column name and within that the partitions the table is set to have. Next to that a data store can be ordered to improve performance for certain types of queries.

If you are able to define a subset of items in the column or partition definition you can massively increase the query execution performance.

So only selecting the columns you are interested in will mean that instead of retrieving the data for all columns only the relevant columns are touched. Try running: select eventTimestamp, userid, eventName from events order by 1 Instead of:

select * from events order by eventTimestamp

In the case of the events table it helps to limit the number of dates your query is looking at.

Where eventDate = current_date -10

Or if you can limit the number of events to look at:

Where eventName = ‘gameStarted’

Or if you define the users you are interested in:

Where userId in (‘ThisUser’, ‘AnotherUser’)

Alternatively if you can sample the data based on the userId so you can do a simple operation that tells you if the user should be included in your search, like only look at users starting with an ‘a’

where left(userID, 1) = 'a'

Another, more random way to query a subset of users is to run a hash function on the userId which returns a number and then running a modulo on the outcome of that, which gives you a pseudo random number and get only the items where this returns 0, this gives you 1/100th of all users and makes the query nearly 100x as fast. Using this when debugging a complicated query is recommended. (protip: try changing the 100 for 10000 for a smaller subset and changing the 0 for a different number to test with a different sample!)

where mod(hash(userId), 100)=0

Spooling

The next step in running any query is returning the results. This is depending on bandwidth and might cause performance issues on a slow internet connection. Returning 100 rows will not be a problem and most db tools like dbeaver and data mining will limit the number of returned rows to something below 1000. When running queries from R and Python this is an easy pitfall, especially when running on your laptop via a mobile data connection in the park.

Filtering in a query

Filter as early as possible in a query to make the next step quicker. So instead of:

With data as (select eventName, userid, platform from events)
Select * from data
Where eventName = ‘gameStarted’

Do

With data as (select eventName, userid, platform from events
where eventName = ‘gameStarted’)
Select * from data
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].