All Projects → txytju → air-quality-prediction

txytju / air-quality-prediction

Licence: other
Repository of KDD Cup, 2018.

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to air-quality-prediction

awesome-time-series
Resources for working with time series and sequence data
Stars: ✭ 178 (+270.83%)
Mutual labels:  time-series-prediction
Springboard-Data-Science-Immersive
No description or website provided.
Stars: ✭ 52 (+8.33%)
Mutual labels:  time-series-prediction
GAR
Code and resources for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021
Stars: ✭ 38 (-20.83%)
Mutual labels:  seq2seq-model
lstm-electric-load-forecast
Electric load forecast using Long-Short-Term-Memory (LSTM) recurrent neural network
Stars: ✭ 56 (+16.67%)
Mutual labels:  time-series-prediction
keras-chatbot-web-api
Simple keras chat bot using seq2seq model with Flask serving web
Stars: ✭ 51 (+6.25%)
Mutual labels:  seq2seq-model
Predictive-Maintenance
time-series prediction for predictive maintenance
Stars: ✭ 28 (-41.67%)
Mutual labels:  time-series-prediction
learningspoons
nlp lecture-notes and source code
Stars: ✭ 29 (-39.58%)
Mutual labels:  seq2seq-model
exact
EXONA: The Evolutionary eXploration of Neural Networks Framework -- EXACT, EXALT and EXAMM
Stars: ✭ 43 (-10.42%)
Mutual labels:  time-series-prediction
awesome-energy-forecasting
list of papers, code, and other resources
Stars: ✭ 31 (-35.42%)
Mutual labels:  time-series-prediction
Neural-Machine-Translation
Several basic neural machine translation models implemented by PyTorch & TensorFlow
Stars: ✭ 29 (-39.58%)
Mutual labels:  seq2seq-model
text simplification
Text Simplification Model based on Encoder-Decoder (includes Transformer and Seq2Seq) model.
Stars: ✭ 66 (+37.5%)
Mutual labels:  seq2seq-model
DARNN
A Dual-Stage Attention-Based Recurrent Neural Network for Time Series Prediction
Stars: ✭ 90 (+87.5%)
Mutual labels:  time-series-prediction
Awesome Chatbot
Awesome Chatbot Projects,Corpus,Papers,Tutorials.Chinese Chatbot =>:
Stars: ✭ 1,785 (+3618.75%)
Mutual labels:  seq2seq-model
neural-chat
An AI chatbot using seq2seq
Stars: ✭ 30 (-37.5%)
Mutual labels:  seq2seq-model
Gluon Ts
Probabilistic time series modeling in Python
Stars: ✭ 2,373 (+4843.75%)
Mutual labels:  time-series-prediction
Traffic-Prediction-Open-Code-Summary
Summary of open source code for deep learning models in the field of traffic prediction
Stars: ✭ 58 (+20.83%)
Mutual labels:  time-series-prediction
time series classification prediction
Different deep learning architectures are implemented for time series classification and prediction purposes.
Stars: ✭ 17 (-64.58%)
Mutual labels:  time-series-prediction
financial-ts-prediction-with-deeplearning
(Work In Progress) Implementation of "Financial Time Series Prediction Using Deep Learning"
Stars: ✭ 15 (-68.75%)
Mutual labels:  time-series-prediction
fireTS
A python multi-variate time series prediction library working with sklearn
Stars: ✭ 62 (+29.17%)
Mutual labels:  time-series-prediction

1. Introduction

  • Kdd cup data mining competition, the main task is to predict air quality(aq) in Beijing and London in the next 48 hours.
  • Use seq2seq and xgboost models, ranking 31th in the final leaderboard.

2. Data Exploration and Preprocess

2.1 Exploratory analysis of data

2.2 Data Preprocess

Data preprocess then split the dataset into training, val and aggr dataset.

  1. Data preprocess

    Steps of data preprocess:

    1. Remove duplicated data. Some of the hour data are duplicated, remove them.
    2. Missing value processing. If hour level data are missing for all stations for 5 hours in a row, all (X,y) data that have these missing data in X or y are droped. Then if data are missing for all stations for less than 5 hours in a row, data before and after missing data are used to generate padding data linearly. In some cases, data for some specific stations are nan, then data from the nearest station will be used to pad.
  2. Split the data

    All data points that are valid after data preprocess will be split into 3 parts : training set, validation set and aggregation set.

    Training set is used for the training of single models, and usually data from 20170101-20180328 will be used in the training set.

    Validation set will be used for selecting the best single models from the checkpoints of all single models. Then all best single models will be aggregated on the validation set and eveluated finally on the aggregation set. The aggregation model will be used for the final prediction.

2.3 Oversampling

  1. Why oversampling?

    Symmetric mean absolute percentage error (SMAPE) is used in this competation as evaluation metric. In SMAPE, relative error matters rather than absolut error, as shown in the function.

    However, loss functions like L1 loss, L2 loss and huber loss are applied in different models and they all aim at decreasing absolute error rather than relative error. So if models are trained using original data and these 3 loss functions, trained models would be optimized to fit data points with huge number rather than data points with smaller numbers, which would lead to larger SAMPE when evaluating with validation set and test with test set.

  2. Oversamping Strategies

    Training data from 20170101-20180328 are used in the training data. Oversampling steps are as follows:

    1. PM2.5 mean of y is caculated for every (X,y) pair, and all data points in the training set are sorted in ascending order.
    2. Smallest oversample_part of all datapoints are picked and repeated for repeats times and are appended to the original training dataset. So (1+repeats*oversample_part) times the original amount of training data are finally used to generate training data batch (X,y), which may help to shift the optimization target from those loss functions to SMAPE.

    Oversample_part and repeats are hyperparameters which suitable values can be found by random search or grid search. Oversampling lead to a 0.02~0.04 improvement on SMAPE of validation set.

3. Models

3.1 seq2seq

Seq2seq model is a machine learning model that use decoder and encoder to learn serialized feature pattern from data. Seq2seq model is applied to a lot of machine learning applications, especially NLP applications like Machine translation. In this project, seq2seq is applied to generate time series forecast of different granularity, which are Day model and Hour model. The basic graph of seq2seq model is as follows.

  1. Day model

    The air condition seem to be very cyclical every day, as shown in the 3rd part in bj_aq_data_exploration and below. So the basic seq2seq model would be Day model, which means that we just predict the mean value of all aq parameters in the next 2 days, and then overlay the parameter trend during 24 hours to generate the final prediction.

    PM2.5 PM10 O3 NO2

    The computation graph of Day model is as follws.

  2. Hour model, Predicting 2 days together

  3. Hour model, Predicting 1 day at a time

3.2 xgboost

3.3 models aggregation

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].