All Projects → jngseattle → Bordercrossing

jngseattle / Bordercrossing

Galvanize Capstone Project - Forecasting Wait Times at the US/Canada Border

Projects that are alternatives of or similar to Bordercrossing

Jupyterlab Lsp
Coding assistance for JupyterLab (code navigation + hover suggestions + linters + autocompletion + rename) using Language Server Protocol
Stars: ✭ 796 (+15820%)
Mutual labels:  jupyter-notebook
Jupyter nbextensions configurator
A jupyter notebook serverextension providing config interfaces for nbextensions.
Stars: ✭ 814 (+16180%)
Mutual labels:  jupyter-notebook
Covid Ct
COVID-CT-Dataset: A CT Scan Dataset about COVID-19
Stars: ✭ 820 (+16300%)
Mutual labels:  jupyter-notebook
Fizz Buzz Tensorflow
fizz buzz in tensorflow
Stars: ✭ 803 (+15960%)
Mutual labels:  jupyter-notebook
Python tutorials
Python tutorials in both Jupyter Notebook and youtube format.
Stars: ✭ 813 (+16160%)
Mutual labels:  jupyter-notebook
Ghgforcing
Python package to calculating forcing from continuous GHG emissions
Stars: ✭ 5 (+0%)
Mutual labels:  jupyter-notebook
Nlp In Practice
Starter code to solve real world text data problems. Includes: Gensim Word2Vec, phrase embeddings, Text Classification with Logistic Regression, word count with pyspark, simple text preprocessing, pre-trained embeddings and more.
Stars: ✭ 790 (+15700%)
Mutual labels:  jupyter-notebook
Pms Swp 2016
Υλικό για το μεταπτυχιακό μάθημα "Προγραμματισμός Σημασιολογικού Ιστού"
Stars: ✭ 5 (+0%)
Mutual labels:  jupyter-notebook
Andrew Ng Deep Learning Notes
吴恩达《深度学习》系列课程笔记及代码 | Notes in Chinese for Andrew Ng Deep Learning Course
Stars: ✭ 814 (+16180%)
Mutual labels:  jupyter-notebook
Uorf repressiveness supplemental
Data and iPython notebooks documenting all analysis for manuscript "Conservation of uORF repressiveness and sequence features in mouse, human and zebrafish", by Guo-Liang "Chewie" Chew, Andrea Pauli and Alexander F. Schier, 2016
Stars: ✭ 5 (+0%)
Mutual labels:  jupyter-notebook
Earthengine Py Notebooks
A collection of 360+ Jupyter Python notebook examples for using Google Earth Engine with interactive mapping
Stars: ✭ 807 (+16040%)
Mutual labels:  jupyter-notebook
Gan stability
Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"
Stars: ✭ 810 (+16100%)
Mutual labels:  jupyter-notebook
Jjug Ccc 2016 Spring
JJUG CCC 2016 Spring http://www.java-users.jp/?page_id=2377 #jjug_ccc #ccc_m61
Stars: ✭ 5 (+0%)
Mutual labels:  jupyter-notebook
Deep Learning Time Series
List of papers, code and experiments using deep learning for time series forecasting
Stars: ✭ 796 (+15820%)
Mutual labels:  jupyter-notebook
Illust comment search
コメント生成ハッカソン用レポジトリ
Stars: ✭ 5 (+0%)
Mutual labels:  jupyter-notebook
Deep Image Prior
Image restoration with neural networks but without learning.
Stars: ✭ 6,940 (+138700%)
Mutual labels:  jupyter-notebook
Coursera Machine Learning
Coursera Machine Learning - Python code
Stars: ✭ 815 (+16200%)
Mutual labels:  jupyter-notebook
Deanonymizing Tennis Suspects
Putting names to the players identified by BuzzFeed News in its tennis exposé.
Stars: ✭ 5 (+0%)
Mutual labels:  jupyter-notebook
Gutenburg nlp
A hypothetical proof-of-concept book recommendation system for Project Gutenberg, using Natural Language Processing.
Stars: ✭ 5 (+0%)
Mutual labels:  jupyter-notebook
Techtalks
Stars: ✭ 5 (+0%)
Mutual labels:  jupyter-notebook

#Forecasting Wait Times at the US/Canada Border

Background

Travel delays are a frustrating reality of driving and are particularly pronounced when crossing the US/Canada border. Reliable and accurate predictions of wait times are not available to travelers. This project predicts the wait times at the Peace Arch and Pacific Highway border crossings and provide users with a more reliable forecast of wait times for dates in 2016.

Executive Summary

Deliverables

  • Model predictions of wait times for Peace Arch and Pacific Highway crossings
  • Estimates for missing northbound data due to known sensor issues
  • Borderforecaster.com web app for displaying wait time predictions

Results

  • Model predictions have better predictive accuracy than baseline model of 12 month averages by day of week
  • The most important features are time of day and weather which capture daily and yearly seasonality, respectively
  • The holidays that drive wait times differ based on travel direction, with most important holidays being Victoria Day and Good Friday

Data Source

JSON API

Crossings

  • Peace Arch
  • Pacific Highway
  • Sumas
  • Lynden

Lanes

  • Car
  • Nexus
  • Truck
  • Bus
  • FAST

This project focused on car lane data at Peace Arch and Pacific Highway crossings only.

Goals

Improve predictions compared to publicly available tools

Users can view real-time wait times from WSDOT or US Customs, but the data is of limited value to those travelers already near the border.

Alternatively, users can view an average of wait times for a day of week from the University of California; however, variations by day of year are disregarded.

Provide predictions for northbound crossings

The UC data only provides predictions for southbound crossings. The reason for the omission is due to gaps in the data due to where the sensors are placed. According to the data steward, data below a certain threshold are reported as zero.

The chart below shows volume in red and wait time in blue. Notice that between 12pm and 4pm, even though volume is at a peak, the wait time displays zero.

Compare this to southbound data which shows more reasonable wait times throughout the day. Even when the wait time drops, rarely does it drop to zero.

Pre-processing

Imputing false zeros

For northbound data, the false zeros from chart above needed to be imputed before any predictions could be made. The data was imputed using a decision tree model which used volume and wait time values of neighbors as features. The imputer consisted of 3 separate decision tree models which were applied depending on whether values from neighbors were available:

  • Both lead + lag values
  • Lead values only
  • Lag values only

Because of the large spans of false zeros, the imputer was applied iteratively, filling in missing values in step-wise fashion.

The imputer was trained on southbound data with data below a configurable threshold set to zero. To validate the approach, the model was cross-validated on a separate southbound crossing where false zeros were emulated by removing data below a threshold.

Smoothing and resampling

Due to the noise in the raw data, data was smoothed with a window size of 1 hour using LOWESS.

Once smoothed, the data was resampled at 30 minute grain to reduce processing time without degrading the end-user experience.

Feature Engineering

Date and time features

For each record, the following date and time features were constructed:

  • Time of day
  • Year
  • Month
  • Week
  • Day of week

Holidays

Major holidays from US and Canada were added as features, along with lead and lag effects.

Lead holiday features were added to account for traveler behavior ahead of a holiday, e.g., the Friday before Labor Day. Lag holiday features were added to account for traveler behavior after a holiday, e.g. Sunday after Thanksgiving.

Weather

Weather data was pulled from Weather Underground for Blaine, WA using following fields:

  • Temperature (min/max/mean)
  • Rain/Snow/Thunderstorm/Fog
  • Precipitation

Lead and lag weather features were added to account for changes in traveler behavior after a weather event, or in anticipation of a weather event.

Trend

Wait time has decreased over time as shown in chart below.

To model trend, a difference in daily average wait time was added as a feature. Multiple difference features were included over multiple weeks to capture both long term and short term trends. Note that each difference feature is quantized in 1 week intervals to account for weekly seasonality.

Excluded features

Feature Why excluded
School calendars no improvement
Lag daily averages of wait times overfit
Rolling daily averages of wait times overfit
North vs. south volume imbalance overfit

Modeling

Baseline

A baseline model was defined as the average over the last 12 months by day of week. The baseline is motivated by the day of week predictions referenced above from the University of California, and by predictions using Random Forest which tended to predict the same values as the baseline model.

Predictions from the baseline model served as measuring stick for comparing the quality of my model.

Extra Trees

Random Forest was the first model attempted, but was never able to beat the baseline model. A different decision tree model from scikitlearn, Extra Trees, was used instead yielding better results and more variance in prediction compared to the baseline.

Once trend features were added to the model, Extra Trees consistently beat the baseline predictions for different crossings, directions and years.

A Gradient Boosting model was tested, but the predictive accuracy was only marginally better than Extra Trees. The significantly higher processing cost of Gradient Boosting, due to the inability to parallelize model training, favored Extra Trees.

Ensembling

To further improve the predictive accuracy, the Extra Trees predictions were ensembled with the baseline predictions. Ensembling was performed using a harmonic mean with equal weights.

Different weights were attempted, but since optimal weights varied depending on the data set (year, crossing and direction), equal weights were used to better generalize the model.

Preventing Overfitting

For any given data set, it was possible to improve the model via hyperparameter tuning. However, this came at the expense of poorer predictive accuracy for a different data set, e.g. different year.

To keep the model generalizable, the Extra Trees model was loosely tuned with 96 estimators as the only non-default parameter.

What about ARIMA?

There are a few factors that make ARIMA not applicable:

  1. Multiple seasonalities, e.g. daily, weekly and yearly
  2. Non-linear exogenous factors
  3. Slow to train for large number of exogenous factors

An attempt at using ARIMA yielded predictions that repeated the same seasonal pattern without variation.

Website

The website is a responsive site using Flask and Bootstrap. For charting, the Chartist javascript library was used. Data is persisted in a postgreSQL database.

Users can select a date, crossing location and direction to view intraday wait times. For dates before 2016, predictions were generated on a weekly basis to emulate a production system where the model is retrained as new data is collected.

For dates from 2016 onwards, predictions were generated at one time to emulate long-term predictions.

The website is hosted on AWS at http://borderforecaster.com.

Results

R-squared

Below is a chart of R-squared calculated for predictions on Peace Arch southbound data. The chart shows R-squared for both baseline and model when trained on a weekly or yearly basis. When trained yearly, predictions for an entire year are generated all at once. When trained weekly, predictions for the year are generated a week at a time with the model retrained for each week of predictions.

As should be expected, predictions trained weekly are better than predictions trained yearly. In both cases, the model makes a better prediction than the baseline for each year. The strength of the model is evidenced by the fact that the model when trained yearly beats the baseline when trained weekly for all years except 2015.

2015 shows the most dramatic improvements in R-squared due to the ability of the model to handle changes in trend.

Feature importance

Overview

Time of day is the most important feature, corresponding to daily seasonality.

Weather is the second most important class of features, driven by temperature and precipitation features. These likely act as a proxy for yearly seasonality corresponding to seasons of the year. When weather features are scaled according to their frequency of occurrence, snow and thunderstorms stand out as the most important weather features.

Holidays

Differences between northbound and southbound crossings is most pronounced when comparing holiday features.

Interestingly, the two holidays with highest importance are Canadian holidays - Victoria Day for northbound traffic and Good Friday for southbound traffic. Note that although Good Friday is recognized as a holiday in the US, it is broadly observed in Canada.

There is evidence of bidirectional holiday traffic. For example, southbound travel on the Sunday before Civic Day and northbound travel on Civic Day. Similarly, northbound travel 2 days before Christmas with southbound travel 2 days after Christmas.

The only day which displays high importance in both directions is the Saturday before Labor Day.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].