This repository is separated into four parts:

1. Data Preprocessing

./crawl.py

Output: ./input/raw_*

crawl historical stock prices from https://finance.yahoo.com/

./feature engineering.py

Input: ./input/feature_projection
Output: ./input/feature_label_*, selected_feature_*

Format basic information to the (sample_n, feature_m) matrix

	[ feature1_sample_1, feature2_sample_1, ... feature_m_sample_1]

	...

	[ feature1_sample_m, feature2_sample_m, ... feature_m_sample_n]

Before filter, a sample has feature dimension 80 * 11 (880 financial ratios)

Delele the feature if it has more than N% missing values (we can set N as 1, 5, 10)

Since we need a complete time window to shift, if it doesn't have full 11 data, delete all of this

type of feature, such P/E or Asset turnover

Some stock may only have data from 2005 - 2014

Some starts at April 2007, we regard them as 2006

./dataCheck.py

./learning.py

Input: ./input/feature_label_*
Output: ./output/result_*, ./output/tickers_*

Take financial ratios (2006 - Dec.2014) to train the model

Increase training sample by using 2006-2011, 2007-2012, 2008-2013, 2009-2014, 2010-2015

Label based on Sortino ratio

In summary, gradient boosting gives us the best performance (highest precision of label 1 in average)

./mean_variance_optimization.py

Do a brute-force method to randomly pick 15 stocks from the stock sets and implement mean-variance portfolio with no short constraint

Delelte the portfolio with the worst CVAR