All Projects → mratsim → Apartment-Interest-Prediction

mratsim / Apartment-Interest-Prediction

Licence: other
Predict people interest in renting specific NYC apartments. The challenge combines structured data, geolocalization, time data, free text and images.

Programming Languages

Jupyter Notebook
11667 projects
python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to Apartment-Interest-Prediction

HumanOrRobot
a solution for competition of kaggle `Human or Robot`
Stars: ✭ 16 (-5.88%)
Mutual labels:  kaggle, kaggle-competition, xgboost, lightgbm
Open Solution Home Credit
Open solution to the Home Credit Default Risk challenge 🏡
Stars: ✭ 397 (+2235.29%)
Mutual labels:  kaggle, xgboost, lightgbm
Kaggle-Competition-Sberbank
Top 1% rankings (22/3270) code sharing for Kaggle competition Sberbank Russian Housing Market: https://www.kaggle.com/c/sberbank-russian-housing-market
Stars: ✭ 31 (+82.35%)
Mutual labels:  kaggle, xgboost, lightgbm
Kaggle
Kaggle Kernels (Python, R, Jupyter Notebooks)
Stars: ✭ 26 (+52.94%)
Mutual labels:  kaggle-competition, xgboost, lightgbm
stackgbm
🌳 Stacked Gradient Boosting Machines
Stars: ✭ 24 (+41.18%)
Mutual labels:  xgboost, lightgbm, gradient-boosting
autogbt-alt
An experimental Python package that reimplements AutoGBT using LightGBM and Optuna.
Stars: ✭ 76 (+347.06%)
Mutual labels:  kaggle, lightgbm, gradient-boosting
MSDS696-Masters-Final-Project
Earthquake Prediction Challenge with LightGBM and XGBoost
Stars: ✭ 58 (+241.18%)
Mutual labels:  kaggle, xgboost, lightgbm
Awesome Gradient Boosting Papers
A curated list of gradient boosting research papers with implementations.
Stars: ✭ 704 (+4041.18%)
Mutual labels:  xgboost, lightgbm, gradient-boosting
Benchmarks
Comparison tools
Stars: ✭ 139 (+717.65%)
Mutual labels:  kaggle, xgboost, lightgbm
Machine Learning Workflow With Python
This is a comprehensive ML techniques with python: Define the Problem- Specify Inputs & Outputs- Data Collection- Exploratory data analysis -Data Preprocessing- Model Design- Training- Evaluation
Stars: ✭ 157 (+823.53%)
Mutual labels:  kaggle, kaggle-competition, gradient-boosting
Lightgbm
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
Stars: ✭ 13,293 (+78094.12%)
Mutual labels:  kaggle, lightgbm, gradient-boosting
My Data Competition Experience
本人多次机器学习与大数据竞赛Top5的经验总结,满满的干货,拿好不谢
Stars: ✭ 271 (+1494.12%)
Mutual labels:  kaggle-competition, xgboost, lightgbm
Awesome Decision Tree Papers
A collection of research papers on decision, classification and regression trees with implementations.
Stars: ✭ 1,908 (+11123.53%)
Mutual labels:  xgboost, lightgbm, gradient-boosting
Auto ml
[UNMAINTAINED] Automated machine learning for analytics & production
Stars: ✭ 1,559 (+9070.59%)
Mutual labels:  xgboost, lightgbm, gradient-boosting
decision-trees-for-ml
Building Decision Trees From Scratch In Python
Stars: ✭ 61 (+258.82%)
Mutual labels:  xgboost, lightgbm, gradient-boosting
Data Science Competitions
Goal of this repo is to provide the solutions of all Data Science Competitions(Kaggle, Data Hack, Machine Hack, Driven Data etc...).
Stars: ✭ 572 (+3264.71%)
Mutual labels:  kaggle, kaggle-competition, xgboost
Mlbox
MLBox is a powerful Automated Machine Learning python library.
Stars: ✭ 1,199 (+6952.94%)
Mutual labels:  kaggle, xgboost, lightgbm
docker-kaggle-ko
머신러닝/딥러닝(PyTorch, TensorFlow) 전용 도커입니다. 한글 폰트, 한글 자연어처리 패키지(konlpy), 형태소 분석기, Timezone 등의 설정 등을 추가 하였습니다.
Stars: ✭ 46 (+170.59%)
Mutual labels:  kaggle, xgboost, lightgbm
fast retraining
Show how to perform fast retraining with LightGBM in different business cases
Stars: ✭ 56 (+229.41%)
Mutual labels:  kaggle, xgboost, lightgbm
kaggle-recruit-restaurant
🏆 Kaggle 8th place solution
Stars: ✭ 102 (+500%)
Mutual labels:  kaggle, lightgbm

Apartment Interest Prediction

Predict people interest in renting specific apartments. The challenge combines structured data, geolocalization, time data, free text and images.

Overview of my solution

This solution features Gradient Boosted Trees (XGBoost and LightGBM) and does not use stacking, due to lack of time.

Feature engineering

Features can be activated and deactivated by a single comment in main.py

Time features

From the datetime field I created several features:

  • Year, month, day, hour, day of the week
  • Bank holiday, school holiday
  • Elapsed time since publication

Furthermore, day, month, hour are cyclical. To tell the classifier than after Sunday (day 6) there is Monday (day 0), I've projected the time information on a circle by taking the cos and sin.

Geo-localization features

From the latitude and longitude, I created clusters using Density-based clustering (HDBSCAN).

I would have preferred DBSCAN and setting epsilon to 200 meters but unfortunately, Scikitlearn's DBSCAN is not properly optimized. Trying to get 40000 (train set) or 70000 (test set) pairwise haversine distance goes KABOOM on my memory. (HDBSCAN creates cluster fully automatically from density, but NYC is too dense)

From the public kernels I've also taken the coordinate of Central Park, Brooklyn, Queens .... to compute the distance of each apartment from those center.

Apartment features

Apartment features (cat, dog, doorman, laundry in building ...) were deduplicated and encoded using a 4-letter encoding scheme to reduce duplication further. Furthermore Sklearn CountVectorizer to One-Hot-Encode + Expose their frequency to the classifier

Description features (NLP / Text-mining)

The description field was one of my big focus, I did:

  • Clean-up the HTML tags from the description with BeautifulSoup
  • Latent Semantic Analysis, by stacking a LightGBM kernel on a description preprocessed with TfIdf + TruncatedSVD
  • Sentiment Analysis with TextBlob (unused at the end)
  • Extraction of metro lines and metro/transport related vocabulary
  • Check the number of words and lines
  • Check the presence of "REDACTED"
  • Check the number of caps and exclamation points

Categorical features

On price, number of bathrooms, bedrooms, the usual combinations of price per room, etc were done. Address, manager, building id were numerically encoded.

Furthermore for manager and building id, various other encoding scheme were tested (Bayesian target label encoding, low/mid/high interest count from the Kaggle Forum, manager skill and building hype).

In the end, after multiple leaks on cross-validation, I simply binned managers/building with their frequency (top 1%, 2%, 5% ...). This way target labels were not used, I ensure no leak and performance seemed to be similar to Bayesian encoding.

Outliers removal

Detected Outliers were corrected from the test set (117 bathrooms :O) Prices > 13000 were clipped

Images

Like many other I didn't process the image at ll, besides using the magic leak (folder creation time). The biggest issue was that the number of images per apartment was irregular, some had a floor plans, other had furnitures, other had nothing.

I did extract metadata from the images to process add resolution, image height and width to my model. Unfortunately the json file was 800MB or 1.4GB in CSV with thousands of sparse columns. Pandas couldn't load that in my machine. The workaround would be to a. buy more RAM, b. use a dictionary structure but it was clunky and time consuming.

Example metadata are available in my 000_Data_Exploration.ipynb notebook.

Overview of the architecture

I ran early in scalability issues and cross-validation issues with Scikit-Learn.

In Sklearn, you can use Pipelines to apply modifications on the train and test set independently, but it's not trivial to use pipelines on a validation set (split from train set) that you will use as input for XGBoostor LightGBM early stopping. Furthermore, most features are not inherently leaky and do not need to be recomputed for each fold as Sklearn does. Lastly, Sklearn has no caching framework

So:

I wrote my own code so that adding each features is easy and independant, check the star_command.py pipe function. Now each transformation can be applied with:

# Feature extraction - sequence of transformations
tr_pipeline = feat_extraction_pipe(
    tr_remove_outliers,
    tr_numphot,
    tr_numfeat,
    tr_numdescwords,
    tr_desc_mining,
    tr_datetime,
    tr_split_bath_toilets,
    tr_tfidf_lsa_lgb)

Feature selection was done the same way, with a framework that can deal with dataframe and sparse array, there is even a glimpse of feature selection on multiple processes, but it was slower due to Python's Global Interpreter Lock Each features can be chained with Scikit's transformers like TfIdf or PCA. Multiple features can be declared at the same time.

select_feat = [
  ("dedup_features", CountVectorizer(max_features=200)),
  ("description", [TfidfVectorizer(max_features=2**16,
                         min_df=2, stop_words='english',
                         use_idf=True),
                TruncatedSVD(2), # 2 or 3
                # Normalizer(copy=False) # Not needed for trees ensemble and Leaky on CV
                ]),
  #("description",[HTMLPreprocessor(),NLTKPreprocessor(),
  #                TfidfVectorizer(tokenizer=identity, preprocessor=None, lowercase=False)]
  #),
  ("description", CountVectorizer(vocabulary=vocab_metro,binary=True)),
  ("description", CountVectorizer(vocabulary=vocab_metro_lines,binary=True, lowercase=False)),
  ("redacted", None),
  (['top_' + str(p) + '_manager' for p in [1,2,5,10,15,20,25,30,50]],None)
  (['top_' + str(p) + '_building' for p in [1,2,5,10,15,20,25,30,50]],None)
  ]

Each transformation can be cached in a "database" with shelve and retrieved easily with a key. See transformers_nlp_tfidf.py And finally I wrote my own cross-validation and out of fold prediction code.

Thank you for your attention

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].