All Projects → alexeygrigorev → cikm-cup-2016-cross-device

alexeygrigorev / cikm-cup-2016-cross-device

Licence: other
Solution for the Cross-Device linking challenge from CIKM CUP 2016

Programming Languages

python
139335 projects - #7 most used programming language

Projects that are alternatives of or similar to cikm-cup-2016-cross-device

MEgo2Vec-Embedding-Matched-Ego-Networks-for-User-Alignment-Across-Social-Networks
Implementation of CIKM'18 paper: "MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks∗".
Stars: ✭ 22 (-8.33%)
Mutual labels:  cikm
FairAI
This is a collection of papers and other resources related to fairness.
Stars: ✭ 55 (+129.17%)
Mutual labels:  cikm
CoronaXiv
First Prize in HackJaipur Hackathon 2020 for Best ElasticSearch-based Product! Website: http://coronaxiv2.surge.sh/#/
Stars: ✭ 15 (-37.5%)
Mutual labels:  elastic-search
xelogstash
Send SQL Server Extended Events to Logstash, Elastic Search, or JSON
Stars: ✭ 22 (-8.33%)
Mutual labels:  elastic-search
tracardi
TRACARDI is a new HOME for your customer data. TRACARDI is an API-first solution, low-code / no-code platform aimed at any e-commerce business that wants to start using user data for marketing purposes.
Stars: ✭ 193 (+704.17%)
Mutual labels:  elastic-search

CIKM CUP 2016 Track 1: Cross-Device Linking

The goal of the competition is to find browsing logs which belong to the same user.

More details at http://cikmcup.org and https://competitions.codalab.org/competitions/11171

My solution

  • Convert user ids into integers so they occupy less RAM
  • Split the train data into 2 folds based on connected components
  • Use Elastic Search and More-Like-This queries to find top pair candidates
  • Split logs into sessions (using 30 minute intervals) and compute the user "profile" (log features):
    • Number of sessions
    • Clicks within session
    • Duration of breaks between sessions
    • Starts and ends of sessions
    • Title-based, Domain-based and Url-based similarities within sessions
  • For candidates retrieved with Elastic Search, compute the following features:
    • Absolute difference between the profile features
    • Cosine between domains, full urls and titles
  • Train an xgboost model for predicting if a candidate pair corresponds to the same user or not

Files:

  • 1_prepare_data.py: preprocesses the data
  • 2_data_to_elastic.py: puts the log data to elastic search
  • 3_candidates_elastic.py: uses elastic search for retrieving top 70 candidates for each user
  • 4_session_vectorizers.py: "trains" count vectorizers for urls, domains and titles for user sessions
  • 5_user_profiles.py: extracts profile information from each user log
  • 6_pair_features.py: computes features for each candidate pair
  • 7_model.py: trains the xgb model and creates the submission file

Presentation

  • This solution was presented at Berlin Machine Learning meetup. See the slides here.
Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].