All Projects → jphall663 → Gwu_data_mining

jphall663 / Gwu_data_mining

Materials for GWU DNSC 6279 and DNSC 6290.

Programming Languages

python
139335 projects - #7 most used programming language
r
7636 projects

Projects that are alternatives of or similar to Gwu data mining

Data Science Resources
👨🏽‍🏫You can learn about what data science is and why it's important in today's modern world. Are you interested in data science?🔋
Stars: ✭ 171 (-21.2%)
Mutual labels:  jupyter-notebook, data-science, data-mining, data-visualization
Fantasy Basketball
Scraping statistics, predicting NBA player performance with neural networks and boosting algorithms, and optimising lineups for Draft Kings with genetic algorithm. Capstone Project for Machine Learning Engineer Nanodegree by Udacity.
Stars: ✭ 146 (-32.72%)
Mutual labels:  jupyter-notebook, data-science, data-mining, data-visualization
Amazing Feature Engineering
Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be considered as applied machine learning itself.
Stars: ✭ 218 (+0.46%)
Mutual labels:  jupyter-notebook, data-science, data-mining, data-visualization
Mli Resources
H2O.ai Machine Learning Interpretability Resources
Stars: ✭ 428 (+97.24%)
Mutual labels:  jupyter-notebook, data-science, data-mining, h2o
Cookbook 2nd Code
Code of the IPython Cookbook, Second Edition, by Cyrille Rossant, Packt Publishing 2018 [read-only repository]
Stars: ✭ 541 (+149.31%)
Mutual labels:  jupyter-notebook, data-science, data-mining, data-visualization
Interpretable machine learning with python
Examples of techniques for training interpretable ML models, explaining ML models, and debugging ML models for accuracy, discrimination, and security.
Stars: ✭ 530 (+144.24%)
Mutual labels:  jupyter-notebook, data-science, data-mining, h2o
Pydataroad
open source for wechat-official-account (ID: PyDataLab)
Stars: ✭ 302 (+39.17%)
Mutual labels:  jupyter-notebook, data-science, data-mining, data-visualization
Cookbook 2nd
IPython Cookbook, Second Edition, by Cyrille Rossant, Packt Publishing 2018
Stars: ✭ 704 (+224.42%)
Mutual labels:  jupyter-notebook, data-science, data-mining, data-visualization
Deep learning projects
Stars: ✭ 28 (-87.1%)
Mutual labels:  jupyter-notebook, image-recognition, image-processing, data-visualization
Seaborn Tutorial
This repository is my attempt to help Data Science aspirants gain necessary Data Visualization skills required to progress in their career. It includes all the types of plot offered by Seaborn, applied on random datasets.
Stars: ✭ 114 (-47.47%)
Mutual labels:  jupyter-notebook, data-science, data-visualization
Dat8
General Assembly's 2015 Data Science course in Washington, DC
Stars: ✭ 1,516 (+598.62%)
Mutual labels:  jupyter-notebook, data-science, data-visualization
Data Science For Marketing Analytics
Achieve your marketing goals with the data analytics power of Python
Stars: ✭ 127 (-41.47%)
Mutual labels:  jupyter-notebook, data-science, data-visualization
Pythondata
repo for code published on pythondata.com
Stars: ✭ 113 (-47.93%)
Mutual labels:  jupyter-notebook, data-science, data-visualization
Krisk
Statistical Interactive Visualization with pandas+Jupyter integration on top of Echarts.
Stars: ✭ 111 (-48.85%)
Mutual labels:  jupyter-notebook, data-science, data-visualization
Datasist
A Python library for easy data analysis, visualization, exploration and modeling
Stars: ✭ 123 (-43.32%)
Mutual labels:  jupyter-notebook, data-science, data-visualization
Vizuka
Explore high-dimensional datasets and how your algo handles specific regions.
Stars: ✭ 100 (-53.92%)
Mutual labels:  data-science, data-mining, data-visualization
H2o Tutorials
Tutorials and training material for the H2O Machine Learning Platform
Stars: ✭ 1,305 (+501.38%)
Mutual labels:  jupyter-notebook, data-science, h2o
Data Science Wg
SF Brigade's Data Science Working Group.
Stars: ✭ 135 (-37.79%)
Mutual labels:  jupyter-notebook, data-science, data-visualization
Machine learning for good
Machine learning fundamentals lesson in interactive notebooks
Stars: ✭ 142 (-34.56%)
Mutual labels:  jupyter-notebook, data-science, data-mining
Python Machine Learning Book
The "Python Machine Learning (1st edition)" book code repository and info resource
Stars: ✭ 11,428 (+5166.36%)
Mutual labels:  jupyter-notebook, data-science, data-mining

Materials for GWU DNSC 6279 and 6290

DNSC 6279 ("Data Mining") provides exposure to various data preprocessing, statistics, and machine learning techniques that can be used both to discover relationships in large data sets and to build predictive models. Techniques covered will include basic and analytical data preprocessing, regression models, decision trees, neural networks, clustering, association analysis, and basic text mining. Techniques will be presented in the context of data driven organizational decision making using statistical and machine learning approaches.

DNSC 6290 ("Machine Learning") provides a follow up course to DNSC 6279 that will expand on both the theoretical and practical aspects of subjects covered in the pre-requisite course while optionally introducing new materials. Techniques covered may include feature engineering, penalized regression, neural networks and deep learning, ensemble models including stacked generalization and super learner approaches, matrix factorization, model validation, and model interpretation. Classes will be taught as workshops where groups of students will apply lecture materials to the ongoing Kaggle Advanced Regression and Digit Recognizer contests.

Course Topics

Topics
Section 00: Intro and History
Section 01: Basic Data Prep
Section 02: Analytical Data Prep
Section 03: Regression
Section 04: Decision Trees and Ensembles
Section 05: Neural Networks
Section 06: Clustering
Section 07: Association Rules
Section 08: Text Mining
Section 09: Matrix Factorization
Section 10: Model Interpretability

Some external reference material

Course Syllabi (Outdated/Unofficial)

Pre-requisite Courses

  • DNSC 6279 ("Data Mining"): Stochastics for Analytics I, Statistics for Analytics, or equivalent (JUD/DAD), MSBA Program Candidacy or instructor approval.

  • DNSC 6290 ("Machine Learning"): Stochastics for Analytics I, Statistics for Analytics, or equivalent (JUD/DAD), Data Mining, MSBA Program Candidacy or instructor approval.

Instructor

Mr. Patrick Hall

E-mail: [email protected]

Twitter: @jpatrickhall

Linkedin: https://www.linkedin.com/in/jpatrickhall/

Course Location

Location: Duques Hall, Room 255 Thursdays 6:10-8:40 PM

Office Hours: Funger Hall, Room 415 Thursdays 5:00 - 6:00 PM

Copyrights and Licenses

Some teaching materials are copyrighted by the instructor. Some copyrights are owned by other individuals and entities.

Most code examples are copyrighted by the instructor and provided with an MIT license, meaning they can be used for almost anything as long as the copyright and license notice are preserved. Some code examples are copyrighted by other entities, and usually provided with an Apache Version 2 license. These code examples can be also used for nearly any purpose, even commercially, as long as the copyright and license notice are preserved.

Recommended Textbooks

DNSC 6279 ("Data Mining")
DNSC 6290 ("Machine Learning")

Reading Assignments

The student is responsible for studying and understanding all assigned materials. If reading generates questions that are not discussed in class, the student has the responsibility of addressing the instructor privately or raising the issue in an appropriate digital medium.

Blackboard

Some materials for this class have personal or corporate copyrights or licenses that prevent them from being shared on GitHub. Those materials or other internal information will be shared with students via Blackboard.

Grading

DNSC 6279 ("Data Mining")
  • The course grade will be based on team homework assignments, a midterm and final exam, and a team project. Each grading component is described in detail below.

  • Homework Assignments: You will be given several homework assignments during the semester. Homework assignments will typically require the use of software. A typical homework assignment will consist of a few problems with several parts. Homework assignments may be completed in groups of 2-4 students. You may be given up to several weeks to complete the assignment. Late homework assignments may be rejected. In preparing your homework assignments, please follow these guidelines:

    • Ensure any submitted computer program solutions are commented and runnable in a standard Python, R, or SAS environment.
    • Ensure any written solutions are typed or easily readable by anyone.
    • Ensure a clear logical flow and mark your answers.
    • Print/type your name(s) on the top right hand corner of every page or in a header of any papers submitted.
  • Midterm and Final Exam: A midterm exam will address content from the first half of the class and a final exam will address content from the second half of the class. The final exam will be scheduled during finals' week. Graduate final exams are scheduled by the university late in the semester. The final exam date will be made known at that time. No make-up midterm or final exams will be given. The exams are individual assignments. If you are taking the class remotely and cannot attend the exams in-person, make arrangements with the instructor immediately.

  • Project: The project is designed to serve as an exercise in applying one or more of the data mining techniques covered in the course to analyze real life data sets. A primary objective is to understand the complexities that arise in mining large, real life datasets that are often inconsistent, incomplete, and unclean. Students can use a variety of software tools to perform the analysis, including standard Python, R, or SAS packages. This is a semester long project, and students have the option to work in 2-4 person teams. The deliverables include a formal project proposal (due mid-semester), and a final report or presentation (due at the end of the semester). Projects can be a group or individual assignment. As the project for this class, students may select:

  • Grading Weights

    • Group homework assignments: 25%
    • Midterm exam: 30%
    • Final exam: 30%
    • Group semester Project: 15%
  • Grading Scale

Numeric Grade Letter grade
94-100: A
90-93.99: A-
87-89.99: B+
84-86.99: B
80-83.99: B-
77-79.99: C+
74-76.99: C
70-73.99: C-
<= 69.99: F
DNSC 6290 ("Machine Learning")
  • In class Participation: As this will be a 6 week, workshop based course, student attendance and participation in class is expected.

  • Kaggle Performance: Lecture materials and hands on workshop materials will be geared toward application to the Kaggle Advanced Regression and Digit Recognizer contests. Students are expected to participate in these contests as individuals or in groups and to do reasonably well.

  • Public Github Contributions: Students are expected to write code and generate other artifacts (i.e. notebooks, visualizations, markdown) and to store them in a publicly accessible GitHub repository (or other public location, i.e. personal website).

  • Grading Weights

    • In class participation: 1/3
    • Kaggle Performance: 1/3
    • Public Github Contributions: 1/3

Academic Integrity

If you are struggling with an assignment or class materials, require extra time for an assignment, or simply require additional assistance, see the instructor immediately.

Cheating and plagiarism will not be tolerated. Any case will automatically result in loss of all the points for the assignment, and may be a reason for a failing grade and/or grounds for dismissal. In case of a group assignment, all group members will receive a zero grade.

Any suspected case of cheating or plagiarism or behavior in violation of the rules of this course will be reported to the Office of Academic Integrity. Students are expected to know and understand all college policies, especially the code of academic integrity.

Disability Services

Please contact the Disability Support Services to establish eligibility and to coordinate reasonable accommodation.

Attendance

Regular attendance is expected, except for remote students. All students are held responsible for all of the work of the courses in which they are registered, and all absences must be excused by the instructor before provision is made to make up the work missed.

Class Policy Changes

The instructor reserves the right to revise any item on this syllabus, including, but not limited to any class policy, course outline or schedule, grading policy, tests, etc. Note that the requirements for deliverables may be clarified and expanded in class, via email, on GitHub, or on Blackboard. Students are expected to complete the deliverables incorporating such additions.

Software

  • Anaconda Python Python is an approachable, general purpose programming language with excellent add on libraries for math and data analysis. Anaconda Python is a commercial version of Python that bundles these add on packages (and many other packages) together with convenient development utilities like the Spyder IDE.

  • H2o.ai is a package of high performance functions and algorithms for preprocessing data and training statistical and machine learning models. It can be accessed without the need for coding through a standalone, web browser client or by installing additional coding interfaces for R and/or Python.

  • PySpark is a convenient, Python-based way to use the extremely powerful and scalable Spark platform. (Spark is becoming the new standard commercial data engineering tool.)

  • R is a tremendously popular language for data analysis, with thousands of user contributed packages for different types of data analysis tasks.

  • R Studio is the standard IDE for the R language.

  • SAS 9.4 and Enterprise Miner is a commercial package for preprocessing data and training statistical and machine learning models. Enterprise Miner allows for the construction of complex data mining workflows without writing code. Enterprise Miner is a proprietary commercial product and not freely available. You may access Enterprise Miner through the SAS on Demand for Academics portal or by contacting the GWU Instructional Technology Lab.

  • SAS 9.4 University Edition is a free edition of SAS' proprietary commercial data analysis software. SAS University Edition contains the newest version of several SAS software packages along with learning tools and utilities for new users. It also requires a virtual machine player which you may need to install separately.

  • TensorFlow + Keras are two of several popular deep learning toolkits and libraries; this particular combination will work on Windows. TensorFlow is a lower-level library for performing mathematical operations. It is GPU-enabled. (GPU support is optional but helpful for this class.) Keras is a higher level library that makes TensorFlow easier to use for building and training common deep learning architectures. They are both available as Python packages.

  • XGBoost is an optimized and highly accurate library for gradient boosted regression and classification. There are Python and R packages available for available XGBoost. (I have found XGBoost is easiest to install as R an package, but if you get stuck with Python and Windows, you can try following the directions in this blog post.)

Using Git for this Material

You are welcome to use git and/or GitHub to save and manage your own copies of class materials.

The easiest way to do so is to download this entire repository as a zip file. However you will need to download a new copy of the repository whenever changes are made to this repository. To download the course repository, navigate to the course GitHub repository (i.e. this page) and click the 'Clone or Download' button and then select 'Download Zip'.

alt text

If you would like to take advantage of the version control capabilities of git then you need to follow these steps.

Install required software
Fork and pull materials

Navigate to the course GitHub repository (i.e. this page) and click the 'Fork' button.

alt text

Enter the following statements on the git bash command line:

$ cd <parent directory>

$ mkdir GWU_data_mining

$ cd GWU_data_mining

$ git init

$ git remote add origin https://github.com/<your username>/GWU_data_mining.git

$ git remote add upstream https://github.com/jphall663/GWU_data_mining.git

$ git pull origin master

$ git lfs install

$ git lfs track '*.jpg' '*.png' '*.csv' '*.sas7bdat'

Docker

Dockerfile to create Anaconda Python 3.5 environment with H2O, XGBoost, and GraphViz.

Start the image with:

docker run -i -t -p 8888:8888 <image_id> /bin/bash -c "/opt/conda/bin/conda install jupyter -y --quiet && /opt/conda/bin/jupyter notebook --notebook-dir=/GWU_data_mining --ip='*' --port=8888 --no-browser"

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].