All Projects â†’ KangboLu â†’ Uc Davis Cs Exams Analysis

KangboLu / Uc Davis Cs Exams Analysis

Licence: mit
📈 Regression and Classification with UC Davis student quiz data and exam data

Programming Languages

r
7636 projects

Projects that are alternatives of or similar to Uc Davis Cs Exams Analysis

Dat8
General Assembly's 2015 Data Science course in Washington, DC
Stars: ✭ 1,516 (+4493.94%)
Mutual labels:  web-scraping, logistic-regression, linear-regression
Data Science Toolkit
Collection of stats, modeling, and data science tools in Python and R.
Stars: ✭ 169 (+412.12%)
Mutual labels:  statistics, logistic-regression, statistical-analysis
25daysinmachinelearning
I will update this repository to learn Machine learning with python with statistics content and materials
Stars: ✭ 53 (+60.61%)
Mutual labels:  statistics, logistic-regression, linear-regression
srqm
An introductory statistics course for social scientists, using Stata
Stars: ✭ 43 (+30.3%)
Mutual labels:  linear-regression, statistical-analysis, logistic-regression
Python For Probability Statistics And Machine Learning
Jupyter Notebooks for Springer book "Python for Probability, Statistics, and Machine Learning"
Stars: ✭ 481 (+1357.58%)
Mutual labels:  statistics, probability, statistical-analysis
machine-learning-course
Machine Learning Course @ Santa Clara University
Stars: ✭ 17 (-48.48%)
Mutual labels:  linear-regression, logistic-regression, unsupervised-learning
Machine Learning With Python
Python code for common Machine Learning Algorithms
Stars: ✭ 3,334 (+10003.03%)
Mutual labels:  logistic-regression, linear-regression
2018 Machinelearning Lectures Esa
Machine Learning Lectures at the European Space Agency (ESA) in 2018
Stars: ✭ 280 (+748.48%)
Mutual labels:  text-mining, linear-regression
Stats
A C++ header-only library of statistical distribution functions.
Stars: ✭ 292 (+784.85%)
Mutual labels:  statistics, probability
Stats Maths With Python
General statistics, mathematical programming, and numerical/scientific computing scripts and notebooks in Python
Stars: ✭ 381 (+1054.55%)
Mutual labels:  statistics, probability
Text-Analysis
Explaining textual analysis tools in Python. Including Preprocessing, Skip Gram (word2vec), and Topic Modelling.
Stars: ✭ 48 (+45.45%)
Mutual labels:  text-mining, web-scraping
Basic Mathematics For Machine Learning
The motive behind Creating this repo is to feel the fear of mathematics and do what ever you want to do in Machine Learning , Deep Learning and other fields of AI
Stars: ✭ 300 (+809.09%)
Mutual labels:  statistics, probability
Teaching
Teaching Materials for Dr. Waleed A. Yousef
Stars: ✭ 435 (+1218.18%)
Mutual labels:  statistics, probability
Expan
Open-source Python library for statistical analysis of randomised control trials (A/B tests)
Stars: ✭ 275 (+733.33%)
Mutual labels:  statistics, statistical-analysis
Shendusuipian
To know stats by heart
Stars: ✭ 275 (+733.33%)
Mutual labels:  statistics, probability
Fuku Ml
Simple machine learning library / ç°Ąć–źæ˜“ç”šçš„æ©Ÿć™šć­žçż’ć„—ä»¶
Stars: ✭ 280 (+748.48%)
Mutual labels:  logistic-regression, linear-regression
Probability Theory
A quick introduction to all most important concepts of Probability Theory, only freshman level of mathematics needed as prerequisite.
Stars: ✭ 25 (-24.24%)
Mutual labels:  statistics, probability
Machine learning basics
Plain python implementations of basic machine learning algorithms
Stars: ✭ 3,557 (+10678.79%)
Mutual labels:  logistic-regression, linear-regression
Tensorflow Book
Accompanying source code for Machine Learning with TensorFlow. Refer to the book for step-by-step explanations.
Stars: ✭ 4,448 (+13378.79%)
Mutual labels:  logistic-regression, linear-regression
Bagofconcepts
Python implementation of bag-of-concepts
Stars: ✭ 18 (-45.45%)
Mutual labels:  unsupervised-learning, text-mining

Probabilistic and Statistical Modeling Project

Detailed Description:

Click Here

Word Cloud for ECS132 Exams

First release of 132 term project:

  • ProblemA.R accomplished statistical analysis on ECS132, ECS145, and ECS154 students' quiz average from University of California at Davis.
  • ProblemB.R gathered, cleaned, and organized training data (previous exams) into document term matrix for creating 9 logistic models for 9 courses. The 9 models are used to predict which course does the test data (exam) belongs to.

Task A

Here you will do some statistical analysis on my undergrad quiz data, with a Description goal. Here are the details:

  • I have made available data on the following for each student:

    • Course name. We will have ECS 132, 145 and 158.

    • Year/quarter offered. E.g. 2012.1 is Winter 2012, 2015.3 is Fall 2015. This data will be used to determine whether there has been some time trend in my quiz grades in recent years.

    • Student major (CS, CSE only).

    • Overall quiz average.

    Please note: A very important part of your job will be to take the data in the form I provide it, and create one big R data frame, with columns 'course name', 'year offered', 'major' and 'quiz average'. Use R's read.table() or some other R function to read the original data from our Web site, then other R code to create the data frame and work with it. You are required to use R for all aspects of this, and explain in your report what you did in this regard.

  • Do the following analyses:

    • Assuming no time trend, find approximate 95% confidence intervals for the population mean quiz average for each of the four courses. Comment.

    • Assuming no time trend, find an approximate 95% confidence interval for the difference in population mean quiz averages for ECS 132 and 145. Comment.

    • Assuming no time trend, find an approximate 95% confidence interval for the difference in population mean quiz averages in ECS 145 between the two majors. Comment.

    • Fit a linear regression model in which quiz average is predicted from year, course and major. For the last two, create dummy variables (Sec. 21.12). Use this to determine whether there is a substantial time trend. Also use it to compare ECS 132 and 145, and CS majors to CSE. (This is different from above, because now we are adjusting for a possible time trend.)

    • Do an analysis of your choice (justified!) that investigates whether there is a time trend in the proportion of CS majors in our department, based on this data.

Task B

Here you will do some predictive modeling (machine learning), involving text data. One active branch of this field is text classification, e.g. sentiment analysis. We will be less ambitious here, but the principles are the same. Here are the details.

  • The data consists of all files in my course Web page site with names of the form *1/Exams/tex , *2/Exams/tex or **50/Exams/tex . (Go into one directory level within *Exams).

  • As in Problem A, provide and explain your complete R code for fetching the data and for your analyses.

  • This will be a classification problem, as in Chapter 16. The classes here will be...classes! You will predict the class, i.e. one of ECS 50, 132, 145, 152A, 154A, 154B, 156, 158, 188 and 256 from the words present in an exam.

  • You will use the logistic model, fitting 10 logit models, one for each class. The predictor variables are counts of specific words. For a given new case, you plug the word counts into the logit function, giving you an estimated conditional probability of that class. Whichever class has the highest conditional probability, you guess this case to be in that class.

  • The criterion here is prediction accuracy: What proportion of new cases is predicted correctly? To simulate having new cases, it is customary to divide one's data into a training set and a test set. We fit our models to the training set, then predict the test test, pretending that we don't know the classes of the test set. We of course do know their classes, so we can evaluate the proportion of our predictions that come out correct. There are something like 293 exams in the above directories; you will choose 50 at random for your test set (sometimes called the holdout set).

  • You will use R's glm() function to fit the logit models, as in Chapter 11. Your data will consist of an R matrix or data frame, one row per exam in the training set. All but one of the columns will be word counts, with the remaining one being an indicator variable for the class of interest (1 for being in the class, 0 not).

  • A major issue is how to get the word counts. You will use R's tm package, which removes punctuation, white space etc. You can get counts from the output. You will decide what to remove and what not, including the issue of whether to remove the LaTeX keywords. There are lots of tutorials on tm on the Web. Explain your decision on this thoroughly in your report.

  • The other major issue is which words to use. This is hard. A rough rule of thumb is to use no more than sqrt(n) predictor variables, where n is the number of cases in the training set, thus no more than sqrt(n) word counts here. But which ones? Explain your decision on this thoroughly in your report.

  • This is another of those assignments in which you at first will have little or no idea as to what to do. Give it a lot of thought, and discuss it vigorously in your group. Your solution will gradually take shape. Of course, feel free to ask Robin or me if you get stuck and you are not sure about something.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].