All Projects → parrt → Msds692

parrt / Msds692

Licence: mit
MSAN692 Data Acquisition

Labels

MSDS692 Data acquisition

There are lots of exciting and interesting problems in data science, such as figuring out what the right question is, selecting features, training a model, and interpreting results. But all of that presupposes a tidy data set that is suitable for analysis or training models. Industry experts all agree that data collection and preparation is roughly 3/4 of any analysis effort. Or, as Kareem Carr puts it (I'm guessing he includes data acquisition and organization in the term "cleaning"):

The title of this course is "Data Acquisition" but of course, once we get the data, we have to organize it into handy data structures and typically have to extract information from the raw data. For example, we might need to boil down a Twitter stream into a single positive or negative sentiment score for a given user. This course teaches you how to collect, organize, coalesce, and extract information from multiple sources in preparation for your analysis work. Along the way, you'll learn about the commandline, git, networks, the internet protocols, and building your own web servers.

This course is part of the MS in Data Science program at the University of San Francisco.

Course details

INSTRUCTOR. Terence Parr. I’m a professor in the computer science and data science program departments and was founding director of the MS in Analytics program at USF (which became the MS data science program). Please call me Terence or Professor (not “Terry”).

OFFICE HOURS

Terence is generally available on slack or email on-demand.

SPATIAL COORDINATES:

All classes are remote but live online, courtesy of COVID-19.

TEMPORAL COORDINATES. Tue Oct 13, 2020 - Thur Dec 1, 2020 (No lecture Dec 3)

There are lectures on Tuesday and Thursday each week from 10am - 12 noon California time.

  • Live lecture: Tue and Thur 10:00AM - 12Noon

Terence is generally available on-demand for help with exercises from the lecture or projects, even on weekends.

Exams:

  • Exam 1: online, open-book and available for 24 hours
  • Exam 2: online, open-book and available for 24 hours

INSTRUCTION FORMAT. Live class runs for 2 hours, 2 days/week. Instructor-student interaction during lecture is encouraged by speaking up in zoom. We'll often mix in mini-exercises / labs during class. All programming will be done in the Python 3 programming language, unless otherwise specified.

PROFESSIONALISM

The following items are even more important because all of us will be remote this Fall:

  • Showing respect for your classmates and your professor
  • Getting to class on time every time
  • No cellphones, email, social media, slack, texting during the class
  • Turn off all of your various notifications so you are not distracted
  • Turn on your webcam on zoom

Student evaluation

Artifact Grade Weight Due date
Data pipeline 5% Thu, Oct 22
Search Engine Implementation 11% Tue, Nov 3
TFIDF document summarization 9% Thu, Nov 12
Recommending Articles 7% Thu, Nov 19
Tweet Sentiment Analysis 9% Thu, Dec 3
Code reviews for 5 projects 5% Due 11:59PM on day associated project is due
Exam 1 27% 2-3:30PM Tue, Nov 10 and 12:01AM-1:31AM Nov 11
Exam 2 27% 2-3:30PM Mon, Dec 7 and 12:01AM-1:31AM Dec 8

All projects will be graded with the specific input or tests given in the project description, so you understand precisely what is expected of your program. Consequently, projects will be graded in binary fashion: They either work or they do not. The only exception is when your program does not run on the grader's or my machine because of some cross-platform issue. This is typically because a student has hardcoded some file name or directory into their program. In that case, we will take off a minimum of 10% instead of giving you a 0, depending on the severity of the mistake. Some projects will be tested with some hidden unit tests; e.g., see the evaluation section of the search project.

Please go to github and verify that the website has the proper files for your solution. That is what I will download for testing.

Each project has a hard deadline and only those projects working correctly before the deadline get credit. My grading script pulls from github at the deadline. All projects are due at the start of class on the day indicated, unless otherwise specified.

Groups. All projects are individual projects not group efforts! You will be assigned to a two or three person group for each project in order to encourage you to meet your fellow students and discuss the design of each project. You're not allowed to share code at any time before the project due date and time. After all projects are submitted to github, you will share zips of your code with your partner or partners. Then, you will provide a quick (less than 30 minute) code review for your partner using a code review template. There is an assignment on canvas where you can submit a PDF of the notebook. If there are 3 people in your group, you can pick which person's work to review. These code reviews do not affect the reviewed persons' grades; they are meant to help you and your partner become better programmers. Since students don't do anything unless you give them points, each review gives you 1% of your grade. You just have to make a decent effort to get credit, otherwise you lose that one percent. Naturally, you are free to discuss the design of your projects with any of your fellow students.

Grading standards. I consider an A grade to be above and beyond what most students have achieved. A B grade is an average grade for a student or what you could call "competence" in a business setting. A C grade means that you either did not or could not put forth the effort to achieve competence. Below C implies you did very little work or had great difficulty with the class compared to other students.

Syllabus

We're going to start the class with a cool lab to extract coronavirus data from Wikipedia.

Tools

Before we get to the meat of the course, we need to get familiar with some important tools: the commandline (Terminal.app) and git.

Data formats

Most data you encounter will be in the form of human readable text, such as comma-separated value (CSV) files. We begin the course by studying how characters are stored in files and learning about the key data formats.

There are also plenty of nontext, binary formats. You can learn more from the msds501 boot camp material for audio processing and image processing.

Organizing data in memory into structures

Text feature extraction

How the web works

Now you know how to work with data files already sitting on your desk, we turn towards a study of computer networking and web infrastructure.

Data sources

With an understanding of how the Internet and web works, it's time to start pulling data from various web sources. The difficulty of collecting data depends a great deal on the permissions and services available for a site or page. A good analogy is: some doors are open, some doors are closed, some doors are locked, some "doors" are not doors but reinforced steel walls.

Misc

Administrative details

ACADEMIC HONESTY

You must abide by the copyright laws of the United States and academic honesty policies of USF. You may not copy code from other current or previous students. All suspicious activity will be investigated and, if warranted, passed to the Dean of Sciences for action. Copying answers or code from other students or sources during a quiz, exam, or for a project is a violation of the university’s honor code and will be treated as such. Plagiarism consists of copying material from any source and passing off that material as your own original work. Plagiarism is plagiarism: it does not matter if the source being copied is on the Internet, from a book or textbook, or from quizzes or problem sets written up by other students. Giving code or showing code to another student is also considered a violation.

The golden rule: You must never represent another person’s work as your own.

If you ever have questions about what constitutes plagiarism, cheating, or academic dishonesty in my course, please feel free to ask me.

All persons with common code are likely to be considered at fault.

USF policies and legal declarations

Students with Disabilities

If you are a student with a disability or disabling condition, or if you think you may have a disability, please contact USF Student Disability Services (SDS) for information about accommodations.

Behavioral Expectations

All students are expected to behave in accordance with the Student Conduct Code and other University policies.

Academic Integrity

USF upholds the standards of honesty and integrity from all members of the academic community. All students are expected to know and adhere to the University's Honor Code

Counseling and Psychological Services (CAPS)

CAPS provides confidential, free counseling to student members of our community.

Confidentiality, Mandatory Reporting, and Sexual Assault

For information and resources regarding sexual misconduct or assault visit the Title IX coordinator or USFs Callisto website.

todo

check out fastapi for server 2021.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].