All Projects → whythawk → data-as-a-science

whythawk / data-as-a-science

Licence: other
Lesson guide and textbook for "Data as a Science" course.

Programming Languages

Jupyter Notebook
11667 projects
TeX
3793 projects

Projects that are alternatives of or similar to data-as-a-science

PandasVersusExcel
Python数据分析入门,数据分析师入门
Stars: ✭ 120 (+263.64%)
Mutual labels:  data-analysis, data-science-learning
seed
seed自助报表展示系统
Stars: ✭ 63 (+90.91%)
Mutual labels:  data-analysis
study-guide
A graded list of topics you'll need to learn to be a professional Angular developer
Stars: ✭ 31 (-6.06%)
Mutual labels:  syllabus
mlmachine
mlmachine accelerates machine learning experimentation
Stars: ✭ 23 (-30.3%)
Mutual labels:  data-analysis
kwx
BERT, LDA, and TFIDF based keyword extraction in Python
Stars: ✭ 33 (+0%)
Mutual labels:  data-analysis
covid19-data-greece
Datasets and analysis of Novel Coronavirus (COVID-19) outbreak in Greece
Stars: ✭ 16 (-51.52%)
Mutual labels:  data-analysis
social-data
Code and data for eviction and housing analysis in the US
Stars: ✭ 17 (-48.48%)
Mutual labels:  data-analysis
neural-finance
Neural Network for HFT-trading [experimental]
Stars: ✭ 67 (+103.03%)
Mutual labels:  data-analysis
python tour of data science
A Python Tour of Data Science
Stars: ✭ 28 (-15.15%)
Mutual labels:  data-analysis
validada
Another library for defensive data analysis.
Stars: ✭ 29 (-12.12%)
Mutual labels:  data-analysis
GreyNSights
Privacy-Preserving Data Analysis using Pandas
Stars: ✭ 18 (-45.45%)
Mutual labels:  data-analysis
twitter-analytics-wrapper
A simple Python wrapper to download tweets data from the Twitter Analytics platform. Particularly interesting for the impressions metrics that are unavailable on current Twitter API. Also works for the videos data.
Stars: ✭ 44 (+33.33%)
Mutual labels:  data-analysis
Spotify-Song-Recommendation-ML
UC Berkeley team's submission for RecSys Challenge 2018
Stars: ✭ 70 (+112.12%)
Mutual labels:  data-analysis
r4dswebsite
Public repository for the R4DS community website.
Stars: ✭ 19 (-42.42%)
Mutual labels:  data-analysis
Data-Analysis
Different types of data analytics projects : EDA, PDA, DDA, TSA and much more.....
Stars: ✭ 22 (-33.33%)
Mutual labels:  data-analysis
growthbook
Open Source Feature Flagging and A/B Testing Platform
Stars: ✭ 2,342 (+6996.97%)
Mutual labels:  data-analysis
how-to-python-code
A collection of Jupyter Notebooks from the How to Python series
Stars: ✭ 64 (+93.94%)
Mutual labels:  jupyter-notebooks
trading sim
📈📆 Backtest trading strategies concurrently using historical chart data from various financial exchanges.
Stars: ✭ 21 (-36.36%)
Mutual labels:  data-analysis
COVID-19-CaseStudy-and-Predictions
This repository is a case study, analysis and visualization of COVID-19 Pandemic spread along with prediction models.
Stars: ✭ 90 (+172.73%)
Mutual labels:  data-analysis
JimuReport
「低代码可视化报表」类似excel操作风格,在线拖拽完成设计!功能涵盖: 报表设计、图形报表、打印设计、大屏设计等,完全免费!秉承“简单、易用、专业”的产品理念,极大的降低报表开发难度、缩短开发周期、解决各类报表难题。
Stars: ✭ 2,895 (+8672.73%)
Mutual labels:  data-analysis

Data as a Science

DOI

Data has become the most important language of our era, informing everything from intelligence in automated machines, to predictive analytics in medical diagnostics. The plunging cost and easy accessibility of the raw requirements for such systems - data, software, distributed computing, and sensors - are driving the adoption and growth of data-driven decision-making.

A data scientist is a researcher who answers a research question using data, and can lead the development of the research process. They may design the methods to acquire primary or secondary sources of data that inform the research process, monitor and ensure ethical responsibilities, curate the research data and results, or communicate the process and results to stakeholders. Coding is incidental to that process, and it is possible to be a data scientist without programming at all.

Higher education course modules continue to be an atomised collection of dissociated curricula, since the heart of the university process is the assumption that graduates serve apprenticeships in labs or organisations. But data-driven careers don’t offer an artisanship of learning where an inter-generational accumulation of experience is passed on. Instead, online-first education has become equivalent to a best-of collection with no context or process.

As it becomes ever-easier to collect data about individuals and systems, a diverse range of professionals - who have never been trained for such requirements - grapple with inadequate analytic and data management skills, as well as the ethical risks arising from the possession and consequences of such data and tools.

Ordinarily, when teaching data science, everyone - from teachers to students - prefers to focus on analysis and presentation since these are more fun and require less frustration with messy data or ethical dilemmas. Working data scientists will point out that the bulk of their time is taken up with social and ethical negotiations, and complex and tedious data integration.

There are two objectives for this syllabus:

  1. To ensure students have a comprehensive grasp of a data-driven research process. Data as a Science guides learners to confidence in the ethics, curation, analysis, and presentation of data, integrating each of these topics into each lesson.
  2. To support the growing desire for universities around the world, but especially in emerging-market countries, to offer Data Science degree courses, by providing a free, openly-licenced core curriculum for adoption and adaptation by their degree programs.

Pedagogy

The course is based on the Sloyd model of technical training. Each lesson is discrete, building on the previous lesson, and provides a functional and holistic understanding of the scientific method as it applies to data. It is not about learning an algorithm and applying it to abstract, arbitrary data. The course has the objective of training complete data scientists, you will learn how research works and apply tools to a specific case-study.

Each lesson starts with a research question, and progresses by teaching a complete, and practical, set of skills allowing students to learn at their own pace and in an order which suites their current understanding. Case-studies and tutorials are drawn from public health, economics and social issues, and the course is accessible to anyone with an interest in data. Course materials, case studies and guided tutorials are presented in Jupyter Notebooks permitting learners to test running code and gain hands-on understanding of the techniques discussed.

Lesson structure and approach

Each lesson is guided by the following four topics:

  • Ethics: determine the social and behavioural challenges posed by a research question;
  • Curation: establish the research requirements for data collection and management;
  • Analysis: investigate, explore and analyse research data;
  • Presentation: prepare and present the results of analysis to promote a response;

Case-studies: review and replicate

Science is a set of defined methods that stands up to scrutiny, supports replication, and is supported by ethical measurement data acquired during the study process. The way to gain confidence in these methods is to review the work of others.

Each lesson will guide you through review of published scholarly work in the following ways:

  • Review: apply learned techniques to open access published research, and review and reflect on the methodology, analysis and results presented;
  • Replication: using source- or synthetic data, reproduce the methodology used in open access published research to test whether claimed analysis and results are replicable;

Synthetic data will include lessons in dependent randomisation, as well as agent-based modelling.

On completion of each lesson, students gain useful and meaningful skills, and are not left stranded. This means that even partial completion of the material permits students to be productive members of a research team. The first lesson will ensure students can become professional data wranglers, and – on completion of the first ten lessons – graduates will be capable of taking on a responsible data research role.

This is a brief video demonstrating the first module: https://www.youtube.com/watch?v=nZRL3OabbsY

Course outline

I have prepared an overview of 20 lessons, each requiring two to three weeks to learn, which would comprise the complete course.

  1. Module 1:
    • Lesson 1: Introduction to data as a science (view)
    • Lesson 2: Research and experiments with data (view)
    • Lesson 3: Probability, randomness, and the risk of de-anonymization (view)
    • Lesson 4: Sampling, data distribution, and secure data custody (view)
    • Lesson 5: Expected statistical outcomes using distributions, and issues for analysis (#4)
    • Lesson 6: Techniques in data and population sampling, and assessing standard error (#5)
    • Lesson 7: Hypothesis testing, and risks for policy from poor data (#6)
    • Lesson 8: Bootstrapping and the risks of algorithmic decision-making (#7)
    • Lesson 9: Sample robustness, central limit theory, and the ethics and abuses of p-hacking (#8)
    • Lesson 10: Publishing and evaluating studies based on cohort data and analysis of variance (#9)
  2. Module 2:
    • Lesson 1: Trolley problems, and predictions using regression and least squares (#10)
    • Lesson 2: Doctrine of double effect, and interpreting regression with visual and numerical diagnostics (#11)
    • Lesson 3: Reflective equilibrium, and methods for multiple regression (#12)
    • Lesson 4: Ultimatum games, “fairness” and model selection for multiple regression (#13)
    • Lesson 5: Strong and weak machine intelligence, and classification using logistic regression (#14)
    • Lesson 6: Emergent systems, strange loops, and supervised and unsupervised learning techniques (#15)
    • Lesson 7: Counterfactual consequences, and implementing, testing and optimising classifiers (#16)
    • Lesson 8: Human agency and autonomous systems, and permutation testing for classification (#17)
    • Lesson 9: Liquid modernity, multiple jurisdictions, and assessing causality in randomised control trials (#18)
    • Lesson 10: Consolidate what you have learned, and explore machine learning (#19)

The first two lessons are complete, and I estimate about 6 weeks to research and create each of the remaining 18 lessons.

Supporting continued development of Data as a Science

This course is not complete. My objective is that Data as a Science becomes a standard data science core syllabus, much as Core Econ has become for Economics. Progress is slow and dependent on the support and good-will of others.

Each lesson costs about $5,000 to research and create, and is released here on completion. Please contact me at gchait @ whythawk . com should you wish to sponsor a lesson (or part thereof).

Whois

My name is Gavin Chait, and I am an independent data scientist specialising in economic development and data curation. I spent more than a decade in economic and development initiatives in South Africa. I was the commercial lead of open data projects at the Open Knowledge Foundation, leading the open source CKAN development team, and led the implementation of numerous open data technical and research projects around the world. Recently, I have developed Sqwyre.com, an initiative to develop a comprehensive business intelligence search engine for entrepreneurs. Data are based on open data and Freedom of Information requests.

I have extensive experience in leading research projects, implementing open source software initiatives, and developing and leading seminars and workshops. I have taught for 25 years, including for undergraduates, adult education, and technical and analytical teaching at all levels.

This pedagogy and syllabus structure was developed with support from the Gates Foundation and WHO. Initial research into the need for education capacity building arose as a result of research supported by the Hewlett Foundation, Wellcome Trust and Public Health Research Data Forum.

Chait, Gavin; Sujith, Eramangalath; Grzywinska, Dominika; Wainwright, Mark (2018): Supporting capacity and skills development for public health data research management in low- and medium income countries. Wellcome Trust. Journal contribution. https://doi.org/10.6084/m9.figshare.6087161.v1

Citation

Chait, Gavin (2020): Data as a Science. Whythawk. https://doi.org/10.5281/zenodo.4194973

And as a BibTeX entry:

@book{chait_data_2020,
	  title = {Data as a {Science}},
	  copyright = {Creative Commons Attribution-ShareAlike 4.0 International and the GNU Affero General Public License},
	  publisher = {Whythawk},
	  author = {Chait, Gavin},
	  year = {2020},
	  doi = {10.5281/zenodo.4194973},
	  url = {https://doi.org/10.5281/zenodo.4194973}
}

Licensing and release

Course content, materials and approach are copyright Gavin Chait, and released under both the Creative Commons Attribution-ShareAlike 4.0 International and the GNU Affero General Public License licences.

The objective is to ensure reuse, and that any modifications or adaptations of the source material must be released under an equivalent licence.

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].