Data as a Science

Data has become the most important language of our era, informing everything from intelligence in automated machines, to predictive analytics in medical diagnostics. The plunging cost and easy accessibility of the raw requirements for such systems - data, software, distributed computing, and sensors - are driving the adoption and growth of data-driven decision-making.

A data scientist is a researcher who answers a research question using data, and can lead the development of the research process. They may design the methods to acquire primary or secondary sources of data that inform the research process, monitor and ensure ethical responsibilities, curate the research data and results, or communicate the process and results to stakeholders. Coding is incidental to that process, and it is possible to be a data scientist without programming at all.

Higher education course modules continue to be an atomised collection of dissociated curricula, since the heart of the university process is the assumption that graduates serve apprenticeships in labs or organisations. But data-driven careers don’t offer an artisanship of learning where an inter-generational accumulation of experience is passed on. Instead, online-first education has become equivalent to a best-of collection with no context or process.

As it becomes ever-easier to collect data about individuals and systems, a diverse range of professionals - who have never been trained for such requirements - grapple with inadequate analytic and data management skills, as well as the ethical risks arising from the possession and consequences of such data and tools.

Ordinarily, when teaching data science, everyone - from teachers to students - prefers to focus on analysis and presentation since these are more fun and require less frustration with messy data or ethical dilemmas. Working data scientists will point out that the bulk of their time is taken up with social and ethical negotiations, and complex and tedious data integration.

There are two objectives for this syllabus:

To ensure students have a comprehensive grasp of a data-driven research process. Data as a Science guides learners to confidence in the ethics, curation, analysis, and presentation of data, integrating each of these topics into each lesson.
To support the growing desire for universities around the world, but especially in emerging-market countries, to offer Data Science degree courses, by providing a free, openly-licenced core curriculum for adoption and adaptation by their degree programs.

Pedagogy

The course is based on the Sloyd model of technical training. Each lesson is discrete, building on the previous lesson, and provides a functional and holistic understanding of the scientific method as it applies to data. It is not about learning an algorithm and applying it to abstract, arbitrary data. The course has the objective of training complete data scientists, you will learn how research works and apply tools to a specific case-study.

Each lesson starts with a research question, and progresses by teaching a complete, and practical, set of skills allowing students to learn at their own pace and in an order which suites their current understanding. Case-studies and tutorials are drawn from public health, economics and social issues, and the course is accessible to anyone with an interest in data. Course materials, case studies and guided tutorials are presented in Jupyter Notebooks permitting learners to test running code and gain hands-on understanding of the techniques discussed.

Lesson structure and approach

Each lesson is guided by the following four topics:

Ethics: determine the social and behavioural challenges posed by a research question;
Curation: establish the research requirements for data collection and management;
Analysis: investigate, explore and analyse research data;
Presentation: prepare and present the results of analysis to promote a response;

Case-studies: review and replicate

Science is a set of defined methods that stands up to scrutiny, supports replication, and is supported by ethical measurement data acquired during the study process. The way to gain confidence in these methods is to review the work of others.

Each lesson will guide you through review of published scholarly work in the following ways:

Review: apply learned techniques to open access published research, and review and reflect on the methodology, analysis and results presented;
Replication: using source- or synthetic data, reproduce the methodology used in open access published research to test whether claimed analysis and results are replicable;

Synthetic data will include lessons in dependent randomisation, as well as agent-based modelling.

On completion of each lesson, students gain useful and meaningful skills, and are not left stranded. This means that even partial completion of the material permits students to be productive members of a research team. The first lesson will ensure students can become professional data wranglers, and – on completion of the first ten lessons – graduates will be capable of taking on a responsible data research role.

This is a brief video demonstrating the first module: https://www.youtube.com/watch?v=nZRL3OabbsY

Course outline

I have prepared an overview of 20 lessons, each requiring two to three weeks to learn, which would comprise the complete course.

Module 1:
- Lesson 1: Introduction to data as a science (view)
- Lesson 2: Research and experiments with data (view)
- Lesson 3: Probability, randomness, and the risk of de-anonymization (view)
- Lesson 4: Sampling, data distribution, and secure data custody (view)
- Lesson 5: Expected statistical outcomes using distributions, and issues for analysis (#4)
- Lesson 6: Techniques in data and population sampling, and assessing standard error (#5)
- Lesson 7: Hypothesis testing, and risks for policy from poor data (#6)
- Lesson 8: Bootstrapping and the risks of algorithmic decision-making (#7)
- Lesson 9: Sample robustness, central limit theory, and the ethics and abuses of p-hacking (#8)
- Lesson 10: Publishing and evaluating studies based on cohort data and analysis of variance (#9)
Module 2:
- Lesson 1: Trolley problems, and predictions using regression and least squares (#10)
- Lesson 2: Doctrine of double effect, and interpreting regression with visual and numerical diagnostics (#11)
- Lesson 3: Reflective equilibrium, and methods for multiple regression (#12)
- Lesson 4: Ultimatum games, “fairness” and model selection for multiple regression (#13)
- Lesson 5: Strong and weak machine intelligence, and classification using logistic regression (#14)
- Lesson 6: Emergent systems, strange loops, and supervised and unsupervised learning techniques (#15)
- Lesson 7: Counterfactual consequences, and implementing, testing and optimising classifiers (#16)
- Lesson 8: Human agency and autonomous systems, and permutation testing for classification (#17)
- Lesson 9: Liquid modernity, multiple jurisdictions, and assessing causality in randomised control trials (#18)
- Lesson 10: Consolidate what you have learned, and explore machine learning (#19)

The first two lessons are complete, and I estimate about 6 weeks to research and create each of the remaining 18 lessons.

Supporting continued development of Data as a Science

This course is not complete. My objective is that Data as a Science becomes a standard data science core syllabus, much as Core Econ has become for Economics. Progress is slow and dependent on the support and good-will of others.

Each lesson costs about $5,000 to research and create, and is released here on completion. Please contact me at gchait @ whythawk . com should you wish to sponsor a lesson (or part thereof).

Whois

My name is Gavin Chait, and I am an independent data scientist specialising in economic development and data curation. I spent more than a decade in economic and development initiatives in South Africa. I was the commercial lead of open data projects at the Open Knowledge Foundation, leading the open source CKAN development team, and led the implementation of numerous open data technical and research projects around the world. Recently, I have developed Sqwyre.com, an initiative to develop a comprehensive business intelligence search engine for entrepreneurs. Data are based on open data and Freedom of Information requests.

I have extensive experience in leading research projects, implementing open source software initiatives, and developing and leading seminars and workshops. I have taught for 25 years, including for undergraduates, adult education, and technical and analytical teaching at all levels.

This pedagogy and syllabus structure was developed with support from the Gates Foundation and WHO. Initial research into the need for education capacity building arose as a result of research supported by the Hewlett Foundation, Wellcome Trust and Public Health Research Data Forum.

Chait, Gavin; Sujith, Eramangalath; Grzywinska, Dominika; Wainwright, Mark (2018): Supporting capacity and skills development for public health data research management in low- and medium income countries. Wellcome Trust. Journal contribution. https://doi.org/10.6084/m9.figshare.6087161.v1

Citation

Chait, Gavin (2020): Data as a Science. Whythawk. https://doi.org/10.5281/zenodo.4194973

And as a BibTeX entry:

@book{chait_data_2020,
	  title = {Data as a {Science}},
	  copyright = {Creative Commons Attribution-ShareAlike 4.0 International and the GNU Affero General Public License},
	  publisher = {Whythawk},
	  author = {Chait, Gavin},
	  year = {2020},
	  doi = {10.5281/zenodo.4194973},
	  url = {https://doi.org/10.5281/zenodo.4194973}
}

Licensing and release

Course content, materials and approach are copyright Gavin Chait, and released under both the Creative Commons Attribution-ShareAlike 4.0 International and the GNU Affero General Public License licences.

The objective is to ensure reuse, and that any modifications or adaptations of the source material must be released under an equivalent licence.

Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

whythawk / data-as-a-science

Programming Languages

Labels

Projects that are alternatives of or similar to data-as-a-science