International Data Science in Schools Project: Searchable Index

Abbreviated Topic Lists: A Searchable Index

Dynamic Search - type in a keyword: Search for any word matches?

Outline of Unit 1

	Topic Area - 1.1: Data science and me
		1.1	Role of data in decision-making ― at home, in government, business, industry, sport, ...
		1.2	Introduction to Data Science and the Data Science learning cycle
1	What is Data Science?	1.3	Data Science success stories
		1.4	Data Science disasters
		1.5	Elaboration of steps in the data-science cycle
		1.6T	Diverse uses of the term " Data Science "
		1.7T	Sources of information about Data Science and its activities

		2.1	Data I keep
		2.2	Data about me
2	What does Data Science	2.3	Data on friends and family
	have to do with me?	2.4	What are data?
		2.5	Privacy, security and openness/accessibility ― issues and trade-offs
		2.6T	Pedagogical issues relating to leading discussions and extracting issues

		3.1	Examples of data
		3.2	What are data?
		3.3	How do we get useful data? Primary versus secondary data
3	Sources of data	3.4	Privacy, security and openness/accessibility -- issues and trade-offs
		3.5	Thinking critically about data: introduction
		3.6	Thinking critically about data: data quality and GIGO
		3.7	Thinking critically about data: ways in which data need critical appraisal
		3.8T	Pedagogical issues relating to these topics

		4.1	Ideas (with examples) about using data
		4.2	Examples of social and personal consequences
4	Examples of Data Science	4.3	How can the prediction process go wrong?
	problems	4.4	Examples of causal explanation and use for control
		4.5	How can the process of finding causes go wrong?
		4.6T	Pedagogical issues relating to these topics

5T	Extracting pertinent lessons	5.1T	Leading discussions and extracting key issues from student contributions
	from student discussions	5.2T	Story Telling

	Topic Area 1.2 Basic tools for exploration and analysis. Part 1: Tools for a single variable
		1.1	Observations and features/variables
		1.2	Numerical versus categorical variables
1	Rectangular data sets	1.3	Importing data files in simple formats
		1.4	The need for data cleaning
		1.5T	Knowledge of Topic Area 1.6
		1.6T	Pedagogical issues relating to these topics

		2.1	Frequency tables and bar charts (counts and proportions)
		2.3	Good ways of ordering groups for displays and tables
2	Graphics and summaries	2.4T	Proportions versus counts: what works best for what?
	for a single categorical	2.5T	Weaknesses of pie charts and stacked bar charts
	feature/variable	2.6T	Pedagogical issues relating to these topics

		3.1	Graphics: Dot plots and histograms as large-data-set alternative.
		3.2	Characteristics to look out for and their implications
		3.3	Summaries
3	Graphics and summaries	3.4	Box plot as plot of summaries
	for a single numeric	3.6	Converting numerical features /variables to categorical
	feature/variable	3.6T	How the summary measures are obtained
		3.7T	Pedagogical issues relating to these topics


	Topic Area 1.3 Basic techniques for exploration and analysis. Part 2: Pairs of features/variables
		1.1	Making comparisons
		1.2	Interpreting group comparisons
1	Comparing groups	1.3	Extension to panel plots
		1.4T	Pedagogical issues relating to these topics
		1.5T	Comparing and evaluating different presentations

		2.1	Scatter plots
		2.2	Outcome/Response features/variables versus Predictor/Explanatory features/variables
2	Relationships between two	2.3	Construction
	numerical features/variables	2.4	Structure in scatter plots
		2.5	Basic ideas of prediction
		2.6	Vertical strips as a guide for sketching trend curves by eye
		2.7	How predictions can fail
		2.8	Minimizing average prediction errors
		2.9	Obtaining trend lines and slider-controlled smooths from software
		2.10	(Straight) lines and interpreting the intercept and slope coefficients of a trend line
		2.11	Positive and negative associations
		2.12	Modifications to scatter plots to overcome perceptual problems
		2.13T	Working with algebraic expressions
		2.14T	Pedagogical issues relating to these topics

		3.1	Two-way tables of counts and proportions
3	Relationships between	3.2	Side-by-side and separate bar charts or dot charts of proportions
	categorical features/variables	3.3T	Pedagogical issues relating to these topics

4	Filtering data	4.1	Filtering data by levels of a categorical feature/variable
		4.2T	Pedagogical issues relating to these topics


	Topic Area 1.4 Basic tools for exploration and analysis. Part 3: Three or more variables
1	Pairs plots	1.1	Pairs plots that will cope with categorical and numerical features/variables
		1.2T	Pedagogical issues relating to these topics

		2.1	Panel plots/faceting and 3-dimensional summary tables
2	Subsetting by a third	2.2	Playing or stepping through the sequence of plots in panel display
	feature/variable	2.3	Highlighting subgroups in a scatter plot or dot plot
		2.4T	Pedagogical issues relating to these topics.

		3.1	Coloring points in dot plots and scatter plots
3	Other ways of adding	3.2	Sizing points in scatter plots
	information on additional	3.3	Labeling points
	features/variables to 1- and 2-	3.4	Strengths and weaknesses of methods of adding information
	feature/variable plots	3.5T	Pedagogical issues relating to these topics

		4.1	Plots that allow querying of elements
4	Interactive plots	4.2	Linked plots, linked plots-and-tables
		4.3T	Pedagogical issues relating to these topics


	Topic Area 1.5 Graphs and Tables: how to construct them and when to use them
1	When to use a graphical	1.1	Exploration and discovery versus presentation
	display and when to use a table	1.2	Infographics and infotables

		2.1	The graphical process: a chain from graph creator to graph interpreter
		2.2	Visualization Principle 1. Use Position along a common scale
		2.3	Visualization Principle 2. Choose an appropriate Aspect Ratio
2	What makes a graphical	2.4	Visualization Principle 3. Encoding features/variables
	display good or bad?	2.5	Visualization Principle 4. Supply an informative caption
		2.6	Visualization Principle 5. May need more than one graph
		2.7	Other factors that good software generally gets right by default
		2.8T	Pedagogical issues relating to comparing different types of graphs

		3.1	Plotting samples of numerical data to explore relationships
3	What sort of graphical	3.2	Plotting a numerical and a categorical feature/variable
	display should I use?	3.3	Plotting features/variables that change over time
		3.3	Plotting two numerical features/variables to explore relationships
		3.5T	Pedagogical issues relating to comparing different types of graphs

		4.1	Role of tables
4	Tables: their purpose, and	4.2	Principles for making patterns in tabular information
	how to create good tables	4.3	When it is often better to use a table rather than a graph
		4.4T	Pedagogical issues relating to using tables

	Topic Area 1.6 The data-handling pipeline
		1.1	Automating the data science process
		1.2	Data management principles
1	Introduction to tool	1.3	Case studies of real data-science projects using various toolsets
	support for the data-handling pipeline	1.4T	Characteristics and evaluation of some widespread tool-sets
		1.5T	Pedagogical issues relating to tool use/mastery


		2.1	Data sources
		2.2	Logical data formats
2	Getting and storing data	2.3	Physical file formats
		2.4T	Case studies of good data sources
		2.5T	Comparison and evaluation of storage approaches
		2.6T	Pedagogy issues relating to data sources and storage

		3.1	Automating an analysis
		3.2	Data cleaning
3	Tool support for exploring	3.3	Data transformations
	and analyzing data	3.4T	Learning to use more aspects of the programming language
		3.5T	Pedagogical issues for coding

		4.1	Principles of communication
		4.2	Customizing graphs and tables
4	Generating presentations	4.3	Combining explanation with graphs/tables
	Of the data	4.4T	Comparison and evaluation of tools that allow generating presentations
		4.5T	Pedagogy issues relating to generating presentations

	Topic Area 1.7 Avoiding being misled by data
		1.1	What do we mean by GIGO?
1	GIGO - Garbage In,	1.2	Examples of garbage. How can we avoid collecting or using garbage?
	Garbage Out	1.3T	Pedagogical issues relating to these topics

		2.1	Biases due to measurement issues
2	Bias and what we can do	2.2	Biases due to selection or filtering in data streams
	about it	2.3	(Discussion Topic) Extrapolating from the data we have to a larger setting
		2.4T	Pedagogical issues relating to these topics

		3.1	Allowing for an important third feature/variable
3	Problems and solutions in	3.2	Difference between an observational study and a randomized experiment
	reaching causal conclusions	3.3	Extrapolating from the data at hand to a larger setting
		3.4T	Pedagogical issues relating to these topics

		4.1	Learning to ask questions that can be answered from data
4	Questions that can and	4.2	Learning to spot questions that cannot be answered from the available data
	cannot be answered by	4.3T	Pedagogical issues relating to these topics
	data

		5.1	Random sampling is not perfect
5	Sampling errors and	5.2	Unpacking the likely extent of sampling error
	confidence intervals	5.3	Experiencing how confidence intervals can be constructed
		5.4T	Pedagogical issues relating to these topics

		6.1	Randomized assignment is not perfect
6	Randomization variation	6.2	When to conclude that observed group differences are real?
	in experiments	6.3	Experiencing a two-group randomization test
		6.4T	Pedagogical issues relating to these topics

	Topic Area 2.1 Time Series data
1	Problem elicitation and	1.1	The nature of time series data
	formulation: Time Series	1.2	Reasons why people are often interested in time series data
	data

		2.1	Obtaining time-series datasets
2	Getting the data	2.2	Some common date-and-time feature/variable formats
		2.3	Transforming and reshaping data sets

		3.1	Basic time-series plots and recognizing features
3	Exploring the data	3.2	Identifying the trend + seasonal oscillation components
		3.3	Decomposition into trend + season + residual
		3.4	Comparing related series

		4.1	Forecasting as projecting patterns from the past
4	Analyzing the data:	4.2	Making an informal forecast
	Modelling and Forecasting	4.3	Experience with using a formal forecasting method
		4.4T	Pedagogical issues relating to these topics

		5.1	Selecting features to be communicated
5T	Communicating the	5.2	Choosing a communication method
	results; next question?	5.3	Telling the story

	Topic Area 2.2 Map data
		1.1	Ubiquitous nature of maps
		1.2	Constituent components of maps
1	What are the purposes of	1.3	Separation of data from display
	maps?	1.4	Multiple dimensions
		1.5	Interactive example as a tool for exploratory learning

		2.1	Location and Region Data as common archetypes
		2.2	Plotting points on downloaded map tiles, relationship to scatterplots
2	How do we build and	2.3	Coding added-variable-information at location points; interpretation
	work with location maps?	2.4	Subsetting/Faceting; ways of showing changes over time
		2.5	Interactivity with location map-plots

		3.1	Shape files and choropleth maps; region labels
		3.2	Matching regions in a dataset to regions in a shape file
		3.3	Representing two or more features/variables; issues of scales
3	How do we build and	3.4	Perceptual problems with choropleth maps; alternative representations
	work with regional maps?	3.5	Subsetting/Faceting as a tool; ways of showing changes over time
		3.6	Interactivity with regional maps
		3.7T	Subtleties of color and scale choice - communication enhancement
		3.8T	Distortions and bias: avoiding misleading figures projections

		3.3T	Maps as visualization versus maps as data
4	(TEACHER-only TOPIC)	3.4T	Maps as data
	What is a Map, and how is	3.5T	Finding patterns in data through maps
	this Data?	3.6T	Overlays on maps

	Topic Area 2.3 Text data
		1.1	Examples of questions to be addressed using natural language
		1.2	Important characteristics of text data
1	Problem elicitation and	1.3	Extracting tokens
	formulation: Text data	1.4	Removing stop words and performing stemming
		1.5T	Characteristics of text data

		2.1	Constructing frequency tables of tokens
		2.2	Generating bar charts and word clouds of token frequencies
		2.3	Limitations of unigrams; extracting bigrams
2	Bag of words analysis of	2.4	Summarizing bigrams; comparing unigrams and bigrams
	text data	2.5	Exploring differences between documents
		2.6	Distinguishing the content of documents
		2.7	Limitations of bag of words analysis
		2.8T	Pedagogical issues relating to text data analysis

		3.1	What is sentiment and why do people want to use it?
3	Sentiment Analysis	3.2	Merging tokens with sentiment data tables
		3.3	Summarizing sentiment in a document; differences between sources
		3.4	Limitations of sentiment analysis
		3.5T	Issues relating to sentiment analysis


	Topic Area 2.4 Supervised learning
1	Problem elicitation and	1.1	What is classification?
	formulation: Supervised	1.2	Classification models and rules
	Classification	1.3	Measuring how well the classification model works

		2.1	Components of a classification tree and how it works
2	Introduction to	2.2	Misclassification rate
	Classification Trees	2.3	Node/leave "purity"
		2.4	Generating Classification and Regression Trees
		2.5	Consequences of misclassification

		3.1	Introduction to R/Python commands to grow and visualize a CART
3	Growing Classification	3.2	When to stop growing the tree
	Trees	3.3	What is overfitting? Validation data
		3.4	Pruning

		4.1	What does the tree tell you about how classifications are made?
4	Communicating the	4.2	Using the tree to make decisions
	Results; next question?	4.3T	Pedagogical issues relating to these topics

		5.1	Measuring quality of prediction
		5.2	Interpreting regression trees
5	Introduction to	5.3	Building regression trees with one predictor feature/variable
	Regression Trees	5.4	Building regression trees with more than one predictor feature/variable
		5.5	Comparing trees with a validation set


	Topic Area 2.5 Unsupervised Learning
1	Problem elicitation and	1.1	What is unsupervised learning?
	formulation: Unsupervised	1.2	Creating clusters (groups) of data points based on attributes of the data
	Learning	1.3	Contrast with classification or supervised learning

		2.1	Obtain appropriate datasets
2	Getting and exploring data	2.2	Interpreting the structure of the data
		2.3	Contrast with Supervised Learning
		2.4	Motivation for identifying clusters automatically

		3.1	K-means is a clustering algorithm which is iterative in nature
		3.2	Use of distance metric to assign each data point to a cluster automatically
3	Example of Unsupervised	3.3	Explanation of iterative procedure
	learning algorithm:	3.4	The need to repeat with different initial guesses for cluster centers
	K-means clustering	3.5	Use a small set of data points to explain the K-means algorithm manually
		3.6	What is an outlier in this context?
		3.7	Discuss distance of an object from the center of its assigned cluster

		4.1	Introduction to format of the data set
		4.2	Introduction to the programming environment to use for clustering
4	Implementing K-means	4.3	Clean and transform the data set to observations versus features
	clustering on a large data	4.4	Choose K and run the algorithm using the code snippet provided
	set	4.5	Interpret the results
		4.6	Change the value for K, and repeat; compare how selections have changed
		4.7T	Pedagogical issues relating to these topics

		5.1	When is unsupervised learning as best approach for problem at hand?
		5.2	Exploratory analysis and graphs to summarize the input data.
5	Use in Problem solving	5.3	Visualizations to show differences for different values of K
		5.4	Making an optimal choice of K
		5.5	Descriptive statistics for the features in each cluster
		5.6	Interpreting and communicating the results
		5.7	Effect of human factors

6	Other unsupervised	6.1	Examples when distance-based methods may not be appropriate
	learning methods	6.2	Motivate need for other clustering methods
	Alternatives to K-means	6.3	Visualizations of different cluster shapes unsuited to K-means
	clustering	6.4	Examples of visualizations

	Topic Area 2.6 Recommender Systems
		1.1	Examples of some recommender systems
		1.2	Desirable features of recommender systems
1	Problem elicitation and	1.3	Ethical issues for recommender and personalized systems
	formulation:	1.4	Additional complexities
	Recommender Systems	1.5	Communicating the recommenders,
		1.6T	Characteristics of a wide variety of recommender systems
		1.7T	Pedagogical issues relating to recommender systems

		2.1	Ratings (on user-item pairs): sparsity of the data.
		2.2	Data quality issues
2	The data used by	2.3	Feature data on items, demographic data on users
	Recommender Systems	2.4	Ethics with data
		2.5	Storing the data
		2.6T	Tools to collect and manipulate ratings data
		2.7T	Pedagogical issues relating to data for recommender system

		3.1	Concept: Recommendation based on single-user data, from item similarity
		3.2	Measures of item similarity
3	Content-based	3.3	Analysis and recommendation based on calculating nearest neighbors
	recommendation	3.4	Analysis and recommendation based on forming clusters of items
		3.5T	Tools for calculating similarity, clusters etc

		4.1	Concept: recommend items that are liked by similar users
		4.2	Define similarity of users
4	Collaborative-filtering	4.3	Use regression to predict unseen rating from known ones
		4.4	Ethics issues
		4.5T	Tools that calculate predicted ratings
		4.6T	Pedagogical issues relating to collaborative filter

		5.1	User satisfaction; proxies to measure this
		5.2	Measures from information retrieval
5	Evaluation of a	5.3	Ethics rules that apply to user studies
	Recommender System	5.4T	Tools to calculate IR measures
		5.5T	Pedagogical issues relating to recommendation evaluation

	Topic Area 2.7 Interactive Visualization
1	Why visualization? The	1.1	The power of visualization
	role of Visualization	1.2	Visual Exploratory Data Analysis
	in the Data Science	1.3	Visual Communication and Presentation
	Learning cycle	1.4	Considerations and challenges of visualization design


		2.1	Brief introduction to perceptual and cognitive capacity.
2	Data Types and Visual	2.2	Hierarchy of visual variables
	Variables	2.3	Color
		2.4	Motion

		3.1	The role of interaction
3	Interaction	3.2	Types of interaction 1. Basic
		3.3	Types of interaction 2. Manipulating the layout
		3.4	Types of interaction 3. Manipulating the data

		4.1	Understanding the audience and the task
4	Critique	4.2	Perceptual biases that affect visualization efficacy
		4.3	Inappropriate encodings of data
		4.4	Scales, legends, decorations

	Topic Area 2.8 Confidence intervals and the bootstrap
1	Parameters versus	1.1	Motivation and examples
	estimates

		2.1	Behavior of sampling errors in various contexts
		2.2	What is a standard error?
2	Sampling error	2.3	Relation between sample size and standard error
		2.4	Inverse-root-n relationship between sample size and sampling error
		2.5	Effect of population size on sampling error
		2.6T	Simulation and simulation tools

		3.1	The concept of a confidence interval
3	Confidence intervals and	3.2	Interpretation of confidence intervals for a single parameter
	their implementation using	3.3	Experiencing bootstrap resampling
	bootstrapping	3.4	Comparing bootstrap standard error with usual approach
		3.5T	Ways to facilitate learning from simulation

		4.1	Coverage properties
4	Investigating performance	4.2	(Optional) More advanced situations
		4.3T	Ways to facilitate learning from simulation

5	Differences and ratios	5.1	Constructing and interpreting bootstrap confidence intervals for differences
		5.2T	Ways to facilitate learning from simulation

6	Further exploration	6.1	(Optional) Explore sampling from a set of theoretical distributions
		6.2T	Ways to facilitate learning from simulation

	Topic Area 2.9 Randomization tests and Significance testing
1	Randomized Experiments	1.1	The issue of assessing a possible treatment effect
	and Randomization variation
		2.1	Generation and display of randomization distributions
		2.2	Discussion: Concluding that you had evidence of a real difference
2	Towards the randomization	2.3	Motivate concept of re-randomization
	test	2.4	Generating the randomization distribution
		2.5	Scope of inference justified by the experiment
		2.6T	Simulation and simulation tools

		3.1	A general procedure for conducting randomization tests
3	Randomization test	3.2	Analyzing data from several experiments
		3.3	Language and idea of P-value and sidedness of a test
		3.4	Generalize to 3 or more groups using simple distance measure
		3.5T	Simulation and simulation tools

4	Investigating the
	performance of	4.1	Explore using large data set
	randomization tests in a	4.2T	Simulation and simulation tools
	sampling context

5	Relationships between	5.1	Assessing treatment differences using randomization
	categorical features/variables	5.2T	Simulation and simulation tools

	Topic Area 2.10 Image data
1	Examples of image data	1.1	Sources and types
	and their uses	1.2	Images as data and methods for representing them


		2.1	Sources of images
2	Image data origins and	2.2	Ways of encoding images in digital form
	encodings	2.3T	Comparing encoding methods
		2.4T	Encodings for video and audio


		3.1	Representing images using a numerical matrix
3	Image data representation	3.2	Relating individual pixels to color
	and basic mathematical	3.3T	Visual effects of mathematical operations on numerical versions of images
	operations on images	3.4T	Higher-dimensional representations of 3D spatial data
		3.5T	Using image processing operations to augment image data



		4.1	Examples of low-level features
4	Feature detection in s	4.2	Overview of some ways features can be identified in an image
	image	4.3	A feature vector as an imperfect representation of the image.
		4.4T	Alternative choices for the features to consider
		4.5T	Impact of high-dimensionality of typical feature vector representations


		5.1	Review supervised-learning approach to a classification task
5	Image classification	5.2	Obtaining high-quality training data for image classification
		5.3	Applying an automated classifier to learn a classification
		5.4T	Advantages and disadvantages of running a classifier directly on image data
		5.5T	Unsupervised learning approaches to image classification