Study Guide
Please use this study guide to create your certification self-study plan. We’ve included the objectives you should meet for each assessed competency, with links to relevant skill assessments.
Exam DS101: Exploratory Analysis, Statistical Experimentation, and Data Management in R or Python
1.1 Calculate metrics to effectively report characteristics of data and relationships between features
- Calculate the measures of center (e.g. mean, median, mode) for variables using R or Python.
- Calculate the measures of spread (e.g. range, standard deviation, variance) for variables using R or Python.
- Calculate the skewness for variables using R or Python.
- Calculate the missingness for variables and understand its influence on reporting characteristics of data and relationships in R or Python.
- Calculate the correlation between variables using R or Python.
1.2 Create data visualizations in coding language to demonstrate the characteristics of data
- Create and customize the bar chart using R or Python.
- Create and customize the box plot using R or Python.
- Create and customize the line graph using R or Python.
- Create and customize the histogram graph using R or Python.
1.3 Create data visualizations in coding language to represent the relationships between features
- Create and customize the scatterplot using R or Python.
- Create and customize the heatmap using R or Python.
- Create and customize the pivot table using R or Python.
1.4 Identify and reduce the impact of characteristics of data
- Describe when the transformation applies to variables and implement suitable transformation methods using R or Python.
- Identify the missing data and implement suitable imputation methods to reduce its impact on analysis or modeling using R or Python.
- Identify and remove the outliers using R or Python.
2.1 Apply sampling methods to data
- Distinguish between different types of random sampling techniques and apply the methods using R or Python
- Sample data from a statistical distribution (e.g. normal, binomial, Poisson, exponential, etc.) using R or Python
- Calculate the probability from a statistical distribution (e.g. normal, binomial, Poisson, exponential, etc.) using R or Python
2.2 Implement methods for performing statistical tests
- Use different types of graphs to analyze the normality of the samples using R or Python.
- Run simple statistical tests (e.g. t-test, ANOVA test, chi-square test) using R or Python.
- Run suitable statistical tests in the context of the business question using R or Python.
- Interpret the results of the statistical tests running from R or Python.
1.1 Perform standard data import, joining and aggregation tasks
- Import data from flat files and databases using R or Python.
- Aggregate numeric, categorical variables and dates by groups using R or Python.
- Combine multiple tables by rows or columns using R or Python.
- Filter the data based on different criteria using R or Python.
1.2 Perform standard cleaning tasks to prepare data for analysis
- Match the string with different specific patterns from the dataset using R or Python.
- Identify different data types in R or Python and convert values between types.
- Clean categorical and text data by manipulating the string in R or Python.
- Clean date and time data by manipulating the dates and times in R or Python.
- Explain the concept of tidy data and transform the messy data into tidy data using R or Python.
1.3 Assess data quality and perform validation tasks
- Identify, calculate and replace the missing values using R or Python.
- Identify, calculate and remove the duplicates using R or Python.
- Perform different types of data validation tasks (e.g. constraint validation, data range validation, code validation, data type validation) using R or Python.
1.4 Collect data from non-standard formats by modifying existing code
- Import data from API using R or Python.
- Identify the structure of HTML and JSON data and parse them into a usable format for data processing and analysis using R or Python.
1.1 Perform data extraction, joining and aggregation tasks
- Aggregate numeric, categorical variables and dates by groups using PostgreSQL.
- Interpret the database schema and combine multiple tables by rows or columns using PostgreSQL.
- Extract the data based on different conditions using PostgreSQL.
2.1 Prepare data for modeling by implementing relevant transformations.
- Create new categories from existing data (e.g. seasons from date, categories from continuous data, combing categories from categorical data) using R or Python.
- Explain the importance of splitting data and split data for training, testing, and validation using R or Python.
- Explain the importance of scaling data and implement the scaling using R or Python.
- Transform categorical data into numerical data using R or Python.
2.2 Implement standard modeling approaches for supervised learning problems.
- Identify the problems that supervised learning models are targeted at.
- Select the regression and classification models and implement the model using R or Python.
- Select the ensemble methods and implement the model using R or Python.
2.3 Implement approaches for unsupervised learning problems.
- Identify the problems that unsupervised learning models are targeted at.
- Select the clustering models and implement the model using R or Python.
- Explain the dimensionality reduction techniques and implement the techniques using R or Python.
2.4 Use suitable methods to assess the performance of a model.
- Select the metrics to evaluate the regression models and calculate the metrics using R or Python.
- Select the metrics to evaluate the classification models and calculate the metrics using R or Python.
- Select the metrics to evaluate the clustering models and calculate the metrics using R or Python.
3.1 Use common programming constructs to write repeatable production quality code for analysis.
- Define, write and execute functions in R or Python.
- Use and write the control flow statements in R or Python.
- Use and write the loops and iterations in R or Python.
3.2 Demonstrates best practices in production code including version control, testing, and package development.
- Describe the basic flow and structures of the package development in R or Python.
- Explain how to document codes in package, subpackage, or module in R or Python.
- Explain the importance of the testing and write the testing statements in R or Python.
- Use version control and interpret the changes between versions from history files in R or Python.