PhD Toolbox for data analysis
PhD Toolbox for data analysis
Academic year 2022/2023
- Daniel Edward Chamberlain
- 1st year, 2nd year, 3rd year
- Teaching period
- Course disciplinary sector (SSD)
- BIO/01 - general botany
BIO/03 - environmental and applied botany
BIO/05 - zoology
BIO/07 - ecology
- Formal authority
- Type of examination
- Practice test
Sommario del corso
The overall objectives of the PhD Toolbox are to enhance the expertise of doctoral students in order to contribute significantly to their current research, and to improve their competitiveness in the post-doctoral jobs market through the acquisition of transferable skills. The focus is on technical know-how, with a specific goal to acquire expertise in advanced programming and statistical techniques in order to carry out explorative analyses, including data management, the use of graphical tools for data presentation, and testing statistical hypotheses.
Results of learning outcomes
Knowledge and Understanding
- An understanding of what reproducible research means and the good practices to achieve it
- What is R and why it is an important tool for each scientist
- An understanding of the underlying principles of the use of linear models in statistical analyses
- An understanding of the underlying principles of the use of multivariate statistics and the most common machine learning algorithms
- A understanding of developing packages for R
Applying knowledge and understanding
- The ability to use R to clean, transform and visualise biological data
- The ability to analyse data using linear models in the R programming environment
- The bases of using R for principal component analysis, clustering, multilayer perceptrons and support vector machines
- The ability to design personalized packages in R
- Improved clarity of presentation of research results in scientific publications, in particular in terms of graphical tools in R (all)
The course will be divided into 7 modules for a total of 40 hours. The modules are divided into two streams. Stream 1 consists of basic modules that are obligatory for all students on the PhD course. Stream 2 consists of optional modules that deal with more advanced statistics. A full timetable will be published in due course.
- R for Data Science: R for Beginners – 8 hours
This module is dedicated to those approaching R and RStudio for the first time and will include operative introduction to R, functions to import and handle tables in R, base graphics in R. Students will be guided by the lecturers in constructing scripts which will
enable them to learn the main functions and their use in an immediate way. Across each
topic students will be allowed to propose their own datasets as real case-studies.
- Essential Statistical Analyses in R – 4 hours
The first part of this module will provide a refresher course on basic statistical concepts, such as key descriptive statistics (mean, variance, confidence intervals), different types of data, data distributions and transformations, and statistical hypothesis testing. We will address essential statistical tests in R such as t-tests, ANOVA and correlation, and also consider non-parametric equivalents (e.g. Wilcoxon and Kruskal-Wallis tests and Spearman correlation). In the second part, we will consider standard parametric tests for normally distributed data in the context of linear modelling in R. We will develop models from the simple univariate model (i.e. linear regression) to multivariate analyses, and will include key topics such as interactions, assessing model fit and graphical representations of results.
- Machine Learning with R – 8 hours
This module provides Ph.D. students with the basics of Machine Learning using a hands-on lab and application-oriented approach. The first part of the course will look into how conventional statistical analysis relates to Machine Learning, and make a comparison of each. We will then focus on Unsupervised Learning, exploring the most common techniques, from Clustering and Cluster-Validation to Dimensional Reduction, discussing the advantages & disadvantages of each algorithm. We will then concentrate on Supervised Learning, using some of the most popular algorithms and introducing the concepts of Classification, Training and Testing Split, Neural Network, Support Vector Machine, Feature Extraction & Selection. We will also consider how to present the results of the analyses mentioned above by exploring different data visualization techniques. On completion of the Machine Learning module, students will be expected to have a good understanding of the fundamental issues and challenges of these topics: e.g. they should possess practical knowledge of Supervised and Unsupervised approaches, know strengths and weaknesses of the most popular techniques, and be able to implement various algorithms in a range of realistic research applications.
- Writing R Packages - 4 hours
R softeware presents an incredible variety of packages which allow an exponential expansion of analytical capabilities with respect to the bae packages. Even though the number of packages available is large, some functions are lacking and the development of new functions and/or the modification of existing ones may sometimes be necessary. The recent development of tools for the creation of R packages enable the writing of functions necessary for the automation of procedures necessary to define the structure of a package. This module will provide general guidelines on these tools and on their potential to optimize data analysis routines as well as highlighting their importance in the dissemination of research.
- R for Advanced Users – 8 hours
How to use lists in R, grafics in R, reproducible research with RMarkdown. Stream 2 R for advanced users dedicated to those students who want to familiarize with some essential tools in advanced R applications, including effective scripting, scientific graphics and reproducible research. The course is applicative-oriented, thus, most of the time will be spent hacking on the computer.
- Introduction to Linear Models and Generalized Linear Models – 8 hours
This module will address topics in General Linear Modelling in R. The first part of this module will advance from the normal to other common data distributions (Poisson, binomial). The second part will address more complex mixed models. Information theoretic approaches to model selection will be introduced, including both multi-model inference and model averaging, which are especially relevant to models where several dependent variables are under consideration. Introductions to broader concepts in modelling, such as Generalized Additive Models, spatial and temporal autocorrelation and model testing will also be included depending on progress.
- Plotting maps in R – 2 hours
This module will provide a brief introduction to map chart plotting in R and will include; spatial systems, vector and raster data; how to plot country maps and add data to them; how to plot maps with ggplot2.
Teaching in person may be provided in computer laboratories (for modules concerning analytical and programming techniques, which will consist of traditional lessons interspersed with practical demonstrations of the use of software tools). Exercises will be set during lessons (some will be expected to be completed outside the set timetable of lessons), followed by class discussions.
In the event that teaching in person is not possible due to Covid, online teaching will be provided through appropriate online platforms (such as Webex).
Learning assessment methods
Assessments will be made both through exercises given during the lesson, and through set homework. No formal mark will be given, but attribution of credits will be based on delivery of set exercises.
Suggested readings and bibliography
Beckerman et al. (2017). Getting started with R. 2nd Edn. Oxford University press, Oxford.
Zuur et al. (2009). A beginner’s Guide to R. Springer, New York.
Peng (2012). R Programming for Data Science.
Wickham (2014). Advanced R.
Grolemund (2014). Hands-On Programming with R.
Grolemund & Wickham (2016). R for Data Science.
The latter four books can be read for free here: https://bookdown.org/
STREAM 1 - all lessons will take place in Aula 1, Via Accademia Albertina 13
R for Data Science: R for Beginners 13/02/23, 09:00-13:00
Essential Statistical Analyses in R 16/02/23, 08:30-12:30
Machine Learning with R 21/02/23, 09:00-13:00
Writing R Packages 24/02/23, 09:00-13:00
Introduction to LMs and GLMMs 27/02/23, 14:00-18:00
Aula 1, Via Accademia Albertina 13 02/03/23, 14:00-18:00
R Advanced 07/03/23, 09:00-13:00
Auletta 1, Orto Botanico 09/03/23, 09:00-13:00
R Mapping 10/03/23, 09:00-13:00
Auletta 1, Orto Botanico