PhD Toolbox and data analysis
PhD Toolbox and data analysis
Academic year 2020/2021
- Dott. Daniel Edward Chamberlain
Prof. Marco Gamba
Stefano Ghignone (Lecturer)
Marco Chiapello (Lecturer)
- 1st year, 2nd year, 3rd year
- Teaching period
- Da definire
- Course disciplinary sector (SSD)
- BIO/05 - zoologia
BIO/07 - ecologia
BIO/09 - fisiologia
BIO/11 - biologia molecolare
- Formal authority
- Type of examination
- Practice test
- See details for specific modules.
Sommario del corso
The overall objectives of the PhD Toolbox are to enhance the expertise of doctoral students in order to contribute significantly to their current research, and to improve their competitiveness in the post-doctoral jobs market through the acquisition of transferable skills. The focus is on technical know-how, with a specific goal to acquire expertise in advanced programming and statistical techniques in order to carry out explorative analyses, including data management, the use of graphical tools for data presentation, and testing statistical hypotheses.
Results of learning outcomes
Knowledge and Understanding
- An understanding of what reproducible research means and the good practices to achieve it (Chiapello module)
- What is R and why it is an important tool for each scientist (Chiapello module)
- An understanding of the underlying principles of the use of linear models in statistical analyses (Chamberlain module)
- An understanding of the underlying principles of the use of multivariate statistics and the most common machine learning algorithms (Gamba module)
Applying knowledge and understanding
- The ability to use R to clean, transform and visualise biological data (Chiapello module)
- The ability to analyse data using linear models in the R programming environment (Chamberlain module)
- The bases of using R for principal component analysis, clustering, multilayer perceptrons and support vector machines (Gamba module)
- Improved clarity of presentation of research results in scientific publications, in particular in terms of graphical tools in R (all)
The course will be divided into 4 modules for a total of 36 hours. A full timetable will be published in due course.
- Introduction to UNIX Environment and Command Line Basics (Stefano Ghignone) – 6 hours
UNIX is an operating system which was first developed in the 1960s, and has been under constant development ever since. By operating system, we mean the group of programs which make the computer work. UNIX is a stable, multi-user, multi-tasking system for servers, desktops and laptops, and it is popular in bioinformatics because of its powerful command-line tools that make scripting and performing automated analyses relatively easy. In this brief introductory course, we will focus on commands, those pesky little words you type on a command line prompt to tell the system what to do. The logic behind the command line concept is fundamental for understanding the functioning of the R statistical environment. We will learn the basics of the Unix environment and how to interact with it through the terminal, how to move around the Filesystem, how to view and edit Files, how to manipulate Files and Directories, and some basic Bioinformatic examples.
- R for Data Science (Marco Chiapello) – 14 hours
Computers are increasingly essential to the study of all aspects of biology. Data management skills are needed for entering data without errors, storing it in a usable way, and extracting key aspects of the data for analysis. Basic programming is required for everything from accessing and managing data, to statistical analysis, to modelling. This course will provide an introduction to data management, manipulation, and analysis, with an emphasis on biological problems. Classes will
typically consist of short introductions or question & answer sessions, followed by hands-on computing exercises. The course will be taught using R, but the concepts learned will easily apply to all programming languages and database management systems. No background in programming
or databases is required.
- Generalized Linear Models & Multi-Model Inference (Dan Chamberlain) – 8 hours (lato studente)
This module will be divided into two streams to match different levels of statistical experience among the students.
Stream 1. Essential Statistical Analyses in R (4 hours); Introduction to Linear Models (4 hours)
The first part of this module will provide a refresher course on basic statistical concepts, such as key descriptive statistics (mean, variance, confidence intervals), different types of data, data distributions and transformations, and statistical hypothesis testing. We will address essential statistical tests in R such as t-tests, ANOVA and correlation, and also consider non-parametric equivalents (e.g. Wilcoxon and Kruskal-Wallis tests and Spearman correlation). In the second part, we will consider standard parametric tests for normally distributed data in the context of linear modelling in R. We will develop models from the simple univariate model (i.e. linear regression) to multivariate analyses, and will include key topics such as interactions, assessing model fit and graphical representations of results.
Prerequisites - R for Data Science (see above) or equivalent basic R knowledge. Knowledge of basic statistical tests and data distributions would be an advantage.
Stream 2. Introduction to Linear Models (4 hours); Generalized Linear Models (4 hours)
This module will address topics in General Linear Modelling in R. The first part of this module will advance from the normal to other common data distributions (Poisson, binomial). The second part will address more complex mixed models. Information theoretic approaches to model selection will be introduced, including both multi-model inference and model averaging, which are especially relevant to models where several dependent variables are under consideration. Introductions to broader concepts in modelling, such as Generalized Additive Models, spatial and temporal autocorrelation and model testing will also be included depending on progress.
Prerequisites - Experience of basic programming and data analysis in R. Knowledge of common statistical tests based on normally distributed data (t-test, ANOVA, correlation, linear regression).
- Machine Learning with R (Marco Gamba) – 8 hours
This module provides Ph.D. students with the basics of Machine Learning using a hands-on lab and application-oriented approach. The first part of the course will look into how conventional statistical analysis relates to Machine Learning, and make a comparison of each. We will then focus on Unsupervised Learning, exploring the most common techniques, from Clustering and Cluster-Validation to Dimensional Reduction, discussing the advantages & disadvantages of each algorithm. We will then concentrate on Supervised Learning, using some of the most popular algorithms and introducing the concepts of Classification, Training and Testing Split, Neural Network, Support Vector Machine, Feature Extraction & Selection. We will also consider how to present the results of the analyses mentioned above by exploring different data visualization techniques. On completion of the Machine Learning module, students will be expected to have a good understanding of the fundamental issues and challenges of these topics: e.g. they should possess practical knowledge of Supervised and Unsupervised approaches, know strengths and weaknesses of the most popular techniques, and be able to implement various algorithms in a range of realistic research applications.
Teaching in person may be provided in computer laboratories (for modules concerning analytical and programming techniques, which will consist of traditional lessons interspersed with practical demonstrations of the use of software tools). Exercises will be set during lessons (some will be expected to be completed outside the set timetable of lessons), followed by class discussions.
In the event that teaching in person is not possible, online teaching will be provided through appropriate online platforms (such as Webex).
Learning assessment methods
Assessments will be made both through exercises given during the lesson, and through set homework. No formal mark will be given, but attribution of credits will be based on delivery of set exercises.
Suggested readings and bibliography
Beckerman et al. (2017). Getting started with R. 2nd Edn. Oxford University press, Oxford.
Zuur et al. (2009). A beginner’s Guide to R. Springer, New York.
Peng (2012). R Programming for Data Science.
Wickham (2014). Advanced R.
Grolemund (2014). Hands-On Programming with R.
Grolemund & Wickham (2016). R for Data Science.
The latter four books can be read for free here: https://bookdown.org/
This is the schedule of the course:
10 May 09:00-16:00
11 May 11:00-17:00
12 May 10:00-16:00
13 May 10:00-16:00
14 May 09:00-12:00
17 May 14:00-18:00 (Stream 1)
19 May 14:00-18:00 (Stream 2)
20 May 14:00-18:00 (Stream 1)
24 May 14:00-18:00 (Stream 2)
26 May 14:00-16:00
27 May 14:00-17:00
28 May 14:00-17:00
- Enrollment opening date
- 01/03/2020 at 00:00
- Enrollment closing date
- 31/12/2022 at 23:55