PYTHON AND R FOR DATA SCIENCE (LAB)

PYTHON AND R FOR DATA SCIENCE (LAB)

Valerio Rughetti

Obiettivi formativi

The course aims at providing technical skills about coding aspects for data analysis. The Python programming language and the R environment are illustrated with a specific focus on those libraries, modules and functions that allow the students to manage data effectively. This provides an in-depth understanding of the approaches to be used to preprocess, clean, visualize and analyze data related to a plethora of different contexts. Students in this course will mainly acquire practical skills, necessary to analyze real data.

Risultati di apprendimento attesi

Knowledge and understanding: The course will offer key techniques to manage and analyze data programmatically, in order to extract useful insights. The course will provide a good understanding of the fundamental issues in data analysis, along with the knowledge of all those programming libraries that are needed to analyze data in an effective way. Applying knowledge and understanding: On successful completion of this course students will be able to: ●Organize, visualize, and analyze large, complex datasets by means of Python and R programming language. ●Extract knowledge from data. ●Make use of data science tools and techniques effectively. Making judgements: Students are expected to analyze complex datasets. Working both independently and in small groups, students will be required to complete a project work (three students per group). The project work will allow students to make their own judgements about the most appropriate techniques to be applied in a given use case. This will, in turn, allow the students to assess critically the advantages and disadvantages of the different approaches. Communications Skills: The course will give the students the possibility to understand terms and concepts related to data science, along with the main concepts, libraries and abstractions of R and Python programming in this context. Communication in small groups of students (that is, three students) will play a key role in the development of the course project work. Small groups are important communication units in academic, professional, civic, and personal contexts; the students will be able to communicate their ideas and analyses in the most effective way, along with being able to write appropriate technical reports of the data analyses carried out. Learning skills: This course will empower students with the capability to enhance their technical skills in order to manage data effectively by means of the Python and the R programming languages. A strong emphasis will be given to the practical application of data science libraries to real use cases.

Contenuti Del Corso

The course will cover the following topics: ● Python and R programming language (variables, conditional expressions, loops, functions). ●Data Loading and File Formats. ●Data Cleaning. ●Data Manipulation with Pandas. ●Plotting and Visualization. ●Building and optimizing pipelines in scikit-learn. ●Data Visualization with ggplot2. ●Data Transformation with dplyr. ●Relational Data with dplyr. ●Pyspark: How to use Apache Spark from Python.

Testi Di Riferimento

Lecture notes, research papers and course material will be made available on the e-learning platform. Recommended reading: ● “Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython”, 2nd Edition by Wes McKinney.Publisher: O'Reilly Media, Inc. Release Date: October 2017. ISBN: 9781491957660. ● “R for Data Science: Import, Tidy, Transform, Visualize, and Model Data” by Garrett Grolemund and Hadley Wickham. Release Date: December 2016. ISBN: 978-1491910399.

Metodologie Didattiche

The course consists of practical lab sessions and group project work, complemented by lectures in order to explain the underlying concepts required by the laboratory sessions.

Modalità di verifica dell'apprendimento

The final exam will be worth 30% of the final evaluation. It consists in a written exam with questions related to programming code, programming abstractions and libraries (Python and R). The 70% of the final evaluation will be given by the delivered project work. The project work must be carried out by groups of exactly 3 students. Students will be required to continuously deliver updates of the project work during the course timeframe via Git. The project work will be defined by the instructor after 3 weeks since the start of the course with the definition of several project options for the work to be done (students can choose the project they are more interested in). The project work consists in 2 assignments (one assignment requiring to write code in R and one assignment requiring to write code in Python). A proper amount of time will be reserved by the instructor to comment (and give feedback about) the delivered code during course. Bonus points are granted to students that deliver code during the course timeframe and hold a presentation of their final project to the instructor (and to the class) in the last lesson, while no bonus points are granted to students that deliver the project a few days ahead of the exam session. Bonus points may be granted to students that present interesting project advancements to the instructor during classes (e.g., got interesting achievements).

Criteri per l’assegnazione dell’elaborato finale

A thesis will be assigned (upon specific request to the instructor) to students who demonstrate a serious and motivated interest to the course subjects.

Settimana 1

Introduction to the course. Python and R programming basics.

Settimana 2

Python programming basics: Conditional statements, loops, functions. Integrated Development Environments (IDEs). Notebook environments. Jupyter and Google Colab. How to structure Python code in a project. How to manage libraries in Python using virtual environments.

Settimana 3

R programming basics: Conditional statements and loops. Integrated Development Environments (IDEs). How to structure R code in a project. How to manage the installation of libraries in the R programming language.

Settimana 4

R and Python: ● Data Loading and File Formats. ● Data Cleaning and Preparation.

Settimana 5

Python: ● Data Manipulation with Pandas. ● Array and Numerical operations with NumPy Continuous Assessment: Evaluation of first commits / delivered versions of the Python assignment. Comments and feedback by the instructor during the classes.

Settimana 6

Python: ● Plotting and Visualization. ● Data Aggregation and Group Operations.

Settimana 7

Python ● Building and optimizing pipelines in scikit-learn. Continuous Assessment: Second round of evaluation of the incremental deliveries of the Python assignment. Comments and feedback by the instructor during the classes.

Settimana 8

Introduction to the use of version control systems (e.g., Git).

Settimana 9

R and introduction to Tidyverse: ●Data Visualization with ggplot2. ● Data Transformation with dplyr. Continuous Assessment: Evaluation of first commits / delivered versions of the R assignment. Comments and feedback by the instructor during the classes.

Settimana 10

R and introduction to Tidyverse: ●Relational Data with dplyr. Python: ●Pyspark: How to use Spark from Python.

Settimana 11

Python: Pyspark exercises. Continuous Assessment: Second round of evaluation of the incremental deliveries of the R assignment. Comments and feedback by the instructor during the classes.

Settimana 12

Presentation of the delivered projects to the instructor and to the class.