DATA AND ARTIFICIAL INTELLIGENCE LABS

Stefano Guarino

Instructional goals

The aim of this course is to introduce the students to the basic principles of data analysis and machine learning, making use of examples and real-world case studies. Acquainting students with these issues is part of the overarching goal of training learners with a competitive profile and valuable skills in the ever-changing labour market.

Prerequisites

No previous knowledge or IT/engineering skills are required, but some basic understanding of statistics and programming may be of help.

Intended learning outcomes

By the end of the course, the students will be familiar with the main concepts of data analysis and they will understand the importance of using suitable algorithms to extract trends and patterns from data by combining techniques of data mining, predictive modeling, and machine learning. The course will teach students to use a data-driven approach to problem-solving and decision-making, fostering their critical thinking and their ability to work alone or in group.

Course Contents

Principles of python programming and scientific programming. Data ingestion, cleaning and preprocessing. Statistical distribution visualization and fitting. Correlation analysis, regression models and prediction. Customer segmentation and the RFM model. Recommentation systems. Unsupervised and supervised learning and classification.

Reference Books

Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction (Vol. 2, pp. 1-758). New York: springer. VanderPlas, J. (2016). Python data science handbook: Essential tools for working with data. " O'Reilly Media, Inc.". Shmueli, G., Bruce, P. C., Gedeck, P., & Patel, N. R. (2019). Data mining for business analytics: concepts, techniques and applications in Python. John Wiley & Sons.

Teaching Methods

All lectures will be highly interactive. The students will be given direct access to the datasets and the code used to analyze it. They will be asked to discuss the problem at hand and to identify goals and possible approaches to attain them. As the teacher presents the techniques, the students will be asked to test them on the provided data and discuss the results.

Assessment Method

The last day of lecture will be used to test the students' understanding of the topic by involving them in "hands-on" project to be addressed in small teams.

Thesis assignment criteria

Students showing a deep understanding of the topic will be eligible for a thesis on the application of data mining and machine learning to the analysis of communication campaigns on social media.

Week 1 Contenuto sessioni on line e on campus

No teaching

Week 2 Contenuto sessioni on line e on campus

No teaching

Week 3 Contenuto sessioni on line e on campus

No teaching

Week 4 Contenuto sessioni on line e on campus

Introduction to scientific programming in python (4 hours). We will learn how to use the Python interpreter and the ipython notebooks. We will introduce the basic concepts of Python programming: variables and assignments; operators; input/output; data types; functions; control flows (if-elif-else statements, the while statement, the for loop); data structures (strings, lists, tuples, dictionaries); installing packages and importing modules and functions. We will review the main python libraries for scientific programming: NumPy, SciPy, Pandas, Scikit-learn and Matplotlib.

Week 5 Contenuto sessioni on line e on campus

Analysis of data from a retail store (4 hours). We will consider a small dataset of purchases made in a retail store in a year. We will introduce the principles of exploratory data analysis and visualization, such as data preprocessing, cleaning, filtering and rescaling. We will focus on univariate and bivariate descriptive statistics, studying probability distributions and their fitting, either parametric or non-parametric. We will try to understand the purchasing patterns of customers, to understand how to optimize opening hours and resource allocation in the store.

Week 6 Contenuto sessioni on line e on campus

Analysis of data from an online retail store (4 hours). We will consider real and synthetic data related to transactions and customers for an online retail store. We will introduce techniques such as data standardization and rescaling, outliers detection, and unsupervised multivariate clustering. We will focus on customers, presenting a few methods for customer segmentation and for estimating customer value (e.g., through the RFM model).

Week 7 Contenuto sessioni on line e on campus

Design of a recommendation system from Amazon rating data (4 hours). We will consider real data related to Amazon purchases. We will introduce techniques to embed entities into a suitable vector space and measure their mutual similarity, also addressing dimensionality reduction. We will then illustrate different ways to define a recommendation system, starting from popularity-based recommentations, to then consider user-based and item-based collaborative filtering. Finally, we will review the main principles of association rule mining.

Week 8 Contenuto sessioni on line e on campus

Wine quality prediction with machine learning (4 hours). We will consider a dataset of wines described by a number of parameters, such as acidity, alcohol and residual sugar, and classified based on their quality, as judged by wine experts. We will start by cleaning and visualizing the data, to then evaluate the correlation between different covariates and perform a feature selection. We will then introduce the basics of supervised learning: how to split the data between training and testing, how to perform a k-fold cross-validation and tune the parameters of a model, and how to evaluate the performance of a classification algorithm. Finally, we will test several families of regression and classification models on our dataset, and see how to select the one giving the most accurate prediction.

Week 9 Contenuto sessioni on line e on campus

Final review and practice (4 hours). We will review the main concepts seen in the previous lectures. The students will be provided with additional datasets and they will be asked to collaborate to analyze them.

Week 10 Contenuto sessioni on line e on campus

No teaching

Week 11 Contenuto sessioni on line e on campus

No teaching

Week 12 Contenuto sessioni on line e on campus

No teaching