LABS: CODING IN ACTION (MODULE II)

LABS: CODING IN ACTION (MODULE II)

Stefano Guarino

Instructional goals

This course builds on the work done during the first module of the Coding in Action Lab. In the first phase, the course introduces the concepts of computational cost and efficiency of a program, illustrates some best practices for writing readable, organized, and efficient code, and presents the main tools for data processing, analysis, and visualization available in Python, one of the reference programming languages in the "data science" world. In the second phase, the course focuses on the main data processing and analysis techniques: data acquisition, cleaning and preprocessing; statistical analysis and data visualization; clustering and classification; and supervised and unsupervised learning. To this end, students will tackle concrete problems using real data: customer segmentation, defining a recommendation system, and designing a product quality predictive system.

Intended learning outcomes

Knowledge and understanding: By the end of this course, the students will understand that a good program solves the given task making use of the least possible amount of resources. They will learn specific Python syntax and best practices needed to make programming faster and increase performance. Further, the students will learn how to use, at an introductory level, the most known Python packages for data analysis and visualization: NumPy, Pandas, Scikit-learn and Matplotlib. By the end of the course, the students will be familiar with the main concepts of data analysis and they will understand the importance of using suitable algorithms to extract trends and patterns from data by combining techniques of data mining, predictive modeling, and machine learning. The course will teach students to use a data-driven approach to problem-solving and decision-making, fostering their critical thinking and their ability to work alone or in group. Applying knowledge and understanding: To test their understanding of the concepts seen in class, the students will be assigned projects that require to deal with real data. They will be asked to: • Rapidly manipulate and process large data sets through dataframes and arrays • Plot data and functions • Operate basic descriptive analysis and modeling (statistics, histograms, interpolation, clustering, fitting) The course will prepare the students for more advanced data analysis courses and make them ready to complete projects in other courses that require a computational approach to data. Making judgements: Upon completing the study program, students will be able to: • Compare simple algorithms that solve the same task in terms of their computational cost • Address simple data analysis project Communications Skills: Being introduced to the concepts of computational and memory efficiency and to their formalization, the students will understand that the “cost” of solving a problem can be precisely quantified and expressed. Through examples, case studies and projects, the students will learn how to communicate the results of a data analysis task and how to justify the choice of specific algorithms, methods and techniques. Learning skills: The students will be introduced to a set of advanced/professional libraries for data analysis. At the end of the course, they will be able to autonomously browse the Python standard library as well as the web to find the right libraries and tools to perform a given task.

Course Contents

The course will cover the following aspects of computer programming: • Principles of computational complexity, efficiency of an algorithm • Pythonic programming: conditional expressions; EAFP (Easier to Ask for Forgiveness than Permission); list comprehension; any and all; sets. • Libraries for scientific programming: Numpy, Pandas, Scipy, Scikit-learn, Matplotlib • Data ingestion, cleaning and preprocessing. Statistical distribution visualization. Correlation analysis, regression models and prediction. Customer segmentation and the RFM model. Recommentation systems. Unsupervised and supervised learning and classification.

Reference Books

Allen B. Downey, “Think Python: How to Think Like a Computer Scientist (2nd Edition)”, O’Reilly, ISBN-13: 978-1491939369 Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction (Vol. 2, pp. 1-758). New York: springer. VanderPlas, J. (2016). Python data science handbook: Essential tools for working with data. " O'Reilly Media, Inc.". Shmueli, G., Bruce, P. C., Gedeck, P., & Patel, N. R. (2019). Data mining for business analytics: concepts, techniques and applications in Python. John Wiley & Sons.

Teaching Methods

Face-to-face teaching will be used to introduce students to the new topics. For each topic, a number of examples and case studies will then be considered for students to tackle independently, either alone or in groups, in line with the "learning by doing" paradigm. Student participation in class is strongly encouraged.

Assessment Method

Assessment for this course will be based on a combination of programming projects to be done during the weeks of the course and to be discussed with the professor (70%) and a final exam (30%). The final exam will consist of a multiple-choice quiz. All programming projects will be graded as follows: - If the code provides the correct answer using the required approach: 100% - If it generally uses the correct approach, but the answer is incorrect due to some minor errors: 70-90% - If it uses the basic concepts correctly and makes some progress in trying to solve the problem: 40-60% - If the answer shows a lack of understanding of some or all of the fundamental concepts: 0-40% For non-attending students, the projects will weigh 50% of the grade and an oral examination will also be requested for the remaining 20%.

Thesis assignment criteria

No thesis will be assigned

Week 1 Contenuto sessioni on line e on campus

4 hours lecture + 1-2 days of independent group work: - principles of computational complexity; effectiveness vs. efficiency of a program; examples of algorithms solving the same problem but having different complexity - conditional expressions; EAFP (Easier to Ask for Forgiveness than Permission) and try-except statements; list comprehension; any and all operators; sets - introduction to the libraries Numpy, Pandas, SciPy, Scikit Learn and Matplotlib - how to import a dataset; searching for and eliminating missing, duplicate, redundant data; checking the internal consistency of a dataset 4 hours lecture + 1-2 days of independent group work: - grouping and sorting data by value of certain features - statistical distributions, mean, median, quartiles, skewness, rescaling, standardization, outliers - calculation and visualization of histograms and distribution of features, possibly grouped by other attributes (e.g., temporal) - correlation between attributes: correlation measures and correlation matrix; scatterplot and linear fit - embedding of a nonnumeric dataset in a vector space; spatial distance as a measure of similarity, cosine similarity

Week 2 Contenuto sessioni on line e on campus

4 hours lecture + 1-2 days of independent group work: - spatial clustering, k-means, elbow method, silhouette method - dimensionality problem and sparsity of data; dimensionality reduction and singular values decomposition - segmentation of customers based on the RFM (Recency - Frequency - Monetary value) model: RFM by spatial clustering and relative labeling of segments; - recommendation system definition: popularity-based recommendation; user-based system; item-bases system 4 hours lecture + 1-2 days of independent group work: - introduction to machine learning: supervised and unsupervised learning; regression vs. classification - regression: cost function; decision trees, ensembles, bootstrapping and random forests; cross-validation, overfitting and underfitting - classification: binary and multi-label classification; measuring the quality of a classification, precision, recall, f1-score, confusion matrix, ROC curve and AUC; classification by decision tree and random forest - application to predicting the quality/customer satisfaction of a wine

Week 3 Contenuto sessioni on line e on campus

As per the curriculum, this is a 2-week intensive course taught in the gap between the summer exam session and the summer break.

Week 4 Contenuto sessioni on line e on campus

As per the curriculum, this is a 2-week intensive course taught in the gap between the summer exam session and the summer break.

Week 5 Contenuto sessioni on line e on campus

As per the curriculum, this is a 2-week intensive course taught in the gap between the summer exam session and the summer break.

Week 6 Contenuto sessioni on line e on campus

As per the curriculum, this is a 2-week intensive course taught in the gap between the summer exam session and the summer break.

Week 7 Contenuto sessioni on line e on campus

As per the curriculum, this is a 2-week intensive course taught in the gap between the summer exam session and the summer break.

Week 8 Contenuto sessioni on line e on campus

As per the curriculum, this is a 2-week intensive course taught in the gap between the summer exam session and the summer break.

Week 9 Contenuto sessioni on line e on campus

As per the curriculum, this is a 2-week intensive course taught in the gap between the summer exam session and the summer break.

Week 10 Contenuto sessioni on line e on campus

As per the curriculum, this is a 2-week intensive course taught in the gap between the summer exam session and the summer break.

Week 11 Contenuto sessioni on line e on campus

As per the curriculum, this is a 2-week intensive course taught in the gap between the summer exam session and the summer break.

Week 12 Contenuto sessioni on line e on campus

As per the curriculum, this is a 2-week intensive course taught in the gap between the summer exam session and the summer break.