PYTHON AND R FOR DATA SCIENCE (LAB)

PYTHON AND R FOR DATA SCIENCE (LAB)

Marco Querini

Instructional goals

The course aims at providing technical skills about coding aspects for data analysis. The Python programming language and the R environment are illustrated with a specific focus on those libraries, modules and functions that allow the students to manage data effectively. This provides an in-depth understanding of the approaches to be used to preprocess, clean, visualize and analyze data related to a plethora of different contexts. Students in this course will mainly acquire practical skills, necessary to analyze real data.

Intended learning outcomes

Knowledge and understanding: The course will offer key techniques to manage and analyze data programmatically, in order to extract useful insights. The course will provide a good understanding of the fundamental issues in data analysis, along with the knowledge of all those programming libraries that are needed to analyze data in an effective way. Applying knowledge and understanding: On successful completion of this course students will be able to: ●Organize, visualize, and analyze large, complex datasets by means of Python and R programming language. ●Extract knowledge from data. ●Make use of data science tools and techniques effectively. Making judgements: Students are expected to analyze complex datasets. Working both independently and in small groups, students will be required to complete a project work (three students per group). The project work will allow students to make their own judgements about the most appropriate techniques to be applied in a given use case. This will, in turn, allow the students to assess critically the advantages and disadvantages of the different approaches. Communications Skills: The course will give the students the possibility to understand terms and concepts related to data science, along with the main concepts, libraries and abstractions of R and Python programming in this context. Communication in small groups of students (that is, three students) will play a key role in the development of the course project work. Small groups are important communication units in academic, professional, civic, and personal contexts; the students will be able to communicate their ideas and analyses in the most effective way, along with being able to write appropriate technical reports of the data analyses carried out. Learning skills: This course will empower students with the capability to enhance their technical skills in order to manage data effectively by means of the Python and the R programming languages. A strong emphasis will be given to the practical application of data science libraries to real use cases.

Course Contents

The course will cover the following topics: ● Python and R programming language (variables, conditional expressions, loops, functions). ●Data Loading and File Formats. ●Data Cleaning. ●Data Manipulation with Pandas. ●Plotting and Visualization. ●Building and optimizing pipelines in scikit-learn. ●Data Visualization with ggplot2. ●Data Transformation with dplyr. ●Relational Data with dplyr. ●Pyspark: How to use Apache Spark from Python.

Reference Books

Lecture notes, research papers and course material will be made available on the e-learning platform. Recommended reading: ●“Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython”, 2nd Edition by Wes McKinney.Publisher: O'Reilly Media, Inc. Release Date: October 2017. ISBN: 9781491957660. ●“R for Data Science: Import, Tidy, Transform, Visualize, and Model Data” by Garrett Grolemund and Hadley Wickham. Release Date: December 2016. ISBN: 978-1491910399.

Teaching Methods

The course consists of practical lab sessions and group project works, complemented by lectures in order to explain the underlying concepts required by the laboratory sessions.

Assessment Method

The final evaluation will be represented by a pass or fail grade. Students are required to get a weighted arithmetic mean equal or greater than 21 points out of 30. In case the condition is met, a PASS grade is registered. The final written exam will be worth 20% of the final evaluation. It consists in a written exam with questions related to programming code, programming abstractions and libraries (Python and R). 40% of the final evaluation will be given by tests to be taken during the course. The 40% of the final evaluation will be given by the project work. The project work must be carried out by groups of exactly 3 students. Students will be required to continuously deliver updates of the project work during the course timeframe. The project work will be defined by the instructor after 3 weeks since the start of the course with the definition of several project options for the work to be done (students can choose the project they are more interested in). The project work consists in 2 assignments (requiring to write code in both Python and R programming languages). Bonus points are granted to students that deliver code during the course timeframe.

Thesis assignment criteria

A thesis will be assigned (upon specific request to the instructor) to students who demonstrate a serious and motivated interest to the course subjects.

Does the syllabus cover sustainability topics?

Sustainability topics are not dealt with.

Week 1 Contenuto sessioni on line e on campus

Introduction to the course. Python and R programming basics.

Week 2 Contenuto sessioni on line e on campus

Python programming basics: Conditional statements, loops, functions. Integrated Development Environments (IDEs). How to structure Python code in a project. How to manage libraries in Python using virtual environments.

Week 3 Contenuto sessioni on line e on campus

R programming basics: Conditional statements and loops. Integrated Development Environments (IDEs). How to structure R code in a project. How to manage the installation of libraries in the R programming language. Definition of the project work to be carried out by students. The instructor publishes a list of project definitions: each group chooses the project to be carried out (students’ choices).

Week 4 Contenuto sessioni on line e on campus

R and Python: ●Data Loading and File Formats. ●Data Cleaning and Preparation.

Week 5 Contenuto sessioni on line e on campus

Python: ●Data Manipulation with Pandas. Continuous Assessment: Evaluation of first commits / delivered versions of the Python assignment. Comments and feedback by the instructor during the classes.

Week 6 Contenuto sessioni on line e on campus

Python: ●Plotting and Visualization. ●Data Aggregation and Group Operations.

Week 7 Contenuto sessioni on line e on campus

Python ●Building and optimizing pipelines in scikit-learn. Continuous Assessment: Second round of evaluation of the incremental deliveries of the Python assignment. Comments and feedback by the instructor during the classes.

Week 8 Contenuto sessioni on line e on campus

Introduction to the use of version control systems (e.g., Git).

Week 9 Contenuto sessioni on line e on campus

R: ●Data Visualization with ggplot2. ● Data Transformation with dplyr. ●Relational Data with dplyr. Continuous Assessment: Evaluation of first commits / delivered versions of the R assignment. Comments and feedback by the instructor during the classes.

Week 10 Contenuto sessioni on line e on campus

Python: ●Pyspark library: How to use Spark from Python.

Week 11 Contenuto sessioni on line e on campus

Python: Pyspark exercises. Continuous Assessment: Second round of evaluation of the incremental deliveries of the R assignment. Comments and feedback by the instructor during the classes.

Week 12 Contenuto sessioni on line e on campus

●Course Recap. ●(Not Mandatory) presentation of the delivered projects to the instructor and to the class.