PYTHON AND R FOR DATA SCIENCE (LAB)
Instructional goals
The course aims to provide technical skills in coding aspects of data analysis. The Python programming language and the R environment are illustrated with a specific focus on those libraries, modules, and functions that allow the students to manage data effectively. This course provides an in-depth understanding of the approaches to preprocess, clean, visualize, and analyze data related to various contexts. Students in this course will mainly acquire practical skills, necessary to analyze real data.
Intended learning outcomes
Knowledge and understanding:
The course will offer key techniques to manage and analyze data programmatically, in order to extract useful insights. The course will provide a good understanding of the fundamental issues in data analysis, along with the knowledge of all those programming libraries that are needed to analyze data effectively.
Applying knowledge and understanding:
On successful completion of this course, students will be able to:
- Organize, visualize, and analyze large, complex datasets using Python and R programming languages.
- Extract knowledge from data.
- Make use of data science tools and techniques effectively.
Making judgements:
Students are expected to analyze complex datasets. Working both independently and in small groups, students will be required to complete a project work. The project work will allow students to make their judgments about the most appropriate techniques to be applied in a given use case. This will, in turn, allow the students to assess critically the advantages and disadvantages of the different approaches.
Communications Skills:
The course will give the students the possibility to understand terms and concepts related to data science, along with the main concepts, libraries and abstractions of R and Python programming in this context. Communication in small groups of students will play a key role in the development of the course project work. Small groups are important communication units in academic, professional, civic, and personal contexts; the students will be able to communicate their ideas and analyses in the most effective way, along with being able to write appropriate technical reports of the data analyses carried out.
Learning skills:
This course will empower students with the capability to enhance their technical skills in order to manage data effectively using the Python and the R programming languages. A strong emphasis will be given to the practical application of data science libraries to real use cases.
Course Contents
The course will cover the following topics:
- Python and R Programming Language
- Data Loading and Main File Formats
- Data Cleaning
- Data Manipulation and Transformation
- Data Visualization
Different frameworks, libraries, modules, and packages will be presented, including: numpy, pandas, matplotlib, seaborn, scikit-learn, ggplot2, and dplyr.
Reference Books
Lecture notes, research papers and course material will be available on the e-learning platform.
Recommended reading:
- “Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython”, 2nd Edition by Wes McKinney.Publisher: O'Reilly Media, Inc. Release Date: October 2017. ISBN: 9781491957660.
- “R for Data Science: Import, Tidy, Transform, Visualize, and Model Data” by Garrett Grolemund and Hadley Wickham. Release Date: December 2016. ISBN: 978-1491910399.
Teaching Methods
The course consists of practical lab sessions and group project work, complemented by lectures in order to explain the underlying concepts required by the laboratory sessions.
Assessment Method
The exam is passed when obtaining at least 80% of the maximum score during:
1) midterm written exam (weeks 6-7)
2) final written exam (end of the course)
3) course project work
The midterm and final exams consist of a written exam with questions and exercises related to programming code, programming abstractions, and libraries (Python and R). The final written exam will have additional questions if the midterm exam is skipped or failed.
The project work will be carried out by groups of 3 students. Students will be required to continuously deliver updates on the project work during the course timeframe. The project work will be defined by the teacher 4 weeks after the start of the course. There will be two phases of evaluation of the project during the semester.
Thesis assignment criteria
A thesis will be assigned (upon specific request to the instructor) to students who demonstrate a serious and motivated interest in the course topics.
Week 1
Python and R Language: basics (part I)
Week 2
Python and R Language: basics (part II)
Week 3
Python and R Language: basics (part III)
Week 4
Python and R language: Data Loading and File Formats
Week 5
Python and R language: Data Cleaning, Preparation, and Manipulation.
Python Packages: NumPy and Pandas.
Week 6
Python and R language: Data Visualization (part I)
Python Package: Matplotlib
Week 7
Python and R language: Data Visualization (part II)
Python Package: Seaborn
Week 8
Python and R language: advanced features (part I)
Week 9
Python and R language: advanced features (part II)
Week 10
Python and R language: step-by-step analysis of real-world datasets.
Week 11
Python and R language: step-by-step analysis of real-world datasets.
Week 12
Presentation of the delivered projects to the instructor and to the class.