BIG DATA AND SMART DATA ANALYTICS

Irene Finocchi

finocchi@luiss.it

Go to Bio

Open in My Luiss

Course code:

DSM08

General discipline (SSD):

SECS-S/01

Course year:

Semester:

Primo Semestre

Partition of students:

Unique

Credits:

Teaching language:

English

Total lesson hours:

Academic year:

2023/2024

Description

Extended program and reference teaching materials

Instructional goals

As datasets grow to Petabyte scale, traditional analysis models and computation paradigms become obsolete. The course will focus on fundamental algorithmic, statistical, and programming issues posed by big-data analytics, tackling major problems and techniques for extracting knowledge from massive amounts of data. By the end of the course the students will gain an understanding of the theory and computing of modern methods for big data analytics, with particular emphasis on advanced statistical methods and algorithms for mining massive and noisy datasets as well as rapidly changing streams of data.

Intended learning outcomes

Knowledge and understanding: Upon successful completion of the course, the students will be familiar with data mining problems and techniques, advanced statistical methods, as well as computational models and frameworks for analyzing and extracting insights from massive, possibly distributed or rapidly changing amounts of data at a large scale. Applying knowledge and understanding: After this course, the students will be able to proficiently develop innovative big data solutions, based on sound statistical and algorithmic techniques, in different application domains. They will be also able to implement the proposed solutions on top of industry-standard frameworks, e.g., Apache SparkTM, in order to tackle real-world problems such as those typically faced by big tech companies. Making judgements: Throughout the entire course, students will be invited to assess critically strengths and weaknesses of all the different methods and tools presented in class. After this course, they will be able to analyze different solutions to big data problems and to demonstrate an in-depth, critical understanding of the scope and challenges of different data-driven analytics techniques. Communication skills: This course will give the students the possibility to acquire and to understand major terms and concepts so as to communicate effectively their ideas, findings, proposals, analysis, and critical reasoning in the area of data-driven analytics. A special emphasis will be given to oral presentations and pitches in project group works. Learning skills: This course will provide the students with the ability to learn cutting-edge design and analysis tools and to apply them to real-world data analytics problems. The method of study will make the students able to break down complex problems arising in specific applications into manageable pieces and to apply different patterns in order to design rigorous and documentable solutions. A strong emphasis will be given to the direct application of the techniques and tools covered in this course to complex problems that are typical of today’s data-driven industry.

Course Contents

Defining big data: the five V's. From big data to actionable insights: smart data. Data sources, statistical features of big data. Pattern discovery. Sampling and estimators: traditional approaches, data stream algorithmics. Predictive analytics, recommender systems. Analytics at scale in Apache SparkTM: batch and streaming data analytics, SQL analytics.

Reference Books

Mining of Massive Datasets. HYPERLINK "https://www.amazon.com/s/ref=dp_byline_sr_ebooks_1?ie=UTF8&field-author=Jure+Leskovec&text=Jure+Leskovec&sort=relevancerank&search-alias=digital-text" Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman. Second edition. Apache SparkTM manual. Lecture notes, research papers, and other course material made available on the e-learning platform.

Teaching Methods

The course consists of traditional lectures complemented by hands-on lab sessions and industrial testimonials, that will guide the students on the use of good analytics practices and industry-standard practices.

Assessment Method

There will be three written intermediate tests (highly recommended) and a group software project with a final presentation. Students not taking the intermediate tests will be required to take a written exam in one of the standard sessions.

Thesis assignment criteria

The final work will be assigned (upon specific request to the instructor) to students who demonstrate a serious and motivated interest in the course topics.

Week 1 Contenuto sessioni on line e on campus

Introduction. The five V's of big data. Smart data: from data to actionable insights. (On campus) Lab: hands-on Apache SparkTM. (On line)

Week 2 Contenuto sessioni on line e on campus

The new software stack for big data. Google's MapReduce and Apache SparkTM. Lab: hands-on Apache SparkTM. (On line)

Week 3 Contenuto sessioni on line e on campus

Statistical data literacy: how (not) to go wrong with data. (On campus) Lab: hands-on Apache SparkTM. (On line)

Week 4 Contenuto sessioni on line e on campus

Finding similar items. (On campus) Lab: hands-on Apache SparkTM. (On line)

Week 5 Contenuto sessioni on line e on campus

Locality-sensitive hashing. (On campus) Lab: hands-on Apache SparkTM. (On line)

Week 6 Contenuto sessioni on line e on campus

Finding frequent itemsets. (On campus) Lab: hands-on Apache SparkTM. (On line)

Week 7 Contenuto sessioni on line e on campus

Mining social-network graphs. (On campus) Lab: hands-on Apache SparkTM. (On line)

Week 8 Contenuto sessioni on line e on campus

Mining social-network graphs. (On campus) Lab: hands-on Apache SparkTM. (On line)

Week 9 Contenuto sessioni on line e on campus

Recommendation systems. (On campus) Lab: hands-on Apache SparkTM. (On line)

Week 10 Contenuto sessioni on line e on campus

Mining data streams. (On campus) Lab: hands-on Apache SparkTM. (On line)

Week 11 Contenuto sessioni on line e on campus

Mining data streams. (On campus) Lab: hands-on Apache SparkTM. (On line)

Week 12 Contenuto sessioni on line e on campus

Recap and Q&A session. (On campus) Lab: hands-on Apache SparkTM. (On line)