BIG DATA AND SMART DATA ANALYTICS
Obiettivi formativi
As datasets grow to Petabyte scale, traditional analysis models and computation paradigms become obsolete. The course will focus on fundamental algorithmic, statistical, and programming issues posed by big-data analytics, tackling major problems and techniques for extracting knowledge from massive amounts of data. By the end of the course the students will gain an understanding of the theory and computing of modern methods for big data analytics, with particular emphasis on advanced statistical methods and algorithms for mining massive and noisy datasets as well as rapidly changing streams of data.
Risultati di apprendimento attesi
Knowledge and understanding:
Upon successful completion of the course, the students will be familiar with data mining problems and techniques, advanced statistical methods, as well as computational models and frameworks for analyzing and extracting insights from massive, possibly distributed or rapidly changing amounts of data at a large scale.
Applying knowledge and understanding:
After this course, the students will be able to proficiently develop innovative big data solutions, based on sound statistical and algorithmic techniques, in different application domains. They will be also able to implement the proposed solutions on top of industry-standard frameworks, e.g., Apache SparkTM, in order to tackle real-world problems such as those typically faced by big tech companies.
Making judgements:
Throughout the entire course, students will be invited to assess critically strengths and weaknesses of all the different methods and tools presented in class. After this course, they will be able to analyze different solutions to big data problems and to demonstrate an in-depth, critical understanding of the scope and challenges of different data-driven analytics techniques.
Communication skills:
This course will give the students the possibility to acquire and to understand major terms and concepts so as to communicate effectively their ideas, findings, proposals, analysis, and critical reasoning in the area of data-driven analytics. A special emphasis will be given to oral presentations and pitches in project group works.
Learning skills:
This course will provide the students with the ability to learn cutting-edge design and analysis tools and to apply them to real-world data analytics problems. The method of study will make the students able to break down complex problems arising in specific applications into manageable pieces and to apply different patterns in order to design rigorous and documentable solutions. A strong emphasis will be given to the direct application of the techniques and tools covered in this course to complex problems that are typical of today’s data-driven industry.
Contenuti Del Corso
Defining big data: the five V's. From big data to actionable insights: smart data.
Data sources, statistical features of big data.
Pattern discovery.
Sampling and estimators: traditional approaches, data stream algorithmics.
Predictive analytics, recommender systems.
Analytics at scale in Apache SparkTM: batch and streaming data analytics, SQL analytics.
Testi Di Riferimento
Mining of Massive Datasets. HYPERLINK "https://www.amazon.com/s/ref=dp_byline_sr_ebooks_1?ie=UTF8&field-author=Jure+Leskovec&text=Jure+Leskovec&sort=relevancerank&search-alias=digital-text" Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman. Second edition.
Apache SparkTM manual.
Lecture notes, research papers, and other course material made available on the e-learning platform.
Metodologie Didattiche
The course consists of traditional lectures complemented by hands-on lab sessions and industrial testimonials, that will guide the students on the use of good analytics practices and industry-standard practices.
Modalità di verifica dell'apprendimento
There will be three written intermediate tests (highly recommended) and a group software project with a final presentation. Students not taking the intermediate tests will be required to take a written exam in one of the standard sessions.
Criteri per l’assegnazione dell’elaborato finale
The final work will be assigned (upon specific request to the instructor) to students who demonstrate a serious and motivated interest in the course topics.
Settimana 1
Introduction. The five V's of big data. Smart data: from data to actionable insights. (On campus)
Lab: hands-on Apache SparkTM. (On line)
Settimana 2
The new software stack for big data. Google's MapReduce and Apache SparkTM.
Lab: hands-on Apache SparkTM. (On line)
Settimana 3
Statistical data literacy: how (not) to go wrong with data. (On campus)
Lab: hands-on Apache SparkTM. (On line)
Settimana 4
Finding similar items. (On campus)
Lab: hands-on Apache SparkTM. (On line)
Settimana 5
Locality-sensitive hashing. (On campus)
Lab: hands-on Apache SparkTM. (On line)
Settimana 6
Finding frequent itemsets. (On campus)
Lab: hands-on Apache SparkTM. (On line)
Settimana 7
Mining social-network graphs. (On campus)
Lab: hands-on Apache SparkTM. (On line)
Settimana 8
Mining social-network graphs. (On campus)
Lab: hands-on Apache SparkTM. (On line)
Settimana 9
Recommendation systems. (On campus)
Lab: hands-on Apache SparkTM. (On line)
Settimana 10
Mining data streams. (On campus)
Lab: hands-on Apache SparkTM. (On line)
Settimana 11
Mining data streams. (On campus)
Lab: hands-on Apache SparkTM. (On line)
Settimana 12
Recap and Q&A session. (On campus)
Lab: hands-on Apache SparkTM. (On line)