BIG DATA AND SMART DATA ANALYTICS
Instructional goals
As datasets grow to Petabyte scale, traditional analysis models and computation paradigms become obsolete. The course will focus on fundamental algorithmic, statistical, and programming issues posed by big-data analytics, tackling major problems and techniques for extracting knowledge from massive amounts of data. By the end of the course the students will gain an understanding of the theory and computing of modern methods for big data analytics, with particular emphasis on advanced statistical methods and algorithms for mining massive and noisy datasets as well as rapidly changing streams of data.
Intended learning outcomes
Knowledge and understanding:
Upon successful completion of the course, the students will be familiar with data mining problems and techniques and computational models and frameworks for analyzing and extracting insights from massive, possibly distributed or rapidly changing amounts of data at a large scale.
Applying knowledge and understanding:
After this course, the students will be able to develop efficient data analytics solutions. They will be also able to implement the proposed solutions on top of industry-standard frameworks, e.g., Apache Spark, in order to tackle real-world problems such as those typically faced by big tech companies.
Making judgements:
Throughout the entire course, students will be invited to assess critically strengths and weaknesses of all the different methods and tools presented in class. After this course, they will be able to analyze different solutions to big data problems and to demonstrate an in-depth, critical understanding of the scope and challenges of different data-driven analytics techniques.
Communication skills:
This course will give the students the possibility to acquire and to understand major terms and concepts so as to communicate effectively their ideas, findings, proposals, analysis, and critical reasoning in the area of data-driven analytics. A special emphasis will be given to oral presentations and pitches in project group works.
Learning skills:
This course will provide the students with the ability to learn cutting-edge design and analysis tools and to apply them to real-world data analytics problems. The method of study will make the students able to break down complex problems arising in specific applications into manageable pieces and to apply different patterns in order to design rigorous and documentable solutions. A strong emphasis will be given to the direct application of the techniques and tools covered in this course to complex problems that are typical of today’s data-driven industry.
Course Contents
Defining big data. From data to insights. Data analytics workflow.
Data sources, statistical features of big data.
Pattern discovery.
Sampling and estimators: traditional approaches, data stream algorithmics.
Predictive analytics, recommender systems, massive network analysis.
Analytics at scale in Apache Spark: batch data analytics, SQL, streaming.
Reference Books
Mining of Massive Datasets. HYPERLINK "https://www.amazon.com/s/ref=dp_byline_sr_ebooks_1?ie=UTF8&field-author=Jure+Leskovec&text=Jure+Leskovec&sort=relevancerank&search-alias=digital-text" Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman. Second edition.
Apache SparkTM manual.
Lecture notes, research papers, and other course material made available on the e-learning platform.
Teaching Methods
The course consists of traditional lectures complemented by hands-on lab sessions and industrial testimonials, that will guide the students on the use of good analytics practices and industry-standard practices.
Assessment Method
There will be written intermediate tests (highly recommended) and a group software project with a final presentation. Students not taking the intermediate tests will be required to take an oral exam in one of the standard sessions. Non-attending students: oral exam on the course contents (including Apache Spark) and essay on a topic selected with the professor.
Thesis assignment criteria
Thesis assigned (upon specific request to the professor) to students who demonstrate a serious and motivated interest in the course topics.
Week 1
Introduction. Big data characteristics. Data analytics workflow.
Lab: hands-on Apache Spark.
Week 2
Big data systems: hw/sw stack for big data. Software design principles (using Google's MapReduce as an example).
Lab: hands-on Apache Spark.
Project presentation, part 1.
Week 3
Finding similar items: shingling and minhashing.
Lab: hands-on Apache Spark.
Week 4
Finding similar items: locality-sensitive hashing.
Lab: hands-on Apache Spark.
Week 5
Finding frequent itemsets.
Lab: hands-on Apache Spark.
Week 6
Industry testimonial: data analytics in practice.
Lab: hands-on Apache Spark.
Week 7
Intermediate test (tentative). Project presentation, part 2.
Lab: hands-on Apache Spark.
Week 8
Mining social-network graphs.
Lab: hands-on Apache Spark.
Week 9
Mining social-network graphs.
Lab: hands-on Apache Spark.
Week 10
Mining data streams.
Lab: hands-on Apache Spark.
Week 11
Mining data streams.
Lab: hands-on Apache Spark.
Week 12
Final test. Theory recap, Q&A session.
Lab: project recap and Q&A.