Support Material

Theoretical classes will be supported by two MOOCs:

MOOC Técnico Data Science: kdd process
MOOC Técnico Data Science: classification

Students may enrol on MOOC Técnico, free of any charge, using their IST-ID, since the second one is only available for the students in this course. In these MOOCs, beside videos students will find some exercises to prepare for the exam.
Students are expected to prepare the class, by watching the listed videos before each class. The classes will be used to present more detailed aspects on the topics and discuss a case study.

Lectures

LectureVideoTopicDurationBibliographyResourcesExtra
T1Introduction to Data Science
kdd_T0_V1AI and Data Science6:16slides
kdd_T1_V2Data Science5:47
kdd_T1_V1Basic Concepts4:43Zaki: 1; Han: 2.1 slides
kdd_T1_V2KDD process5:37
kdd_T1_V3Evaluation3:11
T2Data Profiling and Preparation
kdd_T2_V1Data Profiling4:01Zaki: 2.1-2.3, 2.5, 3.1-3.4; Han 2.1-2.3slides
kdd_T2_V2Distribution - part I and II5:19 and 7:05
kdd_T2_V3Granularity3:02
kdd_T2_V4Sparsity6:06
kdd_T2_V5Dimensionality4:52
kdd_T3_V2Missing values imputation7:03Han: 3.1, 3.2slides
kdd_T3_V4Dummification3:24
T3Modeling and ML Tasks
kdd_T4_V1Modeling4:35Han: 8.1slidesThe five tribes by Pedro Domingos (60:00)
class_T0_V1The notion of concept5:03
class_T0_V2Classification task5:29
class_T1_V1Evaluation metrics10:41Zaki: 22; Han: 8.5
class_T1_V2Training strategies6:18
class_T3_V1Analogizers7:43Zaki: 18.3; Han: 9.5slides
kdd_T2_V6Similarity measures5:26Zaki: 3.4; Han: 2.4slides
kdd_T3_V5Scaling5:47Zaki: 2.4; Han: 3.5slides
T4ZoomDeloitte Presentation
T5Bayesians
class_T4_V1Bayesians and MAP classifier8:08Zaki: 18.1-18.2; Han 8.3slides
class_T4_V2Naive Bayes6:41
kdd_T3_V6Data balancing5:54slides
class_T1_V3Performance estimation5:49Zaki: 22.2; Han: 8.5.5
T6Symbolists
class_T2_V1Decision trees7:04Zaki: 19; Han: 8.2slides
class_T2_V2Algorithms7:35
class_T2_V3Metrics4:09
class_T2_V4Pruning4:27
class_T1_V4Overfitting5:34Zaki: 22.3slides
T7Ensembles
class_T7_V1Ensembles7:31Zaki 22.4, 24; Han 8.6slides
class_T7_V2Bagging5:08
class_T7_V3Random Forests6:29
kdd_T5_V1Feature Engineering3:05slides
kdd_T5_V2Feature selection9:52Han: 3.4.4
kdd_T5_V4Feature generation5:08
T8Connectionists and Boosting
class_T5_V1Neural Networks and Peceptron5:37Zaki: 25; Han: 9.2slidesvideo
class_T5_V2Gradient descent algorithm6:57video
class_T5_V3Multi-layer perceptrons4:49video (PT) subtitles
class_T3_V5Logistic regression4:56Zaki: 24video
class_T7_V4Boosting5:21Zaki: 22.4; Han: 8.6.3slides
class_T7_V5Gradient Boosting7:19GB example by Josh Starmer(17:02)
T9Support Vector Machines
class_T3_V3SVMs and KernelsZaki: 21, 5; Han: 9.3slides
T10Clustering
kdd_T4_V4Clustering - part I and II3:51 and 7:47Zaki: 17; Han: 10.1, 10.6slides
Clustering algorithmsZaki: 13-15; Han: 11.2-11.4, 11.1.3slides
kdd_T5_V3Feature extraction8:53Zaki: 7; Han: 3.4slides
T11Pattern Mining and Anomaly Detection
kdd_T4_V4Pattern Mining - part I and II6:29 and 7:28Zaki: 8, 12; Han: 6slides
Apriori algorithm
kdd_T3_V2Discretisation3:14Zaki: 3.5; Han: 3.5
kdd_T4_V6Anomaly Detection (Part I and Part II)5:20 and 5:46Han: 12slides
LOF algorithm
T12Time Series
Time series profiling10:35Mitsa 2slidesvideo
Time Series transformation8:13slidesvideo
T13Forecasting
kdd_T4_V3Forecasting and Evaluation3:38Zaki: 27; Mitsa 4; Shumway 3slides
Zoom (pw="forecasting2022!")Regression ModelsZaki: 23slidesARMA model by IBM (6:41)
ARIMA and SARIMA models by IBM (12:21)
RNNs and LSTMsZaki: 26.1-26.2slides
T14Closing Remarks
Zoom (pw="forecasting2022!")AutoMLslides
kdd_T6_V1Challenges4:36
kdd_T6_V2Ethical concerns6:49slides
The Social Dilemma94:00 The discrimination dilemma (3:36)
The democracy dilemma (3:01)
The mental health dilemma (3:10)

Bibliography

Multivariate data topics (from lesson 2 to 10) are covered by Zaki's book, Han's cover essentially the same topics in a more superficial way.

Time series are covered by Mitsa's and Shumway's books

Texts about Ethical Concerns in Data Science

Past Exams

Attachments