Support Material
Theoretical classes will be supported by two MOOCs:
MOOC Técnico Data Science: kdd process |
MOOC Técnico Data Science: classification |
Students may enrol on MOOC Técnico, free of any charge, using their IST-ID, since the second one is only
available for the students in this course. In these MOOCs, beside videos students will find some exercises
to prepare for the exam.
Students are expected to prepare the class, by watching the listed videos before each class. The classes
will be used to present more detailed aspects on the topics and discuss a case study.
Lectures
Lecture | Video | Topic | Duration | Bibliography | Resources | Extra |
---|---|---|---|---|---|---|
T1 | Introduction to Data Science | |||||
kdd_T0_V1 | AI and Data Science | 6:16 | slides | |||
kdd_T1_V2 | Data Science | 5:47 | ||||
kdd_T1_V1 | Basic Concepts | 4:43 | Zaki: 1; Han: 2.1 | slides | ||
kdd_T1_V2 | KDD process | 5:37 | ||||
kdd_T1_V3 | Evaluation | 3:11 | ||||
T2 | Data Profiling and Preparation | |||||
kdd_T2_V1 | Data Profiling | 4:01 | Zaki: 2.1-2.3, 2.5, 3.1-3.4; Han 2.1-2.3 | slides | ||
kdd_T2_V2 | Distribution - part I and II | 5:19 and 7:05 | ||||
kdd_T2_V3 | Granularity | 3:02 | ||||
kdd_T2_V4 | Sparsity | 6:06 | ||||
kdd_T2_V5 | Dimensionality | 4:52 | ||||
kdd_T3_V2 | Missing values imputation | 7:03 | Han: 3.1, 3.2 | slides | ||
kdd_T3_V4 | Dummification | 3:24 | ||||
T3 | Modeling and ML Tasks | |||||
kdd_T4_V1 | Modeling | 4:35 | Han: 8.1 | slides | The five tribes by Pedro Domingos (60:00) | |
class_T0_V1 | The notion of concept | 5:03 | ||||
class_T0_V2 | Classification task | 5:29 | ||||
class_T1_V1 | Evaluation metrics | 10:41 | Zaki: 22; Han: 8.5 | |||
class_T1_V2 | Training strategies | 6:18 | ||||
class_T3_V1 | Analogizers | 7:43 | Zaki: 18.3; Han: 9.5 | slides | ||
kdd_T2_V6 | Similarity measures | 5:26 | Zaki: 3.4; Han: 2.4 | slides | ||
kdd_T3_V5 | Scaling | 5:47 | Zaki: 2.4; Han: 3.5 | slides | ||
T4 | Zoom | Deloitte Presentation | ||||
T5 | Bayesians | |||||
class_T4_V1 | Bayesians and MAP classifier | 8:08 | Zaki: 18.1-18.2; Han 8.3 | slides | ||
class_T4_V2 | Naive Bayes | 6:41 | ||||
kdd_T3_V6 | Data balancing | 5:54 | slides | |||
class_T1_V3 | Performance estimation | 5:49 | Zaki: 22.2; Han: 8.5.5 | |||
T6 | Symbolists | |||||
class_T2_V1 | Decision trees | 7:04 | Zaki: 19; Han: 8.2 | slides | ||
class_T2_V2 | Algorithms | 7:35 | ||||
class_T2_V3 | Metrics | 4:09 | ||||
class_T2_V4 | Pruning | 4:27 | ||||
class_T1_V4 | Overfitting | 5:34 | Zaki: 22.3 | slides | ||
T7 | Ensembles | |||||
class_T7_V1 | Ensembles | 7:31 | Zaki 22.4, 24; Han 8.6 | slides | ||
class_T7_V2 | Bagging | 5:08 | ||||
class_T7_V3 | Random Forests | 6:29 | ||||
kdd_T5_V1 | Feature Engineering | 3:05 | slides | |||
kdd_T5_V2 | Feature selection | 9:52 | Han: 3.4.4 | |||
kdd_T5_V4 | Feature generation | 5:08 | ||||
T8 | Connectionists and Boosting | |||||
class_T5_V1 | Neural Networks and Peceptron | 5:37 | Zaki: 25; Han: 9.2 | slides | video | |
class_T5_V2 | Gradient descent algorithm | 6:57 | video | |||
class_T5_V3 | Multi-layer perceptrons | 4:49 | video (PT) subtitles | |||
class_T3_V5 | Logistic regression | 4:56 | Zaki: 24 | video | ||
class_T7_V4 | Boosting | 5:21 | Zaki: 22.4; Han: 8.6.3 | slides | ||
class_T7_V5 | Gradient Boosting | 7:19 | GB example by Josh Starmer(17:02) | |||
T9 | Support Vector Machines | |||||
class_T3_V3 | SVMs and Kernels | Zaki: 21, 5; Han: 9.3 | slides | |||
T10 | Clustering | |||||
kdd_T4_V4 | Clustering - part I and II | 3:51 and 7:47 | Zaki: 17; Han: 10.1, 10.6 | slides | ||
Clustering algorithms | Zaki: 13-15; Han: 11.2-11.4, 11.1.3 | slides | ||||
kdd_T5_V3 | Feature extraction | 8:53 | Zaki: 7; Han: 3.4 | slides | ||
T11 | Pattern Mining and Anomaly Detection | |||||
kdd_T4_V4 | Pattern Mining - part I and II | 6:29 and 7:28 | Zaki: 8, 12; Han: 6 | slides | ||
Apriori algorithm | ||||||
kdd_T3_V2 | Discretisation | 3:14 | Zaki: 3.5; Han: 3.5 | |||
kdd_T4_V6 | Anomaly Detection (Part I and Part II) | 5:20 and 5:46 | Han: 12 | slides | ||
LOF algorithm | ||||||
T12 | Time Series | |||||
Time series profiling | 10:35 | Mitsa 2 | slides | video | ||
Time Series transformation | 8:13 | slides | video | |||
T13 | Forecasting | |||||
kdd_T4_V3 | Forecasting and Evaluation | 3:38 | Zaki: 27; Mitsa 4; Shumway 3 | slides | ||
Zoom (pw="forecasting2022!") | Regression Models | Zaki: 23 | slides | ARMA model by IBM (6:41) ARIMA and SARIMA models by IBM (12:21) | ||
RNNs and LSTMs | Zaki: 26.1-26.2 | slides | ||||
T14 | Closing Remarks | |||||
Zoom (pw="forecasting2022!") | AutoML | slides | ||||
kdd_T6_V1 | Challenges | 4:36 | ||||
kdd_T6_V2 | Ethical concerns | 6:49 | slides | |||
The Social Dilemma | 94:00 |
The discrimination dilemma (3:36) The democracy dilemma (3:01) The mental health dilemma (3:10) |
Bibliography
Multivariate data topics (from lesson 2 to 10) are covered by Zaki's book, Han's cover essentially the same topics in a more superficial way.
Time series are covered by Mitsa's and Shumway's books
- Mohammed J. Zaki, Wagner Meira, Jr. Data Mining and Analysis: Fundamental Concepts and Algorithms. 2nd edition. Cambridge University Press. 2020
- Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Concepts and Techniques, 3rd edition, Morgan Kaufmann, 2011
- Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman. Mining Massive Datasets. 3rd edition. 2019
- Theophano Mitsa. Temporal Data Mining. 2010
- Robert H. Shumway, David S. Stoffer.Time Series Analysis and Its Applications. 3rd edition. Springer
- Charu C. Aggarwal (ed).Social Network Data Analytics
- Chin-Chia Michael Yeh, Yan Zhu, Liudmila Ulanova, Nurjahan Begum, Yifei Ding, Hoang Anh Dau, Diego Furtado Silva, Abdullah Mueen, Eamonn Keogh.Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View that Includes Motifs, Discords and Shapelets (2016). IEEE ICDM 2016.
Texts about Ethical Concerns in Data Science
- GDPR
- ACM code of ethics
- Doing good data science, by DJ Patil, Hilary Mason and Mike Loukides. O'Riley. July 10, 2018.
- The ethical side of data science and AI, by Shreyas S. Medium. Oct 21, 2018
Past Exams