Tabular Data Normalization for Deep Learning
Data normalization plays a key role for applications of Deep Learning (DL) to tabular data. In most applications, practitioners resort to Z-score normalization for numerical features and one-hot encoding or embeddings for categorical features. However, other processing methodologies have been shown to work better in some scenarios, but not in the domain of tabular financial fraud data. The goal of this project is to study the impact of the feature normalization methods on the performance of Deep Learning algorithms on a financial fraud dataset and test more advanced normalization techniques. Some of the methods to be tested could be: grouping of low cardinality categoricals into better embeddings, byte pair encoding and quantile normalization for numericals. But a more extensive literature review on the most recent methods must also be performed.
References:
Huang, Xin, et al. "Tabtransformer: Tabular data modeling using contextual embeddings." arXiv preprint arXiv:2012.06678 (2020).
Badirli, Sarkhan, et al. "Gradient boosting neural networks: Grownet." arXiv preprint arXiv:2002.07971 (2020).
Arik, Sercan Ö., and Tomas Pfister. "Tabnet: Attentive interpretable tabular learning." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. No. 8. 2021.
Wang, Ruoxi, et al. "Deep & cross network for ad click predictions." Proceedings of the ADKDD'17. 2017. 1-7.
Song, Weiping, et al. "Autoint: Automatic feature interaction learning via self-attentive neural networks." Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2019.
Contact: Pedro Saleiro ( pedro.saleiro@feedzai.com )
Dealing with Selective Labels in Fraud Detection
Online merchants use machine learning models to protect them from fraud: whenever the model predicts that a purchase is fraudulent, they reject that purchase. To keep up with ever-changing fraud trends, machine learning models need to be retrained with the most recent instances. However, some labels will not be available ? those of instances that were rejected. This is known as the selective labels problem. Should the new model ignore these instances and risk missing out on valuable information? Or should it assume they are fraudulent, use them, and risk training on inaccurate labels? Are there more sophisticated alternatives? The goal of this project is to delve into the relevant literature and benchmark methodologies on a financial fraud dataset, from the perspective of both performance and algorithmic fairness.
References:
Hand, D. J., & Henley, W. E. (1993). Can reject inference ever work?. IMA Journal of Management Mathematics, 5(1), 45-55.
Bücker, M., van Kampen, M., & Krämer, W. (2013). Reject inference in consumer credit scoring with nonignorable missing data. Journal of Banking & Finance, 37(3), 1040-1045.
Mancisidor, R. A., Kampffmeyer, M., Aas, K., & Jenssen, R. (2020). Deep generative models for reject inference in credit scoring. Knowledge-Based Systems, 196, 105758.
Pombal, J., Saleiro, P., Figueiredo, M. A., & Bizarro, P. (2022). Prisoners of Their Own Devices: How Models Induce Data Bias in Performative Prediction. arXiv preprint arXiv:2206.13183.
Contact: Pedro Saleiro ( pedro.saleiro@feedzai.com )
Learning to Quantify in Dynamic Environments
Real-world fraud detection is highly dynamic due to its adversarial nature. Fraudsters change strategies as a response to a new deployed machine learning model. Therefore the actions of the system may influence the environment and the underlying data distribution. One key challenge is associated with estimating fraud rates (i.e., class prevalence) in a given period of time to adjust model thresholds without having access to the true labels (true fraud labels may take months to arrive). In this work we aim to study existing methods for supervised prevalence estimation and test their application in a real-world fraud detection problem.
References:
del Coz, Juan José, et al. "Learning to quantify: Methods and applications (LQ 2021)." Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2021.
González, Pablo, et al. "A review on quantification learning." ACM Computing Surveys (CSUR) 50.5 (2017): 1-40.
Pérez-Gállego, Pablo, et al. "Dynamic ensemble selection for quantification tasks." Information Fusion 45 (2019): 1-15.
Ditzler, Gregory, et al. "Learning in nonstationary environments: A survey." IEEE Computational Intelligence Magazine 10.4 (2015): 12-25.
Contact: Pedro Saleiro ( pedro.saleiro@feedzai.com )
Ditzler, Gregory, et al. "Learning in nonstationary environments: A survey." IEEE Computational Intelligence Magazine 10.4 (2015): 12-25.
Contact: Pedro Saleiro ( pedro.saleiro@feedzai.com )