Contact Person: Fernando Silva
Email: fernando.silva@novobanco.pt


Title 1. client2vec: Sequences of bank transactions are predictable in the same way as sequences of words are predictable. We know that some words are more likely than others in a given context and the same holds true for bank transactions and purchases. Therefore, given the success of extracting embeddings of words that project them into a high dimensional space where the distance has some semantic meaning, we can also learn meaningful embeddings of clients given their banking transactions. Extending the analogy, account transactions can be seen as words, clients as documents (bags or sequence of words), and the behaviour of a client as the summary of a document.  Just like word or document embeddings, client embeddings should exhibit the fundamental property that neighbouring points in the space of embeddings correspond to clients with similar behaviours. The project would entail experimenting with the approach described in the paper "client2vec: Towards Systematic Baselines for Banking Applications" and applying it to a marketing prediction problem: can you predict who's going to buy Chinese food based on their embedding?

 

Title 2. Entity (categorical) embeddings: 

Deep learning has seen tremendous success in a variety of problems in the past years. However, in real-world tabular data problems, other approaches (e.g. GBMs) are more successful. That doesn't mean that deep learning isn't useful in tabular settings. One of the most interesting contributions is the concept of embeddings for categorical data. Traditional techniques for treating categorical variables (e.g. one-hot encoding, target encoding) have strong limitations, either because they increase the dimensionality of the feature space or because they are specifically tied to the task being solved. Categorical embeddings attempt to solve both issues by mapping the categories into a high dimensionality space where similar categories are near one another. This project would entail implementing categorical embeddings for portuguese locations (concelhos/freguesias): how similar are "Freixo de Espada à Cinta" and "Vila Nova da Barquinha"?

 

Title 3. Causal machine learning: 

Typical machine learning models learn correlations between features and targets. That's good enough if you're trying to predict what will happen, but not if you're predicting what will happen if you do something. For instance, if you're trying to predict the success of a new drug, you don't want to predict whether the patient will get better - you want to predict whether the patient will get better if he's treated with that drug but would remain ill if he didn't get it. That means that your model is estimating the effect of an intervention and not the probability of an event. If you can conduct an A/B test, you can estimate that effect directly. But sometimes all you have is so-called observational data: data that did not result from a proper experiment. In these scenarios, the solution is to use causal machine learning. In this project, you'll estimate the effect of marketing campaigns for individual customers: how big of an effect do you actually get by calling a customer? You'll try different causal techniques (S-learner, T-learner, X-learner, Orthogonal Forests, and Doubly Robust Learners) and explore the proper metrics to assess these models.