Title 2. Entity (categorical) embeddings:
Deep learning has seen tremendous success in a variety of problems in the past years. However, in real-world tabular data problems, other approaches (e.g. GBMs) are more successful. That doesn't mean that deep learning isn't useful in tabular settings. One of the most interesting contributions is the concept of embeddings for categorical data. Traditional techniques for treating categorical variables (e.g. one-hot encoding, target encoding) have strong limitations, either because they increase the dimensionality of the feature space or because they are specifically tied to the task being solved. Categorical embeddings attempt to solve both issues by mapping the categories into a high dimensionality space where similar categories are near one another. This project would entail implementing categorical embeddings for portuguese locations (concelhos/freguesias): how similar are "Freixo de Espada à Cinta" and "Vila Nova da Barquinha"?
Title 3. Causal machine learning:
Typical machine learning models learn correlations between features and targets. That's good enough if you're trying to predict what will happen, but not if you're predicting what will happen if you do something. For instance, if you're trying to predict the success of a new drug, you don't want to predict whether the patient will get better - you want to predict whether the patient will get better if he's treated with that drug but would remain ill if he didn't get it. That means that your model is estimating the effect of an intervention and not the probability of an event. If you can conduct an A/B test, you can estimate that effect directly. But sometimes all you have is so-called observational data: data that did not result from a proper experiment. In these scenarios, the solution is to use causal machine learning. In this project, you'll estimate the effect of marketing campaigns for individual customers: how big of an effect do you actually get by calling a customer? You'll try different causal techniques (S-learner, T-learner, X-learner, Orthogonal Forests, and Doubly Robust Learners) and explore the proper metrics to assess these models.