TL;DR

This blogpost focusses on ML Design and Architecture and tries to give some intuition and hints for deciding between one generalized and multiple specialized models for the same business requriement and dataset.

Consider it a nudge to dive deeper into the topic

Why should i care?

Transparency in machine learning is crucial for business stakeholders because it fosters trust and informed decision-making. Business Stakeholders need to understand not only the potential benefits but also the limitations and risks associated with machine learning models.

As Data scientists we play an important role in this process, as we bridge the gap between complex algorithms and business objectives. Educating stakeholders about the inner workings of machine learning algorithms, their inputs, outputs, and potential biases, empowers them to make well-informed decisions, manage expectations, and mitigate risks effectively.

But this customer segment has way higher churn, shouldn’t the model focus on this segment only?

This is one of the examples when you as a ML Engineer need to shed some light on your design decisions and explain why you decide on a specific algorithm for the source data.

UseCase

We are focussing on a “Bread and Butter” Binary Classification Model for Churn in the Telecommunications sector and try to make a comparison between the two approaches.

“Churn” refers to the phenomenon where customers switch from one telecommunications service provider to another or discontinue their subscription altogether. It is a significant metric and concern for telecommunications companies because retaining existing customers is often more cost-effective than acquiring new ones.

A binary classification model is a type of machine learning model used to classify data into one of two categories: a positive class (labeled as 1) or a negative class (labeled as 0). It learns from labeled data during training and uses features to make predictions. The model is evaluated based on metrics like accuracy, precision, recall, and F1-score and can be applied to various tasks, such as spam detection, disease diagnosis, and sentiment analysis. It’s a fundamental tool for making decisions with two possible outcomes.

General Models vs Specialized Models

Suppose we have a dataset which consist of a matrix of predictors (called X) and a target variable (called y). X contains n columns(features) that could be used to segment the dataset. E.g. a dataset can be segmented by age, product, sales channel, network experience…

So what are the two approaches we can choose between?

General Model

One main model is fitted on the whole training set including all segments, then the performance is measured on the test set.

Specialized Model

The dataset is splitted into each segment and for each of these subsets a unique model is fitted. This means that we repeat training and testing for the number of k segments.

Intuition

Using specialized models has obviously some practical disadvantages as you need to do some tasks k times which leads to

  • higher complexity
  • higher effort on maintenance
  • almost redundant processes

So why should anyone favour specialized models at any point in time?

The prejudice against general models is as following: Advocates for specialized models argue that a single, all-encompassing model might lack precision within a specific subset, as it would have learned the characteristics of various other subsets.

This intuition was in my opinion built on top of the assumption that all Machine Learning Models work similary to simple models like e.g. linear regression.

The type of model counts

Linear regression assumes that there is a linear relationship between the independent variable(s) X and the dependent variable y. This means that the relationship can be approximated by a straight line. While it assumes linearity, it can be extended to capture more complex relationships by introducing higher-order terms or using more advanced regression techniques like polynomial regression or multiple linear regression when dealing with multiple independent variables. Still it lacks accuracy if the dataset contains different behaviours.

The intuition does not neccessarily hold true for the de facto standard algorithm for tabular data: Boosted Tree Models like XGBoost,LightGBM or CatBoost.

Boosted tree algorithms, outperform linear models like linear regression in modeling complexity. The main advantages above linear models which are relevant to defuse the intuition against general models is their ability to:

  • Handle Non-Linearity: Boosted trees capture non-linear relationships, unlike linear regression, which assumes linearity.

  • Ensemble Learning: They use ensemble learning to combine multiple decision trees, enabling them to create complex models.

  • Feature Interactions: Boosted trees automatically detect and model feature interactions, which linear regression struggles with.

  • Model Flexibility: Boosted trees adapt to data complexity by adding more trees to the ensemble.

This is the main reason why there is no theoretical reason to prefer several specialized models over one general model. Other adavantages are:

  • Robustness to Outliers: They are more robust to outliers and noisy data, making them suitable for real-world scenarios.

  • Handling Heterogeneous Data: They handle various data types, including categorical features, without extensive preprocessing.

  • Fewer Assumptions: Unlike linear regression, they have fewer rigid assumptions, making them versatile in diverse datasets.

And what about the data

When looking at the data and keeping in mind the capabilities of non-linear models we should always prefer general model over specialized models when there is some similarity across the segments composing the dataset.

As the segments diverge further from one another, the benefits of employing a universal model diminish progressively. If the segments are entirely dissimilar, the disparity between the two approaches should converge to zero.

But what would that mean when looking at our Real-Life UseCase? Will customer segments be completely different. Shall we assume that the behaviour of people based on age, product, sales channel, network experience is completely different?

As you already can see this is a rhetoric question because you cannot assume completely different behaviour for a more or less homogenous group.

I want proof

The former thoughts are based on two very nice blogposts by Samuele Mazzanti which are also referenced below.

He did the math and calculated based on two experiments whether a general model outperforms specialized models.

On average the general model outperforms the specialized models. He did a short stress test with completely diverged groups which should be a really rare occurence in any Real-Life Business Case. Even then the general model performed only 0.53% AUC worse than the specialized model.

Conclusion

Whether a generalized model or specialized models are better for the same dataset depends on various factors, including the nature of the data, the used algorithm, the specific problem you’re trying to solve, and your objectives.

Generalized Model:

  • Advantages:

    • Simplicity: Generalized models are often simpler to implement and maintain.
    • Resource Efficiency: Training and deploying a single model can be more resource-efficient than multiple specialized models.
    • Applicability: Generalized models can be useful when the differences between subgroups in the data are relatively small, and a single model can provide satisfactory performance across all groups.
  • Use Cases:

    • When the dataset is relatively homogenous, and there are no strong reasons to believe that different subgroups require significantly different modeling approaches.
    • In scenarios where model interpretability and ease of deployment are critical.

Specialized Models:

  • Advantages:

    • Improved Performance: Specialized models can potentially provide better predictive performance for specific subgroups or behaviors within the data if the data subgroups are
    • Customization: They allow you to tailor the model to the unique characteristics of different segments, which can lead to more accurate predictions.
    • Flexibility: Specialized models can handle cases where the relationships between features and the target variable vary significantly between subgroups.
  • Use Cases:

    • When there are clear distinctions or significant variations in behavior or patterns among different subgroups within the dataset.
    • In cases where the overall dataset is large, but specific subgroups have limited data, making it challenging for a generalized model to capture their nuances effectively.
    • When optimizing performance for specific subgroups is critical, even if it requires more complex model development and maintenance.

In practice, it’s all about the data: If the data doesn’t exhibit significant subgroup differences in my view it is alway beneficial to use a generalized model. A specialized model may only be benefical due to a very heterogenous dataset or specific requirements.

As written above also the model plays a crucial part and the intuition is often a bit misleading. The results from the quoted experiments show clearly that using boosted tree models in one general model will be favourable.

The trade-offs between model complexity and performance are always to be taken into account.

Ressources

Samuele Mazzanti: What Is Better: One General Model or Many Specialized Models?

Samuele Mazzanti: The Unreasonable Effectiveness of General Models