Menu
Log in


Complexity of a regression tree

  • 17 Aug 2024 8:16 AM
    Reply # 13394810 on 13394513

    "You're right that the effective number of parameters in a regression or classification tree goes beyond simply counting the nodes. The complexity of the tree, including its structure and the selection of variables at each split, plays a significant role in determining the model's effective capacity.

    1. Formal Mathematical Frameworks

    One formal approach to quantify the effective number of parameters is through degrees of freedom (DoF), which takes into account the tree structure. The idea is to estimate the complexity by considering not just the number of parameters (like split points and terminal nodes) but also how flexible the model is in fitting the data. However, calculating DoF for decision trees isn't straightforward due to their non-linear, non-parametric nature. Some approaches involve:

    • Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC): These criteria penalize models with more parameters. Although designed for linear models, they can be adapted to trees by considering the number of splits as part of the penalty term.
    • Vapnik-Chervonenkis (VC) Dimension: The VC dimension provides a theoretical upper bound on the model's capacity to fit a variety of functions. For trees, the VC dimension is related to the maximum depth and number of leaves.

    2. Heuristics for Effective Parameters

    Given the difficulty of formalizing the effective number of parameters, some heuristics have been proposed:

    • Number of Leaves (Terminal Nodes): This is a common heuristic where each leaf is treated as an independent parameter. However, this doesn't account for the tree's structure.
    • Tree Depth: Some heuristics multiply the number of leaves by a factor related to the tree depth, recognizing that deeper trees have more complexity.
    • Effective Number of Parameters (ENP): A heuristic suggested in some statistical literature is ENP=α×Number of Leaves, where α is a factor that accounts for the structure of the tree (e.g., α=1.5 or 2.0 as a rough estimate).

    3. Effective Degrees of Freedom:

    A practical approximation might be to treat the Effective Degrees of Freedom (EDF) as the number of leaves plus a penalty for tree depth. One possible formula could be:

    EDF=Number of Leaves+β×Tree Depth

    where β is a chosen constant that reflects the additional complexity introduced by deeper trees.

    These approaches recognize that the effective number of parameters in a tree is a combination of the number of decision rules (nodes), the depth (which controls the flexibility), and the tree's overall structure. While these are not strict mathematical formulas, they provide useful guidelines for estimating the complexity of a decision tree model in practice."

    Last modified: 17 Aug 2024 8:42 AM | Duncan Lowes
  • 16 Aug 2024 3:12 PM
    Message # 13394513

    I am interested in a simple rough formula for the effective number of parameters for a regression or classification tree, which I am teaching as part of a business analytics course.

    The most obvious formula is just the number of total nodes since each decision node involves estimating a cutoff and each terminal node involves estimating a mean or probability. This is what ChatGPT told me!

    But training involves more than just these numbers. It involves which variables are associated with each decision node as well as the structure of the tree.

    Are there any formal mathematical frameworks for defining the effective number of parameters, for instance, using rank-like properties of the transformation from observed to predicted?

    Are there any heuristics that folks know of for including the tree structure, to make a rough adjustment to the number of nodes?


Powered by Wild Apricot Membership Software