All of this means that if we want to minimize surprise FPs between model releases, we must ensure DV ordering preservation.  

XGBoost is flexible because its Newton-Raphson solver requires only the gradient and Hessian of the objective rather than the objective itself. By adding small perturbations to the gradient and to the Hessian, we can replace the standard XGBoost objective function with one that includes a loss for failing to rank DVs according to the DV ranking defined by the previous model release, thereby promoting model release stability.  

Mathematical Description of XGBoost Optimization 

The following, up to but not including the example, is taken predominantly from the XGBoost Project docs. The XGBoost model consists of an ensemble of trees

such that

The objective function we leverage for training the binary classifier is the binary logistic loss function with complexity regularization

where

and

For each iteration (t) the goal is to find ft that minimizes obj(t). In the case of a neural network, loss minimization requires computing the rate of change of the loss with respect to the model weights. In the case of XGBoost, we compute the second-order Taylor expansion of the loss l and provide the gradient and Hessian to the Newton-Raphson solver to find the optimal ft given previously constructed trees f(s<t).

The second-order Taylor expansion of the objective takes the form 

where

The upshot is that if we want to customize the XGBoost objective, we need only provide the updated gradient gi and Hessian hi.

A note to the observant reader (not from the docs): In the above expansion, the loss function

is being expanded around

where the independent variable is in the form

and

Computing

gives

For the sake of making these equations more interpretable and concrete, assume we have a sample x such that the XGBoost model f outputs 0.2 = p = f(x), and assume we have a true label = 1. The gradient of the logistic loss for this sample is g = p-y = -0.8. This will encourage the (t+1)st tree to be constructed so as to push the prediction value for this sample higher.

The adjustment to the gradient and Hessian are then

and

respectively.

The takeaway is that a negative gradient pushes the prediction value and therefore the DV higher, as the sigmoid function is everywhere increasing. This means that if we want to customize the objective function in such a way that the DV of a given sample is pushed higher as subsequent trees are added, we should add a number v<0 to the gradient for that sample.

An Intuitive Toy Example

Assume we have sorted the samples in the training corpus of model N by DV in ascending order and stacked the remaining samples below. Assume ypred [1,2,3,4,5,7,6]. The resulting addition to the gradient should be something like [0,0,0,0,0,1,-1]. The intuition is that we want to move the prediction of the sample whose current prediction is 6 a little higher and the prediction of the sample whose current prediction is 7 a little lower. Keep in mind that the ordering in terms of row position of the underlying samples in the train set is correct by assumption. This will enforce the proper ordering of [1,2,3,4,5,6,7].

Experiments, Code, and Results

Experimental Setup

Each experiment consists of training exactly three XGBoost binary classifier models on a set of 90/10 dirty/clean PE files. Featurization was performed with an internally developed static parser, but the method itself is agnostic to the parser. One could leverage the EMBER open-source parser, for example. The first model represents the “N” release trained with the standard XGBoost logistic loss objective. We call this the “old” model. The second model represents the standard “N+1” release trained with the same objective as the “old” model but with 10% more data and the same label balance. We call this the “full” model. The third model represents the candidate “N+1” release trained with the custom objective described above and on the same dataset as the “full” model.

We ran two separate experiments, differing only in the number of training samples. The custom objective succeeded in reducing swap-in or “surprise” FPs with a minimal trade-off in true positives.   

Results

Table 1. 119,494 samples: objective restricted to clean DVs within 5% and 80% target FPR thresholds, weight multiplier for gi = 1e – 11
Comparison Swap-Ins Persistent FPS Non-Swap New FPS Total FPS Old Model Total FPS New Model Total TPS Old Model Total TPS New Model
Old vs. Full 32 194 23 226 250 25,267 28,111
Old vs. Candidate 26 (18.75%) 199 25 226 250 25, 267 28,104 (0.025%)
Table 2. 284,657 samples: objective restricted to clean DVs within 5% and 80% target FPR thresholds, weight multiplier for gi = 1e – 11
Comparison Swap-Ins Persistent FPS Non-Swap New FPS Total FPS Old Model Total FPS New Model Total TPS Old Model Total TPS New Model
Old vs. Full 59 382 56 446 497 62,157 69,059
Old vs. Candidate 53 (10.2%) 387 56 446 497 62,157 69,053 (0.009%)

Python Implementation

The perturbation value we decided to use was simply the difference between the pred values of each pair of misordered samples (ordered according to DV output by model N, or “old” model).  Note that this requires a perturbation to the Hessian as well. This code assumes the values in the argument “y_pred” are ordered according to values output by model N. Take care to note that this does not mean these values are ordered as on the real number line. The scipy function expit is the sigmoid function with built-in underflow and overflow protection.

The callable CustomObjective class instantiation is then passed to the standard xgb.train function. Incidentally, the callable class is another way, in addition to lambda functions, to pass additional arguments to Python functions called with a signature restriction on the number of arguments.   

Employing an XGBoost Custom Objective Function Results in More Predictable Model Behavior with Fewer FPs 

XGBoost classifier consistency between releases can be improved with an XGBoost custom objective function that is easy to implement and mathematically sound, with a minimal trade-off in true positive rate. The results are more predictable model behavior, less chaotic customer environments, and fewer threat researcher cycles wasted on surprise FP remediation.

CrowdStrike’s Research Investment Pays Off for Customers and the Cybersecurity Industry

Research is a critical function at CrowdStrike, ensuring we continue to take a leadership role in advancing the global cybersecurity ecosystem. The results of groundbreaking work — like that done by the team who conducted the research into the XGBoost custom objective function — ensure CrowdStrike customers enjoy state-of-the-art protection and advance cyber defenses globally against sophisticated adversaries.

Additional Resources

Similar Posts