Malware Detection with Machine Learning

robertmcgrath

Apr 29

For many years, research labs have experimented with using machine learning to detect malware. There have been quite a few successful research projects, but researchers from the UK and Germany report that these models have notably failed to transfer to real world situations [1]. As they ask, "what's the deal?"

Much of "the deal" seems to do with issues in training the ML models.

Malware detection is difficult not only because of deliberate obfuscation, but because malware is generally a tiny fraction of the overall code. (Heck, the fraction of code that actually runs is miniscule, compared to all the code sitting there ready to run.)

Reliably and precisely detecting a small target in a large pool is always hard for machine learning because it is difficult to create representative training sets.

A second problem for malware detectiong is that the training set is generally badly out of date. Learning to detect malware from several years ago probably isn't relevant to detecting today's malware.

All machine learning may suffer from out of date training sets, but some targets such as natural language or image classification are relatively immune. Human language changes slowly enough that a ChatBot trained on examples from 2010-2019 probably still works pretty well. But, both malware and "goodware" (as these researchers term it) change rapidly. Samples of software from 2010-2019 are embarrassingly irrelevant today.

(As an aside, I'll note that ML for real time navigation and autonomous driving are probably vulnerable to this issue, at least more than, say language analysis. Driving has a lot of unchanging stuff, but there are plenty of wild cards and new stuff every year—and they are some of the most important stuff! Training on road conditions and driving experience from 2010-2019 is only partly useful for predicting driving conditions today. It's not surprising, then, that AI for self-driving cars has proved hard.)

This problem of a rapidly evolving training set is exacerbated by the adversarial nature of the malware game. Not only is a lot of malware new, today's software has been modified to defeat last year's malware, and vice versa. A machine learning model that sifts through last year's code and malware is, well, irrelevant, because everything has been patched in the mean time.

The authors note that some research projects aggregate samples across time, creating a training set that doesn't represent any situation that could have occurred in real life. I.e., a dataset that included patched and unpatched versions of software may be easy to learn (e.g., 'unpatched is a sign of malware present'), but this is not very useful in the real world.

These researchers would like to see the development of open, carefully documented, standard datasets—with timestamps. This would be analogous to existing research datasets for training models to learn images and language. The timestamps are critical because it is necessary to know the temporal relationships that represent versions and adversarial tit-for-tat.

They also note that ML models will degrade over time unless retrained to reflect all the patches and new attacks. So, it is important to measure them over time, not just once. (And this means that publishing one initial good result doesn't mean the model will be useful for long.) They suggest, too, that it would be useful to automatically detect when a model is out of date to the point that it should not be used without retraining.

I'll add one other thought.

I'm pretty sure that malware creators are already using ML to find new vulnerabilities, to automatically create code to exploit targets, and to improve obfuscation (e.g., to simulate traffic patterns and other normal behaviors).

Which means that, soon enough, if not already, there will be machine learning trying to learn to detect the malicious activities of other machine learning, and vice versa. I can't help but think of Spy vs. Spy!

Kewl!

But also, not so cool for puny humans, caught in the crossfire.

Lorenzo Cavallaro, Johannes Kinder, Feargus Pendlebury, and Fabio Pierazzi, Are Machine Learning Models for Malware Detection Ready for Prime Time? IEEE Security & Privacy, 21 (2):53-56, 2023. https://www.computer.org/csdl/magazine/sp/2023/02/10102612/1MkXXhq8D7O

Comment