privacysavvy

privacysavvy

Saturday, April 29, 2023

[New post] Malware Detection with Machine Learning

Site logo image robertmcgrath posted: " For many years, research labs have experimented with using machine learning to detect malware.  There have been quite a few successful research projects, but researchers from the UK and Germany report that these models have notably failed to transfe" Robert McGrath's Blog

Malware Detection with Machine Learning

robertmcgrath

Apr 29

For many years, research labs have experimented with using machine learning to detect malware.  There have been quite a few successful research projects, but researchers from the UK and Germany report that these models have notably failed to transfer to real world situations [1].  As they ask, "what's the deal?"

Much of "the deal" seems to do with issues in training the ML models. 

Malware detection is difficult not only because of deliberate obfuscation, but because malware is generally a tiny fraction of the overall code.  (Heck, the fraction of code that actually runs is miniscule, compared to all the code sitting there ready to run.)

Reliably and precisely detecting a small target in a large pool is always hard for machine learning because it is difficult to create representative training sets. 

A second problem for malware detectiong is that the training set is generally badly out of date.  Learning to detect malware from several years ago probably isn't relevant to detecting today's malware.

All machine learning may suffer from out of date training sets, but some targets such as natural language or image classification are relatively immune.  Human language changes slowly enough that a ChatBot trained on examples from 2010-2019 probably still works pretty well.  But, both malware and "goodware" (as these researchers term it) change rapidly.  Samples of software from 2010-2019 are embarrassingly irrelevant today. 

(As an aside, I'll note that ML for real time navigation and autonomous driving are probably vulnerable to this issue, at least more than, say language analysis.  Driving has a lot of unchanging stuff, but there are plenty of wild cards and new stuff every year—and they are some of the most important stuff!  Training on road conditions and driving experience from 2010-2019 is only partly useful for predicting driving conditions today.  It's not surprising, then, that AI for self-driving cars has proved hard.)

This problem of a rapidly evolving training set is exacerbated by the adversarial nature of the malware game.  Not only is a lot of malware new, today's software has been modified to defeat last year's malware, and vice versa.  A machine learning model that sifts through last year's code and malware is, well, irrelevant, because everything has been patched in the mean time. 

The authors note that some research projects aggregate samples across time, creating a training set that doesn't represent any situation that could have occurred in real life.  I.e., a dataset that included patched and unpatched versions of software may be easy to learn (e.g., 'unpatched is a sign of malware present'), but this is not very useful in the real world.

These researchers would like to see the development of open, carefully documented, standard datasets—with timestamps.  This would be analogous to existing research datasets for training models to learn images and language.  The timestamps are critical because it is necessary to know the temporal relationships that represent versions and adversarial tit-for-tat.

They also note that ML models will degrade over time unless retrained to reflect all the patches and new attacks.  So, it is important to measure them over time, not just once.  (And this means that publishing one initial good result doesn't mean the model will be useful for long.) They suggest, too, that it would be useful to automatically detect when a model is out of date to the point that it should not be used without retraining.


I'll add one other thought. 

I'm pretty sure that malware creators are already using ML to find new vulnerabilities, to automatically create code to exploit targets, and to improve obfuscation (e.g., to simulate traffic patterns and other normal behaviors).

Which means that, soon enough, if not already, there will be machine learning trying to learn to detect the malicious activities of other machine learning, and vice versa.  I can't help but think of Spy vs. Spy!

Kewl!

But also, not so cool for puny humans, caught in the crossfire.


  1. Lorenzo Cavallaro, Johannes Kinder, Feargus Pendlebury, and Fabio Pierazzi, Are Machine Learning Models for Malware Detection Ready for Prime Time? IEEE Security & Privacy, 21 (2):53-56,  2023. https://www.computer.org/csdl/magazine/sp/2023/02/10102612/1MkXXhq8D7O
Comment
Like
Tip icon image You can also reply to this email to leave a comment.

Unsubscribe to no longer receive posts from Robert McGrath's Blog.
Change your email settings at manage subscriptions.

Trouble clicking? Copy and paste this URL into your browser:
https://robertmcgrath.wordpress.com/2023/04/29/malware-detection-with-machine-learning/

WordPress.com and Jetpack Logos

Get the Jetpack app to use Reader anywhere, anytime

Follow your favorite sites, save posts to read later, and get real-time notifications for likes and comments.

Download Jetpack on Google Play Download Jetpack from the App Store
WordPress.com on Twitter WordPress.com on Facebook WordPress.com on Instagram WordPress.com on YouTube
WordPress.com Logo and Wordmark title=

Learn how to build your website with our video tutorials on YouTube.


Automattic, Inc. - 60 29th St. #343, San Francisco, CA 94110  

at April 29, 2023
Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest

No comments:

Post a Comment

Newer Post Older Post Home
Subscribe to: Post Comments (Atom)

The AI Exchange: Innovators in Payment Security Featuring Utimaco

...

  • Dork List
    ...
  • End of week Artemis update - July 18th 2025
    A round-up of our ILS focused news from this week ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌...
  • Artemis London 2025: Under two months to go
    Register now to attend at the lowest price ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌...

Search This Blog

  • Home

About Me

privacysavvy
View my complete profile

Report Abuse

Blog Archive

  • June 2026 (69)
  • May 2026 (73)
  • April 2026 (94)
  • March 2026 (92)
  • February 2026 (76)
  • January 2026 (77)
  • December 2025 (79)
  • November 2025 (73)
  • October 2025 (88)
  • September 2025 (79)
  • August 2025 (71)
  • July 2025 (89)
  • June 2025 (78)
  • May 2025 (95)
  • April 2025 (85)
  • March 2025 (78)
  • February 2025 (31)
  • January 2025 (50)
  • December 2024 (39)
  • November 2024 (42)
  • October 2024 (54)
  • September 2024 (83)
  • August 2024 (2665)
  • July 2024 (3210)
  • June 2024 (2908)
  • May 2024 (3025)
  • April 2024 (3132)
  • March 2024 (3115)
  • February 2024 (2893)
  • January 2024 (3169)
  • December 2023 (3031)
  • November 2023 (3021)
  • October 2023 (2352)
  • September 2023 (1900)
  • August 2023 (2009)
  • July 2023 (1878)
  • June 2023 (1594)
  • May 2023 (1716)
  • April 2023 (1657)
  • March 2023 (1737)
  • February 2023 (1597)
  • January 2023 (1574)
  • December 2022 (1543)
  • November 2022 (1684)
  • October 2022 (1617)
  • September 2022 (1310)
  • August 2022 (1676)
  • July 2022 (1375)
  • June 2022 (1458)
  • May 2022 (1297)
  • April 2022 (1464)
  • March 2022 (1491)
  • February 2022 (1249)
  • January 2022 (1282)
  • December 2021 (1663)
  • November 2021 (3139)
  • October 2021 (3253)
  • September 2021 (3136)
  • August 2021 (732)
Powered by Blogger.