Notes from GT-IISP’s Cyber Security Lecture Series: Automatic Feature Engineering: Learning to Detect Malware by Mining the Scientific Literature

three people hacking a computer system

As a Georgia Tech OMSCS student as well as working software professional, advanced security topics are always something I want to learn more about. Georgia Tech’s Institute for Information Security & Privacy is presenting a weekly Cybersecurity Lecture Series on Fridays this fall, and being a local I’ve started attending them. Here are my quick (albeit not necessarily complete) notes from this week’s presentation by Tudor Dumitraș, Assistant Professor at the University of Maryland.

Security and Machine Learning

  • ML used to classify malicious vs benign software
  • Used for detecting spam, phishing, malware, network attacks, data breaches, etc

How to define similarity

  • Forecasting exploits with Twitter content + analytics
    • Compare information dissemination and vulnerability characteristics
    • Need domain-specific features, outside of Twitter
  • Detecting malware delivery on the client side
  • Android malware detection
    • How should we compare samples?
    • Permissions
      • Protect sensitive data + functionality
      • DOES NOT work for privilege escalation
    • API method calls
      • Reveal malware behaviors
  • Feature engineering
    • Use domain knowledge to identify useful features
    • Must consider threat semantics

The security body of knowledge

  • Growing volume of papers, industry reports, blogs, etc
  • Dilemma!
    • Growing body of knowledge vs need for good features
  • Can we engineer features automatically by mining security documents?

Security threats in natural language

  • Understanding the semantic meaning based on common sense, knowledge of security domain.
  • Every year a roughly constant number of security terms are introduced in the literature (2008-2016 IEEE Security and Privacy Symposium)
  • Security arms race
  • Must discover open-ended behaviors

Intuition for automatic feature engineering

  • Map suspicious API calls to known behaviors, example:
    • getDeviceId, getSubscriberId (access sensitive data)
    • execHttpRequest (connect over network)
  • Extract subject-verb-object to extract behaviors
    • Create smaller sentences from complex sentences
    • Link behaviors to concrete features
      • “API calls for accessing sensitive data, such as getDeviceId()”
      • “accessing sensitive data” -> “getDeviceId()”
    • Link behaviors to malware
      • “ZSone malware is designed to send SMS messages to premium numbers”

Semantic networks

  • Nodes: security concepts
    • Malware families: named entities
    • Concrete features: named entities
    • Behaviors: open ended
  • Edges: semantically related concepts
    • Weighted

Construction of the network (FeatureSmith)

  • Android docs -> features
  • Literature -> behaviors -> weighted behaviors
  • Malware names -> malware
  • End product is Explanations


  • Analyzed 1068 security papers
  • Automatically engineered 195 features relevant to Android malware
    • Out of 383 found in the papers
  • Used Drebin as ground truth (manually engineered)
  • Compare performance FeatureSmith vs Drebin
    • Random forest used for classification
    • Same corpus of benign and malicious apps
    • Same feature types
  • FeatureSmith actually discovered new features not found in Drebin

More Information