As a Georgia Tech OMSCS student as well as working software professional, advanced security topics are always something I want to learn more about. Georgia Tech’s Institute for Information Security & Privacy is presenting a weekly Cybersecurity Lecture Series on Fridays this fall, and being a local I’ve started attending them. Here are my quick (albeit not necessarily complete) notes from this week’s presentation by Tudor Dumitraș, Assistant Professor at the University of Maryland.

Security and Machine Learning

ML used to classify malicious vs benign software
Used for detecting spam, phishing, malware, network attacks, data breaches, etc

How to define similarity

Forecasting exploits with Twitter content + analytics
- Compare information dissemination and vulnerability characteristics
- Need domain-specific features, outside of Twitter
- http://ter.ps/sec15exploit
Detecting malware delivery on the client side
- Reconstruct who-downloads-whom graph
- Identify malicious vs benign download activity
- http://ter.ps/ccs15dropper
Android malware detection
- How should we compare samples?
- Permissions
  - Protect sensitive data + functionality
  - DOES NOT work for privilege escalation
- API method calls
  - Reveal malware behaviors
Feature engineering
- Use domain knowledge to identify useful features
- Must consider threat semantics

The security body of knowledge

Growing volume of papers, industry reports, blogs, etc
Dilemma!
- Growing body of knowledge vs need for good features
Can we engineer features automatically by mining security documents?

Security threats in natural language

Understanding the semantic meaning based on common sense, knowledge of security domain.
Every year a roughly constant number of security terms are introduced in the literature (2008-2016 IEEE Security and Privacy Symposium)
Security arms race
Must discover open-ended behaviors

Intuition for automatic feature engineering

Map suspicious API calls to known behaviors, example:
- getDeviceId, getSubscriberId (access sensitive data)
- execHttpRequest (connect over network)
Extract subject-verb-object to extract behaviors
- Create smaller sentences from complex sentences
- Link behaviors to concrete features
  - “API calls for accessing sensitive data, such as getDeviceId()”
  - “accessing sensitive data” -> “getDeviceId()”
- Link behaviors to malware
  - “ZSone malware is designed to send SMS messages to premium numbers”

Semantic networks

Nodes: security concepts
- Malware families: named entities
- Concrete features: named entities
- Behaviors: open ended
Edges: semantically related concepts
- Weighted

Construction of the network (FeatureSmith)

Android docs -> features
Literature -> behaviors -> weighted behaviors
Malware names -> malware
End product is Explanations

Validation

Analyzed 1068 security papers
Automatically engineered 195 features relevant to Android malware
- Out of 383 found in the papers
Used Drebin as ground truth (manually engineered)
Compare performance FeatureSmith vs Drebin
- Random forest used for classification
- Same corpus of benign and malicious apps
- Same feature types
FeatureSmith actually discovered new features not found in Drebin

More Information

http://ter.ps/featuresmith

Notes from GT-IISP’s Cyber Security Lecture Series: Automatic Feature Engineering: Learning to Detect Malware by Mining the Scientific Literature

Security and Machine Learning

How to define similarity

The security body of knowledge

Security threats in natural language

Intuition for automatic feature engineering

Semantic networks

Construction of the network (FeatureSmith)

Validation

More Information

You may also like...

Categories

Security and Machine Learning

How to define similarity

The security body of knowledge

Security threats in natural language

Intuition for automatic feature engineering

Semantic networks

Construction of the network (FeatureSmith)

Validation

More Information

You may also like...

Gitea self-hosted runners

Notes from GT-IISP’s Cyber Security Lecture Series: Software Assurance & Exploitation

Why data portability matters

Categories