Well… it happened… again…

Only 3 things in life are guaranteed:

Death
Taxes
A new critical vuln with a cool name will ruin your weekend at least once a year

Incident Response plans for “we got hacked”, and “there was an earthquake” are fairly common, but I’ve yet to find a general guide for “how to handle the hot new vuln impacting 99% of my infra”. This post aims to be that guide, and points out the common pitfalls I’ve encountered along the way.

Phase 0: Vulnerability Announced

Marketing teams and attorneys tend to over-use the phrase “zero day”… so I’ll avoid it here. When a new high profile/critical vulnerability is announced, I start by asking the following questions:

What software/OS versions are impacted?
Is there a specific config that must exist for the exploit to work?
Is this “just a DoS for now”?

The first two make sense… they inform my follow-up questions of:

How do I find all the matching systems, and their owners?
How do I verify a host/application is at risk?

But, the last one is important when it comes to “playing the long game”. If I don’t work at a company with a hard requirement for 99.999 uptime, or if my company doesn’t have a history of people attempting to DoS us… it means my team shouldn’t be losing sleep over patching/chasing down the issue. One thing to keep in mind is, “this vulnerability existed yesterday, we just didn’t know about it.”

… <climbs off soapbox> …

With those 3 questions answered, it’s time to tighten up my processes. We really need 3 things:

A way to detect if something is at risk
A way to kickoff and track efforts to fix it
A way to report our progress to the rest of the company

Phase 1: Detect if something is at risk

Don’t worry, your CISO’s inbox is likely full of vendor emails offering to help you solve the new “ZERO DAY VULNERABILITY!!!”. You’ll find a tool to do the actual detection. That said, keep the following questions in mind when selecting one:

How does the detection work?

The most basic checks are often “does this filename exist on the host”, or “run this command to get your software version”. These types of checks aren’t great, because engineers love pointing out edge cases…

“The file exists, but we don’t use it…”
“That’s a dependency of a dependency, and we can’t fix it yet”
“This doesn’t apply to our configuration - read the CVE security people!”

How often does the scan run?

As an engineer, one of the most frustrating things in the world is when I take time to solve a problem, but no one uses my solution. Worse still, is when I solve the problem, but people still complain “X is broken” without actually checking to see if I fixed it. We need to understand the timing of our scanning and reporting tools, and set expectations accordingly. At the beginning of your engagement, you should share with the organization:

How often you will rescan for the vuln
How they can self-service a rescan (if possible)
How long it will take for reporting to indicate something has been fixed/mitigated/marked as false positive.

What metadata does the tool provide to aid in tracking/deduplication?

Most security tools don’t take into account the operational nature of the work they create. “Found a bug, that’s a finding”… You end up with 5,000 tickets for something that only needs to be fixed in a single place. Consider aggregating findings at the highest level possible for the work that needs to happen:

Finding per file/entry in a file is generally a bad idea
Assuming a single team manages a repository, aggregate findings at the repository level. If it’s a cloud resource, aggregate them at the service-tag or subscription level.
Don’t forget that different tools may report the same finding; consolidate those if possible.
- Example: Dependency Analysis tools will find that unsafe version of Guava… but so will container registry scanning tools… and if you’re really lucky your static code analysis tool also does dependency analysis… If you can map resources back to the repo that creates them, you can group all of these findings into a single “Fix this file”

How do we know we aren’t closing things that aren’t really closed…

This is a really fun one…. Take care not to assume that “if it’s not in today’s list, it must be solved”. Over the long term (and in lower-stakes situations), it’s a reasonable assumption. However, keep in mind the following edge cases:

We keep this server turned off because it’s expensive, it just turns on Thurs — Friday to run reports.
Data science folks love Jupyter Notebook, but because of The Billing Sev of 2010, we automatically shut those down after 15 mins of inactivity. They’ll turn back on after lunch
The vendor is swamped because everyone is doing vuln scans; our scan failed, returning 0 results. It never happens, so our automation doesn’t know how to handle “no vuln”
SRE noticed our scans were slowing down prod a lot more than usual, they disabled our scanning account.

How do we not create 500 tickets for 5 things

Any new automation you build during this effort is likely full of bugs and lacking tests (No judgement, that’s most things I write…). At a minimum, spend the time to emit metrics throughout the ticketing pipeline, then monitor them at reasonable intervals.
Good metrics for this type of work:
- # Results reported by scanning API
- # Results saved to disk
- Job Run Time
- # of findings opened by automation
- # of findings closed by automation
- # findings marked as false positive

Speaking of false positives: Plan for them, and the decisions the business may make about “accepted risks”

I like to keep a table of asset:findingId that I check before cutting new tickets
I use Jira labels like “Snooze:”, “MarkedAsDone”, and “RiskAccepted” to filter my reports.

Phase 2: Kickoff and Track Efforts

Now that we have a solid set of findings, it’s time to get to work. Key points for this phase of the engagement:

Don’t forget to set expectations on scan/report frequency. You MUST know these things before starting on phase 2
Let the risk do the talking. If you find yourself saying, “because security” - you didn’t collect enough data, or you’re pushing for the wrong decision. Engineering teams likely understand how their platform works better than security folks; so walk through the attack path with hesitant engineering teams, seeking to understand if the CVSS risk aligns with the real world.
Action, Owner, Date. For each finding, identify the proper owner of the fix action, and get their commitment to completing it by a specific date. This helps in a few ways:
- Leaders are now accountable for their commitments, the bystander effect is eliminated.
- You no longer have to track/review every finding… leaders can track and report on their progress towards their commitments, your data is just there to show the facts as they change.
Setup a recurring meeting with all leaders who own incomplete actions. Again… multiple benefits to a “group huddle”
- Strategies used by one team may benefit another; ask about challenges the teams are facing, they’ll help each other.
- You’ll get more feedback when you’re outnumbered by non-security folks (which is a good thing. Feedback is a gift :p )
Look for opportunities to … Do LESS? Often times, we’re planning on sunsetting an API/service, and no one wants to fix the problem. If deprecation is already planned, ask the team if it’s worth just allocating more resource to that instead? If a team has plans to move from an image they maintain to a centrally supported image, offer to let them migrate instead of patching. Big wins are hiding somewhere; you just have to put down the “security says” sign and look for them.

Phase 3: Reporting our progress

I’m out of steam… look for part 2… eventually…