3 parameters to measure SAST testing
Asaf Biton
Shani Gal
August 3, 2021
0 mins readIn our previous blog on why you can’t compare SAST tools using only lists, test suites, and benchmarks, we explored the various tools and metrics commonly used today to assess and compare SAST testing tools. We also looked at a few reasons why those tools might produce inconsistent results and might not be at all reliable for the purpose of assessing a SAST testing tool.
Instead, when assessing a SAST testing tool, there are 3 parameters you will want to consider:
accuracy
completeness
any unique additional values
In this blog, we are going to explore those parameters and look at ways to measure them. When assessing a SAST testing tool, there are two relevant types of measurements — quantitative (meaning the number of results versus “noise”) and qualitative (specifically language depth and support).
Quantitative aspects
The following definitions of accuracy and completeness are a bit complex at first because they are actually two sides of the same coin. It is mathematically impossible (according to Rice’s Theorem) to have a perfect static program analysis. One might think increasing the number of suggestions would find all possible issues. Sadly, this also increases the number of false positives (FPs) to a level where the noise makes the results impossible to work with. There are some tricks SAST testing vendors can use to improve the results but being perfect is mathematically impossible.
Accuracy
In the context of SAST testing, accuracy is loosely defined as having the highest number of TPs (true positives, findings that are actual issues) while maintaining the least amount of FPs (findings that are not vulnerabilities and are therefore wrong).
Accuracy is especially important. A high accuracy rate means we get more actionable results, and less “noise” (irrelevant, unactionable reports). “Noise” is also the number one factor that deters developers from using SAST testing products, which is why the higher the accuracy, the more satisfying the overall developer experience will be.
In order to calculate accuracy, you will first need to triage the results. The formula is then TP*100/(TP+FP)
. This will yield a number between 1 to 100. The higher the number, the higher the accuracy. For example, a tool that finds 140 TPs, and 40 FPs would have an accuracy rate of 77.7%.
Completeness
NIST defines: "Completeness, sometimes referred to as Recall Rate, as a measure of the real issues found (TPs) versus all possible issues (TPs and false negatives). The higher the completeness (up to its theoretical maximum of 1), the better a tool covers the existing issues in the code." In practical terms: The number of real issues that were missed, the false negatives (FNs).
The more complete the tool is, the better visibility and protection you will have. This might also translate into more findings, of course, but coupled with a high accuracy rate, most of those findings should prove to be relevant. With that being said, the severity of those findings always plays a key role in completeness — a thousand low severity FNs is not necessarily a bad thing if you are trying to minimise noise. As a general rule of thumb, the fewer FNs, the better. And it’s important to know how to remove them so the actual vulnerabilities aren’t missed.
This metric can only be generated if you know of any vulnerabilities in your code, or are comparing multiple tools and have found differences in findings. Another angle is to look for the severity of the FNs and then focus on the higher priority ones. FNs are hard to measure as they are unknown unknowns. It is inevitable to have tradeoffs. Experience shows that in non-trivial projects FNs have to be expected, always. In cybersecurity, letting your guard down by feeling too secure is never an option.
The qualitative aspect
Qualitative measurement looks at how language and vulnerability support are approached. As we explored in the previous blog, sticking to just the known vulnerability list, test suites and intentionally-vulnerable repositories yield an incomplete picture. Therefore, a good SAST is one that goes beyond the lists.
This measurement can be further split into two areas of interest: language/vulnerability support and approach to depth/accuracy.
How is language support determined?
It’s important to understand how language support is prioritised and determined for the SAST you are assessing.
We already know vulnerability lists are not enough. A more comprehensive approach would be to aggregate data from multiple sources to create solid language support that would be both up-to-date with today's cyber risks, as well as contextually-relevant.
So while lists can serve as one reference, there are additional sources to explore:
News sources — “trending” vulnerabilities and newly-published vectors are more likely to be exploited
Known vulnerability databases and exploit data like NVD Database and Snyk Vulnerability Database.
Language and framework-specific best practices and context
Zero Day Vulnerabilities research, like new patterns or existing patterns
When determining which languages and frameworks to support in Snyk Code, we use all of the above and more to build a list of the most relevant issues customers should care about.
How is language depth and accuracy approached?
While having a strong list of supported languages and vulnerabilities is a major first step, you also have to consider how well that support is translated into results.
For example, a SAST that relies on the open source community to build and push new rules with little to no strict review process is prone to a high number of FPs and most often yields inconsistent results between different languages and vulnerabilities.
With Snyk Code, we have a team of dedicated security researchers which works constantly to add support for more and more languages and vulnerabilities, as well as to improve existing support through added depth and accuracy.
What is the development velocity and maintenance of the SAST?
As shown above, maintenance and continuous development of a SAST solution is important. This translates into two areas: the roadmap of the product and the ability of the company or community to fulfill it. Recent advances in machine learning make it interesting to see how this plays into the roadmap. Also key are the support of modern languages, the use of a modern engine, and the velocity in which new languages are added.
Secondly, it’s important to understand how able a company or community is at maintaining the SAST knowledge base. As said above, security needs constant observation and to react to various sources.
Bringing it all together
With quantitative assessment, you now want to understand how the tool actually fares in the real world. Having the best language and vulnerability support is not enough if the tool offers many results (even if TPs), or alternatively a lot of noise (FPs). A security professional wants a high level of completeness, but a developer is more interested in practical advice to concretely address. So, it is important to have a balance between the number of suggestions, their prioritization, and the ability of the developer team to address them. Our experience and research shows that overwhelming developers with suggestions (particularly if accuracy is low) demotivates them and actually slows down the process.
From a qualitative perspective, we suggest making a list of aspects of interest in your environment, building a matrix and inserting the values for each competitor.
For these measurements of quantitative characteristics, we suggest picking real projects that you are familiar with. For extra convenience, we suggest the project be a relatively small to medium-sized one. As mentioned in the previous blog, you should refrain from using an intentionally-vulnerable app as it is likely that it won’t be indicative of the tool’s real value.
After you run the SAST and receive the results, it’s time to triage them. Triaging means determining whether a certain result is a TP (a real issue) or a FP (not a real issue). SAST results are often contextual, which is why having good knowledge of the project you are scanning is important. In the end, your personal expertise and work cannot be replaced by pre-cooked benchmark results.
Finally, you will need to calculate the accuracy and completeness based on the formulae mentioned earlier in this post.
Why not simply collect all the tools and run them?
As shown above, every tool adds TPs and FPs. So while it makes sense to simply use all possible tools, in reality, the work to separate the noise will overweigh the value added by another tool. Developers will need to work through doubled suggestions from different tools using different formats, which will produce a lot of overhead not to mention runtime constraints. While this is a good way to get FN and FP numbers, it is not feasible for continuous operations. In our experience, having a developer-friendly platform is most important. If you are working in a very security-critical or regulated environment, you might want to add specialized tools later in the CI/CD process.
A SAST is undoubtedly a powerful tool that every developer should have in their “toolbox” and can really make a difference in your application security. It’s therefore imperative that you choose the best tool for you and your organization. We hope the information and steps outlined above and in the previous blog post can help you make smarter decisions.
Get started in capture the flag
Learn how to solve capture the flag challenges by watching our virtual 101 workshop on demand.