Exploring the advanced technologies behind Snyk Code
Frank Fischer
2021年10月20日
0 分で読めますSnyk Code is the static application security testing (SAST) solution from Snyk, and it introduces some revolutionary technologies into the SAST space. It is based on the research and technologies developed by a spin-off from the ETH (Zurich/Switzerland), DeepCode which joined Snyk at the end of 2020. This article is about these technologies and how Snyk not only gives back to the open source community, but also how it promotes and works with the academic community in the field of static program analysis.
The Snyk Code process
The overall process is depicted below.
It starts on the left with “Big Code”, a training set of hundreds of thousands of open source repositories and their change history — covering a range of languages. The first step is to transform the code via parsing into intermediate representations. Snyk Code uses an abstract syntax tree (AST) and event graph (EG), which allows a data flow-sensitive, context-aware analysis. You can find more details in the academic papers cited later in this article. The EG is capable of representing various programming languages.
Next is to apply a logic solver using the intermediate representation and logic rules. There are several noteworthy aspects of the solver. First, this solver is proprietary and optimized for this task with algorithms with lower time complexity than the one used in standard Datalog solvers. That, secondly, hints at an industry leading runtime behavior. Finally, as a constraint-based system, it generates semantic facts that can be used as input to the human-guided learning process.
Using open source, Snyk Code mines the knowledge of the global developer community to identify and address security issues. It even finds combinations of sources and sinks as an example, that are not actually used in the training set yet but could pose a potential threat. And it has the opportunity to add rules for zero-day exploits even if there is no training data available yet.
The learning process involves the Snyk security engineers that work with machine learning algorithms to generate and maintain the knowledge base. They also add additional information to help developers to understand and take action if those issues are found.
In production, Snyk Code follows this pipeline, but is not using the code to learn, instead applying the knowledge base and producing accurate results in minutes or even seconds. And there you have it! All of this is how we've been able to make Snyk Code such a powerful SAST.
Participation in academic publications and conferences
On top of building this SAST tool, the team behind Snyk Code has been actively involved in publishing at top machine learning and programming languages conferences over many years as well as publishing several academic tools on static program analysis.
Here is a sample of publications and research systems:
TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer — Berabi, B., He, J., Raychev, V. and Vechev, M., 2021, June. TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer. In Proceedings of the 38th International Conference on Machine Learning, PMLR 139(pp. 780-791).
Learning to find naming issues with big code and small supervision — He, J., Lee, C.C., Raychev, V. and Vechev, M., 2021, June. Learning to find naming issues with big code and small supervision. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (pp. 296-311).
Unsupervised Learning of API Aliasing Specifications — Eberhardt, J., Steffen, S., Raychev, V. and Vechev, M., 2019, June. Unsupervised learning of API aliasing specifications. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (pp. 745-759).
Scalable Taint Specification Inference with Big Code — Chibotaru, V., Bichsel, B., Raychev, V. and Vechev, M., 2019, June. Scalable taint specification inference with big code. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (pp. 760-774).
Inferring Crypto API Rules from Code Changes — Paletov, R., Tsankov, P., Raychev, V. and Vechev, M., 2018. Inferring crypto API rules from code changes. ACM SIGPLAN Notices, 53(4), pp.450-464.
Learning a Static Analyzer from Data — Bielik, P., Raychev, V. and Vechev, M., 2017, July. Learning a static analyzer from data. In International Conference on Computer Aided Verification (pp. 233-253). Springer, Cham.
Probabilistic Model for Code with Decision Trees — Raychev, V., Bielik, P. and Vechev, M., 2016. Probabilistic model for code with decision trees. ACM SIGPLAN Notices, 51(10), pp.731-747.
PHOG: Probabilistic Model for Code — Bielik, P., Raychev, V. and Vechev, M., 2016, June. PHOG: probabilistic model for code. In International Conference on Machine Learning (pp. 2933-2942). PMLR.
Learning Programs from Noisy Data — Raychev, V., Bielik, P., Vechev, M. and Krause, A., 2016. Learning programs from noisy data. ACM Sigplan Notices, 51(1), pp.761-774.
Code Completion with Statistical Language Models — Raychev, V., Vechev, M. and Yahav, E., 2014, June. Code completion with statistical language models. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (pp. 419-428).
Predicting Program Properties from "Big Code" — Raychev, V., Vechev, M. and Krause, A., 2015. Predicting program properties from" big code". ACM SIGPLAN Notices, 50(1), pp.111-124.
Statistical Deobfuscation of Android Applications — Bichsel, B., Raychev, V., Tsankov, P. and Vechev, M., 2016, October. Statistical deobfuscation of android applications. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (pp. 343-355).
DEBIN: Predicting Debug Information in Stripped Binaries — He, J., Ivanov, P., Tsankov, P., Raychev, V. and Vechev, M., 2018, October. Debin: Predicting debug information in stripped binaries. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (pp. 1667-1680).
Publicly available tools
Additionally, during the development of the technologies underlying Snyk Code, research tools were developed and published. The following list gives an overview of these tools.
Note: Please note that Snyk is not responsible for these tools and does not maintain or host them. Rather, they are parts of respective research projects in which Snyk team members have been involved.
JSNice
JSNice de-obfuscates JavaScript programs using machine learning. It is used by tens of thousands of programmers, worldwide.
Nice2Predict
Nice2Predict is an efficient and scalable open source framework for structured prediction, enabling one to build new statistical engines quicker.
DeGuard
DeGuard reverses the process of layout obfuscation for Android apps. It is used by security analysts on a daily basis.
Wrapping up
As shown above, the Snyk Code team is on the leading edge of research, and they contribute back by authoring scientific papers, hosting students, and participating in other scientific projects. Experience the results of their research for yourself by signing up for Snyk Code today.