Trojans in your source code
Bijgewerkt op: 12 nov. 2021
As part of my work I frequently perform source code reviews for security issues. Looking for vulnerabilities in the logic of the source code is not easy, but when the encoding of that code is attacked, things get unreal pretty fast. Especially when you realise how often code is copy-pasted from sites like StackOverflow.
This article describes the dangers of hidden Unicode control characters and how they can make your source code appear differently than it is executed.
This code contains a simple "demo"-function that starts by setting the "isAdmin" variable to "false" in line 2. Line 4 starts with a comment and checks whether "isAdmin" is set to "true" in which case the message "You are an admin" would be printed in line 5. Logically, this message will never be printed since "isAdmin" is explicitly set to "false". However, when executing this script, the following output is generated:
You are an admin
How does this work?
Using Unicode control characters we can reorder the tokens of the source code. By doing this, the way the source code is rendered to screen no longer matches the actual logic of the source code itself. In other words, by carefully placing these control characters, we can visually reorder the source code so that it is displayed differently than how it is processed by the compiler or interpreter.
Modifying the code
The table below lists the abbreviations of the special Unicode control characters that are used to create the examples in this chapter.
Treat following text as right-to-left
Treat following text as left-to-right without affecting adjacent text
Pop Directional Isolate
Terminate the nearest LRI or RLI
As you can see, line 4 is processed as a normal comment section with some special characters in it. However, when the line is displayed on screen in the IDE, the special characters inside the comment manipulate the text so that it is rendered in a different order. This vulnerability is tracked as CVE-2021-42574. The sample code can be found here.
Adversaries can use this vulnerability, the difference between handling and rendering of these Unicode control characters, to hide malicious behaviour from human reviewers. The impact of this kind of an attack lies within the context of software supply chains; open source projects used by other applications. When a vulnerability like this slips through the code review and ends up unnoticed in an open source library, it is likely to be inherited downstream by the software using that library.
Visually reordering the source code can be done in different ways:
This is the example explained above. Comments visually appear as executable code, which are not executed by compiler or interpreter.
In this case, executable code visually appears to be a comment, but is actually executed. For example, in the python code below, the code "amount -= 70" in line 3 appears to be part of a comment section:
However, after running the script, we see it is actually executed:
Amount: 30 done
Below is the actual code, with Unicode control characters made visible, as it would be processed by the python interpreter:
The sample code can be found here.
This technique is a variation on the commenting-in technique. It executes a return statement that appears to be part of a comment, to exit a function early. In the python example below, the "return" command in line 3 appears to be included in the comment, but is actually executed.
first comment done
The figure below shows the code as it is processed by the python interpreter:
The early returns example can be found here.
Another way to exploit this vulnerability, is to have pieces of string literals render as code. This could break equality checks like the one in the bash script below:
Line 2 assigns the value "user" to the ACCESS_LEVEL variable. If the ACCESS_LEVEL is not equal to "user" (line 4), the message "You are an admin" is displayed, otherwise the output is "You are a user". Since the ACCESS_LEVEL has been explicitly set to "user", we would expect the output is the latter. The output however is the following:
You are an admin
The figure below shows the source code as it is processed by the bash interpreter:
Notice the string literal in line 4 is not "user", but rather contains other characters as well. This way, given the explicit assignment in line 2, the comparison in line 4 always yields true (i.e. ACCESS_LEVEL is never equal to "user") and the message "You are an admin" is output.
The sample code for stretched strings can be found here.
A homoglyph attack exploits the fact that two characters look alike. For example, it is hard to see the difference between the Cyrillic letter "a" (U+0430) and the letter "a" from the latin alphabet. This is a similar to the IDN homograph attack in which domain names are spoofed using the same principle.
It would be very difficult for a reviewer to visually make a distinction between the function call to the good and the evil demo function in line 8. This vulnerability is tracked as CVE-2021-42694.
The issues mentioned in this article can be mitigated by limiting the way the bidirectional Unicode control characters are processed. Either by making these clearly visible, or producing errors or warnings when these characters are encountered:
IDE's and other code editors should provide visual feedback for these characters. Either by making them clearly visible, or notifying the user with a warning. The figure below depicts how IntelliJ handles the visual representation of Unicode control characters for example:
The example below shows how GitHub handles the issue in comment_in.py:
Automated systems and software
Systems and software that process source code, like compilers, interpreters, build pipelines, etc. should throw exceptions or generate warnings whenever special characters are encountered.
Unterminated bidirectional control characters in comments and string literals should be disallowed by means of language specifications.
Reference: Paper: "Trojan Source: Invisible Vulnerabilities" by Nicholas Boucher and Ross Anderson, 2021 [link (pdf)]