CAPEC-80

Name

Using UTF-8 Encoding to Bypass Validation Logic

Likihood attacks

High

Typical Severity

High

Status

Draft

Published

2014-06-23
00h00 +00:00

Modified

2022-09-29
00h00 +00:00

Official links

CAPEC Mitre.org

Alerte pour un CAPEC

Stay informed of any changes for a specific CAPEC.

Notifications manage

List of Notifications

Alerte pour un CAPEC

Stay informed of any changes for a specific CAPEC.

Parameters

You can specify a title that will be retrieved in the alerts that will be sent out.

Specify the CAPEC ID you wish to monitor.

Planning

Month

Next run calculation

Day

Weekday

Hour

Minute

Creation date

Last execution

Next execution

Functionality requiring a connection

This feature, which allows you to receive alerts, is only active when you are logged into your account.

Descriptions CAPEC

This attack is a specific variation on leveraging alternate encodings to bypass validation logic. This attack leverages the possibility to encode potentially harmful input in UTF-8 and submit it to applications not expecting or effective at validating this encoding standard making input filtering difficult. UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. Legal UTF-8 characters are one to four bytes long. However, early version of the UTF-8 specification got some entries wrong (in some cases it permitted overlong characters). UTF-8 encoders are supposed to use the "shortest possible" encoding, but naive decoders may accept encodings that are longer than necessary. According to the RFC 3629, a particularly subtle form of this attack can be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as characters.

Informations CAPEC

Execution Flow

1) Explore

[Survey the application for user-controllable inputs] Using a browser or an automated tool, an attacker follows all public links and actions on a web site. They record all the links, the forms, the resources accessed and all other potential entry-points for the web application.

Technique

Use a spidering tool to follow and record all links and analyze the web pages to find entry points. Make special note of any links that include parameters in the URL.
Use a proxy tool to record all user input entry points visited during a manual traversal of the web application.
Use a browser to manually explore the website and analyze how it is constructed. Many browsers' plugins are available to facilitate the analysis or automate the discovery.

2) Experiment

[Probe entry points to locate vulnerabilities] The attacker uses the entry points gathered in the "Explore" phase as a target list and injects various UTF-8 encoded payloads to determine if an entry point actually represents a vulnerability with insufficient validation logic and to characterize the extent to which the vulnerability can be exploited.

Technique

Try to use UTF-8 encoding of content in Scripts in order to bypass validation routines.
Try to use UTF-8 encoding of content in HTML in order to bypass validation routines.
Try to use UTF-8 encoding of content in CSS in order to bypass validation routines.

Prerequisites

The application's UTF-8 decoder accepts and interprets illegal UTF-8 characters or non-shortest format of UTF-8 encoding.
Input filtering and validating is not done properly leaving the door open to harmful characters for the target host.

Skills Required

An attacker can inject different representation of a filtered character in UTF-8 format.
An attacker may craft subtle encoding of input data by using the knowledge that they have gathered about the target host.

Mitigations

The Unicode Consortium recognized multiple representations to be a problem and has revised the Unicode Standard to make multiple representations of the same code point with UTF-8 illegal. The UTF-8 Corrigendum lists the newly restricted UTF-8 range (See references). Many current applications may not have been revised to follow this rule. Verify that your application conform to the latest UTF-8 encoding specification. Pay extra attention to the filtering of illegal characters.

The exact response required from an UTF-8 decoder on invalid input is not uniformly defined by the standards. In general, there are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:

1. Insert a replacement character (e.g. '?', '').
2. Ignore the bytes.
3. Interpret the bytes according to a different character encoding (often the ISO-8859-1 character map).
4. Not notice and decode as if the bytes were some similar bit of UTF-8.
5. Stop decoding and report an error (possibly giving the caller the option to continue).

It is possible for a decoder to behave in different ways for different types of invalid input.

RFC 3629 only requires that UTF-8 decoders must not decode "overlong sequences" (where a character is encoded in more bytes than needed but still adheres to the forms above). The Unicode Standard requires a Unicode-compliant decoder to "...treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."

Overlong forms are one of the most troublesome types of UTF-8 data. The current RFC says they must not be decoded but older specifications for UTF-8 only gave a warning and many simpler decoders will happily decode them. Overlong forms have been used to bypass security validations in high profile products including Microsoft's IIS web server. Therefore, great care must be taken to avoid security issues if validation is performed before conversion from UTF-8, and it is generally much simpler to handle overlong forms before any input validation is done.

To maintain security in the case of invalid input, there are two options. The first is to decode the UTF-8 before doing any input validation checks. The second is to use a decoder that, in the event of invalid input, returns either an error or text that the application considers to be harmless. Another possibility is to avoid conversion out of UTF-8 altogether but this relies on any other software that the data is passed to safely handling the invalid data.

Another consideration is error recovery. To guarantee correct recovery after corrupt or lost bytes, decoders must be able to recognize the difference between lead and trail bytes, rather than just assuming that bytes will be of the type allowed in their position.

For security reasons, a UTF-8 decoder must not accept UTF-8 sequences that are longer than necessary to encode a character. If you use a parser to decode the UTF-8 encoding, make sure that parser filter the invalid UTF-8 characters (invalid forms or overlong forms).
Look for overlong UTF-8 sequences starting with malicious pattern. You can also use a UTF-8 decoder stress test to test your UTF-8 parser (See Markus Kuhn's UTF-8 and Unicode FAQ in reference section)
Assume all input is malicious. Create an allowlist that defines all valid input to the software system based on the requirements specifications. Input that does not match against the allowlist should not be permitted to enter into the system. Test your decoding process against malicious input.

Related Weaknesses

CWE-ID	Weakness Name
CWE-173	Improper Handling of Alternate Encoding The product does not properly handle when an input uses an alternate encoding that is valid for the control sphere to which the input is being sent.
CWE-172	Encoding Error The product does not properly encode or decode the data, resulting in unexpected values.
CWE-180	Incorrect Behavior Order: Validate Before Canonicalize The product validates input before it is canonicalized, which prevents the product from detecting data that becomes invalid after the canonicalization step.
CWE-181	Incorrect Behavior Order: Validate Before Filter The product validates data before it has been filtered, which prevents the product from detecting data that becomes invalid after the filtering step.
CWE-73	External Control of File Name or Path The product allows user input to control or influence paths or file names that are used in filesystem operations.
CWE-74	Improper Neutralization of Special Elements in Output Used by a Downstream Component ('Injection') The product constructs all or part of a command, data structure, or record using externally-influenced input from an upstream component, but it does not neutralize or incorrectly neutralizes special elements that could modify how it is parsed or interpreted when it is sent to a downstream component.
CWE-20	Improper Input Validation The product receives input or data, but it does not validate or incorrectly validates that the input has the properties that are required to process the data safely and correctly.
CWE-697	Incorrect Comparison The product compares two entities in a security-relevant context, but the comparison is incorrect, which may lead to resultant weaknesses.
CWE-692	Incomplete Denylist to Cross-Site Scripting The product uses a denylist-based protection mechanism to defend against XSS attacks, but the denylist is incomplete, allowing XSS variants to succeed.

References

REF-1

Exploiting Software: How to Break Code
G. Hoglund, G. McGraw.

REF-112

Secure Programming for Linux and Unix HOWTO
David Wheeler.
http://www.dwheeler.com/secure-programs/Secure-Programs-HOWTO/character-encoding.html

REF-530

Writing Secure Code
Michael Howard, David LeBlanc.

REF-531

Security Risks of Unicode
Bruce Schneier.
https://www.schneier.com/crypto-gram/archives/2000/0715.html

REF-532

Wikipedia
http://en.wikipedia.org/wiki/UTF-8

REF-533

RFC 3629 - UTF-8, a transformation format of ISO 10646
F. Yergeau.
http://www.faqs.org/rfcs/rfc3629.html

REF-114

IDS Evasion with Unicode
Eric Hacker.
http://www.securityfocus.com/infocus/1232

REF-535

Corrigendum #1: UTF-8 Shortest Form
http://www.unicode.org/versions/corrigendum1.html

REF-525

UTF-8 and Unicode FAQ for Unix/Linux
Markus Kuhn.
http://www.cl.cam.ac.uk/~mgk25/unicode.html

REF-537

UTF-8 decoder capability and stress test
Markus Kuhn.
http://www.cl.cam.ac.uk/%7Emgk25/ucs/examples/UTF-8-test.txt

Submission

Name	Organization	Date	Date release
CAPEC Content Team	The MITRE Corporation	2014-06-23 +00:00

Modifications

Name	Organization	Date	Comment
CAPEC Content Team	The MITRE Corporation	2018-07-31 +00:00	Updated References
CAPEC Content Team	The MITRE Corporation	2020-07-30 +00:00	Updated Example_Instances, Execution_Flow, Mitigations, Skills_Required
CAPEC Content Team	The MITRE Corporation	2021-06-24 +00:00	Updated Related_Weaknesses
CAPEC Content Team	The MITRE Corporation	2022-09-29 +00:00	Updated Example_Instances, Mitigations

CAPEC-80 Detail