Corrected uses of Whitelist to Safelist (#1464)
And introduced a (deprecated) compatibility shim.
Tested with japicmp binary and source compatibility report, and checked binary compat with some test utilities (local testing).
Many thanks to @drei01 (#1408), @nkiesel (#1423) and others for highlighting the issue and for PRs which this commit refines.
diff --git a/CHANGES b/CHANGES
index 3d3f1e8..76e2d89 100644
--- a/CHANGES
+++ b/CHANGES
@@ -5,6 +5,10 @@
* Change: updated the minimum Android API level from 8 to 10.
+ * Improvement: renamed the Whitelist class to Safelist, with the goal of more inclusive language. A shim is provided
+ for backwards compatibility (source and binary). This shim is marked as deprecated and will be removed in the
+ jsoup 1.15.1 release.
+
* Improvement: added support for loading and parsing gzipped HTML files in Jsoup.parse(File in, charset, baseUri).
* Improvement: reduced thread content in HttpConnection and Document.
@@ -458,12 +462,12 @@
* Added the new selector :containsData(), to find elements that hold data, like script and style tags.
* Changed Jsoup.isValid(bodyHtml) to validate that the input contains only body HTML that is safe according to the
- whitelist, and does not include HTML errors. And in the Jsoup.Cleaner.isValid(Document) method, make sure the doc
+ safelist, and does not include HTML errors. And in the Jsoup.Cleaner.isValid(Document) method, make sure the doc
only includes body HTML.
<https://github.com/jhy/jsoup/issues/245>
<https://github.com/jhy/jsoup/issues/632>
- * In Whitelists, validate that a removed protocol exists before removing said protocol.
+ * In Safelists, validate that a removed protocol exists before removing said protocol.
* Allow the Jsoup.Connect thread to be interrupted when reading the input stream; helps when reading from a long stream
of data that doesn't read timeout.
@@ -658,10 +662,10 @@
or your JDK doesn't support SNI.
<https://github.com/jhy/jsoup/pull/343>
- * Added ability to further tweak the canned Cleaner Whitelists by removing existing settings.
+ * Added ability to further tweak the canned Cleaner Safelists by removing existing settings.
<https://github.com/jhy/jsoup/pull/449>
- * Added option in Cleaner Whitelist to allow linking to in-page anchors (#)
+ * Added option in Cleaner Safelist to allow linking to in-page anchors (#)
<https://github.com/jhy/jsoup/pull/441>
* Use a lowercase doctype tag for HTML5 documents.
@@ -734,7 +738,7 @@
* If pretty-print is disabled, don't trim outer whitespace in Element.html()
<https://github.com/jhy/jsoup/issues/368>
- * In the HTML Cleaner, allow span tags in the basic whitelist, and span and div tags in the relaxed whitelist.
+ * In the HTML Cleaner, allow span tags in the basic safelist, and span and div tags in the relaxed safelist.
* Added Element.cssSelector(), which returns a unique CSS selector/path for an element.
<https://github.com/jhy/jsoup/pull/459>
@@ -769,7 +773,7 @@
* Added support for 'application/*+xml' mimetypes.
<https://github.com/jhy/jsoup/pull/444>
- * Fixed support for allowing script tags in cleaner whitelists.
+ * Fixed support for allowing script tags in cleaner Safelists.
<https://github.com/jhy/jsoup/issues/299>
<https://github.com/jhy/jsoup/issues/388>
@@ -833,7 +837,7 @@
* Introduced Parser.parseXmlFragment(), to allow easy parsing of XML fragments.
<https://github.com/jhy/jsoup/issues/279>
- * Allow Whitelist test methods to be extended
+ * Allow Safelist test methods to be extended
<https://github.com/jhy/jsoup/issues/85>
* Added Document.OutputSettings.outline mode, to aid HTML debugging by printing out in outline mode, similar to
@@ -925,13 +929,13 @@
* Fixed NPE when HTML fragment parsing a <style> tag
<https://github.com/jhy/jsoup/issues/189>
- * Fixed issue with :all pseudo-tag in HTML sanitizer when cleaning tags previously defined in whitelist
+ * Fixed issue with :all pseudo-tag in HTML sanitizer when cleaning tags previously defined in safelist
<https://github.com/jhy/jsoup/issues/156>
* Fixed NPE in Parser.parseFragment() when context parameter is null.
<https://github.com/jhy/jsoup/issues/195>
- * In HTML whitelists, when defining allowed attributes for a tag, automatically add the tag to the allowed list.
+ * In HTML Safelists, when defining allowed attributes for a tag, automatically add the tag to the allowed list.
*** Release 1.6.2 [2012-Mar-27]
* Added a simplified XML parsing mode, which can usefully parse valid and invalid XML, but does not enforce any HTML
@@ -950,7 +954,7 @@
* Updated jsoup.connect so that when requests made as POSTs are redirected, the redirect is followed as a GET.
<https://github.com/jhy/jsoup/issues/120>
- * Updated the Cleaner and whitelists to optionally preserve related links in elements, instead of converting them
+ * Updated the Cleaner and Safelists to optionally preserve related links in elements, instead of converting them
to absolute links.
* Updated the Cleaner to support custom allowed protocols such as "cid:" and "data:".
@@ -1255,7 +1259,7 @@
prepend(String), append(String); bulk methods for corresponding
methods in Element.
- * New feature: Jsoup.isValid(html, whitelist) method for user input
+ * New feature: Jsoup.isValid(html, safelist) method for user input
form validation.
* Improved Elements.attr(String) to find first matching element
diff --git a/README.md b/README.md
index bc24827..76d8855 100644
--- a/README.md
+++ b/README.md
@@ -7,7 +7,7 @@
* scrape and [parse](https://jsoup.org/cookbook/input/parse-document-from-string) HTML from a URL, file, or string
* find and [extract data](https://jsoup.org/cookbook/extracting-data/selector-syntax), using DOM traversal or CSS selectors
* manipulate the [HTML elements](https://jsoup.org/cookbook/modifying-data/set-html), attributes, and text
-* [clean](https://jsoup.org/cookbook/cleaning-html/whitelist-sanitizer) user-submitted content against a safe white-list, to prevent XSS attacks
+* [clean](https://jsoup.org/cookbook/cleaning-html/safelist-sanitizer) user-submitted content against a safe-list, to prevent XSS attacks
* output tidy HTML
jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.
diff --git a/pom.xml b/pom.xml
index d49e6c9..d0d0000 100644
--- a/pom.xml
+++ b/pom.xml
@@ -185,7 +185,9 @@
<version>0.14.4</version>
<configuration>
<parameter>
- <!-- Just running the reports for now. jsoup policy is ok to remove deprecated methods on minor but not builds. -->
+ <!-- jsoup policy is ok to remove deprecated methods on minor but not builds. will need to temp remove on bump to 1.15.1 and manually validate -->
+ <breakBuildOnBinaryIncompatibleModifications>true</breakBuildOnBinaryIncompatibleModifications>
+ <breakBuildOnSourceIncompatibleModifications>true</breakBuildOnSourceIncompatibleModifications>
</parameter>
</configuration>
<executions>
diff --git a/src/main/java/org/jsoup/Jsoup.java b/src/main/java/org/jsoup/Jsoup.java
index 84a5e34..5fe6286 100644
--- a/src/main/java/org/jsoup/Jsoup.java
+++ b/src/main/java/org/jsoup/Jsoup.java
@@ -1,11 +1,12 @@
package org.jsoup;
+import org.jsoup.helper.DataUtil;
+import org.jsoup.helper.HttpConnection;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;
import org.jsoup.safety.Cleaner;
+import org.jsoup.safety.Safelist;
import org.jsoup.safety.Whitelist;
-import org.jsoup.helper.DataUtil;
-import org.jsoup.helper.HttpConnection;
import java.io.File;
import java.io.IOException;
@@ -184,70 +185,106 @@
}
/**
- Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted
+ Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through an allow-list of safe
tags and attributes.
@param bodyHtml input untrusted HTML (body fragment)
@param baseUri URL to resolve relative URLs against
- @param whitelist white-list of permitted HTML elements
+ @param safelist list of permitted HTML elements
@return safe HTML (body fragment)
@see Cleaner#clean(Document)
*/
- public static String clean(String bodyHtml, String baseUri, Whitelist whitelist) {
+ public static String clean(String bodyHtml, String baseUri, Safelist safelist) {
Document dirty = parseBodyFragment(bodyHtml, baseUri);
- Cleaner cleaner = new Cleaner(whitelist);
+ Cleaner cleaner = new Cleaner(safelist);
Document clean = cleaner.clean(dirty);
return clean.body().html();
}
/**
- Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of permitted
+ Use {@link #clean(String, String, Safelist)} instead.
+ @deprecated as of 1.14.1.
+ */
+ @Deprecated
+ public static String clean(String bodyHtml, String baseUri, Whitelist safelist) {
+ return clean(bodyHtml, baseUri, (Safelist) safelist);
+ }
+
+ /**
+ Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a safe-list of permitted
tags and attributes.
@param bodyHtml input untrusted HTML (body fragment)
- @param whitelist white-list of permitted HTML elements
+ @param safelist list of permitted HTML elements
@return safe HTML (body fragment)
@see Cleaner#clean(Document)
*/
- public static String clean(String bodyHtml, Whitelist whitelist) {
- return clean(bodyHtml, "", whitelist);
+ public static String clean(String bodyHtml, Safelist safelist) {
+ return clean(bodyHtml, "", safelist);
}
/**
- * Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a white-list of
+ Use {@link #clean(String, Safelist)} instead.
+ @deprecated as of 1.14.1.
+ */
+ @Deprecated
+ public static String clean(String bodyHtml, Whitelist safelist) {
+ return clean(bodyHtml, (Safelist) safelist);
+ }
+
+ /**
+ * Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a safe-list of
* permitted tags and attributes.
* <p>The HTML is treated as a body fragment; it's expected the cleaned HTML will be used within the body of an
* existing document. If you want to clean full documents, use {@link Cleaner#clean(Document)} instead, and add
- * structural tags (<code>html, head, body</code> etc) to the whitelist.
+ * structural tags (<code>html, head, body</code> etc) to the safelist.
*
* @param bodyHtml input untrusted HTML (body fragment)
* @param baseUri URL to resolve relative URLs against
- * @param whitelist white-list of permitted HTML elements
+ * @param safelist list of permitted HTML elements
* @param outputSettings document output settings; use to control pretty-printing and entity escape modes
* @return safe HTML (body fragment)
* @see Cleaner#clean(Document)
*/
- public static String clean(String bodyHtml, String baseUri, Whitelist whitelist, Document.OutputSettings outputSettings) {
+ public static String clean(String bodyHtml, String baseUri, Safelist safelist, Document.OutputSettings outputSettings) {
Document dirty = parseBodyFragment(bodyHtml, baseUri);
- Cleaner cleaner = new Cleaner(whitelist);
+ Cleaner cleaner = new Cleaner(safelist);
Document clean = cleaner.clean(dirty);
clean.outputSettings(outputSettings);
return clean.body().html();
}
/**
- Test if the input body HTML has only tags and attributes allowed by the Whitelist. Useful for form validation.
+ Use {@link #clean(String, String, Safelist, Document.OutputSettings)} instead.
+ @deprecated as of 1.14.1.
+ */
+ @Deprecated
+ public static String clean(String bodyHtml, String baseUri, Whitelist safelist, Document.OutputSettings outputSettings) {
+ return clean(bodyHtml, baseUri, (Safelist) safelist, outputSettings);
+ }
+
+ /**
+ Test if the input body HTML has only tags and attributes allowed by the Safelist. Useful for form validation.
<p>The input HTML should still be run through the cleaner to set up enforced attributes, and to tidy the output.
<p>Assumes the HTML is a body fragment (i.e. will be used in an existing HTML document body.)
@param bodyHtml HTML to test
- @param whitelist whitelist to test against
+ @param safelist safelist to test against
@return true if no tags or attributes were removed; false otherwise
- @see #clean(String, org.jsoup.safety.Whitelist)
+ @see #clean(String, Safelist)
*/
- public static boolean isValid(String bodyHtml, Whitelist whitelist) {
- return new Cleaner(whitelist).isValidBodyHtml(bodyHtml);
+ public static boolean isValid(String bodyHtml, Safelist safelist) {
+ return new Cleaner(safelist).isValidBodyHtml(bodyHtml);
+ }
+
+ /**
+ Use {@link #isValid(String, Safelist)} instead.
+ @deprecated as of 1.14.1.
+ */
+ @Deprecated
+ public static boolean isValid(String bodyHtml, Whitelist safelist) {
+ return isValid(bodyHtml, (Safelist) safelist);
}
}
diff --git a/src/main/java/org/jsoup/safety/Cleaner.java b/src/main/java/org/jsoup/safety/Cleaner.java
index 22ecb0b..9b6242f 100644
--- a/src/main/java/org/jsoup/safety/Cleaner.java
+++ b/src/main/java/org/jsoup/safety/Cleaner.java
@@ -18,34 +18,44 @@
/**
- The whitelist based HTML cleaner. Use to ensure that end-user provided HTML contains only the elements and attributes
+ The safelist based HTML cleaner. Use to ensure that end-user provided HTML contains only the elements and attributes
that you are expecting; no junk, and no cross-site scripting attacks!
<p>
- The HTML cleaner parses the input as HTML and then runs it through a white-list, so the output HTML can only contain
- HTML that is allowed by the whitelist.
+ The HTML cleaner parses the input as HTML and then runs it through a safe-list, so the output HTML can only contain
+ HTML that is allowed by the safelist.
</p>
<p>
It is assumed that the input HTML is a body fragment; the clean methods only pull from the source's body, and the
- canned white-lists only allow body contained tags.
+ canned safe-lists only allow body contained tags.
</p>
<p>
Rather than interacting directly with a Cleaner object, generally see the {@code clean} methods in {@link org.jsoup.Jsoup}.
</p>
*/
public class Cleaner {
- private Whitelist whitelist;
+ private Safelist safelist;
/**
- Create a new cleaner, that sanitizes documents using the supplied whitelist.
- @param whitelist white-list to clean with
+ Create a new cleaner, that sanitizes documents using the supplied safelist.
+ @param safelist safe-list to clean with
*/
- public Cleaner(Whitelist whitelist) {
- Validate.notNull(whitelist);
- this.whitelist = whitelist;
+ public Cleaner(Safelist safelist) {
+ Validate.notNull(safelist);
+ this.safelist = safelist;
}
/**
- Creates a new, clean document, from the original dirty document, containing only elements allowed by the whitelist.
+ Use {@link #Cleaner(Safelist)} instead.
+ @deprecated as of 1.14.1.
+ */
+ @Deprecated
+ public Cleaner(Whitelist whitelist) {
+ Validate.notNull(whitelist);
+ new Cleaner((Safelist) whitelist);
+ }
+
+ /**
+ Creates a new, clean document, from the original dirty document, containing only elements allowed by the safelist.
The original document is not modified. Only elements from the dirt document's <code>body</code> are used.
@param dirtyDocument Untrusted base document to clean.
@return cleaned document.
@@ -61,8 +71,8 @@
}
/**
- Determines if the input document <b>body</b>is valid, against the whitelist. It is considered valid if all the tags and attributes
- in the input HTML are allowed by the whitelist, and that there is no content in the <code>head</code>.
+ Determines if the input document <b>body</b>is valid, against the safelist. It is considered valid if all the tags and attributes
+ in the input HTML are allowed by the safelist, and that there is no content in the <code>head</code>.
<p>
This method can be used as a validator for user input. An invalid document will still be cleaned successfully
using the {@link #clean(Document)} document. If using as a validator, it is recommended to still clean the document
@@ -107,7 +117,7 @@
if (source instanceof Element) {
Element sourceEl = (Element) source;
- if (whitelist.isSafeTag(sourceEl.normalName())) { // safe, clone and copy safe attrs
+ if (safelist.isSafeTag(sourceEl.normalName())) { // safe, clone and copy safe attrs
ElementMeta meta = createSafeElement(sourceEl);
Element destChild = meta.el;
destination.appendChild(destChild);
@@ -121,7 +131,7 @@
TextNode sourceText = (TextNode) source;
TextNode destText = new TextNode(sourceText.getWholeText());
destination.appendChild(destText);
- } else if (source instanceof DataNode && whitelist.isSafeTag(source.parent().nodeName())) {
+ } else if (source instanceof DataNode && safelist.isSafeTag(source.parent().nodeName())) {
DataNode sourceData = (DataNode) source;
DataNode destData = new DataNode(sourceData.getWholeData());
destination.appendChild(destData);
@@ -131,7 +141,7 @@
}
public void tail(Node source, int depth) {
- if (source instanceof Element && whitelist.isSafeTag(source.nodeName())) {
+ if (source instanceof Element && safelist.isSafeTag(source.nodeName())) {
destination = destination.parent(); // would have descended, so pop destination stack
}
}
@@ -151,12 +161,12 @@
Attributes sourceAttrs = sourceEl.attributes();
for (Attribute sourceAttr : sourceAttrs) {
- if (whitelist.isSafeAttribute(sourceTag, sourceEl, sourceAttr))
+ if (safelist.isSafeAttribute(sourceTag, sourceEl, sourceAttr))
destAttrs.put(sourceAttr);
else
numDiscarded++;
}
- Attributes enforcedAttrs = whitelist.getEnforcedAttributes(sourceTag);
+ Attributes enforcedAttrs = safelist.getEnforcedAttributes(sourceTag);
destAttrs.addAll(enforcedAttrs);
return new ElementMeta(dest, numDiscarded);
diff --git a/src/main/java/org/jsoup/safety/Safelist.java b/src/main/java/org/jsoup/safety/Safelist.java
new file mode 100644
index 0000000..91ad06b
--- /dev/null
+++ b/src/main/java/org/jsoup/safety/Safelist.java
@@ -0,0 +1,655 @@
+package org.jsoup.safety;
+
+/*
+ Thank you to Ryan Grove (wonko.com) for the Ruby HTML cleaner http://github.com/rgrove/sanitize/, which inspired
+ this safe-list configuration, and the initial defaults.
+ */
+
+import org.jsoup.helper.Validate;
+import org.jsoup.nodes.Attribute;
+import org.jsoup.nodes.Attributes;
+import org.jsoup.nodes.Element;
+
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Map;
+import java.util.Set;
+
+import static org.jsoup.internal.Normalizer.lowerCase;
+
+
+/**
+ Safe-lists define what HTML (elements and attributes) to allow through the cleaner. Everything else is removed.
+ <p>
+ Start with one of the defaults:
+ </p>
+ <ul>
+ <li>{@link #none}
+ <li>{@link #simpleText}
+ <li>{@link #basic}
+ <li>{@link #basicWithImages}
+ <li>{@link #relaxed}
+ </ul>
+ <p>
+ If you need to allow more through (please be careful!), tweak a base safelist with:
+ </p>
+ <ul>
+ <li>{@link #addTags}
+ <li>{@link #addAttributes}
+ <li>{@link #addEnforcedAttribute}
+ <li>{@link #addProtocols}
+ </ul>
+ <p>
+ You can remove any setting from an existing safelist with:
+ </p>
+ <ul>
+ <li>{@link #removeTags}
+ <li>{@link #removeAttributes}
+ <li>{@link #removeEnforcedAttribute}
+ <li>{@link #removeProtocols}
+ </ul>
+
+ <p>
+ The cleaner and these safelist assume that you want to clean a <code>body</code> fragment of HTML (to add user
+ supplied HTML into a templated page), and not to clean a full HTML document. If the latter is the case, either wrap the
+ document HTML around the cleaned body HTML, or create a safelist that allows <code>html</code> and <code>head</code>
+ elements as appropriate.
+ </p>
+ <p>
+ If you are going to extend a safelist, please be very careful. Make sure you understand what attributes may lead to
+ XSS attack vectors. URL attributes are particularly vulnerable and require careful validation. See
+ http://ha.ckers.org/xss.html for some XSS attack examples.
+ </p>
+
+ @author Jonathan Hedley
+ */
+public class Safelist {
+ private Set<TagName> tagNames; // tags allowed, lower case. e.g. [p, br, span]
+ private Map<TagName, Set<AttributeKey>> attributes; // tag -> attribute[]. allowed attributes [href] for a tag.
+ private Map<TagName, Map<AttributeKey, AttributeValue>> enforcedAttributes; // always set these attribute values
+ private Map<TagName, Map<AttributeKey, Set<Protocol>>> protocols; // allowed URL protocols for attributes
+ private boolean preserveRelativeLinks; // option to preserve relative links
+
+ /**
+ This safelist allows only text nodes: all HTML will be stripped.
+
+ @return safelist
+ */
+ public static Safelist none() {
+ return new Safelist();
+ }
+
+ /**
+ This safelist allows only simple text formatting: <code>b, em, i, strong, u</code>. All other HTML (tags and
+ attributes) will be removed.
+
+ @return safelist
+ */
+ public static Safelist simpleText() {
+ return new Safelist()
+ .addTags("b", "em", "i", "strong", "u")
+ ;
+ }
+
+ /**
+ <p>
+ This safelist allows a fuller range of text nodes: <code>a, b, blockquote, br, cite, code, dd, dl, dt, em, i, li,
+ ol, p, pre, q, small, span, strike, strong, sub, sup, u, ul</code>, and appropriate attributes.
+ </p>
+ <p>
+ Links (<code>a</code> elements) can point to <code>http, https, ftp, mailto</code>, and have an enforced
+ <code>rel=nofollow</code> attribute.
+ </p>
+ <p>
+ Does not allow images.
+ </p>
+
+ @return safelist
+ */
+ public static Safelist basic() {
+ return new Safelist()
+ .addTags(
+ "a", "b", "blockquote", "br", "cite", "code", "dd", "dl", "dt", "em",
+ "i", "li", "ol", "p", "pre", "q", "small", "span", "strike", "strong", "sub",
+ "sup", "u", "ul")
+
+ .addAttributes("a", "href")
+ .addAttributes("blockquote", "cite")
+ .addAttributes("q", "cite")
+
+ .addProtocols("a", "href", "ftp", "http", "https", "mailto")
+ .addProtocols("blockquote", "cite", "http", "https")
+ .addProtocols("cite", "cite", "http", "https")
+
+ .addEnforcedAttribute("a", "rel", "nofollow")
+ ;
+
+ }
+
+ /**
+ This safelist allows the same text tags as {@link #basic}, and also allows <code>img</code> tags, with appropriate
+ attributes, with <code>src</code> pointing to <code>http</code> or <code>https</code>.
+
+ @return safelist
+ */
+ public static Safelist basicWithImages() {
+ return basic()
+ .addTags("img")
+ .addAttributes("img", "align", "alt", "height", "src", "title", "width")
+ .addProtocols("img", "src", "http", "https")
+ ;
+ }
+
+ /**
+ This safelist allows a full range of text and structural body HTML: <code>a, b, blockquote, br, caption, cite,
+ code, col, colgroup, dd, div, dl, dt, em, h1, h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, span, strike, strong, sub,
+ sup, table, tbody, td, tfoot, th, thead, tr, u, ul</code>
+ <p>
+ Links do not have an enforced <code>rel=nofollow</code> attribute, but you can add that if desired.
+ </p>
+
+ @return safelist
+ */
+ public static Safelist relaxed() {
+ return new Safelist()
+ .addTags(
+ "a", "b", "blockquote", "br", "caption", "cite", "code", "col",
+ "colgroup", "dd", "div", "dl", "dt", "em", "h1", "h2", "h3", "h4", "h5", "h6",
+ "i", "img", "li", "ol", "p", "pre", "q", "small", "span", "strike", "strong",
+ "sub", "sup", "table", "tbody", "td", "tfoot", "th", "thead", "tr", "u",
+ "ul")
+
+ .addAttributes("a", "href", "title")
+ .addAttributes("blockquote", "cite")
+ .addAttributes("col", "span", "width")
+ .addAttributes("colgroup", "span", "width")
+ .addAttributes("img", "align", "alt", "height", "src", "title", "width")
+ .addAttributes("ol", "start", "type")
+ .addAttributes("q", "cite")
+ .addAttributes("table", "summary", "width")
+ .addAttributes("td", "abbr", "axis", "colspan", "rowspan", "width")
+ .addAttributes(
+ "th", "abbr", "axis", "colspan", "rowspan", "scope",
+ "width")
+ .addAttributes("ul", "type")
+
+ .addProtocols("a", "href", "ftp", "http", "https", "mailto")
+ .addProtocols("blockquote", "cite", "http", "https")
+ .addProtocols("cite", "cite", "http", "https")
+ .addProtocols("img", "src", "http", "https")
+ .addProtocols("q", "cite", "http", "https")
+ ;
+ }
+
+ /**
+ Create a new, empty safelist. Generally it will be better to start with a default prepared safelist instead.
+
+ @see #basic()
+ @see #basicWithImages()
+ @see #simpleText()
+ @see #relaxed()
+ */
+ public Safelist() {
+ tagNames = new HashSet<>();
+ attributes = new HashMap<>();
+ enforcedAttributes = new HashMap<>();
+ protocols = new HashMap<>();
+ preserveRelativeLinks = false;
+ }
+
+ /**
+ Deep copy an existing Safelist to a new Safelist.
+ @param copy the Safelist to copy
+ */
+ public Safelist(Safelist copy) {
+ this();
+ tagNames.addAll(copy.tagNames);
+ attributes.putAll(copy.attributes);
+ enforcedAttributes.putAll(copy.enforcedAttributes);
+ protocols.putAll(copy.protocols);
+ preserveRelativeLinks = copy.preserveRelativeLinks;
+ }
+
+ /**
+ Add a list of allowed elements to a safelist. (If a tag is not allowed, it will be removed from the HTML.)
+
+ @param tags tag names to allow
+ @return this (for chaining)
+ */
+ public Safelist addTags(String... tags) {
+ Validate.notNull(tags);
+
+ for (String tagName : tags) {
+ Validate.notEmpty(tagName);
+ tagNames.add(TagName.valueOf(tagName));
+ }
+ return this;
+ }
+
+ /**
+ Remove a list of allowed elements from a safelist. (If a tag is not allowed, it will be removed from the HTML.)
+
+ @param tags tag names to disallow
+ @return this (for chaining)
+ */
+ public Safelist removeTags(String... tags) {
+ Validate.notNull(tags);
+
+ for(String tag: tags) {
+ Validate.notEmpty(tag);
+ TagName tagName = TagName.valueOf(tag);
+
+ if(tagNames.remove(tagName)) { // Only look in sub-maps if tag was allowed
+ attributes.remove(tagName);
+ enforcedAttributes.remove(tagName);
+ protocols.remove(tagName);
+ }
+ }
+ return this;
+ }
+
+ /**
+ Add a list of allowed attributes to a tag. (If an attribute is not allowed on an element, it will be removed.)
+ <p>
+ E.g.: <code>addAttributes("a", "href", "class")</code> allows <code>href</code> and <code>class</code> attributes
+ on <code>a</code> tags.
+ </p>
+ <p>
+ To make an attribute valid for <b>all tags</b>, use the pseudo tag <code>:all</code>, e.g.
+ <code>addAttributes(":all", "class")</code>.
+ </p>
+
+ @param tag The tag the attributes are for. The tag will be added to the allowed tag list if necessary.
+ @param attributes List of valid attributes for the tag
+ @return this (for chaining)
+ */
+ public Safelist addAttributes(String tag, String... attributes) {
+ Validate.notEmpty(tag);
+ Validate.notNull(attributes);
+ Validate.isTrue(attributes.length > 0, "No attribute names supplied.");
+
+ TagName tagName = TagName.valueOf(tag);
+ tagNames.add(tagName);
+ Set<AttributeKey> attributeSet = new HashSet<>();
+ for (String key : attributes) {
+ Validate.notEmpty(key);
+ attributeSet.add(AttributeKey.valueOf(key));
+ }
+ if (this.attributes.containsKey(tagName)) {
+ Set<AttributeKey> currentSet = this.attributes.get(tagName);
+ currentSet.addAll(attributeSet);
+ } else {
+ this.attributes.put(tagName, attributeSet);
+ }
+ return this;
+ }
+
+ /**
+ Remove a list of allowed attributes from a tag. (If an attribute is not allowed on an element, it will be removed.)
+ <p>
+ E.g.: <code>removeAttributes("a", "href", "class")</code> disallows <code>href</code> and <code>class</code>
+ attributes on <code>a</code> tags.
+ </p>
+ <p>
+ To make an attribute invalid for <b>all tags</b>, use the pseudo tag <code>:all</code>, e.g.
+ <code>removeAttributes(":all", "class")</code>.
+ </p>
+
+ @param tag The tag the attributes are for.
+ @param attributes List of invalid attributes for the tag
+ @return this (for chaining)
+ */
+ public Safelist removeAttributes(String tag, String... attributes) {
+ Validate.notEmpty(tag);
+ Validate.notNull(attributes);
+ Validate.isTrue(attributes.length > 0, "No attribute names supplied.");
+
+ TagName tagName = TagName.valueOf(tag);
+ Set<AttributeKey> attributeSet = new HashSet<>();
+ for (String key : attributes) {
+ Validate.notEmpty(key);
+ attributeSet.add(AttributeKey.valueOf(key));
+ }
+ if(tagNames.contains(tagName) && this.attributes.containsKey(tagName)) { // Only look in sub-maps if tag was allowed
+ Set<AttributeKey> currentSet = this.attributes.get(tagName);
+ currentSet.removeAll(attributeSet);
+
+ if(currentSet.isEmpty()) // Remove tag from attribute map if no attributes are allowed for tag
+ this.attributes.remove(tagName);
+ }
+ if(tag.equals(":all")) // Attribute needs to be removed from all individually set tags
+ for(TagName name: this.attributes.keySet()) {
+ Set<AttributeKey> currentSet = this.attributes.get(name);
+ currentSet.removeAll(attributeSet);
+
+ if(currentSet.isEmpty()) // Remove tag from attribute map if no attributes are allowed for tag
+ this.attributes.remove(name);
+ }
+ return this;
+ }
+
+ /**
+ Add an enforced attribute to a tag. An enforced attribute will always be added to the element. If the element
+ already has the attribute set, it will be overridden with this value.
+ <p>
+ E.g.: <code>addEnforcedAttribute("a", "rel", "nofollow")</code> will make all <code>a</code> tags output as
+ <code><a href="..." rel="nofollow"></code>
+ </p>
+
+ @param tag The tag the enforced attribute is for. The tag will be added to the allowed tag list if necessary.
+ @param attribute The attribute name
+ @param value The enforced attribute value
+ @return this (for chaining)
+ */
+ public Safelist addEnforcedAttribute(String tag, String attribute, String value) {
+ Validate.notEmpty(tag);
+ Validate.notEmpty(attribute);
+ Validate.notEmpty(value);
+
+ TagName tagName = TagName.valueOf(tag);
+ tagNames.add(tagName);
+ AttributeKey attrKey = AttributeKey.valueOf(attribute);
+ AttributeValue attrVal = AttributeValue.valueOf(value);
+
+ if (enforcedAttributes.containsKey(tagName)) {
+ enforcedAttributes.get(tagName).put(attrKey, attrVal);
+ } else {
+ Map<AttributeKey, AttributeValue> attrMap = new HashMap<>();
+ attrMap.put(attrKey, attrVal);
+ enforcedAttributes.put(tagName, attrMap);
+ }
+ return this;
+ }
+
+ /**
+ Remove a previously configured enforced attribute from a tag.
+
+ @param tag The tag the enforced attribute is for.
+ @param attribute The attribute name
+ @return this (for chaining)
+ */
+ public Safelist removeEnforcedAttribute(String tag, String attribute) {
+ Validate.notEmpty(tag);
+ Validate.notEmpty(attribute);
+
+ TagName tagName = TagName.valueOf(tag);
+ if(tagNames.contains(tagName) && enforcedAttributes.containsKey(tagName)) {
+ AttributeKey attrKey = AttributeKey.valueOf(attribute);
+ Map<AttributeKey, AttributeValue> attrMap = enforcedAttributes.get(tagName);
+ attrMap.remove(attrKey);
+
+ if(attrMap.isEmpty()) // Remove tag from enforced attribute map if no enforced attributes are present
+ enforcedAttributes.remove(tagName);
+ }
+ return this;
+ }
+
+ /**
+ * Configure this Safelist to preserve relative links in an element's URL attribute, or convert them to absolute
+ * links. By default, this is <b>false</b>: URLs will be made absolute (e.g. start with an allowed protocol, like
+ * e.g. {@code http://}.
+ * <p>
+ * Note that when handling relative links, the input document must have an appropriate {@code base URI} set when
+ * parsing, so that the link's protocol can be confirmed. Regardless of the setting of the {@code preserve relative
+ * links} option, the link must be resolvable against the base URI to an allowed protocol; otherwise the attribute
+ * will be removed.
+ * </p>
+ *
+ * @param preserve {@code true} to allow relative links, {@code false} (default) to deny
+ * @return this Safelist, for chaining.
+ * @see #addProtocols
+ */
+ public Safelist preserveRelativeLinks(boolean preserve) {
+ preserveRelativeLinks = preserve;
+ return this;
+ }
+
+ /**
+ Add allowed URL protocols for an element's URL attribute. This restricts the possible values of the attribute to
+ URLs with the defined protocol.
+ <p>
+ E.g.: <code>addProtocols("a", "href", "ftp", "http", "https")</code>
+ </p>
+ <p>
+ To allow a link to an in-page URL anchor (i.e. <code><a href="#anchor"></code>, add a <code>#</code>:<br>
+ E.g.: <code>addProtocols("a", "href", "#")</code>
+ </p>
+
+ @param tag Tag the URL protocol is for
+ @param attribute Attribute name
+ @param protocols List of valid protocols
+ @return this, for chaining
+ */
+ public Safelist addProtocols(String tag, String attribute, String... protocols) {
+ Validate.notEmpty(tag);
+ Validate.notEmpty(attribute);
+ Validate.notNull(protocols);
+
+ TagName tagName = TagName.valueOf(tag);
+ AttributeKey attrKey = AttributeKey.valueOf(attribute);
+ Map<AttributeKey, Set<Protocol>> attrMap;
+ Set<Protocol> protSet;
+
+ if (this.protocols.containsKey(tagName)) {
+ attrMap = this.protocols.get(tagName);
+ } else {
+ attrMap = new HashMap<>();
+ this.protocols.put(tagName, attrMap);
+ }
+ if (attrMap.containsKey(attrKey)) {
+ protSet = attrMap.get(attrKey);
+ } else {
+ protSet = new HashSet<>();
+ attrMap.put(attrKey, protSet);
+ }
+ for (String protocol : protocols) {
+ Validate.notEmpty(protocol);
+ Protocol prot = Protocol.valueOf(protocol);
+ protSet.add(prot);
+ }
+ return this;
+ }
+
+ /**
+ Remove allowed URL protocols for an element's URL attribute. If you remove all protocols for an attribute, that
+ attribute will allow any protocol.
+ <p>
+ E.g.: <code>removeProtocols("a", "href", "ftp")</code>
+ </p>
+
+ @param tag Tag the URL protocol is for
+ @param attribute Attribute name
+ @param removeProtocols List of invalid protocols
+ @return this, for chaining
+ */
+ public Safelist removeProtocols(String tag, String attribute, String... removeProtocols) {
+ Validate.notEmpty(tag);
+ Validate.notEmpty(attribute);
+ Validate.notNull(removeProtocols);
+
+ TagName tagName = TagName.valueOf(tag);
+ AttributeKey attr = AttributeKey.valueOf(attribute);
+
+ // make sure that what we're removing actually exists; otherwise can open the tag to any data and that can
+ // be surprising
+ Validate.isTrue(protocols.containsKey(tagName), "Cannot remove a protocol that is not set.");
+ Map<AttributeKey, Set<Protocol>> tagProtocols = protocols.get(tagName);
+ Validate.isTrue(tagProtocols.containsKey(attr), "Cannot remove a protocol that is not set.");
+
+ Set<Protocol> attrProtocols = tagProtocols.get(attr);
+ for (String protocol : removeProtocols) {
+ Validate.notEmpty(protocol);
+ attrProtocols.remove(Protocol.valueOf(protocol));
+ }
+
+ if (attrProtocols.isEmpty()) { // Remove protocol set if empty
+ tagProtocols.remove(attr);
+ if (tagProtocols.isEmpty()) // Remove entry for tag if empty
+ protocols.remove(tagName);
+ }
+ return this;
+ }
+
+ /**
+ * Test if the supplied tag is allowed by this safelist
+ * @param tag test tag
+ * @return true if allowed
+ */
+ protected boolean isSafeTag(String tag) {
+ return tagNames.contains(TagName.valueOf(tag));
+ }
+
+ /**
+ * Test if the supplied attribute is allowed by this safelist for this tag
+ * @param tagName tag to consider allowing the attribute in
+ * @param el element under test, to confirm protocol
+ * @param attr attribute under test
+ * @return true if allowed
+ */
+ protected boolean isSafeAttribute(String tagName, Element el, Attribute attr) {
+ TagName tag = TagName.valueOf(tagName);
+ AttributeKey key = AttributeKey.valueOf(attr.getKey());
+
+ Set<AttributeKey> okSet = attributes.get(tag);
+ if (okSet != null && okSet.contains(key)) {
+ if (protocols.containsKey(tag)) {
+ Map<AttributeKey, Set<Protocol>> attrProts = protocols.get(tag);
+ // ok if not defined protocol; otherwise test
+ return !attrProts.containsKey(key) || testValidProtocol(el, attr, attrProts.get(key));
+ } else { // attribute found, no protocols defined, so OK
+ return true;
+ }
+ }
+ // might be an enforced attribute?
+ Map<AttributeKey, AttributeValue> enforcedSet = enforcedAttributes.get(tag);
+ if (enforcedSet != null) {
+ Attributes expect = getEnforcedAttributes(tagName);
+ String attrKey = attr.getKey();
+ if (expect.hasKeyIgnoreCase(attrKey)) {
+ return expect.getIgnoreCase(attrKey).equals(attr.getValue());
+ }
+ }
+ // no attributes defined for tag, try :all tag
+ return !tagName.equals(":all") && isSafeAttribute(":all", el, attr);
+ }
+
+ private boolean testValidProtocol(Element el, Attribute attr, Set<Protocol> protocols) {
+ // try to resolve relative urls to abs, and optionally update the attribute so output html has abs.
+ // rels without a baseuri get removed
+ String value = el.absUrl(attr.getKey());
+ if (value.length() == 0)
+ value = attr.getValue(); // if it could not be made abs, run as-is to allow custom unknown protocols
+ if (!preserveRelativeLinks)
+ attr.setValue(value);
+
+ for (Protocol protocol : protocols) {
+ String prot = protocol.toString();
+
+ if (prot.equals("#")) { // allows anchor links
+ if (isValidAnchor(value)) {
+ return true;
+ } else {
+ continue;
+ }
+ }
+
+ prot += ":";
+
+ if (lowerCase(value).startsWith(prot)) {
+ return true;
+ }
+ }
+ return false;
+ }
+
+ private boolean isValidAnchor(String value) {
+ return value.startsWith("#") && !value.matches(".*\\s.*");
+ }
+
+ Attributes getEnforcedAttributes(String tagName) {
+ Attributes attrs = new Attributes();
+ TagName tag = TagName.valueOf(tagName);
+ if (enforcedAttributes.containsKey(tag)) {
+ Map<AttributeKey, AttributeValue> keyVals = enforcedAttributes.get(tag);
+ for (Map.Entry<AttributeKey, AttributeValue> entry : keyVals.entrySet()) {
+ attrs.put(entry.getKey().toString(), entry.getValue().toString());
+ }
+ }
+ return attrs;
+ }
+
+ // named types for config. All just hold strings, but here for my sanity.
+
+ static class TagName extends TypedValue {
+ TagName(String value) {
+ super(value);
+ }
+
+ static TagName valueOf(String value) {
+ return new TagName(value);
+ }
+ }
+
+ static class AttributeKey extends TypedValue {
+ AttributeKey(String value) {
+ super(value);
+ }
+
+ static AttributeKey valueOf(String value) {
+ return new AttributeKey(value);
+ }
+ }
+
+ static class AttributeValue extends TypedValue {
+ AttributeValue(String value) {
+ super(value);
+ }
+
+ static AttributeValue valueOf(String value) {
+ return new AttributeValue(value);
+ }
+ }
+
+ static class Protocol extends TypedValue {
+ Protocol(String value) {
+ super(value);
+ }
+
+ static Protocol valueOf(String value) {
+ return new Protocol(value);
+ }
+ }
+
+ abstract static class TypedValue {
+ private String value;
+
+ TypedValue(String value) {
+ Validate.notNull(value);
+ this.value = value;
+ }
+
+ @Override
+ public int hashCode() {
+ final int prime = 31;
+ int result = 1;
+ result = prime * result + ((value == null) ? 0 : value.hashCode());
+ return result;
+ }
+
+ @Override
+ public boolean equals(Object obj) {
+ if (this == obj) return true;
+ if (obj == null) return false;
+ if (getClass() != obj.getClass()) return false;
+ TypedValue other = (TypedValue) obj;
+ if (value == null) {
+ return other.value == null;
+ } else return value.equals(other.value);
+ }
+
+ @Override
+ public String toString() {
+ return value;
+ }
+ }
+}
diff --git a/src/main/java/org/jsoup/safety/Whitelist.java b/src/main/java/org/jsoup/safety/Whitelist.java
index 229ab36..986cf4d 100644
--- a/src/main/java/org/jsoup/safety/Whitelist.java
+++ b/src/main/java/org/jsoup/safety/Whitelist.java
@@ -1,643 +1,115 @@
package org.jsoup.safety;
-/*
- Thank you to Ryan Grove (wonko.com) for the Ruby HTML cleaner http://github.com/rgrove/sanitize/, which inspired
- this whitelist configuration, and the initial defaults.
- */
-
-import org.jsoup.helper.Validate;
import org.jsoup.nodes.Attribute;
import org.jsoup.nodes.Attributes;
import org.jsoup.nodes.Element;
-import java.util.HashMap;
-import java.util.HashSet;
-import java.util.Map;
-import java.util.Set;
-
-import static org.jsoup.internal.Normalizer.lowerCase;
-
-
/**
- Whitelists define what HTML (elements and attributes) to allow through the cleaner. Everything else is removed.
+ @deprecated As of release <code>v1.14.1</code>, this class is deprecated in favour of {@link Safelist}. The name has
+ been changed with the intent of promoting more inclusive language. {@link Safelist} is a drop-in replacement, and no
+ further changes other than updating the name in your code are required to cleanly migrate. This class will be
+ removed in <code>v1.15.1</code>. Until that release, this class acts as a shim to maintain code compatibility
+ (source and binary).
<p>
- Start with one of the defaults:
- </p>
- <ul>
- <li>{@link #none}
- <li>{@link #simpleText}
- <li>{@link #basic}
- <li>{@link #basicWithImages}
- <li>{@link #relaxed}
- </ul>
- <p>
- If you need to allow more through (please be careful!), tweak a base whitelist with:
- </p>
- <ul>
- <li>{@link #addTags}
- <li>{@link #addAttributes}
- <li>{@link #addEnforcedAttribute}
- <li>{@link #addProtocols}
- </ul>
- <p>
- You can remove any setting from an existing whitelist with:
- </p>
- <ul>
- <li>{@link #removeTags}
- <li>{@link #removeAttributes}
- <li>{@link #removeEnforcedAttribute}
- <li>{@link #removeProtocols}
- </ul>
-
- <p>
- The cleaner and these whitelists assume that you want to clean a <code>body</code> fragment of HTML (to add user
- supplied HTML into a templated page), and not to clean a full HTML document. If the latter is the case, either wrap the
- document HTML around the cleaned body HTML, or create a whitelist that allows <code>html</code> and <code>head</code>
- elements as appropriate.
- </p>
- <p>
- If you are going to extend a whitelist, please be very careful. Make sure you understand what attributes may lead to
- XSS attack vectors. URL attributes are particularly vulnerable and require careful validation. See
- http://ha.ckers.org/xss.html for some XSS attack examples.
- </p>
-
- @author Jonathan Hedley
- */
-public class Whitelist {
- private Set<TagName> tagNames; // tags allowed, lower case. e.g. [p, br, span]
- private Map<TagName, Set<AttributeKey>> attributes; // tag -> attribute[]. allowed attributes [href] for a tag.
- private Map<TagName, Map<AttributeKey, AttributeValue>> enforcedAttributes; // always set these attribute values
- private Map<TagName, Map<AttributeKey, Set<Protocol>>> protocols; // allowed URL protocols for attributes
- private boolean preserveRelativeLinks; // option to preserve relative links
-
- /**
- This whitelist allows only text nodes: all HTML will be stripped.
-
- @return whitelist
- */
- public static Whitelist none() {
- return new Whitelist();
- }
-
- /**
- This whitelist allows only simple text formatting: <code>b, em, i, strong, u</code>. All other HTML (tags and
- attributes) will be removed.
-
- @return whitelist
- */
- public static Whitelist simpleText() {
- return new Whitelist()
- .addTags("b", "em", "i", "strong", "u")
- ;
- }
-
- /**
- <p>
- This whitelist allows a fuller range of text nodes: <code>a, b, blockquote, br, cite, code, dd, dl, dt, em, i, li,
- ol, p, pre, q, small, span, strike, strong, sub, sup, u, ul</code>, and appropriate attributes.
- </p>
- <p>
- Links (<code>a</code> elements) can point to <code>http, https, ftp, mailto</code>, and have an enforced
- <code>rel=nofollow</code> attribute.
- </p>
- <p>
- Does not allow images.
- </p>
-
- @return whitelist
- */
- public static Whitelist basic() {
- return new Whitelist()
- .addTags(
- "a", "b", "blockquote", "br", "cite", "code", "dd", "dl", "dt", "em",
- "i", "li", "ol", "p", "pre", "q", "small", "span", "strike", "strong", "sub",
- "sup", "u", "ul")
-
- .addAttributes("a", "href")
- .addAttributes("blockquote", "cite")
- .addAttributes("q", "cite")
-
- .addProtocols("a", "href", "ftp", "http", "https", "mailto")
- .addProtocols("blockquote", "cite", "http", "https")
- .addProtocols("cite", "cite", "http", "https")
-
- .addEnforcedAttribute("a", "rel", "nofollow")
- ;
-
- }
-
- /**
- This whitelist allows the same text tags as {@link #basic}, and also allows <code>img</code> tags, with appropriate
- attributes, with <code>src</code> pointing to <code>http</code> or <code>https</code>.
-
- @return whitelist
- */
- public static Whitelist basicWithImages() {
- return basic()
- .addTags("img")
- .addAttributes("img", "align", "alt", "height", "src", "title", "width")
- .addProtocols("img", "src", "http", "https")
- ;
- }
-
- /**
- This whitelist allows a full range of text and structural body HTML: <code>a, b, blockquote, br, caption, cite,
- code, col, colgroup, dd, div, dl, dt, em, h1, h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, span, strike, strong, sub,
- sup, table, tbody, td, tfoot, th, thead, tr, u, ul</code>
- <p>
- Links do not have an enforced <code>rel=nofollow</code> attribute, but you can add that if desired.
- </p>
-
- @return whitelist
- */
- public static Whitelist relaxed() {
- return new Whitelist()
- .addTags(
- "a", "b", "blockquote", "br", "caption", "cite", "code", "col",
- "colgroup", "dd", "div", "dl", "dt", "em", "h1", "h2", "h3", "h4", "h5", "h6",
- "i", "img", "li", "ol", "p", "pre", "q", "small", "span", "strike", "strong",
- "sub", "sup", "table", "tbody", "td", "tfoot", "th", "thead", "tr", "u",
- "ul")
-
- .addAttributes("a", "href", "title")
- .addAttributes("blockquote", "cite")
- .addAttributes("col", "span", "width")
- .addAttributes("colgroup", "span", "width")
- .addAttributes("img", "align", "alt", "height", "src", "title", "width")
- .addAttributes("ol", "start", "type")
- .addAttributes("q", "cite")
- .addAttributes("table", "summary", "width")
- .addAttributes("td", "abbr", "axis", "colspan", "rowspan", "width")
- .addAttributes(
- "th", "abbr", "axis", "colspan", "rowspan", "scope",
- "width")
- .addAttributes("ul", "type")
-
- .addProtocols("a", "href", "ftp", "http", "https", "mailto")
- .addProtocols("blockquote", "cite", "http", "https")
- .addProtocols("cite", "cite", "http", "https")
- .addProtocols("img", "src", "http", "https")
- .addProtocols("q", "cite", "http", "https")
- ;
- }
-
- /**
- Create a new, empty whitelist. Generally it will be better to start with a default prepared whitelist instead.
-
- @see #basic()
- @see #basicWithImages()
- @see #simpleText()
- @see #relaxed()
- */
+ For a clear rationale of the removal of this change, please see
+ <a href="https://tools.ietf.org/html/draft-knodel-terminology-04" title="draft-knodel-terminology-04">Terminology,
+ Power, and Inclusive Language in Internet-Drafts and RFCs</a> */
+@Deprecated
+public class Whitelist extends Safelist {
public Whitelist() {
- tagNames = new HashSet<>();
- attributes = new HashMap<>();
- enforcedAttributes = new HashMap<>();
- protocols = new HashMap<>();
- preserveRelativeLinks = false;
+ super();
}
- /**
- Add a list of allowed elements to a whitelist. (If a tag is not allowed, it will be removed from the HTML.)
+ public Whitelist(Safelist copy) {
+ super(copy);
+ }
- @param tags tag names to allow
- @return this (for chaining)
- */
+ static public Whitelist basic() {
+ return new Whitelist(Safelist.basic());
+ }
+
+ static public Whitelist basicWithImages() {
+ return new Whitelist(Safelist.basicWithImages());
+ }
+
+ static public Whitelist none() {
+ return new Whitelist(Safelist.none());
+ }
+
+ static public Whitelist relaxed() {
+ return new Whitelist(Safelist.relaxed());
+ }
+
+ static public Whitelist simpleText() {
+ return new Whitelist(Safelist.simpleText());
+ }
+
+ @Override
public Whitelist addTags(String... tags) {
- Validate.notNull(tags);
-
- for (String tagName : tags) {
- Validate.notEmpty(tagName);
- tagNames.add(TagName.valueOf(tagName));
- }
+ super.addTags(tags);
return this;
}
- /**
- Remove a list of allowed elements from a whitelist. (If a tag is not allowed, it will be removed from the HTML.)
-
- @param tags tag names to disallow
- @return this (for chaining)
- */
+ @Override
public Whitelist removeTags(String... tags) {
- Validate.notNull(tags);
-
- for(String tag: tags) {
- Validate.notEmpty(tag);
- TagName tagName = TagName.valueOf(tag);
-
- if(tagNames.remove(tagName)) { // Only look in sub-maps if tag was allowed
- attributes.remove(tagName);
- enforcedAttributes.remove(tagName);
- protocols.remove(tagName);
- }
- }
+ super.removeTags(tags);
return this;
}
- /**
- Add a list of allowed attributes to a tag. (If an attribute is not allowed on an element, it will be removed.)
- <p>
- E.g.: <code>addAttributes("a", "href", "class")</code> allows <code>href</code> and <code>class</code> attributes
- on <code>a</code> tags.
- </p>
- <p>
- To make an attribute valid for <b>all tags</b>, use the pseudo tag <code>:all</code>, e.g.
- <code>addAttributes(":all", "class")</code>.
- </p>
-
- @param tag The tag the attributes are for. The tag will be added to the allowed tag list if necessary.
- @param attributes List of valid attributes for the tag
- @return this (for chaining)
- */
+ @Override
public Whitelist addAttributes(String tag, String... attributes) {
- Validate.notEmpty(tag);
- Validate.notNull(attributes);
- Validate.isTrue(attributes.length > 0, "No attribute names supplied.");
-
- TagName tagName = TagName.valueOf(tag);
- tagNames.add(tagName);
- Set<AttributeKey> attributeSet = new HashSet<>();
- for (String key : attributes) {
- Validate.notEmpty(key);
- attributeSet.add(AttributeKey.valueOf(key));
- }
- if (this.attributes.containsKey(tagName)) {
- Set<AttributeKey> currentSet = this.attributes.get(tagName);
- currentSet.addAll(attributeSet);
- } else {
- this.attributes.put(tagName, attributeSet);
- }
+ super.addAttributes(tag, attributes);
return this;
}
- /**
- Remove a list of allowed attributes from a tag. (If an attribute is not allowed on an element, it will be removed.)
- <p>
- E.g.: <code>removeAttributes("a", "href", "class")</code> disallows <code>href</code> and <code>class</code>
- attributes on <code>a</code> tags.
- </p>
- <p>
- To make an attribute invalid for <b>all tags</b>, use the pseudo tag <code>:all</code>, e.g.
- <code>removeAttributes(":all", "class")</code>.
- </p>
-
- @param tag The tag the attributes are for.
- @param attributes List of invalid attributes for the tag
- @return this (for chaining)
- */
+ @Override
public Whitelist removeAttributes(String tag, String... attributes) {
- Validate.notEmpty(tag);
- Validate.notNull(attributes);
- Validate.isTrue(attributes.length > 0, "No attribute names supplied.");
-
- TagName tagName = TagName.valueOf(tag);
- Set<AttributeKey> attributeSet = new HashSet<>();
- for (String key : attributes) {
- Validate.notEmpty(key);
- attributeSet.add(AttributeKey.valueOf(key));
- }
- if(tagNames.contains(tagName) && this.attributes.containsKey(tagName)) { // Only look in sub-maps if tag was allowed
- Set<AttributeKey> currentSet = this.attributes.get(tagName);
- currentSet.removeAll(attributeSet);
-
- if(currentSet.isEmpty()) // Remove tag from attribute map if no attributes are allowed for tag
- this.attributes.remove(tagName);
- }
- if(tag.equals(":all")) // Attribute needs to be removed from all individually set tags
- for(TagName name: this.attributes.keySet()) {
- Set<AttributeKey> currentSet = this.attributes.get(name);
- currentSet.removeAll(attributeSet);
-
- if(currentSet.isEmpty()) // Remove tag from attribute map if no attributes are allowed for tag
- this.attributes.remove(name);
- }
+ super.removeAttributes(tag, attributes);
return this;
}
- /**
- Add an enforced attribute to a tag. An enforced attribute will always be added to the element. If the element
- already has the attribute set, it will be overridden with this value.
- <p>
- E.g.: <code>addEnforcedAttribute("a", "rel", "nofollow")</code> will make all <code>a</code> tags output as
- <code><a href="..." rel="nofollow"></code>
- </p>
-
- @param tag The tag the enforced attribute is for. The tag will be added to the allowed tag list if necessary.
- @param attribute The attribute name
- @param value The enforced attribute value
- @return this (for chaining)
- */
+ @Override
public Whitelist addEnforcedAttribute(String tag, String attribute, String value) {
- Validate.notEmpty(tag);
- Validate.notEmpty(attribute);
- Validate.notEmpty(value);
-
- TagName tagName = TagName.valueOf(tag);
- tagNames.add(tagName);
- AttributeKey attrKey = AttributeKey.valueOf(attribute);
- AttributeValue attrVal = AttributeValue.valueOf(value);
-
- if (enforcedAttributes.containsKey(tagName)) {
- enforcedAttributes.get(tagName).put(attrKey, attrVal);
- } else {
- Map<AttributeKey, AttributeValue> attrMap = new HashMap<>();
- attrMap.put(attrKey, attrVal);
- enforcedAttributes.put(tagName, attrMap);
- }
+ super.addEnforcedAttribute(tag, attribute, value);
return this;
}
- /**
- Remove a previously configured enforced attribute from a tag.
-
- @param tag The tag the enforced attribute is for.
- @param attribute The attribute name
- @return this (for chaining)
- */
+ @Override
public Whitelist removeEnforcedAttribute(String tag, String attribute) {
- Validate.notEmpty(tag);
- Validate.notEmpty(attribute);
-
- TagName tagName = TagName.valueOf(tag);
- if(tagNames.contains(tagName) && enforcedAttributes.containsKey(tagName)) {
- AttributeKey attrKey = AttributeKey.valueOf(attribute);
- Map<AttributeKey, AttributeValue> attrMap = enforcedAttributes.get(tagName);
- attrMap.remove(attrKey);
-
- if(attrMap.isEmpty()) // Remove tag from enforced attribute map if no enforced attributes are present
- enforcedAttributes.remove(tagName);
- }
+ super.removeEnforcedAttribute(tag, attribute);
return this;
}
- /**
- * Configure this Whitelist to preserve relative links in an element's URL attribute, or convert them to absolute
- * links. By default, this is <b>false</b>: URLs will be made absolute (e.g. start with an allowed protocol, like
- * e.g. {@code http://}.
- * <p>
- * Note that when handling relative links, the input document must have an appropriate {@code base URI} set when
- * parsing, so that the link's protocol can be confirmed. Regardless of the setting of the {@code preserve relative
- * links} option, the link must be resolvable against the base URI to an allowed protocol; otherwise the attribute
- * will be removed.
- * </p>
- *
- * @param preserve {@code true} to allow relative links, {@code false} (default) to deny
- * @return this Whitelist, for chaining.
- * @see #addProtocols
- */
+ @Override
public Whitelist preserveRelativeLinks(boolean preserve) {
- preserveRelativeLinks = preserve;
+ super.preserveRelativeLinks(preserve);
return this;
}
- /**
- Add allowed URL protocols for an element's URL attribute. This restricts the possible values of the attribute to
- URLs with the defined protocol.
- <p>
- E.g.: <code>addProtocols("a", "href", "ftp", "http", "https")</code>
- </p>
- <p>
- To allow a link to an in-page URL anchor (i.e. <code><a href="#anchor"></code>, add a <code>#</code>:<br>
- E.g.: <code>addProtocols("a", "href", "#")</code>
- </p>
-
- @param tag Tag the URL protocol is for
- @param attribute Attribute name
- @param protocols List of valid protocols
- @return this, for chaining
- */
+ @Override
public Whitelist addProtocols(String tag, String attribute, String... protocols) {
- Validate.notEmpty(tag);
- Validate.notEmpty(attribute);
- Validate.notNull(protocols);
-
- TagName tagName = TagName.valueOf(tag);
- AttributeKey attrKey = AttributeKey.valueOf(attribute);
- Map<AttributeKey, Set<Protocol>> attrMap;
- Set<Protocol> protSet;
-
- if (this.protocols.containsKey(tagName)) {
- attrMap = this.protocols.get(tagName);
- } else {
- attrMap = new HashMap<>();
- this.protocols.put(tagName, attrMap);
- }
- if (attrMap.containsKey(attrKey)) {
- protSet = attrMap.get(attrKey);
- } else {
- protSet = new HashSet<>();
- attrMap.put(attrKey, protSet);
- }
- for (String protocol : protocols) {
- Validate.notEmpty(protocol);
- Protocol prot = Protocol.valueOf(protocol);
- protSet.add(prot);
- }
+ super.addProtocols(tag, attribute, protocols);
return this;
}
- /**
- Remove allowed URL protocols for an element's URL attribute. If you remove all protocols for an attribute, that
- attribute will allow any protocol.
- <p>
- E.g.: <code>removeProtocols("a", "href", "ftp")</code>
- </p>
-
- @param tag Tag the URL protocol is for
- @param attribute Attribute name
- @param removeProtocols List of invalid protocols
- @return this, for chaining
- */
+ @Override
public Whitelist removeProtocols(String tag, String attribute, String... removeProtocols) {
- Validate.notEmpty(tag);
- Validate.notEmpty(attribute);
- Validate.notNull(removeProtocols);
-
- TagName tagName = TagName.valueOf(tag);
- AttributeKey attr = AttributeKey.valueOf(attribute);
-
- // make sure that what we're removing actually exists; otherwise can open the tag to any data and that can
- // be surprising
- Validate.isTrue(protocols.containsKey(tagName), "Cannot remove a protocol that is not set.");
- Map<AttributeKey, Set<Protocol>> tagProtocols = protocols.get(tagName);
- Validate.isTrue(tagProtocols.containsKey(attr), "Cannot remove a protocol that is not set.");
-
- Set<Protocol> attrProtocols = tagProtocols.get(attr);
- for (String protocol : removeProtocols) {
- Validate.notEmpty(protocol);
- attrProtocols.remove(Protocol.valueOf(protocol));
- }
-
- if (attrProtocols.isEmpty()) { // Remove protocol set if empty
- tagProtocols.remove(attr);
- if (tagProtocols.isEmpty()) // Remove entry for tag if empty
- protocols.remove(tagName);
- }
+ super.removeProtocols(tag, attribute, removeProtocols);
return this;
}
- /**
- * Test if the supplied tag is allowed by this whitelist
- * @param tag test tag
- * @return true if allowed
- */
+ @Override
protected boolean isSafeTag(String tag) {
- return tagNames.contains(TagName.valueOf(tag));
+ return super.isSafeTag(tag);
}
- /**
- * Test if the supplied attribute is allowed by this whitelist for this tag
- * @param tagName tag to consider allowing the attribute in
- * @param el element under test, to confirm protocol
- * @param attr attribute under test
- * @return true if allowed
- */
+ @Override
protected boolean isSafeAttribute(String tagName, Element el, Attribute attr) {
- TagName tag = TagName.valueOf(tagName);
- AttributeKey key = AttributeKey.valueOf(attr.getKey());
-
- Set<AttributeKey> okSet = attributes.get(tag);
- if (okSet != null && okSet.contains(key)) {
- if (protocols.containsKey(tag)) {
- Map<AttributeKey, Set<Protocol>> attrProts = protocols.get(tag);
- // ok if not defined protocol; otherwise test
- return !attrProts.containsKey(key) || testValidProtocol(el, attr, attrProts.get(key));
- } else { // attribute found, no protocols defined, so OK
- return true;
- }
- }
- // might be an enforced attribute?
- Map<AttributeKey, AttributeValue> enforcedSet = enforcedAttributes.get(tag);
- if (enforcedSet != null) {
- Attributes expect = getEnforcedAttributes(tagName);
- String attrKey = attr.getKey();
- if (expect.hasKeyIgnoreCase(attrKey)) {
- return expect.getIgnoreCase(attrKey).equals(attr.getValue());
- }
- }
- // no attributes defined for tag, try :all tag
- return !tagName.equals(":all") && isSafeAttribute(":all", el, attr);
+ return super.isSafeAttribute(tagName, el, attr);
}
- private boolean testValidProtocol(Element el, Attribute attr, Set<Protocol> protocols) {
- // try to resolve relative urls to abs, and optionally update the attribute so output html has abs.
- // rels without a baseuri get removed
- String value = el.absUrl(attr.getKey());
- if (value.length() == 0)
- value = attr.getValue(); // if it could not be made abs, run as-is to allow custom unknown protocols
- if (!preserveRelativeLinks)
- attr.setValue(value);
-
- for (Protocol protocol : protocols) {
- String prot = protocol.toString();
-
- if (prot.equals("#")) { // allows anchor links
- if (isValidAnchor(value)) {
- return true;
- } else {
- continue;
- }
- }
-
- prot += ":";
-
- if (lowerCase(value).startsWith(prot)) {
- return true;
- }
- }
- return false;
- }
-
- private boolean isValidAnchor(String value) {
- return value.startsWith("#") && !value.matches(".*\\s.*");
- }
-
+ @Override
Attributes getEnforcedAttributes(String tagName) {
- Attributes attrs = new Attributes();
- TagName tag = TagName.valueOf(tagName);
- if (enforcedAttributes.containsKey(tag)) {
- Map<AttributeKey, AttributeValue> keyVals = enforcedAttributes.get(tag);
- for (Map.Entry<AttributeKey, AttributeValue> entry : keyVals.entrySet()) {
- attrs.put(entry.getKey().toString(), entry.getValue().toString());
- }
- }
- return attrs;
- }
-
- // named types for config. All just hold strings, but here for my sanity.
-
- static class TagName extends TypedValue {
- TagName(String value) {
- super(value);
- }
-
- static TagName valueOf(String value) {
- return new TagName(value);
- }
- }
-
- static class AttributeKey extends TypedValue {
- AttributeKey(String value) {
- super(value);
- }
-
- static AttributeKey valueOf(String value) {
- return new AttributeKey(value);
- }
- }
-
- static class AttributeValue extends TypedValue {
- AttributeValue(String value) {
- super(value);
- }
-
- static AttributeValue valueOf(String value) {
- return new AttributeValue(value);
- }
- }
-
- static class Protocol extends TypedValue {
- Protocol(String value) {
- super(value);
- }
-
- static Protocol valueOf(String value) {
- return new Protocol(value);
- }
- }
-
- abstract static class TypedValue {
- private String value;
-
- TypedValue(String value) {
- Validate.notNull(value);
- this.value = value;
- }
-
- @Override
- public int hashCode() {
- final int prime = 31;
- int result = 1;
- result = prime * result + ((value == null) ? 0 : value.hashCode());
- return result;
- }
-
- @Override
- public boolean equals(Object obj) {
- if (this == obj) return true;
- if (obj == null) return false;
- if (getClass() != obj.getClass()) return false;
- TypedValue other = (TypedValue) obj;
- if (value == null) {
- return other.value == null;
- } else return value.equals(other.value);
- }
-
- @Override
- public String toString() {
- return value;
- }
+ return super.getEnforcedAttributes(tagName);
}
}
-
diff --git a/src/main/java/org/jsoup/safety/package-info.java b/src/main/java/org/jsoup/safety/package-info.java
index ac890f0..26b4b70 100644
--- a/src/main/java/org/jsoup/safety/package-info.java
+++ b/src/main/java/org/jsoup/safety/package-info.java
@@ -1,4 +1,4 @@
/**
- Contains the jsoup HTML cleaner, and whitelist definitions.
+ Contains the jsoup HTML cleaner, and safelist definitions.
*/
package org.jsoup.safety;
diff --git a/src/main/javadoc/overview.html b/src/main/javadoc/overview.html
index bbeac18..0fa25e1 100644
--- a/src/main/javadoc/overview.html
+++ b/src/main/javadoc/overview.html
@@ -1,31 +1,31 @@
<!DOCTYPE html>
<html>
<head>
- <title>jsoup Javadoc overview</title>
+ <title>jsoup Javadoc overview</title>
</head>
<body>
<h1>jsoup: Java HTML parser that makes sense of real-world HTML soup.</h1>
-<p><b>jsoup</b> is a Java library for working with real-world HTML. It provides a very convenient API
-for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.</p>
+<p><b>jsoup</b> is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs
+ and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors.</p>
-<p>jsoup implements the <a href="http://whatwg.org/html">WHATWG HTML</a> specification, and parses HTML to the same DOM
-as modern browsers do.</p>
+<p>jsoup implements the <a href="https://html.spec.whatwg.org/multipage/">WHATWG HTML</a> specification, and parses HTML to the same DOM
+ as modern browsers do.</p>
<ul>
-<li>parse HTML from a URL, file, or string
-<li>find and extract data, using DOM traversal or CSS selectors
-<li>manipulate the HTML elements, attributes, and text
-<li>clean user-submitted content against a safe white-list, to prevent XSS
-<li>output tidy HTML
+ <li>parse HTML from a URL, file, or string
+ <li>find and extract data, using DOM traversal or CSS selectors
+ <li>manipulate the HTML elements, attributes, and text
+ <li>clean user-submitted content against a safelist, to prevent XSS
+ <li>output tidy HTML
</ul>
<p>jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating,
-to invalid tag-soup; jsoup will create a sensible parse tree.</p>
+ to invalid tag-soup; jsoup will create a sensible parse tree.</p>
<p>See <a href="https://jsoup.org/"><b>jsoup.org</b></a> for downloads, documentation, and examples...</p>
-@author <a href="http://jonathanhedley.com/">Jonathan Hedley</a>
+@author <a href="https://jonathanhedley.com/">Jonathan Hedley</a>
</body>
</html>
diff --git a/src/test/java/org/jsoup/parser/HtmlParserTest.java b/src/test/java/org/jsoup/parser/HtmlParserTest.java
index 8eb27cb..36d4a31 100644
--- a/src/test/java/org/jsoup/parser/HtmlParserTest.java
+++ b/src/test/java/org/jsoup/parser/HtmlParserTest.java
@@ -5,7 +5,7 @@
import org.jsoup.integration.ParseTest;
import org.jsoup.internal.StringUtil;
import org.jsoup.nodes.*;
-import org.jsoup.safety.Whitelist;
+import org.jsoup.safety.Safelist;
import org.jsoup.select.Elements;
import org.junit.jupiter.api.Disabled;
import org.junit.jupiter.api.Test;
@@ -1158,8 +1158,8 @@
parser.parseInput(html, "");
assertEquals(0, parser.getErrors().size());
- assertTrue(Jsoup.isValid(html, Whitelist.basic()));
- String clean = Jsoup.clean(html, Whitelist.basic());
+ assertTrue(Jsoup.isValid(html, Safelist.basic()));
+ String clean = Jsoup.clean(html, Safelist.basic());
assertEquals("<p>test<br>test<br></p>", clean);
}
@@ -1170,8 +1170,8 @@
assertEquals(1, parser.getErrors().size());
assertEquals("18: Tag cannot be self closing; not a void tag", parser.getErrors().get(0).toString());
- assertFalse(Jsoup.isValid(html, Whitelist.relaxed()));
- String clean = Jsoup.clean(html, Whitelist.relaxed());
+ assertFalse(Jsoup.isValid(html, Safelist.relaxed()));
+ String clean = Jsoup.clean(html, Safelist.relaxed());
assertEquals("<p>test</p> <div></div> <div> Two </div>", StringUtil.normaliseWhitespace(clean));
}
@@ -1294,7 +1294,7 @@
public void testH20() {
// https://github.com/jhy/jsoup/issues/731
String html = "H<sub>2</sub>O";
- String clean = Jsoup.clean(html, Whitelist.basic());
+ String clean = Jsoup.clean(html, Safelist.basic());
assertEquals("H<sub>2</sub>O", clean);
Document doc = Jsoup.parse(html);
@@ -1305,7 +1305,7 @@
public void testUNewlines() {
// https://github.com/jhy/jsoup/issues/851
String html = "t<u>es</u>t <b>on</b> <i>f</i><u>ir</u>e";
- String clean = Jsoup.clean(html, Whitelist.basic());
+ String clean = Jsoup.clean(html, Safelist.basic());
assertEquals("t<u>es</u>t <b>on</b> <i>f</i><u>ir</u>e", clean);
Document doc = Jsoup.parse(html);
diff --git a/src/test/java/org/jsoup/safety/CleanerTest.java b/src/test/java/org/jsoup/safety/CleanerTest.java
index 707313f..bce555d 100644
--- a/src/test/java/org/jsoup/safety/CleanerTest.java
+++ b/src/test/java/org/jsoup/safety/CleanerTest.java
@@ -18,21 +18,21 @@
public class CleanerTest {
@Test public void simpleBehaviourTest() {
String h = "<div><p class=foo><a href='http://evil.com'>Hello <b id=bar>there</b>!</a></div>";
- String cleanHtml = Jsoup.clean(h, Whitelist.simpleText());
+ String cleanHtml = Jsoup.clean(h, Safelist.simpleText());
assertEquals("Hello <b>there</b>!", TextUtil.stripNewlines(cleanHtml));
}
@Test public void simpleBehaviourTest2() {
String h = "Hello <b>there</b>!";
- String cleanHtml = Jsoup.clean(h, Whitelist.simpleText());
+ String cleanHtml = Jsoup.clean(h, Safelist.simpleText());
assertEquals("Hello <b>there</b>!", TextUtil.stripNewlines(cleanHtml));
}
@Test public void basicBehaviourTest() {
String h = "<div><p><a href='javascript:sendAllMoney()'>Dodgy</a> <A HREF='HTTP://nice.com'>Nice</a></p><blockquote>Hello</blockquote>";
- String cleanHtml = Jsoup.clean(h, Whitelist.basic());
+ String cleanHtml = Jsoup.clean(h, Safelist.basic());
assertEquals("<p><a rel=\"nofollow\">Dodgy</a> <a href=\"http://nice.com\" rel=\"nofollow\">Nice</a></p><blockquote>Hello</blockquote>",
TextUtil.stripNewlines(cleanHtml));
@@ -40,33 +40,33 @@
@Test public void basicWithImagesTest() {
String h = "<div><p><img src='http://example.com/' alt=Image></p><p><img src='ftp://ftp.example.com'></p></div>";
- String cleanHtml = Jsoup.clean(h, Whitelist.basicWithImages());
+ String cleanHtml = Jsoup.clean(h, Safelist.basicWithImages());
assertEquals("<p><img src=\"http://example.com/\" alt=\"Image\"></p><p><img></p>", TextUtil.stripNewlines(cleanHtml));
}
@Test public void testRelaxed() {
String h = "<h1>Head</h1><table><tr><td>One<td>Two</td></tr></table>";
- String cleanHtml = Jsoup.clean(h, Whitelist.relaxed());
+ String cleanHtml = Jsoup.clean(h, Safelist.relaxed());
assertEquals("<h1>Head</h1><table><tbody><tr><td>One</td><td>Two</td></tr></tbody></table>", TextUtil.stripNewlines(cleanHtml));
}
@Test public void testRemoveTags() {
String h = "<div><p><A HREF='HTTP://nice.com'>Nice</a></p><blockquote>Hello</blockquote>";
- String cleanHtml = Jsoup.clean(h, Whitelist.basic().removeTags("a"));
+ String cleanHtml = Jsoup.clean(h, Safelist.basic().removeTags("a"));
assertEquals("<p>Nice</p><blockquote>Hello</blockquote>", TextUtil.stripNewlines(cleanHtml));
}
@Test public void testRemoveAttributes() {
String h = "<div><p>Nice</p><blockquote cite='http://example.com/quotations'>Hello</blockquote>";
- String cleanHtml = Jsoup.clean(h, Whitelist.basic().removeAttributes("blockquote", "cite"));
+ String cleanHtml = Jsoup.clean(h, Safelist.basic().removeAttributes("blockquote", "cite"));
assertEquals("<p>Nice</p><blockquote>Hello</blockquote>", TextUtil.stripNewlines(cleanHtml));
}
@Test public void testRemoveEnforcedAttributes() {
String h = "<div><p><A HREF='HTTP://nice.com'>Nice</a></p><blockquote>Hello</blockquote>";
- String cleanHtml = Jsoup.clean(h, Whitelist.basic().removeEnforcedAttribute("a", "rel"));
+ String cleanHtml = Jsoup.clean(h, Safelist.basic().removeEnforcedAttribute("a", "rel"));
assertEquals("<p><a href=\"http://nice.com\">Nice</a></p><blockquote>Hello</blockquote>",
TextUtil.stripNewlines(cleanHtml));
@@ -74,53 +74,53 @@
@Test public void testRemoveProtocols() {
String h = "<p>Contact me <a href='mailto:[email protected]'>here</a></p>";
- String cleanHtml = Jsoup.clean(h, Whitelist.basic().removeProtocols("a", "href", "ftp", "mailto"));
+ String cleanHtml = Jsoup.clean(h, Safelist.basic().removeProtocols("a", "href", "ftp", "mailto"));
assertEquals("<p>Contact me <a rel=\"nofollow\">here</a></p>",
TextUtil.stripNewlines(cleanHtml));
}
@MultiLocaleTest
- public void whitelistedProtocolShouldBeRetained(Locale locale) {
+ public void safeListedProtocolShouldBeRetained(Locale locale) {
Locale.setDefault(locale);
- Whitelist whitelist = Whitelist.none()
+ Safelist safelist = Safelist.none()
.addTags("a")
.addAttributes("a", "href")
.addProtocols("a", "href", "something");
- String cleanHtml = Jsoup.clean("<a href=\"SOMETHING://x\"></a>", whitelist);
+ String cleanHtml = Jsoup.clean("<a href=\"SOMETHING://x\"></a>", safelist);
assertEquals("<a href=\"SOMETHING://x\"></a>", TextUtil.stripNewlines(cleanHtml));
}
@Test public void testDropComments() {
String h = "<p>Hello<!-- no --></p>";
- String cleanHtml = Jsoup.clean(h, Whitelist.relaxed());
+ String cleanHtml = Jsoup.clean(h, Safelist.relaxed());
assertEquals("<p>Hello</p>", cleanHtml);
}
@Test public void testDropXmlProc() {
String h = "<?import namespace=\"xss\"><p>Hello</p>";
- String cleanHtml = Jsoup.clean(h, Whitelist.relaxed());
+ String cleanHtml = Jsoup.clean(h, Safelist.relaxed());
assertEquals("<p>Hello</p>", cleanHtml);
}
@Test public void testDropScript() {
String h = "<SCRIPT SRC=//ha.ckers.org/.j><SCRIPT>alert(/XSS/.source)</SCRIPT>";
- String cleanHtml = Jsoup.clean(h, Whitelist.relaxed());
+ String cleanHtml = Jsoup.clean(h, Safelist.relaxed());
assertEquals("", cleanHtml);
}
@Test public void testDropImageScript() {
String h = "<IMG SRC=\"javascript:alert('XSS')\">";
- String cleanHtml = Jsoup.clean(h, Whitelist.relaxed());
+ String cleanHtml = Jsoup.clean(h, Safelist.relaxed());
assertEquals("<img>", cleanHtml);
}
@Test public void testCleanJavascriptHref() {
String h = "<A HREF=\"javascript:document.location='http://www.google.com/'\">XSS</A>";
- String cleanHtml = Jsoup.clean(h, Whitelist.relaxed());
+ String cleanHtml = Jsoup.clean(h, Safelist.relaxed());
assertEquals("<a>XSS</a>", cleanHtml);
}
@@ -128,15 +128,15 @@
String validAnchor = "<a href=\"#valid\">Valid anchor</a>";
String invalidAnchor = "<a href=\"#anchor with spaces\">Invalid anchor</a>";
- // A Whitelist that does not allow anchors will strip them out.
- String cleanHtml = Jsoup.clean(validAnchor, Whitelist.relaxed());
+ // A Safelist that does not allow anchors will strip them out.
+ String cleanHtml = Jsoup.clean(validAnchor, Safelist.relaxed());
assertEquals("<a>Valid anchor</a>", cleanHtml);
- cleanHtml = Jsoup.clean(invalidAnchor, Whitelist.relaxed());
+ cleanHtml = Jsoup.clean(invalidAnchor, Safelist.relaxed());
assertEquals("<a>Invalid anchor</a>", cleanHtml);
- // A Whitelist that allows them will keep them.
- Whitelist relaxedWithAnchor = Whitelist.relaxed().addProtocols("a", "href", "#");
+ // A Safelist that allows them will keep them.
+ Safelist relaxedWithAnchor = Safelist.relaxed().addProtocols("a", "href", "#");
cleanHtml = Jsoup.clean(validAnchor, relaxedWithAnchor);
assertEquals(validAnchor, cleanHtml);
@@ -148,13 +148,13 @@
@Test public void testDropsUnknownTags() {
String h = "<p><custom foo=true>Test</custom></p>";
- String cleanHtml = Jsoup.clean(h, Whitelist.relaxed());
+ String cleanHtml = Jsoup.clean(h, Safelist.relaxed());
assertEquals("<p>Test</p>", cleanHtml);
}
@Test public void testHandlesEmptyAttributes() {
String h = "<img alt=\"\" src= unknown=''>";
- String cleanHtml = Jsoup.clean(h, Whitelist.basicWithImages());
+ String cleanHtml = Jsoup.clean(h, Safelist.basicWithImages());
assertEquals("<img alt=\"\">", cleanHtml);
}
@@ -168,74 +168,74 @@
String nok5 = "<p>Test <b><a href='http://example.com/' rel='nofollowme'>OK</a></b></p>";
String nok6 = "<p>Test <b><a href='http://example.com/'>OK</b></p>"; // missing close tag
String nok7 = "</div>What";
- assertTrue(Jsoup.isValid(ok, Whitelist.basic()));
- assertTrue(Jsoup.isValid(ok1, Whitelist.basic()));
- assertFalse(Jsoup.isValid(nok1, Whitelist.basic()));
- assertFalse(Jsoup.isValid(nok2, Whitelist.basic()));
- assertFalse(Jsoup.isValid(nok3, Whitelist.basic()));
- assertFalse(Jsoup.isValid(nok4, Whitelist.basic()));
- assertFalse(Jsoup.isValid(nok5, Whitelist.basic()));
- assertFalse(Jsoup.isValid(nok6, Whitelist.basic()));
- assertFalse(Jsoup.isValid(ok, Whitelist.none()));
- assertFalse(Jsoup.isValid(nok7, Whitelist.basic()));
+ assertTrue(Jsoup.isValid(ok, Safelist.basic()));
+ assertTrue(Jsoup.isValid(ok1, Safelist.basic()));
+ assertFalse(Jsoup.isValid(nok1, Safelist.basic()));
+ assertFalse(Jsoup.isValid(nok2, Safelist.basic()));
+ assertFalse(Jsoup.isValid(nok3, Safelist.basic()));
+ assertFalse(Jsoup.isValid(nok4, Safelist.basic()));
+ assertFalse(Jsoup.isValid(nok5, Safelist.basic()));
+ assertFalse(Jsoup.isValid(nok6, Safelist.basic()));
+ assertFalse(Jsoup.isValid(ok, Safelist.none()));
+ assertFalse(Jsoup.isValid(nok7, Safelist.basic()));
}
@Test public void testIsValidDocument() {
String ok = "<html><head></head><body><p>Hello</p></body><html>";
String nok = "<html><head><script>woops</script><title>Hello</title></head><body><p>Hello</p></body><html>";
- Whitelist relaxed = Whitelist.relaxed();
+ Safelist relaxed = Safelist.relaxed();
Cleaner cleaner = new Cleaner(relaxed);
Document okDoc = Jsoup.parse(ok);
assertTrue(cleaner.isValid(okDoc));
assertFalse(cleaner.isValid(Jsoup.parse(nok)));
- assertFalse(new Cleaner(Whitelist.none()).isValid(okDoc));
+ assertFalse(new Cleaner(Safelist.none()).isValid(okDoc));
}
@Test public void resolvesRelativeLinks() {
String html = "<a href='/foo'>Link</a><img src='/bar'>";
- String clean = Jsoup.clean(html, "http://example.com/", Whitelist.basicWithImages());
+ String clean = Jsoup.clean(html, "http://example.com/", Safelist.basicWithImages());
assertEquals("<a href=\"http://example.com/foo\" rel=\"nofollow\">Link</a>\n<img src=\"http://example.com/bar\">", clean);
}
@Test public void preservesRelativeLinksIfConfigured() {
String html = "<a href='/foo'>Link</a><img src='/bar'> <img src='javascript:alert()'>";
- String clean = Jsoup.clean(html, "http://example.com/", Whitelist.basicWithImages().preserveRelativeLinks(true));
+ String clean = Jsoup.clean(html, "http://example.com/", Safelist.basicWithImages().preserveRelativeLinks(true));
assertEquals("<a href=\"/foo\" rel=\"nofollow\">Link</a>\n<img src=\"/bar\"> \n<img>", clean);
}
@Test public void dropsUnresolvableRelativeLinks() {
String html = "<a href='/foo'>Link</a>";
- String clean = Jsoup.clean(html, Whitelist.basic());
+ String clean = Jsoup.clean(html, Safelist.basic());
assertEquals("<a rel=\"nofollow\">Link</a>", clean);
}
@Test public void handlesCustomProtocols() {
String html = "<img src='cid:12345' /> <img src='data:gzzt' />";
- String dropped = Jsoup.clean(html, Whitelist.basicWithImages());
+ String dropped = Jsoup.clean(html, Safelist.basicWithImages());
assertEquals("<img> \n<img>", dropped);
- String preserved = Jsoup.clean(html, Whitelist.basicWithImages().addProtocols("img", "src", "cid", "data"));
+ String preserved = Jsoup.clean(html, Safelist.basicWithImages().addProtocols("img", "src", "cid", "data"));
assertEquals("<img src=\"cid:12345\"> \n<img src=\"data:gzzt\">", preserved);
}
@Test public void handlesAllPseudoTag() {
String html = "<p class='foo' src='bar'><a class='qux'>link</a></p>";
- Whitelist whitelist = new Whitelist()
+ Safelist safelist = new Safelist()
.addAttributes(":all", "class")
.addAttributes("p", "style")
.addTags("p", "a");
- String clean = Jsoup.clean(html, whitelist);
+ String clean = Jsoup.clean(html, safelist);
assertEquals("<p class=\"foo\"><a class=\"qux\">link</a></p>", clean);
}
@Test public void addsTagOnAttributesIfNotSet() {
String html = "<p class='foo' src='bar'>One</p>";
- Whitelist whitelist = new Whitelist()
+ Safelist safelist = new Safelist()
.addAttributes("p", "class");
- // ^^ whitelist does not have explicit tag add for p, inferred from add attributes.
- String clean = Jsoup.clean(html, whitelist);
+ // ^^ safelist does not have explicit tag add for p, inferred from add attributes.
+ String clean = Jsoup.clean(html, safelist);
assertEquals("<p class=\"foo\">One</p>", clean);
}
@@ -247,8 +247,8 @@
os.charset("ascii");
String html = "<div><p>ℬ</p></div>";
- String customOut = Jsoup.clean(html, "http://foo.com/", Whitelist.relaxed(), os);
- String defaultOut = Jsoup.clean(html, "http://foo.com/", Whitelist.relaxed());
+ String customOut = Jsoup.clean(html, "http://foo.com/", Safelist.relaxed(), os);
+ String defaultOut = Jsoup.clean(html, "http://foo.com/", Safelist.relaxed());
assertNotSame(defaultOut, customOut);
assertEquals("<div><p>ℬ</p></div>", customOut); // entities now prefers shorted names if aliased
@@ -258,37 +258,37 @@
os.charset("ASCII");
os.escapeMode(Entities.EscapeMode.base);
- String customOut2 = Jsoup.clean(html, "http://foo.com/", Whitelist.relaxed(), os);
+ String customOut2 = Jsoup.clean(html, "http://foo.com/", Safelist.relaxed(), os);
assertEquals("<div><p>ℬ</p></div>", customOut2);
}
@Test public void handlesFramesets() {
String dirty = "<html><head><script></script><noscript></noscript></head><frameset><frame src=\"foo\" /><frame src=\"foo\" /></frameset></html>";
- String clean = Jsoup.clean(dirty, Whitelist.basic());
+ String clean = Jsoup.clean(dirty, Safelist.basic());
assertEquals("", clean); // nothing good can come out of that
Document dirtyDoc = Jsoup.parse(dirty);
- Document cleanDoc = new Cleaner(Whitelist.basic()).clean(dirtyDoc);
+ Document cleanDoc = new Cleaner(Safelist.basic()).clean(dirtyDoc);
assertNotNull(cleanDoc);
assertEquals(0, cleanDoc.body().childNodeSize());
}
@Test public void cleansInternationalText() {
- assertEquals("привет", Jsoup.clean("привет", Whitelist.none()));
+ assertEquals("привет", Jsoup.clean("привет", Safelist.none()));
}
@Test
- public void testScriptTagInWhiteList() {
- Whitelist whitelist = Whitelist.relaxed();
- whitelist.addTags( "script" );
- assertTrue( Jsoup.isValid("Hello<script>alert('Doh')</script>World !", whitelist ) );
+ public void testScriptTagInSafeList() {
+ Safelist safelist = Safelist.relaxed();
+ safelist.addTags( "script" );
+ assertTrue( Jsoup.isValid("Hello<script>alert('Doh')</script>World !", safelist) );
}
@Test
public void bailsIfRemovingProtocolThatsNotSet() {
assertThrows(IllegalArgumentException.class, () -> {
// a case that came up on the email list
- Whitelist w = Whitelist.none();
+ Safelist w = Safelist.none();
// note no add tag, and removing protocol without adding first
w.addAttributes("a", "href");
@@ -298,20 +298,20 @@
@Test public void handlesControlCharactersAfterTagName() {
String html = "<a/\06>";
- String clean = Jsoup.clean(html, Whitelist.basic());
+ String clean = Jsoup.clean(html, Safelist.basic());
assertEquals("<a rel=\"nofollow\"></a>", clean);
}
@Test public void handlesAttributesWithNoValue() {
// https://github.com/jhy/jsoup/issues/973
- String clean = Jsoup.clean("<a href>Clean</a>", Whitelist.basic());
+ String clean = Jsoup.clean("<a href>Clean</a>", Safelist.basic());
assertEquals("<a rel=\"nofollow\">Clean</a>", clean);
}
@Test public void handlesNoHrefAttribute() {
String dirty = "<a>One</a> <a href>Two</a>";
- Whitelist relaxedWithAnchor = Whitelist.relaxed().addProtocols("a", "href", "#");
+ Safelist relaxedWithAnchor = Safelist.relaxed().addProtocols("a", "href", "#");
String clean = Jsoup.clean(dirty, relaxedWithAnchor);
assertEquals("<a>One</a> <a>Two</a>", clean);
}
@@ -319,7 +319,7 @@
@Test public void handlesNestedQuotesInAttribute() {
// https://github.com/jhy/jsoup/issues/1243 - no repro
String orig = "<div style=\"font-family: 'Calibri'\">Will (not) fail</div>";
- Whitelist allow = Whitelist.relaxed()
+ Safelist allow = Safelist.relaxed()
.addAttributes("div", "style");
String clean = Jsoup.clean(orig, allow);
diff --git a/src/test/java/org/jsoup/safety/CompatibilityTests.java b/src/test/java/org/jsoup/safety/CompatibilityTests.java
new file mode 100644
index 0000000..7586950
--- /dev/null
+++ b/src/test/java/org/jsoup/safety/CompatibilityTests.java
@@ -0,0 +1,99 @@
+package org.jsoup.safety;
+
+import org.jsoup.Jsoup;
+import org.jsoup.MultiLocaleExtension;
+import org.jsoup.TextUtil;
+import org.jsoup.nodes.Document;
+import org.jsoup.nodes.Entities;
+import org.junit.jupiter.api.Test;
+
+import java.util.Locale;
+
+import static org.junit.jupiter.api.Assertions.*;
+
+/**
+ Tests for the deprecated {@link org.jsoup.safety.Whitelist} class source compatibility. Will be removed in
+ <code>v.1.15.1</code>. No net new tests here so safe to blow up.
+ */
+public class CompatibilityTests {
+ @Test
+ public void resolvesRelativeLinks() {
+ String html = "<a href='/foo'>Link</a><img src='/bar'>";
+ String clean = Jsoup.clean(html, "http://example.com/", Whitelist.basicWithImages());
+ assertEquals("<a href=\"http://example.com/foo\" rel=\"nofollow\">Link</a>\n<img src=\"http://example.com/bar\">", clean);
+ }
+
+ @Test
+ public void testDropsUnknownTags() {
+ String h = "<p><custom foo=true>Test</custom></p>";
+ String cleanHtml = Jsoup.clean(h, Whitelist.relaxed());
+ assertEquals("<p>Test</p>", cleanHtml);
+ }
+
+ @Test
+ public void preservesRelativeLinksIfConfigured() {
+ String html = "<a href='/foo'>Link</a><img src='/bar'> <img src='javascript:alert()'>";
+ String clean = Jsoup.clean(html, "http://example.com/", Whitelist.basicWithImages().preserveRelativeLinks(true));
+ assertEquals("<a href=\"/foo\" rel=\"nofollow\">Link</a>\n<img src=\"/bar\"> \n<img>", clean);
+ }
+
+ @Test
+ public void handlesCustomProtocols() {
+ String html = "<img src='cid:12345' /> <img src='data:gzzt' />";
+ String dropped = Jsoup.clean(html, Whitelist.basicWithImages());
+ assertEquals("<img> \n<img>", dropped);
+
+ String preserved = Jsoup.clean(html, Whitelist.basicWithImages().addProtocols("img", "src", "cid", "data"));
+ assertEquals("<img src=\"cid:12345\"> \n<img src=\"data:gzzt\">", preserved);
+ }
+
+ @Test
+ public void handlesFramesets() {
+ String dirty = "<html><head><script></script><noscript></noscript></head><frameset><frame src=\"foo\" /><frame src=\"foo\" /></frameset></html>";
+ String clean = Jsoup.clean(dirty, Whitelist.basic());
+ assertEquals("", clean); // nothing good can come out of that
+
+ Document dirtyDoc = Jsoup.parse(dirty);
+ Document cleanDoc = new Cleaner(Whitelist.basic()).clean(dirtyDoc);
+ assertNotNull(cleanDoc);
+ assertEquals(0, cleanDoc.body().childNodeSize());
+ }
+
+ @Test
+ public void supplyOutputSettings() {
+ // test that one can override the default document output settings
+ Document.OutputSettings os = new Document.OutputSettings();
+ os.prettyPrint(false);
+ os.escapeMode(Entities.EscapeMode.extended);
+ os.charset("ascii");
+
+ String html = "<div><p>ℬ</p></div>";
+ String customOut = Jsoup.clean(html, "http://foo.com/", Whitelist.relaxed(), os);
+ String defaultOut = Jsoup.clean(html, "http://foo.com/", Whitelist.relaxed());
+ assertNotSame(defaultOut, customOut);
+
+ assertEquals("<div><p>ℬ</p></div>", customOut); // entities now prefers shorted names if aliased
+ assertEquals("<div>\n" +
+ " <p>ℬ</p>\n" +
+ "</div>", defaultOut);
+
+ os.charset("ASCII");
+ os.escapeMode(Entities.EscapeMode.base);
+ String customOut2 = Jsoup.clean(html, "http://foo.com/", Whitelist.relaxed(), os);
+ assertEquals("<div><p>ℬ</p></div>", customOut2);
+ }
+
+ @MultiLocaleExtension.MultiLocaleTest
+ public void safeListedProtocolShouldBeRetained(Locale locale) {
+ Locale.setDefault(locale);
+
+ Whitelist safelist = Whitelist.none()
+ .addTags("a")
+ .addAttributes("a", "href")
+ .addProtocols("a", "href", "something");
+
+ String cleanHtml = Jsoup.clean("<a href=\"SOMETHING://x\"></a>", safelist);
+
+ assertEquals("<a href=\"SOMETHING://x\"></a>", TextUtil.stripNewlines(cleanHtml));
+ }
+}