Class HTMLSanitizer
object --+
         |
        HTMLSanitizer
A filter that removes potentially dangerous HTML tags and attributes
from the stream.
>>> from genshi import HTML
>>> html = HTML('<div><script>alert(document.cookie)</script></div>', encoding='utf-8')
>>> print(html | HTMLSanitizer())
<div/>
The default set of safe tags and attributes can be modified when the filter
is instantiated. For example, to allow inline style attributes, the
following instantation would work:
>>> html = HTML('<div style="background: #000"></div>', encoding='utf-8')
>>> sanitizer = HTMLSanitizer(safe_attrs=HTMLSanitizer.SAFE_ATTRS | set(['style']))
>>> print(html | sanitizer)
<div style="background: #000"/>
Note that even in this case, the filter does attempt to remove dangerous
constructs from style attributes:
>>> html = HTML('<div style="background: url(javascript:void); color: #000"></div>', encoding='utf-8')
>>> print(html | sanitizer)
<div style="color: #000"/>
This handles HTML entities, unicode escapes in CSS and Javascript text, as
well as a lot of other things. However, the style tag is still excluded by
default because it is very hard for such sanitizing to be completely safe,
especially considering how much error recovery current web browsers perform.
It also does some basic filtering of CSS properties that may be used for
typical phishing attacks. For more sophisticated filtering, this class
provides a couple of hooks that can be overridden in sub-classes.
      Warning:
        Note that this special processing of CSS is currently only applied to
style attributes, not style elements.
      
 
    |  | 
        
          | __init__(self,
        safe_tags= frozenset(['a', 'abbr', 'acronym', 'address', 'area', 'b', 'bi...,
        safe_attrs=frozenset(['abbr', 'accept', 'accept-charset', 'accesskey', 'a...,
        safe_schemes=frozenset([None, 'file', 'ftp', 'http', 'https', 'mailto']),
        uri_attrs=frozenset(['action', 'background', 'dynsrc', 'href', 'lowsrc',...,
        safe_css=frozenset(['background', 'background-attachment', 'background-...)Create the sanitizer.
 |  |  | 
    |  | 
        
          | __call__(self,
        stream) Apply the filter to the given stream.
 |  |  | 
    | bool | 
        
          | is_safe_css(self,
        propname,
        value) Determine whether the given css property declaration is to be
considered safe for inclusion in the output.
 |  |  | 
    | bool | 
        
          | is_safe_elem(self,
        tag,
        attrs) Determine whether the given element should be considered safe for
inclusion in the output.
 |  |  | 
    | bool | 
        
          | is_safe_uri(self,
        uri) Determine whether the given URI is to be considered safe for
inclusion in the output.
 |  |  | 
    | list | 
        
          | sanitize_css(self,
        text) Remove potentially dangerous property declarations from CSS code.
 |  |  | 
  
    | Inherited from object:__delattr__,__format__,__getattribute__,__hash__,__new__,__reduce__,__reduce_ex__,__repr__,__setattr__,__sizeof__,__str__,__subclasshook__ | 
    |  | SAFE_TAGS = frozenset(['a', 'abbr', 'acronym', 'address', 'are... | 
    |  | SAFE_ATTRS = frozenset(['abbr', 'accept', 'accept-charset', 'a... | 
    |  | SAFE_CSS = frozenset(['background', 'background-attachment', '... | 
    |  | SAFE_SCHEMES = frozenset([None, 'file', 'ftp', 'http', 'https'... | 
    |  | URI_ATTRS = frozenset(['action', 'background', 'dynsrc', 'href... | 
  
    | Inherited from object:__class__ | 
| 
  | __init__(self,
        safe_tags=frozenset(['a', 'abbr', 'acronym', 'address', 'area', 'b', 'bi...,
        safe_attrs=frozenset(['abbr', 'accept', 'accept-charset', 'accesskey', 'a...,
        safe_schemes=frozenset([None, 'file', 'ftp', 'http', 'https', 'mailto']),
        uri_attrs=frozenset(['action', 'background', 'dynsrc', 'href', 'lowsrc',...,
        safe_css=frozenset(['background', 'background-attachment', 'background-...)(Constructor)
 |  |  Create the sanitizer. The exact set of allowed elements and attributes can be configured. 
    Parameters:
        safe_tags- a set of tag names that are considered safesafe_attrs- a set of attribute names that are considered safesafe_schemes- a set of URI schemes that are considered safeuri_attrs- a set of names of attributes that contain URIsOverrides:
        object.__init__
     | 
 
| 
  Apply the filter to the given stream.| __call__(self,
        stream)
    (Call operator)
 |  |  
    Parameters:
        stream- the markup event stream to filter | 
 
| 
  Determine whether the given css property declaration is to be
considered safe for inclusion in the output.| is_safe_css(self,
        propname,
        value)
   |  |  
    Parameters:
        propname- the CSS property namevalue- the value of the propertyReturns: boolwhether the property value should be considered safe | 
 
| 
  Determine whether the given element should be considered safe for
inclusion in the output.| is_safe_elem(self,
        tag,
        attrs)
   |  |  
    Parameters:
        tag(QName) - the tag name of the elementattrs(Attrs) - the element attributesReturns: boolwhether the element should be considered safe | 
 
| Determine whether the given URI is to be considered safe for
inclusion in the output. The default implementation checks whether the scheme of the URI is in
the set of allowed URIs (safe_schemes). 
>>> sanitizer = HTMLSanitizer()
>>> sanitizer.is_safe_uri('http://example.org/')
True
>>> sanitizer.is_safe_uri('javascript:alert(document.cookie)')
False
    Parameters:Returns: boolTrueif the URI can be considered safe,Falseotherwise | 
 
| Remove potentially dangerous property declarations from CSS code. In particular, properties using the CSS url() function with a scheme
that is not considered safe are removed: 
>>> sanitizer = HTMLSanitizer()
>>> sanitizer.sanitize_css(u'''
...   background: url(javascript:alert("foo"));
...   color: #000;
... ''')
[u'color: #000']Also, the proprietary Internet Explorer function expression() is
always stripped: 
>>> sanitizer.sanitize_css(u'''
...   background: #fff;
...   color: #000;
...   width: e/**/xpression(alert("foo"));
... ''')
[u'background: #fff', u'color: #000']
    Parameters:
        text- the CSS text; this is expected to beunicodeand to not
contain any character or numeric referencesReturns: lista list of declarations that are considered safe | 
 
| SAFE_TAGS
   
    Value:| 
frozenset(['a','abbr','acronym','address','area','b','big','blockquote',... | 
 | 
 
| SAFE_ATTRS
   
    Value:| 
frozenset(['abbr','accept','accept-charset','accesskey','action','align','alt','axis',... | 
 | 
 
| SAFE_CSS
   
    Value:| 
frozenset(['background','background-attachment','background-color','background-image','background-position','background-repeat','border','border-bottom',... | 
 | 
 
| SAFE_SCHEMES
   
    Value:| 
frozenset([None, 'file', 'ftp', 'http', 'https', 'mailto']) | 
 | 
 
| URI_ATTRS
   
    Value:| 
frozenset(['action', 'background', 'dynsrc', 'href', 'lowsrc', 'src']) | 
 |