Class HTMLPurifier_Encoder

InheritanceHTMLPurifier_Encoder

A UTF-8 specific character encoder that handles cleaning and transforming.

Public Methods

Hide inherited methods

MethodDescriptionDefined By
cleanUTF8() Cleans a UTF-8 string for well-formedness and SGML validity HTMLPurifier_Encoder
convertFromUTF8() Converts a string from UTF-8 based on configuration. HTMLPurifier_Encoder
convertToASCIIDumbLossless() Lossless (character-wise) conversion of HTML to ASCII HTMLPurifier_Encoder
convertToUTF8() Convert a string to UTF-8 based on configuration. HTMLPurifier_Encoder
iconv() Iconv wrapper which mutes errors and works around bugs. HTMLPurifier_Encoder
iconvAvailable() HTMLPurifier_Encoder
muteErrorHandler() Error-handler that mutes errors, alternative to shut-up operator. HTMLPurifier_Encoder
testEncodingSupportsASCII() This expensive function tests whether or not a given character encoding supports ASCII. 7/8-bit encodings like Shift_JIS will fail this test, and require special processing. Variable width encodings shouldn't ever fail. HTMLPurifier_Encoder
testIconvTruncateBug() Glibc iconv has a known bug where it doesn't handle the magic //IGNORE stanza correctly. In particular, rather than ignore characters, it will return an EILSEQ after consuming some number of characters, and expect you to restart iconv as if it were an E2BIG. Old versions of PHP did not respect the errno, and returned the fragment, so as a result you would see iconv mysteriously truncating output. We can work around this by manually chopping our input into segments of about 8000 characters, as long as PHP ignores the error code. If PHP starts paying attention to the error code, iconv becomes unusable. HTMLPurifier_Encoder
unichr() HTMLPurifier_Encoder
unsafeIconv() Iconv wrapper which mutes errors, but doesn't work around bugs. HTMLPurifier_Encoder

Constants

Hide inherited constants

ConstantValueDescriptionDefined By
ICONV_OK 0 No bugs detected in iconv. HTMLPurifier_Encoder
ICONV_TRUNCATES 1 Iconv truncates output if converting from UTF-8 to another character set with //IGNORE, and a non-encodable character is found HTMLPurifier_Encoder
ICONV_UNUSABLE 2 Iconv does not support //IGNORE, making it unusable for transcoding purposes HTMLPurifier_Encoder

Method Details

cleanUTF8() public static method

Cleans a UTF-8 string for well-formedness and SGML validity

It will parse according to UTF-8 and return a valid UTF8 string, with non-SGML codepoints excluded.

public static string cleanUTF8 ( $str, $force_php false )
$str string

The string to clean

$force_php bool
convertFromUTF8() public static method

Converts a string from UTF-8 based on configuration.

public static string convertFromUTF8 ( $str, $config, $context )
$str string

The string to convert

$config HTMLPurifier_Config
$context HTMLPurifier_Context
convertToASCIIDumbLossless() public static method

Lossless (character-wise) conversion of HTML to ASCII

public static string convertToASCIIDumbLossless ( $str )
$str string

UTF-8 string to be converted to ASCII

return string

ASCII encoded string with non-ASCII character entity-ized

convertToUTF8() public static method

Convert a string to UTF-8 based on configuration.

public static string convertToUTF8 ( $str, $config, $context )
$str string

The string to convert

$config HTMLPurifier_Config
$context HTMLPurifier_Context
iconv() public static method

Iconv wrapper which mutes errors and works around bugs.

public static string iconv ( $in, $out, $text, $max_chunk_size 8000 )
$in string

Input encoding

$out string

Output encoding

$text string

The text to convert

$max_chunk_size int
iconvAvailable() public static method

public static bool iconvAvailable ( )
muteErrorHandler() public static method

Error-handler that mutes errors, alternative to shut-up operator.

public static void muteErrorHandler ( )
testEncodingSupportsASCII() public static method

This expensive function tests whether or not a given character encoding supports ASCII. 7/8-bit encodings like Shift_JIS will fail this test, and require special processing. Variable width encodings shouldn't ever fail.

public static Array testEncodingSupportsASCII ( $encoding, $bypass false )
$encoding string

Encoding name to test, as per iconv format

$bypass bool

Whether or not to bypass the precompiled arrays.

return Array

Of UTF-8 characters to their corresponding ASCII,

 which can be used to "undo" any overzealous iconv action.
testIconvTruncateBug() public static method

Glibc iconv has a known bug where it doesn't handle the magic //IGNORE stanza correctly. In particular, rather than ignore characters, it will return an EILSEQ after consuming some number of characters, and expect you to restart iconv as if it were an E2BIG. Old versions of PHP did not respect the errno, and returned the fragment, so as a result you would see iconv mysteriously truncating output. We can work around this by manually chopping our input into segments of about 8000 characters, as long as PHP ignores the error code. If PHP starts paying attention to the error code, iconv becomes unusable.

public static int testIconvTruncateBug ( )
return int

Error code indicating severity of bug.

unichr() public static method

public static void unichr ( $code )
$code
unsafeIconv() public static method

Iconv wrapper which mutes errors, but doesn't work around bugs.

public static string unsafeIconv ( $in, $out, $text )
$in string

Input encoding

$out string

Output encoding

$text string

The text to convert