Class HTMLPurifier_Encoder
Inheritance | HTMLPurifier_Encoder |
---|
A UTF-8 specific character encoder that handles cleaning and transforming.
Public Methods
Method | Description | Defined By |
---|---|---|
cleanUTF8() | Cleans a UTF-8 string for well-formedness and SGML validity | HTMLPurifier_Encoder |
convertFromUTF8() | Converts a string from UTF-8 based on configuration. | HTMLPurifier_Encoder |
convertToASCIIDumbLossless() | Lossless (character-wise) conversion of HTML to ASCII | HTMLPurifier_Encoder |
convertToUTF8() | Convert a string to UTF-8 based on configuration. | HTMLPurifier_Encoder |
iconv() | Iconv wrapper which mutes errors and works around bugs. | HTMLPurifier_Encoder |
iconvAvailable() | HTMLPurifier_Encoder | |
muteErrorHandler() | Error-handler that mutes errors, alternative to shut-up operator. | HTMLPurifier_Encoder |
testEncodingSupportsASCII() | This expensive function tests whether or not a given character encoding supports ASCII. 7/8-bit encodings like Shift_JIS will fail this test, and require special processing. Variable width encodings shouldn't ever fail. | HTMLPurifier_Encoder |
testIconvTruncateBug() | Glibc iconv has a known bug where it doesn't handle the magic //IGNORE stanza correctly. In particular, rather than ignore characters, it will return an EILSEQ after consuming some number of characters, and expect you to restart iconv as if it were an E2BIG. Old versions of PHP did not respect the errno, and returned the fragment, so as a result you would see iconv mysteriously truncating output. We can work around this by manually chopping our input into segments of about 8000 characters, as long as PHP ignores the error code. If PHP starts paying attention to the error code, iconv becomes unusable. | HTMLPurifier_Encoder |
unichr() | HTMLPurifier_Encoder | |
unsafeIconv() | Iconv wrapper which mutes errors, but doesn't work around bugs. | HTMLPurifier_Encoder |
Constants
Constant | Value | Description | Defined By |
---|---|---|---|
ICONV_OK | 0 | No bugs detected in iconv. | HTMLPurifier_Encoder |
ICONV_TRUNCATES | 1 | Iconv truncates output if converting from UTF-8 to another character set with //IGNORE, and a non-encodable character is found | HTMLPurifier_Encoder |
ICONV_UNUSABLE | 2 | Iconv does not support //IGNORE, making it unusable for transcoding purposes | HTMLPurifier_Encoder |
Method Details
Cleans a UTF-8 string for well-formedness and SGML validity
It will parse according to UTF-8 and return a valid UTF8 string, with non-SGML codepoints excluded.
public static string cleanUTF8 ( $str, $force_php = false ) | ||
$str | string | The string to clean |
$force_php | bool |
Converts a string from UTF-8 based on configuration.
public static string convertFromUTF8 ( $str, $config, $context ) | ||
$str | string | The string to convert |
$config | HTMLPurifier_Config | |
$context | HTMLPurifier_Context |
Lossless (character-wise) conversion of HTML to ASCII
public static string convertToASCIIDumbLossless ( $str ) | ||
$str | string | UTF-8 string to be converted to ASCII |
return | string | ASCII encoded string with non-ASCII character entity-ized |
---|
Convert a string to UTF-8 based on configuration.
public static string convertToUTF8 ( $str, $config, $context ) | ||
$str | string | The string to convert |
$config | HTMLPurifier_Config | |
$context | HTMLPurifier_Context |
Iconv wrapper which mutes errors and works around bugs.
public static string iconv ( $in, $out, $text, $max_chunk_size = 8000 ) | ||
$in | string | Input encoding |
$out | string | Output encoding |
$text | string | The text to convert |
$max_chunk_size | int |
public static bool iconvAvailable ( ) |
Error-handler that mutes errors, alternative to shut-up operator.
public static void muteErrorHandler ( ) |
This expensive function tests whether or not a given character encoding supports ASCII. 7/8-bit encodings like Shift_JIS will fail this test, and require special processing. Variable width encodings shouldn't ever fail.
public static Array testEncodingSupportsASCII ( $encoding, $bypass = false ) | ||
$encoding | string | Encoding name to test, as per iconv format |
$bypass | bool | Whether or not to bypass the precompiled arrays. |
return | Array | Of UTF-8 characters to their corresponding ASCII,
|
---|
Glibc iconv has a known bug where it doesn't handle the magic //IGNORE stanza correctly. In particular, rather than ignore characters, it will return an EILSEQ after consuming some number of characters, and expect you to restart iconv as if it were an E2BIG. Old versions of PHP did not respect the errno, and returned the fragment, so as a result you would see iconv mysteriously truncating output. We can work around this by manually chopping our input into segments of about 8000 characters, as long as PHP ignores the error code. If PHP starts paying attention to the error code, iconv becomes unusable.
public static int testIconvTruncateBug ( ) | ||
return | int | Error code indicating severity of bug. |
---|
public static void unichr ( $code ) | ||
$code |
Iconv wrapper which mutes errors, but doesn't work around bugs.
public static string unsafeIconv ( $in, $out, $text ) | ||
$in | string | Input encoding |
$out | string | Output encoding |
$text | string | The text to convert |