function UTF8Utils::convertToUTF8
Convert data from the given encoding to UTF-8.
This has not yet been tested with charactersets other than UTF-8. It should work with ISO-8859-1/-13 and standard Latin Win charsets.
Parameters
string $data The data to convert:
string $encoding A valid encoding. Examples: http://www.php.net/manual/en/mbstring.supported-encodings.php:
Return value
string
2 calls to UTF8Utils::convertToUTF8()
- Scanner::__construct in vendor/
masterminds/ html5/ src/ HTML5/ Parser/ Scanner.php - Create a new Scanner.
- StringInputStream::__construct in vendor/
masterminds/ html5/ src/ HTML5/ Parser/ StringInputStream.php - Create a new InputStream wrapper.
File
-
vendor/
masterminds/ html5/ src/ HTML5/ Parser/ UTF8Utils.php, line 76
Class
Namespace
Masterminds\HTML5\ParserCode
public static function convertToUTF8($data, $encoding = 'UTF-8') {
/*
* From the HTML5 spec: Given an encoding, the bytes in the input stream must be converted
* to Unicode characters for the tokeniser, as described by the rules for that encoding,
* except that the leading U+FEFF BYTE ORDER MARK character, if any, must not be stripped
* by the encoding layer (it is stripped by the rule below). Bytes or sequences of bytes
* in the original byte stream that could not be converted to Unicode characters must be
* converted to U+FFFD REPLACEMENT CHARACTER code points.
*/
// mb_convert_encoding is chosen over iconv because of a bug. The best
// details for the bug are on http://us1.php.net/manual/en/function.iconv.php#108643
// which contains links to the actual but reports as well as work around
// details.
if (function_exists('mb_convert_encoding')) {
// mb library has the following behaviors:
// - UTF-16 surrogates result in false.
// - Overlongs and outside Plane 16 result in empty strings.
// Before we run mb_convert_encoding we need to tell it what to do with
// characters it does not know. This could be different than the parent
// application executing this library so we store the value, change it
// to our needs, and then change it back when we are done. This feels
// a little excessive and it would be great if there was a better way.
$save = mb_substitute_character();
mb_substitute_character('none');
$data = mb_convert_encoding($data, 'UTF-8', $encoding);
mb_substitute_character($save);
}
elseif (function_exists('iconv') && 'auto' !== $encoding) {
// fprintf(STDOUT, "iconv found\n");
// iconv has the following behaviors:
// - Overlong representations are ignored.
// - Beyond Plane 16 is replaced with a lower char.
// - Incomplete sequences generate a warning.
$data = @iconv($encoding, 'UTF-8//IGNORE', $data);
}
else {
throw new Exception('Not implemented, please install mbstring or iconv');
}
/*
* One leading U+FEFF BYTE ORDER MARK character must be ignored if any are present.
*/
if ("" === substr($data, 0, 3)) {
$data = substr($data, 3);
}
return $data;
}