This section describes a number of functions for dealing with
Unicode characters and strings. There are analogues of the
traditional ctype.h character classification
and case conversion functions, UTF-8 analogues of some string utility
functions, functions to perform normalization, case conversion and
collation on UTF-8 strings and finally functions to convert between
the UTF-8, UTF-16 and UCS-4 encodings of Unicode.
Checks whether ch is a valid Unicode character. Some possible
integer values of ch will not be valid. 0 is considered a valid
character, though it's normally a string terminator.
Determines whether a character is numeric (i.e. a digit). This
covers ASCII 0-9 and also digits in other languages/scripts. Given
some UTF-8 text, obtain a character value with g_utf8_get_char().
Determines whether a character is printable and not a space
(returns FALSE for control characters, format characters, and
spaces). g_unichar_isprint() is similar, but returns TRUE for
spaces. Given some UTF-8 text, obtain a character value with
g_utf8_get_char().
Determines whether a character is printable.
Unlike g_unichar_isgraph(), returns TRUE for spaces.
Given some UTF-8 text, obtain a character value with
g_utf8_get_char().
Determines whether a character is a space, tab, or line separator
(newline, carriage return, etc.). Given some UTF-8 text, obtain a
character value with g_utf8_get_char().
(Note: don't use this to do word breaking; you have to use
Pango or equivalent to get word breaking right, the algorithm
is fairly complex.)
Determines if a character is titlecase. Some characters in
Unicode which are composites, such as the DZ digraph
have three case variants instead of just two. The titlecase
form is used at the beginning of a word where only the
first letter is capitalized. The titlecase form of the DZ
digraph is U+01F2 LATIN CAPITAL LETTTER D WITH SMALL LETTER Z.
Determines the break type of c. c should be a Unicode character
(to derive a character from UTF-8 encoded text, use
g_utf8_get_char()). The break type is used to find word and line
breaks ("text boundaries"), Pango implements the Unicode boundary
resolution algorithms and normally you would use a function such
as pango_break() instead of caring about break types yourself.
Computes the canonical ordering of a string in-place.
This rearranges decomposed characters in the string
according to their combining classes. See the Unicode
manual for more information.
In Unicode, some characters are mirrored. This
means that their images are mirrored horizontally in text that is laid
out from right to left. For instance, "(" would become its mirror image,
")", in right-to-left text.
If ch has the Unicode mirrored property and there is another unicode
character that typically has a glyph that is the mirror image of ch's
glyph, puts that character in the address pointed to by mirrored_ch.
ch :
a unicode character
mirrored_ch :
location to store the mirrored character
Returns :
TRUE if ch has a mirrored character and mirrored_ch is
filled in, FALSE otherwise
Since 2.4
g_utf8_next_char()
#define g_utf8_next_char(p)
Skips to the next character in a UTF-8 string. The string must be
valid; this macro is as fast as possible, and has no error-checking.
You would use this macro to iterate over a string character by
character. The macro returns the start of the next UTF-8 character.
Before using this macro, use g_utf8_validate() to validate strings
that may contain invalid UTF-8.
Converts a sequence of bytes encoded as UTF-8 to a Unicode character.
If p does not point to a valid UTF-8 encoded character, results are
undefined. If you are not sure that the bytes are complete
valid Unicode characters, you should use g_utf8_get_char_validated()
instead.
Convert a sequence of bytes encoded as UTF-8 to a Unicode character.
This function checks for incomplete characters, for invalid characters
such as characters that are out of the range of Unicode, and for
overlong encodings of valid characters.
p :
a pointer to Unicode character encoded as UTF-8
max_len :
the maximum number of bytes to read, or -1, for no maximum.
Returns :
the resulting character. If p points to a partial
sequence at the end of a string that could begin a valid
character, returns (gunichar)-2; otherwise, if p does not point
to a valid UTF-8 encoded Unicode character, returns (gunichar)-1.
Finds the previous UTF-8 character in the string before p.
p does not have to be at the beginning of a UTF-8 character. No check
is made to see if the character found is actually valid other than
it starts with an appropriate byte. If p might be the first
character of the string, you must use g_utf8_find_prev_char() instead.
p :
a pointer to a position within a UTF-8 encoded string
Finds the start of the next UTF-8 character in the string after p.
p does not have to be at the beginning of a UTF-8 character. No check
is made to see if the character found is actually valid other than
it starts with an appropriate byte.
p :
a pointer to a position within a UTF-8 encoded string
end :
a pointer to the end of the string, or NULL to indicate
that the string is nul-terminated, in which case
the returned value will be
Given a position p with a UTF-8 encoded string str, find the start
of the previous UTF-8 character starting before p. Returns NULL if no
UTF-8 characters are present in p before str.
p does not have to be at the beginning of a UTF-8 character. No check
is made to see if the character found is actually valid other than
it starts with an appropriate byte.
str :
pointer to the beginning of a UTF-8 encoded string
the maximum number of bytes to examine. If max
is less than 0, then the string is assumed to be
nul-terminated. If max is 0, p will not be examined and
may be NULL.
Like the standard C strncpy() function, but
copies a given number of characters instead of a given number of
bytes. The src string must be valid UTF-8 encoded text.
(Use g_utf8_validate() on all text before trying to use UTF-8
utility functions with it.)
Finds the leftmost occurrence of the given ISO10646 character
in a UTF-8 encoded string, while limiting the search to len bytes.
If len is -1, allow unbounded search.
p :
a nul-terminated UTF-8 encoded string
len :
the maximum length of p
c :
a ISO10646 character
Returns :
NULL if the string does not contain the character,
otherwise, a pointer to the start of the leftmost occurrence of
the character in the string.
Find the rightmost occurrence of the given ISO10646 character
in a UTF-8 encoded string, while limiting the search to len bytes.
If len is -1, allow unbounded search.
p :
a nul-terminated UTF-8 encoded string
len :
the maximum length of p
c :
a ISO10646 character
Returns :
NULL if the string does not contain the character,
otherwise, a pointer to the start of the rightmost occurrence of the
character in the string.
Reverses a UTF-8 string. str must be valid UTF-8 encoded text.
(Use g_utf8_validate() on all text before trying to use UTF-8
utility functions with it.)
Note that unlike g_strreverse(), this function returns
newly-allocated memory, which should be freed with g_free() when
no longer needed.
str :
a UTF-8 encoded string
len :
the maximum length of str to use. If len < 0, then
the string is nul-terminated.
Returns :
a newly-allocated string which is the reverse of str.
Validates UTF-8 encoded text. str is the text to validate;
if str is nul-terminated, then max_len can be -1, otherwise
max_len should be the number of bytes to validate.
If end is non-NULL, then the end of the valid range
will be stored there (i.e. the start of the first invalid
character if some bytes were invalid, or the end of the text
being validated otherwise).
Note that g_utf8_validate() returns FALSE if max_len is
positive and NUL is met before max_len bytes have been read.
Returns TRUE if all of str was valid. Many GLib and GTK+
routines require valid UTF-8 as input;
so data read from a file or the network should be checked
with g_utf8_validate() before doing anything else with it.
Converts all Unicode characters in the string that have a case
to uppercase. The exact manner that this is done depends
on the current locale, and may result in the number of
characters in the string increasing. (For instance, the
German ess-zet will be changed to SS.)
str :
a UTF-8 encoded string
len :
length of str, in bytes, or -1 if str is nul-terminated.
Returns :
a newly allocated string, with all characters
converted to uppercase.
Converts all Unicode characters in the string that have a case
to lowercase. The exact manner that this is done depends
on the current locale, and may result in the number of
characters in the string changing.
str :
a UTF-8 encoded string
len :
length of str, in bytes, or -1 if str is nul-terminated.
Returns :
a newly allocated string, with all characters
converted to lowercase.
Converts a string into a form that is independent of case. The
result will not correspond to any particular case, but can be
compared for equality or ordered with the results of calling
g_utf8_casefold() on other strings.
Note that calling g_utf8_casefold() followed by g_utf8_collate() is
only an approximation to the correct linguistic case insensitive
ordering, though it is a fairly good one. Getting this exactly
right would require a more sophisticated collation function that
takes case sensitivity into account. GLib does not currently
provide such a function.
str :
a UTF-8 encoded string
len :
length of str, in bytes, or -1 if str is nul-terminated.
Returns :
a newly allocated string, that is a
case independent form of str.
Converts a string into canonical form, standardizing
such issues as whether a character with an accent
is represented as a base character and combining
accent or as a single precomposed character. You
should generally call g_utf8_normalize() before
comparing two Unicode strings.
The normalization mode G_NORMALIZE_DEFAULT only
standardizes differences that do not affect the
text content, such as the above-mentioned accent
representation. G_NORMALIZE_ALL also standardizes
the "compatibility" characters in Unicode, such
as SUPERSCRIPT THREE to the standard forms
(in this case DIGIT THREE). Formatting information
may be lost but for most text operations such
characters should be considered the same.
For example, g_utf8_collate() normalizes
with G_NORMALIZE_ALL as its first step.
G_NORMALIZE_DEFAULT_COMPOSE and G_NORMALIZE_ALL_COMPOSE
are like G_NORMALIZE_DEFAULT and G_NORMALIZE_ALL,
but returned a result with composed forms rather
than a maximally decomposed form. This is often
useful if you intend to convert the string to
a legacy encoding or pass it to a system with
less capable Unicode handling.
str :
a UTF-8 encoded string.
len :
length of str, in bytes, or -1 if str is nul-terminated.
mode :
the type of normalization to perform.
Returns :
a newly allocated string, that is the
normalized form of str.
Defines how a Unicode string is transformed in a canonical
form, standardizing such issues as whether a character with an accent is
represented as a base character and combining accent or as a single precomposed
character. Unicode strings should generally be normalized before comparing them.
G_NORMALIZE_DEFAULT
standardize differences that do not affect the
text content, such as the above-mentioned accent representation.
G_NORMALIZE_NFD
another name for G_NORMALIZE_DEFAULT.
G_NORMALIZE_DEFAULT_COMPOSE
like G_NORMALIZE_DEFAULT, but with composed
forms rather than a maximally decomposed form.
G_NORMALIZE_NFC
another name for G_NORMALIZE_DEFAULT_COMPOSE.
G_NORMALIZE_ALL
beyond G_NORMALIZE_DEFAULT also standardize the
"compatibility" characters in Unicode, such as SUPERSCRIPT THREE to the
standard forms (in this case DIGIT THREE). Formatting information may be
lost but for most text operations such characters should be considered the
same.
G_NORMALIZE_NFKD
another name for G_NORMALIZE_ALL.
G_NORMALIZE_ALL_COMPOSE
like G_NORMALIZE_ALL, but with composed
forms rather than a maximally decomposed form.
Compares two strings for ordering using the linguistically
correct rules for the current locale. When sorting a large
number of strings, it will be significantly faster to
obtain collation keys with g_utf8_collate_key() and
compare the keys with strcmp() when
sorting instead of sorting the original strings.
str1 :
a UTF-8 encoded string
str2 :
a UTF-8 encoded string
Returns :
< 0 if str1 compares before str2,
0 if they compare equal, > 0 if str1 compares after str2.
Converts a string into a collation key that can be compared
with other collation keys using strcmp().
The results of comparing the collation keys of two strings
with strcmp() will always be the same as
comparing the two original keys with g_utf8_collate().
str :
a UTF-8 encoded string.
len :
length of str, in bytes, or -1 if str is nul-terminated.
Returns :
a newly allocated string. This string should
be freed with g_free() when you are done with it.
Convert a string from UTF-8 to UTF-16. A 0 word will be
added to the result after the converted text.
str :
a UTF-8 encoded string
len :
the maximum length of str to use. If len < 0, then
the string is nul-terminated.
items_read :
location to store number of bytes read, or NULL.
If NULL, then G_CONVERT_ERROR_PARTIAL_INPUT will be
returned in case str contains a trailing partial
character. If an error occurs then the index of the
invalid input is stored here.
items_written :
location to store number of words written, or NULL.
The value stored here does not include the trailing
0 word.
error :
location to store the error occuring, or NULL to ignore
errors. Any of the errors in GConvertError other than
G_CONVERT_ERROR_NO_CONVERSION may occur.
Returns :
a pointer to a newly allocated UTF-16 string.
This value must be freed with g_free(). If an
error occurs, NULL will be returned and
error set.
Convert a string from UTF-8 to a 32-bit fixed width
representation as UCS-4. A trailing 0 will be added to the
string after the converted text.
str :
a UTF-8 encoded string
len :
the maximum length of str to use. If len < 0, then
the string is nul-terminated.
items_read :
location to store number of bytes read, or NULL.
If NULL, then G_CONVERT_ERROR_PARTIAL_INPUT will be
returned in case str contains a trailing partial
character. If an error occurs then the index of the
invalid input is stored here.
items_written :
location to store number of characters written or NULL.
The value here stored does not include the trailing 0
character.
error :
location to store the error occuring, or NULL to ignore
errors. Any of the errors in GConvertError other than
G_CONVERT_ERROR_NO_CONVERSION may occur.
Returns :
a pointer to a newly allocated UCS-4 string.
This value must be freed with g_free(). If an
error occurs, NULL will be returned and
error set.
Convert a string from UTF-8 to a 32-bit fixed width
representation as UCS-4, assuming valid UTF-8 input.
This function is roughly twice as fast as g_utf8_to_ucs4()
but does no error checking on the input.
str :
a UTF-8 encoded string
len :
the maximum length of str to use. If len < 0, then
the string is nul-terminated.
items_written :
location to store the number of characters in the
result, or NULL.
Returns :
a pointer to a newly allocated UCS-4 string.
This value must be freed with g_free().
Convert a string from UTF-16 to UCS-4. The result will be
terminated with a 0 character.
str :
a UTF-16 encoded string
len :
the maximum length of str to use. If len < 0, then
the string is terminated with a 0 character.
items_read :
location to store number of words read, or NULL.
If NULL, then G_CONVERT_ERROR_PARTIAL_INPUT will be
returned in case str contains a trailing partial
character. If an error occurs then the index of the
invalid input is stored here.
items_written :
location to store number of characters written, or NULL.
The value stored here does not include the trailing
0 character.
error :
location to store the error occuring, or NULL to ignore
errors. Any of the errors in GConvertError other than
G_CONVERT_ERROR_NO_CONVERSION may occur.
Returns :
a pointer to a newly allocated UCS-4 string.
This value must be freed with g_free(). If an
error occurs, NULL will be returned and
error set.
Convert a string from UTF-16 to UTF-8. The result will be
terminated with a 0 byte.
Note that the input is expected to be already in native endianness,
an initial byte-order-mark character is not handled specially.
g_convert() can be used to convert a byte buffer of UTF-16 data of
ambiguous endianess.
str :
a UTF-16 encoded string
len :
the maximum length of str to use. If len < 0, then
the string is terminated with a 0 character.
items_read :
location to store number of words read, or NULL.
If NULL, then G_CONVERT_ERROR_PARTIAL_INPUT will be
returned in case str contains a trailing partial
character. If an error occurs then the index of the
invalid input is stored here.
items_written :
location to store number of bytes written, or NULL.
The value stored here does not include the trailing
0 byte.
error :
location to store the error occuring, or NULL to ignore
errors. Any of the errors in GConvertError other than
G_CONVERT_ERROR_NO_CONVERSION may occur.
Returns :
a pointer to a newly allocated UTF-8 string.
This value must be freed with g_free(). If an
error occurs, NULL will be returned and
error set.