Class PDFStringUtil
converting to text strings
converting to PDFDocEncoded strings
converting to UTF-16BE strings
- converting basic strings between
byte
andstring
representations
We refer to basic strings as those corresponding to the PDF 'string' type.
PDFRenderer represents these as String
s, though this is somewhat
deceiving, as they are, effectively, just sequences of bytes, although byte
values <= 127 do correspond to the ASCII character set. Outside of this,
the 'string' type, as repesented by basic strings do not possess any
character set or encoding, and byte values >= 128 are entirely acceptable.
For a basic string as represented by a String, each character has a value
less than 256 and is represented in the String as if the bytes represented as
it were in ISO-8859-1 encoding. This, however, is merely for convenience. For
strings that are user visible, and that don't merely represent some
identifying token, the PDF standard employs a 'text string' type that offers
the basic string as an encoding of in either UTF-16BE (with a byte order
marking) or a specific 8-byte encoding, PDFDocEncoding. Using a basic string
without conversion when the actual type is a 'text string' is erroneous
(though without consequence if the string consists only of ASCII
alphanumeric values). Care must be taken to either convert basic strings to
text strings (also expressed as a String) when appropriate, using either the
methods in this class, or PDFObject.getTextStringValue()
}. For
strings that are 'byte strings', asBytes(String)
or PDFObject.getStream()
should be used.
Author Luke Kirby
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionstatic String
asBasicString
(byte[] bytes) Create a basic string from bytes.static String
asBasicString
(byte[] bytes, int offset, int length) Create a basic string from bytes.static byte[]
Get the corresponding byte array for a basic string.static String
asPDFDocEncoded
(String basicString) Take a basic PDF string and produce a string of its bytes as encoded in PDFDocEncoding.static String
asTextString
(String basicString) Take a basic PDF string and determine if it is in UTF-16BE encoding by looking at the lead characters for a byte order marking (BOM).static String
asUTF16BEEncoded
(String basicString) Take a basic PDF string and produce a string from its bytes as an UTF16-BE encoding.byte[]
toPDFDocEncoded
(String string) toPDFDocEncoded.
-
Constructor Details
-
PDFStringUtil
public PDFStringUtil()
-
-
Method Details
-
asTextString
Take a basic PDF string and determine if it is in UTF-16BE encoding by looking at the lead characters for a byte order marking (BOM). If it appears to be UTF-16BE, we return the string representation of the UTF-16BE encoding of those bytes. If the BOM is not present, the bytes from the input string are decoded using the PDFDocEncoding charset.
From the PDF Reference 1.7, p158:
The text string type is used for character strings that are encoded in either PDFDocEncoding or the UTF-16BE Unicode character encoding scheme. PDFDocEncoding can encode all of the ISO Latin 1 character set and is documented in Appendix D. UTF-16BE can encode all Unicode characters. UTF-16BE and Unicode character encoding are described in the Unicode Standard by the Unicode Consortium (see the Bibliography). Note that PDFDocEncoding does not support all Unicode characters whereas UTF-16BE does.
- Parameters:
basicString
- the basic PDF string, as offered byPDFObject.getStringValue()
- Returns:
- either the original input, or the input decoded as UTF-16
-
asPDFDocEncoded
Take a basic PDF string and produce a string of its bytes as encoded in PDFDocEncoding. The PDFDocEncoding is described in the PDF Reference.- Parameters:
basicString
- the basic PDF string, as offered byPDFObject.getStringValue()
- Returns:
- the decoding of the string's bytes in PDFDocEncoding
-
asUTF16BEEncoded
Take a basic PDF string and produce a string from its bytes as an UTF16-BE encoding. The first 2 bytes are presumed to be the big-endian byte markers, 0xFE and 0xFF; that is not checked by this method.- Parameters:
basicString
- the basic PDF string, as offered byPDFObject.getStringValue()
- Returns:
- the decoding of the string's bytes in UTF16-BE
-
asBytes
Get the corresponding byte array for a basic string. This is effectively the char[] array cast to bytes[], as chars in basic strings only use the least significant byte.- Parameters:
basicString
- the basic PDF string, as offered byPDFObject.getStringValue()
- Returns:
- the bytes corresponding to its characters
-
asBasicString
Create a basic string from bytes. This is effectively the byte array cast to a char array and turned into a String.- Parameters:
bytes
- the source of the bytes for the basic stringoffset
- the offset into butes where the string startslength
- the number of bytes to turn into a string- Returns:
- the corresponding string
-
asBasicString
Create a basic string from bytes. This is effectively the byte array cast to a char array and turned into a String.- Parameters:
bytes
- the bytes, all of which are used- Returns:
- the corresponding string
-
toPDFDocEncoded
toPDFDocEncoded.
- Parameters:
string
- aString
object.- Returns:
- an array of
invalid reference
byte
- Throws:
CharacterCodingException
- if any.
-