Class PDFStringUtil

java.lang.Object
org.loboevolution.pdfview.PDFStringUtil

public class PDFStringUtil extends Object
Utility methods for dealing with PDF Strings, such as:

We refer to basic strings as those corresponding to the PDF 'string' type. PDFRenderer represents these as Strings, though this is somewhat deceiving, as they are, effectively, just sequences of bytes, although byte values <= 127 do correspond to the ASCII character set. Outside of this, the 'string' type, as repesented by basic strings do not possess any character set or encoding, and byte values >= 128 are entirely acceptable. For a basic string as represented by a String, each character has a value less than 256 and is represented in the String as if the bytes represented as it were in ISO-8859-1 encoding. This, however, is merely for convenience. For strings that are user visible, and that don't merely represent some identifying token, the PDF standard employs a 'text string' type that offers the basic string as an encoding of in either UTF-16BE (with a byte order marking) or a specific 8-byte encoding, PDFDocEncoding. Using a basic string without conversion when the actual type is a 'text string' is erroneous (though without consequence if the string consists only of ASCII alphanumeric values). Care must be taken to either convert basic strings to text strings (also expressed as a String) when appropriate, using either the methods in this class, or PDFObject.getTextStringValue()}. For strings that are 'byte strings', asBytes(String) or PDFObject.getStream() should be used.

.

Author Luke Kirby

  • Constructor Details

    • PDFStringUtil

      public PDFStringUtil()
  • Method Details

    • asTextString

      public static String asTextString(String basicString)

      Take a basic PDF string and determine if it is in UTF-16BE encoding by looking at the lead characters for a byte order marking (BOM). If it appears to be UTF-16BE, we return the string representation of the UTF-16BE encoding of those bytes. If the BOM is not present, the bytes from the input string are decoded using the PDFDocEncoding charset.

      From the PDF Reference 1.7, p158:

      The text string type is used for character strings that are encoded in either PDFDocEncoding or the UTF-16BE Unicode character encoding scheme. PDFDocEncoding can encode all of the ISO Latin 1 character set and is documented in Appendix D. UTF-16BE can encode all Unicode characters. UTF-16BE and Unicode character encoding are described in the Unicode Standard by the Unicode Consortium (see the Bibliography). Note that PDFDocEncoding does not support all Unicode characters whereas UTF-16BE does.
      Parameters:
      basicString - the basic PDF string, as offered by PDFObject.getStringValue()
      Returns:
      either the original input, or the input decoded as UTF-16
    • asPDFDocEncoded

      public static String asPDFDocEncoded(String basicString)
      Take a basic PDF string and produce a string of its bytes as encoded in PDFDocEncoding. The PDFDocEncoding is described in the PDF Reference.
      Parameters:
      basicString - the basic PDF string, as offered by PDFObject.getStringValue()
      Returns:
      the decoding of the string's bytes in PDFDocEncoding
    • asUTF16BEEncoded

      public static String asUTF16BEEncoded(String basicString)
      Take a basic PDF string and produce a string from its bytes as an UTF16-BE encoding. The first 2 bytes are presumed to be the big-endian byte markers, 0xFE and 0xFF; that is not checked by this method.
      Parameters:
      basicString - the basic PDF string, as offered by PDFObject.getStringValue()
      Returns:
      the decoding of the string's bytes in UTF16-BE
    • asBytes

      public static byte[] asBytes(String basicString)
      Get the corresponding byte array for a basic string. This is effectively the char[] array cast to bytes[], as chars in basic strings only use the least significant byte.
      Parameters:
      basicString - the basic PDF string, as offered by PDFObject.getStringValue()
      Returns:
      the bytes corresponding to its characters
    • asBasicString

      public static String asBasicString(byte[] bytes, int offset, int length)
      Create a basic string from bytes. This is effectively the byte array cast to a char array and turned into a String.
      Parameters:
      bytes - the source of the bytes for the basic string
      offset - the offset into butes where the string starts
      length - the number of bytes to turn into a string
      Returns:
      the corresponding string
    • asBasicString

      public static String asBasicString(byte[] bytes)
      Create a basic string from bytes. This is effectively the byte array cast to a char array and turned into a String.
      Parameters:
      bytes - the bytes, all of which are used
      Returns:
      the corresponding string
    • toPDFDocEncoded

      public byte[] toPDFDocEncoded(String string) throws CharacterCodingException

      toPDFDocEncoded.

      Parameters:
      string - a String object.
      Returns:
      an array of
      invalid reference
      byte
      objects.
      Throws:
      CharacterCodingException - if any.