org.loboevolution.pdfview.PDFStringUtil

public class PDFStringUtil extends Object

Utility methods for dealing with PDF Strings, such as:

converting to text strings
converting to PDFDocEncoded strings
converting to UTF-16BE strings
converting basic strings between byte and string representations

We refer to basic strings as those corresponding to the PDF 'string' type. PDFRenderer represents these as Strings, though this is somewhat deceiving, as they are, effectively, just sequences of bytes, although byte values <= 127 do correspond to the ASCII character set. Outside of this, the 'string' type, as repesented by basic strings do not possess any character set or encoding, and byte values >= 128 are entirely acceptable. For a basic string as represented by a String, each character has a value less than 256 and is represented in the String as if the bytes represented as it were in ISO-8859-1 encoding. This, however, is merely for convenience. For strings that are user visible, and that don't merely represent some identifying token, the PDF standard employs a 'text string' type that offers the basic string as an encoding of in either UTF-16BE (with a byte order marking) or a specific 8-byte encoding, PDFDocEncoding. Using a basic string without conversion when the actual type is a 'text string' is erroneous (though without consequence if the string consists only of ASCII alphanumeric values). Care must be taken to either convert basic strings to text strings (also expressed as a String) when appropriate, using either the methods in this class, or PDFObject.getTextStringValue()}. For strings that are 'byte strings', asBytes(String) or PDFObject.getStream() should be used.

.

Author Luke Kirby

Constructor Summary

Constructors

Constructor

Description

PDFStringUtil()
Method Summary

Modifier and Type

Method

Description

static String

asBasicString(byte[] bytes)

Create a basic string from bytes.

static String

asBasicString(byte[] bytes, int offset, int length)

Create a basic string from bytes.

static byte[]

asBytes(String basicString)

Get the corresponding byte array for a basic string.

static String

asPDFDocEncoded(String basicString)

Take a basic PDF string and produce a string of its bytes as encoded in PDFDocEncoding.

static String

asTextString(String basicString)

Take a basic PDF string and determine if it is in UTF-16BE encoding by looking at the lead characters for a byte order marking (BOM).

static String

asUTF16BEEncoded(String basicString)

Take a basic PDF string and produce a string from its bytes as an UTF16-BE encoding.

byte[]

toPDFDocEncoded(String string)

toPDFDocEncoded.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- PDFStringUtil
  
  public PDFStringUtil()
Method Details
- asTextString
  
  public static String asTextString(String basicString)
  
  Take a basic PDF string and determine if it is in UTF-16BE encoding by looking at the lead characters for a byte order marking (BOM). If it appears to be UTF-16BE, we return the string representation of the UTF-16BE encoding of those bytes. If the BOM is not present, the bytes from the input string are decoded using the PDFDocEncoding charset.
  
  From the PDF Reference 1.7, p158:
  The text string type is used for character strings that are encoded in either PDFDocEncoding or the UTF-16BE Unicode character encoding scheme. PDFDocEncoding can encode all of the ISO Latin 1 character set and is documented in Appendix D. UTF-16BE can encode all Unicode characters. UTF-16BE and Unicode character encoding are described in the Unicode Standard by the Unicode Consortium (see the Bibliography). Note that PDFDocEncoding does not support all Unicode characters whereas UTF-16BE does.
  
  Parameters:
  
  basicString - the basic PDF string, as offered by PDFObject.getStringValue()
  
  Returns:
  
  either the original input, or the input decoded as UTF-16
- asPDFDocEncoded
  
  public static String asPDFDocEncoded(String basicString)
  
  Take a basic PDF string and produce a string of its bytes as encoded in PDFDocEncoding. The PDFDocEncoding is described in the PDF Reference.
  
  Parameters:
  
  basicString - the basic PDF string, as offered by PDFObject.getStringValue()
  
  Returns:
  
  the decoding of the string's bytes in PDFDocEncoding
- asUTF16BEEncoded
  
  public static String asUTF16BEEncoded(String basicString)
  
  Take a basic PDF string and produce a string from its bytes as an UTF16-BE encoding. The first 2 bytes are presumed to be the big-endian byte markers, 0xFE and 0xFF; that is not checked by this method.
  
  Parameters:
  
  basicString - the basic PDF string, as offered by PDFObject.getStringValue()
  
  Returns:
  
  the decoding of the string's bytes in UTF16-BE
- asBytes
  
  public static byte[] asBytes(String basicString)
  
  Get the corresponding byte array for a basic string. This is effectively the char[] array cast to bytes[], as chars in basic strings only use the least significant byte.
  
  Parameters:
  
  basicString - the basic PDF string, as offered by PDFObject.getStringValue()
  
  Returns:
  
  the bytes corresponding to its characters
- asBasicString
  
  public static String asBasicString(byte[] bytes, int offset, int length)
  
  Create a basic string from bytes. This is effectively the byte array cast to a char array and turned into a String.
  
  Parameters:
  
  bytes - the source of the bytes for the basic string
  
  offset - the offset into butes where the string starts
  
  length - the number of bytes to turn into a string
  
  Returns:
  
  the corresponding string
- asBasicString
  
  public static String asBasicString(byte[] bytes)
  
  Create a basic string from bytes. This is effectively the byte array cast to a char array and turned into a String.
  
  Parameters:
  
  bytes - the bytes, all of which are used
  
  Returns:
  
  the corresponding string
- toPDFDocEncoded
  
  public byte[] toPDFDocEncoded(String string) throws CharacterCodingException
  
  toPDFDocEncoded.
  Parameters:
  
  string - a String object.
  
  Returns:
  
  an array of
  
  invalid reference
  
  byte
  
  objects.
  
  Throws:
  
  CharacterCodingException - if any.

Class PDFStringUtil

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

PDFStringUtil

Method Details

asTextString

asPDFDocEncoded

asUTF16BEEncoded

asBytes

asBasicString

asBasicString

toPDFDocEncoded