com.openexchange.mail.text
Class HTMLProcessing

java.lang.Object
  extended by com.openexchange.mail.text.HTMLProcessing

public final class HTMLProcessing
extends java.lang.Object

HTMLProcessing - Various methods for HTML processing.

Author:
Thorben Betten

Field Summary
static java.util.regex.Pattern PATTERN_LINK
          The regular expression to match URLs and anchors inside text.
static java.util.regex.Pattern PATTERN_LINK_WITH_GROUP
          The regular expression to match URLs and anchors inside text.
static java.util.regex.Pattern PATTERN_URL
          The regular expression to match URLs inside text:
\(?
 
Method Summary
static java.lang.String convertAndKeepQuotes(java.lang.String htmlContent, Html2TextConverter converter)
          Converts given HTML content into plain text, but keeps <blockquote> tags if any present.
static org.w3c.dom.Document createDOMDocument(java.lang.String string)
          Creates a DOM document from specified XML/HTML string.
static java.lang.String filterExternalImages(java.lang.String htmlContent, boolean[] modified)
          Filters externally loaded images out of specified HTML content.
static java.lang.String filterInlineImages(java.lang.String content, com.openexchange.session.Session session, MailPath msgUID)
          Filters inline images occurring in HTML content of a message: Inline images
The source of inline images is in the message itself.
static java.lang.String filterWhitelist(java.lang.String htmlContent)
          Filters specified HTML content according to white-list filter.
static java.lang.String formatContentForDisplay(java.lang.String content, java.lang.String charset, boolean isHtml, com.openexchange.session.Session session, MailPath mailPath, UserSettingMail usm, boolean[] modified, DisplayMode mode)
          Performs all the formatting for both text and HTML content for a proper display according to specified user's mail settings.
static java.lang.String formatHrefLinks(java.lang.String content)
          Searches for non-HTML links and convert them to valid HTML links.
static java.lang.String formatHTMLForDisplay(java.lang.String content, java.lang.String charset, com.openexchange.session.Session session, MailPath mailPath, UserSettingMail usm, boolean[] modified, DisplayMode mode)
          Performs all the formatting for HTML content for a proper display according to specified user's mail settings.
static java.lang.String formatTextForDisplay(java.lang.String content, UserSettingMail usm, DisplayMode mode)
          Performs all the formatting for text content for a proper display according to specified user's mail settings.
static java.lang.String getConformHTML(java.lang.String htmlContent, ContentType contentType)
          Creates valid HTML from specified HTML content conform to W3C standards.
static java.lang.String getConformHTML(java.lang.String htmlContent, java.lang.String charset)
          Creates valid HTML from specified HTML content conform to W3C standards.
static java.lang.Character getHTMLEntity(java.lang.String entity)
          Maps specified HTML entity - e.g.
static java.io.InputStream getTidyMessages()
          Gets the messages used by JTidy as an input stream.
static java.lang.String htmlFormat(java.lang.String plainText)
          Formats plain text to HTML by escaping HTML special characters e.g.
static java.lang.String htmlFormat(java.lang.String plainText, boolean withQuote)
          Formats plain text to HTML by escaping HTML special characters e.g.
static java.lang.String prettyPrint(java.lang.String htmlContent)
          Pretty prints specified HTML content.
static java.lang.String prettyPrintXML(org.w3c.dom.Node node)
          Pretty-prints specified XML/HTML node.
static java.lang.String prettyPrintXML(java.lang.String string)
          Pretty-prints specified XML/HTML string.
static java.lang.String replaceHTMLEntities(java.lang.String content)
          Replaces all HTML entities occurring in specified HTML content.
static java.lang.String replaceHTMLSimpleQuotesForDisplay(java.lang.String htmlText)
          Turns all simple quotes "&gt; " occurring in specified HTML text to colored "<blockquote>" tags according to configured quote colors.
static java.lang.String urlEncodeSafe(java.lang.String text, java.lang.String charset)
          Translates specified string into application/x-www-form-urlencoded format using a specific encoding scheme.
static java.lang.String validate(java.lang.String htmlContent)
          Validates specified HTML content with tidy html library.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

PATTERN_URL

public static final java.util.regex.Pattern PATTERN_URL
The regular expression to match URLs inside text:
\(?\b(?:https?://|ftp://|mailto:|news\\.|www\.)[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]

Parentheses, if present, are allowed in the URL -- The leading one is absorbed, too.

 String s = matcher.group();
 int mlen = s.length() - 1;
 if (mlen > 0 && '(' == s.charAt(0) && ')' == s.charAt(mlen)) {
     s = s.substring(1, mlen);
 }
 


PATTERN_LINK

public static final java.util.regex.Pattern PATTERN_LINK
The regular expression to match URLs and anchors inside text.
 String s = matcher.group();
 int mlen = s.length() - 1;
 if (mlen > 0 && '(' == s.charAt(0) && ')' == s.charAt(mlen)) {
     s = s.substring(1, mlen);
 }
 


PATTERN_LINK_WITH_GROUP

public static final java.util.regex.Pattern PATTERN_LINK_WITH_GROUP
The regular expression to match URLs and anchors inside text. The URLs are matched in capturing group #1.
 String s = matcher.group(1);
 int mlen = s.length() - 1;
 if (mlen > 0 && '(' == s.charAt(0) && ')' == s.charAt(mlen)) {
     s = s.substring(1, mlen);
 }
 

Method Detail

formatTextForDisplay

public static java.lang.String formatTextForDisplay(java.lang.String content,
                                                    UserSettingMail usm,
                                                    DisplayMode mode)
Performs all the formatting for text content for a proper display according to specified user's mail settings.

Parameters:
content - The plain text content
usm - The settings used for formatting content
mode - The display mode
Returns:
The formatted content
See Also:
formatContentForDisplay(String, String, boolean, Session, MailPath, UserSettingMail, boolean[], DisplayMode)

formatHTMLForDisplay

public static java.lang.String formatHTMLForDisplay(java.lang.String content,
                                                    java.lang.String charset,
                                                    com.openexchange.session.Session session,
                                                    MailPath mailPath,
                                                    UserSettingMail usm,
                                                    boolean[] modified,
                                                    DisplayMode mode)
Performs all the formatting for HTML content for a proper display according to specified user's mail settings.

Parameters:
content - The HTML content
charset - The character encoding
session - The session
mailPath - The message's unique path in mailbox
usm - The settings used for formatting content
modified - A boolean array with length 1 to store modified status of external images filter
mode - The display mode
Returns:
The formatted content
See Also:
formatContentForDisplay(String, String, boolean, Session, MailPath, UserSettingMail, boolean[], DisplayMode)

formatContentForDisplay

public static java.lang.String formatContentForDisplay(java.lang.String content,
                                                       java.lang.String charset,
                                                       boolean isHtml,
                                                       com.openexchange.session.Session session,
                                                       MailPath mailPath,
                                                       UserSettingMail usm,
                                                       boolean[] modified,
                                                       DisplayMode mode)
Performs all the formatting for both text and HTML content for a proper display according to specified user's mail settings.

If content is plain text:

  1. Plain text content is converted to valid HTML if at least DisplayMode.MODIFYABLE is given
  2. If enabled by settings simple quotes are turned to colored block quotes if DisplayMode.DISPLAY is given
  3. HTML links and URLs found in content are going to be prepared for proper display if DisplayMode.DISPLAY is given
If content is HTML:
  1. Both inline and non-inline images found in HTML content are prepared according to settings if DisplayMode.DISPLAY is given

Parameters:
content - The content
charset - The character encoding (only needed by HTML content; may be null on plain text)
isHtml - true if content is of type text/html; otherwise false
session - The session
mailPath - The message's unique path in mailbox
usm - The settings used for formatting content
modified - A boolean array with length 1 to store modified status of external images filter (only needed by HTML content; may be null on plain text)
mode - The display mode
Returns:
The formatted content

formatHrefLinks

public static java.lang.String formatHrefLinks(java.lang.String content)
Searches for non-HTML links and convert them to valid HTML links.

Example: http://www.somewhere.com is converted to <a href="http://www.somewhere.com">http://www.somewhere.com</a>.

Parameters:
content - The content to search in
Returns:
The given content with all non-HTML links converted to valid HTML links

getConformHTML

public static java.lang.String getConformHTML(java.lang.String htmlContent,
                                              ContentType contentType)
Creates valid HTML from specified HTML content conform to W3C standards.

Parameters:
htmlContent - The HTML content
contentType - The corresponding content type (including charset parameter)
Returns:
The HTML content conform to W3C standards

getConformHTML

public static java.lang.String getConformHTML(java.lang.String htmlContent,
                                              java.lang.String charset)
Creates valid HTML from specified HTML content conform to W3C standards.

Parameters:
htmlContent - The HTML content
charset - The charset parameter
Returns:
The HTML content conform to W3C standards

createDOMDocument

public static org.w3c.dom.Document createDOMDocument(java.lang.String string)
Creates a DOM document from specified XML/HTML string.

Parameters:
string - The XML/HTML string
Returns:
A newly created DOM document or null if given string cannot be transformed to a DOM document

prettyPrintXML

public static java.lang.String prettyPrintXML(java.lang.String string)
Pretty-prints specified XML/HTML string.

Parameters:
string - The XML/HTML string to pretty-print
Returns:
The pretty-printed XML/HTML string

prettyPrintXML

public static java.lang.String prettyPrintXML(org.w3c.dom.Node node)
Pretty-prints specified XML/HTML node.

Parameters:
node - The XML/HTML node pretty-print
Returns:
The pretty-printed XML/HTML node

getTidyMessages

public static java.io.InputStream getTidyMessages()
                                           throws java.io.IOException
Gets the messages used by JTidy as an input stream.

Returns:
The messages used by JTidy as an input stream
Throws:
java.io.IOException - If input stream cannot be generated

validate

public static java.lang.String validate(java.lang.String htmlContent)
Validates specified HTML content with tidy html library.

Parameters:
htmlContent - The HTML content
Returns:
The validated HTML content

prettyPrint

public static java.lang.String prettyPrint(java.lang.String htmlContent)
Pretty prints specified HTML content.

Parameters:
htmlContent - The HTML content
Returns:
Pretty printed HTML content

convertAndKeepQuotes

public static java.lang.String convertAndKeepQuotes(java.lang.String htmlContent,
                                                    Html2TextConverter converter)
                                             throws java.io.IOException
Converts given HTML content into plain text, but keeps <blockquote> tags if any present.
NOTE: returned content is again HTML content.

Parameters:
htmlContent - The HTML content
converter - The instance of Html2TextConverter
Returns:
The partially converted plain text version of given HTML content as HTML content
Throws:
java.io.IOException - If an I/O error occurs

replaceHTMLEntities

public static java.lang.String replaceHTMLEntities(java.lang.String content)
Replaces all HTML entities occurring in specified HTML content.

Parameters:
content - The content
Returns:
The content with HTML entities replaced

getHTMLEntity

public static java.lang.Character getHTMLEntity(java.lang.String entity)
Maps specified HTML entity - e.g. &uuml; - to corresponding ASCII character.

Parameters:
entity - The HTML entity
Returns:
The corresponding ASCII character or null

htmlFormat

public static java.lang.String htmlFormat(java.lang.String plainText,
                                          boolean withQuote)
Formats plain text to HTML by escaping HTML special characters e.g. "<" is converted to "&lt;".

Parameters:
plainText - The plain text
withQuote - Whether to escape quotes (") or not
Returns:
properly escaped HTML content

htmlFormat

public static java.lang.String htmlFormat(java.lang.String plainText)
Formats plain text to HTML by escaping HTML special characters e.g. "<" is converted to "&lt;".

This is just a convenience method which invokes htmlFormat(String, boolean) with latter parameter set to true.

Parameters:
plainText - The plain text
Returns:
properly escaped HTML content
See Also:
htmlFormat(String, boolean)

replaceHTMLSimpleQuotesForDisplay

public static java.lang.String replaceHTMLSimpleQuotesForDisplay(java.lang.String htmlText)
Turns all simple quotes "&gt; " occurring in specified HTML text to colored "<blockquote>" tags according to configured quote colors.

Parameters:
htmlText - The HTML text
Returns:
The HTML text with simple quotes replaced with block quotes

filterWhitelist

public static java.lang.String filterWhitelist(java.lang.String htmlContent)
Filters specified HTML content according to white-list filter.

Parameters:
htmlContent - The HTML content
Returns:
The filtered HTML content

filterExternalImages

public static java.lang.String filterExternalImages(java.lang.String htmlContent,
                                                    boolean[] modified)
Filters externally loaded images out of specified HTML content.

Parameters:
htmlContent - The HTML content
modified - A boolean array with length 1 to store modified status
Returns:
The HTML content stripped by external images

filterInlineImages

public static java.lang.String filterInlineImages(java.lang.String content,
                                                  com.openexchange.session.Session session,
                                                  MailPath msgUID)
Filters inline images occurring in HTML content of a message:

Parameters:
content - The HTML content possibly containing images
session - The session
msgUID - The message's unique path in mailbox
Returns:
The HTML content with all inline images replaced with valid links

urlEncodeSafe

public static java.lang.String urlEncodeSafe(java.lang.String text,
                                             java.lang.String charset)
Translates specified string into application/x-www-form-urlencoded format using a specific encoding scheme. This method uses the supplied encoding scheme to obtain the bytes for unsafe characters.

Parameters:
text - The string to be translated.
charset - The character encoding to use; should be UTF-8 according to W3C
Returns:
The translated string or the string itself if any error occurred