Documentation

Readability
in package

EasyBlog

Arc90's Readability ported to PHP for FiveFilters.org Based on readability.js version 1.7.1 (without multi-page support) ------------------------------------------------------ Original URL: http://lab.arc90.com/experiments/readability/js/readability.js Arc90's project URL: http://lab.arc90.com/experiments/readability/ JS Source: http://code.google.com/p/arc90labs-readability Ported by: Keyvan Minoukadeh, http://www.keyvan.net Modded by: Dither, https://dithersky.wordpress.com More information: http://fivefilters.org/content-only/ License: Apache License, Version 2.0 Requires: PHP version 5.2.0+ Date: 2013-08-02.

Differences between the PHP port and the original

Arc90's Readability is designed to run in the browser. It works on the DOM tree (the parsed HTML) after the page's CSS styles have been applied and Javascript code executed. This PHP port does not run inside a browser. We use PHP's ability to parse HTML to build our DOM tree, but we cannot rely on CSS or Javascript support. As such, the results will not always match Arc90's Readability. (For example, if a web page contains CSS style rules or Javascript code which hide certain HTML elements from display, Arc90's Readability will dismiss those from consideration but our PHP port, unable to understand CSS or Javascript, will not know any better.)

Another significant difference is that the aim of Arc90's Readability is to re-present the main content block of a given web page so users can read it more easily in their browsers. Correct identification, clean up, and separation of the content block is only a part of this process. This PHP port is only concerned with this part, it does not include code that relates to presentation in the browser - Arc90 already do that extremely well, and for PDF output there's FiveFilters.org's PDF Newspaper: http://fivefilters.org/pdf-newspaper/.

Finally, this class contains methods that might be useful for developers working on HTML document fragments. So without deviating too much from the original code (which I don't want to do because it makes debugging and updating more difficult), I've tried to make it a little more developer friendly. You should be able to use the methods here on existing DOMElement objects without passing an entire HTML document to be parsed.

Constants

FLAG_CLEAN_CONDITIONALLY = 4
FLAG_DISABLE_POSTFILTER = 16
FLAG_DISABLE_PREFILTER = 8
FLAG_STRIP_UNLIKELYS = 1
FLAG_WEIGHT_ATTRIBUTES = 2
GRANDPARENT_SCORE_DIVISOR = 2.2
MAX_LINK_DENSITY = 0.25
MIN_ARTICLE_LENGTH = 200
MIN_COMMAS_IN_PARAGRAPH = 6
MIN_NODE_LENGTH = 80
MIN_PARAGRAPH_LENGTH = 20
SCORE_CHARS_IN_PARAGRAPH = 100
SCORE_WORDS_IN_PARAGRAPH = 20

Properties

$articleContent : mixed
$articleTitle : mixed
$convertLinksToFootnotes : mixed
$debug : mixed
$dom : mixed
$lightClean : mixed
$original_html : mixed
$regexps : mixed: All of the regular expressions in use within readability.
$revertForcedParagraphElements : mixed
$tidied : mixed
$tidy_config : mixed
$url : mixed
$body : mixed
$bodyCache : mixed
$domainRegExp : mixed
$flags : mixed
$html : mixed
$logger : mixed
$parser : mixed
$post_filters : mixed
$pre_filters : mixed
$success : mixed
$useTidy : mixed

Methods

__construct() : mixed: Create instance of Readability.
addFlag() : mixed: Add a flag.
addFootnotes() : mixed: For easier reading, convert this document to have footnotes at the bottom rather than inline links.
addPostFilter() : mixed: Add post filter for raw output HTML processing.
addPreFilter() : mixed: Add pre filter for raw input HTML processing.
clean() : mixed: Clean a node of all elements of type "tag".
cleanConditionally() : mixed: Clean an element of all tags of type "tag" if they look fishy.
cleanHeaders() : mixed: Clean out spurious headers from an Element. Checks things like classnames and link density.
cleanStyles() : mixed: Remove the style attribute on every $e and under.
flagIsActive() : bool: Check if the given flag is active.
getCommaCount() : int: Get comma number for a given text.
getContent() : DOMElement: Get article content element.
getInnerText() : string: Get the inner text of a node.
getLinkDensity() : int: Get the density of links as a percentage of the content This is the amount of text that is inside a link divided by the total text in the node.
getTitle() : DOMElement: Get article title element.
getWeight() : int: Get an element relative weight.
getWordCount() : int: Get words number for a given text if words separated by a space.
init() : bool: Runs readability.
killBreaks() : mixed: Remove extraneous break tags from a node.
postProcessContent() : mixed: Run any post-process modifications to article content as necessary.
prepArticle() : mixed: Prepare the article node for display. Clean out any inline styles, iframes, forms, strip extraneous <p> tags, etc.
removeFlag() : mixed: Remove a flag.
dump_dbg() : mixed: Dump debug info.
getArticleTitle() : DOMElement: Get the article title as an H1.
grabArticle() : DOMElement|bool: Using a variety of metrics (content score, classname, element types), find the content that is most likely to be the stuff a user wants to read. Then return it wrapped up in a div.
initializeNode() : mixed: Initialize a node with the readability object. Also checks the className/id for special names to add to its score.
prepDocument() : mixed: Prepare the HTML document for readability to scrape it.
reinitBody() : mixed: Will recreate previously deleted body property.
weightAttribute() : int: Get an element weight by attribute.
loadHtml() : mixed: Load HTML in a DOMDocument.

FLAG_CLEAN_CONDITIONALLY


    public
        mixed
    FLAG_CLEAN_CONDITIONALLY
    = 4

FLAG_DISABLE_POSTFILTER


    public
        mixed
    FLAG_DISABLE_POSTFILTER
    = 16

FLAG_DISABLE_PREFILTER


    public
        mixed
    FLAG_DISABLE_PREFILTER
    = 8

FLAG_STRIP_UNLIKELYS


    public
        mixed
    FLAG_STRIP_UNLIKELYS
    = 1

FLAG_WEIGHT_ATTRIBUTES


    public
        mixed
    FLAG_WEIGHT_ATTRIBUTES
    = 2

GRANDPARENT_SCORE_DIVISOR


    public
        mixed
    GRANDPARENT_SCORE_DIVISOR
    = 2.2

MAX_LINK_DENSITY


    public
        mixed
    MAX_LINK_DENSITY
    = 0.25

MIN_ARTICLE_LENGTH


    public
        mixed
    MIN_ARTICLE_LENGTH
    = 200

MIN_COMMAS_IN_PARAGRAPH


    public
        mixed
    MIN_COMMAS_IN_PARAGRAPH
    = 6

MIN_NODE_LENGTH


    public
        mixed
    MIN_NODE_LENGTH
    = 80

MIN_PARAGRAPH_LENGTH


    public
        mixed
    MIN_PARAGRAPH_LENGTH
    = 20

SCORE_CHARS_IN_PARAGRAPH


    public
        mixed
    SCORE_CHARS_IN_PARAGRAPH
    = 100

SCORE_WORDS_IN_PARAGRAPH


    public
        mixed
    SCORE_WORDS_IN_PARAGRAPH
    = 20

$articleContent


    public
        mixed
    $articleContent

$articleTitle


    public
        mixed
    $articleTitle

$convertLinksToFootnotes


    public
        mixed
    $convertLinksToFootnotes
     = \false

$debug


    public
        mixed
    $debug
     = \false

$dom


    public
        mixed
    $dom

$lightClean


    public
        mixed
    $lightClean
     = \true

$original_html


    public
        mixed
    $original_html

$regexps

All of the regular expressions in use within readability.


    public
        mixed
    $regexps
     = array('unlikelyCandidates' => '/display\s*:\s*none|ignore|\binfos?\b|annoy|clock|date|time|author|intro|hidd?e|about|archive|\bprint|bookmark|tags|tag-list|share|search|social|robot|published|combx|comment|mast(?:head)|subscri|community|category|disqus|extra|head|head(?:er|note)|floor|foot(?:er|note)|menu|tool\b|function|nav|remark|rss|shoutbox|widget|meta|banner|sponsor|adsense|inner-?ad|ad-|sponsor|\badv\b|\bads\b|agr?egate?|pager|sidebar|popup|tweet|twitter|eb-rating|eb-reaction/i', 'okMaybeItsACandidate' => '/article\b|contain|\bcontent|column|general|detail|shadow|lightbox|blog|body|entry|main|page|eb-post-image/i', 'positive' => '/read|full|article|body|\bcontent|contain|entry|main|markdown|page|attach|pagination|post|text|blog|story/i', 'negative' => '/bottom|stat|info|discuss|e[\-]?mail|comment|reply|log.{2}(n|ed)|sign|single|combx|com-|contact|_nav|link|media|\bout|promo|\bad-|related|scroll|shoutbox|sidebar|sponsor|shopping|teaser|recommend|eb-rating|eb-reaction/i', 'divToPElements' => '/<(?:blockquote|header|section|code|div|article|footer|aside|p|pre|dl|ol|ul)/mi', 'killBreaks' => '/(<br\s*\/?>([ \r\n\s]|&nbsp;?)*)+/', 'media' => '!//(?:[^\.\?/]+\.)?(?:youtu(?:be)?|soundcloud|dailymotion|vimeo|pornhub|xvideos|twitvid|rutube|viddler)\.(?:com|be|org|net)/!i', 'skipFootnoteLink' => '/^\s*(\[?[a-z0-9]{1,2}\]?|^|edit|citation needed)\s*$/i')

Defined up here so we don't instantiate them repeatedly in loops.

$revertForcedParagraphElements


    public
        mixed
    $revertForcedParagraphElements
     = \true

$tidied


    public
        mixed
    $tidied
     = \false

$tidy_config


    public
        mixed
    $tidy_config
     = array(
    'tidy-mark' => \false,
    'vertical-space' => \false,
    'doctype' => 'omit',
    'numeric-entities' => \false,
    // 'preserve-entities' => true,
    'break-before-br' => \false,
    'clean' => \true,
    'output-xhtml' => \true,
    'logical-emphasis' => \true,
    'show-body-only' => \false,
    'new-blocklevel-tags' => 'article aside audio bdi canvas details dialog figcaption figure footer header hgroup main menu menuitem nav section source summary template track video',
    'new-empty-tags' => 'command embed keygen source track wbr',
    'new-inline-tags' => 'audio command datalist embed keygen mark menuitem meter output progress source time video wbr',
    'wrap' => 0,
    'drop-empty-paras' => \true,
    'drop-proprietary-attributes' => \false,
    'enclose-text' => \true,
    'enclose-block-text' => \true,
    'merge-divs' => \true,
    // 'merge-spans' => true,
    'input-encoding' => '????',
    'output-encoding' => 'utf8',
    'hide-comments' => \true,
)

$url


    public
        mixed
    $url
     = \null

$body


    protected
        mixed
    $body
     = \null

$bodyCache


    protected
        mixed
    $bodyCache
     = \null

$domainRegExp


    protected
        mixed
    $domainRegExp
     = \null

$flags


    protected
        mixed
    $flags
     = 7

$html


    protected
        mixed
    $html

$logger


    protected
        mixed
    $logger

$parser


    protected
        mixed
    $parser

$post_filters


    protected
        mixed
    $post_filters
     = array(
    // replace excessive br's
    '/<br\s*\/?>\s*<p/i' => '<p',
    // replace empty tags that break layouts
    '!<(?:a|div|p)[^>]+/>!is' => '',
    // remove all attributes on text tags
    //'!<(\s*/?\s*(?:blockquote|br|hr|code|div|article|span|footer|aside|p|pre|dl|li|ul|ol)) [^>]+>!is' => "<\\1>",
    //single newlines cleanup
    "/\n+/" => "\n",
    // modern web...
    '!<pre[^>]*>\s*<code!is' => '<pre',
    '!</code>\s*</pre>!is' => '</pre>',
    '!<[hb]r>!is' => '<\1 />',
)

$pre_filters


    protected
        mixed
    $pre_filters
     = array(
    // remove obvious scripts
    '!<script[^>]*>(.*?)</script>!is' => '',
    // remove obvious styles
    '!<style[^>]*>(.*?)</style>!is' => '',
    // remove spans as we redefine styles and they're probably special-styled
    '!</?span[^>]*>!is' => '',
    // HACK: firewall-filtered content
    '!<font[^>]*>\s*\[AD\]\s*</font>!is' => '',
    // HACK: replace linebreaks plus br's with p's
    '!(<br[^>]*>[ \r\n\s]*){2,}!i' => '</p><p>',
    // replace noscripts
    //'!</?noscript>!is' => '',
    // replace fonts to spans
    '!<(/?)font[^>]*>!is' => '<\1span>',
)

$success


    protected
        mixed
    $success
     = \false

$useTidy


    protected
        mixed
    $useTidy

__construct()

Create instance of Readability.


    public
                    __construct(mixed $html[, mixed $url = null ][, mixed $parser = 'libxml' ][, mixed $use_tidy = true ]) : mixed

Parameters

$html : mixed
$url : mixed = null
$parser : mixed = 'libxml'
$use_tidy : mixed = true

addFlag()

Add a flag.


    public
                    addFlag(int $flag) : mixed

Parameters

$flag : int

addFootnotes()

For easier reading, convert this document to have footnotes at the bottom rather than inline links.


    public
                    addFootnotes(DOMElement $articleContent) : mixed

Parameters

$articleContent : DOMElement

addPostFilter()

Add post filter for raw output HTML processing.


    public
                    addPostFilter(mixed $filter[, mixed $replacer = '' ]) : mixed

Parameters

$filter : mixed
$replacer : mixed = ''

addPreFilter()

Add pre filter for raw input HTML processing.


    public
                    addPreFilter(mixed $filter[, mixed $replacer = '' ]) : mixed

Parameters

$filter : mixed
$replacer : mixed = ''

clean()

Clean a node of all elements of type "tag".


    public
                    clean(DOMElement $e, string $tag) : mixed

(Unless it's a youtube/vimeo video. People love movies.).

Updated 2012-09-18 to preserve youtube/vimeo iframes

Parameters

$e : DOMElement
$tag : string

cleanConditionally()

Clean an element of all tags of type "tag" if they look fishy.


    public
                    cleanConditionally(DOMElement $e, string $tag) : mixed

"Fishy" is an algorithm based on content length, classnames, link density, number of images & embeds, etc.

Parameters

$e : DOMElement
$tag : string

cleanHeaders()

Clean out spurious headers from an Element. Checks things like classnames and link density.


    public
                    cleanHeaders(DOMElement $e) : mixed

Parameters

$e : DOMElement

cleanStyles()

Remove the style attribute on every $e and under.


    public
                    cleanStyles(DOMElement $e) : mixed

Parameters

$e : DOMElement

flagIsActive()

Check if the given flag is active.


    public
                    flagIsActive(int $flag) : bool

Parameters

$flag : int

Return values

bool

getCommaCount()

Get comma number for a given text.


    public
                    getCommaCount(string $text) : int

Parameters

$text : string

Return values

int

getContent()

Get article content element.


    public
                    getContent() : DOMElement

Return values

DOMElement

getInnerText()

Get the inner text of a node.


    public
                    getInnerText(DOMElement $e[, bool $normalizeSpaces = true ][, bool $flattenLines = false ]) : string

This also strips out any excess whitespace to be found.

Parameters

$e : DOMElement
$normalizeSpaces : bool = true: (default: true)
$flattenLines : bool = false: (default: false)

Return values

string

getLinkDensity()

Get the density of links as a percentage of the content This is the amount of text that is inside a link divided by the total text in the node.


    public
                    getLinkDensity(DOMElement $e[, string $excludeExternal = false ]) : int

Can exclude external references to differentiate between simple text and menus/infoblocks.

Parameters

$e : DOMElement
$excludeExternal : string = false

Return values

int

getTitle()

Get article title element.


    public
                    getTitle() : DOMElement

Return values

DOMElement

getWeight()

Get an element relative weight.


    public
                    getWeight(DOMElement $e) : int

Parameters

$e : DOMElement

Return values

int

getWordCount()

Get words number for a given text if words separated by a space.


    public
                    getWordCount(string $text) : int

Input string should be normalized.

Parameters

$text : string

Return values

int

init()

Runs readability.


    public
                    init() : bool

Workflow:

Prep the document by removing script tags, css, etc.
Build readability's DOM tree.
Grab the article content from the current dom tree.
Replace the current DOM tree with the new one.
Read peacefully.

Return values

bool —

true if we found content, false otherwise

killBreaks()

Remove extraneous break tags from a node.


    public
                    killBreaks(DOMElement $node) : mixed

Parameters

$node : DOMElement

postProcessContent()

Run any post-process modifications to article content as necessary.


    public
                    postProcessContent(DOMElement $articleContent) : mixed

Parameters

$articleContent : DOMElement

prepArticle()

Prepare the article node for display. Clean out any inline styles, iframes, forms, strip extraneous <p> tags, etc.


    public
                    prepArticle(DOMElement $articleContent) : mixed

Parameters

$articleContent : DOMElement

removeFlag()

Remove a flag.


    public
                    removeFlag(int $flag) : mixed

Parameters

$flag : int

dump_dbg()

Dump debug info.


    protected
                    dump_dbg() : mixed

since Monolog gather log, we don't need it

getArticleTitle()

Get the article title as an H1.


    protected
                    getArticleTitle() : DOMElement

Return values

DOMElement

grabArticle()

Using a variety of metrics (content score, classname, element types), find the content that is most likely to be the stuff a user wants to read. Then return it wrapped up in a div.


    protected
                    grabArticle([DOMElement $page = null ]) : DOMElement|bool

Parameters

$page : DOMElement = null

Return values

DOMElement|bool

initializeNode()

Initialize a node with the readability object. Also checks the className/id for special names to add to its score.


    protected
                    initializeNode(DOMElement $node) : mixed

Parameters

$node : DOMElement

prepDocument()

Prepare the HTML document for readability to scrape it.


    protected
                    prepDocument() : mixed

This includes things like stripping javascript, CSS, and handling terrible markup.

reinitBody()

Will recreate previously deleted body property.


    protected
                    reinitBody() : mixed

weightAttribute()

Get an element weight by attribute.


    protected
                    weightAttribute(DOMElement $element, string $attribute) : int

Uses regular expressions to tell if this element looks good or bad.

Parameters

$element : DOMElement
$attribute : string

Return values

int

loadHtml()

Load HTML in a DOMDocument.


    private
                    loadHtml() : mixed

Apply Pre filters Cleanup HTML using Tidy (or not).

Readability in package EasyBlog

Differences between the PHP port and the original

Table of Contents

Constants

Properties

Methods

Constants

FLAG_CLEAN_CONDITIONALLY

FLAG_DISABLE_POSTFILTER

FLAG_DISABLE_PREFILTER

FLAG_STRIP_UNLIKELYS

FLAG_WEIGHT_ATTRIBUTES

GRANDPARENT_SCORE_DIVISOR

MAX_LINK_DENSITY

MIN_ARTICLE_LENGTH

MIN_COMMAS_IN_PARAGRAPH

MIN_NODE_LENGTH

MIN_PARAGRAPH_LENGTH

SCORE_CHARS_IN_PARAGRAPH

SCORE_WORDS_IN_PARAGRAPH

Properties

$articleContent

$articleTitle

$convertLinksToFootnotes

$debug

$dom

$lightClean

$original_html

$regexps

$revertForcedParagraphElements

$tidied

$tidy_config

$url

$body

$bodyCache

$domainRegExp

$flags

$html

$logger

$parser

$post_filters

$pre_filters

$success

$useTidy

Methods

__construct()

Parameters

addFlag()

Parameters

addFootnotes()

Parameters

Tags

addPostFilter()

Parameters

addPreFilter()

Parameters

clean()

Parameters

cleanConditionally()

Parameters

cleanHeaders()

Parameters

cleanStyles()

Parameters

Readability
in package

EasyBlog