Readability
in package
Arc90's Readability ported to PHP for FiveFilters.org Based on readability.js version 1.7.1 (without multi-page support) ------------------------------------------------------ Original URL: http://lab.arc90.com/experiments/readability/js/readability.js Arc90's project URL: http://lab.arc90.com/experiments/readability/ JS Source: http://code.google.com/p/arc90labs-readability Ported by: Keyvan Minoukadeh, http://www.keyvan.net Modded by: Dither, https://dithersky.wordpress.com More information: http://fivefilters.org/content-only/ License: Apache License, Version 2.0 Requires: PHP version 5.2.0+ Date: 2013-08-02.
Differences between the PHP port and the original
Arc90's Readability is designed to run in the browser. It works on the DOM tree (the parsed HTML) after the page's CSS styles have been applied and Javascript code executed. This PHP port does not run inside a browser. We use PHP's ability to parse HTML to build our DOM tree, but we cannot rely on CSS or Javascript support. As such, the results will not always match Arc90's Readability. (For example, if a web page contains CSS style rules or Javascript code which hide certain HTML elements from display, Arc90's Readability will dismiss those from consideration but our PHP port, unable to understand CSS or Javascript, will not know any better.)
Another significant difference is that the aim of Arc90's Readability is to re-present the main content block of a given web page so users can read it more easily in their browsers. Correct identification, clean up, and separation of the content block is only a part of this process. This PHP port is only concerned with this part, it does not include code that relates to presentation in the browser - Arc90 already do that extremely well, and for PDF output there's FiveFilters.org's PDF Newspaper: http://fivefilters.org/pdf-newspaper/.
Finally, this class contains methods that might be useful for developers working on HTML document fragments. So without deviating too much from the original code (which I don't want to do because it makes debugging and updating more difficult), I've tried to make it a little more developer friendly. You should be able to use the methods here on existing DOMElement objects without passing an entire HTML document to be parsed.
Table of Contents
Constants
- FLAG_CLEAN_CONDITIONALLY = 4
- FLAG_DISABLE_POSTFILTER = 16
- FLAG_DISABLE_PREFILTER = 8
- FLAG_STRIP_UNLIKELYS = 1
- FLAG_WEIGHT_ATTRIBUTES = 2
- GRANDPARENT_SCORE_DIVISOR = 2.2
- MAX_LINK_DENSITY = 0.25
- MIN_ARTICLE_LENGTH = 200
- MIN_COMMAS_IN_PARAGRAPH = 6
- MIN_NODE_LENGTH = 80
- MIN_PARAGRAPH_LENGTH = 20
- SCORE_CHARS_IN_PARAGRAPH = 100
- SCORE_WORDS_IN_PARAGRAPH = 20
Properties
- $articleContent : mixed
- $articleTitle : mixed
- $convertLinksToFootnotes : mixed
- $debug : mixed
- $dom : mixed
- $lightClean : mixed
- $original_html : mixed
- $regexps : mixed
- All of the regular expressions in use within readability.
- $revertForcedParagraphElements : mixed
- $tidied : mixed
- $tidy_config : mixed
- $url : mixed
- $body : mixed
- $bodyCache : mixed
- $domainRegExp : mixed
- $flags : mixed
- $html : mixed
- $logger : mixed
- $parser : mixed
- $post_filters : mixed
- $pre_filters : mixed
- $success : mixed
- $useTidy : mixed
Methods
- __construct() : mixed
- Create instance of Readability.
- addFlag() : mixed
- Add a flag.
- addFootnotes() : mixed
- For easier reading, convert this document to have footnotes at the bottom rather than inline links.
- addPostFilter() : mixed
- Add post filter for raw output HTML processing.
- addPreFilter() : mixed
- Add pre filter for raw input HTML processing.
- clean() : mixed
- Clean a node of all elements of type "tag".
- cleanConditionally() : mixed
- Clean an element of all tags of type "tag" if they look fishy.
- cleanHeaders() : mixed
- Clean out spurious headers from an Element. Checks things like classnames and link density.
- cleanStyles() : mixed
- Remove the style attribute on every $e and under.
- flagIsActive() : bool
- Check if the given flag is active.
- getCommaCount() : int
- Get comma number for a given text.
- getContent() : DOMElement
- Get article content element.
- getInnerText() : string
- Get the inner text of a node.
- getLinkDensity() : int
- Get the density of links as a percentage of the content This is the amount of text that is inside a link divided by the total text in the node.
- getTitle() : DOMElement
- Get article title element.
- getWeight() : int
- Get an element relative weight.
- getWordCount() : int
- Get words number for a given text if words separated by a space.
- init() : bool
- Runs readability.
- killBreaks() : mixed
- Remove extraneous break tags from a node.
- postProcessContent() : mixed
- Run any post-process modifications to article content as necessary.
- prepArticle() : mixed
- Prepare the article node for display. Clean out any inline styles, iframes, forms, strip extraneous <p> tags, etc.
- removeFlag() : mixed
- Remove a flag.
- dump_dbg() : mixed
- Dump debug info.
- getArticleTitle() : DOMElement
- Get the article title as an H1.
- grabArticle() : DOMElement|bool
- Using a variety of metrics (content score, classname, element types), find the content that is most likely to be the stuff a user wants to read. Then return it wrapped up in a div.
- initializeNode() : mixed
- Initialize a node with the readability object. Also checks the className/id for special names to add to its score.
- prepDocument() : mixed
- Prepare the HTML document for readability to scrape it.
- reinitBody() : mixed
- Will recreate previously deleted body property.
- weightAttribute() : int
- Get an element weight by attribute.
- loadHtml() : mixed
- Load HTML in a DOMDocument.
Constants
FLAG_CLEAN_CONDITIONALLY
public
mixed
FLAG_CLEAN_CONDITIONALLY
= 4
FLAG_DISABLE_POSTFILTER
public
mixed
FLAG_DISABLE_POSTFILTER
= 16
FLAG_DISABLE_PREFILTER
public
mixed
FLAG_DISABLE_PREFILTER
= 8
FLAG_STRIP_UNLIKELYS
public
mixed
FLAG_STRIP_UNLIKELYS
= 1
FLAG_WEIGHT_ATTRIBUTES
public
mixed
FLAG_WEIGHT_ATTRIBUTES
= 2
GRANDPARENT_SCORE_DIVISOR
public
mixed
GRANDPARENT_SCORE_DIVISOR
= 2.2
MAX_LINK_DENSITY
public
mixed
MAX_LINK_DENSITY
= 0.25
MIN_ARTICLE_LENGTH
public
mixed
MIN_ARTICLE_LENGTH
= 200
MIN_COMMAS_IN_PARAGRAPH
public
mixed
MIN_COMMAS_IN_PARAGRAPH
= 6
MIN_NODE_LENGTH
public
mixed
MIN_NODE_LENGTH
= 80
MIN_PARAGRAPH_LENGTH
public
mixed
MIN_PARAGRAPH_LENGTH
= 20
SCORE_CHARS_IN_PARAGRAPH
public
mixed
SCORE_CHARS_IN_PARAGRAPH
= 100
SCORE_WORDS_IN_PARAGRAPH
public
mixed
SCORE_WORDS_IN_PARAGRAPH
= 20
Properties
$articleContent
public
mixed
$articleContent
$articleTitle
public
mixed
$articleTitle
$convertLinksToFootnotes
public
mixed
$convertLinksToFootnotes
= \false
$debug
public
mixed
$debug
= \false
$dom
public
mixed
$dom
$lightClean
public
mixed
$lightClean
= \true
$original_html
public
mixed
$original_html
$regexps
All of the regular expressions in use within readability.
public
mixed
$regexps
= array('unlikelyCandidates' => '/display\s*:\s*none|ignore|\binfos?\b|annoy|clock|date|time|author|intro|hidd?e|about|archive|\bprint|bookmark|tags|tag-list|share|search|social|robot|published|combx|comment|mast(?:head)|subscri|community|category|disqus|extra|head|head(?:er|note)|floor|foot(?:er|note)|menu|tool\b|function|nav|remark|rss|shoutbox|widget|meta|banner|sponsor|adsense|inner-?ad|ad-|sponsor|\badv\b|\bads\b|agr?egate?|pager|sidebar|popup|tweet|twitter|eb-rating|eb-reaction/i', 'okMaybeItsACandidate' => '/article\b|contain|\bcontent|column|general|detail|shadow|lightbox|blog|body|entry|main|page|eb-post-image/i', 'positive' => '/read|full|article|body|\bcontent|contain|entry|main|markdown|page|attach|pagination|post|text|blog|story/i', 'negative' => '/bottom|stat|info|discuss|e[\-]?mail|comment|reply|log.{2}(n|ed)|sign|single|combx|com-|contact|_nav|link|media|\bout|promo|\bad-|related|scroll|shoutbox|sidebar|sponsor|shopping|teaser|recommend|eb-rating|eb-reaction/i', 'divToPElements' => '/<(?:blockquote|header|section|code|div|article|footer|aside|p|pre|dl|ol|ul)/mi', 'killBreaks' => '/(<br\s*\/?>([ \r\n\s]| ?)*)+/', 'media' => '!//(?:[^\.\?/]+\.)?(?:youtu(?:be)?|soundcloud|dailymotion|vimeo|pornhub|xvideos|twitvid|rutube|viddler)\.(?:com|be|org|net)/!i', 'skipFootnoteLink' => '/^\s*(\[?[a-z0-9]{1,2}\]?|^|edit|citation needed)\s*$/i')
Defined up here so we don't instantiate them repeatedly in loops.
$revertForcedParagraphElements
public
mixed
$revertForcedParagraphElements
= \true
$tidied
public
mixed
$tidied
= \false
$tidy_config
public
mixed
$tidy_config
= array(
'tidy-mark' => \false,
'vertical-space' => \false,
'doctype' => 'omit',
'numeric-entities' => \false,
// 'preserve-entities' => true,
'break-before-br' => \false,
'clean' => \true,
'output-xhtml' => \true,
'logical-emphasis' => \true,
'show-body-only' => \false,
'new-blocklevel-tags' => 'article aside audio bdi canvas details dialog figcaption figure footer header hgroup main menu menuitem nav section source summary template track video',
'new-empty-tags' => 'command embed keygen source track wbr',
'new-inline-tags' => 'audio command datalist embed keygen mark menuitem meter output progress source time video wbr',
'wrap' => 0,
'drop-empty-paras' => \true,
'drop-proprietary-attributes' => \false,
'enclose-text' => \true,
'enclose-block-text' => \true,
'merge-divs' => \true,
// 'merge-spans' => true,
'input-encoding' => '????',
'output-encoding' => 'utf8',
'hide-comments' => \true,
)
$url
public
mixed
$url
= \null
$body
protected
mixed
$body
= \null
$bodyCache
protected
mixed
$bodyCache
= \null
$domainRegExp
protected
mixed
$domainRegExp
= \null
$flags
protected
mixed
$flags
= 7
$html
protected
mixed
$html
$logger
protected
mixed
$logger
$parser
protected
mixed
$parser
$post_filters
protected
mixed
$post_filters
= array(
// replace excessive br's
'/<br\s*\/?>\s*<p/i' => '<p',
// replace empty tags that break layouts
'!<(?:a|div|p)[^>]+/>!is' => '',
// remove all attributes on text tags
//'!<(\s*/?\s*(?:blockquote|br|hr|code|div|article|span|footer|aside|p|pre|dl|li|ul|ol)) [^>]+>!is' => "<\\1>",
//single newlines cleanup
"/\n+/" => "\n",
// modern web...
'!<pre[^>]*>\s*<code!is' => '<pre',
'!</code>\s*</pre>!is' => '</pre>',
'!<[hb]r>!is' => '<\1 />',
)
$pre_filters
protected
mixed
$pre_filters
= array(
// remove obvious scripts
'!<script[^>]*>(.*?)</script>!is' => '',
// remove obvious styles
'!<style[^>]*>(.*?)</style>!is' => '',
// remove spans as we redefine styles and they're probably special-styled
'!</?span[^>]*>!is' => '',
// HACK: firewall-filtered content
'!<font[^>]*>\s*\[AD\]\s*</font>!is' => '',
// HACK: replace linebreaks plus br's with p's
'!(<br[^>]*>[ \r\n\s]*){2,}!i' => '</p><p>',
// replace noscripts
//'!</?noscript>!is' => '',
// replace fonts to spans
'!<(/?)font[^>]*>!is' => '<\1span>',
)
$success
protected
mixed
$success
= \false
$useTidy
protected
mixed
$useTidy
Methods
__construct()
Create instance of Readability.
public
__construct(mixed $html[, mixed $url = null ][, mixed $parser = 'libxml' ][, mixed $use_tidy = true ]) : mixed
Parameters
- $html : mixed
- $url : mixed = null
- $parser : mixed = 'libxml'
- $use_tidy : mixed = true
addFlag()
Add a flag.
public
addFlag(int $flag) : mixed
Parameters
- $flag : int
addFootnotes()
For easier reading, convert this document to have footnotes at the bottom rather than inline links.
public
addFootnotes(DOMElement $articleContent) : mixed
Parameters
- $articleContent : DOMElement
Tags
addPostFilter()
Add post filter for raw output HTML processing.
public
addPostFilter(mixed $filter[, mixed $replacer = '' ]) : mixed
Parameters
- $filter : mixed
- $replacer : mixed = ''
addPreFilter()
Add pre filter for raw input HTML processing.
public
addPreFilter(mixed $filter[, mixed $replacer = '' ]) : mixed
Parameters
- $filter : mixed
- $replacer : mixed = ''
clean()
Clean a node of all elements of type "tag".
public
clean(DOMElement $e, string $tag) : mixed
(Unless it's a youtube/vimeo video. People love movies.).
Updated 2012-09-18 to preserve youtube/vimeo iframes
Parameters
- $e : DOMElement
- $tag : string
cleanConditionally()
Clean an element of all tags of type "tag" if they look fishy.
public
cleanConditionally(DOMElement $e, string $tag) : mixed
"Fishy" is an algorithm based on content length, classnames, link density, number of images & embeds, etc.
Parameters
- $e : DOMElement
- $tag : string
cleanHeaders()
Clean out spurious headers from an Element. Checks things like classnames and link density.
public
cleanHeaders(DOMElement $e) : mixed
Parameters
- $e : DOMElement
cleanStyles()
Remove the style attribute on every $e and under.
public
cleanStyles(DOMElement $e) : mixed
Parameters
- $e : DOMElement
flagIsActive()
Check if the given flag is active.
public
flagIsActive(int $flag) : bool
Parameters
- $flag : int
Return values
boolgetCommaCount()
Get comma number for a given text.
public
getCommaCount(string $text) : int
Parameters
- $text : string
Return values
intgetContent()
Get article content element.
public
getContent() : DOMElement
Return values
DOMElementgetInnerText()
Get the inner text of a node.
public
getInnerText(DOMElement $e[, bool $normalizeSpaces = true ][, bool $flattenLines = false ]) : string
This also strips out any excess whitespace to be found.
Parameters
- $e : DOMElement
- $normalizeSpaces : bool = true
-
(default: true)
- $flattenLines : bool = false
-
(default: false)
Return values
stringgetLinkDensity()
Get the density of links as a percentage of the content This is the amount of text that is inside a link divided by the total text in the node.
public
getLinkDensity(DOMElement $e[, string $excludeExternal = false ]) : int
Can exclude external references to differentiate between simple text and menus/infoblocks.
Parameters
- $e : DOMElement
- $excludeExternal : string = false
Return values
intgetTitle()
Get article title element.
public
getTitle() : DOMElement
Return values
DOMElementgetWeight()
Get an element relative weight.
public
getWeight(DOMElement $e) : int
Parameters
- $e : DOMElement
Return values
intgetWordCount()
Get words number for a given text if words separated by a space.
public
getWordCount(string $text) : int
Input string should be normalized.
Parameters
- $text : string
Return values
intinit()
Runs readability.
public
init() : bool
Workflow:
- Prep the document by removing script tags, css, etc.
- Build readability's DOM tree.
- Grab the article content from the current dom tree.
- Replace the current DOM tree with the new one.
- Read peacefully.
Return values
bool —true if we found content, false otherwise
killBreaks()
Remove extraneous break tags from a node.
public
killBreaks(DOMElement $node) : mixed
Parameters
- $node : DOMElement
postProcessContent()
Run any post-process modifications to article content as necessary.
public
postProcessContent(DOMElement $articleContent) : mixed
Parameters
- $articleContent : DOMElement
prepArticle()
Prepare the article node for display. Clean out any inline styles, iframes, forms, strip extraneous <p> tags, etc.
public
prepArticle(DOMElement $articleContent) : mixed
Parameters
- $articleContent : DOMElement
removeFlag()
Remove a flag.
public
removeFlag(int $flag) : mixed
Parameters
- $flag : int
dump_dbg()
Dump debug info.
protected
dump_dbg() : mixed
since Monolog gather log, we don't need it
Tags
getArticleTitle()
Get the article title as an H1.
protected
getArticleTitle() : DOMElement
Return values
DOMElementgrabArticle()
Using a variety of metrics (content score, classname, element types), find the content that is most likely to be the stuff a user wants to read. Then return it wrapped up in a div.
protected
grabArticle([DOMElement $page = null ]) : DOMElement|bool
Parameters
- $page : DOMElement = null
Return values
DOMElement|boolinitializeNode()
Initialize a node with the readability object. Also checks the className/id for special names to add to its score.
protected
initializeNode(DOMElement $node) : mixed
Parameters
- $node : DOMElement
prepDocument()
Prepare the HTML document for readability to scrape it.
protected
prepDocument() : mixed
This includes things like stripping javascript, CSS, and handling terrible markup.
reinitBody()
Will recreate previously deleted body property.
protected
reinitBody() : mixed
weightAttribute()
Get an element weight by attribute.
protected
weightAttribute(DOMElement $element, string $attribute) : int
Uses regular expressions to tell if this element looks good or bad.
Parameters
- $element : DOMElement
- $attribute : string
Return values
intloadHtml()
Load HTML in a DOMDocument.
private
loadHtml() : mixed
Apply Pre filters Cleanup HTML using Tidy (or not).