Michael Rog presents

RogEE: Add-ons for ExpressionEngine 2

Scraper 1.1.1

Scraper allows you to easily pull in HTML content from remote pages, selecting elements by CSS selector, and output that content via EE template variables.

Examples: General Use

Simply specify a URL and root selector; Scraper will import the remote content and make the adjacent DOM available in EE variable form.

{exp:scraper url="http://some-external-site.com" selector="div.foo"}

    {count} / {total_results}

    {if children_total_results}
        {tag}, etc.
        {attr:id}, {attr:class}, etc.
        {children_count} / {children_total_results}

    {parent} ... {/parent}
    {first_child} ... {/first_child}
    {last_child} ... {/last_child}
    {next_sibling} ... {/next_sibling}
    {prev_sibling} ... {/prev_sibling}

    {find selector=".bar"} ... {/find}


Using a variable prefix, it might look something like this:

{exp:scraper url="http://some-external-site.com" selector="div.foo" prefix="scraper:"}


    {if scraper:children_total_results}

    {scraper:find selector=".bar"} ... {/find}




Element variables Related Element variable pairs Find variable pair
  • {tag}
  • {outertext}
  • {innertext}
  • {plaintext}
  • {attr:[attribute]}

Additionally, count variables are available for the root element(s):

  • {count}
  • {total_results}
  • {children_total_results}
  • {children}
  • {parent}
  • {first_child}
  • {last_child}
  • {next_sibling}
  • {prev_sibling}

The Element variables (except count variables) are also available for each of the Related Element pairs.

The children pair makes available its own count variable:

  • {children_count}
  • {find selector=""}

The Find pair takes a selector and [optional] index parameter and searches for matching elements inside of the root element.

The Element variables (except count variables) are available for each of the found elements. Each Find variable pair also makes available its own count variables:

  • {found_count}
  • {found_total_results}

Dealing with Foreign Character Encodings

If you find funny-looking characters where international characters, diacritics, accent marks, or symbols ought to be, you might need to specify the encoding of the source document. (It'll be converted to UTF-8, which is required for the DOM parser to work properly.) You must use cURL as the fetching method in order to enable the characterset conversion feature:

{exp:scraper url="http://some-external-site.com" selector="div.foo"
   fetch_method="curl" source_encoding="Windows-1250"}