Screaming Frog is the go to Web Scraper Tool for SEOs. The options are endless, here a ton of custom web-scraping syntaxes.

How to Use Screaming Frog Custom Extraction

In ScreamingFrog, go to Configuration > Custom > Extraction.

Next, you will need to +Add and set up your extraction rules.

Add a title, select if you need CSSPath, XPath, or Regex, then add your search function. If you aren't sure which selector or function you need, look at the examples below.

Example:

Here is an example of a how you would scrape for a Facebook Pixel ID

Results, as you can see, one of my pages is missing a Facebook Pixel:

Basic Syntax for XPath Web Scraping

SYNTAX FUNCTION
// Search anywhere in the document
/ Search within the root
@ Select a specific attribute of an element
* Wildcard, used to select any element
[ ] Find a specific element
. Specifies the current element
.. Specifies the parent element


XPath functions

OPERATOR FUNCTION
starts-with(x,y) Checks if x starts with y
contains(x,y) Checks if x contains y
last() Finds the last item in a set
count(XPath) Counts occurrences of the XPath extraction

How to Extract Common HTML Elements

XPATH OUTPUT
//h1 Extract all H1 tags
//h3[1] Extract the first H3 tag
//h3[2] Extract the second H3 tag
//div/p Extract any <p> contained within a <div>
//div[@class='author'] Extract any <div> with class “author”
//p[@class='bio'] Extract any <p> with class “bio”
//*[@class='bio'] Extract any element with class “bio”
//ul/li[last()] Extract the last <li> in a <ul>
//ol[@class='cat']/li[1] Extract the first <li> in a <ol> with class “cat”
count(//h2) Count the number of H2’s (set extraction filter to “Function Value”)
//a[contains(.,'click here')] Extract any link with anchor text containing “click here”
//a[starts-with(@title,'Written by')] Extract any link with a title starting with “Written by”

How to Extract Common HTML Attributes

XPATH OUTPUT
//@href Extract all links
//a[starts-with(@href,'mailto')]/@href Extract link that starts with “mailto” (email address)
//img/@src Extract all image source URLs
//img[contains(@class,'aligncenter')]/@src Extract all image source URLs for images with the class name containing “aligncenter”
//link[@rel='alternate'] Extract elements with the rel attribute set to “alternate”
//@hreflang Extract all hreflang values

Extract Meta Tags (use inner HTML)

XPATH OUTPUT
//meta[@property='article:published_time']/@content Extract the article publish date (commonly-found meta tag on WordPress websites)

Extract Open Graph

XPATH OUTPUT
//meta[@property='og:type']/@content Extract the Open Graph type object
//meta[@property='og:image']/@content Extract the Open Graph featured image URL
//meta[@property='og:updated_time']/@content Extract the Open Graph updated time

Extract Twitter Cards

XPATH OUTPUT
//meta[@name='twitter:card']/@content Extract the Twitter Card type
//meta[@name='twitter:title']/@content Extract the Twitter Card title
//meta[@name='twitter:site']/@content Extract the Twitter Card site object (Twitter handle)

Extract Schema Types

XPATH OUTPUT
//*[@itemtype]/@itemtype Extract all of the types of schema markup on a page

Extract Breadcrumb Schema

XPATH OUTPUT
//*[contains(@itemtype,'BreadcrumbList')]/*[@itemprop]/a/@href Extract all breadcrumb links
//*[contains(@itemtype,'BreadcrumbList')]/*[@itemprop][1]/a/@href Extract the first breadcrumb link
//*[contains(@itemtype,'BreadcrumbList')]/*[@itemprop] Extract breadcrumb names (set extraction filter to “Extract Text”)
count(//*[contains(@itemtype,'BreadcrumbList')]/*[@itemprop]) Count the number of breadcrumb list items (set extraction filter to “Function Value”)

Extract Product Schema

XPATH OUTPUT
//*[@itemprop='name']/@content Extract product name
//*[@itemprop='description']/@content Extract product description
//*[@itemprop='price']/@content Extract product price
//*[@itemprop='priceCurrency']/@content Extract product currency
//*[@itemprop='availability']/@href Extract product availability
//*[@itemprop='sku']/@content Extract product SKU

Extract Review Schema

XPATH OUTPUT
//*[@itemprop='reviewCount'] Extract review count
//*[@itemprop='ratingValue'] Extract rating value
//*[@itemprop='bestRating'] Extract best review rating
//*[@itemprop='review']/*[@itemprop='name'] Extract review name
//*[@itemprop='review']/*[@itemprop='author'] Extract review author
//*[@itemprop='review']/*[@itemprop='datePublished']/@content Extract the publish date of reviews
//*[@itemprop='review']/*[@itemprop='reviewBody'] Extract the body content of reviews

Extract Local Business & Organization Schema

XPATH OUTPUT
//*[contains(@itemtype,'Organization')]/*[@itemprop='name'] Extract the organization’s name
//*[@itemprop='address']/*[@itemprop='streetAddress'] Extract the street address
//*[@itemprop='address']/*[@itemprop='addressLocality'] Extract the address locality
//*[@itemprop='address']/*[@itemprop='addressRegion'] Extract the address region
//*[@itemprop='telephone'] Extract the telephone number
//*[@itemprop='sameAs']/@href Extract the “sameAs” links

Extract Article Schema

XPATH OUTPUT
//*[contains(@itemtype,'Article')]/*[@itemprop='headline'] Extract the article headline
//*[@itemprop='author']/*[@itemprop='name']/@content Extract author name
//*[@itemprop='publisher']/*[@itemprop='name']/@content Extract publisher name
//*[@itemprop='datePublished']/@content Extract publish date
//*[@itemprop='dateModified']/@content Extract modified date

Custom Extraction with Regex

Wildcards

SYNTAX FUNCTION
. Match any 1 character
* Match preceding character 0 or more times
? Match preceding character 0 or 1 time
+ Match preceding character 1 or more times
| OR

Anchors

SYNTAX FUNCTION
^ String begins with the succeeding character
$ String ends with the preceding character

Groups

SYNTAX FUNCTION
( ) Match enclosed characters in exact order
[ ] Match enclosed characters in any order
Match any characters within the specified range

Escape

SYNTAX FUNCTION
\ Treat character literally, not as regex

Regex Custom Extraction

REGEX OUTPUT
["'](UA-.*?)["'] Extract the Google Analytics tracking ID
["'](AW-.*?)["'] Extract the Google Ads conversion ID and/or remarketing tag
["'](GTM-.*?)["'] Extract the Google Tag Manager and/or Google Optimize ID
fbq\(["']init["'], ["'](.*?)["'] Extract the Facebook Pixel ID
\{ti:["'](.*?)["']\} Extract the Bing Ads UET tag
adroll_adv_id = ["'](.*?)["'] Extract the AdRoll Advertiser ID
adroll_pix_id = ["'](.*?)["'] Extract the AdRoll Pixel ID

Extract All Schema Markup and Schema Types

REGEX OUTPUT
["']application/ld\+json["']>(.*?)</script> Extract all of the JSON-LD schema markup
["']@type["']: *["'](.*?)["'] Extract all of the types of JSON-LD schema markup on a page

Extract Breadcrumb Schema

REGEX OUTPUT
["']item["']: *\{["']@id["']: *["'](.*?)["'] Extract breadcrumb links
["']item["']: *\{["']@id["']: *["'].*?["'], *["']name["']: *["'](.*?)["'] Extract breadcrumb names

Extract Product Schema

REGEX OUTPUT
["']@type["']: *["']Product["'].*?["']name["']: *["'](.*?)["'] Extract product name
["']@type["']: *["']Product["'].*?["']description["']: *["'](.*?)["'] Extract product description
["']@type["']: *["']Product["'].*?["']price["']: *["'](.*?)["'] Extract product price
["']@type["']: *["']Product["'].*?["']priceCurrency["']: *["'](.*?)["'] Extract product currency
["']@type["']: *["']Product["'].*?["']availability["']: *["'](.*?)["'] Extract product availability
["']@type["']: *["']Product["'].*?["']sku["']: *["'](.*?)["'] Extract product SKU

Extract Review Schema

REGEX OUTPUT
["']reviewCount["']: *["'](.*?)["'] Extract review count
["']ratingValue["']: *["'](.*?)["'] Extract rating value
["']bestRating["']: *["'](.*?)["'] Extract best rating

Extract Local Business & Organization Schema

REGEX OUTPUT
["']@type["']: *["']Organization["'].*?["']name["']: *["'](.*?)["'] Extract organization name
["']streetAddress["']: *["'](.*?)["'] Extract the street address
["']addressLocality["']: *["'](.*?)["'] Extract the address locality
["']addressRegion["']: *["'](.*?)["'] Extract the address region
["']telephone["']: *["'](.*?)["'] Extract the telephone number
["']sameAs["']: *\[(.*?)\] Extract the “sameAs” links

Extract Article or BlogPosting Schema

REGEX OUTPUT
["']headline["']: *["'](.*?)["'] Extract article headline
["']author["'].*?["']name["']: *["'](.*?)["'] Extract author name
["']publisher["'].*?["']name["']: *["'](.*?)["'] Extract publisher name
["']datePublished["']: *["'](.*?)["'] Extract publish date
["']dateModified["']: *["'](.*?)["'] Extract modified date

The possibilities are endless, please let me know if you want any extractions added to this list.