Screaming Frog Custom Extractions: A Guide to Extracting Crawl Data

Q: What are custom extractions?

Custom extractions are a set of functions in Screaming Frogs SEO spider to extract explicit information from webpages. These extractions help optimize your site for Technical SEO; which includes search results, gather essential data on your copy, and help locate and fix errors.

Q: How is Data Extraction done?

The process of data extraction involves pulling the required data on your website using a Screaming Frog web spider. The information is saved within Screaming Frog's memory, giving you the option to export your scanned results to Excel or Google Sheets for further review.

Q: How to Extract Custom Data using Screaming Frog

1. In ScreamingFrog, go to Configuration > Custom > Extraction. Read More

Screaming Frog (screamingfrog.co.uk) is a powerful SEO tool with many search engine optimization features. One of the lesser-known features, Screaming Frog Custom Extractions, allows you to easily extract data from your crawls. This blog post will discuss how Screaming Frog Custom Extraction works and why it can help improve your SEO efforts and e-commerce digital marketing SEO strategies!

Websites have a ton of helpful information—most times, it’s too laborious or complicated to visit every page on a website to copy product data, metadata, title tags, and anchor text into a spreadsheet. Here is where Screaming Frog comes to the rescue with custom search data extractions to automate the process. Custom extractions are a form of web scraping, web harvesting, or web data extraction used to scrape and extract data from websites, allowing you to store it locally on your computer.

For beginners, some questions you might have:

Table of Contents

What is the Screaming Frog SEO Spider?

The Screaming Frog SEO Spider software is a website crawler that improves onsite SEO by extracting and analyzing your website’s structured data using a graphical user interface (GUI).

What are custom extractions?

Custom extractions are Screaming Frogs SEO spider functions to extract explicit information from web pages. These extractions help optimize your site for Technical SEO audit, including search results, gather essential data on your copy, and help locate and fix errors.

How is Data Extraction done?

Use Screaming Frog if you want to process data extraction, which involves pulling the required data from your website. The information is saved within Screaming Frog’s memory, giving you the option to export your scanned results to Excel or Google Sheets for further review.

Why is Data Extraction critical?

Data extraction allows you to harvest large amounts of data quickly and efficiently. This automation gives you immediate results of web architecture. This process saves you time and resources while giving you the valuable data you’ll need to plan and strategize search engine optimization strategies. Screaming Frog is the go-to Web Scraper Tool for SEOs and a data extractor. The options are endless; here are a ton of custom web-scraping syntaxes. Check the tutorial below.

How to Extract Custom Data using Screaming Frog

1. In ScreamingFrog, go to Configuration > Custom > Extraction.

2. Next, you will need to +Add and set up your extraction rules.

Custom Extraction Settings — Select elements of internal HTML using the Custom Extraction tab

3. Add a Title,
4. Select if you need CSSPath, XPath, or Regex,
5. Add your search function.

If you aren’t sure which selector or function you need, look at the examples below or use the inspect element function in Google Chrome Dev Tools. You can open Dev Tools by using “right-click” in the Google Chrome browser.

Example:

Here is an example of how you would scrape for a Facebook Pixel ID

In the Results, you can see, one of my pages is missing a Facebook Pixel:

Below are predefined custom extraction datasets to get you started.

Basic Syntax for using XPath Web Scraping

SYNTAX	FUNCTION
`//`	Search anywhere within the document
`/`	Search within the root of the website
`@`	Select a specific attribute of an element
`*`	Wildcard is used to select any element
`[ ]`	Find a specific element
`.`	Specifies the current element
`..`	Specifies the parent element

XPath functions

XPATH	OUTPUT
`//h1`	Extract all H1 tags
`//h2[1]`	Extract the first H2 tag
`//h2[2]`	Extract the second H2 tag
`//div/p`	Extracts any <p> contained within a <div>
`//div[@class='author']`	Extracts any <div> with class “author”
`//p[@class='content']`	Extracts any <p> with class “content”
`//*[@class='content']`	Extracts any element with class “content”
`//ul/li[last()]`	Extracts the last <li> in a <ul>
`//ol[@class='cat']/li[1]`	Extracts the first <li> in a <ol> with class “cat”
`count(//h2)`	Counts the number of H2’s (set extraction filter to “Function Value”)
`//a[contains(.,'learn more')]`	Extracts any link with anchor text containing “learn more”
`//a[starts-with(@title,'Written by')]`	Extracts any link with a title starting with “Written by.”

How to Extract Common HTML Elements

XPATH	OUTPUT
`//@href`	Extracts all links
`//a[starts-with(@href,'mailto')]/@href`	Extracts the link that starts with “mailto:” (email address)
`//a[starts-with(@href,'tel')]/@href`	Extracts the link that starts with “tel:” (telephone number)
`//img/@src`	Extracts all image source URLs
`//img[contains(@class,'aligncenter')]/@src`	Extracts all image source URLs for images containing the class name “aligncenter.”
`//link[@rel='alternate']`	Extracts elements with the rel attribute set to “alternate.”
`//@hreflang`	Extracts all hreflang values

Extract Meta Tags (use inner HTML element)

XPATH	OUTPUT
`//meta[@property='article:published_time']/@content`	Extracts the article publish date (commonly-found meta tag on WordPress websites)

Extract Open Graph

XPATH	OUTPUT
`//meta[@property='og:type']/@content`	Extracts the Open Graph type object
`//meta[@property='og:image']/@content`	Extracts the Open Graph featured image URL
`//meta[@property='og:updated_time']/@content`	Extracts the Open Graph updated time

Extract Twitter Cards

XPATH	OUTPUT
`//meta[@name='twitter:card']/@content`	Extracts the Twitter Card type
`//meta[@name='twitter:title']/@content`	Extracts the Twitter Card title
`//meta[@name='twitter:site']/@content`	Extracts the Twitter Card site object (Twitter handle)

Extract Schema Types

XPATH	OUTPUT
`//*[@itemtype]/@itemtype`	Extracts all of the types of schema markup on a page

Extract Breadcrumb Schema

Here are the custom extractions you use to check breadcrumbs in Screaming Frog.

XPATH	OUTPUT
`//[contains(@itemtype,'BreadcrumbList')]/[@itemprop]/a/@href`	Extracts all breadcrumb links
`//[contains(@itemtype,'BreadcrumbList')]/[@itemprop][1]/a/@href`	Extracts the first breadcrumb link
`//[contains(@itemtype,'BreadcrumbList')]/[@itemprop]`	Extracts breadcrumb names (set extraction filter to “Extract Text”)
`count(//[contains(@itemtype,'BreadcrumbList')]/[@itemprop])`	Counts the number of breadcrumb list items (set extraction filter to “Function Value”)

Extract Product Schema

XPATH	OUTPUT
`//*[@itemprop='name']/@content`	Extracts product name
`//*[@itemprop='description']/@content`	Extracts product description
`//*[@itemprop='price']/@content`	Extracts product price
`//*[@itemprop='priceCurrency']/@content`	Extracts product currency
`//*[@itemprop='availability']/@href`	Extracts product availability
`//*[@itemprop='sku']/@content`	Extracts product SKU

Extract Review Schema

XPATH	OUTPUT
`//*[@itemprop='reviewCount']`	Extracts review count
`//*[@itemprop='ratingValue']`	Extracts rating value
`//*[@itemprop='bestRating']`	Extracts the best review rating
`//[@itemprop='review']/[@itemprop='name']`	Extracts review name
`//[@itemprop='review']/[@itemprop='author']`	Extracts review author
`//[@itemprop='review']/[@itemprop='datePublished']/@content`	Extracts the publish date of reviews
`//[@itemprop='review']/[@itemprop='reviewBody']`	Extracts the body content of reviews

Extract Local Business & Organization Schema

XPATH	OUTPUT
`//[contains(@itemtype,'Organization')]/[@itemprop='name']`	Extracts the organization’s name
`//[@itemprop='address']/[@itemprop='streetAddress']`	Extracts the street address
`//[@itemprop='address']/[@itemprop='addressLocality']`	Extracts the address locality
`//[@itemprop='address']/[@itemprop='addressRegion']`	Extracts the address region
`//*[@itemprop='telephone']`	Extracts the telephone number
`//*[@itemprop='sameAs']/@href`	Extracts the “sameAs” links

Extract Article Schema

XPATH	OUTPUT
`//[contains(@itemtype,'Article')]/[@itemprop='headline']`	Extracts the article headline
`//[@itemprop='author']/[@itemprop='name']/@content`	Extracts author-name
`//[@itemprop='publisher']/[@itemprop='name']/@content`	Extracts publisher name
`//*[@itemprop='datePublished']/@content`	Extracts publish date
`//*[@itemprop='dateModified']/@content`	Extracts modified date

Custom Data Extraction with Regex

Wildcards

SYNTAX	FUNCTION
`.`	Match any 1 character
`*`	Match the preceding character 0 or more times
`?`	Match the preceding character 0 or 1 time
`+`	Match the preceding character 1 or more times
`\|`	OR

Anchors

SYNTAX	FUNCTION
`^`	The string begins with the succeeding character.
`$`	The string ends with the preceding character.

Groups

SYNTAX	FUNCTION
`( )`	Match enclosed characters in the exact order
`[ ]`	Match enclosed characters in any order
`–`	Match any characters within the specified range

Escape

SYNTAX	FUNCTION
`\`	Treat character literally, not as regex.

Regex Custom Data Extraction

REGEX	OUTPUT
`["'](UA-.*?)["']`	Extract the Google Analytics tracking ID
`["'](G-.*?)["']`	Extract the Google Analytics 4 (GA4) tracking ID
`["'](AW-.*?)["']`	Extract the Google Ads conversion ID and/or remarketing tag
`["'](GTM-.*?)["']`	Extract the Google Tag Manager and/or Google Optimize ID
`fbq\(["']init["'], ["'](.*?)["']`	Extract the Facebook Pixel ID
`\{ti:["'](.*?)["']\}`	Extract the Bing Ads UET tag
`adroll_adv_id = ["'](.*?)["']`	Extract the AdRoll Advertiser ID
`adroll_pix_id = ["'](.*?)["']`	Extract the AdRoll Pixel ID

Extract All Schema Markup and Schema Types

REGEX	OUTPUT
`["']application/ld\+json["']>(.*?)</script>`	Extracts all of the JSON-LD schema markups
`["']@type["']: ["'](.?)["']`	Extracts all of the types of JSON-LD schema markup on a page

Extract Breadcrumb Schema

REGEX	OUTPUT
`["']item["']: \{["']@id["']: ["'](.*?)["']`	Extracts breadcrumb links
`["']item["']: \{["']@id["']: ["'].?["'], ["']name["']: ["'](.?)["']`	Extracts breadcrumb names

Extract Product Schema

REGEX	OUTPUT
`["']@type["']: ["']Product["'].?["']name["']: ["'](.?)["']`	Extracts product name
`["']@type["']: ["']Product["'].?["']description["']: ["'](.?)["']`	Extracts product description
`["']@type["']: ["']Product["'].?["']price["']: ["'](.?)["']`	Extracts product price
`["']@type["']: ["']Product["'].?["']priceCurrency["']: ["'](.?)["']`	Extracts product currency
`["']@type["']: ["']Product["'].?["']availability["']: ["'](.?)["']`	Extracts product availability
`["']@type["']: ["']Product["'].?["']sku["']: ["'](.?)["']`	Extracts product SKU

Extract Review Schema

REGEX	OUTPUT
`["']reviewCount["']: ["'](.?)["']`	Extracts review count
`["']ratingValue["']: ["'](.?)["']`	Extracts rating value
`["']bestRating["']: ["'](.?)["']`	Extracts the best rating

Extract Local Business & Organization Schema

REGEX	OUTPUT
`["']@type["']: ["']Organization["'].?["']name["']: ["'](.?)["']`	Extracts organization name
`["']streetAddress["']: ["'](.?)["']`	Extracts the street address
`["']addressLocality["']: ["'](.?)["']`	Extracts the address locality
`["']addressRegion["']: ["'](.?)["']`	Extracts the address region
`["']telephone["']: ["'](.?)["']`	Extracts the telephone number
`["']sameAs["']: \[(.?)\]`	Extracts the “sameAs” links

Extract Article or BlogPosting Schema

REGEX	OUTPUT
`["']headline["']: ["'](.?)["']`	Extracts article headline
`["']author["'].?["']name["']: ["'](.*?)["']`	Extracts author-name
`["']publisher["'].?["']name["']: ["'](.*?)["']`	Extracts publisher name
`["']datePublished["']: ["'](.?)["']`	Extracts publish date
`["']dateModified["']: ["'](.?)["']`	Extracts modified date

The possibilities are endless; please let me know if you want any extractions added to this list.

Published on: 2021-03-10
Updated on: 2024-04-05

Isaac Adams-Hands

Isaac Adams-Hands is the SEO Director at SEO North, a company that provides Search Engine Optimization services. As an SEO Professional, Isaac has considerable expertise in On-page SEO, Off-page SEO, and Technical SEO, which gives him a leg up against the competition.

Screaming Frog Custom Extractions: A Guide to Extracting Crawl Data

What is the Screaming Frog SEO Spider?

What are custom extractions?

How is Data Extraction done?

Why is Data Extraction critical?

How to Extract Custom Data using Screaming Frog

Example:

Basic Syntax for using XPath Web Scraping

XPath functions

How to Extract Common HTML Elements

Extract Meta Tags (use inner HTML element)

Extract Open Graph

Extract Twitter Cards

Extract Schema Types

Extract Breadcrumb Schema

Extract Product Schema

Extract Review Schema

Extract Local Business & Organization Schema

Extract Article Schema

Custom Data Extraction with Regex

Wildcards

Anchors

Groups

Escape

Regex Custom Data Extraction

Extract All Schema Markup and Schema Types

Extract Breadcrumb Schema

Extract Product Schema

Extract Review Schema

Extract Local Business & Organization Schema

Extract Article or BlogPosting Schema

Did this article answer your questions?

Did this article answer your questions?

Isaac Adams-Hands