Do you want to find the best keywords for your content? Do you need to do a competitive analysis on your blog or website? If so, Screaming Frog is an invaluable tool. This article will walk through how to use it and what information can be found with this tool.
The screaming frog extractor allows users to select which fields they would like extracted from a webpage and then save these as CSV files. These can then be opened in Excel and filtered into specific ranges of data types such as dates, numbers, words. For example, if you wanted all the URLs on a page that contain “keyword” in them but only pages with more than 5 matches per URL, this could easily be done by filtering out any URLs that meet those requirements. Keep reading to learn more.
- How to Use Screaming Frog Custom Extraction
- Basic Syntax for XPath Web Scraping
- XPath functions
- How to Extract Common HTML Elements
- How to Extract Common HTML Attributes
- Extract Meta Tags (use inner HTML)
- Extract Open Graph
- Extract Twitter Cards
- Extract Schema Types
- Extract Breadcrumb Schema
- Extract Product Schema
- Extract Review Schema
- Extract Local Business & Organization Schema
- Extract Article Schema
- Custom Extraction with Regex
- Regex Custom Extraction
- Extract All Schema Markup and Schema Types
How to Use Screaming Frog Custom Extraction
If you haven’t already, please download and install Screaming Frog. SF is free scanning up to 500 pages, if you want to scan more than 500 pages at a time, you will have to pay for the upgrade.
Go to the Custom Extraction Setting
In ScreamingFrog, go to Configuration > Custom > Extraction.
+Add the extraction rules
Next, you will need to +Add and set up your extraction rules.
Add a title, select if you need CSSPath, XPath, or Regex, then add your search function. If you aren’t sure which selector or function you need, look at the examples below.
Example #1 – Extracting Facebook Pixels (tracking codes)
Here is an example of how you would scrape for missing tracking codes
If you haven’t already, please download and install Screaming Frog
1. Go to the Custom Extraction settings under Configurations >> Custom >> Extraction.

2. Add the snippet below to the Regex value by pressing the +Add button and pasting.
fbq\(["']init["'], ["'](.*?)["']

You can use the same approach for finding missing Google Analytics, Google Ads, or Google Tag Manager tracking codes.
REGEX | OUTPUT |
---|---|
["'](UA-.*?)["'] | Google Analytics |
["'](AW-.*?)["'] | Google Ads conversion ID |
["'](GTM-.*?)["'] | Google Tag Manager |
2. After pressing Start, and scanning your website, you can find the Results under the Custom Extraction Tab, as you can see, one of my pages are missing a Facebook Pixel.

Example #2 – Extracting Email Addresses
Here is how to scrape email addresses from a list of websites.
If you haven’t already, please download and install Screaming Frog
1. Go to the Custom Extraction settings under Configurations >> Custom >> Extraction.

2. Add the snippet below to the XPath value by pressing the +Add button and pasting.
//a[starts-with(@href,'mailto')]/@href

3. Enable List Mode, by going to Mode >> List.
Upload your list of pages you want to target. Or use Spider Mode to scan every page on a specific website.
Press Start and let the program run.

4. Find your results under the Custom Extraction Tab.

Note: Don’t be a jerk! Please respect crawl rates and people’s privacy. Please do not use this to spam people
Basic Syntax for XPath Web Scraping
Here are more prebuilt syntaxes to help you crawl and search for content:
SYNTAX | FUNCTION |
---|---|
// | Search anywhere in the document |
/ | Search within the root |
@ | Select a specific attribute of an element |
* | The wildcard is used to select any element. |
[ ] | Find a specific element. |
. | Specifies the current element |
.. | Specifies the parent element |
XPath functions
OPERATOR | FUNCTION |
---|---|
starts-with(x,y) | Checks if x starts with y |
contains(x,y) | Checks if x contains y |
last() | Finds the last item in a set |
count(XPath) | Counts occurrences of the XPath extraction |
How to Extract Common HTML Elements
XPATH | OUTPUT |
---|---|
//h1 | Extract all H1 tags |
//h3[1] | Extract the first H3 tag |
//h3[2] | Extract the second H3 tag |
//div/p | Extract any <p> contained within a <div> |
//div[@class='author'] | Extract any <div> with class “author” |
//p[@class='bio'] | Extract any <p> with class “bio” |
//*[@class='bio'] | Extract any element with class “bio” |
//ul/li[last()] | Extract the last <li> in a <ul> |
//ol[@class='cat']/li[1] | Extract the first <li> in a <ol> with class “cat” |
count(//h2) | Count the number of H2’s (set extraction filter to “Function Value”) |
//a[contains(.,'click here')] | Extract any link with anchor text containing “click here.” |
//a[starts-with(@title,'Written by')] | Extract any link with a title starting with “Written by.” |
How to Extract Common HTML Attributes
XPATH | OUTPUT |
---|---|
//@href | Extract all links |
//a[starts-with(@href,'mailto')]/@href | Extract link that starts with “mailto” (email address) |
//img/@src | Extract all image source URLs |
//img[contains(@class,'aligncenter')]/@src | Extract all image source URLs for images with the class name containing “aligncenter.” |
//link[@rel='alternate'] | Extract elements with the rel attribute set to “alternate.” |
//@hreflang | Extract all hreflang values |
Extract Meta Tags (use inner HTML)
XPATH | OUTPUT |
---|---|
//meta[@property='article:published_time']/@content | Extract the article publish date (commonly-found meta tag on WordPress websites) |
Extract Open Graph
XPATH | OUTPUT |
---|---|
//meta[@property='og:type']/@content | Extract the Open Graph type object |
//meta[@property='og:image']/@content | Extract the Open Graph featured image URL |
//meta[@property='og:updated_time']/@content | Extract the Open Graph updated time |
Extract Twitter Cards
XPATH | OUTPUT |
---|---|
//meta[@name='twitter:card']/@content | Extract the Twitter Card type |
//meta[@name='twitter:title']/@content | Extract the Twitter Card title |
//meta[@name='twitter:site']/@content | Extract the Twitter Card site object (Twitter handle) |
Extract Schema Types
XPATH | OUTPUT |
---|---|
//*[@itemtype]/@itemtype | Extract all of the types of schema markup on a page |
Extract Breadcrumb Schema
XPATH | OUTPUT |
---|---|
//*[contains(@itemtype,'BreadcrumbList')]/*[@itemprop]/a/@href | Extract all breadcrumb links |
//*[contains(@itemtype,'BreadcrumbList')]/*[@itemprop][1]/a/@href | Extract the first breadcrumb link |
//*[contains(@itemtype,'BreadcrumbList')]/*[@itemprop] | Extract breadcrumb names (set extraction filter to “Extract Text”) |
count(//*[contains(@itemtype,'BreadcrumbList')]/*[@itemprop]) | Count the number of breadcrumb list items (set extraction filter to “Function Value”) |
Extract Product Schema
XPATH | OUTPUT |
---|---|
//*[@itemprop='name']/@content | Extract product name |
//*[@itemprop='description']/@content | Extract product description |
//*[@itemprop='price']/@content | Extract product price |
//*[@itemprop='priceCurrency']/@content | Extract product currency |
//*[@itemprop='availability']/@href | Extract product availability |
//*[@itemprop='sku']/@content | Extract product SKU |
Extract Review Schema
XPATH | OUTPUT |
---|---|
//*[@itemprop='reviewCount'] | Extract review count |
//*[@itemprop='ratingValue'] | Extract rating value |
//*[@itemprop='bestRating'] | Extract best review rating |
//*[@itemprop='review']/*[@itemprop='name'] | Extract review name |
//*[@itemprop='review']/*[@itemprop='author'] | Extract review author |
//*[@itemprop='review']/*[@itemprop='datePublished']/@content | Extract the publish date of reviews |
//*[@itemprop='review']/*[@itemprop='reviewBody'] | Extract the body content of reviews |
Extract Local Business & Organization Schema
XPATH | OUTPUT |
---|---|
//*[contains(@itemtype,'Organization')]/*[@itemprop='name'] | Extract the organization’s name |
//*[@itemprop='address']/*[@itemprop='streetAddress'] | Extract the street address |
//*[@itemprop='address']/*[@itemprop='addressLocality'] | Extract the address locality |
//*[@itemprop='address']/*[@itemprop='addressRegion'] | Extract the address region |
//*[@itemprop='telephone'] | Extract the telephone number |
//*[@itemprop='sameAs']/@href | Extract the “sameAs” links |
Extract Article Schema
XPATH | OUTPUT |
---|---|
//*[contains(@itemtype,'Article')]/*[@itemprop='headline'] | Extract the article headline |
//*[@itemprop='author']/*[@itemprop='name']/@content | Extract author name |
//*[@itemprop='publisher']/*[@itemprop='name']/@content | Extract publisher name |
//*[@itemprop='datePublished']/@content | Extract publish date |
//*[@itemprop='dateModified']/@content | Extract modified date |
Custom Extraction with Regex
Wildcards
SYNTAX | FUNCTION |
---|---|
. | Match any 1 character |
* | Match preceding character 0 or more times |
? | Match preceding character 0 or 1 time |
+ | Match preceding character 1 or more times |
| | OR |
Anchors
SYNTAX | FUNCTION |
---|---|
^ | The string begins with the succeeding character. |
$ | The string ends with the preceding character. |
Groups
SYNTAX | FUNCTION |
---|---|
( ) | Match enclosed characters in exact order |
[ ] | Match enclosed characters in any order |
– | Match any characters within the specified range |
Escape
SYNTAX | FUNCTION |
---|---|
\ | Treat character literally, not as regex. |
Regex Custom Extraction
REGEX | OUTPUT |
---|---|
["'](UA-.*?)["'] | Extract the Google Analytics tracking ID |
["'](AW-.*?)["'] | Extract the Google Ads conversion ID and/or remarketing tag |
["'](GTM-.*?)["'] | Extract the Google Tag Manager and/or Google Optimize ID |
fbq\(["']init["'], ["'](.*?)["'] | Extract the Facebook Pixel ID |
\{ti:["'](.*?)["']\} | Extract the Bing Ads UET tag |
adroll_adv_id = ["'](.*?)["'] | Extract the AdRoll Advertiser ID |
adroll_pix_id = ["'](.*?)["'] | Extract the AdRoll Pixel ID |
Extract All Schema Markup and Schema Types
REGEX | OUTPUT |
---|---|
["']application/ld\+json["']>(.*?)</script> | Extract all of the JSON-LD schema markup |
["']@type["']: *["'](.*?)["'] | Extract all of the types of JSON-LD schema markup on a page |
Extract Breadcrumb Schema
REGEX | OUTPUT |
---|---|
["']item["']: *\{["']@id["']: *["'](.*?)["'] | Extract breadcrumb links |
["']item["']: *\{["']@id["']: *["'].*?["'], *["']name["']: *["'](.*?)["'] | Extract breadcrumb names |
Extract Product Schema
REGEX | OUTPUT |
---|---|
["']@type["']: *["']Product["'].*?["']name["']: *["'](.*?)["'] | Extract product name |
["']@type["']: *["']Product["'].*?["']description["']: *["'](.*?)["'] | Extract product description |
["']@type["']: *["']Product["'].*?["']price["']: *["'](.*?)["'] | Extract product price |
["']@type["']: *["']Product["'].*?["']priceCurrency["']: *["'](.*?)["'] | Extract product currency |
["']@type["']: *["']Product["'].*?["']availability["']: *["'](.*?)["'] | Extract product availability |
["']@type["']: *["']Product["'].*?["']sku["']: *["'](.*?)["'] | Extract product SKU |
Extract Review Schema
REGEX | OUTPUT |
---|---|
["']reviewCount["']: *["'](.*?)["'] | Extract review count |
["']ratingValue["']: *["'](.*?)["'] | Extract rating value |
["']bestRating["']: *["'](.*?)["'] | Extract best rating |
Extract Local Business & Organization Schema
REGEX | OUTPUT |
---|---|
["']@type["']: *["']Organization["'].*?["']name["']: *["'](.*?)["'] | Extract organization name |
["']streetAddress["']: *["'](.*?)["'] | Extract the street address |
["']addressLocality["']: *["'](.*?)["'] | Extract the address locality |
["']addressRegion["']: *["'](.*?)["'] | Extract the address region |
["']telephone["']: *["'](.*?)["'] | Extract the telephone number |
["']sameAs["']: *\[(.*?)\] | Extract the “sameAs” links |
Extract Article or BlogPosting Schema
REGEX | OUTPUT |
---|---|
["']headline["']: *["'](.*?)["'] | Extract article headline |
["']author["'].*?["']name["']: *["'](.*?)["'] | Extract author name |
["']publisher["'].*?["']name["']: *["'](.*?)["'] | Extract publisher name |
["']datePublished["']: *["'](.*?)["'] | Extract publish date |
["']dateModified["']: *["'](.*?)["'] | Extract modified date |
The possibilities are endless; please let me know if you want any extractions added to this Screaming Frog XPath cheat sheet.
Screaming Frog is a valuable tool for anyone who wants to find the best keywords, do a competitive analysis on their blog or website, or perform any other task that requires crawling and analyzing your site. It’s easy to use and can provide you with a lot of information about your own content and what potential competitors are doing online. If you have questions about how it works or want more in-depth guidance on using the software, please email me. I will be happy to help!