I recently stumbled upon Screaming Frog’s Head of SEO Patrick Langridge’s article on How to Scrape Google Search Features Using XPath. In the comment section a user asks if Screaming Frog could scrape more than just the initial questions Google provides. Langridge responded with:
Hey John – the config doesn’t crawl more than the standard 4 questions. Do share if you have a python solution, that would be cool to see! – Patrick Langridge
Feeling inspired by this response I decided to come up my own Python solution that could scrape more than the 4 initial questions.
Google’s “People also ask” section with the initial standard questions
As you already know in order to get more than the initial standard questions Google provides a user has got to click one of the question toggles. For each question toggle clicked Google appends additional related questions below the existing questions.
Infinitely expanding questions – Source: https://moz.com/blog/infinite-people-also-ask-boxes
So all we really want to do is click the question toggle. Sounds easy. The issue here is that the current version of Screaming Frog isn’t able to interact with webpages, or more specifically, it cannot click on the question accordions to reveal more questions. Fortunately for us there’s open-source software that can automate this type web browser user behavior.
What exactly do we want to extract?
The basic of anatomy of a PAA result.
The following pieces of data from each related question will be extracted:
- Question text
- Answer content text
- Answer title tag text
- Answer URL
OK, how do we actually do this?
Short Answer: With the script below. 👇 (You’ll need to do some stuff before you actually use it.)
You can’t perform that action at this time. You signed in with another tab or window. You signed out in another tab or window. Reload to refresh your session. Reload to refresh your session.
The script above requires the following modules to be installed on your machine. If you’re unfamiliar with them please take your time to get acquainted:
- Selenium (Automates web browser behavior)
- BeautifulSoup (HTML parser / iterates & navigates the DOM)
- lxml (HTML parser)
- ChromeDriver (Allows us to use Chrome with Selenium)
- XlsxWriter (Creates excel files)
I’ve got those modules installed now. What else do I have to do?
Starting at line 103 there are some variables that you’ll need to update.
- questions: These are your questions.
- pathToChromeDriver: The path directory to chromeDriver.
- totalClicks: This is the total number of clicks chromedriver will make until moving along to the next question.
Once you have updated those variables then you ought to be good to go ahead and run the script. When the script is finished you should have a fresh .xlsx file in the same directory you ran the script from.
The following columns of data are provided.
- The initial question
- The related question
- The related question’s title tag
- The related question’s title tag character count (I figure why not)
- The related question’s answer content
- The related question’s answer content character count (Again, just because.)
- The related question’s URL
And that’s it.
(If you have any questions please feel free to reach out.)