Screen Scrape with Headless Chrome and Puppeteer
Screen scrape more effectively with Chrome and Puppeteer.
I have been screen scraping for over a decade, according to this blog. I chose my wedding day by screen scraping. Most recently, I started building up git histories of the contents of various websites.
Over time these techniques have gotten more and more advanced. The oldest including mere numeric or date based url patterns, up to the newest (before this post) involving CSS selectors to extract contents of a given page.
Yesterday I discovered that one of my targets added some anti-scrape technology. I briefly considered just giving up, but their website is so slow and I have to use it, so I forged ahead.
🔗 Screen Scraping with Puppeteer and Headless Chrome
Rather than point at some poor webhost for these examples, I’ll point at my own blog. You can all scrape this and I’ll be fine. I am not doing anything to counter scraping but the code I’ll share here work better than anything else I’ve done so far.
Puppeteer is a tool provided by Google to drive a headless chrome. I find this uses fewer resources than Selenium (see appendix.) Here’s an example script:
#!/usr/bin/node
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({ headless: 'new' });
const page = await browser.newPage();
await page.goto("https://blog.afoolishmanifesto.com/");
const results = await page.$$eval('h1 a',
es => es.map(
e => e.getAttribute('href')
)
);
process.stdout.write(JSON.stringify(results) + "\n");
await browser.close();
})();
To run this you’ll need to install nodejs and also install
puppeteer (npm i pupeteer
). Puppeteer
automatically downloads a copy of chrome so I found it very easy to set up.
Also note the headless: 'new'
parameter: that is a brand new feature of Chrome
that makes the headless browser much closer to a genuine browser.
The hardest part about the above for me was learning all the new JavaScript syntax. I found myself reading about await, Promise, Arrow function expressions, and more.
By the way, I had hoped to use TypeScript for this but the compiler was slow enough that I just rolled with pure JavaScript. I wonder how long it will take me to regret that?
Sometimes software is a punishment, but without any edification. I reject any assertion that I must use software as built or intended. I scrape ethically: I will do what I can to avoid undue load on the remote site, and I don’t scrape and then resell contents. I will use the skills I have developed to make my life easier.
(Affiliate Links Below.)
Here are a few books I recently bought and suggest checking out:
The Idiot: I have been wanting to read this for a while. It was a struggle to read, but I enjoyed it through and through. I was surprised how relatable it was! I found the book much more charming than Crime and Punishment, but still firmly Dostoyevsky.
The Name of the Rose: My better half suggested this one to me. Normally we don’t read the same kind of literature but she thought I’d enjoy this and she’s absolutely right. I love the philosophical asides and the fourteenth century setting.
The Practicing Stoic: Stoicism is embarrassingly popular right now. I heard of this book in a class put on by Mahmoud Rasmi. I had already read a couple Ryan Holiday books and all of Taleb’s Incerto, so this me deepening my understanding rather than getting started.
Hands Employed Aright: is a book about Joshua Fisher, a Parson from Blue Hill, Maine. Fisher’s life and breadth of activity (notably the woodworking) inspires me. This is a great book to read in the evenings when winding down.
🔗 Appendix: Selenium and Python
Originally I solved the above using Python and Selenium. Here’s a translation of the above code:
#!/usr/bin/python3
import json
from pyvirtualdisplay import Display
from selenium import webdriver
display = Display(visible=0, size=(800, 600))
display.start()
b = webdriver.Firefox()
try:
links = []
b.get('https://blog.afoolishmanifesto.com/')
for el in b.find_elements('css selector', 'h1 a'):
links.append(el.get_attribute('href'))
print(json.dumps(links))
finally:
b.quit()
display.stop()
There were two main reasons I switched the code to JavaScript. First and foremost, the selenium documentation was hard for me to follow. I suspect that if I used Python all the time I’d get used to whatever style this is. The second reason was that I was writing my code in the context of an app that has other JavaScript (though certainly not only JavaScript.) I do not intend to only write in a single language, but where possible I try to reduce the requisite ecosystem for a given project.
Posted Mon, Feb 20, 2023If you're interested in being notified when new posts are published, you can subscribe here; you'll get an email once a week at the most.