Web scraping data with n8n and Puppeteer

24.28k views1455 WordsCopy TextShare

Oskar

This tutorial shows two methods on how to scrape data from websites with n8n and Puppeteer. In this...

Video Transcript:

hi in this video we are going to scrape data from websites to be precise in this example we want to extract email address in my case its workflows at gmail. com from my website I've also prepared the second version of this website which is scrape. workforce.

com where the presentation of an email is a bit different let's go now to ni10 workflow and see how we can do the very basic scrape from those websites I've prepared two HTTP request notes which are the very simple get requests to both websites in the first note I pointed to the urlworkfalls. com but in the second one the URL is scrape. workflows.

com let me now execute this workflow and let's see what is the difference between those two websites as you can see HTTP request node Returns the HTML representation of the website so if I want to find an email address I simply need to click command f and look for it in the source code in this case my email address is hidden in ATAC in HTML code but how the situation looks in the second node well if I click once again command F and paste my email address there are no results but when I simply shorten it to gmail. com you can see that well this email address exists but it is hidden behind the JavaScript code so HTTP request node can return the code of the website but it doesn't include the JavaScript interpreter let's see now how we can automate the extraction of email addresses from those outputs so I've pinned the data in both HTTP request notes and I also added the code notes where I pasted the Snippets of the JavaScript code in this code you can find the regular expression which should help us to retrieve email addresses from the string so here is the example string and as you can see only the email address meets the pattern provided by the regular expression so next lines of code basically look for the parts of the string that match the pattern of the regular expression and filter out the empty results next it will return the array which will include only the email addresses I've connected both HTTP request notes with the same code and let's see how the outputs look like in the first function node you can see that email address was returned properly but in the second one we just have an empty array the reason of the situation is that we don't have the unified email address in the output of the second HTTP request node we are going to change it with Puppeteer Puppeteer is basically API to control the a headless Chrome so you can do a lot of things with this API you can for example generate screenshots automate Farm submissions and many many things that you can normally do manually with your browser now I go to the vs code and make the new node project so I've typed npm init and click enter whatever shows up and when I have already my package. json I will install Puppeteer to the project so I will type in npm install Puppeteer after a few seconds when Puppeteer is installed you can go to your project folder and simply add the new file which is index.

js file this is the file where we add all the instructions to control the browser I've pasted the ready snippet of the code and you can find it also in the description let's go now quickly through this code and see what it actually does in the first few lines we Define the get page content function and add some parameters then we tell the browser to open new page and tell what page should it open in this case it's workflows. com next it should extract the text from all the elements dot r on this website and finally it should show us in the console the extracted text from the desired website and close the browser and of course at the end we simply call the function get page content everything is in the file index. js so we simply need to type node index.

js and as you can see only the text elements have been returned to the console by calling this function in this case you cannot see the email address because in case of website workflows. com it's hidden behind the link behind the attack but when we change the address to scrape. wordpress.

com com where the email address is hidden behind the JavaScript code and when we run the function again we should be able to see all the text including those text hidden behind the JavaScript so in our case it's the email workflows at gmail. com okay but how you can run this code in na10 how you can use it so basically the best option is to modify this code and run it in Google Cloud functions so I will right now modify this code in order to use the parameter name in the URL of our function you can find the modified version of this code in the description I'm adding also the additional line of code to make this function actually work in Google Cloud functions and what's very important I will use not the latest version of Puppeteer I will use the version 19. 8 8 which is a bit older than the current version I will also add the new file to the function which is Puppeteer RC dot CJs now go to your Google Cloud dashboard and look for cloud functions and simply create a new function I will use the second generation of the environment and I will name my function get page content I will also allow authenticated invocations which is not recommended but I will simply do it for the purposes of this tutorial and the minimum memory that you should use for puppeteer is one gigabyte now what you need to do is simply paste the code that you have already prepared so first I will copy the index.

js code and paste it to index. js in Google Cloud functions I will also edit the entry point from Hello HTTP to get page content function next I will copy package. jsons on CP I will go to to already prepared file copy it and paste it to package Json and the last step is creating new file which is puppeteerrc.

cjs so simply add the new file name it Puppeteer rc. cjs but remember to add the dot at the beginning of the name of this file and simply paste the content of this file that you have already prepared when you do this your function will be ready to deploy so you need to Simply click deploy button at the bottom of the page and it may take a few seconds sometimes a bit longer even a few minutes before your function is activated so please be patient when everything is fine you should finally see the green icon next to the name of your function and also available should be the URL of your function so copy it and paste in the new tab of the browser add the parameter name and then you can paste the URL the address of the website that you want to return in this function in my case it's scrape. workflows.

com and look what happened I received in the output only the text from this website without any HTML code without any styling this is what I wanted to have right now I will edit my second HTTP request note and I will change the url to the function URL including the parameter of the website that I want to scrape so right now this node should call the function in Google Cloud functions and return only the text from the website that is included in the parameter name so the website that I want to scrape and yes this is what happens I received in the output only the text from the website scrape. workforce.