How to Scrape Websites Without Paid APIs Using n8n

13.63k views4740 WordsCopy TextShare
Bart Slodyczka
Hey guys, in this video I break down the mechanics of web scraping using standard HTTP calls. I show...
Video Transcript:
hello Legends this video is the introduction to web scraping without using a paid API service so this template that I have in front of me is basically using I would just call it mechanical a mechanical approach to scraping websites where it hits them directly it tries to find all the URLs that are associated so all the different pages on the website and then grab that content and then put it into a Google Sheets so then you can I don't know put it back into some kind of llm to process or to do um tot ract certain types of data or bits of data that you want to get so really I think the two main use cases for this are you're already paying for an paid API scraping service and it's too expensive and you want something that's cheaper or you're already using one of these paid services or you're trying to scrape websites yourself and you just like you just need very granular control over the data that you get back um and this is how you can kind of get closer to that space so at a high level when you look at a website like this right typically what you would get when you're first doing some kind of lead gen maybe you're on Google Maps and you're getting a list of websites or you have some kind of um you know paid lead gen service you will get the base URL of the website and the base URL of the website let's say we're in the menu section now but I think if I just backspace this yeah you get to this page and if you just do an API call to here like a HTTP call to here you can get the HTML content of this page but you can kind of see I might be able to click around to like a different like here we are if I order online reservations gluten free I can click to different pages but straight away the URL over here it it's different so I would have to somehow find all the URLs on this website to be able to scrape all the content and right now this is like a probably a pretty bad example it's a pizza store so like why would you want from the pizza store or maybe you do I don't know but if you want to do some kind of like um Research into competitors and monitor the website daily or like pull out all their product information and scrape it daily and you know whatever else then like yeah you probably want every single URL of that website and you want to map it somewhere you want to store the information somewhere um so at a high level there's actually a couple ways that you can do this and the most common way is that you'll get the base URL of the website and you will add an extension to the end a/ robots. txt now at a high level robots. txt is um what websites will supply to Google so it tells Google what like what information to to scrape and to like um I don't know all about it right but it's like they give it to Google Google uses this uh robots at txt page as instructions of how to process that website and how to present it on Google itself something like that any SEO guys here that probably know more than I do so put it into the comments but when you get the you know you can append this extension to the end of the base URL so technically speaking you would just do this and go txt and if you hit hit enter on this you actually get to this kind of page and it looks like very techy and whatever else but really what you find here is this this sit map extension so this sit map this bit here step two actually contains um even more information about the website and typically it either goes directly into a list of all the URLs on a website so then you can literally take all those URLs and then go to each one one by one or for bigger websites like ecommer stores or blog posts like page blog blog Pages whatever blog sites they will have a category XML page so let's say for example here we have the XML I'm just going to copy this and now I want to paste it into here and by the way when I'm pasting it into the search bar what this is doing is it's making an API call like a HTTP call to here so exactly what we see on the screen is what we would see if we were making that API call within um n8n or make.
com or just directly from curs using Code so now we're in this XML sitemap and we have these like uh highle category Pages which if you click into one of these for example I have let's say mega menu sitemap this extension here so once again we are now in this category uh XML so if I click into this uh Mega menu I open it up and now I have um actual URLs that I can get to and these URLs which this is a pretty bad example or makes more sense later these are direct URLs that take me to certain pages on the website so yeah this is probably uh you know it's a very bad example here but if this was a u an e-commerce store you might go into this category page again this category page and you might have categories of products or categories of services and you might click into like the products category page and it would list all the products that that business has so you can kind of see here that this way like this is actually organized in such a way where it's easy for Google to understand what's the content how's it structured how do I find it like how do I link to it how do I use it so that's why this robots is actually a very good first step when you're trying to scrape a website because it will give you information on um all of this stuff now that's typical for e-commerce stores SEO friendly stores blog post like blog Pages they follow that you know robots extension you go to the sitemap the category page and then URLs exactly what we just did here right we went through that entire process um but not every website's the same sometimes the website um they might have the robots extension you go to the site map which is this you go to this and in our example here we have this category page right we have this category but sometimes it just goes straight to the URLs so it goes straight to the URLs um so that's one option another option is they don't even have the robots. txt it's maybe a small site like maybe a very um like a mechanic store or a cafe or a bakery they might not have a robots in which case you can still actually access the list of URLs or the category like you can still kind of get to either um here or uh yeah into here or here find the category or the URLs by guessing what the sitemap extension is so even with this sitemap by the way it's just if I go one step back um as you can see here it's uh it's just the extension off of the actual home like the base URL so the base URL slit mapcore index. xml so actually what happens is there's uh you know four or maybe yeah five or whatever like main sitemap extensions that you can put on that you can append to the base URL to find the um to get access to those category Pages or the actual URLs itself so let's say you do not have the robots you can still just give all of these a try and then maybe you'll hit your luck and then end up with the category page or the URL page if this is confusing now I'll actually show you in a second the actual flow when I go through it uh but it makes more sense um when I am going through it because you can see that you have to have a very robust solution that will try this first if it doesn't work it tries this if it doesn't work it tries this um CU yeah not every single website is the same and then you know the final option is that there's no robots page there's no sitemap page and you have to just hit the homepage yourself manually and figure out how to like grab all the URLs from there so there's a bunch of different ways that you can do this um this is pretty you know depending on what industry you're actually targeting if you're looking for information from e-commerce stores or SEO you know very like big buil out websites this is probably a really good process that you'll follow otherwise there'll be a variety of different processes like three or four other Pathways okay so to start drilling into the flow the first thing we do is we're actually interacting with this system via a chat interface so you can plug this into like Google Sheets to iterate over each new row that you add but for our example it's easier for me to just use the chat interface so I'm going to import a URL and the first step is um just an AI step to simplify this for me but you can do this mechanically as well so to speak mechanically where it's just um you write some some kind of script that does this for you um but we're just want to extract the base URL in a specific format because in our next step we're checking for robots so we input that base URL that came out of the open AI block and we just for/ robots.
txt exactly like what we did over here so robots. txt so we can be met with this page and just check out do we have a sitemap so after we do that there's actually two options that happen here one there is a uh robots. txt or two there is no robots.
txt so like what we had over here bigger websites might have it smaller websites might not have a robots. txt so we're kind of already having to like just take Pi some control and make this robust so we're checking if we have a robots. txt if we do have a robots.
txt we're going to take that sit map that we get and we're going to extract it out so this is basically saying hey you've got a you know we're in the robots. txt page find me the site map that you see um within anywhere on this page in whichever format it is and then you know just give me a Json um key value of sitemap that I can use in my next HTTP call that's what we're doing here then we would just literally Target that s map so what we're doing with that call is uh copying this pasting it here hitting [Music] enter and this what we see here is exactly the output of uh of this block over here so that like that list of stuff that we see is uh completed after here and before I move on if we just kind of pause for a second so we had over here if there is no robots whoops if there is no robots we have to guess the sit map so that's what we're doing here we're saying all right append the extension of robots. txt is it like did that work did this did this HTTP call fail or was it successful if it was successful Co we go if it failed let's try each of those four extensions that we had and if you click into here you just see all we're doing is appending sitemap.
xml the next one is sore index . xml and it's all these four variations sitemap with a dash and then sit. Json so we're trying our luck with all of these and actually sometimes it works sometimes there's no robots we go into this flow and maybe this one works and we keep pushing forward and then we kind of um yeah we keep going with the flow so at this stage in this merge block here we've we've gone to the sitemap page if any of these have worked or if this has worked we're now at the exact same checkpoint we're on the sitemap page so we're on the sitemap page like it is here and the next thing that we're doing is let's open up this yes we're extracting all of the lo lo elements so I want to run this in a second and you'll see exactly what's going on um but essentially all we're doing here is checking okay if the stuff that's on this page still has axml extension like this stuff does it's got XML XML XML we're in that categories page we're now in here and we need to go one level deeper to get the URL or we're checking if the uh URLs like the actual stuff that's on the page already is just directly like Dooms or doom.
Related Videos
The 9 Best Ways to Scrape Any Website in N8N
1:08:45
The 9 Best Ways to Scrape Any Website in N8N
Nick Saraev
13,404 views
Scrape Any Website for FREE Using DeepSeek & Crawl4AI
22:46
Scrape Any Website for FREE Using DeepSeek...
aiwithbrandon
127,097 views
Jr Devs - "I Can't Code Anymore"
23:41
Jr Devs - "I Can't Code Anymore"
ThePrimeTime
305,976 views
Even EU Shocked by Germany’s Bold Move Against US! Trump Didn’t Expect This Much!
12:38
Even EU Shocked by Germany’s Bold Move Aga...
PPR Mundial
1,708,405 views
N8N 101: The Ultimate Beginner’s Guide
21:01
N8N 101: The Ultimate Beginner’s Guide
AIpreneur-J
236 views
Ian Hislop vs Donald Trump Pt. 1! | Have I Got News For You
8:57
Ian Hislop vs Donald Trump Pt. 1! | Have I...
Hat Trick Comedy
517,697 views
This is How I Scrape 99% of Sites
18:27
This is How I Scrape 99% of Sites
John Watson Rooney
277,698 views
How to use chatGPT Projects: Walkthrough, Use Cases & Comparisons to Claude
31:15
How to use chatGPT Projects: Walkthrough, ...
Bart Slodyczka
23,107 views
This is how I scrape 99% websites via LLM
22:44
This is how I scrape 99% websites via LLM
AI Jason
271,393 views
How to Scrape ANY YouTube Video Transcript with n8n! (full workflow)
22:49
How to Scrape ANY YouTube Video Transcript...
AI Foundations
6,055 views
Turn ANY Website into LLM Knowledge in SECONDS
18:44
Turn ANY Website into LLM Knowledge in SEC...
Cole Medin
244,872 views
Beginner's Guide to Coding: My Tool Stack & Mindset for Success
46:57
Beginner's Guide to Coding: My Tool Stack ...
Bart Slodyczka
312 views
How to Code Smarter with ChatGPT Projects
29:39
How to Code Smarter with ChatGPT Projects
Bart Slodyczka
2,807 views
How I Extract LinkedIn Leads Automatically Using n8n
21:06
How I Extract LinkedIn Leads Automatically...
Clarence | AI Automations
3,509 views
Python AI Web Scraper Tutorial - Use AI To Scrape ANYTHING
45:36
Python AI Web Scraper Tutorial - Use AI To...
Tech With Tim
253,464 views
Master n8n in 2 Hours: Complete Beginner’s Guide for 2025
2:10:36
Master n8n in 2 Hours: Complete Beginner’s...
Jono Catliff
17,597 views
How to Build Your Own AI Phone Assistant for Just 1¢/Minute (No Cloud, 1 Second Latency)
36:16
How to Build Your Own AI Phone Assistant f...
Bart Slodyczka
6,510 views
Scrape Free Leads Using N8N In Minutes (EASY Beginners Tutorial)
11:04
Scrape Free Leads Using N8N In Minutes (EA...
Charlie Barber
12,038 views
Prince William's closest aide breaks his silence | 60 Minutes Australia
21:42
Prince William's closest aide breaks his s...
60 Minutes Australia
541,947 views
How to Start Coding | Programming for Beginners | Learn Coding | Intellipaat
33:08
How to Start Coding | Programming for Begi...
Intellipaat
9,752,776 views
Copyright © 2025. Made with ♥ in London by YTScribe.com