This is How I Scrape 99% of Sites

98.74k views3853 WordsCopy TextShare
John Watson Rooney
Check Out ProxyScrape here: https://proxyscrape.com/?ref=jhnwr ➡ JOIN MY MAILING LIST https://johnw...
Video Transcript:
a large part of the work I do in scraping is e-commerce data competitor analysis product analysis and all that and I want to show you in this video how I go about scraping almost every single site that I come up against especially ones like this so I've covered this before but what you want to do is you absolutely don't want to be trying to pull out links and trying to um you know scrape the HTML that's just not going to work I know if you look over my head here I'll make it a bit bigger I mean this is just passing HTML for this is just not going to work what we want to do is we want to find the backend API that this site uses to hydrate the front end to basically populate this data to find that we want to open up our inspect tool our tools here in Chrome go to network I'll try and make this a little bit bigger and then we need to start interrogating the site now the first thing I always do pretty much is just sort of scroll around and see what pops up I'm going to click on Fetch xhr and it's responses that are Json that we are going to be interested in uh you can either move around go to different categories or click on a product we'll do just fine when you start to scale up projects like this one you'll find that your requests start to get blocked and that's where you need to start using high quality proxies and I want to share with you the proxy provider that I use and the sponsor of this Video Proxy scrape proxy scrape gives us access to high quality secure fast and ethically sourced proxies that cover residential Data Center and mobile with rotating and sticky session options there's 10 million plus proxy SE in the pool to use or with unlimited concurrent sessions from countries all over the globe enabling us to scrape quickly and efficiently my goto is either geot targeted residential proxies based on the location of the website or the mobile proxies as these are the best options for passing antibot protection on sites and with auto rotation or sticky sessions it's a good first step to avoid being blocked for the project we're working on today I'm going to use the sticky sessions with residential proxies holding on to a single IP for about 3 minutes it's still only one line of code to add to your project and then we can let proxy scrape handle the rest from there and also any traffic you purchase is yours to use whenever you need as it doesn't ever expire so if this all sounds good to you go ahead and check out proxy scrape at the link in the description below let's get on with the video so let's go ahead and look at what we've got here um so here right away I can see a load of images and a load of Json data here the one that I'm interested in straight away says availability and this has all the product availability the like you know the basically the stock numbers and the SKS etc for this item that's pretty handy that's very relevant and the other one is right here which is sort of the whole product data everything that uh comes with it so we can see we've got all the images and stuff like that and there's there's pricing information in here metadata if I Collapse these uh we can see everything coming up pricing information so this is essentially the data that I want now I've shown you all this before in other videos and if this is new to you then I'll will cover everything you need to do to get started with this but what I haven't done before is I haven't showed you more of a full project which is what I'm going to go through through in a minute um the first thing that I want to do though is we need to understand the API and the endpoints and what's happening so I'm going to go ahead and I'm just going to copy the request URL for this one which is the product now we can see that this is basically essentially just their API and by hitting it like this we do indeed get the Json response for this data now what that means is we could effectively take a different um [Music] product for example uh let's see if I can grab the data for this one the code for this one and just put it on the end here and we're going to get that information but how do we go about getting these product codes well there's another way that we can do this and uh I'm going to keep this one open so now I've got the sort of the product link here and I'm going to open the um the availability one as well so we can have all three and have a look where is the availability here so again here the availability it's basically very straightforward so just going to paste this in here we get the availability again if I change the product code it's going to give us the availability for that product now to actually find the product IDs well how would you find them on the website well you could either go to a category or you might want to search and this is kind of where I tend to go for go for to start with so I might type something like boots into the search again with this open on this side you know here we go 431 results this is how I would typically sort of look to get this information so if I come over back to the um the the data here that I had I need to scroll to the bottom somewhere around here we're going to find a um a request wish it wouldn't show me all of these actually what I'm going to do is I'm going to delete all this I had all the other ones and I'm going to search again just so it comes up at the top okay so this is it loading up you can see it's loading up all these products and this is because these are the the products that have come from the search so this endpoint is actually slightly different it's going to give you different bits of information we'll we will cover that the one I'm looking for is the actual um search one here search query there we I found it so what this is is this is like basically hitting the M the API Endo with the search query that we gave it and again you know I can put this in here put this in I wish this would go away I don't know what this is for I wish and I can put this in here and here is the response now I'm going to just collapse a lot of this information uh get rid of all of this cuz we're not that interested in this information but what we are interested in if I make this full screen and we have a good look is we have a view size a view set size we have the count which is 431 which was the whole of the search uh we have the search term and then we have the items at 48 per page which was the view size we also have the current set which I believe uh no there should be another one start index here we go so what we can actually do is we can start to see are any of these parameters available for us to manipulate so if I change the start index to 10 what happens okay that wasn't the right one um I think it's actually so start index didn't work so I'm going to change it and quite often it's just start maybe okay start is start index okay that's fine to find that out if you were I mean you could try and guess it like that but what you could do is you can uh if we just come back here and we manually go to the next page with the uh developer tools open you would see that and it would it would be there so if we scroll down somewhere along here start is 48 we can see that there so you can start to do everything that you would do on the page um and just keep an eye on the uh the actual Network Tab and you'll see everything come through so now that I know that the uh the start index works oh way too big we can start to put together something that's going to give us we can use to search we can have like the that we can start we want to start on zero index I guess yeah and then we can go through the items so what we have in the actual items response is somewhere down here we have a lot of good information actually and in some cases this is enough but a lot of cases you do want to go actually deep into the product itself we have a product ID so this product is some kind of kid Superstar boots right so now we come back to our products part end point and we hit this in here here's the product straight away has come back and it's given us all this information and the one that I want to look at the most is the pricing information it's got a discount all this cool stuff right here then we can of course go to the availability one put the product code in and here's the available availability and this one has some availability so you can see that we're starting to work out how their API works now this is not that difficult especially if you've either worked with rest apis before or built AR apis before but my best device as I said is just to look through the website so what I want to do now is to take this and I want to turn it into something we can repeat within our code uh so I'm going to get rid of this at the moment I don't think I'm going to need this uh we can always actually we can always come back to it and I've got my um terminal open here in a new folder let's make this a bit bigger and I'm going to create a virtual environment like so I'm going to activate it what I want to show you now is a couple of interesting things so I'm going to go and I'm going to use Curl I'm going to take this endpoint that we know that works in our browser we can see it works there I'm going to paste it here and we get denied so this is a curl error and this is basically you know the akin to you know we can't get this data like this well let's try it with requests so let's Import in requests and we'll do our response is equal to requests. get let's put the URL in there we're getting you can see that we we're having issues here we're not able to stream the data for whatever reason so I'm going to change the headers I can't clear this up can I clear this up we'll do it this way we're going to change the headers so we I'll say our headers are equal to because you know you always want to do a good user agent right user agent and let me just grab one my user agent this one will be fine put that in here oh uh I need to sanitize and paste please there we go cool so now we'll import requests again and we'll do our response is equal to requests doget and we'll grab our URL again this one will be fine put you in there we'll say our headers is equal to the headers that we just created which is the user agent and response. status code 403 now this is because of TLS fingerprinting I'm going to cover this much more in a video much more in depth coming up so if you're interested in finding out really why this is happening and what you can do to avoid it and how you know everything works underneath the hood you want to subscribe for that video but essentially what we want to do is we're going to um I'm going to come out of this just so I don't get any Nam space issues actually I don't need to we'll do um import we'll do uh from Curl cffi we're going to import in requests as uh CU Rec curl cffi is going to give us a more consistent fingerprint that looks like a real browser so what I can do now is I can go up to here we don't need this one we just want this and instead of using actual requests I'm going to use Co requests uh CFI request and I'll do request.
status code and I got 403 because I forgot to do this impersonate is equal to and we can just put Chrome in here you don't have to put the version and now if I do response do status code we get our 200 our response. Json is all the data so we basically needed to uh get our fingerprint sorted for the um to make the request you notice I didn't need any cookies I didn't need any headers I didn't need anything other than what curl cffi or other you know TLS fingerprint um sort of spoofers do there's a few out there and I will as I said I'll cover that in a following video so now that I know that this is going to work what I'm going to do is I'm going to go into my we need to activate this one here I'm going to do pip three and we're going to use that curl cffi Library three install curl cffi and I'm going to use uh rich I always use Rich for printing we're also going to use pantic because I want to get it to a point where we have modeled the data a bit better um so I will install these I think that should probably be enough for us in this instance and I'm going to touch main. py and we'll make this open here now I've imported everything that we're going to need I'm going to look at modeling my data a little bit closer now I've done this already but essentially what I'm going to do is I'm going to take so from this the products one and the search one so we can get that information I haven't done the availability one but you can add that one on nice and easy now that you know the the end point here so we're going to model this information I'm basically just going to take what I want from here and create a pantic model with it so the first one is the search item which I'm going to have the product ID the model ID price sale price and the display name and the rating so that's all comes from that search endpoint and then the same thing I'm going to have with the search response which means I can easily find out and manipulate what page and count Etc like this so we can see the search term the count uh of total items for that search and the start index which I tolded about earlier and then the items is the list of search items then I've modeled the item detail um which is the the information that I was after before so I've just basically put the product description and the pricing information in as d dictionaries rather than modeling them because this is quite Dynamic this data I found some products they don't have all of this information so it was easier just to do it like this again with the product description so it's up to you but basically what I'm saying is model your data from here I creating a new session now I gave I created a function for this because initially I thought maybe I would want to expand on this project and then be able to import this new session function from into a different uh you know different different file or different part of the project so all I'm saying is I'm creating a session I'm using request.
session and again this is K cffi so we have this impersonate here and I also am importing my proxy now I talked about sticky proxies earlier and that's what I'm going to be using here it's not actually essential to do so with this specific site but there are sites that will be um that will sort of match your fingerprint or your request with the IP address and if it starts to different it starts to get flagged that's a lot less common though so this should be fine and now I'm going to model a function that's going to go ahead and query the search API we need our session which we're going to create our query string and our start number and I've just put in the an F string into the URL here to do that and then I'm going to basically just get the data from here we want to put in something to handle if we get a bad response so basically I've put request uh for status which is going to throw me an exception if we get anything that isn't a 200 response basically going to let me know if we're starting to get blocked um I'm not too fond of this I think there's probably a more elegant way of handling it but this will work just fine for now then we are basically taking the response data and uh pushing it into our model our search response model we're unpacking it and I'm unpacking from the raw and item list which is essentially this piece of here so raw I'm going to go to this one and then this one here and then I'm going to unpack everything that fits into my models like so again it's up to you how you model your data and then I'm going to return the search which is a type of the search response model I'm going to do exactly the same now for the detail API very very similar we're going to put the item.
Copyright © 2025. Made with ♥ in London by YTScribe.com