Python AI Web Scraper Tutorial - Use AI To Scrape ANYTHING

135.55k views9412 WordsCopy TextShare
Tech With Tim
In this video, I'll be showing you how to build an AI web scraper using Python. The application itse...
Video Transcript:
today I'll be showing you how to build an AI web scraper using python that's right we'll be building an application that can scrape literally any website using AI this is super cool so let's dive in and let me give you a quick demo of how this application works so the functionality here is pretty basic all we need to do is give it a website URL so maybe something like the metal table from the recent Olympics so let's paste that right here we will scrape the site grab the Dom content and then we can pass a
prompt to our AI that we'll be able to grab any information from this website that we want all right so we have our Dom content here this is all the content from the site now I can paste in a prompt like give me a table that contains all of the countries and their metal count we can wait a second and we'll get that information so there we go we just got a table now that contains a bunch of metal Counts from the countries that were on that website let's look at a few other examples then
dive into the code and I'll show you exactly how we build something like this so here's a more e-commerce related example it's a website called true classic which actually has my favorite t-shirts and I'll show you how we can scrape this now and grab all of the different T-shirts from this site so I'm just looking at the polo shirts here I'll paste in this URL scrape the site and then we'll give it a prompt and grab that information so let's parse the content Now using this prompt and here you go we get a result here
with a few different tables containing the different products names their price currency ratings Etc lastly let's look at a real estate example you can see here I have a property finder website for properties into dub Marina which is a place I'm going to be moving to shortly and if we scrape the site and give it a prompt again we can get that data and here you go we get a bunch of different properties here with all of their relevant information all right so now that we've looked at that demo we're going to start building out
the project now in order to do this we're going to need to install a few different dependencies because we're going to be using things like streamlit for the front end we're going to be using selenium for actually doing the web scraping and then we're going to be using Lang chain for actually calling the AI and having the AI pars through our data so what we're going to do is set up a virtual environment here and notice I've just opened up a new window in Visual Studio code so let's do that let's install our dependencies and
then we'll start building this project so in order to do that I'm going to say Python 3-m venv and then EnV or I can name this anything I want in this case I'm going to name it AI now if you're on Windows you can simply replace Python 3 with python and this should spin up a new virtual environment for you using ven V okay so let's hit enter here and now you can see that we have a new one called AI now we need to activate that virtual environment now if you're on Mac or Linux
the command to do that will be Source the name of your virtual environment then slash bin and then slash activate when you do this you should see that you get a prefix before your commands line uh saying the name of the virtual environment now if you're on Windows I'm going to show two commands on screen right now that will show you how to activate your virtual Environ environment they'll differ depending if you're using CMD or Powershell so just type that command in activate your virtual environment and now we're going to install a few different dependencies
now all the dependencies that we need have actually put up on GitHub so I'm going to link this in the description but go there and click on requirements.txt now requirements.txt will contain all of the requirements that we need to install so I'm just going to copy the contents of this file and I'm going to make a new file in my local directory here contains the same so I'm going to say requirements.txt here and I'm going to paste all of these requirements inside so again go to the GitHub it'll be linked in the description copy the
contents of the requirements.txt file paste them inside of a new file in your local directory and then we can install them using this file so now what we're going to do is simply uh type pip install dasr and then requirements.txt obviously make sure that when you're running this command you're in the same directory as this file and then go ahead and hit enter now this will install all the dependencies that we need inside of this virtual environment so that they're contained we're not installing them systemwide now you notice here that we're using streamlet Lang chain
selenium beautiful soup and python. EnV a few other modules as well and these will allow us to kind of build the entire project out and I'll explain each library and why we're using it when we get to that stage it's just easier to install all of them at once all right so all of those dependencies have been installed so I'm going to close my terminal now and we're going to move on to the next step so the first thing I'm going to do is create a python file called main.py here now just to quickly walk
you through the steps that we're going to follow through here the first thing that we're going to do is create a very simple streamlit user interface if you're unfamiliar with streamlit this is a way to create really simple python web applications that just takes a few lines of code so it's probably the easiest way to interact with things like llms which is what we're going to be using here once we built the streamlet UI the next step is to actually grab data from the website that we want to scrape so in order to do that
we're going to use a python module known as selenium selenium allows us to automate a web browser so we can actually navigate to a web page we can grab all of the content that's on that page and then once we have that content we can do some filtering on it and we can pass it into an llm an llm being something like chat GPT right or gp4 or llama 3 or whatever you want to use and then we can use that llm to actually parse through the data and give us a meaningful response so let's
start here by building out the streamlit UI and we'll kind of build this as we go through the video so we're going to say import streamlit as St now beneath here what we're going to do is add a title for our website so we're going to say st. tile and this is going to be AI web if we could type this correctly scraper and and then we're just going to put a uh kind of URL input box here so we're going to say URL is equal to st. text input like that and then we're going
to say enter a website URL okay and this is just going to create a very simple kind of text input box for us perfect now beneath that we're going to say if st. button and then this button is going to be scrape site then we're going to do something inside of here so this is as simple as it is to create a streamlet UI this makes a title this adds a text input and then the button we can put here and we say if this button is clicked then we're just going to execute whatever code
is inside of this if statement so for now actually if we want to test that we can just say st. write and then we can write something like scraping the website and then we can proceed to scrape the website down below so in order to run our streamlit application the way that we do that is we type streamlet and then run and then the name of the Python file that can contains our streamlit app in this case it's main.py again notice that I'm running this from the same directory as my python file so I do
streamlit run main.py and now it's going to spin open a web server for us and you see that we have ai webscraper we can enter a website URL and then press scrape site when we do that it says scraping the website if we refresh it'll just go back to that main State perfect now we can still write code while this is running it will automatically refresh but for now I'm just going to close this with contrl C and we're going to go on to actually do the web scraping component of this app so in order
to do the web scraping here we're going to use selenium now what I'm going to do is make a new file called scrape dopy now this is where we'll write our scraping code just to separate it from the main file so that's a little bit easier for us to navigate so what we're going to do is import a few selenium modules or selenium I guess classes that we need to use and then we're going to write a function that takes a website URL and just returns all of the content from that website now as I
said we're going to use selenium to do this and then I'm going to show you how we can connect this with a service known as bright data which is actually the sponsor of this video but don't worry we'll talk about them later on which will allow us to actually do this at scale and to get past things like captas IP bands and other blocks that you'll commonly encounter so what I'm going to do here is import selenium do web driver as web driver and then I'm going to say from selenium do web driver driver and
this is chrome. service we are going to import service okay now we're going to write a simple function that will allow us to actually grab the content from a website using selenium now again what selenium allows us to do is control our web browser we can actually do things like click buttons and interact with text Widgets or uh you know kind of interact with the page as if a human was doing it but in this case we're just going to use it to grab the content from a site so I'm going to make a function
called scrape website and then inside of here I'm going to put website now what I'm going to do is just do a print statement and I'm going to say launching Chrome browser just so we know what's going on here and I'm going to write a few different variables so the first thing I'm going to do is say Chrome driver path is equal to an empty string I am then going to specify some options Now options is going to be web driver. Chrome options we need this when we set up the web driver even though it's
empty we still need to pass it and we could specify options here later on and then I'm going to say driver is equal to web driver. Chrome and we're going to say service is equal to service and we're going to pass inside of the service our Chrome driver path we're then going to say options is equal to options so let's save that and I'll just make this a bit bigger so we can see and let's write the rest of the code and that'll quickly explain what all of this is doing so we're going to write
a Tri block and we're going to say driver doget and we're going to try to get the website what this does here is use our web driver to actually go to the specific website and then once we're on the site we can grab the content from it so we're just going to do a print statement and we're going to say print page loaded dot dot dot so we know that it's loaded and then we can grab the HTML by simply saying HTML is equal to driver. pageor Source then we can return the HTML and we
can have a finally block here where we say driver. quit and that's it for this function okay I know I went fast don't worry I'm going to explain what's happening here and what the next step is so the first thing we need to do is we need to specify where our Chrome driver is now this is actually an application that we'll need to download in just one second that allows us to control Chrome so we need to have a chrome driver and this application will be different depending on what operating system you're using we then
have options now the options are so we can specify how the Chrome web driver should operate we don't need to pass any options right now that's why it's blank but we could specify that we want to run it in headless mode or we want to ignore um you know images or something along those lines next what we have is we set up our actual driver so we say web driver. Chrome in this case we're going to use Google Chrome but we could be using Firefox Safari doesn't really matter we can automate pretty much any browser
that we want we then specify the service that we're using and that's going to be wherever the Chrome driver application lives and then we give it the options and we're good to go then we can use the driver which is effectively automating our browser by using commands like doget okay then we grab the page source which is our HTML and we return that now lastly I'm just going to import time and I'm just going to do a sleep so we can see what's going on here so I'm going to say time. sleep 10 great so
now that we've done this the next thing we need to do is grab our Chrome web driver so to do that we're going to go to a link that I'll leave in the description it's this one right here and you're going to click on stable when you do that it's going to give you all of the download links for various operating systems for appropriate Chrome drivers and the Chrome driver version needs to match your Google drive or sorry your Google Chrome version so I recommend just updating your Google Chrome to the most recent version and
then you can just download the most recent version of Chrome driver anyways all the versions are here and you want to just go with the latest stable version or whatever matches your Google Chrome version okay so in my case I'm going to download the Chrome driver for Mac armm 64 because I'm using the new M2 or M3 or whatever chips that they have uh but if you're running an Intel Mac it's going to be this one and then you have Windows obviously and Linux okay so go to this download link and it's going to download
a zip folder for you now I already have that downloaded and you'll see that when you download it it will give you a folder or a zip folder that looks like this now what you want to do is extract that zip folder so you get the actual folder and then you want to copy the application that's inside of here and put it in the same directory where your code is so I'm just going to copy the Chrome driver inside of here by simply pasting it in BS code and now I have access to it in
the same path where my code is okay so again go to this website download the appropriate Chrome driver extract the zip folder grab the Chrome driver app application it may be named something slightly different depending on your operating system and then paste it in the directory where your code is great now that you've done that you're simply going to write for the Chrome driver path do slash and then Chrome driver now this will be the name of your application so if you're on Windows I believe it's aexe but just whatever the name of this app
is you want this variable name to match okay now that we've got that what we can do is go back to main.py we can import this function and then we can call it and we can test this out all right so from main.py here in order to use this we're going to say from scrape import scrape website then inside of this function we're going to call scrape website with the URL we're going to store that in a result and we are simply going to go here and print out the result just so that we can
see in our terminal if we're getting any response okay and let me just get out of whatever I have here and clear okay so now we're going to rerun our application so to do that we're going to say streamlit run and then main.py so let's go ahead and do that and now we can enter a URL to test so I'm going to test my website and feel free to do this by the way which is just techwith tim. net I'm going to press scrape site and we should see that the Chrome driver launches it shows
our website we're just waiting for 10 seconds here where you can check out my software development program great plug there anyways and then it will quit it will return the HTML and we should be able to see that if we go to our terminal and you can see if we go back to our terminal we get all of the content from that page perfect so that is exactly how you do the web scraping using selenium however I want to show you that this doesn't always work as expected so if I actually try to scrape something
like Amazon so I'm going to do amazon.ca notice what will happen here so let's go amazon.ca and see that we actually get a caption now the reason for this is that when we do web scraping locally on our own computer using something like Chrome driver it's very easy for the website to detect that we're a bot there's all kinds of things that it can look for and you'll notice that when you start doing this more and more a ton of websites are actually going to block you now they could be blocking you with that capture
or they could be blocking you things like IP bands just not showing you the correct content and it can be a huge nightmare especially if you want to put this app out into production now that's where the sponsor of this video bright data comes in now bright data provides a ton of different products that really just make web scraping and data collection on the internet significantly easier now we're actually going to use them in this video entirely for free and you're going to see how we're able to actually scrape sites that would typically block us
using this service and this technology now what I'm going to focus on here is something known as their scraping browser however they have a ton of other products as well like residential proxies uh they have uh web Unblocker they have this new search engine API they've got a ton of stuff coming out so definitely check them out and I'll leave a link in the description where you can actually sign up for a new account here and get some free credits so that you can follow along with the rest of this tutorial now keep in mind
that you can just keep doing web scraping how we're doing it right now you don't need to change this but if you want to get unblocked or there websites that are giving you captas or causing you issues or eventually you want to push this into production and do web scraping at scale you will need to use a service like bright data anyways I want to show you exactly we're going to use here and how it works so I'm going to go to my user dashboard which is only available once you make an account now I
have some credits on here and what we're going to do is make a new instance of something called their scraping browser so to do that I'm going to click on this little icon right here for proxies and scraping infrastructure but notice they have a ton of other stuff like a web scraper API they have a web data like they have data sets that are already made for you and if you're just doing web scraping at all um they have a bunch of services now what we can do is create a new scrape SC browser by
simply clicking on ADD here and then clicking on scraping browser now scraping browser includes a capture solver as well as connects to a proxy Network what that means is that it will automatically give you new IP addresses and cycle through those so that you can simulate as if you were a real user um kind of accessing a website it also means that if there is a capture that occurs it will automatically solve it for you so you don't need to deal with being blocked by captas so what I'm going to do here is just give
this a new names we'll just call this AI scraper like that there's a few other options but we can just click on ADD and once we do this it's going to say yes okay we want to go ahead and create this and I'll show you how we can connect to this from our code now the main advantage for developers here is that this just works with the code that you already have so in our case we're using selenium right so we can just continue using selenium but rather than doing the scraping on our own computer
we'll push that out to the cloud and use a remote brow browser instance which is what this scraping browser is so it's rather than running the Google Chrome browser on our own computer we let bright data run it for us where it's connected to all of its tools and it works exactly the same way now this works with playright Puppeteer selenium uh scrapey I believe as well and in some other languages so let me show that to you okay so now that this is created all I have to do is go to checkout code and
integration examples and if I do that it's going to show me how to connect to this from our code and what we'll do is modify our code now to use bright data so that we're able to actually scrape at scale and to get past all of those blocks so you can see that we have a few different options here like nodejs python C Java obviously we're using python so we'll select that and then we can choose what library we're using so in our case it's selenium so what we'll do is just copy the code sample
that it has here and kind of retrofit that to our current code and you can also just test it by the way if you want to do that using this okay so let's go back back here and let's just make kind of a new section of our code let me zoom out so we can see here and you'll notice that really all we're doing is rather than kind of connecting to this web driver where we've downloaded ourselves on our own computer we're connecting to a remote instance which is defined by this URL right here so
we have this spr connection we're connecting remotely to this remote browser instance and then we can do the exact same thing that we did before however there's this code that we can uncomment and this right here will actually solve captur for us so if we think a capture is going to occur on our page whether it it whether it does or doesn't we can execute this command on the remote browser instance and it will automatically solve that capture for us so you can see it will tell us what the status of that was and then
continue on with the rest of the code okay so what I'm going to do is I'm going to kind of copy this that we've got I'm going to remove the rest of this and I'm going to paste that inside of here so let's get rid of all this let's paste this code and let's just take these three lines and put these at the top of our program okay so let's actually replace these with those because we no longer need them and now if we go here you can see that we have print launching Chrome browser
we have this spr connection and then we are connecting as driver and we're going to this website so let's change example.com to say website we can get rid of this first print statement we can run the capture solver we can grab the page source and then we can return the HTML this time rather than uh what we were doing before uh which was what was it just printing it out okay so now again we have the exact same thing except we're using bright data okay so let's test this out now and just make sure that
this works and again I want you to kind of focus on the fact that this was literally just a few lines of code change now we're just kind of using this rather than setting up our own web driver and also this is significantly easier than having to download the web driver set that up Etc so let me go back to my streamlet app uh let's make sure that's running by the way so let's just quit this and rerun it just so that we refresh our code and let's try to go to Amazon now so we're
going to go to amazon.ca uh and let's see the result that we get all right so I'm just looking at the console here because the scraping has finished and you can see that there's actually not a caption now even though I know it's a little bit difficult to navigate here and we are actually able to access amazon.ca and grab the content from that site now it's not popping up on our own computer because we're connecting to that remote instance we could go in and debug that and view that if we want but the point is
we're getting all of this content and we're not being blocked like we were before because now we've connected with bright data so that's great let's continue on now with the rest of the tutorial because we're able to actually grab this HTML Source okay so now that we have the HTML Source what we want to do is clean this up a little bit and make sure that we're not grabbing things like uh script tags and style tags and all these other pieces of content that we don't want to pass to the llm the idea is we
want to take just the textual content pass that to the llm and allow it to parse it so we have the fewest tokens possible and we can reduce the amount of characters or batches that we need to submit to our llm to get a valid response you'll see what I mean in a minute but for now I want to write a few helper functions that will kind of clean this HTML for us so the first thing I want to do is write an extract body content function and this is going to take in HTML content
and that's not what I meant to do let's go down here what this is going to do is simply extract the body so we're going to say soup is equal to beautiful soup and we're going to actually import beautiful soup now which is an HTML parser so in order to import beauti soup we are going to say from bs4 import beautiful soup like that okay so now we will import beautiful soup we are going to pass to this the HTML content and we're going to specify that we want this to do an html. parser okay
next we're going to say that the body content is equal to soup. body so we can actually just grab the body tag directly from this because it will parse it for us and we're going to say if body content content exists and I don't know why that keeps happening then we are going to return the string of the body content otherwise we are just going to return an empty string so that we don't potentially get any errors okay next after this we're going to clean the body content so we're going to say Define clean body
content like that we're going to take in body content as a parameter and we're going to clean it so in order to do that we are going to say again soup is equal to beautiful soup we're going to take the body content and then html. parser so just reparse that again and we're going to say for script oror style in soup and then we're going to take script and style then inside of here we are going to say script or style. extract and what this is effectively going to do is going to look inside of
our par content for any scripts or Styles and it's simply going to remove them so that's all this is doing is just getting rid of those tags because we don't care about the styling or all of the scripts which is just unnecessary characters okay next what we're going to do is we're going to say cleaned content is equal to soup. getor text and we're going to specify that the separator is equal to the back sln character so let's do that so back sln sorry this is the new new line character which just says okay get
all of the text and then separate it with a new line perfect now what we're going to do is say cleans underscore content is equal to and this is going to be back sln do joy and what I'm going to do here is write some fancy code that's going to effectively remove any back slend characters that are unnecessary so a lot of times when you grab the content from the HTML you're going to get a bunch of empty text strings I want to remove all of those strings so that we don't have them in our
text so in order to do that I can do the following I can say line. strip for line in the cleaned content. spit lines if line. strip now what this effectively says is if a back sln character is not separating anything so if there's no text between it and the next thing then we're just going to remove it so that means get rid of all these kind of random back slend characters that aren't actually separating any text and just exist in the um uh what do you call it the the text that we don't need
so stripping will remove that it'll also remove any uh leading or trailing spaces great and then after this we're going to say return the clean content like that okay the last thing that we're going to do is split this content up into batches so what happens is when we want to use an llm we have a specific token limit now that token limit is usually about 8,000 characters that means that if we have a really big web page it's possible that the llm can't take the entire web page at once and parse all of it
so what we need to do is actually split our text into a bunch of different batches of whatever the maximum size is so maybe seven batches of 8,000 characters or something along those lines and then feed those characters to the llm one batch at a time so that it's able to process them again that's because there's a token limit and we're not able to submit every single character if we have too many so that's what this function is going to do we're going to say Define split J content and we're going to take the Dom
content and we're going to take a max length equal to and I'm going to set this to 6,000 but I believe the max is typically eight I'm just leaving it at six to ensure uh we're not going to go over I'm then going to say return domore content and this is going to be I to I plus the maximum length and this is going to be 4 I in range and then this is going to be a little bit of an advanced step but I'll explain this we're going to say range zero the length of
Dom content and then take the maximum length Okay so the way that this works is I want to create batches of 6,000 characters so I'm saying okay I want to take my dom content from index I which will start out at zero up to I plus whatever the maximum length is so in this case it's 6,000 so this would grab the first 6,000 characters for me then what this this for Loop will do is it will step by the max length up to the Dom content so after I grabbed the first 6,000 characters now I
will be equal to 6,000 because it will step forward one so we'll then start at 6,000 and go up to I plus the next 6,000 characters and we'll keep doing that until we reach the length of our Dom content okay perfect so that's it for the scraping now we need to connect these functions in our main dop Okay so so how are we going to do that well first let's import them so we're going to say from scrape import scrape website we're going to import the split Dom content the clean body content and the extract
body content functions now what we're going to do is call those functions and then kind of print out the result in streamlet so we can see what it looks like so we're going to say the body content is equal to extract body content and we're going to take the result we're then going to say the cleaned content is equal to the clean body content and we're going to pass in the body content then we're going to store this in the session for streamlet so we're going to say st. sessionstate dodore content is equal to the
cleaned content this way we can access it later then we're going to say with st. expander now this is an expander text box that allows us to expand it to view more content we're going to say view Dom content and we're going to write inside of here that content so we're going to say st. text area and we're going to say Dom content and then this is going to be the cleaned content and we can give this a height of 300 so sorry let me clarify this the expander is kind of like a button that
will toggle what uh we're showing in here so when we click on it it will show whatever is inside of here when we click on it again it will collapse it then the text area is something that we can expand the size of so we can view like a little bit of content or a ton of content and that's why we give it a starting height and then we say that this is kind of the title of the text area and this is the content that we want to display okay so now let's actually view
this so to do that let's refresh our streamlit app uh let's make sure it is running and I believe it is and now we can view my website again so let's do techwithtim doet let's scrape this it's going to take a second and then once we get that content it will show us that in the expander view okay amazing so now we have this view Dom content window and when I do this you can see we get all of the textual content from the main page of my website perfect so that is that that's actually
a lot of the hard work done the next step now is to actually take this Dom content and pass this into an llm that will parse it based on what we ask it to do so now what we need is we need to actually ask the user for a prompt so like hey what information do you actually want from this site and then once they give that to us we can um pass that prompt as well as the Dom content to an llm and have it extract that data so we're going to say if domore
content is in the st. session state so if we've saved that then what we're able to do is say parore description is equal to st. text area and we can say describe what you want to pars question mark then we can say if St do button and this button can be parse content and we can say if the parse description then we can say St dot write and we can write parsing the content and inside of here we can parse it so in order to parse it we're going to say the Dom chunks are equal
to split and this is going to be Dom content and then we're going to take the st. sessionstate DOD content and pass that to it then of course you need to pass this to an llm which will be the next step so we're going to take these as chunks pass them to an llm get some response and then write that that's pretty much it but we need to obviously write the llm component but anyways if we go back here you can see now that if we refresh and we scrape the site again so let's go
here and scrape it so Tech with.net after we do this it should pop up a text window asking us to enter a description okay so you can see now that it gives us this text box saying describe what you want to uh parse and I got to fix that spelling mistake and then we can write whatever we want in here and then parse the content perfect so now we're going to move on to the llm component so we're going to make a new file here called parse dopy and let's start writing the code that we
need and then we're going to download AMA which is what we're going to be using to actually execute the llm if you're unfamiliar with AMA this is something that allows you to run open source llms locally on your own computer comp so you don't need to rely on things like API tokens from open Ai and you don't need to pay this is completely free so what I'm going to do is let's import here I'm going to say from Lang chain uncore olama import AMA llm I'm then going to say from Lang chain and this is
going to beore core. prompts import the chat prompt template now ol llm is because we're using olama but if you wanted to you could use something like open AI you could use Gemini you can use whatever you want really for the uh llm uh but Lang chain is something that allows us to kind of connect llms to our python code and that's why we're using it now what we'll do is we'll say model is equal to a llama llm and then we'll say model is equal to llama 3 now I'm going to specify how you
know what to put here in just one minute but I want to write all of the code first that we're not jumping back and forth between too many windows okay so now we have our model this is only going to work once olama is installed which again don't worry I'm going to show you in just a minute but we're going to write a function now that will use this to parse our code so we're going to say parse with AMA we're going to take in our Dom chunks and our parse description okay and then inside
of here we are going to create a um kind of what is it a way of calling the llm with that code and the description now we know when we call an llm we need a prompt now we will have a prompt that comes here from streamlet so it's like you know describe what you want to parse but we need to give it some more detail so that it knows what to do with that as well as the Dom content that we're about to pass it so what I'm going to do is copy in a
template here from my GitHub so this is the template let's just make this a little bit uh smaller so you can actually read it and you'll notice that inside of the template we have two variables Dom content and parse description if you're unfamiliar with how templates work all you can do is write a bunch of normal code in strings which is what I'm doing and then you can specify any variables that you want to be injected when you actually execute the prompt in this case we want to pass the Dom content and the parse description
now I won't read the entire thing to you but it pretty much says uh you know you're extracting specific information from this content here's the description of what the user wants and then it just give some uh more specific instructions to make sure that uh what do you call it here we get a decent response from the llm you could improve this prompt you can change it around but this is one that I found worked fairly well so that's why I'm copying it in here now if you want to find that prompt yourself you can
go to GitHub you can go to parpy and you can simply copy it directly from here which is exactly what I just did okay so now we're going to go into the pars of the Llama function so what we're going to do is say prompt is equal to chat prompt template and then this is going to be do from not messages do from template and we're going to pass the template that we wrote we're then going to say the chain is equal to prompt pipe and then model this is the straight uh pipe on your
keyboard if you're familiar with it it's on the far right hand side typically uh beneath the delete key I just say that cuz a lot of people aren't familiar with this character it's usually with the backslash okay anyways now we have a chain this means that we're first going to go to the prompt and then we will call the model that's kind of how it works with langing chain I'm going to create a uh array here or a list called par results and what we're going to do is pass all of the different chunks to
our llm and then grab the results and store them in here so we're going to say 4 I comma Chunk in enumerate and we're going to enumerate over the Dom chunks and we're going to start equals 1 when you say start equals 1 that just means that I will start at one rather than zero so that we don't need to add one to it if we're going to be printing out the value it doesn't actually change the enumeration it just means that I will start counting at one rather than zero okay next we're going to
grab a response so we're going to say response is equal to chain. invoke now this is how you call the llm and when you call it you need to pass two variables and they need to match the variables that you specified here if you change the prompt and add other variables well then you need to pass those as well so we're going to pass the domore content which is equal to the chunk because we're passing one chunk at a time and and then we're going to pass the parse description okay and then parse description like
that okay now we have a response and what I'm going to do is just print out some kind of logging information so I'm going to say par batch and then notice that I'm using an F string here so that I can embed variables directly inside of curly brackets and I'm going to say I of and then Len of my dom chunks I'm going put this inside of a variable the reason for this is so I know how many chunks we parsing because sometimes this will take a fair amount of time and I want to have
some kind of output so I know that something's going on then I'm going to say my par results do append and I'm going to append the response that I got from the llm lastly I'm going to return a back sln dojyan my par results which is going to take all my results and just join them with a new line character okay so that that is how this works right what we're doing is we're creating a prompt template we're using that for our model we are then just taking all of the chunks from our Dom right
so all of those 6,000 character chunks and we're going to pass them into the prompt so we pass them as Dom content and then our llm knows what to do because of the instructions that we've given it as well as that parse description that we provided from the streamlet UI we parse one chunk at a time we then take those append those into the results and then we join them and return them that's it now obviously you can do this with other llms and the faster your computer is the faster this is going to execute
there's ways to paralyze this add asynchronous code and to make this run a lot faster but for the Simplicity of this video this is what I am going to be doing okay so this is great but this only works once AMA is installed so let me quickly show you how we install AMA it's pretty straightforward you're just going to go AMA install on Google and you'll see that you have a download olama link so I'm just going to click on this one right here and notice that you have four options or three options sorry Mac
Linux and windows so obviously click the appropriate one and download oama now once olama is downloaded you need to run it and in order to run it you're going to open up your terminal or your command prompt and you're going to type the AMA command now when you type AMA you should get something that looks like this and now what we need to do is pull an olama model so the AMA model is what we're actually going to be using we need to download the model locally before we can execute it so in order to
do that actually let's go to the olama GitHub quickly and I'll show you a bunch of different options that you have so if you go to AMA GitHub you'll see that if you scroll down here on this page it shows you all the different models that you can use llama 3.1 Mistral code llama you got a bunch of different options so you can pick what you want but notice the size of them okay so obviously the bigger the model you download the more performant it's going to be but the harder it's going to be to
run and it will specify down here how many GB of RAM you need in order to run these various different models for example if you want to try to run you know llama 3.1 uh this has 405 billion parameters you're going to need a very intense computer I don't even think my computer which is an M2 Max would be able to run that so pick an appropriate model based on the specs of your computer and then you can run it so for example let's say you just want to do llama 3.1 okay what we can
do is we're going to go here to our terminal and we're going to type llama pull and then the name of the model which is llama 3.1 now I already have this so I'm not going to run it but you would run this code it would then download the model for you and then you just need to wait for that to finish and then you're able to actually use it now in order to use the model you can say o llama run and then something like llama 3 when you do that it's going to give
you a prompt window you can say something like hello world and then you can actually use this model so I have llama 3 obviously if you have llama 3.1 then you would change this to llama 3.1 and I think I want to exit so can I do SL exit yes and then that will get me out of that okay I know I went fast there but I'm just trying to quickly show you how to get AMA on your system now that ama is on your system we can start using this code so let's go back
to main.py and let's import that function so we're going to say from parse import and then this is going to be parse with a llama and now all we need to do is say result is equal to parse with a llama and then parse the Dom chunks and the parse description now the parse description is right here okay so that's perfect and then we can just say st. write and we can write the result to our terminal that's it so now let's test this out and see if it works so we're going to go back
here we'll do Tech with.net just to check my website so let's refresh this and just make sure that this is working so we'll go https techwith tim. net and we'll wait for that to scrape and then we'll just give it a prompt like can you describe this site or something and see what it tells us okay great so we've got the Dom content here so now I'm going to say hey can you put all of the main titles from this page in a table for me question mark I don't know exactly what kind of result
we're going to get from that but let's see what happens if we do this and you see that it shows us the main titles that have found which actually are correct tutorials courses Community gear shop donate software development program my software development program and Tech with Tim boom just parsed that content for us now we can try with a more complex site so let's use one of those real estate sites I showed you before okay so here's a property finder site that just has a bunch of different properties from Dubai Marina this is the same
example we looked at at the beginning of the video and this is interesting because I'm actually going to be moving to Dubai shortly so I might actually want to make a web scraper that can organize a bunch of these properties for me anyways point is I put that link here let's scrape it and then I'll give this a prompt and we'll see if it can organize that for us okay so the site is scraped and if we go here I'll just say can you please collect all the relevant information and organize it in a table
and let's see what that gives us and you'll notice if we go back here by the way to the terminal uh that it should be showing us the batches so here it said you know parsed batch one of one uh and this one actually let me just click it again I don't think that button worked uh it should pop up and tell us you know parsing batch one parsing badge two parsing batch three uh as it goes through through there okay and there we go that actually happened quite quickly and it gives us a table
here of this property information sweet so that is pretty much it for this tutorial I wanted to create a simple AI web scraper for you and I wanted to leave this so that you could extend it and make it more customizable to whatever it is you want it to do obviously this isn't going to work perfectly and there's a lot of tweaks that you could make but you can imagine the potential you can have with something like this right you can use uh this to scrape real estate you could use this to scrape e-commerce you
could use this to scrape really anything that you want and if you knew the kind of website you wanted to scrape you could customize this even more give it a better prompt give it more information and you could even make this kind of go back and forth where you could ask multiple questions and just keep going and refining the result that you want to get so much potential here with this type of AI web scraping this really just scratches the surface and I hope you enjoyed this video if you did make sure to leave a
like subscribe and I will see you in the next one [Music]
Related Videos
This is How I Scrape 99% of Sites
18:27
This is How I Scrape 99% of Sites
John Watson Rooney
126,815 views
18 Weird and Wonderful ways I use Docker
26:18
18 Weird and Wonderful ways I use Docker
NetworkChuck
311,193 views
How To Make Money From Coding - A Complete Guide
43:05
How To Make Money From Coding - A Complete...
Tech With Tim
153,052 views
Python Selenium Tutorial - Automate Websites and Create Bots
36:42
Python Selenium Tutorial - Automate Websit...
Tech With Tim
235,946 views
PySpark Full Course using Azure Databricks | Spark SQL and DataFrames
1:02:32
PySpark Full Course using Azure Databricks...
Algometica
245 views
Fox News Kamala Harris Interview Cold Open - SNL
9:00
Fox News Kamala Harris Interview Cold Open...
Saturday Night Live
817,724 views
How to scrape the web for LLM in 2024: Jina AI (Reader API), Mendable (firecrawl) and Scrapegraph-ai
20:22
How to scrape the web for LLM in 2024: Jin...
LLMs for Devs
179,823 views
OpenAI's Swarm - a GAME CHANGER for AI Agents
20:48
OpenAI's Swarm - a GAME CHANGER for AI Agents
Cole Medin
24,108 views
How I'd Learn AI (If I Had to Start Over)
15:04
How I'd Learn AI (If I Had to Start Over)
Thu Vu data analytics
833,089 views
How I use Reddit and AI to find winning startup ideas
21:20
How I use Reddit and AI to find winning st...
Greg Isenberg
268,041 views
Weekend Update: Trump Dances for 40 Minutes Straight at Campaign Rally - SNL
5:08
Weekend Update: Trump Dances for 40 Minute...
Saturday Night Live
313,024 views
Python As Fast as Possible - Learn Python in ~75 Minutes
1:19:41
Python As Fast as Possible - Learn Python ...
Tech With Tim
1,845,479 views
Scrape ANY Website With AI For Free - Best AI Web Scraper
10:07
Scrape ANY Website With AI For Free - Best...
AI Andy
24,531 views
Learn 80% of Perplexity in under 10 minutes!
9:52
Learn 80% of Perplexity in under 10 minutes!
Jeff Su
223,128 views
The Biggest Issues I've Faced Web Scraping (and how to fix them)
15:03
The Biggest Issues I've Faced Web Scraping...
ForrestKnight
60,476 views
5 Unique Python AI Project Ideas & HOW To Build Them
16:27
5 Unique Python AI Project Ideas & HOW To ...
Tech With Tim
39,206 views
I Didn’t Believe that AI is the Future of Coding. I Was Right.
6:55
I Didn’t Believe that AI is the Future of ...
Sabine Hossenfelder
453,298 views
Extracting Structured Data From PDFs | Full Python AI project for beginners (ft Docker)
36:24
Extracting Structured Data From PDFs | Ful...
Thu Vu data analytics
43,375 views
Web Scraping with GPT-4 Vision AI + Puppeteer is Mind-Blowingly EASY!
24:14
Web Scraping with GPT-4 Vision AI + Puppet...
ByteGrad
54,882 views
The Race to Harness Quantum Computing's Mind-Bending Power | The Future With Hannah Fry
24:02
The Race to Harness Quantum Computing's Mi...
Bloomberg Originals
2,174,208 views
Copyright © 2025. Made with ♥ in London by YTScribe.com