Qwen 2.5 VL Computer Use: FULLY FREE AI Agent With UI CAN DO ANYTHING! (Beats OpenAI Operator)

26.04k views1784 WordsCopy TextShare

WorldofAI

In my last video, I covered Qwen 2.5 Max, which dominated GPT-4o, DeepSeek V3, and Claude 3.5 Sonnet...

Video Transcript:

a few days back I had made a video on a new remarkable model by Quin which was the 2. 5 Max Quin model that outperforms and out pieces deep seeks version 3 gbt 4 Omni plus claw 3. 5 Sonic and various benchmarks now in that video I had promised to make a video on another model they had released alongside with the 2.

5 Max model which is their quen 2. 5 Vision model this is their new flagship vision model of the coin series and it's also a significant leap from its previous coin 2vl model it's a great model for computer use like open AI operator and what's Wild is that the Quin 2. 5 VL 72b will perform as well as the base model for the operator agent meaning that you can use this local VL open- Source model to automate any computer-based task just take a look at the performance evaluation sheet that showcas es the performance of this Vision model where it delivers competitive performance across various benchmarks it excels in document and diagram understanding while functioning as a visual agent without task specific fine-tuning it matches state-of-the-art models in tasks like math question answering as well as video comprehension against the gp4 Omni model CL 3.

5 Sonet as well as its predecessor which is the quen 2vl it also closely Rivals Gemini 2. 0 flash where it out competes it in most benchmarks but Falls a bit short in triple muu now overall this model is open sourced with both base and instruct models in three different sizes including the 3B the 7B as well as a 72 billion parameter model which are available through hugging face you can also access the Quin 2. 5 VL 72b instruct model through quen chat which is their chat bot and in this case I'm going ahead and asking it what are these attractions please give their names in English and you can see list out giza's pyramids you have the Great Wall of China uh you have Statue of Liberty as well as Terracotta Army within China before we get started I got a huge new update this is where I've launched a new newsletter this is something that's going to be sent out on a weekly basis and essentially going to be updating you on the latest AI advancements comparison of different large language models AI news as well as ranking different AI agents so definitely go ahead and subscribe to this cuz you don't want to miss out on free AI news now guys when I say that this model is truly remarkable in terms of its visual understanding it definitely is cuz it has the capability to recognize objects analyze texts charts layouts within images and so much more better than any other Vision model can it functions as a visual agent for both computers as well as for phone use and it can comprehend long videos by pinpointing key events localizes objects with precise bounding boxes and it can even generate structured outputs for scan documents this is something that will make it highly useful in different categories like finance and e-commerce or having it so that it could help you process and sparse different types of data so you can see why this model could be really useful for many of us so imagine if we are to take this model's Vision capabilities and combine it with something like browser use which is a web automation framework you're definitely going to get the best accurate automation performance cuz as we all know browser use is a new framework that has came out recently which is open source and it basically is an AI agent that provides a powerful way for you to automate any website based task now you can see on the web agent accuracy benchmark test it is something that even outpaces the open AI operator which is recording an 87 percentage in terms of web agent accuracy in terms of Performing task and with browser use in comparison which scores an 89 percentage now just imagine if you are to combine this framework with the vision capabilities of quen you're definitely going to get the best precise output of any automation on the web which is just insane now there's a couple of ways to get started you can either use quen's API from their Cloud API platform to access the vision model capabilities through browser use or you can go ahead and install this inference based API server that is going to be accessible through open AI compatible API endpoint and to get started there's a couple of prerequisites for having browser use installed you're going to need to make sure that you have python installed UV to set up your virtual environment you'll need pip which is to help you install the packages and eventually you will also need git to help you clone the repository which is something that most people should have but once you have those prerequisites fulfilled head over to the web UI GitHub repository click on this green button copy this link to clipboard and I want you guys to open up your command prompt simply go ahead and type in get clone and paste in the link this will clone the repository of web UI once that is done you can then scroll all the way down and then you can type in CD web-ui to get into the web UI directory you want to then start off by creating your virtual environment for this so that it is contained in that environment then you want to activate your virtual environment so go ahead and send in this command which will activate this environment that you have set up once you have sent it in you can then install the dependencies so go ahead and copy this command now after you have installed all the dependencies you want to make sure you go ahead and install playright this is super simple you just need to go ahead and and copy this into your command prompt and this will install all the requirements for playwright and once that is done you can go ahead and open this up by going over and running this python script to start up the web UI with browser use that is functional and once that is done you can then go ahead and click enter and it will then open up within your Local Host within a couple minutes and then you can go ahead and open it up within your web browser now first things first you want to head over to llm configuration and obviously if you went along and set it up with the open AI compatible uh server or endpoint you will then need to select your model which is going to be the one that you just installed with it and then you want to set your Local Host to to the base URL that is said within that GitHub repository and then you want to leave your API key blank and then what you can do is head over to the browser settings as well as run agent to go ahead and execute any command so let's start off and test this we're going to go ahead and run this agent so that it goes over to World of AI my YouTube channel and it's going to go ahead and Source me the most popular video so right now you can see that it opened up a new tab to execute this and within a couple seconds it's going to go over to YouTube and find World of AI for me so within the search tab it should go ahead and search up world of AI and it will then go over to Source the latest or the most popular video so let's see what it ends up doing so it looks like it has found my channel and now it is going to sort through different videos to find the most popular one so it looks like the next step is going to be clicking on the popular button to sort through different videos that have the most amount of views and it should show up with fabric and there we go was able to find the fabric video that is the most popular on my channel so it was able to perform this task pretty quickly actually and it was able to do it quite accurately now I'm going to go ahead and send over this simple task but essentially what I'm trying to Showcase is that this model is exceptional in terms of going ahead and executing webbased task as it can go ahead and easily search for things quite quickly I went along and I requested it to search up trending AI research papers and we can see right now it is already on Google and it's going ahead to search this up now if you compare this with many of the other different types of computer agents it would take them a lot longer to process a query like this so in this case it's already went along and pasted in the prompt within Google search tab of trending AI research papers and already went along and searched it up and you can see there's already a list of different trending AI repositories that I can go ahead and click on if you like this video and would love to support the channel you can consider donating to my Channel Through the super thanks option below or you can consider joining our private Discord where you can access multiple subscriptions to different AI tools for free on a monthly basis plus daily AI news and exclusive content plus a lot more but guys that's basically it for today's video on the Quin 2.

5 BL model it's a truly remarkable model that excels in visual understanding and it's something that will help you recognize objects even better than something like Gemini 2.