Offline AI on iOS and Android

27.59k views7435 WordsCopy TextShare

Siraj Raval

I compiled a PyTorch AI model to run locally on iOS & Android! This keeps conversations private, fre...

Video Transcript:

hello world it's siraj and I'm going to show you how to run any machine learning model locally on iOS or Android and I've been working on this for the past few weeks as a part of my project DrGPT which is now named Drdignity to avoid naming conflicts with a certain open Ai and in this video I'm going to show you how to do this for any large language model and how to run it on iOS or Android what you're seeing here is a demo of doctor dignity I've just renamed it and it's running on my device you can see as soon as I load that model up into memory the vram the ram the video RAM shoots up to 2. 18 gigabytes and as soon as I ask it a question it's going to shoot up even more this is taking up a lot of memory but the point is that it's running locally which is great because offline models offer privacy for patients in this case or anything really they also offer freedom from openai or any sort of page API that you would have to use and it gives you an opportunity to have complete ownership and control over your own machine so I think this is a super useful idea of compiling machine learning models to work not just on Nvidia gpus but any gpus Qualcomm AMD tpus any type of processor not just Nvidia gpus I'm going to show you exactly how I built doctor dignity but in order to get there we're going to start small and we're going to build our way up because this can be kind of complicated so I want to start small if you're a beginner which I hope you are this video is aimed to be accessible to anybody and I want to get you to the point where you understand how the tensor virtual machine works and that's the core technology that I used to get my machine learning model to run on IOS and Android and you can see I have Android Studio up as well not just xcode and I have an Android version of the app and all of that and this is all using the tensor virtual machine to work and but before we talk about the tensor virtual machine and all that we got to start small so let's build a very simple language model in Python and then we're going to continually build up from there okay so let's talk about our dependencies the first dependency that we're going to need is python for programming that one should be obvious the next one is numpy now what numpy does is it allows us to speed up certain operations on a CPU and that's what we're using in this Google collab notebook it was it's using a CPU created by a company called Intel and I'll explain what that is and what the difference between that and a GPU is later then we're going to use Cuda and Cuda is nvidia's programming language for programming on their GPU now Nvidia isn't the only game in town there are other GPU providers out there like apple and Apple has their own language for programming their GPU it's called metal now it's not just Apple Android has a lot of different vendors they've got Qualcomm Huawei all these different Media Tech all these different companies are creating gpus and Vulcan is an API that lets us use GP those gpus it's a general purpose API for all these different Android gpus then there's a tensor virtual machine and that allows us to not have to worry about metal or Vulcan just instead just use our language model directly from hugging face and compile it down to IOS and Android but before we do that I need to show you how it does that by introducing you to metal and Vulcan and we're also going to use this language called relay and relay is a part of the tensor virtual machine it's an intermediary language that takes any language model whether it's in pi torch or scikit-learn converts it into an intermediary language and compiles it down to a hardware specific runtime whether it's a GPU a TPU uh you know anything and the last Library we're going to use here is mlc llm and this is a wrapper around this tensor virtual machine library and it's how I got to doctor dignity to work so we can install all the dependencies that we need right here Pi Cuda for uh writing Cuda code inside of a Google collab notebook we're going to install Vulcan because we want to use Vulcan for Apple GPU acceleration later then we're going to install the tensor virtual machine and that was created by Apache which we'll use at the end and then at the very end we're going to use mlc which stands for machine learning compilation llm which stands for large language model which is a way of using the tensor virtual machine specifically to compile large language models not just any machine learning model like a Time series model or regression model but a large language model and so that's the idea we're going to see it download all these dependencies and as it does that we're going to go right to our first step so our first step is to build a simple language model in Python now this model is it's going to do something very simple for us it's going to predict the next character in a sequence based on the previous characters and this model is trained on a data set of individual characters just five of them a b c d and e that's it and the way we're going to encode these vectors or encode these characters into a way that our model accepts is using a technique called one hot encoding where we turn any sort of categorical variable in this case characters into a column which represents that value as either a one or a zero in a matrix and this is a great way of representing data so that we can perform next character prediction on it with a neural network a very simple neural network so let's look at what this code looks like right here I've commented it out with the help of chat gbt of course extensively so the first thing we're going to do is import numpy and once we have numpy for those Matrix operations we're going to initialize our vocabulary and our vocabulary is only five characters A B C D and E and then we're going to map all of those characters to numbers using this mapping function this dictionary and that is our one hot encoding once we have our vocabulary that's all the possible uh words that our language model could output and all the possible inputs as well it's kind of the search space of this language model now normal large language models have a search space of alphanumeric characters not just alphabets but also numbers sometimes but in this case we just have it small because we want to be really small about this really simple we'll initialize a weight Matrix for this neural network now remember neural networks operate in terms of very simple principle input whatever your input data is times weight of the weight Matrix at a bias value and then activate take all of that output and input it into an activation function which will output a value and then that value is in compared against the target value and then we compute the difference and we use a difference to compute what's called a gradient and the gradient is using calculus pointing Us in the direction of how to point the weight Matrix to be more and more likely when multiplied by our input to give us the expected output so there's the data initialization then there's the training Loop where we train this weight Matrix of our single layer neural network to continually improve that weight Matrix to be more likely to predict the next character and then we'd perform inference and inference is a step where we perform a forward pass on any given input whether it's A B C or D inside of this trained model it's going to give us an output and you can see the output given a is going to be C so it's already predicting the next character and the way this worked is by training this model on What's called the CPU and the CPU is a processor and it's a part of every computer every computer has a CPU and a CPU com computes four steps in order whenever we're running it to perform any program the first step is it's going to fetch instructions from memory whether it's you know multiply add divide whatever it is the next step is to decode those instructions into commands those commands are specific to that CPU architecture third step is to execute those commands and every CPU has an arithmetic logic unit which contains all these possible logical computations that could occur and the last step is once those computations have occurred store those results in memory so the full data flow looks like this from Ram which is our temporary memory Random Access Memory we're going to go into the L2 memory cache which is a temporary buffer before we get to the end and the reason that we have an L2 and an L1 cache in a CPU is because it allows us to be more efficient in terms of temporal and spatial allocation of resources as well as processing data you know it's literally spatial on a motherboard or spatially divided on a CPU where these operations are occurring and this matters when it comes to nanoseconds and picoseconds of processing time and so remember it's fetch decode execute and then store fetch decode execute store that's how a CPU works it's pretty simple stuff and we can actually clock how fast this works on a CPU with this inference function which you can see here so this inference took 0. 07 milliseconds once our train model was trained our language model giving it an input it took 0.

07 milliseconds to run that model and to predict the next character that's kind of cool right 0. 07 milliseconds now you can imagine a lot of these larger language models like doctor dignity they're not going to take 0. 07 milliseconds they're going to take 20 milliseconds or 30 or 50 or a thousand right if you have a 70 billion parameter model so we really want to make this fast maybe not for this model but at scale it matters so how do we do that how do we surpass the performance of a CPU well we have to introduce another processor entirely because we are limited by the capacity of the flops the Computing capacity of the floating Point operations per second of a CPU that's why we have to introduce another processor called the graphics processor Processing Unit or the GPU and the GPU has more cores way more cores than a CPU and it allows us to be able to access those cores in a very specific way now the GPU is good for a lot of things but it's not good for everything the CPU is great for sequential operations the GPU is great for parallel operations now when would we want to use parallel operations well in the case of neural networks like in the case of inference so when we are running inference on tokens let's say we give our large language model a sentence and we wanted to answer that sentence well we could give it one character at a time but ideally we could give it all the characters at the same time and it would be able to process all of them so that it could give us an output faster well the way to do that is to create a bunch of threads in each of these threads are operating in parallel and these threads live in what are called blocks and these blocks live in what are called grids and there's a great way of understanding how this works Oregon state has this great uh slide deck that I found here on how uh you know Cuda works and and how that pipeline works but the idea is that we can have some code and this code could be in C and it could just be iterating through a loop that's all this code is doing but if we wanted to run on Cuda then it changes just a little bit by introducing these variables like block ID and thread ID and grid well what are grids threads and blocks this is a way of subdividing data and processing it in parallel for the GPU and this is nvidia's Paradigm for GPU programming we have a grid and a grid has a bunch of blocks and a block has a bunch of threads and we can very carefully run different computations in parallel and sometimes sequentially and have these threads blocks threads and blocks share data with each other and you can imagine a wide variety of applications that we could use this for and what it was originally used for was Graphics processing in games and videos it's called video RAM after all but actually it's great for deep learning Because deep learning Matrix operations require tons of parallel computations multiplications additions all this stuff and so we want to be able to do that it's specifically we could do a lot on the GPU we could run training on the GPU we could initialize our data loader on the GPU we could uh do a lot on a GPU we could run inference in parallel and uh you know this is called Data parallelism when we have data that is being fed into the model in parallel and you can see that Nvidia has all of these docs for Cuda it has been around for over a decade it is clearly the most supported most popular GPU Library out there and thank you Nvidia for creating it your stock price reflects that but Nvidia is not the only one on the Block okay but I want us to know how Cuda works that's why I'm explaining it to you and we don't want everything to run on Cuda we only want that part to run on Cuda that we need to be parallelized because we have a program and it's written in C C plus plus python whatever and it has some portion of it in Cuda and then we have a host code which runs on the CPU and then we have cudico and the Cuda code will compile it and remember Cuda is not just a compiler it's also a run time so it compiles code and then it runs it as well so that code is compiled it sends it back to the CPU if it needs uh more GPU processing you send it back to the GPU and then if it's processes there then stores the result back on the CPU so you can see how this process works and we can initialize and run our own Cuda operation inside of a collab notebook uh by installing uh Pi Cuda and you can see it right here and once we run Pi Cuda you can see us running the inference function in Cuda as a specific kernel where we have threads blocks and we are going into this thread and block and we are performing all those operations in parallel so that is going to give us a speed up so when we run this measure function to measure the inference feed on an Nvidia GPU and we need to connect to a T4 GPU here to do that I'll connect to one then we can see that the inference speed is even faster which is awesome that's what we wanted to see so the inference speed is faster that's awesome but like I said Nvidia is not the only kid on the Block there's a lot of other Kids on the Block like apple and unfortunately Nvidia doesn't have gpus on iOS or on Android devices so we need to find Alternatives so how do we do this well Apple has its own GPU it's called the M1 the M2 for the MacBooks but it's also got the A1 the a series for the chips a15 a16 bionic chips and the way to program the GPU on these chips is to use Apple's native API called metal and metal is an amazing technology and what metal lets us do is it's just like Cuda both a compiler and a runtime it lets us take a high level high level framework like Pi torch a model written in pi torch and it lets us run some of those functions on uh the a15 chip or whatever chip our iPhone is on and so let's let me show you how to do this we're going to convert that code into metal code and unfortunately we can't do this in a colab notebook because it's using an Nvidia GPU on a collab notebook so we'll open up xcode and let me show you in xcode what this looks like so in xcode I have this file here and this is essentially an objective-c implementation of our language model it looks the exact same and I can run it right here I'm going to hit run and it's going to build and you can see the output of the build right here boom it just built okay and you can see that it's actually being programmed inference took military and you can see I'm actually timing it in nanoseconds because it's that fast milliseconds just showed up at zero whenever I clocked in a milliseconds so I had to redo it in nanoseconds for it to show up and register and so here's the metal function right so we create a DOT metal file we paste in this inference function in dot metal now remember this is it this metal STD Library it's just an objective-c library and we can just input it right here the rest of the code is just in Objective C and that's it it's Objective C code for a large language model we load up our metal program you know the GPU buffers the memory and we you know we do our computation on one end and we do our allocation separately just like a good GPU programmer should do on Cuda it's very similar and that's it and and that's how we run uh uh you know a large a small language model on metal devices on iOS uh but let's keep going right so that's the metal devices let's do do it for Android now okay so Android also is the biggest mobile operating system in the world like by far right so 71 market share as of last year sorry Apple and Android has a lot of different GPU vendors Samsung Apple sorry Samsung Qualcomm mediatek unisug all these different ones and opengl used to be the library of choice for graphics processing um but it was deprecated and really bad and old so Vulcan is now the successor to opengl and it uses much less CPU than opengl it's much more heavily reliant on the GPU and while opengl required the specific companies to update their drivers for their gpus Vulcan puts that uh responsibility on the developer and what that allows us as developers to do is even though our code is going to be so much more verbose and difficult it's going to allow us to create one API that runs on any GPU and that's huge 70 plus percent of mobile devices run Android so as hard as Vulcan is if we learn Vulcan one language can run on 70 of gpus that are non-nvidia that's a huge deal so that's how Vulcan works and you can see that Vulcan really it wasn't just meant for Android you can even run Vulcan code on Nvidia gpus or apple metal so it's almost like a general purpose GPU programming language and that's why it's called video memory right it used to be used for all of these Graphics processing tasks like rasterization and vertex assembly and shading and all this rendering and Ray tracing and all this stuff but now we're using it for Matrix operations for deep neural networks specifically larger language models and so we can see the Vulcan code right here and it took me quite a while to Output all this Vulcan code I had to ask chatgpt to Output this code in small bits to fit into its context window that's how big this code is this Vulcan code you can see me asking chat gbt over and over and over again I'm going to paste all my chatgpt prompts for you in the video description but you can see here what this code looks like for Vulcan and I paste it into Visual Studio code here but it's all written in C and Vulcan is a single header right here but we created a Vulcan instance we say here's our physical device here's our simulated device here's what our pipeline looks like here's our Command cool what commands are we going to use here's the buffer in memory and then here's the pipeline of compute it's a lot of stuff and I'm telling you this is quite a lot of code but I'm sharing it with you in the descriptions 338 lines of code but it compiles and works and I can show it to you right here I'm compiling it boom and it works and so we've got it running right here Vulcan 2 awesome and you can see the binary right here at dot Vulcan 2.

but the idea here is that instead of using you know I had to repurpose this idea of a Shader and shaders are from Graphics processing right they're a way to um create uh you know all sorts of effects but we can use shaders for tensor operations as well in this case just running a simple computation from inference on the device in this case Vulcan running on my MacBook which would be uh the M1 chip but Vulcan can run on Android as well and so let's move on to the next and final step here which is the tensor virtual machine and why this matters to you so we could learn Vulcan it's very verbose we could learn a metal it's also very verbose um and there's tons of different gpus out there right those are not just the only two each GPU processes data differently and there's many different techniques that we could use to make inference faster on a GPU right we could use something like day data tiling where we you know share data as it's being processed in a way that is more memory efficient across gpus we could use memory coal listing and loop unrolling and vectorization and all this stuff batching and pipeline parallelism and that's general purpose there's also specific optimizations that apply for specific types of GPU vendors like shared memory and Cuda or warp Shuffle but in metal there's thread group memory or performance shaders but in AMD Rock there's local data shaders and wavefront optimizations and an opencl you see where I'm getting get getting to this is hard to learn there's a lot out there and so what we want to do is we want to learn one tool that will compile down to all this stuff and it will let us take our simple hugging face model and run it on our on our Android device and our iOS device or every device and that technology is the tensor virtual machine and that's what brings me to this so let me explain to you how the tensor virtual machine works this thing is awesome it was created by Apache a company that's well known in open source software for a bunch of different Technologies already and the way it works is we take our machine learning model whatever it language it's in or framework whether it's in by torch tensorflow Keras mxnet whatever it is in and it compiles it into this language agnostic language called relay and what relay is is it's a way of defining the computation graph of our machine learning model and from that intermediary language this Optimizer is applied and what that Optimizer does it's called the auto TVM Optimizer Auto scheduler what that Optimizer does is it breaks that computation graph down into subgraphs and each of those subgraphs is then optimized for individual Hardware that we're targeting and it's going to guess and check a bunch of different subgraphs and see which ordering of addition addition multiplication activation functions all this stuff which ordering of these operations is the fastest for a given Target hardware and at the end of this process by the way it's a gradient based optimization processes it's essentially a decision tree an xgboost decision tree that's optimizing finding the optimal subgraph once this process is complete then it quantizes that model to be as small as possible in quantization is the process in which all the weights and activations of a model are made smaller so they're less precise but they are they take up less space and it's faster for inference and so that's the idea of the tensor virtual machine and the TVM the tensor virtual machine I'll call the TVM from now on is both the compiler and a runtime just like Cuda just like every other metal all of them there they compile it down to a language and then it runs it as well and so like I said that graph is Created from whatever framework we're using and that graph is then optimized and not only is it optimized it's partitioned and that means it's chunked or sharded into little bits little in the case of Pi torch. bin files and these dot bin files you can see are a way of you know having the model load up uh a large language model into memory even though it doesn't have that much vram it could load up as part of a DOT bin file and that lets smaller gpus run these models right if it's shorted as opposed to one giant bin file only an Nvidia GPU could probably run something like that a big one and so that's the idea and so it's performing this kind of code generation right because it's on one hand it's optimizing this graph using all these different subgraphs on the other hand it's generating code for the different Hardware that it's being targeted for whether it's Cuda code or it's metal code Vulcan code it's kind of on top of all of these Technologies so we can just take our pytorch model give it to the TVM and then build it into relay this intermediary language and then from relay compile it down to iOS or compile it down to Android or Nvidia or whatever we want it okay so that's the idea and in here you can see a little bit more detail around how this subgraph optimization is happening and this is really important because this there's this idea of both abstraction and transformation and if you're interested in this technology by the way you should definitely read up on by the way this is apache's tensor virtual machine you can see this overview that they have of all of it and and how it works it's super super interesting but there's this course called machine learning compilation course mlc. ai go check it out it explains how automatic program optimization Works how the tensor virtual machine creates all of these different Transformations stochastic transformations to take an original program and create a bunch of possible programs and then select the best program the most optimal program for your specific Hardware environment so definitely check out mlc.

ai the course as well and so this idea of abstraction is super powerful because once we abstract a computation graph um from you know any sort of vendor lock-in then we can perform a bunch of Transformations on it so there's two ideas abstraction and transformation and you can see that different transformations of these Matrix operations can yield different results we can cache things and remember I talked about earlier we could do all these fancy techniques like data tiling and vectorization and loop unrolling and all this stuff right so this can be super complicated and so there's this idea of you know in this intermediary language called relay which the tensor virtual machine uses you can see that it's going to generate code for a specific Target whatever Target that is and we can specify what that Target is going to be and there's a bunch of operations that could be applied you could fuse different operations like multiplying it and you know add together to take up less space it could parallelize them it could vectorize them all these different things and so there's a six step process here right the first step is we import our pytorch model the second step is we optimize this model the third step is we partition this model into small chunks the fourth step is we generate actual code to run inference for IOS and Android the fifth step is we package it for IOS and Android and the sixth step is we deploy it and we start running it in xcode in Android Studio okay so that's the six step process let's go through this code right here in Python so first we'll import the tensor virtual machine as well as this language relay and once we have that then we're going to load up this pytorch model this Pi torch model is just called fashion mnist it's a super simple image classification model just for training purposes I just wanted to show you how we can take any PI torch model and convert it to both Android and iOS even though before we just built this small language model from scratch in this case we're going to use pytorch so I just want to specify that in case you got a little confused okay and so there's only nine different class names here pullover trouser sneaker sandal and we can just download this pre-trained model and then run inference on this pre-trained model now this model has several parts right it's got a model structure a skeleton to the model the computation graph and it's also got weights and it's got parameters and now we can take this model and we can optimize it for Android and iOS we can compile it for Android and iOS using the tensor virtual machine I've been trying out a bunch of different models for Drdignity and we can see in my GitHub project that I have gained a lot of attention around Drdignity um it's it's super interesting but I I want to say that the 7 billion parameter model was way too big to be running and I I get that now like it was it's too big to be running on most mobile devices three billion parameters is The Sweet Spot unfortunately we can't do more than that we can only move forward uh we can't move backward and so when it comes to three billion parameter models there's quite a lot of them and this open llm leaderboard is constantly being updated on hugging face so I recommend you check it out but what I like to do is I Look to look for um actual three billion parameter models not 70 billion parameter models and I want to see how they're doing here and you can see that there's several models that are at the top of this list um and and one in particular that really got to the top of the open hello leaderboard is Marx 3bv2 and Mark's 3bv2 is open Llama 3bv2 which was an open version of meta's llama V1 but a three billion parameter version trained on gpt4 conversation data and this is a similarity with a lot of the open llm leaderboard like a lot of the tops llms were fine-tuned on gpt4 data and this kind of Heralds this other paper called uh your llm is secretly your reward function is what this paper was called and this idea is that we can just use a teacher language model to train a much smaller language model rather than fine-tuning and the idea is that it's going to improve accuracy by a lot because it's a much better representation of the data and I should say knowledge not even data of the knowledge and so there's a lot of ways of quantizing this model and whatever but marks 3B was the best for two epochs it was trained with more epochs I'm sure it could be improved there's a lot to be improved here um but you know I couldn't get any of these models to compile down to Android or iOS using mlc llm except for rwkv Raven and this model is a recurrent Network that offers Transformer level performance super interesting stuff Reinventing rnns for Transformer the Transformer ERA this idea of attention is all you need took the you know World by storm but the idea is that we do still need recurrence recurrence didn't go away recurrence actually gives us optimization and speed at inference right over Transformers so this offers Transformer level performance during training but recurrent level recurrent Network level performance during inference and uh you know this kind of mixes both ideas of attention and recurrence together it's called recepted weighted key value and I recommend you check this paper out um it's super interesting but the idea is that this is super light and it could work and you know this is what I used and so let's look at how that how I was able to use that so if we go to the mlc llm uh no mlc llm documentation and let's go through this list together the first thing we're going to do is we're going to install mlc llm now mlc llm was created by a few researchers and the idea is that they are trying to make TVM Unity much more accessible to people um who you know want to run large language models on their mobile device specifically and they have this app called mlc chat which I basically white labeled for Drdignity um and I had to you know rebuild it and do a lot of different things to make it unique enough to get on the App Store um but it is on the app store by the way my app is now on the App Store you can see it right here um it's still a work in progress don't expect anything too you know amazing or fancy or anything I just barely got it on the App Store um there's still a lot of work to do but it's there so you can download it for iOS and I'm working on Android actively right now okay so so please be patient with that but this idea of mlc chat machine learning compilation chat it shows you how you can use their GitHub repository so so let's do that together if we go to this mlc chat if we want to compile a model with mlc we'll first get clone their repository and once we get clone their repository then we're going to CD into that directory and install all the PIP repo dependencies that this library has and there's quite a lot of them but once we do that then we can build um we can compile I should say any model we want directly from hugging face to Android or iOS and I want to show you exactly how we're going to do that then we'll CD into that directory and then we'll run sudo pip install to install all of the dependencies one by one and once that's done we can verify our installation by running Python mlclm. build and slash you know Dash help and that'll let us know if it was installed and it looks like it was installed now we're going to click on Android and what we want to do we want to build red pajama 3 billion V1 for Target Android max sequence length is 768 how many output tokens and quantization Q4 F16 so this is just a format for how much do we want our weights and our activation functions uh what format do we want them do we want them in bits 16-bit versus four bit right that's it it's a standard format now we don't want to do this for red pajama we want to do this for rwkv and now it's going to do that sub graph optimization process I talked about and it should look something like this where we're going to go up and it's going to build all of these you know things and basically what it's doing is essentially what I told you it is taking that giant pytorch file and then it is uh abstracting it into the relay this intermediary language it's applying a bunch of transformations to it and this is you know using XG boost where we're taking that computation graph we're subdividing into subgraphs we're trying out a bunch of different ones for our Target device in this case iPhone we're executing the functions and seeing how fast they run and we're using that to create a an optimized final output graph and that graph is saved and once we've saved it we can LS and see that do we have it yep we have it and we can see do we have the parameters as well yes we do we have those as well so we've got the parameters we have the tokenizer great stuff so now that we have that now we can build our app from scratch so how do we do that the first thing we do is we clone the IOS app we clone mlc llm and now that we have mlclm we're going to make a distribution directory we're going to clone all the pre-built libraries into that directory and we're going to clone the model into that directory as well then we're going to build the auxiliary components by going to CD iOS and then by preparinglives. sh we're going to prepare those libraries to run for iPhone so we'll run preparelibes.

sh and this is using rust by the way the tensor virtual machine is using rust it's building it perfect it built this run time for iOS and you can see I have all these Library files for iPhone and I can prepare the parameters as well I'll run that perfect and now I can build the IOS app so I'm going to open the app in xcode and then build it and what you're seeing is the final version of this but you know it's got rwkv you can see the all the dot bin files you can see my mlc chat config where I say this my model file here's my local file here's my template you need to have this stuff working for mlc chat to recognize it here's the app config you know I've just got rwkv Raven um you know in the Linker another thing you need to do is build phases we need to go to link binary with libraries mlc Swift should be here um you know we should go to build settings and under other Linker Flags it should have these under them to link all the modules and then we can run it and you can see it's it's running and it's taking up a lot of memory but hey it's running locally which is awesome so that's the idea behind iOS and now let me show you Android so if we go to Android they've got a pre-built Android package as well but we're not going to use that we're going to do this same thing but for Android and I will say that I'm still having some trouble running my Android um mlc LOM I've got it open in Android Studio here and you could see that I've built it we basically do the same thing for the Android version where we build the model for the Target which is Android in this case you know Vulcan it's going to Output Vulcan and then in our app config.