Share this
About
This application stack is composed of three pieces of software:
- Docker: the container management system
- Portainer: the container that managers containers
- NVIDIA Container Toolkit: NVIDIA’s custom runtime, libraries and APIs that allow their datacenter class GPUs to be utilised by Docker containers
Installation
From the SHARON AI public web billing portal, choose your desired virtual machine product. We have a wide array of CPU and GPU based virtual machines with dedicated resources that guarantee you performance without contention.
After choosing your product, choose your Operating System and Application. We recommend a recent Ubuntu LTS based distribution such as Ubuntu 22.04 or 24.04 (released in 2022 and 2024 respectively, and each maintained with security patches for 5 years). Older distributions may have problems or performance issues with outdated versions of Python and hardware drivers, and are not recommended.
Configure the rest of the options to suit your needs, including your disk space, SSH public key, etc.
NOTE: The password you set here will be applied to the default `ubuntu` user. We will need this to log in to JupyterHub later.
When happy with your configuration, complete your order process and wait for your virtual machine to start. This process can take several minutes as the application deployment collects the various applications and drivers necessary. Output can be seen in the files `/var/log/cloud-init.log` and `/var/log/cloud-init-output.log`, and for newer distributions followed via the systemd-journal logger using the command `sudo journalctl -f`.
Using the application
Portainer allows simple management of Docker from a web GUI. It exposes itself on TCP/9443 for HTTPS. Find your VM’s IP on your product information page:
So, for example if your product was assigned the IP “123.456.789.123”, you could connect to your Portainer service in your browser on “https://123.456.789.123:9443/” . Note that Portainer defaults to a self-signed SSL certificate by default, so you will need to accept that in your browser when you first connect.
When first starting up, Portainer asks the user to create a new user account which will be the administrator account. Because these services are often live on the public Internet, a timeout is in place so that you don’t expose this system for too long in an unconfigured state. If you happened to order your VM instance and find yourself distracted by another task, you may come back to the following screen:
If you see this screen, you will need to restart the container manually. You can do this by connecting to your instance via SSH, and running the command:
sudo systemctl restart docker
This will restart the Docker service as well as the Portainer container. On reconnect, you should see the default screen prompting you to create an admin account:
Once logged in, we see our default management screen:
First, let’s do some setup. By default Portainer doesn’t allow containers to use GPUs, which isn’t a whole lot of fun at all. So let’s enable that.
Clicking “Home” shows us our default “local” environment:
Clicking on the environment name “local” itself shows us the dashboard for the “local” environment:
In our left-hand menu, we navigate to Host -> Setup. In the right hand window, scroll all the way to the bottom. Enable “Show GPU in the UI”, and click “Save configuration”:
By default Portainer allows a number of public container registries from across the Internet. One popular registry that’s missing however is the GitHub container registry, or `ghcr.io`. Let’s add that in as a custom registry so we can easily run containers from there. In our menu above, click the drop-down arrow next to “Hosts” and then “Registries”:
Here we see that Portainer has set up the popular DockerHub container registry with free anonymous access. Click the “Add registry” button:
We’re going to choose “Custom registry”, give it the name “ghcr.io”, and the URL “ghcr.io”. Click “Add registry” to save the changes.
And that’s it! We’re ready to start adding containers to our system.
Open WebUI and Ollama – testing containers with LLMs
As an interesting use case, let’s run a container that offers a fully open source, locally hosted, private LLM (Large Language Model).
Often we’ll hear concern over privacy issues when it comes to using third party LLM services. Running open source models on trusted infrastructure gives us the peace of mind to be able to test LLMs in save environments. We can add in strict firewall settings, monitor traffic, encrypt storage and communication, and destroy instances when we’re done testing.
Two excellent projects that can work hand-in-hand to test models are:
- Open WebUI: An open source “chat” interface that can use a number of engines and LLMs behind the scenes
- Ollama: An open source “engine” that can talk between user tools such as chat interfaces or software development tools, and an LLM
Thankfully the clever developers behind Open WebUI have provided a container with both of these tools all configured and ready to go. Consulting the Open WebUI GitHub page, it tells us to run the following:
docker run -d -p 3000:8080 --gpus=all -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama
But we’re not going to do that. Instead, we’re going to use the mighty Portainer container manager!
Make sure you’ve followed all the steps above – for this example to work, we need to have enabled GPU access in the Portainer GUI, and we need to have added the “ghcr.io” GitHub Container Registry.
Click on the “Dashboard” link under our “local” environment, and we can see what containers are running. Right now there’s only Portainer itself, so we have a single running container, and a single volume which is the persistent disk storage for the Portainer container.
Click the “Volume” button on the right (or alternatively the “Volumes” link in the left hand menu), and we’ll see the persistent storage volumes. Again, with only one active container, all we see is the “portainer_data” volume that was created by the automation tool.
Click “Add volume” to create a new volume. The docker command line we looked at above wants to map a Docker volume called “ollama”, and present that to the container as the path “/root/.ollama” inside the running container. So we’ll call our volume “ollama”. Simply type in the name, leave all the other values default, and click “Create volume”:
Once done, you should see a message saying it was successful, and you’ll return to the “Volumes” screen and see the new “ollama” volume listed, and tagged as “unused”.
Let’s repeat for the “open-webui” volume. Volumes -> Add volume. Name it “open-webui”, and click “Create the volume. Again, this should return us to the “Volumes” screen, showing our two new volumes, both in state “Unused”.
Time to create/pull the container itself. Click on the “Containers” link in the left hand menu. Again, we see just our Portainer container running and nothing else. Click “Add container”:
We’re going to copy-paste all the values from the Docker command supplied above and enter them in the relevant fields.
- Name: open-webui
- Registry: ghcr.io (Not seeing this? You forgot to follow the steps to add it above)
- Image: open-webui/open-webui:ollama
- Port mapping:
- Host (i.e.: what we expose on our VM to the world): 3000
- Container (i.e.: what the service inside the container is configured to run): 8080
- Note that your “host” ports must be unique: you can’t specify the same listening port twice on the host side. Each container spawns in a separate network, so there’s no problems with containers having conflicting ports internally, as they’re kept segregated. This feature allows us to map any port we wish on the host into listening ports inside containers.
We’re not quite finished yet, but here’s what it should look like so far:
We need a few more options set yet. Scroll down to “Advanced container settings” and click the “Volumes” button. Click “Map additional volume”. The settings should be:
- Container (i.e.: the path inside the container): /root/.ollama
- Volume (i.e.: the volume on our host that we created in the earlier step): Choose “ollama – local” from the drop down
Repeat once more for the second volume – click “Map additional volume”, and make the settings:
- Container (i.e.: the path inside the container): /app/backend/data
- Volume (i.e.: the volume on our host that we created in the earlier step): Choose “open-webui – local” from the drop down
Next, click “Restart policy”. You can choose from four options:
- Never – if the container is manually stopped, crashes, or the system reboots, this specific container never restarts
- Always – no matter what happens, the container always restarts
- On failure – the container only restarts if there’s an internal problem with it (it crashes, for example)
- Unless stopped – a special option that allows you to manually stop a container, and that state will be remembered even after reboot. However if the container was running at reboot time, it will be started up again after reboot.
We’ve committed to matching the docker command supplied by the developers, so let’s set ours to “Always:
And finally, the reason we’re all here. Click the “Runtime and resources” button. The automation tool that deployed all of this for you was configured to make the Nvidia Container Runtime the default runtime. You don’t need to select it from the list, and can leave that option as “Default”. However you can see “nvidia” in the list should you wish to specify that manually for whatever reason.
Scroll down a little to “Enable GPU”, and enable that option. By default it will select “Use All GPUs”. If you have ordered a multi-GPU instance, you can decide here if you want to split your GPUs up amongst different containers, or present all GPUs to all containers. How you configure that is up to you, but also note that not all applications can use multiple GPUs. You’ll need to consult each application’s documentation individually to see what it can do.
The default “capability” options selected are “compute” and “utility”. These are all we need for our particular container. But the options on offer are:
- Compute – use the Cuda compute component of the NVIDIA GPU
- Utility – be able to access metrics and tools such as the NVIDIA command line tool `nvidia-smi` to query the state, status and driver level of the GPU
- Compat32 – enable legacy 32bit compatibility mode (unlikely to be necessary, as almost all modern tools are 64bit)
- Video – utilise the onboard transcode ASICs inside the GPU for hardware accelerated video encode / decode / transcode, and various transforms like colour correction, tone mapping, HDR processing, etc
- Display – access the DRI/DRM display level hardware for producing graphical output, necessary if you were running a graphical application or desktop based on X11 or Wayland, virtual desktop, etc.
Once done, scroll back up slightly and you’ll see the “Deploy container” button just above the “Advanced container settings” area we just configured it. Click it to save our settings and begin pulling the container image to then run!
Note that this process can take a while. Some containers are quite large, and can take some time to download from their respective Internet based container registries. After a short wait, Portainer will tell us when the container is ready to use. You may see it sitting in the “starting” state for a few seconds. Once “healthy”, it’s ready to use.
Using our Open-WebUI container
We followed the instructions to launch the container and expose it on TCP/3000. Find your VM’s IP on your product information page:
So, for example if your product was assigned the IP “123.456.789.123”, you could connect to your Open-WebUI service in your browser on “http://123.456.789.123:3000/” . Note that this is unencrypted, and adding SSL/TLS encryption to this instance is left up to the user.
On first connect, Open-WebUI will ask you to create a new user who will be the administrator user. Enter in any details you like (feel free to use a fake email address if you want – no checking or validation is done for this local instance), and add a password. Once done, you’ll be sent to the main screen with a “what’s new” popup. Dismiss this, and you’ll see the chat interface:
What we have here is Open-WebUI (the chat interface we can see) and Ollama (the LLM engine/framework behind the scenes that we can’t see), but no model. We can find out what models are available on the Ollama website:
These are all open source models that can be downloaded and used entirely privately. In this example, I’m going to pick a nice small model, the Deepseek-R1 1.5b parameter model. You don’t have to use that, and can use anything you like. But do be aware of the following constraints:
- This model will be downloaded to your VM instance and run locally, which is the whole point of this “private LLM” exercise. This takes time. Huge models will take a long time to download.
- It needs to fit on the disk of your VM. If you’ve ordered the smallest VM SHARON AI offers at 32GB of space (and factoring in things like the Linux operating system, NVIDIA drivers, Docker runtimes, container images, etc, etc), then you won’t be able to fit some of the larger 400GB models on the hard disk
- The models need to fit inside the GPU RAM, and have room to process information. Our H100 fleet are blessed with an impressive 96GB of RAM, but other GPUs may have less. But in general go for a model that is smaller than your GPU RAM
In the name of experience, I will grab the smallest Deepseek-R1 model:
This 1.5 billion parameter model is a mere 1.1GB, and should download quite quickly. The Ollama site tells us that the ollama command to pull the model down is:
ollama run deepseek-r1:1.5b
We’re going to take just the model name, “deepseek-r1:1.5b”, and copy that. Back on our Open-WebUI instance, can can click “Select a model” in the top-left of the screen. Pasting in our “deepseek-r1:1.5b” text, we see the drop down say “Pull “deepseek-r1:1.5b” from Ollama.com”. Click that to begin downloading:
A progress bar will tell us how long that’s expected to take, and a popup will show when complete. Once again, click on “Select a model” in the top-left area, and now we should see our Deepseek-R1 1.5b model available. Select it:
Start interacting with the LLM by asking it questions. You can verify that it’s using the GPU by logging on to your VM instance and running the command:
sudo nvidia-smi -l
This will run NVIDIA’s GPU query tool and refresh the output every second or so to the screen. You will see the ollama_llama_server running, and how much GPU memory it’s consuming. While the LLM is doing very little, the GPU load may be low or even 0%. As you interact with it and ask detailed questions (the more complex and wordy, the better), you will see the GPU utilisation grow. Other handy outputs include the power draw of the GPU in Watts.
Further exercises
From here, the sky’s the limit. Some options to try:
- Edit the Open-WebUI/Ollama container to expose the internal Ollama API port, then use the running LLM over that connection to a code editor running on a low powered laptop
- Try other models and see how they compare for performance and accuracy
- Remember that this is just a single container! Millions of different containers are published for free on the Internet. Try them out across compute-heavy industries like data science, life sciences, engineering, medical research, genomics, remote sensing, computer vision, media processing, visual effects, and countless others.