https://www.youtube.com/watch?v=guHW1Eb3xSs

Here’s a breakdown of the transcript with headings, based on the logical flow of the speaker’s content:

OpenAI’s GPT-OSS: Initial Impressions & Release Details

Okay, so OpenAI has finally dropped their open weights models, and we’ve actually got two of them. In this video, I’m going to go through it. Unlike a lot of AI influencers, I won’t be calling these models insane or giddy, giddy over how they’re the best thing since sliced bread. I want to go through these and have a look at what’s good, what’s bad, and perhaps even a little bit of the ugly that’s in here.

Model Overview & Licensing

(0:24) Okay, so first off, we’ve got two models that have actually come out. We’ve got a 120 billion parameter model and a 20 billion parameter model. You’ll see actually those numbers are a little bit off, which we’ll talk about later on. But it is really good to go into Hugging Face and see that there are new OpenAI models. This is one of the things that OpenAI really did get off the rails and perhaps built a lot of enemies by the fact that they really didn’t release any open LLMs after GPT-2. And while we’ve seen some great contributions from them for things like Whisper and the Clip model, it’s great to see them getting back in the open LLM game. (1:05) And that brings us to actually what they’ve released. So these two models have been released with an Apache 2.0 license. This is really one of the good things about this. We don’t have any sort of weird model license where you can only use it if you’ve got less than 10 million users or conditions around how you can use it. This is totally open in the sense that it’s Apache 2.0. You can pretty much do with these weights what you want.

”Open Source” vs. “Open Weights”: A Naming Debate

(1:34) Now, that said, I’m not sure they really deserve the name of GPT-OSS. So looking around and asking ChatGPT itself what does the OSS stand for, it finally comes out with that this is the Open Source Series of models. But I did find it very amusing that when you actually look at its thinking, originally it thought that it might be the GPT One Stop Shop series of models. So that’s a little tidbit that perhaps even some of their own models were a little bit surprised about this being called open source. (2:07) And my issue with this is really that these models are open weight models. They’re not open source models. True open source models are like some of the things that we’ve seen from AI2, where not only do we get the instruction tuned model, which is what OpenAI has put out, but we also get the base models. We also get the training code. We also get checkpoints. We also get access to the data so that it’s fully reproducible. So I’m not sure these deserve to be called open source models. And it is funny that while they have GPT-OSS in the title, they refer to them as open weight models themselves as we go through this.

Training and Capabilities

(2:49) Now, with that bit over, let’s have a look at some of the details in here. So they claim that these have been trained in a very similar way to how they’ve been training the O3 and the O4 models, meaning that they’ve been using a variety of different sort of reinforcement techniques as well as supervised instruction tuning, et cetera. And it does seem that they’ve chosen these sizes quite deliberately. (3:12) Now it seems like they’ve wanted to have both a model that people could run in the cloud or with some decent amount of GPUs being the 120B model, but at the same time, they also wanted to have something that people could run locally on their computer with Ollama, with LM Studio, et cetera. And that’s where the 20B model comes in. So the 120B, they’re comparing to O4 Mini, which is pretty impressive, right? They can get something like this. And the smaller one, they’re comparing to O3 Mini, which I think is really fantastic for a lot of people who want to run things like agents and do a variety of stuff locally. Hopefully this is going to unlock a whole bunch of use cases of things that we can do. And while a lot of the Chinese models have been getting really good, things like the really big Qwen models and the Kimchi K2, et cetera, are just too big for most people to be running locally on their machines. (4:09) So reinforcing the whole sort of agentic thing, they talk about these models are compatible with the Responses API, and they’re designed for these agentic workflows. And they’ve had post-training specifically for things like instruction following, tool use, web search, Python code execution, and these reasoning abilities. And that brings us to how these models actually deal with reasoning, is that both of the models, so both the 20B and the bigger 120B, actually the big one, I think is 117B. Both of them have the ability to support three levels of reasoning efforts: so a low, medium, and high. And they talk about this being a trade-off between latency and performance, and that this is something that you can actually set with the system prompt. So this is going to be something that really is going to require a bunch of testing just to see, okay, if we do give these models a long time to think, how much can we actually get out of them?

Pre-training and Model Architecture

(5:51) So looking at that, we can see for the pre-training here, there’s not a lot of detail in here. They basically just talk about just how they’ve done it similar for what they’re doing for their proprietary models. I don’t think it’s any big surprise that both of these are mixture of expert models. It’s great to see that they are. We know that pretty much every proprietary model out there now is an MoE model. So it would have been disappointing if they had released something that wasn’t like that. And we can see here looking at the sizes of this. The 120B is a very big model, but it’s running with only 5 billion active parameters. And the 20B is running with only 3.6 billion parameters. So the closest to this we’ve seen is Qwen, where they’ve done some really nice things of having a 30B plus model with 3 billion active. So this one is interesting in that they’ve made the model smaller, but the active parameters bigger in this. So it’s going to be interesting to compare these kind of models out. And hopefully we’ll see DeepSeek, some of the other Chinese companies, perhaps even the Gemini team from Google release some more of these sort of bigger models with lower amounts of active parameters in here. (7:01) Just going through this quickly, it is interesting that the attention patterns that they’re using are GPT-3. I’m not sure if that would be the case in their proprietary models. And they’re also using the Rotary Positional Embeddings, which allow these models to go out to 128K. I haven’t had a chance to read the model card in depth to see actually what they were trained at initially. Whether it was probably 32K is my guess.

Training Data & Post-training

(7:25) And looking in here, it’s good to see that both models, especially the 20B, is actually supporting that 120K context length. Unfortunately for people who were hoping to get something that was multilingual, it looks like it’s pretty much English only, which is not surprising. Pretty much all the labs for their first model have released something that was mostly English based. And then future models, and we do hope that we will get future models from OpenAI, perhaps will go in the form of something that’s a little bit more multilingual. (7:56) Now, it is interesting to look at the post-training is that they talk about this as having a similar process to O4 Mini. Not exactly the same process, maybe. This language is vague, so, yes, pretty much every model is going to have a similar process to O4 Mini if you’re going to talk about that it had some supervised fine-tuning and some kind of RL stage for alignment in here. Unfortunately, we don’t seem to have a lot of details about this. Not surprising, I wouldn’t expect that from OpenAI or from any of the frontier labs at the moment.

Benchmark Performance Analysis

(8:29) Look at the benchmarks. I’m not going to spend a lot of time on these. It is interesting that they only decided to compare against their own models. So over the next day or so, my guess is we’ll see some comparisons come out against the other Chinese models, against other proprietary models, perhaps like the Claude models, Gemini models, et cetera, so that we get a sense of just where this stands. So far, I haven’t seen anything about a Web Arena score, which in some ways I don’t mind at all because honestly, I would rather have a model that generalizes well than one that’s been overfitted for that Web Arena chat format. (9:00) Now, looking at the Humanity’s Last Exam, these scores are really good, right? They’re not as high as obviously some of the proprietary models out there. But showing what both the models are able to get with tools, and that’s substantially above the without tools, does hint more at these being really good for the agentic uses that many of us actually want to see in here. (9:30) Looking at some of the benchmarks. The big challenge with some of these benchmarks is that they’ve already been maxed out by so many different models now that they’re perhaps not a great benchmark to actually look at some of these things. Even with the AIME ones. Definitely really good scores in here. And great to see that these are so close, but you’ve got to wonder when the 20B model is actually scoring more than the O3 model. Are we over-fitting on benchmarks here? (9:59) One of the ones that’s nice to see is just how well both these models are doing with function calling benchmarks, and being able to see scores that the big models are actually surpassing O4 Mini, and it’s not that far off O3.

Scaling Reasoning with Tokens

(10:14) And then we’ve got a nice diagram of basically, sure enough, if you want to get these things to score high accuracy, give them longer chains of thought. And so basically use that high level of reasoning that comes with them. That seems to be both true for the competition math, and for the GPQA in here.

Safety & Worst-Case Fine-tuning

(10:34) All right, let’s jump into the and play with the models and see what they can actually do.

Demo: Using GPT-OSS via OpenRouter API

(10:39) Okay, so the simplest way to get started using this is probably to use it through OpenRouter. We can see they’ve got a number of different versions of the 120B model up here with various different providers with both the Groq and Cerebras versions being extremely fast considering the size of the model, et cetera. (11:01) So if I jump into a Colab and just sort of show you that, here I’ve basically set this up, and we can use see that the OpenAI way of doing this, which is not ideal because it’s using the chat completions. The model is actually built for using the other OpenAI APIs of accessing this, but this does work. And you can see here, if we come in and have a look, we ask it the sort of meaning of life, which is a very generic question, but we can see that, okay, we’re getting lots of answers back. We’re getting tables back. We’re getting a whole bunch of stuff coming through. And it does feel like an OpenAI model. Like I got to say when you go through and read the responses, it’s good to see that this has the sort of personality of the OpenAI models. Now, you may like that, you may not. That’s totally up to you. But it definitely feels different than the Chinese models for me.

Demo: Advanced Reasoning with OpenRouter API

(11:54) The other way you can do this too is you can use the OpenRouter native API. So I’ve just set it up a real simple example here doing the same thing. And you can see we get out a very similar kind of response. And what we can do with the native router is we can actually turn on the reasoning. So if you want to turn on reasoning, this is how to do it. You want to include the max tokens that you want overall, and then the reasoning effort, you determine as being high, medium or low. I think at high it’s using roughly 80% of the tokens accessible for reasoning in there. You can see here we basically print out the reasoning response. So you can see here we’ve got the reasoning response being, okay, how would you build the world’s tallest skyscraper? And you can see that, okay, it goes through this. I do find when you don’t have a lot of tokens, you get a lot of things of, oh, we should be careful not to give away anything dangerous and things like that. That doesn’t seem to happen as much when you’re using a high amount of max tokens and effort. But if we reduce this, we tend to get these not refusals, but sort of guides to be careful and that kind of thing. So that’s the reasoning tokens there. The output itself is here. Now, clearly this model likes tables. It often will answer with things in tables and try and compare things. It, like I said, it has its own personality, and it’s still really early days, so I’m still working out like what that is and how that actually works and how much you can guide that, et cetera.

Running GPT-OSS Locally with Transformers

(13:38) All right. So if you want to run this locally with a GPU with the Transformers, you really need to make sure that you’ve got Triton installed as well. This is a key to it because the model is actually using sort of four bit floating point quantization. If you don’t do that, you find that it will load the model in 16 bit, and it becomes really large, right? So even the 20B model is way too big for fitting on things like Colab and stuff like that if you don’t have Triton enabled.

OpenAI Harmony SDK for Responses

(14:12) Now, you can see that if you wanted to serve this, you can basically serve it with the OpenAI responses or chat completions endpoint. And they’ve got a nice guide here of setting that up. That seems to be fine. You can also use their new SDK, this OpenAI Harmony. And this model has been trained for this response format. So think of it as like a chat ML format. And you can see that it’s quite detailed in the extra tokens that it has. So it has tokens that it responds to in a certain way. The Harmony API takes care of a lot of that for you. So it will do things like insert the knowledge cut-off, insert the current date, all of those things that are in here. (14:55) Now, just looking at this, another interesting point is knowledge cut-off is 2024-06. That fits with my testing. So I will show you some stuff with Ollama in a second, but I’m definitely seeing that the knowledge cut-off for this is over a year old now. And that hints to me that they’ve probably used one of the data sets that they had in the past. They don’t want to use the latest stuff. Obviously, later this week, they’ve got GPT-5 dropping. My guess is there’s been a conscious decision of what dataset to use and how to use it in here with this. (15:28) So using the Harmony SDK, you can actually combine that with Transformers as well. So you can see in here, we’ve basically got the Harmony SDK showing you how you assemble messages and stuff like that. We’ve still got system roles. We’ve still got the various roles and stuff in here. It’s interesting that they’ve got a system role, a developer role, and a user role in here, which that’s new, right? So I think that’s going to be interesting to see how GPT-5 responds with that and works with that as well.

Demo: Running GPT-OSS with Ollama

(16:01) Okay, so if you want to run the model locally yourself, probably the easiest way to get started is to use Ollama. So you can see already on ollama.com, they’re promoting that they’re supporting the models. They’ve obviously got a whole thing up about this. And this is the new quantization format that I was talking about that they’re actually using for this. So I think the cool thing with this, as long as you’ve got 16GB RAM, you can actually run the 20B model quite easily. I think the 120B model, you’re pushing to run it unless you’ve got a pretty serious machine in here for this.

Ollama Demo: US Presidents Example (Knowledge Cutoff)

(16:37) Okay, to get started when you actually go in and select the latest model in here, this 20B model in here. Ollama will start to download it straight away. And then you’ll be able to use it as you go through. Now, just reinforcing that whole idea of that the cut-off is actually about a year old. If I ask it to list the US presidents, first off, you’ll see that we get like all these thinking tokens coming out in there, which is cool. Then we go through, and it’s listing the last president as Joe Biden. And we can see clearly that’s getting it wrong. I think that, okay, current president as of 2025 is wrong in there. So be aware of that. Now, I don’t think Ollama is actually passing in the full Harmony prompt that OpenAI has recommended just from playing around with it. But they do have it set up so that you can see some of the thinking tokens, et cetera, although I don’t think we are able to actually turn on the different strengths of the reasoning, at least in the current UI. And my guess is that you could imagine that’s something that’s coming soon.

Ollama Demo: WW2 Counterfactual (Detailed & Thorough)

(17:47) And you can see that, okay, I can basically pass in a prompt, and it will think for a while running this model before it actually starts to give you the response. It’s certainly not the quickest model that we’ve got in Ollama. But testing this, I’m seeing that it’s on comparison with some of the bigger Qwen models. And that’s pretty impressive that these models are able to take on models that are actually bigger than what they are. Okay, so if we look at the output that we’re getting here, we can see it’s very thorough, right? We’re getting good stuff here. We’ve got a long amount of thinking going on in there. To me, the thinking looks more like summaries of thinking than the actual chain of thoughts that are being generated. But we don’t have those on the OpenAI models to be able to check anyway. They don’t give us the really the raw thinking. I was hoping that with some of these models, we would get really nice, raw, chain of thought, step by step reasoning tokens, but it seems we’re more getting the summaries of that. That said, the answer is very thorough, right? It it was not quick in coming out, but but we can see that it’s given a very detailed answer.

Demo: Agentic Frameworks (Pydantic, LangChain, LangGraph)

(19:08) So lastly, I ran it through my agent checker notebook. And it seems to be doing quite well. The way that it’s handling the function calls for things like Pydantic, with different tools, et cetera. The LangChain responses seem to be going well. And same for the LangGraph stuff.

Final Thoughts & Future Outlook

(19:27) So it does look like it’s going in the right direction for the agentic stuff, for things like that that we care about. I think I’ll make some follow-up videos about using this for agents, and trying it out both with the agents’ SDK, and with some of the other agentic frameworks, et cetera, and trying out versions that basically run in the cloud, as well as versions that run locally with something like Ollama for local agents. On the whole, though, I’ve got to say that this is actually a very good release, right? It it’s looking like it’s maybe not the best state of the art open weights model. And I really not a fan of how they’ve called it OSS. But considering that OpenAI’s last large language model that they made open was GPT-2, which I’m guessing for most of you, you weren’t even using models back then. And back then, just to put it in a bit of context, back then they didn’t make that model open in one shot. They actually made a small version of the 117 million parameters first, all the different sizes over time before we actually got the full sort of 1.5 billion one back then. This is certainly a step in the right direction for OpenAI. It really now puts pressure on a lot of the other frontier labs, especially ones in the West and specifically the United States to actually release more open models. Now, my guess is that one of the things that OpenAI has done this is we’re literally three days away from the GPT-5 launch. So that’s going to be really interesting to see how much better are those models at all of these kind of agentic things, and how quickly are people going to forget these open models that they’ve released and focus on those proprietary models. So that will certainly be interesting to see. Let me know in the comments what you think of the model yourself. I’m really curious to see. I’m still halfway through doing a lot of my tests on it. So I will be running those over the next few days. But I’d love to hear from other people what you find are its strengths, what are you finding that its weaknesses are? Is anyone actually going to use this for code, for example? And what sort of agentic things are you going to try and use this with? Anyway, as always, if you found the video useful, please click like and subscribe, and I will talk to you in the next video. Bye for now.