Hacker News new | past | comments | ask | show | jobs | submit login
Let's Build the GPT Tokenizer [video] (youtube.com)
640 points by davidbarker 3 months ago | hide | past | favorite | 51 comments



Andrej's video on building GPT nano is an excellent tutorial of all of the steps involved in a modern LLM.



His earlier videos on micrograd and makemore are a gold mine as well.


I can't recommend enough the whole series, zero to hero: https://karpathy.ai/zero-to-hero.html

No metaphors trying explain "complex" ideas, making them scary and seem overly complex. Instead, hands on implementations with analogy explainers where you can actually understand the ideas and see how simple it is.

Steeper learning curve at first but it is much more satisfying and you actually earn the ability to reason about this stuff instead of writing over the top influencer BS.


One thing I like about that zero-to-hero series is how he almost never handwave over seemingly minor details.

Definitely recommend watching those videos and doing the exercises, if you have any interest in how LLMs work.


Thanks for this link - I have some free time coming up, and this seems like a great use of it!


A noob question? Do you all intend to work on LLM’s or watching the content for the curious mind.I am asking how anyone like me as a software generalist can make use of this amazing content.Anyone with insights on how to transition from a generalist backend engineer to an AI engineer ? Or its a niche and the only path is the route of PHD …


Speaking for myself, and except for just being curious, it's mostly for similar reasons as to why you'd want to read, for example, CLRS, even though you'll probably never implement an algorithm like that in a real production environment yourself. It's not so much about learning how, but rather why, because it'll help you answer your why's in the future (not that the how can't also be important, of course).


I was not really interested in LLMs till a month back. I had an earlier product where I wanted a no-code app for business insights on any data source. Plug in MySQL, PostgreSQL, APIs like Stripe, Salesforce, Shopify, even CSV files and it would be able to generate queries from user's GUI interactions. Like Airtable but for own data sources. I was generating SQLs including JOINs, or HTTPS API calls.

Then I abandoned it in 2021. This year, it struck me that LLMs would be great to infer business insights from the schema. I could create reports and dashboards automatically, surface critical action points straight from the schema/data and users chatting with the app.

So for the last couple weeks, I have been building it, running test on LLMs (CodeLlama, Zephyr, Mistral, Llama 2, Claude and ChatGPT). The results are quite good. There is a lot of tech that I need to handle: schema analysis, SQL or API calls, and the whole UI. But without LLMs, there was no clear way for me to infer business insights from schema + user chats.

To me, this is not a niche anymore now that I have found a problem I wanted to tackle already.


I would compare it to when I was taught how to build my own compiler. Taking away the magic was empowering. Later on I saw many opportunities to use the some of the basic compiler techniques, even though I'm not out there writing the JDK.

If you had to pick, building a project using off the shelf tech would better prepare you to work your first AI engineering job. However, the knowledge in these videos could help you land that first job, and is a useful base for concepts that aren't going away any time soon.

Also, please let us know if you figure out the secret. I would love to also switch from generalist backend to ML/AI.


There’s a (or soon to be) market for software people that can evaluate a use case and apply an LLM if warranted. You don’t need a PhD but do need a good working knowledge of the nuts/bolts to speak truth to hype. Karpathy has a YouTube titles something like “a busy persons guide to LLM” and in it he describes the model as an operating system kernel with tools and utilities surrounding it. You can build and understand those valuable tools and utilities without having a PhD in AI. I think that’s the way to break into the AI market as a traditional developer.

A good example is llangchain


2009: Porter stemming with NLTK

2013: LDA with MALLET

2015: spaCy

2018: BERT

2023: GPT-4

2024: every person is an NLP expert in four lines of LangChain code


It's how we went from mathematics, theory of computation, algorithms, programming language research to everybody is a web-developer.


Just a guess, but understanding how LLMs are built may also help you if you want to fine-tune a model. Someone who knows more may confirm or contradict this.


No wonder GPT does so horribly on anything involving spelling, or the exact specifications of letters.

To fix it, I'd throw a few gigabytes of synthetic data in the training mix before fine tuning that included the alphabets of all the relevant languages, things like.

  A is an upper case a
  a is a lower case A
  the sequence of numbers is 0 1 2 3 4 5 6 7 8 9 10 11 12
  0 + 1 = 1
  1 + 1 = 2
etc.

It still amazes me that Word2Vec is as useful as it is, let alone LLMs. The structure inherent in language really does convey far more meaning that we assume. We're like fish, not being aware of water, when we use language.


I am curious if an LLM trained using a tokenizer that renders words into IPA alphabet would make a better bot for creative writing, especially things like rhyming, assonance, puns, and other sound-based word games. It might also do better on "fringe" languages, where there is low corpus in the language but the words might have cognates in more widely known languages.


We know OpenAI trains on significant amounts of synthetic data, they probably have something like this.


Humans, a host form. Language, a life form.


There should be awards for this type of content. Andrew Ng series and Karpathy series as first inductees to the hall of fame.


Had to double check my playback speed - he talks like a 1.25x playback speaker sounds.


let him teach we don't need one more rapper


It’s pretty wild how little discussion there's been about the core feature of these models. It's as if this aspect of their development has been solved. Basically all NLP publications today take these BPE tokens as a starting point and if they are mentioned at all they’re mentioned in passing.


It makes sense - publications write about the things they added, changed or evaluated, not about all the (many!) things they do exactly as everyone else; so tokenization would be mentioned only if the publication is explicitly about a different tokenization.

And while it's a core feature, it's a fairly robust one, while you can get some targeted improvements, the default option(s) are good enough and you won't improve much over them.


Thanks for your reply.

That's my first point. In 10 years we have word2vec, GloVe, GPT-2 and... tiktoken. lol. It's as if directional, numeric magnitudes in an embedding space of arbitrary dimensionality have magically captured or will magically capture the nuances and expressivity of language. Optimization techniques and new strategies for domain adaption are what matters, particularly for mobile devices, on-device ASR and short-form videos.

I don't think robust is a good characterization of clusters of semantic attributes in space or a distributional semantics of language. I'd say crude and without understanding are more accurate descriptions. Capturing semantic properties sometimes is not the same thing as having a semantics.

By targeted improvements you must be referring to domain adaptation and by the default option you must be referring to attention over BPE tokens? You can move directional quantities around in directional quantity space all day. If it results in expected behavior for your application that you weren't getting before that's great. If that's all you want to get out of these models then indeed there's nothing to do here. I'm not after improvements so much as I'm after something that works.


What I mean by robust and targeted improvements is not about the concept as such, but about any choices specifically with respect to how you build the tokenization layer - if you're making a particular system, making better targeted choices for tokenization, character filter/preprocessing or vocabulary can give you some improvements in efficiency, but they rarely are a dealbraker and tokenization never is a key enabler. Like, if some tokenization or filtering destroys data your specific task happens to need, that's a problem, but you don't need advanced future research to fix it, going back to simpler tokenization and removing features is sufficient for that, at the extreme you could always use a naive character-level tokenizer, it's trivial but simply is less computationally efficient.

If you don't care about tokenization and use any of the reasonable default options without caring about them, and if you're doing a proper pre-training on non-tiny quantities of data, then the next few layers of whatever neural architecture you have on top of these tokens will generally be able to learn to compensate for any drawbacks in your tokenization, perhaps at some computation overhead - e.g. perhaps you could have had one less layer or smaller layers if you had the best tokenization possible, and edging out that computation cost improvement is pretty much the only thing you can hope to get out of having a better tokenizer.


Thanks for this perspective on the tradeoff between accuracy and efficiency and the insight that an adequately pre-trained model should be in a position to recover lost information from bad tokens.

Tokenization, the gateway to word embeddings, is a means to an end. I'm not suggesting that better tokens are needed or that BPE tokens should be replaced with something else. I'm suggesting that aiming for a distributional semantics is setting the bar pretty low and that there are better places to end up than These Things Are Over Here And Those Things Are Over There Let's Combine Them And See What Happens. I'm expressing disbelief that these representations have been taken at face value and that there has been practically no discussion of applying alternative formalisms which may be more expressive.

Modeling language in a latent space only makes sense for certain aspects of language and certain kinds of analyses. Crucially, you have to have meaningful primitives to begin with. This line of thinking that an understanding of language and an understanding of the world is somehow going to emerge from mapping character spans onto a latent space and combining them with dot product attention is pretty half baked. These systems remain in Firth Mode™.


It's kind of like lexers and parsers for compilers. It's largely a solved problem so doesn't get much attention.


Thanks for your reply.

It's exactly like lexers for compilers. This parsing strategy coupled with the decision to then map the results into an embedding space of arbitrary dimensionality is why these models don't work and cannot be said to understand language. They cannot reliably handle fundamental aspects of meaning. They aren't equipped for it.

They're pretty good at coming up with well-formed sentences of English, though. They ought to be given the excessive amounts of data they've seen.


The best thing is that I know Andrej reads all these comments. Hi Andrej! This is your calling. Miss you though!


Even if you pay it is hard to get such a high quality content!


I've been learning a few new CS things recently and honestly I mostly find inverse correlation between cost and quality.

There are books from oreilly and paid MOOC courses that are just padded with lots of unnecessary text or silly "concept definition" quizzes to make them seem worth the price.

And there are excellent free YT video lectures, free books or blog posts.

Andrej's YT videos are one great example. https://course.fast.ai is another.


It's not only about the cost, though. There's an inverse correlation with the glossiness of the content as well.

If the web page /content is too polished, they're most likely optimizing for wooing users.

Unlike a lot of the examples I gave in the sibling comments. Where the optimization is only on the love for the topic being discussed


  There's an inverse correlation with the glossiness of the content as well.
This is probably due to survivorship bias. Sites that have poor content and poor visual appeal (glossiness) never get on your radar.

i.e. Berkson's Paradox: https://en.wikipedia.org/wiki/Berkson%27s_paradox


There are some extremely good CS textbooks which cost money. That being said, many good ML/AI texts are free. But it's not easy reading.


>And there are excellent free YT video lectures, free books or blog posts.

There's also a tremendous amount of extremely low quality YouTube and blog content.


Sure. I don't claim the free content is all good.

But from my limited sample size, the best free content is better than the best paid content.


Full ACK. I have also grown weary of payed course offerings, because many I have checked out were basically low quality or shallow.


Do you have recommendations for other high quality courses teaching CS things?


- operating system in three easy pieces (https://pages.cs.wisc.edu/~remzi/OSTEP) is incredible for learning OS internals

- beej's networking guide is the best thing for network layer stuff https://beej.us/guide/

- explained from first principles great too https://explained-from-first-principles.com/

- pintos from Stanford https://web.stanford.edu/class/cs140/projects/pintos/pintos_...


Wow. Thanks for sharing. I had no idea that Professor Remzi and his wife Andrea wrote a book on Operating Systems. I loved his class (took it almost 22 years ago.) Will have to check his book out.


Build an 8-bit computer from scratch https://eater.net/8bit/ https://www.youtube.com/playlist?list=PLowKtXNTBypGqImE405J2...

Andreas Kling. OS hacking: Making the system boot with 256MB RAM https://www.youtube.com/watch?v=rapB5s0W5uk

MIT 6.006 Introduction to Algorithms, Spring 2020 https://www.youtube.com/playlist?list=PLUl4u3cNGP63EdVPNLG3T...

MIT 6.824: Distributed Systems https://www.youtube.com/@6.824

MIT 6.172 Performance Engineering of Software Systems, Fall 2018 https://www.youtube.com/playlist?list=PLUl4u3cNGP63VIBQVWguX...

CalTech cs124 Operating Systems https://duckduckgo.com/?t=ffab&q=caltech+cs124&ia=web

try searching here at HN for recommendations https://hn.algolia.com


Thank you a ton for the links.


I can highly recommend CS50 from Harvard (https://www.youtube.com/@cs50). Even after being involved in tech for 25+ years, I learnt a lot from just the first lecture alone.

Disclosure: Professor Malan is a friend of mine, but I was a fan of CS50 long before that!


Replying to bookmark(hoard) all the thread links later.

Fellow hackers might also enjoy:

https://www.nand2tetris.org/


nand2tetris: https://www.nand2tetris.org/

I like the book better than the online course.


His previous video onLLM tramsformer foundation is extremely useful.


His video on Backpropagation was a revelation to me :

https://www.youtube.com/watch?v=q8SA3rM6ckI


I would love a video series from him where he makes a text2img diffusion model. I found the fast.ai course a bit unfocused and annoying.


very grateful that he puts out this kind of education. The one bit I have is that he didn’t explained all the abstract question and the beginning which leads into bad taste I guess. I hope I am not disrespecting.


> you see when it's a space egg, it's a single token

I'm not sure if the crew of the Nostromo would agree ;)


Probably more coming soon given he just left openai to pursue other things.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: