<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Nuss and Bolts]]></title><description><![CDATA[Writing about the nitty-gritty about machine learning]]></description><link>https://www.nuss-and-bolts.com</link><image><url>https://substackcdn.com/image/fetch/$s_!iWip!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F15e4cd6a-d0ae-4cc7-8749-9754ee665360_1024x1024.png</url><title>Nuss and Bolts</title><link>https://www.nuss-and-bolts.com</link></image><generator>Substack</generator><lastBuildDate>Sat, 02 May 2026 10:19:08 GMT</lastBuildDate><atom:link href="https://www.nuss-and-bolts.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Zach Nussbaum]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[zanussbaum@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[zanussbaum@substack.com]]></itunes:email><itunes:name><![CDATA[Zach Nussbaum]]></itunes:name></itunes:owner><itunes:author><![CDATA[Zach Nussbaum]]></itunes:author><googleplay:owner><![CDATA[zanussbaum@substack.com]]></googleplay:owner><googleplay:email><![CDATA[zanussbaum@substack.com]]></googleplay:email><googleplay:author><![CDATA[Zach Nussbaum]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[On the Lost Nuance of Grep vs. Semantic Search]]></title><description><![CDATA[the answer is...it depends]]></description><link>https://www.nuss-and-bolts.com/p/on-the-lost-nuance-of-grep-vs-semantic</link><guid isPermaLink="false">https://www.nuss-and-bolts.com/p/on-the-lost-nuance-of-grep-vs-semantic</guid><dc:creator><![CDATA[Zach Nussbaum]]></dc:creator><pubDate>Fri, 14 Nov 2025 14:02:03 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/42eb8732-4956-4572-8057-72c549c57313_6144x3452.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Two years ago, RAG (Retrieval Augmented Generation) meant vector databases and embedding models. Now, Claude Code, Codex, Cline, and others have popularized a <a href="https://x.com/pashmerepat/status/1926717705660375463">vector-less approach</a> by using grep, bash tools, and good &#8216;ole reasoning. If you took Twitter, the everything App, as conventional wisdom, then you might believe that vectors are overkill and agentic search with grep<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> is <em>really</em> all you need. </p><p>That is until <a href="https://cursor.com/blog/semsearch">Cursor published a blog</a> on how they use vector search alongside grep. Who knew there could be nuance!</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.nuss-and-bolts.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Nuss and Bolts is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>A Really Contrived Example</h2><p>My initial reaction to agentic search was one of (naive) dismissal. What a waste of compute! Why use all of your hard earned Jenson Bucks when you have a compact, efficient embedding model that understands language? This vector-less approach clearly works. When does it not work? Is it always superior to vectors?</p><p>So I built a really dumb testbed. I split out each row from the Natural Questions corpus and saved it as a text file. For each query, I removed stop words and grep&#8217;d all documents that had a match to any of the remaining keywords<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. </p><p>Simple! But slow<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>.</p><p>Latency seems to scale linearly with index size and is much slower than numpy on my Macbook.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nUQH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd13d79-580b-4fde-8150-08af2cdec13e_2231x595.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nUQH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd13d79-580b-4fde-8150-08af2cdec13e_2231x595.png 424w, https://substackcdn.com/image/fetch/$s_!nUQH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd13d79-580b-4fde-8150-08af2cdec13e_2231x595.png 848w, https://substackcdn.com/image/fetch/$s_!nUQH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd13d79-580b-4fde-8150-08af2cdec13e_2231x595.png 1272w, https://substackcdn.com/image/fetch/$s_!nUQH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd13d79-580b-4fde-8150-08af2cdec13e_2231x595.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nUQH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd13d79-580b-4fde-8150-08af2cdec13e_2231x595.png" width="1456" height="388" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dfd13d79-580b-4fde-8150-08af2cdec13e_2231x595.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:388,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:87862,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.nuss-and-bolts.com/i/178312868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd13d79-580b-4fde-8150-08af2cdec13e_2231x595.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nUQH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd13d79-580b-4fde-8150-08af2cdec13e_2231x595.png 424w, https://substackcdn.com/image/fetch/$s_!nUQH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd13d79-580b-4fde-8150-08af2cdec13e_2231x595.png 848w, https://substackcdn.com/image/fetch/$s_!nUQH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd13d79-580b-4fde-8150-08af2cdec13e_2231x595.png 1272w, https://substackcdn.com/image/fetch/$s_!nUQH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdfd13d79-580b-4fde-8150-08af2cdec13e_2231x595.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Unsurprisingly, using only the keywords present in the query showed poor performance. We don&#8217;t get the soft matching and flexibility of embeddings. We only find matches to the correct document when there is an exact keyword match between the query and document. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ggxd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8ed2bd8-ba01-4785-8e63-1026ee69df34_1618x822.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ggxd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8ed2bd8-ba01-4785-8e63-1026ee69df34_1618x822.png 424w, https://substackcdn.com/image/fetch/$s_!Ggxd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8ed2bd8-ba01-4785-8e63-1026ee69df34_1618x822.png 848w, https://substackcdn.com/image/fetch/$s_!Ggxd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8ed2bd8-ba01-4785-8e63-1026ee69df34_1618x822.png 1272w, https://substackcdn.com/image/fetch/$s_!Ggxd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8ed2bd8-ba01-4785-8e63-1026ee69df34_1618x822.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ggxd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8ed2bd8-ba01-4785-8e63-1026ee69df34_1618x822.png" width="1456" height="740" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f8ed2bd8-ba01-4785-8e63-1026ee69df34_1618x822.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:740,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:167530,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.nuss-and-bolts.com/i/178312868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8ed2bd8-ba01-4785-8e63-1026ee69df34_1618x822.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ggxd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8ed2bd8-ba01-4785-8e63-1026ee69df34_1618x822.png 424w, https://substackcdn.com/image/fetch/$s_!Ggxd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8ed2bd8-ba01-4785-8e63-1026ee69df34_1618x822.png 848w, https://substackcdn.com/image/fetch/$s_!Ggxd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8ed2bd8-ba01-4785-8e63-1026ee69df34_1618x822.png 1272w, https://substackcdn.com/image/fetch/$s_!Ggxd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8ed2bd8-ba01-4785-8e63-1026ee69df34_1618x822.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Retrieval performance over different index sizes for 100 random queries from NQ. NQ is in-domain for Nomic Embed</figcaption></figure></div><p>But using a cheap model like gpt-5-mini to return relevant keywords based on the query nearly 10x&#8217;d performance<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a></p><p>So what is grep good for? Exact matches for a <strong>known or easily derived</strong> keyword<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a>. But that keyword may not always be known. </p><h3>RAG is Dead&#8230;Long Live RAG?</h3><p>Compared to a numpy vector search, grep is much slower. But embedding models converge to a <a href="https://x.com/bo_wangbo/status/1869301556186587416?s=20">bag of semantic tokens</a> and don&#8217;t offer much flexibility for queries outside of its training data. </p><p>Take the <a href="https://brightbenchmark.github.io/">BRIGHT</a> benchmark: many models have shifted to include some sort of rewriting and expansion. ReasonIR showed that training with expanded queries<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a> <em>and</em> reranking with an LLM<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a> improved over their baseline.  </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7reD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff63250e9-7a2f-4d01-8a7c-ad26e0901dfa_864x746.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7reD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff63250e9-7a2f-4d01-8a7c-ad26e0901dfa_864x746.png 424w, https://substackcdn.com/image/fetch/$s_!7reD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff63250e9-7a2f-4d01-8a7c-ad26e0901dfa_864x746.png 848w, https://substackcdn.com/image/fetch/$s_!7reD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff63250e9-7a2f-4d01-8a7c-ad26e0901dfa_864x746.png 1272w, https://substackcdn.com/image/fetch/$s_!7reD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff63250e9-7a2f-4d01-8a7c-ad26e0901dfa_864x746.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7reD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff63250e9-7a2f-4d01-8a7c-ad26e0901dfa_864x746.png" width="510" height="440.34722222222223" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f63250e9-7a2f-4d01-8a7c-ad26e0901dfa_864x746.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:746,&quot;width&quot;:864,&quot;resizeWidth&quot;:510,&quot;bytes&quot;:123192,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.nuss-and-bolts.com/i/178312868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff63250e9-7a2f-4d01-8a7c-ad26e0901dfa_864x746.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7reD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff63250e9-7a2f-4d01-8a7c-ad26e0901dfa_864x746.png 424w, https://substackcdn.com/image/fetch/$s_!7reD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff63250e9-7a2f-4d01-8a7c-ad26e0901dfa_864x746.png 848w, https://substackcdn.com/image/fetch/$s_!7reD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff63250e9-7a2f-4d01-8a7c-ad26e0901dfa_864x746.png 1272w, https://substackcdn.com/image/fetch/$s_!7reD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff63250e9-7a2f-4d01-8a7c-ad26e0901dfa_864x746.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In a nutshell, you&#8217;re trading latency and tokens for flexibility when using grep+keywords over embeddings.</p><p>But when should you use either? </p><p><a href="https://seconds0.substack.com/p/heres-whats-next-in-agentic-coding">Seconds</a> describes it quite clearly: </p><blockquote><p>If it&#8217;s not readily apparent what the name of a variable or a particular stage of your pipeline is, but you can reference some oblique aspect of it, embeddings will get you a lot closer than grep will.</p></blockquote><h2>Cursor&#8217;s Embedding Model</h2><p>Cursor&#8217;s embedding model seems to improve performance <strong>for all models</strong> on their internal Cursor Context Bench.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_ro7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ae7c789-e17f-4962-9030-6388688fe5f9_1288x554.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_ro7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ae7c789-e17f-4962-9030-6388688fe5f9_1288x554.png 424w, https://substackcdn.com/image/fetch/$s_!_ro7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ae7c789-e17f-4962-9030-6388688fe5f9_1288x554.png 848w, https://substackcdn.com/image/fetch/$s_!_ro7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ae7c789-e17f-4962-9030-6388688fe5f9_1288x554.png 1272w, https://substackcdn.com/image/fetch/$s_!_ro7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ae7c789-e17f-4962-9030-6388688fe5f9_1288x554.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_ro7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ae7c789-e17f-4962-9030-6388688fe5f9_1288x554.png" width="1288" height="554" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8ae7c789-e17f-4962-9030-6388688fe5f9_1288x554.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:554,&quot;width&quot;:1288,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:148937,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.nuss-and-bolts.com/i/178312868?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ae7c789-e17f-4962-9030-6388688fe5f9_1288x554.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_ro7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ae7c789-e17f-4962-9030-6388688fe5f9_1288x554.png 424w, https://substackcdn.com/image/fetch/$s_!_ro7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ae7c789-e17f-4962-9030-6388688fe5f9_1288x554.png 848w, https://substackcdn.com/image/fetch/$s_!_ro7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ae7c789-e17f-4962-9030-6388688fe5f9_1288x554.png 1272w, https://substackcdn.com/image/fetch/$s_!_ro7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ae7c789-e17f-4962-9030-6388688fe5f9_1288x554.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>So how is it different than a regular code embedding model? They leverage the rich user-agent interactions:</p><blockquote><p>We provide these traces to an LLM, which ranks what content would have been most helpful at each step. We then train our embedding model to align its similarity scores with these LLM-generated rankings. This creates a feedback loop where the model can learn from how agents actually work through coding tasks, rather than relying on generic code similarity.</p></blockquote><p>Taking the traces (expanded query), they train an embedding model to retrieve documents that were found from an agent using tools like grep and file read. This is quite similar to ReasonIR! While they may not be explicitly modeling query/keyword expansion, training over the traces distills that information from grep and file read.</p><p>Is it be better for the embedding model to explicitly learn to do the query expansion or learn it implicitly by mining the correct traces? Maybe <a href="https://www.pinecone.io/learn/splade/">SPLADE</a> is another alternative. Who knows, but it would be fun to try :)</p><p>At the end of the day, comparisons of grep to semantic search are missing context. Agentic Search gives you flexible retrieval by offloading the learned semantics of an embedding model to an LLM. This also makes using grep in any codebase simple. You no longer have to maintain an index, worry about any potential security implications, or think about how to best chunk your files for your embedding model. However, I believe that Agentic Search shows where embedding models can and should improve. </p><p></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>&#8220;with grep&#8221; is doing a lot of heavy lifting here</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Functionally, I did this <code>f&#8221;rg -i -c {&#8216;|&#8217;.join(query)}&#8221;</code>. Docs are &#8220;scored&#8221; by counts of each word. Because this is a toy example, I didn&#8217;t bother with any score normalization. We could use BM25 if we wante d</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Cognition recently trained <a href="https://cognition.ai/blog/swe-grep">specialized models</a> for faster (and parallel) agentic retrieval. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>It&#8217;s not lost on me that this dataset is really popular and probably not the most fair comparison. However</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>However, embeddings are still the way to go for the continuous domain</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>It&#8217;s worth noting that the model itself doesn&#8217;t rewrite the queries</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>This even works using an <a href="https://x.com/zach_nussbaum/status/1922427785710121186">off-the-shelf LLM</a></p></div></div>]]></content:encoded></item><item><title><![CDATA[I Like Big Batches and I Cannot Lie]]></title><description><![CDATA[Understanding How to Train SoTA Embedding Models for Cheap]]></description><link>https://www.nuss-and-bolts.com/p/i-like-big-batches-and-i-cannot-lie</link><guid isPermaLink="false">https://www.nuss-and-bolts.com/p/i-like-big-batches-and-i-cannot-lie</guid><dc:creator><![CDATA[Zach Nussbaum]]></dc:creator><pubDate>Mon, 02 Jun 2025 21:47:20 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!5W28!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa30c9269-2312-4b3d-851f-999846242381_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>If you want to train the next best text embedding model, chances are you&#8217;ll need to use a large batch size. But naively scaling to the critical batch size requires lots of GPUs! But what happens if don&#8217;t have the compute budget to do so? GradCache allows you to fit large batch sizes with limited memory by decoupling the batch size from gradient calculation, the main source of memory.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5W28!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa30c9269-2312-4b3d-851f-999846242381_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5W28!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa30c9269-2312-4b3d-851f-999846242381_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!5W28!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa30c9269-2312-4b3d-851f-999846242381_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!5W28!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa30c9269-2312-4b3d-851f-999846242381_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!5W28!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa30c9269-2312-4b3d-851f-999846242381_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5W28!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa30c9269-2312-4b3d-851f-999846242381_1024x1024.png" width="862" height="862" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a30c9269-2312-4b3d-851f-999846242381_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:862,&quot;bytes&quot;:1213809,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.nuss-and-bolts.com/i/164949360?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa30c9269-2312-4b3d-851f-999846242381_1024x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5W28!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa30c9269-2312-4b3d-851f-999846242381_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!5W28!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa30c9269-2312-4b3d-851f-999846242381_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!5W28!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa30c9269-2312-4b3d-851f-999846242381_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!5W28!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa30c9269-2312-4b3d-851f-999846242381_1024x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We&#8217;ll explore why naive gradient accumulation doesn&#8217;t work and break down how GradCache works. We&#8217;ve used GradCache to train some of the embedding models at Nomic and I&#8217;ve found it essential for understanding the fundamentals of contrastive learning.  </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.nuss-and-bolts.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Nuss and Bolts is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>Big Batches are Better for Contrastive Learning </h2><p>Contrastive representation learning trains a model to learn an embedding space such that similar data points are close to each other while dissimilar points are far away. Many modern embedding models, such as CLIP and OpenAI text-embedding-large, are trained with the InfoNCE loss. For a given batch size N of paired data<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>, the model is trained to identify the positive pair amongst N-1 negative pairs in the batch. For example, each text caption is compared with every image in the batch. The loss forces all N-1 negative representations away from the caption and pull the positive image representation closer to the caption embedding. </p><p>Performance improves as you increase the batch size as you have more negative examples to compare against but doing so requires fitting the whole NxN similarity and activations into GPU memory.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!W7qG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd711609-a83c-493c-bae4-75a429f54fff_1104x844.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!W7qG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd711609-a83c-493c-bae4-75a429f54fff_1104x844.png 424w, https://substackcdn.com/image/fetch/$s_!W7qG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd711609-a83c-493c-bae4-75a429f54fff_1104x844.png 848w, https://substackcdn.com/image/fetch/$s_!W7qG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd711609-a83c-493c-bae4-75a429f54fff_1104x844.png 1272w, https://substackcdn.com/image/fetch/$s_!W7qG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd711609-a83c-493c-bae4-75a429f54fff_1104x844.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!W7qG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd711609-a83c-493c-bae4-75a429f54fff_1104x844.png" width="424" height="324.1449275362319" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bd711609-a83c-493c-bae4-75a429f54fff_1104x844.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:844,&quot;width&quot;:1104,&quot;resizeWidth&quot;:424,&quot;bytes&quot;:144329,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.nuss-and-bolts.com/i/164949360?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd711609-a83c-493c-bae4-75a429f54fff_1104x844.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!W7qG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd711609-a83c-493c-bae4-75a429f54fff_1104x844.png 424w, https://substackcdn.com/image/fetch/$s_!W7qG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd711609-a83c-493c-bae4-75a429f54fff_1104x844.png 848w, https://substackcdn.com/image/fetch/$s_!W7qG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd711609-a83c-493c-bae4-75a429f54fff_1104x844.png 1272w, https://substackcdn.com/image/fetch/$s_!W7qG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbd711609-a83c-493c-bae4-75a429f54fff_1104x844.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>But what happens if you don&#8217;t have enough memory to do so?</p><h2><a href="https://arxiv.org/abs/2101.06983">GradCache: Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup</a></h2><p>GradCache is a technique to reduce memory requirements by removing the backward pass&#8217;s dependency on the batch size. Let&#8217;s dig into how this works!</p><h3>How Loss is Computed</h3><p>The InfoNCE loss minimizes the categorical cross entropy loss between the positive pair and all other pairs in the batch. Each summation requires fitting the <em>whole batch</em> in memory. </p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hbgs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b398a67-01c5-488f-a4b3-5e2cc2f4c3c5_812x208.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hbgs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b398a67-01c5-488f-a4b3-5e2cc2f4c3c5_812x208.png 424w, https://substackcdn.com/image/fetch/$s_!hbgs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b398a67-01c5-488f-a4b3-5e2cc2f4c3c5_812x208.png 848w, https://substackcdn.com/image/fetch/$s_!hbgs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b398a67-01c5-488f-a4b3-5e2cc2f4c3c5_812x208.png 1272w, https://substackcdn.com/image/fetch/$s_!hbgs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b398a67-01c5-488f-a4b3-5e2cc2f4c3c5_812x208.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hbgs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b398a67-01c5-488f-a4b3-5e2cc2f4c3c5_812x208.png" width="398" height="101.95073891625616" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5b398a67-01c5-488f-a4b3-5e2cc2f4c3c5_812x208.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:208,&quot;width&quot;:812,&quot;resizeWidth&quot;:398,&quot;bytes&quot;:31606,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.nuss-and-bolts.com/i/164949360?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b398a67-01c5-488f-a4b3-5e2cc2f4c3c5_812x208.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hbgs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b398a67-01c5-488f-a4b3-5e2cc2f4c3c5_812x208.png 424w, https://substackcdn.com/image/fetch/$s_!hbgs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b398a67-01c5-488f-a4b3-5e2cc2f4c3c5_812x208.png 848w, https://substackcdn.com/image/fetch/$s_!hbgs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b398a67-01c5-488f-a4b3-5e2cc2f4c3c5_812x208.png 1272w, https://substackcdn.com/image/fetch/$s_!hbgs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5b398a67-01c5-488f-a4b3-5e2cc2f4c3c5_812x208.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">InfoNCE loss formulation used to train embedding models. Here <code>f</code> and <code>g</code> are models that output representations and <code>S</code> and <code>T</code> are the paired data points in the batch.</figcaption></figure></div><h3>Why Can&#8217;t You Use Gradient Accumulation?</h3><p>Naive gradient accumulation computes the loss and gradients in sub-batches then the model parameters are updated. In the case of your standard language modeling loss, the loss for each data point is independent of every other point in the batch! If you used gradient accumulation with the InfoNCE loss, you are only computing the negatives within the sub-batch. </p><h3>Derivative of InfoNCE Loss</h3><p>Let&#8217;s break down the derivatives of the InfoNCE loss. The models <code>f</code> and <code>g</code> are parameterized by <code>&#920;</code> and <code>&#923;</code>. Given the loss:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L} = -\\frac{1}{|S|} \\sum_{s_i \\in S} \\log \\frac{\\exp(f(s_i)^T g(t_{r_i})/\\tau)}{\\sum_{t_j \\in T} \\exp(f(s_i)^T g(t_j)/\\tau)}&quot;,&quot;id&quot;:&quot;QAJHORGFPF&quot;}" data-component-name="LatexBlockToDOM"></div><p> we want to derive the partial derivatives of the loss with respect to <code>&#920; </code>and <code>&#923;</code>: </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\\frac{\\partial \\mathcal{L}}{\\partial \\Theta} &amp;= \\sum_{s_i \\in S} \\frac{\\partial \\mathcal{L}}{\\partial f(s_i)} \\frac{\\partial f(s_i)}{\\partial \\Theta} \\\\\n\\frac{\\partial \\mathcal{L}}{\\partial \\Lambda} &amp;= \\sum_{t_j \\in T} \\frac{\\partial \\mathcal{L}}{\\partial g(t_j)} \\frac{\\partial g(t_j)}{\\partial \\Lambda}\n\\end{align}&quot;,&quot;id&quot;:&quot;LDQKYEPBXE&quot;}" data-component-name="LatexBlockToDOM"></div><p>To make this more palatable, let&#8217;s work out the derivative for a <em>single</em> data point in <code>S</code> and <code>T</code> respectively</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\\mathcal{L}_i &amp;= -\\log \\frac{\\exp(f(s_i)^T g(t_{i}))}{\\sum_{t_j \\in T} \\exp(f(s_i)^T g(t_j))} \\\\\n\\mathcal{L}_i &amp;= -\\log \\frac{e^{a_i}}{\\sum_{j} e^{a_j}} \\\\\n&amp;\\text{where } a_i = f(s_i)^T g(t_{i})\\\\ \\\\\n&amp;=  -a_i + \\log(\\sum_{j} e^{a_j}) \\\\\n\\frac{\\partial \\mathcal{L}_i}{\\partial f(s_i)} &amp;= \n\\frac{\\partial \\mathcal{L}_i}{\\partial a_i} \\cdot \\frac{\\partial a_i}{\\partial f(s_i)} + \n\\sum_j \\frac{\\partial \\mathcal{L}_i}{\\partial a_j} \\cdot \\frac{\\partial a_j}{\\partial f(s_i)} \\\\\n&amp;=  -g(t_i) + \\sum_{t_j \\in T}p_{ij} g(t_j) \\tag{Appendix A} \\\\\n&amp;\\text{where } p_{ij} = \\frac{\\exp(f(s_i)^T g(t_{j}))}{\\sum_{t \\in T} \\exp(f(s_i)^T g(t))} \\\\ \\\\ \n\\frac{\\partial L}{\\partial f(s_i)} &amp;= -\\frac{1}{|S|} \\left(g(t_i) - \\sum_{t_j \\in T}p_{ij} g(t_j)\\right)\\\\\n\\end{align}&quot;,&quot;id&quot;:&quot;BNASZUUAUY&quot;}" data-component-name="LatexBlockToDOM"></div><p>We can interpret this partial derivative as how much we should <strong>pull the query representation</strong> <code>f(s_i)</code> toward the correct target representation <code>g(t_i)</code>, and <strong>push it away</strong> from all other targets in the batch, weighted by their similarity.</p><p>Similarly, the partial derivative with respect to <code>g(t_j)</code> has a mirrored structure:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\\frac{\\partial \\mathcal{L}_i}{\\partial g(t_i)} &amp;= \n\\frac{\\partial \\mathcal{L}_i}{\\partial a_i} \\cdot \\frac{\\partial a_i}{\\partial g(t_i)} + \n\\sum_j \\frac{\\partial \\mathcal{L}_i}{\\partial a_j} \\cdot \\frac{\\partial a_j}{\\partial g(t_i)} \\\\\n\\frac{\\partial \\mathcal{L}}{\\partial g(t_j)} &amp;= -\\frac{1}{|S|} \\left( \n\\sum_k \\mathbf{1}[r_k = j] f(s_k) - \\sum_{s_i \\in S} p_{ij} f(s_i)\n\\right)\n\n\\end{align}&quot;,&quot;id&quot;:&quot;DDEBCVZZPR&quot;}" data-component-name="LatexBlockToDOM"></div><p>We <strong>push the target representation</strong> <code>g(t_j)</code> away from all queries that treat it as a negative (weighted by how similar they are), and <strong>pull it toward</strong> the corresponding query representation <code>f(s_k)</code> if it is the true positive match.</p><p>Looking at the partial derivatives, we can see we can&#8217;t use naive gradient accumulation as they rely on the full batch similarities.</p><p>So what <em>can</em> we do to reduce memory?</p><h3>Breaking Apart</h3><p>Remember the partial derivatives of the loss with respect to <code>&#920; </code>and <code>&#923;?</code></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n\\frac{\\partial \\mathcal{L}}{\\partial \\Theta} &amp;= \\sum_{s_i \\in S} \\frac{\\partial \\mathcal{L}}{\\partial f(s_i)} \\frac{\\partial f(s_i)}{\\partial \\Theta} \\\\\n\\frac{\\partial \\mathcal{L}}{\\partial \\Lambda} &amp;= \\sum_{t_j \\in T} \\frac{\\partial \\mathcal{L}}{\\partial g(t_j)} \\frac{\\partial g(t_j)}{\\partial \\Lambda}\n\\end{align}&quot;,&quot;id&quot;:&quot;LMPZCDXSOF&quot;}" data-component-name="LatexBlockToDOM"></div><p>We can take advantage of two key properties of the gradient computation:</p><ol><li><p>The loss gradient with respect to the representations (e.g., &#8706;L/&#8706;f(s_i)) depends only on the numerical values of the representations and not on the encoder parameters &#920; or &#923;.</p></li><li><p>The gradient with respect to the encoder parameters (e.g., &#8706;f(s_i)/&#8706;&#920;) depends only on s_i and &#920;, but <strong>does not require the full batch</strong>.</p></li></ol><p>This lets us avoid building a full end-to-end computational graph from input &#8594; encoder &#8594; embeddings &#8594; loss &#8594; gradient.</p><p>Instead, we can:</p><ul><li><p>Compute the representations without tracking gradients</p></li><li><p>Compute the loss and the numerical gradients with respect to <code>f(s_i)</code> using the full batch.</p></li><li><p>Re-run the forward pass of the encoder for each <code>s_i</code> and use the precomputed &#8706;L/&#8706;f(s_i) to backpropagate and obtain &#8706;L/&#8706;&#920;.</p></li></ul><p>This saves memory by avoiding the need to store encoder activations for the entire batch, while still enabling correct gradient computation.</p><p>GradCache can be thought of as a specialized case of gradient checkpointing. Training normally, you store all intermediate activations to compute gradients during the backward pass. Gradient checkpointing saves memory by discarding these activations and recomputing them during backward using a second forward pass.</p><p>GradCache takes this a step further for contrastive learning: it discards <em>all</em> activations for the encoder and recomputes <em>only what's needed</em> using precomputed gradients of the loss with respect to the embeddings. This means you avoid storing full-batch activations entirely. </p><h2>Conclusion</h2><p>In this article, we walked through why naive gradient accumulation doesn&#8217;t work for contrastive learning setups and how GradCache removes the batch dependency for gradient accumulation. GradCache allows you to scale to the critical batch size with limited hardware. Maybe you don&#8217;t need as much compute as you thought to train the next best embedding model!</p><h2>Appendix</h2><h3>A.) Partial Derivative of LogSumExp</h3><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\nl &amp;= \\log\\left(\\sum_{j} \\exp(x_j)\\right)  \\\\\n\\frac{\\partial l}{\\partial x_i} &amp;= \\frac{\\partial l}{\\partial x_i} \\log\\left(\\sum_{j} \\exp(x_j)\\right)  \\\\\n&amp;=  \\frac{\\partial l}{\\partial x_i}  \\log(z) \\\\\n&amp;\\quad \\text{where } z = \\sum_{j} \\exp(x_j) \\\\\n&amp;= \\frac{\\partial l}{\\partial x_i}\\frac{1}{\\sum_{j} \\exp(x_j)} * \\exp(x_j)\\\\\n&amp;= \\frac{\\exp(x_i)}{\\sum_{j} \\exp(x_j)}\n\\end{align}&quot;,&quot;id&quot;:&quot;ZKRIDNHVMM&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Examples of this include question-answer pairs from search engines and images and their text captions. There is a lot of naturally occurring paired data; however not all of it is high quality!</p></div></div>]]></content:encoded></item><item><title><![CDATA[Optimizing a WebGPU Matmul Kernel for 1TFLOP+ Performance]]></title><description><![CDATA[Building Surfgrad, a high-performant, WebGPU-powered autograd library]]></description><link>https://www.nuss-and-bolts.com/p/optimizing-a-webgpu-matmul-kernel</link><guid isPermaLink="false">https://www.nuss-and-bolts.com/p/optimizing-a-webgpu-matmul-kernel</guid><dc:creator><![CDATA[Zach Nussbaum]]></dc:creator><pubDate>Mon, 11 Nov 2024 16:30:31 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!fak3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fca0b4e-318c-43ec-8d23-c84393b86723_1704x900.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I work at <a href="https://nomic.ai">Nomic</a>, where many of my colleagues work on building large TSNE-like visualizations <em>work</em> in the browser<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. Showing tens of millions of data points in the browser without rendering your computer an oven is no easy challenge. I overhear many of the scaling problems solved by <a href="https://github.com/nomic-ai/deepscatter">Deepscatter</a>, first developed by Ben Schmidt.  </p><p>However, many conversations that I overhear tend to revolve around Typescript and how awesome WebGPU is. At the time of writing, I couldn&#8217;t find any autograd libraries built with WebGPU<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. So as an educational exercise to learn WebGPU and Typescript, I decided to build <strong>Surfgrad</strong><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>, a high-performant, WebGPU-powered autograd library that enables browser-based tensor operations. </p><p>In this post, I&#8217;ll cover how I optimized a naive WebGPU Matrix Multiplication (matmul) Kernel to 1<a href="https://en.wikipedia.org/wiki/Floating_point_operations_per_second">TFLOPS</a>+ of arithmetic intensity<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a>. The goal isn&#8217;t to build the  <strong>fastest</strong> autograd library, but to show the nuances of WebGPU and how it might differ from CUDA.</p><p>Perhaps in the future, we can even use <a href="https://github.com/zanussbaum/surfgrad">Surfgrad</a> for running the next Llama models. </p><h2>What is WebGPU?</h2><p>WebGPU is an API designed for people to write GPU code that runs on any phone or computer with a web browser.  Previously, people hacked around WebGL to run machine learning workloads like rendering <a href="https://benschmidt.org/post/2023-03-07-webGPU-day/">invisible canvas and reading numbers as colors</a>.  Now people can take advantage of the increasing power of GPUs<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a> in laptops and run compute kernels (e.g. data in, data out without any funny business). </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fak3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fca0b4e-318c-43ec-8d23-c84393b86723_1704x900.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fak3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fca0b4e-318c-43ec-8d23-c84393b86723_1704x900.png 424w, https://substackcdn.com/image/fetch/$s_!fak3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fca0b4e-318c-43ec-8d23-c84393b86723_1704x900.png 848w, https://substackcdn.com/image/fetch/$s_!fak3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fca0b4e-318c-43ec-8d23-c84393b86723_1704x900.png 1272w, https://substackcdn.com/image/fetch/$s_!fak3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fca0b4e-318c-43ec-8d23-c84393b86723_1704x900.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fak3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fca0b4e-318c-43ec-8d23-c84393b86723_1704x900.png" width="1456" height="769" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7fca0b4e-318c-43ec-8d23-c84393b86723_1704x900.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:769,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:212600,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fak3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fca0b4e-318c-43ec-8d23-c84393b86723_1704x900.png 424w, https://substackcdn.com/image/fetch/$s_!fak3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fca0b4e-318c-43ec-8d23-c84393b86723_1704x900.png 848w, https://substackcdn.com/image/fetch/$s_!fak3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fca0b4e-318c-43ec-8d23-c84393b86723_1704x900.png 1272w, https://substackcdn.com/image/fetch/$s_!fak3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7fca0b4e-318c-43ec-8d23-c84393b86723_1704x900.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>WebGPU was created to give the &#8220;compute&#8221; shader first-class support and open the doors for in-browser, private machine learning development. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oTen!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0071d1ea-8ad5-4fc9-8972-99ad3f7501a7_1742x878.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oTen!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0071d1ea-8ad5-4fc9-8972-99ad3f7501a7_1742x878.png 424w, https://substackcdn.com/image/fetch/$s_!oTen!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0071d1ea-8ad5-4fc9-8972-99ad3f7501a7_1742x878.png 848w, https://substackcdn.com/image/fetch/$s_!oTen!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0071d1ea-8ad5-4fc9-8972-99ad3f7501a7_1742x878.png 1272w, https://substackcdn.com/image/fetch/$s_!oTen!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0071d1ea-8ad5-4fc9-8972-99ad3f7501a7_1742x878.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oTen!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0071d1ea-8ad5-4fc9-8972-99ad3f7501a7_1742x878.png" width="1456" height="734" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0071d1ea-8ad5-4fc9-8972-99ad3f7501a7_1742x878.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:734,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:222165,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oTen!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0071d1ea-8ad5-4fc9-8972-99ad3f7501a7_1742x878.png 424w, https://substackcdn.com/image/fetch/$s_!oTen!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0071d1ea-8ad5-4fc9-8972-99ad3f7501a7_1742x878.png 848w, https://substackcdn.com/image/fetch/$s_!oTen!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0071d1ea-8ad5-4fc9-8972-99ad3f7501a7_1742x878.png 1272w, https://substackcdn.com/image/fetch/$s_!oTen!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0071d1ea-8ad5-4fc9-8972-99ad3f7501a7_1742x878.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The compute (and vertex and fragment) shaders are written in <a href="https://www.w3.org/TR/WGSL/">WGSL</a>. WGSL is designed for developers to write a single shader that gets compiled to lower level languages like SPIR-V for Vulkan and MSL for Metal. </p><p>Ben&#8217;s also written some great articles on what WebGPU is and why it&#8217;s important:</p><ul><li><p><a href="https://benschmidt.org/post/2020-01-15/2020-01-15-webgpu/">Javascript and the next decade of data programming</a></p></li><li><p><a href="https://benschmidt.org/post/2023-03-07-webGPU-day/">Happy WebGPU Day</a></p></li></ul><h3>WebGPU vs. CUDA</h3><p>NVIDIA is the most popular choice for hardware and CUDA, its API, is one of the reasons for it but their API only works on NVIDIA hardware. </p><p>WebGPU and NVIDIA share similar <a href="https://github.com/googlefonts/compute-shader-101/blob/main/docs/glossary.md">terminologies</a>, but don&#8217;t have the exact same functionality. WebGPU <em>just </em>introduced support for <a href="https://developer.chrome.com/blog/new-in-webgpu-128#experimenting_with_subgroups">subgroups</a> which allows threads within a group to efficiently share data, which is a big win for things like matrix multiplies where you may recalculate similar values. </p><p>WebGPU also sits a half step above CUDA in that it can compiles to other GPU languages like Vulkan and Metal. It&#8217;s kind of like React Native for GPU compute shaders. </p><h3>WebGPU Compute Shader Basics</h3><p>The smallest unit is a <strong>thread</strong> which executes the compute shader. </p><p><strong>workGroups </strong>are groups of threads: they are grouped together and run in parallel (they&#8217;re called threadBlocks in CUDA). They can access the same shared memory.</p><p>WebGPU can dispatch many of these <strong>workGroups</strong> at once, whereas CUDA calls this a Grid (which is made of threadBlocks). </p><p>Similarly to CUDA, <strong>workGroups</strong> and <strong>dispatching work groups</strong> are defined in 3D. The size of a <strong>workGroup </strong>is defined by <code>@workgroup_size(x, y, z) </code>where the number of threads per workgroup is <code>x * y * z</code>. </p><h2>Writing a Fast Matrix Multiply</h2><p>Matrix multiplications makes up most of the floating point operations per second (<a href="https://en.wikipedia.org/wiki/Floating_point_operations_per_second">FLOPs</a>) in Large Language Models like  GPT-4 and Llama. It is the basic primitive for most training and inference workloads.</p><p>Native WebGPU support for Matrix Multiply is limited to <a href="https://webgpu.rocks/wgsl/language/types/#matrix">small matrices</a>, which aren&#8217;t useful for modern Deep Learning workloads when your matrices can be large<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a>.</p><p>A quick few notes on notation. </p><h3>Matrix Multiply</h3><p>First, a <a href="https://en.wikipedia.org/wiki/Matrix_multiplication">matrix multiply</a> is defined by three matrices: A, B, C. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!COlz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3171894-84a2-467d-adcd-4fd6446217fc_1926x1174.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!COlz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3171894-84a2-467d-adcd-4fd6446217fc_1926x1174.png 424w, https://substackcdn.com/image/fetch/$s_!COlz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3171894-84a2-467d-adcd-4fd6446217fc_1926x1174.png 848w, https://substackcdn.com/image/fetch/$s_!COlz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3171894-84a2-467d-adcd-4fd6446217fc_1926x1174.png 1272w, https://substackcdn.com/image/fetch/$s_!COlz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3171894-84a2-467d-adcd-4fd6446217fc_1926x1174.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!COlz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3171894-84a2-467d-adcd-4fd6446217fc_1926x1174.png" width="1456" height="888" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b3171894-84a2-467d-adcd-4fd6446217fc_1926x1174.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:888,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:225493,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!COlz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3171894-84a2-467d-adcd-4fd6446217fc_1926x1174.png 424w, https://substackcdn.com/image/fetch/$s_!COlz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3171894-84a2-467d-adcd-4fd6446217fc_1926x1174.png 848w, https://substackcdn.com/image/fetch/$s_!COlz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3171894-84a2-467d-adcd-4fd6446217fc_1926x1174.png 1272w, https://substackcdn.com/image/fetch/$s_!COlz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb3171894-84a2-467d-adcd-4fd6446217fc_1926x1174.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The total <a href="https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#math-mem">FLOPs required of a matrix multiply</a> are <code>2 * M * K * N </code>as each operation requires both a multiply and an add (hence the 2). </p><h3>Lower Bounding Our Kernel</h3><p>Following the example from <a href="https://siboehm.com/articles/22/CUDA-MMM#lower-bounding-the-fastest-possible-runtime">Simon Boehm's great article</a>, we have two 4092x4092 matrices followed by the addition of a 4092x4092 matrix. Similarly, we have </p><ol><li><p>Total FLOPS: 137GFLOPs</p></li><li><p>Total data to read: 201MB</p></li><li><p>Total data to store: 67MB</p></li></ol><p>However, I am developing on a <a href="https://www.cpu-monkey.com/en/igpu-apple_m2_pro_16_core">Mac M2 Pro which has ~6 TFLOP/</a>s of arithmetic intensity and <a href="https://pocketnow.com/apple-m2-vs-pro-vs-max/">200GB/s of memory bandwidth</a>.</p><p>So, the fastest the compute kernel can take is </p><p>(<code>137GFLOP) / (6TFLOPS/s) = 22ms</code></p><p>and memory access takes </p><p><code>(267MB) / (200GB/s) = 1.34ms</code></p><p>so we should be compute bound (by ~16x too!). </p><h2>Writing the Kernel</h2><h3>Kernel 1: Naive Kernel</h3><p>The simplest way to compute a dot product between matrix A and B and write to matrix C is for each row in A (of shape <strong>M), </strong>iterate over the columns of A (of shape <strong>K</strong>) and multiply by the corresponding value of B. In Python, this looks like </p><pre><code>def matmul(a, b, c):
    """
    Perform naive matrix multiplication: C = A * B
    
    :param a: Input matrix A of shape (m, k)
    :param b: Input matrix B of shape (k, n)
    :param c: Output matrix C of shape (m, n) to store the result
    """
    m = len(a)
    k = len(a[0])
    n = len(b[0])
    
    # Perform the matrix multiplication
    for i in range(m):
        for j in range(n):
            c[i][j] = 0
            for l in range(k):
                c[i][j] += a[i][l] * b[l][j]</code></pre><p>Similar to the Python code above, we define<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a> our inputs<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a></p><pre><code>struct Dimensions {
  M: u32,
  K: u32,
  N: u32,
}

@group(0) @binding(0) var&lt;uniform&gt; dimensions: Dimensions;
@group(0) @binding(1) var&lt;storage, read&gt; a: array&lt;f32&gt;;
@group(0) @binding(2) var&lt;storage, read&gt; b: array&lt;f32&gt;;
@group(0) @binding(3) var&lt;storage, read_write&gt; result: array&lt;f32&gt;;</code></pre><p>and our compute kernel:</p><pre><code>@compute @workgroup_size(1)
fn main(@builtin(global_invocation_id) global_id: vec3&lt;u32&gt;) {
  let index = global_id.x;
  let row = index / dimensions.N;
  let col = index % dimensions.N;

  if (index &lt; dimensions.M * dimensions.N) {
    var sum = 0.0;
    for (var i: u32 = 0u; i &lt; dimensions.K; i = i + 1u) {
      sum = sum + a[row * dimensions.K + i] * b[i * dimensions.N + col];
    }
    result[row * dimensions.N + col] = sum;
  }
}</code></pre><p>The code is functionally equivalent to the Python code above! We define how big our <strong>workGroup</strong> size is with  <code>workgroup_size(1)</code> (remember this is represented in 3D). </p><p>So, each workGroup, since it&#8217;s only one thread, processes one <code>result[i, j]</code>. </p><p>To calculate the full matrix, we need to launch as many entries as there are in the matrix and call <a href="https://developer.mozilla.org/en-US/docs/Web/API/GPUComputePassEncoder/dispatchWorkgroups">dispatchWorkgroups</a> <a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a> </p><pre><code>pass.dispatchWorkgroups(a.shape[0] * b.shape[1]) </code></pre><p>where <code>a.shape == M, b.shape[1] == N</code> for (most) any MxN matrix. </p><p>Now as we see below, we have lots of room for improvement!</p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/8b7XC/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2509ea49-83ae-4e13-bf7a-ffb14d04bdcc_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:397,&quot;title&quot;:&quot;Naive Kernel GFLOPs vs. Matrix Size&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/8b7XC/1/" width="730" height="397" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>The largest square matrix multiply we can calculate is 128x128 due to limits in WebGPU (more on this later). We only achieve 1.64 <strong>GFLOPS/s </strong>a far cry from the theoretical max of 6 <strong>TFLOPS/s</strong>. </p><p>Why is this kernel so slow? In effect, each workgroup calculates a single entry of the 16,384 total elements (128^2). Although we are running in parallel, each workGroup loads its own copy of the matrices. The overhead to launch more workGroups is likely more than if our workGroup had more threads and calculated more results per workGroup and each workGroup isn&#8217;t able to take advantage of any caching of the inputs. </p><h3>Kernel 2: Moarrr Threads!</h3><p>With the first kernel, we&#8217;re only able to compute small square matrices due to limits on the number of <strong>workGroups</strong> (<a href="https://developer.mozilla.org/en-US/docs/Web/API/GPUSupportedLimits">maxComputeWorkgroupsPerDimension</a>) you can <strong>dispatch </strong>at once. </p><p>Since we&#8217;re launching one workgroup per entry, a 256x256 matrix is larger than our limit!</p><p>Remember this line?</p><pre><code><code>@compute @workgroup_size(1)
fn main(@builtin(global_invocation_id) global_id: vec3&lt;u32&gt;) {</code> </code></pre><p>We can reduce the number of <strong>dispatched workGroups</strong> by increasing the number of <strong>threads </strong>per <strong>workGroup</strong>! </p><p>If we update our code </p><pre><code><code>@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) global_id: vec3&lt;u32&gt;) { </code></code></pre><p>we can reduce the number of total dispatched workGroups per dimension:</p><pre><code><code>const WORKGROUP_SIZE = 256;
pass.dispatchWorkgroups((a.shape[0] * b.shape[1]) / WORKGROUP_SIZE);</code></code></pre><p>Why 256? Well, there&#8217;s another <a href="https://www.w3.org/TR/webgpu/#limits">limit</a> :) </p><p>Increasing the workgroupSize, we&#8217;re able to improve our kernel by 200x! </p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/iMGEH/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b66ef49d-b541-4e76-ae78-0b2fa063ec8f_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:389,&quot;title&quot;:&quot;Adding More Threads Increases GFLOPs&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/iMGEH/1/" width="730" height="389" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><h3>Kernel 3: Calculating with 2D workGroups</h3><p>However doing all the computation in &#8220;1 dimension&#8221; limits the matrix size we can calculate<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a></p><p>Although we don&#8217;t change much about our code, if we distribute our work in 2 dimensions we&#8217;re able to bypass these limits and launch more workGroups that are larger. This allows us to calculate a 4096x4096 matmul. </p><p>We update our <code>@workgroup_size(8, 8)</code>, check our bounds, </p><pre><code>@compute @workgroup_size(8, 8)
fn main(@builtin(global_invocation_id) global_id: vec3&lt;u32&gt;) {
  let row = global_id.x;
  let col = global_id.y;

  if (row &lt; dimensions.M &amp;&amp; col &lt; dimensions.N) {
    var sum : f32 = 0.0;
    for (var i: u32 = 0u; i &lt; dimensions.K; i = i + 1u) {
      sum = sum + a[row * dimensions.K + i] * b[i * dimensions.N + col];
    }
    result[row * dimensions.N + col] = sum;
  }
}</code></pre><p>and dispatch workgroups in 2D </p><pre><code>const WORKGROUP_SIZE = 16; 
pass.dispatchWorkgroups(    
          Math.ceil(a.shape[0]  / WORKGROUP_SIZE), 
          Math.ceil(b.shape[1] / WORKGROUP_SIZE),
);    </code></pre><p>But this is slower than our original kernel! What&#8217;s going on? </p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/cF7Na/2/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f0d0addb-0466-4e3b-b5d5-da2bc1f3c64f_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:395,&quot;title&quot;:&quot;2D Dispatch Workgroup Size vs. GFLOPS&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/cF7Na/2/" width="730" height="395" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>If we make a small change to the code</p><pre><code>@compute @workgroup_size(8, 8)
fn main(@builtin(global_invocation_id) global_id: vec3&lt;u32&gt;) {
  let row = global_id.y;
  let col = global_id.x;</code></pre><p>we get much better kernel performance. </p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/aXWxs/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d9434bfc-e590-4e04-9402-6a1b9451f58b_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:395,&quot;title&quot;:&quot;2D Dispatch with Better Cache Utilization&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/aXWxs/1/" width="730" height="395" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>Why is this? We&#8217;re able to take more advantage of cached inputs. The x dimension is incremented before the y dimension in the <code>global_invocation_id</code> and therefore more threads in each workgroup use the same row in matrix A. Otherwise, the row variable is overwritten at each invocation within the workGroup and each thread has to spend a few extra cycles to read from global memory rather than the cache. </p><h3>Kernel 4: Kernel Tiling</h3><p>Another thing to consider is how much work each thread does. </p><p>Up to now, each thread only computes one entry. But there is some overhead to launching each workGroup versus computing more than 1 element per thread! </p><p>If calculating more elements per thread is faster than the overhead to launch each workGroup, we should see a big speedup<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-11" href="#footnote-11" target="_self">11</a>.</p><p>To do so, we calculate 4 results per thread (e.g. a 1x4 Tile). </p><pre><code>const BLOCKSIZE: u32 = 16;
const TILESIZE: u32 = 4;
@compute @workgroup_size(BLOCKSIZE, BLOCKSIZE)
fn main(@builtin(global_invocation_id) global_id: vec3&lt;u32&gt;) {
    let row = global_id.y;
    let col = global_id.x * TILESIZE;

    if (row &gt;= dimensions.M || col &gt;= dimensions.N) {
        return;
    }

    var sum00: f32 = 0.0;
    var sum01: f32 = 0.0;
    var sum02: f32 = 0.0;
    var sum03: f32 = 0.0;

    for (var i: u32 = 0u; i &lt; dimensions.K; i = i + 1u) {
        let a_elem = a[row * dimensions.K + i];
        sum00 = sum00 + a_elem * b[i * dimensions.N + col];
        sum01 = sum01 + a_elem * b[i * dimensions.N + col + 1u];
        sum02 = sum02 + a_elem * b[i * dimensions.N + col + 2u];
        sum03 = sum03 + a_elem * b[i * dimensions.N + col + 3u];
    }

    result[row * dimensions.N + col] = sum00;
    result[row * dimensions.N + col + 1u] = sum01;
    result[row * dimensions.N + col + 2u] = sum02;
    result[row * dimensions.N + col + 3u] = sum03;
}</code></pre><p>The kernel looks roughly the same as before except we&#8217;ve <a href="https://en.wikipedia.org/wiki/Loop_unrolling">unrolled</a> the computation and are calculating <code>TILESIZE </code>results per thread.  </p><p>We can take this a step further and calculate 2D results per thread! Instead of calculating 4 elements per single row, we can calculate 4 elements for 4 rows (e.g. a 2D tile). </p><pre><code>const BLOCKSIZE: u32 = 16;
const TILE_M: u32 = 4;  // Tile size in M dimension
const TILE_N: u32 = 4;  // Tile size in N dimension

@compute @workgroup_size(BLOCKSIZE, BLOCKSIZE)
fn main(@builtin(global_invocation_id) global_id: vec3&lt;u32&gt;) {
    let row = global_id.y * TILE_M;
    let col = global_id.x * TILE_N;

    // initialize the array with all 0s
    var sums: array&lt;array&lt;f32, TILE_N&gt;, TILE_M&gt;;
    for (var i = 0u; i &lt; TILE_M; i++) {
        for (var j = 0u; j &lt; TILE_N; j++) {
            sums[i][j] = 0.0;
        }
    }

    // Compute the 2D tile
    for (var k = 0u; k &lt; dimensions.K; k++) {
        // for each row
        for (var i = 0u; i &lt; TILE_M; i++) {
            let a_element = a[(row + i) * dimensions.K + k];
            // calculate the dot product
            for (var j = 0u; j &lt; TILE_N; j++) {
                let b_element = b[k * dimensions.N + (col + j)];
                sums[i][j] += a_element * b_element;
            }
        }
    }

    // Write results
    for (var i = 0u; i &lt; TILE_M; i++) {
        for (var j = 0u; j &lt; TILE_N; j++) {
            let output_row = row + i;
            let output_col = col + j;
            if (output_row &lt; dimensions.M &amp;&amp; output_col &lt; dimensions.N) {
                result[output_row * dimensions.N + output_col] = sums[i][j];
            }
        }
    }
}</code></pre><p>Each thread now calculates a 4x4 grid of the output matrix and we see a slight improvement over the last kernel. </p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/p5Fkp/2/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cb464404-b5aa-4fac-ab00-4e5cc81c1542_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:395,&quot;title&quot;:&quot;1D and 2D Kernel Tiling&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/p5Fkp/2/" width="730" height="395" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><p>Surprisingly, 2D tiling is quite slow. Why haven&#8217;t we amortized the time it takes to launch workGroups by doing more work? And why are we slower than doing one item of work per thread? </p><h3>Kernel 5: Unrolling</h3><p>To answer the last question, we will need to dig into the compiled WebGPU kernels. </p><p>Some compilers will automatically unroll loops if the bounds of the loop are known at compile time. However we&#8217;ve been writing a general kernel for variable shaped inputs!</p><p> Also when writing at WGSL, we don&#8217;t have any control over the directives of the compiler. </p><p>Looking at the assembly bitcode compiled from Metal, we can see that the instruction set still includes the for loop!  </p><pre><code>%51 = phi i32 [ 0, %41 ], [ %61, %50 ]
%52 = add i32 %37, %51
%53 = zext i32 %52 to i64
%54 = getelementptr inbounds [1 x float], ptr addrspace(1) %3, i64 0, i64 %53
%55 = load float, ptr addrspace(1) %54, align 4, !tbaa !27, !alias.scope !43, !noalias !44
%56 = zext i32 %51 to i64
%57 = getelementptr inbounds %struct.type_5, ptr %7, i64 0, i32 0, i64 %49, i32 0, i64 %56
%58 = load float, ptr %57, align 4, !tbaa !27
%59 = fmul fast float %55, %48
%60 = fadd fast float %58, %59
store float %60, ptr %57, align 4, !tbaa !27
%61 = add nuw nsw i32 %51, 1
%62 = icmp eq i32 %61, 4
br i1 %62, label %38, label %50 // branching for loop</code></pre><p>Whereas the unrolled WGSL code gets compiled to </p><pre><code>...
%141 = fmul fast float %112, %103
%142 = fadd fast float %141, %82
%143 = fmul fast float %116, %103
%144 = fadd fast float %143, %81
%145 = fmul fast float %120, %103
%146 = fadd fast float %145, %80
%147 = fmul fast float %124, %103
%148 = fadd fast float %147, %79
%149 = fmul fast float %112, %107
%150 = fadd fast float %149, %78
%151 = fmul fast float %116, %107
%152 = fadd fast float %151, %77
%153 = fmul fast float %120, %107
%154 = fadd fast float %153, %76
%155 = fmul fast float %124, %107
%156 = fadd fast float %155, %75
%157 = add nuw i32 %91, 1
%158 = icmp eq i32 %157, %27
br i1 %158, label %159, label %74 </code></pre><p>Because of the manual unrolling, the GPU is able to reduce overhead by not having to initialize and increment the inner loop, take advantage of instruction level parallelism, and amortize the cost of launching fewer workGroups by doing more work per thread. When we had our loop, the kernel (#4) wasn&#8217;t able to take advantage of these optimizations and was slower than just launching more workGroups (#3). </p><p>And if we make our grid 8x8, we get a 3x boost over the 4x4 loop and surpass 1TFLOP! </p><div id="datawrapper-iframe" class="datawrapper-wrap outer" data-attrs="{&quot;url&quot;:&quot;https://datawrapper.dwcdn.net/W2laF/1/&quot;,&quot;thumbnail_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5f6777c9-9486-47fe-b35f-befa67005b15_1260x660.png&quot;,&quot;thumbnail_url_full&quot;:&quot;&quot;,&quot;height&quot;:395,&quot;title&quot;:&quot;2D Kernels Unrolled&quot;,&quot;description&quot;:&quot;&quot;}" data-component-name="DatawrapperToDOM"><iframe id="iframe-datawrapper" class="datawrapper-iframe" src="https://datawrapper.dwcdn.net/W2laF/1/" width="730" height="395" frameborder="0" scrolling="no"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(e){if(void 0!==e.data["datawrapper-height"]){var t=document.querySelectorAll("iframe");for(var a in e.data["datawrapper-height"])for(var r=0;r<t.length;r++){if(t[r].contentWindow===e.source)t[r].style.height=e.data["datawrapper-height"][a]+"px"}}}))}();</script></div><h2>Conclusion</h2><p>Through our efforts, we were able to build a performant matmul kernel that is 1000x faster than the naive kernel and approach Apple M2 Pro&#8217;s theoretical peak. </p><p>And with frequent updates to WebGPU, there are still optimizations to be made! For example, we didn&#8217;t take advantage of <a href="https://github.com/gpuweb/gpuweb/issues/3950">subgroups</a>, a feature that is new as of Chrome 125 and should allow for faster memory access and sharing across subgroups to reduce repeated computations. </p><p>And a big thank you to <a href="https://x.com/owl_poster">Abhishaike Mahajan</a> (who writes an <a href="https://www.owlposting.com/?utm_source=global-search">incredible blog</a>) and <a href="https://x.com/elmanmansimov">Elman Mansimov</a> for feedback and encouragement to writing this article!</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Visualizing these 2-dimensional maps pose two problems: projecting (e.g. TSNE and UMAP) into a 2D coordinate system is slow and not RAM friendly as you increase dataset size and visualizing millions of datapoints in the browser without turning your laptop into a toaster.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>I would be remiss to not mention two repos that do similar thing: <a href="https://github.com/0hq/WebGPT">webGPT</a> (Transformer based inference only) and <a href="https://github.com/milhidaka/webgpu-blas">webgpu-blas</a> (fast matmul kernels in webGPU). </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Highly inspired (in name and function) by <a href="https://github.com/tinygrad/tinygrad">tinygrad</a> and <a href="https://github.com/karpathy/micrograd">micrograd</a>. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>The format of the blog follows a similar path to <a href="https://siboehm.com/articles/22/CUDA-MMM">Simon Boehm&#8217;s article on Optimizing a CUDA Matmul Kernel</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Apple&#8217;s M3 Pro has a <a href="https://nanoreview.net/en/cpu-compare/apple-m3-pro-vs-apple-m3">reported</a> ~7TFLOPS. You can even run <a href="https://x.com/xenovacom/status/1840767709317046460">Llama3.2 (with ONNX)</a> in your browser with 85 tokens/s</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>For reference, <a href="https://huggingface.co/meta-llama/Llama-3.1-70B/blob/main/config.json">Llama 3.1 70B</a> has matrices of size (8192x28672)</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>There&#8217;s quite a bit of boilerplate for running WebGPU code from Typescript, which I&#8217;ll leave for the curious to explore: https://webgpufundamentals.org/webgpu/lessons/webgpu-fundamentals.html</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>WGSL supports a <a href="https://google.github.io/tour-of-wgsl/types/">number</a> of <a href="https://www.w3.org/TR/WGSL/#types">types</a></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>To simplify the article and amount of code, I removed much of the boilerplate code needed to setup the GPU buffers and only focus on things required for understanding how I optimized WGSL kernels.  </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p>Due to another limitation: <a href="https://developer.mozilla.org/en-US/docs/Web/API/GPUSupportedLimits#maxComputeWorkgroupsPerDimension">maxComputeWorkgroupsPerDimension</a>. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-11" href="#footnote-anchor-11" class="footnote-number" contenteditable="false" target="_self">11</a><div class="footnote-content"><p>And this is something <a href="https://devstreaming-cdn.apple.com/videos/wwdc/2016/606oluchfgwakjbymy8/606/606_advanced_metal_shader_optimization.pdf?dl=1">Apple suggests when building compute kernels</a></p></div></div>]]></content:encoded></item><item><title><![CDATA[Stonks Only Go Up: Building a WallStreetBets Sentiment Model]]></title><description><![CDATA[What I Wish Knew Building A Full-Stack ML System]]></description><link>https://www.nuss-and-bolts.com/p/stonks-only-go-up-building-a-nlp</link><guid isPermaLink="false">https://www.nuss-and-bolts.com/p/stonks-only-go-up-building-a-nlp</guid><dc:creator><![CDATA[Zach Nussbaum]]></dc:creator><pubDate>Tue, 26 Apr 2022 20:24:08 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!9yo9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F01fa0d05-d19a-4840-8a8b-d4c8f21c9e98_500x498.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Over the past few years, I&#8217;ve spent a majority of my time trying to learn more about ML. From learning about how to approach a problem to MLOps, my projects (and mentors) have been incredible resources in taking me from a noob to a baseline level of competence as a <a href="https://www.toptal.com/machine-learning">machine learning engineer</a>. </p><p>There are many tutorials and classes (my favorite being <a href="http://cs231n.stanford.edu/">CS231N</a>) that delve into the fundamentals of machine learning, a necessity for any later research/serious application of ML. Additionally, there are millions (~80M!) of &#8220;<a href="https://en.wikipedia.org/wiki/%22Hello,_World!%22_program">Hello World</a>&#8221; versions of  machine learning, many of which use a framework to build a model to <a href="https://www.tensorflow.org/tutorials/quickstart/beginner">classify handwritten digits</a>. However, I found that the content diving into building a solution to a business problem relatively sparse (or at least from what I was searching). So here&#8217;s my attempt at filling in the gap between basic blogs and research papers on arXiv. </p><p>This article (and possibly others) will cover some work I did with <a href="https://topstonks.com/">TopStonks</a>, how we approached a ML problem end to end, and things I learned along the way. </p><h2>WTF is TopStonks</h2><p>TopStonks aggregates &#8220;The best advice from the worst investors on the internet&#8221;. At a high level, they utilize data from financial communities (e.g. <a href="https://www.reddit.com/r/wallstreetbets/">r/wallstreetbets</a>) and turn that into meaningful insights. They have been featured in major financial publications including the <a href="https://www.wsj.com/articles/how-redditors-find-the-next-gamestop-stock-11613644201">Wall Street Journal</a>, <a href="https://markets.businessinsider.com/news/stocks/retail-traders-trumps-media-spac-deal-dwac-meme-stoocks-wallstreetbets-2021-10">Business Insider</a> and <a href="https://www.forbes.com/sites/greatspeculations/2021/04/19/saving-investors-from-meme-stocks-amc-entertainment-amc/?sh=37184dc551e4">Forbes</a>. </p><p>Working with an accomplished friend, I was lucky enough to get access to this data and learn from him about how to approach problems like this. </p><h2>Deriving Sentiment</h2><p>Once you have all this data, the question then becomes what do you do with it? Theoretically (and I&#8217;m sure some people do), you can parse every comment to find some <a href="https://www.investopedia.com/terms/a/alpha.asp">alpha</a>. But this is not scalable, unless your day job is to just read Reddit threads. </p><p>Ok, what next? You could apply some hard rules and search for terms like "&#128640;" or "to the moon &#127769; " to score a rough sentiment, but that only gets you so far. These rules only provide a small picture of the nuance of language, but learning that nuance is not easy!</p><p>For example, how would you write rules to classify these comments? Not so easy!</p><pre><code>SPCE is already printing tendies and continuing to go up as I type this</code></pre><p></p><pre><code>MSFT and BABA both shitting the bed, my spy calls didnt get filled this morning, my V calls didnt get filled last minute last night (getting some REAL fomo there). what do i buy now :( ?</code></pre><p></p><p>This is where we can use machine learning to determine the sentiment of a comment, i.e. how bullish is an investor. We have the data and we have a baseline for a rules-based approach, but it&#8217;s not good enough. </p><h2>Looking At Your Data</h2><p>One thing that gets lost in many of the tutorials and classes is that data in the real world is USUALLY messy<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>. At the very least, there were some nuanced choices that were made, so you better be sure you know exactly what you&#8217;re working with. </p><p>Data is the most important part of the process and if you don&#8217;t take time to understand the ins-and-outs, your model will suffer. Your model will ONLY be as good as the data you are working with. When working with my friend, I was truly surprised to see that the model development cycle consisted of 75% working with data<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> (exploratory data analysis, labeling, and cleaning) to about ~25% model building and tuning. His attention to detail on cleaning and labeling data was surprising; I assumed most people just did some basic EDA. Most of our first few weeks were spent diving into the data trying to answer: </p><ul><li><p>What macro things are people talking about? Are themes local to the thread or are they shared across threads?</p></li><li><p>Are there comments we can omit? Is there spam? What does it look like?</p></li><li><p>Are stocks talked about in different ways? Are there stock-specific verbiages seen across a longer time period (e.g. on the order of months)?</p></li></ul><p>Without going through each comment, we would have missed out on so much. We <em>could</em> have thrown the kitchen sink at the problem, but eventually this time would have been spent checking out why your model (most likely) sucks. </p><p>If anyone learns anything at all, it&#8217;s to spend more time with your data. Get into the weeds and understand as much as you can about what you are working with. It will especially pay off when looking at pitfalls of the model. Granted, it&#8217;s easier to look into the data when it&#8217;s human understandable and not something that requires some domain expertise like genomic data. </p><h2>Data Label Like Your Model Depends on It</h2><p>So, your data journey begins with millions of unlabelled comments. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9yo9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F01fa0d05-d19a-4840-8a8b-d4c8f21c9e98_500x498.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9yo9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F01fa0d05-d19a-4840-8a8b-d4c8f21c9e98_500x498.jpeg 424w, https://substackcdn.com/image/fetch/$s_!9yo9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F01fa0d05-d19a-4840-8a8b-d4c8f21c9e98_500x498.jpeg 848w, https://substackcdn.com/image/fetch/$s_!9yo9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F01fa0d05-d19a-4840-8a8b-d4c8f21c9e98_500x498.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!9yo9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F01fa0d05-d19a-4840-8a8b-d4c8f21c9e98_500x498.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9yo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F01fa0d05-d19a-4840-8a8b-d4c8f21c9e98_500x498.jpeg" width="500" height="498" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/01fa0d05-d19a-4840-8a8b-d4c8f21c9e98_500x498.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:498,&quot;width&quot;:500,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:38772,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9yo9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F01fa0d05-d19a-4840-8a8b-d4c8f21c9e98_500x498.jpeg 424w, https://substackcdn.com/image/fetch/$s_!9yo9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F01fa0d05-d19a-4840-8a8b-d4c8f21c9e98_500x498.jpeg 848w, https://substackcdn.com/image/fetch/$s_!9yo9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F01fa0d05-d19a-4840-8a8b-d4c8f21c9e98_500x498.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!9yo9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F01fa0d05-d19a-4840-8a8b-d4c8f21c9e98_500x498.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Label the data! This takes time but is important as you start to understand what&#8217;s confusing and add context to the problem.</p><p>Although tedious, attention to detail is paramount. No one will (usually) put more effort and care than you about the boring stuff. This is something I learned the hard way, as early iterations of our model stunk, primarily due to my lack of attention and poor data quality.</p><h2>The Fun Part: Modeling</h2><p>We now needed to see what a naive approach would yield, then iterate on it. And sometimes the simplest approach produces a 6/10 or 7/10 product and small hacks can take it to the next level. </p><p>Nevertheless, we wrote some basic rules on what we thought were strong signals of a bullish/bearish comment. One such case being "X to the moon &#128640; ".  These rules are an example of a <a href="https://en.wikipedia.org/wiki/Precision_and_recall">high precision low recall</a> classifiers: they&#8217;re accurate when a rule matches a comment, but otherwise (which is most of the time) has no opinion.</p><p>Ok, so we wanted our models to be more sensitive to the actual text and content, but we don&#8217;t have enough labeled data to train a model sentiment model from scratch. What should we do? Utilize existing models!</p><p>Thankfully, <a href="https://huggingface.co/models?pipeline_tag=text-classification&amp;sort=downloads">HuggingFace</a> has many binary sentiment classification models trained on similar data, like movie reviews. </p><p>Evaluating these models proved that they were better than a pure rules-based approach, but failed to fit well to some edge cases specific to the data. The datasets that these HuggingFace models were trained on has some differences to Reddit comments, something I liked to term &#8220;meme-speak&#8221;. Things like</p><p><code>Tsla puts, musk is mole person. #DD</code></p><p>would never be found on a Yelp Review or a Rotten Tomato Movie Review.</p><p>We can even see how inaccurate the model scores out of distribution data. For example, the <a href="https://huggingface.co/textattack">textattack</a>/<a href="https://huggingface.co/textattack/bert-base-uncased-rotten-tomatoes">bert-base-uncased-rotten-tomatoes</a> model scores the following comments as incredibly bearish:</p><pre><code>SPCE to the moon

STONKS only go up 

SPCE is already printing tendies and continuing to go up as I type this

HOLD THE LINE BULLS &#128002;&#128002;&#128002;</code></pre><p>when it&#8217;s clear these are bullish comments.</p><h3>At a Crossroads</h3><p>We now have a bunch of pretrained models trained on tangential data. We also have hand-written rules that are highly accurate when they do appear, which is not often. How can we build a model combining all of these different signals?</p><h3>Snorkel</h3><p><a href="https://snorkel.ai/">Snorkel</a>! Snorkel allows users to create rules and learns a weighting to best approximate the true labels. Instead of having to choose one model or a set of rules, we could combine them into an ensemble!</p><p>Writing these rules was no easy task however. We spent a lot of time iterating on what to include in the functions, mainly thinking about:</p><ul><li><p>How many pretrained models do we need? Is there a number of models that provide diminishing returns?</p></li><li><p>What are key words that are easy predictors of sentiment?</p></li><li><p>Does comment length matter?</p></li></ul><p>A few example functions:</p><pre><code>@labeling_function
def fed(text):
    phrases = ['fed',' powell']
    return BULLISH if contains_phrase(text, phrases) else ABSTAIN

@labeling_function
def hold(text):
    phrases = ['holding', 'hold', 'holder', 'hodl', 'diamond hand', 'diamond hands']
    return BULLISH if contains_phrase(text, phrases) else ABSTAIN
</code></pre><p>At the end of the day, we came up with an ensemble model using ~50 rules and ~5 pretrained models. We could now more accurately predict sentences such as </p><p><code>GME to the moon!</code></p><p>or even more complex comments like </p><p><code>Earnings coming up , I think they&#8217;ll do really well and surpass earnings. They&#8217;re also down from their ATH which imo will shoot up the price midway from now to ATH. Everyone even boomers want to cut the cord and buy a roku stick, expected sales on tvs with roku built it surged during the holidays, and they&#8217;re making a significant money from ads from their free platform. Idk. My opinion, change...</code></p><h3>Caveats</h3><p>However, the model was not perfect. For one, not all comments are either bullish or bearish. Some are neutral or exist on a spectrum:</p><p><code>Yeah I&#8217;m in the same boat. Apple and MSFT trading at record highs and high PE? I&#8217;m 30 years from retirement...I&#8217;ll buy on the way up, on the way down, and sideways for the next 5 years at least. Might dump more $$ on large corrections...but I&#8217;ll DCA for the next 5-10 years and worry about it when I&#8217;m 60. I&#8217;m most confident in those 2 names specifically then any other.</code></p><p>How do you model a comment like this where they are (somewhat) bearish/nervous in the short term but bullish long term? Do you add more categories or do you (somehow) give a discrete value to the comment? </p><p>In general, the model was more well-suited to short-comments with &#8220;meme-speak&#8221; sprinkled in. It tended to be extremely confident on comments with phrases like <code>stonks only go up</code> but less certain on other comments. Given more time and resources, my friend and I would have loved to look more into predicting emotions based on <a href="https://en.wikipedia.org/wiki/Robert_Plutchik#Plutchik's_wheel_of_emotions">Plutchik&#8217;s Wheel of Emotions</a>, but would have taken serious effort to crowdsource the labels. </p><p>We then tried using weak supervision to try and improve the model. Using the confident outputs as ground truth, and also labelling a few thousand more, we fine-tuned a Large Language Model (LLM) with little improvement.</p><h3>Active Learning Attempts</h3><p>Trying to improve the fine-tuned LLM, we also tried using <a href="https://en.wikipedia.org/wiki/Active_learning_(machine_learning)">active learning</a>. Active Learning is a technique to try and select the most beneficial examples to label. Put another way, what examples do I need to label to most improve my model?  Figuring out which examples to choose is difficult however. One standard way to choose important example is using a model&#8217;s uncertainty (calculated by <a href="https://youtu.be/v68zYyaEmEA">entropy</a>). Even still, we found little to no improvement over our ensemble. </p><h3>Wrapping Up</h3><p>As with any project, there are many things you wish you had time to keep trying and figure out. Even with this (already too long) article.</p><p>This project really improved my understanding of applied ML and how to approach most problems. I am eternally grateful to my friend for teaching me a smol portion of how to be effective at building useful ML products and wanted to share a slice of what I have learned. </p><p></p><p></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>At least in my experience, n=1</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>For a more detailed how-to-train a NN, see Andrej Karpathy&#8217;s <a href="https://karpathy.github.io/2019/04/25/recipe/">A Recipe for Training Neural Networks</a></p></div></div>]]></content:encoded></item></channel></rss>