Nuss and Bolts

On the Lost Nuance of Grep vs. Semantic Search

Zach Nussbaum — Fri, 14 Nov 2025 14:02:03 GMT

Two years ago, RAG (Retrieval Augmented Generation) meant vector databases and embedding models. Now, Claude Code, Codex, Cline, and others have popularized a vector-less approach by using grep, bash tools, and good ‘ole reasoning. If you took Twitter, the everything App, as conventional wisdom, then you might believe that vectors are overkill and agentic search with grep1 is really all you need.

That is until Cursor published a blog on how they use vector search alongside grep. Who knew there could be nuance!

A Really Contrived Example

My initial reaction to agentic search was one of (naive) dismissal. What a waste of compute! Why use all of your hard earned Jenson Bucks when you have a compact, efficient embedding model that understands language? This vector-less approach clearly works. When does it not work? Is it always superior to vectors?

So I built a really dumb testbed. I split out each row from the Natural Questions corpus and saved it as a text file. For each query, I removed stop words and grep’d all documents that had a match to any of the remaining keywords2.

Simple! But slow3.

Latency seems to scale linearly with index size and is much slower than numpy on my Macbook.

Unsurprisingly, using only the keywords present in the query showed poor performance. We don’t get the soft matching and flexibility of embeddings. We only find matches to the correct document when there is an exact keyword match between the query and document.

Retrieval performance over different index sizes for 100 random queries from NQ. NQ is in-domain for Nomic Embed

But using a cheap model like gpt-5-mini to return relevant keywords based on the query nearly 10x’d performance4

So what is grep good for? Exact matches for a known or easily derived keyword5. But that keyword may not always be known.

RAG is Dead…Long Live RAG?

Compared to a numpy vector search, grep is much slower. But embedding models converge to a bag of semantic tokens and don’t offer much flexibility for queries outside of its training data.

Take the BRIGHT benchmark: many models have shifted to include some sort of rewriting and expansion. ReasonIR showed that training with expanded queries6 and reranking with an LLM7 improved over their baseline.

In a nutshell, you’re trading latency and tokens for flexibility when using grep+keywords over embeddings.

But when should you use either?

Seconds describes it quite clearly:

If it’s not readily apparent what the name of a variable or a particular stage of your pipeline is, but you can reference some oblique aspect of it, embeddings will get you a lot closer than grep will.

Cursor’s Embedding Model

Cursor’s embedding model seems to improve performance for all models on their internal Cursor Context Bench.

So how is it different than a regular code embedding model? They leverage the rich user-agent interactions:

We provide these traces to an LLM, which ranks what content would have been most helpful at each step. We then train our embedding model to align its similarity scores with these LLM-generated rankings. This creates a feedback loop where the model can learn from how agents actually work through coding tasks, rather than relying on generic code similarity.

Taking the traces (expanded query), they train an embedding model to retrieve documents that were found from an agent using tools like grep and file read. This is quite similar to ReasonIR! While they may not be explicitly modeling query/keyword expansion, training over the traces distills that information from grep and file read.

Is it be better for the embedding model to explicitly learn to do the query expansion or learn it implicitly by mining the correct traces? Maybe SPLADE is another alternative. Who knows, but it would be fun to try :)

At the end of the day, comparisons of grep to semantic search are missing context. Agentic Search gives you flexible retrieval by offloading the learned semantics of an embedding model to an LLM. This also makes using grep in any codebase simple. You no longer have to maintain an index, worry about any potential security implications, or think about how to best chunk your files for your embedding model. However, I believe that Agentic Search shows where embedding models can and should improve.

“with grep” is doing a lot of heavy lifting here

Functionally, I did this f”rg -i -c {‘|’.join(query)}”. Docs are “scored” by counts of each word. Because this is a toy example, I didn’t bother with any score normalization. We could use BM25 if we wante d

Cognition recently trained specialized models for faster (and parallel) agentic retrieval.

It’s not lost on me that this dataset is really popular and probably not the most fair comparison. However

However, embeddings are still the way to go for the continuous domain

It’s worth noting that the model itself doesn’t rewrite the queries

This even works using an off-the-shelf LLM

I Like Big Batches and I Cannot Lie

Zach Nussbaum — Mon, 02 Jun 2025 21:47:20 GMT

If you want to train the next best text embedding model, chances are you’ll need to use a large batch size. But naively scaling to the critical batch size requires lots of GPUs! But what happens if don’t have the compute budget to do so? GradCache allows you to fit large batch sizes with limited memory by decoupling the batch size from gradient calculation, the main source of memory.

We’ll explore why naive gradient accumulation doesn’t work and break down how GradCache works. We’ve used GradCache to train some of the embedding models at Nomic and I’ve found it essential for understanding the fundamentals of contrastive learning.

Big Batches are Better for Contrastive Learning

Contrastive representation learning trains a model to learn an embedding space such that similar data points are close to each other while dissimilar points are far away. Many modern embedding models, such as CLIP and OpenAI text-embedding-large, are trained with the InfoNCE loss. For a given batch size N of paired data1, the model is trained to identify the positive pair amongst N-1 negative pairs in the batch. For example, each text caption is compared with every image in the batch. The loss forces all N-1 negative representations away from the caption and pull the positive image representation closer to the caption embedding.

Performance improves as you increase the batch size as you have more negative examples to compare against but doing so requires fitting the whole NxN similarity and activations into GPU memory.

But what happens if you don’t have enough memory to do so?

GradCache: Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup

GradCache is a technique to reduce memory requirements by removing the backward pass’s dependency on the batch size. Let’s dig into how this works!

How Loss is Computed

The InfoNCE loss minimizes the categorical cross entropy loss between the positive pair and all other pairs in the batch. Each summation requires fitting the whole batch in memory.

InfoNCE loss formulation used to train embedding models. Here f and g are models that output representations and S and T are the paired data points in the batch.

Why Can’t You Use Gradient Accumulation?

Naive gradient accumulation computes the loss and gradients in sub-batches then the model parameters are updated. In the case of your standard language modeling loss, the loss for each data point is independent of every other point in the batch! If you used gradient accumulation with the InfoNCE loss, you are only computing the negatives within the sub-batch.

Derivative of InfoNCE Loss

Let’s break down the derivatives of the InfoNCE loss. The models f and g are parameterized by Θ and Λ. Given the loss:

we want to derive the partial derivatives of the loss with respect to Θ and Λ:

To make this more palatable, let’s work out the derivative for a single data point in S and T respectively

We can interpret this partial derivative as how much we should pull the query representation f(s_i) toward the correct target representation g(t_i), and push it away from all other targets in the batch, weighted by their similarity.

Similarly, the partial derivative with respect to g(t_j) has a mirrored structure:

We push the target representation g(t_j) away from all queries that treat it as a negative (weighted by how similar they are), and pull it toward the corresponding query representation f(s_k) if it is the true positive match.

Looking at the partial derivatives, we can see we can’t use naive gradient accumulation as they rely on the full batch similarities.

So what can we do to reduce memory?

Breaking Apart

Remember the partial derivatives of the loss with respect to Θ and Λ?

We can take advantage of two key properties of the gradient computation:

The loss gradient with respect to the representations (e.g., ∂L/∂f(s_i)) depends only on the numerical values of the representations and not on the encoder parameters Θ or Λ.
The gradient with respect to the encoder parameters (e.g., ∂f(s_i)/∂Θ) depends only on s_i and Θ, but does not require the full batch.

This lets us avoid building a full end-to-end computational graph from input → encoder → embeddings → loss → gradient.

Instead, we can:

Compute the representations without tracking gradients
Compute the loss and the numerical gradients with respect to f(s_i) using the full batch.
Re-run the forward pass of the encoder for each s_i and use the precomputed ∂L/∂f(s_i) to backpropagate and obtain ∂L/∂Θ.

This saves memory by avoiding the need to store encoder activations for the entire batch, while still enabling correct gradient computation.

GradCache can be thought of as a specialized case of gradient checkpointing. Training normally, you store all intermediate activations to compute gradients during the backward pass. Gradient checkpointing saves memory by discarding these activations and recomputing them during backward using a second forward pass.

GradCache takes this a step further for contrastive learning: it discards all activations for the encoder and recomputes only what's needed using precomputed gradients of the loss with respect to the embeddings. This means you avoid storing full-batch activations entirely.

Conclusion

In this article, we walked through why naive gradient accumulation doesn’t work for contrastive learning setups and how GradCache removes the batch dependency for gradient accumulation. GradCache allows you to scale to the critical batch size with limited hardware. Maybe you don’t need as much compute as you thought to train the next best embedding model!

Appendix

A.) Partial Derivative of LogSumExp

Examples of this include question-answer pairs from search engines and images and their text captions. There is a lot of naturally occurring paired data; however not all of it is high quality!

Optimizing a WebGPU Matmul Kernel for 1TFLOP+ Performance

Zach Nussbaum — Mon, 11 Nov 2024 16:30:31 GMT

I work at Nomic, where many of my colleagues work on building large TSNE-like visualizations work in the browser1. Showing tens of millions of data points in the browser without rendering your computer an oven is no easy challenge. I overhear many of the scaling problems solved by Deepscatter, first developed by Ben Schmidt.

However, many conversations that I overhear tend to revolve around Typescript and how awesome WebGPU is. At the time of writing, I couldn’t find any autograd libraries built with WebGPU2. So as an educational exercise to learn WebGPU and Typescript, I decided to build Surfgrad3, a high-performant, WebGPU-powered autograd library that enables browser-based tensor operations.

In this post, I’ll cover how I optimized a naive WebGPU Matrix Multiplication (matmul) Kernel to 1TFLOPS+ of arithmetic intensity4. The goal isn’t to build the fastest autograd library, but to show the nuances of WebGPU and how it might differ from CUDA.

Perhaps in the future, we can even use Surfgrad for running the next Llama models.

What is WebGPU?

WebGPU is an API designed for people to write GPU code that runs on any phone or computer with a web browser. Previously, people hacked around WebGL to run machine learning workloads like rendering invisible canvas and reading numbers as colors. Now people can take advantage of the increasing power of GPUs5 in laptops and run compute kernels (e.g. data in, data out without any funny business).

WebGPU was created to give the “compute” shader first-class support and open the doors for in-browser, private machine learning development.

The compute (and vertex and fragment) shaders are written in WGSL. WGSL is designed for developers to write a single shader that gets compiled to lower level languages like SPIR-V for Vulkan and MSL for Metal.

Ben’s also written some great articles on what WebGPU is and why it’s important:

WebGPU vs. CUDA

NVIDIA is the most popular choice for hardware and CUDA, its API, is one of the reasons for it but their API only works on NVIDIA hardware.

WebGPU and NVIDIA share similar terminologies, but don’t have the exact same functionality. WebGPU just introduced support for subgroups which allows threads within a group to efficiently share data, which is a big win for things like matrix multiplies where you may recalculate similar values.

WebGPU also sits a half step above CUDA in that it can compiles to other GPU languages like Vulkan and Metal. It’s kind of like React Native for GPU compute shaders.

WebGPU Compute Shader Basics

The smallest unit is a thread which executes the compute shader.

workGroups are groups of threads: they are grouped together and run in parallel (they’re called threadBlocks in CUDA). They can access the same shared memory.

WebGPU can dispatch many of these workGroups at once, whereas CUDA calls this a Grid (which is made of threadBlocks).

Similarly to CUDA, workGroups and dispatching work groups are defined in 3D. The size of a workGroup is defined by @workgroup_size(x, y, z) where the number of threads per workgroup is x * y * z.

Writing a Fast Matrix Multiply

Matrix multiplications makes up most of the floating point operations per second (FLOPs) in Large Language Models like GPT-4 and Llama. It is the basic primitive for most training and inference workloads.

Native WebGPU support for Matrix Multiply is limited to small matrices, which aren’t useful for modern Deep Learning workloads when your matrices can be large6.

A quick few notes on notation.

Matrix Multiply

First, a matrix multiply is defined by three matrices: A, B, C.

The total FLOPs required of a matrix multiply are 2 * M * K * N as each operation requires both a multiply and an add (hence the 2).

Lower Bounding Our Kernel

Following the example from Simon Boehm's great article, we have two 4092x4092 matrices followed by the addition of a 4092x4092 matrix. Similarly, we have

Total FLOPS: 137GFLOPs
Total data to read: 201MB
Total data to store: 67MB

However, I am developing on a Mac M2 Pro which has ~6 TFLOP/s of arithmetic intensity and 200GB/s of memory bandwidth.

So, the fastest the compute kernel can take is

(137GFLOP) / (6TFLOPS/s) = 22ms

and memory access takes

(267MB) / (200GB/s) = 1.34ms

so we should be compute bound (by ~16x too!).

Writing the Kernel

Kernel 1: Naive Kernel

The simplest way to compute a dot product between matrix A and B and write to matrix C is for each row in A (of shape M), iterate over the columns of A (of shape K) and multiply by the corresponding value of B. In Python, this looks like

def matmul(a, b, c):
    """
    Perform naive matrix multiplication: C = A * B
    
    :param a: Input matrix A of shape (m, k)
    :param b: Input matrix B of shape (k, n)
    :param c: Output matrix C of shape (m, n) to store the result
    """
    m = len(a)
    k = len(a[0])
    n = len(b[0])
    
    # Perform the matrix multiplication
    for i in range(m):
        for j in range(n):
            c[i][j] = 0
            for l in range(k):
                c[i][j] += a[i][l] * b[l][j]

Similar to the Python code above, we define7 our inputs8

struct Dimensions {
  M: u32,
  K: u32,
  N: u32,
}

@group(0) @binding(0) var dimensions: Dimensions;
@group(0) @binding(1) var a: array;
@group(0) @binding(2) var b: array;
@group(0) @binding(3) var result: array;

and our compute kernel:

@compute @workgroup_size(1)
fn main(@builtin(global_invocation_id) global_id: vec3) {
  let index = global_id.x;
  let row = index / dimensions.N;
  let col = index % dimensions.N;

  if (index < dimensions.M * dimensions.N) {
    var sum = 0.0;
    for (var i: u32 = 0u; i < dimensions.K; i = i + 1u) {
      sum = sum + a[row * dimensions.K + i] * b[i * dimensions.N + col];
    }
    result[row * dimensions.N + col] = sum;
  }
}

The code is functionally equivalent to the Python code above! We define how big our workGroup size is with workgroup_size(1) (remember this is represented in 3D).

So, each workGroup, since it’s only one thread, processes one result[i, j].

To calculate the full matrix, we need to launch as many entries as there are in the matrix and call dispatchWorkgroups 9

pass.dispatchWorkgroups(a.shape[0] * b.shape[1])

where a.shape == M, b.shape[1] == N for (most) any MxN matrix.

Now as we see below, we have lots of room for improvement!

The largest square matrix multiply we can calculate is 128x128 due to limits in WebGPU (more on this later). We only achieve 1.64 GFLOPS/s a far cry from the theoretical max of 6 TFLOPS/s.

Why is this kernel so slow? In effect, each workgroup calculates a single entry of the 16,384 total elements (128^2). Although we are running in parallel, each workGroup loads its own copy of the matrices. The overhead to launch more workGroups is likely more than if our workGroup had more threads and calculated more results per workGroup and each workGroup isn’t able to take advantage of any caching of the inputs.

Kernel 2: Moarrr Threads!

With the first kernel, we’re only able to compute small square matrices due to limits on the number of workGroups (maxComputeWorkgroupsPerDimension) you can dispatch at once.

Since we’re launching one workgroup per entry, a 256x256 matrix is larger than our limit!

Remember this line?

@compute @workgroup_size(1)
fn main(@builtin(global_invocation_id) global_id: vec3) {

We can reduce the number of dispatched workGroups by increasing the number of threads per workGroup!

If we update our code

@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) global_id: vec3) {

we can reduce the number of total dispatched workGroups per dimension:

const WORKGROUP_SIZE = 256;
pass.dispatchWorkgroups((a.shape[0] * b.shape[1]) / WORKGROUP_SIZE);

Why 256? Well, there’s another limit :)

Increasing the workgroupSize, we’re able to improve our kernel by 200x!

Kernel 3: Calculating with 2D workGroups

However doing all the computation in “1 dimension” limits the matrix size we can calculate10

Although we don’t change much about our code, if we distribute our work in 2 dimensions we’re able to bypass these limits and launch more workGroups that are larger. This allows us to calculate a 4096x4096 matmul.

We update our @workgroup_size(8, 8), check our bounds,

@compute @workgroup_size(8, 8)
fn main(@builtin(global_invocation_id) global_id: vec3) {
  let row = global_id.x;
  let col = global_id.y;

  if (row < dimensions.M && col < dimensions.N) {
    var sum : f32 = 0.0;
    for (var i: u32 = 0u; i < dimensions.K; i = i + 1u) {
      sum = sum + a[row * dimensions.K + i] * b[i * dimensions.N + col];
    }
    result[row * dimensions.N + col] = sum;
  }
}

and dispatch workgroups in 2D

const WORKGROUP_SIZE = 16; 
pass.dispatchWorkgroups(    
          Math.ceil(a.shape[0]  / WORKGROUP_SIZE), 
          Math.ceil(b.shape[1] / WORKGROUP_SIZE),
);

But this is slower than our original kernel! What’s going on?

If we make a small change to the code

@compute @workgroup_size(8, 8)
fn main(@builtin(global_invocation_id) global_id: vec3) {
  let row = global_id.y;
  let col = global_id.x;

we get much better kernel performance.

Why is this? We’re able to take more advantage of cached inputs. The x dimension is incremented before the y dimension in the global_invocation_id and therefore more threads in each workgroup use the same row in matrix A. Otherwise, the row variable is overwritten at each invocation within the workGroup and each thread has to spend a few extra cycles to read from global memory rather than the cache.

Kernel 4: Kernel Tiling

Another thing to consider is how much work each thread does.

Up to now, each thread only computes one entry. But there is some overhead to launching each workGroup versus computing more than 1 element per thread!

If calculating more elements per thread is faster than the overhead to launch each workGroup, we should see a big speedup11.

To do so, we calculate 4 results per thread (e.g. a 1x4 Tile).

const BLOCKSIZE: u32 = 16;
const TILESIZE: u32 = 4;
@compute @workgroup_size(BLOCKSIZE, BLOCKSIZE)
fn main(@builtin(global_invocation_id) global_id: vec3) {
    let row = global_id.y;
    let col = global_id.x * TILESIZE;

    if (row >= dimensions.M || col >= dimensions.N) {
        return;
    }

    var sum00: f32 = 0.0;
    var sum01: f32 = 0.0;
    var sum02: f32 = 0.0;
    var sum03: f32 = 0.0;

    for (var i: u32 = 0u; i < dimensions.K; i = i + 1u) {
        let a_elem = a[row * dimensions.K + i];
        sum00 = sum00 + a_elem * b[i * dimensions.N + col];
        sum01 = sum01 + a_elem * b[i * dimensions.N + col + 1u];
        sum02 = sum02 + a_elem * b[i * dimensions.N + col + 2u];
        sum03 = sum03 + a_elem * b[i * dimensions.N + col + 3u];
    }

    result[row * dimensions.N + col] = sum00;
    result[row * dimensions.N + col + 1u] = sum01;
    result[row * dimensions.N + col + 2u] = sum02;
    result[row * dimensions.N + col + 3u] = sum03;
}

The kernel looks roughly the same as before except we’ve unrolled the computation and are calculating TILESIZE results per thread.

We can take this a step further and calculate 2D results per thread! Instead of calculating 4 elements per single row, we can calculate 4 elements for 4 rows (e.g. a 2D tile).

const BLOCKSIZE: u32 = 16;
const TILE_M: u32 = 4;  // Tile size in M dimension
const TILE_N: u32 = 4;  // Tile size in N dimension

@compute @workgroup_size(BLOCKSIZE, BLOCKSIZE)
fn main(@builtin(global_invocation_id) global_id: vec3) {
    let row = global_id.y * TILE_M;
    let col = global_id.x * TILE_N;

    // initialize the array with all 0s
    var sums: array, TILE_M>;
    for (var i = 0u; i < TILE_M; i++) {
        for (var j = 0u; j < TILE_N; j++) {
            sums[i][j] = 0.0;
        }
    }

    // Compute the 2D tile
    for (var k = 0u; k < dimensions.K; k++) {
        // for each row
        for (var i = 0u; i < TILE_M; i++) {
            let a_element = a[(row + i) * dimensions.K + k];
            // calculate the dot product
            for (var j = 0u; j < TILE_N; j++) {
                let b_element = b[k * dimensions.N + (col + j)];
                sums[i][j] += a_element * b_element;
            }
        }
    }

    // Write results
    for (var i = 0u; i < TILE_M; i++) {
        for (var j = 0u; j < TILE_N; j++) {
            let output_row = row + i;
            let output_col = col + j;
            if (output_row < dimensions.M && output_col < dimensions.N) {
                result[output_row * dimensions.N + output_col] = sums[i][j];
            }
        }
    }
}

Each thread now calculates a 4x4 grid of the output matrix and we see a slight improvement over the last kernel.

Surprisingly, 2D tiling is quite slow. Why haven’t we amortized the time it takes to launch workGroups by doing more work? And why are we slower than doing one item of work per thread?

Kernel 5: Unrolling

To answer the last question, we will need to dig into the compiled WebGPU kernels.

Some compilers will automatically unroll loops if the bounds of the loop are known at compile time. However we’ve been writing a general kernel for variable shaped inputs!

Also when writing at WGSL, we don’t have any control over the directives of the compiler.

Looking at the assembly bitcode compiled from Metal, we can see that the instruction set still includes the for loop!

%51 = phi i32 [ 0, %41 ], [ %61, %50 ]
%52 = add i32 %37, %51
%53 = zext i32 %52 to i64
%54 = getelementptr inbounds [1 x float], ptr addrspace(1) %3, i64 0, i64 %53
%55 = load float, ptr addrspace(1) %54, align 4, !tbaa !27, !alias.scope !43, !noalias !44
%56 = zext i32 %51 to i64
%57 = getelementptr inbounds %struct.type_5, ptr %7, i64 0, i32 0, i64 %49, i32 0, i64 %56
%58 = load float, ptr %57, align 4, !tbaa !27
%59 = fmul fast float %55, %48
%60 = fadd fast float %58, %59
store float %60, ptr %57, align 4, !tbaa !27
%61 = add nuw nsw i32 %51, 1
%62 = icmp eq i32 %61, 4
br i1 %62, label %38, label %50 // branching for loop

Whereas the unrolled WGSL code gets compiled to

...
%141 = fmul fast float %112, %103
%142 = fadd fast float %141, %82
%143 = fmul fast float %116, %103
%144 = fadd fast float %143, %81
%145 = fmul fast float %120, %103
%146 = fadd fast float %145, %80
%147 = fmul fast float %124, %103
%148 = fadd fast float %147, %79
%149 = fmul fast float %112, %107
%150 = fadd fast float %149, %78
%151 = fmul fast float %116, %107
%152 = fadd fast float %151, %77
%153 = fmul fast float %120, %107
%154 = fadd fast float %153, %76
%155 = fmul fast float %124, %107
%156 = fadd fast float %155, %75
%157 = add nuw i32 %91, 1
%158 = icmp eq i32 %157, %27
br i1 %158, label %159, label %74

Because of the manual unrolling, the GPU is able to reduce overhead by not having to initialize and increment the inner loop, take advantage of instruction level parallelism, and amortize the cost of launching fewer workGroups by doing more work per thread. When we had our loop, the kernel (#4) wasn’t able to take advantage of these optimizations and was slower than just launching more workGroups (#3).

And if we make our grid 8x8, we get a 3x boost over the 4x4 loop and surpass 1TFLOP!

Conclusion

Through our efforts, we were able to build a performant matmul kernel that is 1000x faster than the naive kernel and approach Apple M2 Pro’s theoretical peak.

And with frequent updates to WebGPU, there are still optimizations to be made! For example, we didn’t take advantage of subgroups, a feature that is new as of Chrome 125 and should allow for faster memory access and sharing across subgroups to reduce repeated computations.

And a big thank you to Abhishaike Mahajan (who writes an incredible blog) and Elman Mansimov for feedback and encouragement to writing this article!

Visualizing these 2-dimensional maps pose two problems: projecting (e.g. TSNE and UMAP) into a 2D coordinate system is slow and not RAM friendly as you increase dataset size and visualizing millions of datapoints in the browser without turning your laptop into a toaster.

I would be remiss to not mention two repos that do similar thing: webGPT (Transformer based inference only) and webgpu-blas (fast matmul kernels in webGPU).

Highly inspired (in name and function) by tinygrad and micrograd.

The format of the blog follows a similar path to Simon Boehm’s article on Optimizing a CUDA Matmul Kernel.

Apple’s M3 Pro has a reported ~7TFLOPS. You can even run Llama3.2 (with ONNX) in your browser with 85 tokens/s

For reference, Llama 3.1 70B has matrices of size (8192x28672)

There’s quite a bit of boilerplate for running WebGPU code from Typescript, which I’ll leave for the curious to explore: https://webgpufundamentals.org/webgpu/lessons/webgpu-fundamentals.html

WGSL supports a number of types

To simplify the article and amount of code, I removed much of the boilerplate code needed to setup the GPU buffers and only focus on things required for understanding how I optimized WGSL kernels.

Due to another limitation: maxComputeWorkgroupsPerDimension.

And this is something Apple suggests when building compute kernels

Stonks Only Go Up: Building a WallStreetBets Sentiment Model

Zach Nussbaum — Tue, 26 Apr 2022 20:24:08 GMT

Over the past few years, I’ve spent a majority of my time trying to learn more about ML. From learning about how to approach a problem to MLOps, my projects (and mentors) have been incredible resources in taking me from a noob to a baseline level of competence as a machine learning engineer.

There are many tutorials and classes (my favorite being CS231N) that delve into the fundamentals of machine learning, a necessity for any later research/serious application of ML. Additionally, there are millions (~80M!) of “Hello World” versions of machine learning, many of which use a framework to build a model to classify handwritten digits. However, I found that the content diving into building a solution to a business problem relatively sparse (or at least from what I was searching). So here’s my attempt at filling in the gap between basic blogs and research papers on arXiv.

This article (and possibly others) will cover some work I did with TopStonks, how we approached a ML problem end to end, and things I learned along the way.

WTF is TopStonks

TopStonks aggregates “The best advice from the worst investors on the internet”. At a high level, they utilize data from financial communities (e.g. r/wallstreetbets) and turn that into meaningful insights. They have been featured in major financial publications including the Wall Street Journal, Business Insider and Forbes.

Working with an accomplished friend, I was lucky enough to get access to this data and learn from him about how to approach problems like this.

Deriving Sentiment

Once you have all this data, the question then becomes what do you do with it? Theoretically (and I’m sure some people do), you can parse every comment to find some alpha. But this is not scalable, unless your day job is to just read Reddit threads.

Ok, what next? You could apply some hard rules and search for terms like "🚀" or "to the moon 🌙 " to score a rough sentiment, but that only gets you so far. These rules only provide a small picture of the nuance of language, but learning that nuance is not easy!

For example, how would you write rules to classify these comments? Not so easy!

SPCE is already printing tendies and continuing to go up as I type this

MSFT and BABA both shitting the bed, my spy calls didnt get filled this morning, my V calls didnt get filled last minute last night (getting some REAL fomo there). what do i buy now :( ?

This is where we can use machine learning to determine the sentiment of a comment, i.e. how bullish is an investor. We have the data and we have a baseline for a rules-based approach, but it’s not good enough.

Looking At Your Data

One thing that gets lost in many of the tutorials and classes is that data in the real world is USUALLY messy1. At the very least, there were some nuanced choices that were made, so you better be sure you know exactly what you’re working with.

Data is the most important part of the process and if you don’t take time to understand the ins-and-outs, your model will suffer. Your model will ONLY be as good as the data you are working with. When working with my friend, I was truly surprised to see that the model development cycle consisted of 75% working with data2 (exploratory data analysis, labeling, and cleaning) to about ~25% model building and tuning. His attention to detail on cleaning and labeling data was surprising; I assumed most people just did some basic EDA. Most of our first few weeks were spent diving into the data trying to answer:

What macro things are people talking about? Are themes local to the thread or are they shared across threads?
Are there comments we can omit? Is there spam? What does it look like?
Are stocks talked about in different ways? Are there stock-specific verbiages seen across a longer time period (e.g. on the order of months)?

Without going through each comment, we would have missed out on so much. We could have thrown the kitchen sink at the problem, but eventually this time would have been spent checking out why your model (most likely) sucks.

If anyone learns anything at all, it’s to spend more time with your data. Get into the weeds and understand as much as you can about what you are working with. It will especially pay off when looking at pitfalls of the model. Granted, it’s easier to look into the data when it’s human understandable and not something that requires some domain expertise like genomic data.

Data Label Like Your Model Depends on It

So, your data journey begins with millions of unlabelled comments.

Label the data! This takes time but is important as you start to understand what’s confusing and add context to the problem.

Although tedious, attention to detail is paramount. No one will (usually) put more effort and care than you about the boring stuff. This is something I learned the hard way, as early iterations of our model stunk, primarily due to my lack of attention and poor data quality.

The Fun Part: Modeling

We now needed to see what a naive approach would yield, then iterate on it. And sometimes the simplest approach produces a 6/10 or 7/10 product and small hacks can take it to the next level.

Nevertheless, we wrote some basic rules on what we thought were strong signals of a bullish/bearish comment. One such case being "X to the moon 🚀 ". These rules are an example of a high precision low recall classifiers: they’re accurate when a rule matches a comment, but otherwise (which is most of the time) has no opinion.

Ok, so we wanted our models to be more sensitive to the actual text and content, but we don’t have enough labeled data to train a model sentiment model from scratch. What should we do? Utilize existing models!

Thankfully, HuggingFace has many binary sentiment classification models trained on similar data, like movie reviews.

Evaluating these models proved that they were better than a pure rules-based approach, but failed to fit well to some edge cases specific to the data. The datasets that these HuggingFace models were trained on has some differences to Reddit comments, something I liked to term “meme-speak”. Things like

Tsla puts, musk is mole person. #DD

would never be found on a Yelp Review or a Rotten Tomato Movie Review.

We can even see how inaccurate the model scores out of distribution data. For example, the textattack/bert-base-uncased-rotten-tomatoes model scores the following comments as incredibly bearish:

SPCE to the moon

STONKS only go up 

SPCE is already printing tendies and continuing to go up as I type this

HOLD THE LINE BULLS 🐂🐂🐂

when it’s clear these are bullish comments.

At a Crossroads

We now have a bunch of pretrained models trained on tangential data. We also have hand-written rules that are highly accurate when they do appear, which is not often. How can we build a model combining all of these different signals?

Snorkel

Snorkel! Snorkel allows users to create rules and learns a weighting to best approximate the true labels. Instead of having to choose one model or a set of rules, we could combine them into an ensemble!

Writing these rules was no easy task however. We spent a lot of time iterating on what to include in the functions, mainly thinking about:

How many pretrained models do we need? Is there a number of models that provide diminishing returns?
What are key words that are easy predictors of sentiment?
Does comment length matter?

A few example functions:

@labeling_function
def fed(text):
    phrases = ['fed',' powell']
    return BULLISH if contains_phrase(text, phrases) else ABSTAIN

@labeling_function
def hold(text):
    phrases = ['holding', 'hold', 'holder', 'hodl', 'diamond hand', 'diamond hands']
    return BULLISH if contains_phrase(text, phrases) else ABSTAIN

At the end of the day, we came up with an ensemble model using ~50 rules and ~5 pretrained models. We could now more accurately predict sentences such as

GME to the moon!

or even more complex comments like

Earnings coming up , I think they’ll do really well and surpass earnings. They’re also down from their ATH which imo will shoot up the price midway from now to ATH. Everyone even boomers want to cut the cord and buy a roku stick, expected sales on tvs with roku built it surged during the holidays, and they’re making a significant money from ads from their free platform. Idk. My opinion, change...

Caveats

However, the model was not perfect. For one, not all comments are either bullish or bearish. Some are neutral or exist on a spectrum:

Yeah I’m in the same boat. Apple and MSFT trading at record highs and high PE? I’m 30 years from retirement...I’ll buy on the way up, on the way down, and sideways for the next 5 years at least. Might dump more $$ on large corrections...but I’ll DCA for the next 5-10 years and worry about it when I’m 60. I’m most confident in those 2 names specifically then any other.

How do you model a comment like this where they are (somewhat) bearish/nervous in the short term but bullish long term? Do you add more categories or do you (somehow) give a discrete value to the comment?

In general, the model was more well-suited to short-comments with “meme-speak” sprinkled in. It tended to be extremely confident on comments with phrases like stonks only go up but less certain on other comments. Given more time and resources, my friend and I would have loved to look more into predicting emotions based on Plutchik’s Wheel of Emotions, but would have taken serious effort to crowdsource the labels.

We then tried using weak supervision to try and improve the model. Using the confident outputs as ground truth, and also labelling a few thousand more, we fine-tuned a Large Language Model (LLM) with little improvement.

Active Learning Attempts

Trying to improve the fine-tuned LLM, we also tried using active learning. Active Learning is a technique to try and select the most beneficial examples to label. Put another way, what examples do I need to label to most improve my model? Figuring out which examples to choose is difficult however. One standard way to choose important example is using a model’s uncertainty (calculated by entropy). Even still, we found little to no improvement over our ensemble.

Wrapping Up

As with any project, there are many things you wish you had time to keep trying and figure out. Even with this (already too long) article.

This project really improved my understanding of applied ML and how to approach most problems. I am eternally grateful to my friend for teaching me a smol portion of how to be effective at building useful ML products and wanted to share a slice of what I have learned.

At least in my experience, n=1

For a more detailed how-to-train a NN, see Andrej Karpathy’s A Recipe for Training Neural Networks