“Why are we limited to chunk-level relevance? Are we really at the mercy of how we chopped up the text?”
From single vectors to hundreds of tiny spotlights
Most embedding models collapse an entire sentence (or longer document chunk) into one vector. That abstraction is perfect for ranking documents, yet it intentionally discards token-level nuance.
The good news: those fine-grained token vectors are still hiding inside the model’s last hidden layer, we just rarely look at them.
In this post we dust them off and treat every token as a tiny queryable unit.
By taking a simple cosine-similarity between a user question and each token, we can paint a heatmap that lights up where the semantic match lives, even inside a 32 k-token context window.
We can then leverage this information to extract relevant text spans.
This paves the way for a chunk-free RAG paradigm, one that retrieves only the most semantically relevant spans, not pre-defined chunks, guided entirely by the query’s content.
A running example you can meme-orise
To keep things concrete we’ll work with an odd corpus: a 2005 “SpongeBob SquarePants Laptop” user manual. It is publicly available, short enough to follow on screen while still making use of the lengthy context window. We will:
Embed the entire PDF and a sample query with Qwen3-0.6B-Instruct.
Compute per-token relevance and surface the results as color-graded heatmaps.
Postprocess the relevance signal to extract text spans.
Observe the results for other queries.
Compare Qwen3 results with Jina v4 Embeddings (in its ColBERT flavor) and see how they fare.
All with cool animated illustrations to support the the process.
What you will learn
How dense embeddings include token-level information, and how to extract it.
How to compute relevance scores for every token in a document, using a single matrix multiplication.
How to visualize the results as a heatmap, and how to extract relevant text spans from the heatmap using a purpose-built algorithm.
The drawbacks using this approach in a production setting (spoiler, its the massive storage requirements).
How ColBERT embeddings, which offer token-level embeddings, could also be used for this task.
from IPython.display import HTML, displaydef display_video(url: str, chapters: list[float]) ->None:""" Display a video player with specified chapters. Args: url (str): The URL of the video to display. chapters (list[float]): List of chapter start times in seconds. """ chapter_qs =",".join(map(str, chapters)) html =f""" <iframe src="player.html?chapters={chapter_qs}&video={url}" style="width:100%; aspect-ratio:16/9; border:0;" loading="lazy"></iframe> """ display(HTML(html))video_url ="https://storage.googleapis.com/onielfa.com/articles/qwen3-span-relevance/ProjectVideo.mp4"chapters = [0, 8, 15.5, 22, 28.4, 38.5, 41, 44]display_video(video_url, chapters)
The model
The Qwen3 family of embedding models have been released on July 2025 under Apache-2.0 license.
They come in three sizes (0.6B, 4B and 8B), and feature a decoder-only architecture, based on that of the Qwen3 LLMs.
They are now top of the leaderboards, blowing out of the water previous models, setting a new standard for the state of the art in text embedding models.
Let’s load the 0.6B version!
Code
import matplotlib.pyplot as pltimport plotly.io as pioimport torchfrom transformers import AutoModel, AutoTokenizerimport os# For rendering in vscode + quartopio.renderers.default ="plotly_mimetype+notebook"DEVICE ="cuda"if torch.cuda.is_available() else"cpu"# Load the Qwen3 model and tokenizerqwen3_model = ( AutoModel.from_pretrained("Qwen/Qwen3-Embedding-0.6B", torch_dtype=torch.bfloat16) .eval() .to(DEVICE))qwen3_tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-4B")print("Qwen3 Embedding-0.6B model loaded successfully!")
Qwen3 Embedding-0.6B model loaded successfully!
The document
As mentioned before, as an example, we will use the manual for a Spongebob laptop created by vtech circa 2009.
Here is a preview of the document so you can browse through it (if the preview doesn't work, try reloading the page):
Converting the document to markdown
Using pymupdf4llm we can easily convert our lovely spongebob laptop manual to markdown and preview its first page.
We will also compute at which character position each page starts, so we can later display the page number in the plots
Code
import itertoolsimport pymupdf4llmimport httpxfrom IPython.display import display_markdownurl ="https://www.vtechkids.com/assets/data/products/%7BF162CE37-57EC-4B6E-AFBE-D9DA9CFA098D%7D/manuals/80-102900-Sponge_Bob_Laptop.pdf"# Download the PDF fileresponse = httpx.get(url, follow_redirects=True)response.raise_for_status() # Ensure the request was successfuldocument ="spongebob_laptop_manual"# Save the PDF file to the local filesystemwithopen(f"assets/{document}.pdf", "wb") as f: f.write(response.content)md_conversion = pymupdf4llm.to_markdown(f"assets/{document}.pdf", page_chunks=True)text_per_page = [c["text"] for c in md_conversion]# Compute the character at which each page startsacc_chars_per_page =list(itertools.accumulate([len(text) for text in text_per_page]))text ="".join(text_per_page)# Add a quote to each line so its displayed as a blockquotetext_preview ="\n".join(f"> {line}"for line in text_per_page[0].splitlines())display_markdown(text_preview, raw=True)
We will use the model to create our embeddings for the document and query.
How can we generate token-level embeddings from a model meant to generate sentence-level embeddings?
By default, these embeddings models will generate a single vector of 2560 dimensions (matching the “hidden size” of the model) for each input text.
In embedding models the single embedding vector is generated through pooling the last hidden state, usually through averaging the embeddings for each of the tokens or by selecting the embedding of a special token.
In the case of the Qwen3 embeddings this is achieved by just selecting the embedding for the special token <|endoftext|>, which is inserted as the last token in the batch by the tokenization process.
During training the model learns to make the embedding of this special token represent the entire input text, and thus it can be used as a compact representation of the input.
However, we can choose to forgo this pooling step. By not applying this pooling, we will be generating a matrix of size (tokenized_text_length, 1024), resulting in an embedding vector for each token.
Generating the embeddings for the document
As we are going to embed our document as a single chunk, one might be worried about the contents being to lengthy and running up against the context window limits for the model.
However, for these Qwen3 embedding models, the context window is a staggering 32k tokens, which makes it likely that our whole document will fit, however let’s make sure by looking at the size of our tokenized document.
At the same time we compute these embeddings, we will be extracting a text representation of each of the tokens and a mapping between token to text position. This is useful for plotting each token and to convert from tokens indices to text indices once we know which tokens are relevant.
Code
import torchimport torch.nn.functional as Ffrom transformers import PreTrainedModel, PreTrainedTokenizer@torch.inference_mode()def embed_text_qwen( text: str, max_length: int=32768, model: PreTrainedModel = qwen3_model, tokenizer: PreTrainedTokenizer = qwen3_tokenizer,) ->tuple[list[str], torch.Tensor, torch.Tensor]:""" Embed the given text using the Qwen3 model and tokenizer. Args: text (str): The text to embed. Returns: tuple: A tuple containing the offsets and the text embeddings. - offsets (torch.Tensor): The offsets of the tokens in the original text. - text_embeds (torch.Tensor): The embeddings of the text. """# Tokenize using the Qwen3 tokenizer# and return the offsets and the text embeddings# The offsets will be used to map the embeddings back to the original text tokenized = tokenizer( text, return_tensors="pt", truncation=True, max_length=max_length, return_offsets_mapping=True, ).to(DEVICE)# Run the tokenized input through the Qwen3 model to get the embeddings# The output is a tensor of shape (1, sequence_length, embedding_dim) text_embeds = model(**tokenized).last_hidden_state.squeeze(0)# Normalize each of the vectors to unit length# For cosine similarity calculations text_embeds = F.normalize(text_embeds, p=2, dim=-1)# Return the splitted text tokens text_toks = [ tokenizer.decode(token_id, skip_special_tokens=False)for token_id in tokenized["input_ids"].squeeze().tolist() ]return text_toks, tokenized["offset_mapping"].squeeze(), text_embeds.cpu()def map_offsets_to_pages( offsets: torch.Tensor, acc_chars_per_page: list[int]) ->list[int]:# Mapping of character offsets to tokens for later visualizations acc_tokens_per_page = []for chars_in_page in acc_chars_per_page:# Remove the last offset token (special token)# And only check the start_token in the offset is_in_page = ( offsets[ :-1, :1, ]<= chars_in_page ) acc_tokens_per_page.append(is_in_page.sum().item())return acc_tokens_per_pagetext_toks, offsets, text_embeds = embed_text_qwen(text)print(f"Number of tokens: {len(offsets)}")print(f"Document embedding shape: {tuple(text_embeds.shape)}")assert text_embeds.shape[0] == offsets.shape[0], ("Text and offsets should have the same number of tokens")# Map offsets to pagesacc_tokens_per_page = map_offsets_to_pages(offsets, acc_chars_per_page)# Create inverse mapping from token to chartoken_char_starts = {i: int(start) for i, (start, end) inenumerate(offsets)}token_char_ends = {i: int(end) for i, (start, end) inenumerate(offsets)}assertlen(token_char_starts) ==len(token_char_ends) == text_embeds.shape[0], ("Token to char mapping should be consistent")
Number of tokens: 3597
Document embedding shape: (3597, 1024)
We can see that the embedding process resulted in 3597 vectors, meaning we are well below the 32k token limit!
Generating the embeddings for the query
For embedding the query, the Qwen team recommend using an instruction prompt such as
Instruct: Given a web search query, retrieve relevant passages that answer the query
Query:{query}
They indicate that not using a prompt can drop performance from 1-5%.
Since in our case we aren’t after the last bit of performance, we’ll embed the query without prompt, for simplicity.
Exploring how the relevance of the words change depending on the prompt might be an interesting thing to do in the future.
The query we are going to use is: What game makes me reason by weighting objects?
The response to this query is in page 9 of the document, and the game in question is called “Weighty Food”.
Code
query ="What game makes me reason by weighting objects?"query_text_toks, _, query_embeds = embed_text_qwen(query)print(f"Query text tokens: {query_text_toks}")assert query_embeds.shape[0] ==len(query_text_toks), ("Query tokens and colbert vector should have the same number of tokens")print(f"Number of query tokens: {len(query_text_toks)}")print(f"Query embedding shape: {tuple(query_embeds.shape)}")
Query text tokens: ['What', ' game', ' makes', ' me', ' reason', ' by', ' weighting', ' objects', '?', '<|endoftext|>']
Number of query tokens: 10
Query embedding shape: (10, 1024)
How we score every token: from cosine similarity to one big matrix
When comparing sentence embeddings, we usually use cosine similarity to measure how close they are:
To get some insight into the relationships of relevance between the query and document tokens, let’s display an interactive heatmap where we can observe the values of the relevance matrix.
Code
import numpy as npimport plotly.express as pxfrom IPython.display import HTML, displaydef create_heatmap(relevances, query_text_toks, text_toks, acc_tokens_per_page): n_rows, n_cols = relevances.shape# Display heatmap fig = px.imshow( relevances, labels=dict(color="Relevance"), x=list(range(n_cols)), # numeric x-coords → we’ll override ticks later y=query_text_toks, color_continuous_scale="Agsunset", aspect="auto", )# Show tokens on hover customdata = np.tile(text_toks, (n_rows, 1)) # duplicate token list down the rows fig.update_traces( customdata=customdata, hovertemplate=("Query token: %{y}<br>""Doc token : %{customdata}<br>""Relevance : %{z:.3f}<extra></extra>" ), )# Show page numbers on the x-axis page_tick_positions = [0] + acc_tokens_per_page page_tick_texts = [f"Page {i +1}"for i inrange(len(page_tick_positions) -1)] + ["<|endoftext|>" ] fig.update_xaxes( tickmode="array", tickvals=page_tick_positions, ticktext=page_tick_texts, tickangle=90, side="bottom", )return figdef display_plot(fig): html = fig.to_html(full_html=False, include_plotlyjs="cdn") display(HTML(html))fig = create_heatmap( relevances.to(torch.float32), query_text_toks, text_toks, acc_tokens_per_page)fig.update_layout( title="Heatmap for relevance between query and document tokens.", height=320)display_plot(fig)
Collapsing to document token relevance over the whole query
We know that the Qwen3 embedding models are trained by taking the embedding for the <|endoftext|> token as the representative embedding for the whole text.
So the relevance score that we would see by using the embedding model with pooling corresponds to the relevance value on the last row and last column of the relevance matrix. Which is measuring the relevance between both of the <|endoftext|> tokens.
Using this information, if we wish to see the relevance of each document token to the whole query, is as easy as visualizing only the last row of this relevance matrix. As we would be comparing the relevance of the <|endoftext|> token (representative of the query) to each document token.
Knowing this, the relevance computation could have been simplified even further by just computing the relevance of the last row of the matrix, instead of computing the whole matrix.
\[
R_{\text{last}} = Q_\text{last} D^{\top}
\]
Focusing our attention on this plot, we can observe some patterns appearing, namely, hotspots with high relevance along the document.
Code
fig_collapsed = create_heatmap( relevances[-1:, :-1], # Only the last row and remove the last column query_text_toks[-1:], # Last query token text_toks[:-1], # Last document token acc_tokens_per_page,)fig_collapsed.update_layout( title="Heatmap for relevance between query and document tokens (collapsed to last token).", height=240,)display_plot(fig_collapsed)
Heatmap of token relevance between whole query and document tokens. Hover for details.
Let’s now visualize it as a line plot to make it clearer.
Code
import pandas as pd# Create a line plot for the last row of the heatmapdef create_line(relevances_vector, text_toks, acc_tokens_per_page): df = pd.DataFrame( relevances_vector, columns=["Raw relevance"], )# Display heatmap fig = px.line(df)# Show tokens on hover customdata = text_toks fig.update_traces( customdata=customdata, hovertemplate=("Query token: %{y}<br>""Doc token : %{customdata}<br>""Relevance : %{y:.3f}<extra></extra>" ), line_color="gray", )iflen(text_toks) >1000:# Show page numbers on the x-axis page_tick_positions = [0] + acc_tokens_per_page page_tick_texts = [f"Page {i +1}"for i inrange(len(page_tick_positions) -1)] fig.update_xaxes( tickmode="array", tickvals=page_tick_positions, ticktext=page_tick_texts, tickangle=90, side="bottom", title="Token index", showgrid=True, )else: fig.update_xaxes( title="Tokens", tickangle=90, tickvals=list(range(len(text_toks))), ticktext=text_toks, showgrid=True, ) fig.update_yaxes(title="Relevance", showgrid=True)return figfig_line = create_line( relevances[-1, :-1], # Only the last row and remove the last column text_toks[:-1], # Last document token acc_tokens_per_page,)fig_line.update_layout( template="simple_white", title=f"Line plot for relevance between query <i>{query}</i> and document tokens.", showlegend=False, height=320,)display_plot(fig_line)
Much better!
We can now observe how the relevance is not uniform across the document, and that we have the strongest relevance in the middle of page 9, and along page 10, which matches the general area where the answer to our query is located.
Finding relevant spans
Now, how can we extract relevant spans from this noisy data?
Looking at the raw data (you may hover over the line graph to see to which tokens correspond each of the scores), there seems to be a pattern in which the relevance will be higher at the end of relevant sentences rather than in specific relevant tokens.
It seems like the model may be combining and encoding the meaning of the previous tokens on those later tokens, which fits into the fact that the Qwen3 embedding models are based on a decoder-only architecture which does not use bidirectional attention like encoder based embedding models do.
This means that if we detect a high relevance token, it might be indicating that the tokens previous to that one are relevant as well.
Let’s zoom into a specific sentence to showcase how high relevance appears at line endings such as \n or .
Code
import matplotlibimport matplotlib.pyplot as pltfrom matplotlib.cm import ScalarMappablefrom matplotlib.colors import Normalize# Example sentence indices (you may alter these to visualize different sentences)sentence_idxs = (1717, 1761)# Sample datatokens = [ a if a !="\n"else"\\n"for a in text_toks[sentence_idxs[0] : sentence_idxs[1]]]values = ( relevances[-1, sentence_idxs[0] : sentence_idxs[1]].cpu().numpy()) # values between -1 and 1# Normalize values to [0, 1]norm = Normalize(vmin=min(values), vmax=max(values))cmap = matplotlib.colormaps["Blues"]def get_text_color(rgb):# rgb values are in [0,1]; calculate luminance r, g, b = rgb[:3] luminance =0.299* r +0.587* g +0.114* breturn"black"if luminance >0.5else"white"fig, ax = plt.subplots(figsize=(21, 2))ax.axis("off")x =0.01y =0.5for word, val inzip(tokens, values, strict=False): bg_color = cmap(norm(val)) text_color = get_text_color(bg_color) txt = ax.text( x, y, word, fontsize=12, va="center", ha="left", color=text_color, bbox=dict(facecolor=bg_color, edgecolor="none", boxstyle="square,pad=0.2"), ) renderer = fig.canvas.get_renderer() bbox = txt.get_window_extent(renderer=renderer).transformed(ax.transData.inverted()) x = bbox.x1 +0.005sm = ScalarMappable(cmap=cmap, norm=norm)cbar = plt.colorbar(sm, ax=ax, orientation="horizontal", fraction=0.2, pad=0)cbar.set_label("Token Relevance")plt.title("Token relevance of sentence tokens for query: "+ query)plt.show()
We now know that the peaks in the relevance signal are key, matching ends of relevance sentences and occasionally, exact text matches.
Let’s devise an algorithm to extract these relevant spans. For that a naive implementation might follow something like this:
Preprocess the relevance signal (optional):
I have found that using a smoothing filter such a gaussian filter makes the signal easier to work with for the next steps.
The adjustable parameter for this filter is sigma, which controls the amount of smoothing applied. A larger value will result in a smoother signal, while a smaller value will retain more of the original signal’s peaks and troughs.
Detect peaks: Find tokens with relevance above a specific set threshold.
Cluster nearby peaks: Use a sliding window approach to cluster nearby peaks, the distance between peaks is controlled by a parameter delta.
Filter out small spans (optional): Remove spans that have less than min_span_size tokens. This steps aids in removing matches that may be purely based on a single token matching lexically.
Extend the clusters: Extend the match forwards and backwards until a separator is found to match a semantic unit, up to a maximum of max_extension tokens in each direction.
Compute scores for each span: This can be done via different aggregations of the relevance scores in the span. By using the maximum, we get a representative value that is invariant on how the cluster is extended.
Code
# Now create a line plot of the smoothed token relevancesimport reimport numpy as npfrom scipy.ndimage import gaussian_filter1ddoc_relevances = relevances[-1, :-1].to(torch.float32)# These could be expanded in the future.SEPARATORS = ["\n", "\t", "."]sep_re = re.compile("["+ re.escape("".join(SEPARATORS)) +"]")def detect_spans( doc_relevances: torch.Tensor, doc_tokens: list[str], threshold: float, delta: int, min_span_size: int, max_extension: int=32,) ->tuple[list[tuple[int, int]], list[float]]:""" Detects spans of relevance in the document relevances based on a threshold. Args: doc_relevances (torch.Tensor): The relevance scores for each token in the document. doc_tokens (list[str]): The tokens of the document. threshold (float): The threshold for relevance to consider a peak. delta (int): Maximum allowed gap between peaks to consider them in the same cluster. min_span_size (int): Minimum size of a span to be considered relevant. max_extension (int): Maximum number of tokens to extend the span to a separator in each direction. Returns: tuple: A tuple containing: - clustered_spans (list): A list of tuples representing the start and end indices of relevant spans. - scores (list): A list of scores for each clustered span, computed as the maximum relevance in the span. """ peaks = torch.where(doc_relevances > threshold)[0].cpu().numpy() clustered_spans = []if peaks.size >0: current_start = peaks[0] current_end = peaks[0]for idx in peaks[1:]:if idx <= current_end + delta:# same cluster: just move the end to this new peak current_end = idxelse:# gap is too large -> finish current cluster, start a new oneif current_end - current_start +1>= min_span_size: clustered_spans.append((current_start, current_end)) current_start = idx current_end = idxif current_end - current_start +1>= min_span_size: clustered_spans.append((current_start, current_end))ifnot clustered_spans:return [], []# Pre-compute a lookup for nearby separators tokens_arr = np.asarray(doc_tokens) is_sep = np.vectorize(lambda s: bool(sep_re.search(s)))(tokens_arr) last_seen = np.where(is_sep, np.arange(len(tokens_arr)), -1) prev_sep = np.maximum.accumulate(last_seen) next_seen = np.where(is_sep, np.arange(len(tokens_arr)), len(tokens_arr)) next_sep = np.minimum.accumulate(next_seen[::-1])[::-1]# Extend left and right edge until the nearest separator extended = []for start, end in clustered_spans: lb =-1if start ==0else prev_sep[start -1] rb = next_sep[end] new_start =max(lb +1, start - max_extension) new_end =min(rb -1, end + max_extension)if new_end >= new_start: extended.append((new_start, new_end))# Compute scores for the clustered spans (max relevance in the span) scores = [doc_relevances[start : end +1].max().item() for start, end in extended]return extended, scoresdef mask_from_spans( doc_relevances: torch.Tensor, clustered_spans: list[tuple[int, int]]) -> torch.Tensor:""" Create a mask from clustered spans. Args: doc_relevances (torch.Tensor): The relevance scores for each token in the document. clustered_spans (list): A list of tuples representing the start and end indices of relevant spans. Returns: torch.Tensor: A boolean mask indicating the positions of relevant spans. """ mask = torch.zeros_like(doc_relevances, dtype=torch.bool)for start, end in clustered_spans: mask[start : end +1] =Truereturn maskthreshold =0.39# Smooth the token relevances using a Gaussian filtersmoothed_relevances = torch.Tensor( gaussian_filter1d(doc_relevances.numpy(), sigma=4, mode="nearest", order=0))clustered_spans, scores = detect_spans( smoothed_relevances, text_toks[:-1], threshold=threshold, delta=170, min_span_size=15,)# Create mask for clustered spansmask = mask_from_spans(smoothed_relevances, clustered_spans)
Code
import plotly.graph_objects as godef plot_relevances_with_spans( fig: go.Figure, doc_relevances: torch.Tensor, smoothed_relevances: torch.Tensor | np.ndarray, mask: np.ndarray, threshold: float, acc_tokens_per_page: list[int], mask_gt: np.ndarray |None=None, row: int|None=None,):""" Plots the token relevances with shaded spans of relevance. Args: fig (go.Figure): The Plotly figure to add traces to. doc_relevances (torch.Tensor): The raw token relevances. smoothed_relevances (np.ndarray): The smoothed token relevances. mask (np.ndarray): A boolean mask indicating relevant spans. threshold (float): The threshold for relevance to consider a span. acc_tokens_per_page (list[int]): Cumulative token counts per page for x-axis ticks. row (int | None): The row number for subplotting. If None, plot in the first row. """ kwargs = {"row": row, "col": 1} if row isnotNoneelse {} show_legend = row isNoneor row ==1 x = np.arange(len(doc_relevances)) y = doc_relevances.cpu().numpy()# Raw token-level relevances fig.add_trace( go.Scatter( x=x, y=y, name="Raw Relevance", mode="lines", line=dict(color="gray", width=1), opacity=0.5, showlegend=show_legend, ),**kwargs, )# Smoothed curve fig.add_trace( go.Scatter( x=x, y=smoothed_relevances, name="Smoothed Relevance", mode="lines", line=dict(color="gray", width=2), # full opacity showlegend=show_legend, ),**kwargs, )def shade_span(ma: np.ndarray, opacity: float=0.5, color: str="salmon"):# Shade contiguous “relevant” spans exactly where mask==True in_span, start =False, 0for i, m inenumerate(ma):if m andnot in_span: # span starts in_span, start =True, ielifnot m and in_span: # span ends in_span =False fig.add_vrect( x0=start -0.5, x1=i -0.5, fillcolor=color, opacity=opacity, layer="below", line_width=0,**kwargs, )# If the document ends inside a spanif in_span: fig.add_vrect( x0=start -0.5, x1=len(y) -0.85, fillcolor=color, opacity=opacity, layer="below", line_width=0,**kwargs, ) in_span, start =False, 0if mask_gt isnotNone: shade_span(mask_gt, opacity=0.85, color="lightskyblue") shade_span(mask)# Horizontal threshold line fig.add_hline( y=threshold, line=dict(color="red", dash="dash"), opacity=0.5, annotation_text="Relevance threshold", annotation_position="bottom right",**kwargs, )# Axis styling, page tick labels, grid, legend, size, theme tick_vals = [0] + acc_tokens_per_page tick_text = [f"Page {i +1}"for i inrange(len(tick_vals))] fig.update_xaxes( title_text="Token index", tickmode="array", tickvals=tick_vals, ticktext=tick_text, tickangle=90, showgrid=True,**kwargs, ) fig.update_yaxes(title_text="Relevance score", showgrid=True, **kwargs) fig.update_layout( title="Token relevances and ground truth spans", template="plotly_white", legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="left", x=0), margin=dict(l=40, r=40, t=70, b=40), autosize=True, )# Manually add legend for the spans and threshold lineif show_legend:if mask_gt isnotNone: fig.add_trace( go.Scatter( x=[None], # No x data, just for legend y=[None], # No y data, just for legend mode="lines", line=dict(color="lightskyblue", width=2), name="Ground truth spans", showlegend=True, ) ) fig.add_trace( go.Scatter( x=[None], # No x data, just for legend y=[None], # No y data, just for legend mode="lines", line=dict(color="salmon", width=2), name="Relevant Spans", showlegend=True, ) ) fig.add_trace( go.Scatter( x=[None], # No x data, just for legend y=[None], # No y data, just for legend mode="lines", line=dict(color="red", dash="dash"), name="Relevance Threshold", showlegend=True, ) )fig_spans = go.Figure()plot_relevances_with_spans( fig_spans, doc_relevances, smoothed_relevances, mask, threshold, acc_tokens_per_page,)fig_spans.update_layout( height=320,)display_plot(fig_spans)
Code
import markdownfrom IPython.display import display_html, display_markdowndef display_text_from_results( query: str, clustered_spans: list[tuple[int, int]], scores: list[float], top_k: int=2,):""" Display the text for each clustered span with its score. Args: query (str): The original query text. clustered_spans (list[tuple[int, int]]): List of tuples representing the start and end indices of relevant spans. scores (list[float]): List of scores for each span. top_k (int): Number of top spans to display. """ifnot clustered_spans ornot scores: display_markdown("No relevant spans found.", raw=True)return# Convert spans to text texts = [ text[token_char_starts[start] : token_char_ends[end]]for start, end in clustered_spans ]# Sort spans by score sorted_spans =sorted(zip(texts, scores), key=lambda x: x[1], reverse=True)# Limit to top_k spans sorted_spans = sorted_spans[:top_k] display_markdown(f"### Top {top_k} spans for query: `{query}`", raw=True) results =""for i, (txt, score) inenumerate(sorted_spans):# Add quote to the text for better formatting txt ="\n".join(f"\t{line}"for line in txt.splitlines()) t =f"**Text for span {i +1} - Score {score:.2f}**\n\n{txt}\n" results += t# Display as a collapsable section html =f""" <details> <summary>Click to view results</summary> <pre>{markdown.markdown(results)}</pre> </details> """ display_html(html, raw=True)# Display the text for the clustered spansdisplay_text_from_results(query, clustered_spans, scores, top_k=2)
Top 2 spans for query: What game makes me reason by weighting objects?
Click to view results
Text for span 1 - Score 0.45
### 7
3) Word Scramble
Help SpongeBob sort out the condiment bottles.
A word is given, then it will be scrambled up. Help
SpongeBob unscramble the word by placing the
bottles in order. Use the left and right arrow buttons
to select a bottle and press the Enter button. Then,
choose a second bottle to swap it with. Repeat until
the word is spelled out correctly.
MATH
There are three activities in this category. They teach counting, addition,
subtraction, and the concept of weight.
4) Weighty Food
There are two scales weighing three different foods.
From the selection of food shown, you will be asked
to find the heaviest or lightest food. Use the left and
right arrow buttons to select the correct food and
press the Enter button to answer.
5) Add Seasoning
Help SpongeBob add seasoning to the Krabby
Patty. A formula is shown, along with five
condiments. Each condiment displays a different
number. Select two condiments that will make
the formula work. Use the left and right buttons to
select the condiments and press the Enter button
to confirm. You can also press a number button to answer.
6) Boating Bubble
Help Patrick drive safely to pick up the Krabby Patty
buns. On the way, Patrick will encounter bubbles
with different numbers. You will be asked to count in
multiples of a certain number to help Patrick along
his way. Steer with the up and down arrow buttons
and catch the correct bubbles as you go.
### 8
LOGIC
There are three activities in this category. They teach memory, logic,
and patterns.
7) Assembling Patties
There are three Krabby Patties on the grill, and each
one has a different pattern. Help SpongeBob keep
track of the Krabby Patty with the given pattern.
Watch carefully as SpongeBob flips the Patties over
and swaps their places. Then, use the left and right
arrow buttons to select the correct Krabby Patty and
press the Enter button to answer.
8) Jellyfishing
There are eight jellyfish on the screen and only two
of them look exactly the same. Watch carefully and
catch the two jellyfish that match each other. Use
the arrow buttons to select and press the Enter
button to confirm.
9) Patty Catch
SpongeBob bumped into the shelf and all the
ingredients are falling down! Help SpongeBob
make some Krabby Patties with the falling
ingredients by following the model on the side of
the screen. Use the left and right arrow buttons to
catch the correct ingredients, and use the down
arrow button to make the ingredients fall faster.
### 9
CREATIVITY AND GAMES
There are three activities in this category
Text for span 2 - Score 0.42
Press a number button to hear the number, or use these buttons to answer
questions in the Math category.
**7. Arrow Buttons**
Press these buttons to make a selection or answer
a question
Running it for multiple queries
Let’s now visualize the output for different queries and compare it to the ground truth spans where the answer is located, showcasing how the algorithm behaves with different relevance signals
Code
queries = ["What game makes me reason by weighting objects?", # Middle of page 9"The display is too bright, what can I do?", # Middle of page 7"What type of battery does the laptop use?", # Second half of page 4"The laptop is not starting, what can I do?", # End of page 12]gt_spans = [ (1676, 1761), (1133, 1163), (655, 698), (2645, 2734),]gt_masks = [np.zeros_like(doc_relevances, dtype=bool) for _ in queries]for (start, end), m inzip(gt_spans, gt_masks): m[start : end +1] =Truefor query, gt_span inzip(queries, gt_spans):print(f"Query: {query}") gt_text ="".join(text_toks[gt_span[0] : gt_span[1]])# Quote each line for better formatting gt_text ="\n".join(f"> {line}"for line in gt_text.splitlines()) html =f""" <details> <summary>Click to view ground truth span</summary> <pre>{markdown.markdown(gt_text)}</pre> </details> """ display_html(html, raw=True)
Query: What game makes me reason by weighting objects?
Click to view ground truth span
MATH
There are three activities in this category. They teach counting, addition,
subtraction, and the concept of weight.
4) Weighty Food
There are two scales weighing three different foods.
From the selection of food shown, you will be asked
to find the heaviest or lightest food. Use the left and
right arrow buttons to select the correct food and
press the Enter button to answer.
Query: The display is too bright, what can I do?
Click to view ground truth span
Contrast Slider
Slide this to the right to darken the screen contrast, or slide to the left to
make the screen contrast lighter.
Query: What type of battery does the laptop use?
Click to view ground truth span
three new “AA” (AM-3/LR6)
batteries into the compartment as
illustrated. (The use of new, alkaline
batteries is recommended for
maximum performance.)
Query: The laptop is not starting, what can I do?
Click to view ground truth span
If your VTech [®] SpongeBob Laptop stops working or does not turn on:
Check your batteries. Make sure the batteries are fresh and
properly installed.
If you are still having problems, visit our website at
www.vtechkids.com for troubleshooting tips.
If nothing happens when you press the On/Off button:
Check to see that the batteries are aligned correctly.
Code
from plotly.subplots import make_subplotsdef process_query(fig, query, gt_mask, row): query_text_toks, _, query_embeds = embed_text_qwen(query) relevances = query_embeds @ text_embeds.T doc_relevances = relevances[-1, :-1].to( torch.float32 ) # Only the last row and remove the last column smoothed_relevances = gaussian_filter1d( doc_relevances.cpu().numpy(), sigma=4, mode="nearest", order=0 ) threshold =0.39 spans, scores = detect_spans( torch.tensor(smoothed_relevances), text_toks, threshold=threshold, delta=170, min_span_size=15, ) mask = mask_from_spans(doc_relevances, spans) fig = plot_relevances_with_spans( fig, doc_relevances, smoothed_relevances, mask, threshold=threshold, acc_tokens_per_page=acc_tokens_per_page, mask_gt=gt_mask, row=row, )return spans, scores# Create a figure with subplots for each queryfig_queries = make_subplots( rows=len(queries), cols=1, shared_xaxes=False, vertical_spacing=0.12, subplot_titles=[f"Query: {q}"for q in queries],)results_per_query = []for i, (query, gt) inenumerate(zip(queries, gt_masks)): results_per_query.append(process_query(fig_queries, query, gt, row=i +1))fig_queries.update_layout( height=1024, title=dict(y=0.98), legend=dict(y=1.04), margin=dict(t=100), # increase top margin if needed)display_plot(fig_queries)
In the graph above, we can see how 1. The different queries match very different parts of the document 2. They generally align with the ground truth spans.
Below, you can explore the most relevant text spans identified for each query.
Code
# Lets see the most relevant span for each queryfor i, (query, (text_spans, scores)) inenumerate(zip(queries, results_per_query)):# Print all spans and scores for the query display_text_from_results(query, text_spans, scores, top_k=3)
Top 3 spans for query: What game makes me reason by weighting objects?
Click to view results
Text for span 1 - Score 0.45
### 7
3) Word Scramble
Help SpongeBob sort out the condiment bottles.
A word is given, then it will be scrambled up. Help
SpongeBob unscramble the word by placing the
bottles in order. Use the left and right arrow buttons
to select a bottle and press the Enter button. Then,
choose a second bottle to swap it with. Repeat until
the word is spelled out correctly.
MATH
There are three activities in this category. They teach counting, addition,
subtraction, and the concept of weight.
4) Weighty Food
There are two scales weighing three different foods.
From the selection of food shown, you will be asked
to find the heaviest or lightest food. Use the left and
right arrow buttons to select the correct food and
press the Enter button to answer.
5) Add Seasoning
Help SpongeBob add seasoning to the Krabby
Patty. A formula is shown, along with five
condiments. Each condiment displays a different
number. Select two condiments that will make
the formula work. Use the left and right buttons to
select the condiments and press the Enter button
to confirm. You can also press a number button to answer.
6) Boating Bubble
Help Patrick drive safely to pick up the Krabby Patty
buns. On the way, Patrick will encounter bubbles
with different numbers. You will be asked to count in
multiples of a certain number to help Patrick along
his way. Steer with the up and down arrow buttons
and catch the correct bubbles as you go.
### 8
LOGIC
There are three activities in this category. They teach memory, logic,
and patterns.
7) Assembling Patties
There are three Krabby Patties on the grill, and each
one has a different pattern. Help SpongeBob keep
track of the Krabby Patty with the given pattern.
Watch carefully as SpongeBob flips the Patties over
and swaps their places. Then, use the left and right
arrow buttons to select the correct Krabby Patty and
press the Enter button to answer.
8) Jellyfishing
There are eight jellyfish on the screen and only two
of them look exactly the same. Watch carefully and
catch the two jellyfish that match each other. Use
the arrow buttons to select and press the Enter
button to confirm.
9) Patty Catch
SpongeBob bumped into the shelf and all the
ingredients are falling down! Help SpongeBob
make some Krabby Patties with the falling
ingredients by following the model on the side of
the screen. Use the left and right arrow buttons to
catch the correct ingredients, and use the down
arrow button to make the ingredients fall faster.
### 9
CREATIVITY AND GAMES
There are three activities in this category
Text for span 2 - Score 0.42
Press a number button to hear the number, or use these buttons to answer
questions in the Math category.
**7. Arrow Buttons**
Press these buttons to make a selection or answer
a question
Top 3 spans for query: The display is too bright, what can I do?
Click to view results
Text for span 1 - Score 0.45
the volume.
13. Contrast Slider
Slide this to the right to darken the screen contrast, or slide to the left to
make the screen contrast lighter
Top 3 spans for query: What type of battery does the laptop use?
Click to view results
Text for span 1 - Score 0.45
||||||
||||||
4 Category 26 Letter
Buttons Buttons
10 Number
Buttons
4 Arrow
Buttons
Repeat
Button
O n/Off
Button
Enter
Button
Cursor Mouse
with 4 Arrows
### 2
### INCLUDED IN THIS PACKAGE
- One VTech [®] SpongeBob Laptop learning toy
- One user’s manual
WARNING: All packing materials, such as tape, plastic sheets,
packaging locks, wire ties and tags are not part of this
toy, and should be discarded for your child’s safety.
Note: Please keep the user manual as it contains important
information.
Unlock the packaging locks:
Rotate the packaging locks
90 degrees counter-clockwise.
Pull out the packaging locks.
### GETTING STARTED
BATTERY INSTALLATION
1. Make sure the unit is OFF.
2. Locate the battery cover on the bottom
of the unit.
3. Open the battery cover.
4. Install three new “AA” (AM-3/LR6)
batteries into the compartment as
illustrated. (The use of new, alkaline
batteries is recommended for
maximum performance.)
5. Replace the battery cover.
### 3
BATTERY NOTICE
- Install batteries correctly observing the polarity (+, -) signs to avoid
leakage.
- Do not mix old and new batteries.
- Do not mix batteries of different types: alkaline, standard (carbon-zinc)
or rechargeable (nickel-cadmium).
- Remove the batteries from the equipment when the unit will not be
used for an extended period of time.
- Always remove exhausted batteries from the equipment.
- Do not dispose of batteries in fire.
- Do not attempt to recharge ordinary batteries.
- The supply terminals are not to be short-circuited.
- Only batteries of the same and equivalent type as recommended
are to be used.
WE DO NOT RECOMMEND THE USE OF RECHARGEABLE BATTERIES.
### PRODUCT FEATURES
1. On/Off Button
To turn the unit on, press the On/Off button. Press the On/Off button
again to turn the unit off.
2. Category Buttons
Press a category button to choose one of the four learning categories.
3. Character Buttons
Press a character button to play a mini-game featuring that character.
### 4
4. Enter Button
Press this button to enter a choice.
5. Letter Buttons
Press a letter button to hear the letter name, or use these buttons to
answer questions in the Word Challenge category.
6. Number Buttons
Press a number button to hear the number, or use these buttons to answer
questions in the Math category.
**7. Arrow Buttons**
Press these buttons to make a selection or answer
a question.
8. Repeat Button
Press this button to hear the last instruction or question repeated.
9. Answer Button
Press this button to reveal the answer.
### 5
10
Text for span 2 - Score 0.42
- Check to see that the batteries are aligned correctly.
### 11
3. If you cannot hear any sound:
- Adjust the volume switch to adjust the sound level of the speaker.
TECHNICAL SUPPORT
If you have a problem that cannot be solved by using this manual,
we encourage you to visit us online or contact our Consumer Services
Department with any problems and/or suggestions that you might have.
A support representative will be happy to assist you.
Before requesting support, please be ready to provide or include the
information below:
- The name of your product or model number (the model number is
typically located on the back or bottom of your product
Text for span 3 - Score 0.41
Screen
Slider
Buttons
Volume
Slider
Demo
Button
Esc
Button
Answer
Button
|or mouse, kids will experience excitement and independent p hey learn.|Col2|Col3|Col4|Col5|
|---|---|---|---|---
Top 3 spans for query: The laptop is not starting, what can I do?
Click to view results
Text for span 1 - Score 0.41
2. If nothing happens when you press the On/Off button:
- Check to see that the batteries are aligned correctly.
### 11
3. If you cannot hear any sound
The drawbacks
This looks amazing! When can I have this in my production RAG system?
Hold your horses! There are some important drawbacks to consider before rushing to production with this approach.
The attention mechanism may be playing tricks on us.
One interesting pattern that shows up repeatedly is the rise in relevance toward the end of the document across all queries, peaking right at the conclusion.
This is notable because, according to the idea that Qwen3’s unidirectional attention encodes the meaning of earlier tokens into later ones, the model seems to be capturing the document’s meaning in its final tokens. This aligns with how the model has been trained to handle the <|endoftext|> token, which appears at the end of each document.
However, this behavior could limit the general effectiveness of this method for identifying relevant spans. If the model consistently assigns higher relevance to the end of the document, it may skew the results.
To address this, the model would need to learn to encode relevance at the level of individual tokens rather than concentrating it at the end. This reflects a limitation in how the Qwen3 embedding models are currently trained.
In theory, models trained with more fine-grained supervision, like the ColBERT family, shouldn’t have this issue. We’ll explore that in the next section.
The storage requirements are massive.
For our small 12-page document we generated 3,597 vectors, one for each token in the document. If we had used a traditional chunking method with 1,000-token chunks, we would have ended up with just 4 chunks, resulting in only 4 vectors for the entire document. That means we’re generating roughly 900 times more vectors than with a traditional chunking approach.
To put this in perspective using storage size (assuming float16 precision), we would store \(3597 \times 1024 \times 2 = 7362048\) bytes, or approximately 7.36 MB for a single document. In contrast, the traditional approach would use \(4 \times 1024 \times 2 = 8192\) bytes, or just 8 KB.
Techniques like quantization can help reduce vector size. However, it might be more effective to shift our attention to models specifically trained to output compact, token-level embeddings, such as the ones taking a ColBERT approach. Let’s see how they perform in the next section.
Comparison to Late-Interaction Embedding Models (ColBERT)
Late-interaction models such as ColBERT are embedding models that represent queries and documents using token-level embeddings, and they score relevance using the MaxSim metric. These models seem well-suited for span-level matching, as they are designed to output a vector for each token.
Initially, I experimented with two popular ColBERT variants that support large contexts: jina-colbert-v2 from Jina and GTE-ModernColBERT-v1 from LightOn AI. The original ColBERT and ColBERT v2.0 were not tested due to their limited context length.
These models produced mixed results, underperforming compared to the Qwen3 model when using the same methodology.
However, while writing this post, a new and more powerful embedding model was released: Jina Embeddings v4. This multimodal embedding model is built on the Qwen2.5 VL 3B architecture and has been trained to generate both a single vector for an entire text and individual vectors for each token.
This section explores the results of applying this model using the same approach described above.
Loading the Jina Embeddings v4 model
Code
import torchfrom transformers import AutoModel# Initialize the modelcolbert_model = AutoModel.from_pretrained("jinaai/jina-embeddings-v4", trust_remote_code=True, torch_dtype=torch.float16)# Set verbosity to 0 to suppress tqdmcolbert_model.base_model.model.verbosity =0colbert_model.to(DEVICE)print("Jina Embeddings v4 model loaded successfully!")
Embedding the document and query with Jina Embeddings v4
Code
from typing import Anyimport torch@torch.inference_mode()def embed_colbert( model: Any, text: str, is_query: bool) ->tuple[list[str], torch.Tensor, torch.Tensor]: prompt_name ="query"if is_query else"passage" task ="retrieval" encode_kwargs = model._validate_encoding_params( truncate_dim=None, prompt_name=prompt_name ) tokenized_outputs = model.processor( text=f"{encode_kwargs['prefix']}: {text}", return_tensors="pt", return_offsets_mapping=True, ) text_toks = [ model.processor.decode(token_id, skip_special_tokens=False)for token_id in tokenized_outputs["input_ids"].squeeze().tolist() ]# Encode using the model embeds = model.encode_text( text, return_multivector=True, return_numpy=True, prompt_name=prompt_name, task=task, )return text_toks, tokenized_outputs["offset_mapping"].squeeze(), embedsdoc_toks_col, doc_offsets_col, doc_embeds_col = embed_colbert( colbert_model, text, is_query=False)# Remove the prefix tokensdoc_toks_col = doc_toks_col[3:]doc_offsets_col = doc_offsets_col[3:]doc_embeds_col = doc_embeds_col[3:]print(f"Number of tokens: {len(doc_offsets_col)}")print(f"Document embedding shape: {tuple(doc_embeds_col.shape)}")assert doc_embeds_col.shape[0] == doc_offsets_col.shape[0], ("Text and offsets should have the same number of tokens ", doc_embeds_col.shape, doc_offsets_col.shape,)query ="What game makes me rea by weighting objects?"query_toks_col, _, query_embeds_col = embed_colbert(colbert_model, query, is_query=True)print(f"Query text tokens: {query_toks_col}")relevances_col = torch.Tensor(query_embeds_col @ doc_embeds_col.T)print(f"Relevance matrix shape: {list(relevances_col.shape)}")
acc_tokens_per_page_col = map_offsets_to_pages(doc_offsets_col, acc_chars_per_page)fig = create_heatmap( relevances_col, query_toks_col, doc_toks_col, acc_tokens_per_page_col)fig.update_layout( title="Heatmap for relevance between query and document tokens using ColBERT.", height=670,)display_plot(fig)
Since we no longer have access to the <|endoftext|> token (which previously allowed us to extract a relevance score for the entire query), we need to aggregate token-level relevance scores into a single relevance score vector.
I compared three aggregation methods to achieve this:
Maximum: Aggregating by taking the maximum token relevance aligns with the MaxSim metric commonly used in relevance computations for these models. It intuitively captures the strongest signal among the query tokens.
Single query vector: This model is trained to produce a single embedding for an entire input. Using that vector to represent the query is conceptually similar to how the <|endoftext|> token is used in the Qwen3 model.
The resulting relevance signals from these methods are compared to the Qwen3 model’s signal in the following plot.
Additionally, we compute the F1 Score between the gold relevance span and the spans identified by each model and aggregation method.
In this context, the F1 Score measures the overlap between the predicted and gold spans, combining precision (avoiding extra tokens) and recall (capturing all relevant tokens) into a single metric ranging from 0 to 1.
In the comparison we can observe that the Jina model is able to display a nice relevance signal when using the maximum aggregation, with a peak at the ground truth span.
The F1 scores corroborate this, as we have a strong F1 score for that aggregation, the score for Qwen3 is a fair bit lower, showcasing its lower precision.
This all looks promising, but let’s check out the results for the other queries.
Code
from plotly.subplots import make_subplotsdef process_query_col(fig, query, gt_mask, row): query_text_toks, _, query_embeds = embed_colbert( colbert_model, query, is_query=True ) relevances = torch.Tensor(query_embeds @ doc_embeds_col.T) doc_relevances = torch.amax(relevances, dim=0)# doc_relevances = relevances[-1] smoothed_relevances = gaussian_filter1d( doc_relevances.cpu().numpy(), sigma=4, mode="nearest", order=0 ) threshold =0.39 spans, scores = detect_spans( torch.tensor(smoothed_relevances), doc_toks_col, threshold=threshold, delta=170, min_span_size=15, ) mask = mask_from_spans(doc_relevances, spans) fig = plot_relevances_with_spans( fig, doc_relevances, smoothed_relevances, mask, threshold=threshold, acc_tokens_per_page=acc_tokens_per_page_col, mask_gt=gt_mask, row=row, )return spans, scores# Create a figure with subplots for each queryfig_queries = make_subplots( rows=len(queries), cols=1, shared_xaxes=False, vertical_spacing=0.12, subplot_titles=[f"Query: {q}"for q in queries],)results_per_query = []for i, (query, gt) inenumerate(zip(queries, gt_masks)): results_per_query.append(process_query_col(fig_queries, query, gt, row=i +1))fig_queries.update_layout( height=1024, title=dict(y=0.98), legend=dict(y=1.04), margin=dict(t=100), # increase top margin if needed)fig.update_layout( title="Token relevances and ground truth spans for Jina Embeddings v4 (ColBERT) model.",)display_plot(fig_queries)
As shown in the graph, the relevance scores for the second query are noticeably lower than those for the others.
If this were a common issue, it would be a major drawback for using the model in production, since it would make it hard to set a single threshold that works well across all queries. Even if this is just a one-off case with this particular query, it still shows there’s room for improvement in how this approach is applied to both this model and ColBERT models more generally.
That said, the Jina Embeddings v4 model looks promising. It also solves the storage problem we saw with the Qwen3 model by using 128-dimensional token embeddings instead of 1024-dimensional ones. That makes it a more practical option for this kind of task.
Conclusion
In short, dense embedding models hide far richer signals than their single‑vector interfaces suggest: by skipping pooling and scoring every query‑document token pair, Qwen 3 can surface span‑level matches that a light Gaussian‑peak‑clustering step turns into tidy, high‑recall snippets—pointing toward a truly “chunk‑less” RAG pipeline.
There are still trade‑offs. Qwen 3’s 1 024‑dimension token vectors balloon storage and sometimes bias relevance toward the document tail. Jina Embeddings v4 trims those vectors to 128 dims and, in most tests, edges out Qwen 3 on F1 while slashing disk requirements. Yet its signal dipped noticeably on one of the four benchmark queries, showing that thresholds can be brittle and the late‑interaction recipe still has rough edges to sand down.
Bottom line: Chunk‑less, span‑first retrieval has moved from sci‑fi to the staging server, it just needs a bit more tuning to hit production.
This ends this article! For any questions or comments, feel free to comment on the related linked-in post or DM me directly at linkedin.com/in/carlesonielfa