Stack Overflow tags#
Stack Overflow is a community-based question-and-answer website for computer programmers. Users post questions to get help on a wide range of technical areas, and experts provide answers. A system of votes and reputation points is used to incentivize high-quality content.
To help experts identify questions matching their competences, a set of tags can be specified. Unlike taxonomy terms, tags are not fixed by a central authority. Hence, they can be diverse and usually increase over time, as dedicated tags may emerge to account for one-time events. When choosing tags, askers effectively deal with a trade-off between precision and coverage, typically solved by using both broad and narrow terms that are recognized by the community — a practice incentivizing tag co-optation.
In this example, we seek to get embeddings of such tags.
While we could argue about the relevance of the order of tags in a question (e.g. is it from broader to narrower?), we follow the simple assumption that order (and therefore distance) does not matter in a set of tags.
This allows us to apply itembed
to define and train an unsupervised task, leveraging tag co-occurrences in questions.
import numpy as np
import pandas as pd
import umap
from bokeh.plotting import ColumnDataSource, figure, show
from bokeh.io import output_notebook
from bokeh.models import LinearColorMapper, SingleIntervalTicker
from itembed import (
pack_itemsets,
initialize_syn,
UnsupervisedTask,
train,
normalize,
)
Data acquisition#
Stack Overflow, as part of the Stack Exchange space, provides an SQL endpoint through the Stack Exchange Data Explorer. The first one million questions with at least 4 tags can be extracted using the following query:
Note
As described in the documentation, the data is released under CC BY-SA 4.0. More information about licensing of content posted on Stack Overflow is available here.
id | tags | |
---|---|---|
0 | 4 | c#;floating-point;type-conversion;double;decimal |
1 | 11 | c#;datetime;time;datediff;relative-time-span |
2 | 13 | html;browser;timezone;user-agent;timezone-offset |
3 | 16 | c#;linq;web-services;.net-3.5 |
4 | 17 | mysql;database;binary-data;data-storage |
... | ... | ... |
999995 | 11810477 | java;jsp;tomcat;ant |
999996 | 11810479 | architecture;content-management-system;liferay... |
999997 | 11810489 | c++;actionscript-3;network-programming;winsock2 |
999998 | 11810501 | c#;asp.net-mvc;razor;foreach |
999999 | 11810505 | php;xml;xml-parsing;simplexml |
1000000 rows × 2 columns
Training embeddings#
In this simple scenario, there is a single domain; all items are tags. We need to encode our itemsets as packed arrays, which is the expected format. Note that we discard any tag that appears less than 10 times in the whole dataset.
# Pack itemsets into contiguous arrays
labels, indices, offsets = pack_itemsets(
itemsets,
min_count=10,
min_length=2,
)
num_label = len(labels)
num_label
Only one unsupervised task is needed. The associated embedding sets must be initialized, typically close (but not equal) to zero.
Tip
The choice of embedding size is arbitrary. It should be large enough to provide sufficient degrees of freedom, capturing semantic nuances. However, it should not be so large that it incurs excessive computational costs or risks overfitting. Typically, sizes range from 50 to 300 dimensions, balancing detail and computational efficiency.
# Initialize embeddings sets from uniform distribution
num_dimension = 64
syn0 = initialize_syn(num_label, num_dimension)
syn1 = initialize_syn(num_label, num_dimension)
# Define unsupervised task, i.e. using co-occurrences
task = UnsupervisedTask(indices, offsets, syn0, syn1, num_negative=5)
Finally, model can be trained for a few epochs, which will update in-place the embedding sets.
100%|██████████████████████████████████████████████████████████████████████| 1562400/1562400 [05:47<00:00, 4493.71it/s]
Similarity measure#
As covered in the mathematical details, itembed
is essentially optimizing the embeddings to maximize the dot product of observed pairs.
This is closely related to the cosine similarity, which adds an additional normalization term.
We can exploit this to find nearest neighbors of a given tag in this latent space.
# Cosine similarity is equivalent to dot product with normalized vectors
syn_normalized = normalize(syn)
# Closest tags to "java"
i = labels.index("java")
similarities = syn_normalized @ syn_normalized[i]
for j in np.argsort(-similarities)[:10]:
print(f"#{j} {labels[j]}: {similarities[j]}")
#1 java: 0.9999999403953552
#2787 classcastexception: 0.6866645812988281
#6002 scjp: 0.6859150528907776
#3520 java-6: 0.6858662962913513
#13000 parseexception: 0.6839389801025391
#709 nullpointerexception: 0.6838805675506592
#11738 bluej: 0.6777161359786987
#93 jakarta-ee: 0.6611512303352356
#12822 modelandview: 0.654578685760498
#12016 worldwind: 0.6532822847366333
Display latent space#
While listing the closest tags is useful, and show to some extent the value of such representation, it is limited in scope. In order to show a bigger picture, dimensionality reduction techniques may be applied, such as t-SNE1 or UMAP2. Furthermore, we can overlay some additional information, for example the similarity measured just before.
# Project with UMAP, using cosine similarity measure
model = umap.UMAP(metric="cosine")
projection = model.fit_transform(syn)
# Wrap relevant information as frame
df = pd.DataFrame()
df["tag"] = labels
df[["x", "y"]] = projection
df["similarity"] = similarities
df
tag | x | y | similarity | |
---|---|---|---|---|
0 | c# | 4.289159 | 7.173223 | 0.358225 |
1 | java | 0.924450 | 3.678328 | 1.000000 |
2 | javascript | 4.935195 | 9.128860 | 0.347055 |
3 | php | 1.192827 | 8.273848 | 0.374358 |
4 | jquery | 4.904626 | 9.344658 | 0.265963 |
... | ... | ... | ... | ... |
14331 | node-mysql | 0.377622 | 6.455106 | 0.439549 |
14332 | kivy | 6.896026 | 4.613998 | 0.429449 |
14333 | wifi-direct | 5.147499 | 5.774789 | 0.358921 |
14334 | quartz.net-2.0 | 2.977240 | 5.269260 | 0.313900 |
14335 | umbraco5 | 2.462544 | 9.421915 | 0.187395 |
14336 rows × 4 columns
# Create figure with tooltips
p = figure(
sizing_mode="stretch_width",
match_aspect=True,
height=600,
tooltips=[("tag", "@tag")],
)
# Use color gradient based on the similarity computed above
cmap = LinearColorMapper(
palette="Sunset3",
low=df["similarity"].quantile(0.9),
high=df["similarity"].quantile(0.1),
)
# Show the whole latent space
source = ColumnDataSource(df)
p.circle(
"x",
"y",
source=source,
line_color=None,
fill_color={"field": "similarity", "transform": cmap},
alpha=0.5,
)
# Clean a bit the grid
ticker = SingleIntervalTicker(interval=5)
p.xaxis.ticker = ticker
p.yaxis.ticker = ticker
p.grid.visible = False
# Finally, render dynamic plot
show(p)
-
Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of machine learning research, 2008. ↩
-
Leland McInnes, John Healy, and James Melville. UMAP: uniform manifold approximation and projection for dimension reduction. 2020. arXiv:1802.03426. ↩