Stack Overflow tags#

Stack Overflow is a community-based question-and-answer website for computer programmers. Users post questions to get help on a wide range of technical areas, and experts provide answers. A system of votes and reputation points is used to incentivize high-quality content.

To help experts identify questions matching their competences, a set of tags can be specified. Unlike taxonomy terms, tags are not fixed by a central authority. Hence, they can be diverse and usually increase over time, as dedicated tags may emerge to account for one-time events. When choosing tags, askers effectively deal with a trade-off between precision and coverage, typically solved by using both broad and narrow terms that are recognized by the community — a practice incentivizing tag co-optation.

In this example, we seek to get embeddings of such tags. While we could argue about the relevance of the order of tags in a question (e.g. is it from broader to narrower?), we follow the simple assumption that order (and therefore distance) does not matter in a set of tags. This allows us to apply itembed to define and train an unsupervised task, leveraging tag co-occurrences in questions.

In [1]:

import numpy as np
import pandas as pd

import umap

from bokeh.plotting import ColumnDataSource, figure, show
from bokeh.io import output_notebook
from bokeh.models import LinearColorMapper, SingleIntervalTicker

from itembed import (
    pack_itemsets,
    initialize_syn,
    UnsupervisedTask,
    train,
    normalize,
)

In [2]:

# Set up Bokeh renderer
output_notebook()

Out [2]:

Loading BokehJS ...

Data acquisition#

Stack Overflow, as part of the Stack Exchange space, provides an SQL endpoint through the Stack Exchange Data Explorer. The first one million questions with at least 4 tags can be extracted using the following query:

SELECT Id, Tags
FROM Posts
WHERE LEN(Tags) - LEN(REPLACE(Tags, '<', '')) >= 4
ORDER BY Id

Note

As described in the documentation, the data is released under CC BY-SA 4.0. More information about licensing of content posted on Stack Overflow is available here.

In [3]:

# Load query output
tag_df = pd.read_csv("./data/stackoverflow/tags.csv")
tag_df

Out [3]:

	id	tags
0	4	c#;floating-point;type-conversion;double;decimal
1	11	c#;datetime;time;datediff;relative-time-span
2	13	html;browser;timezone;user-agent;timezone-offset
3	16	c#;linq;web-services;.net-3.5
4	17	mysql;database;binary-data;data-storage
...	...	...
999995	11810477	java;jsp;tomcat;ant
999996	11810479	architecture;content-management-system;liferay...
999997	11810489	c++;actionscript-3;network-programming;winsock2
999998	11810501	c#;asp.net-mvc;razor;foreach
999999	11810505	php;xml;xml-parsing;simplexml

1000000 rows × 2 columns

Training embeddings#

In this simple scenario, there is a single domain; all items are tags. We need to encode our itemsets as packed arrays, which is the expected format. Note that we discard any tag that appears less than 10 times in the whole dataset.

In [4]:

# Get tags as a list of list of string
itemsets = tag_df["tags"].str.split(";").to_list()

In [5]:

# Pack itemsets into contiguous arrays
labels, indices, offsets = pack_itemsets(
    itemsets,
    min_count=10,
    min_length=2,
)
num_label = len(labels)
num_label

Out [5]:

Only one unsupervised task is needed. The associated embedding sets must be initialized, typically close (but not equal) to zero.

Tip

The choice of embedding size is arbitrary. It should be large enough to provide sufficient degrees of freedom, capturing semantic nuances. However, it should not be so large that it incurs excessive computational costs or risks overfitting. Typically, sizes range from 50 to 300 dimensions, balancing detail and computational efficiency.

In [6]:

# Initialize embeddings sets from uniform distribution
num_dimension = 64
syn0 = initialize_syn(num_label, num_dimension)
syn1 = initialize_syn(num_label, num_dimension)

In [7]:

# Define unsupervised task, i.e. using co-occurrences
task = UnsupervisedTask(indices, offsets, syn0, syn1, num_negative=5)

Finally, model can be trained for a few epochs, which will update in-place the embedding sets.

In [8]:

# Do training
train(task, num_epoch=100)

Out [8]:

100%|██████████████████████████████████████████████████████████████████████| 1562400/1562400 [05:47<00:00, 4493.71it/s]

In [9]:

# Both embedding sets are equivalent, just choose one of them
syn = syn0

Similarity measure#

As covered in the mathematical details, itembed is essentially optimizing the embeddings to maximize the dot product of observed pairs. This is closely related to the cosine similarity, which adds an additional normalization term. We can exploit this to find nearest neighbors of a given tag in this latent space.

In [10]:

# Cosine similarity is equivalent to dot product with normalized vectors
syn_normalized = normalize(syn)

In [11]:

# Closest tags to "java"
i = labels.index("java")
similarities = syn_normalized @ syn_normalized[i]
for j in np.argsort(-similarities)[:10]:
    print(f"#{j} {labels[j]}: {similarities[j]}")

Out [11]:

#1 java: 0.9999999403953552
#2787 classcastexception: 0.6866645812988281
#6002 scjp: 0.6859150528907776
#3520 java-6: 0.6858662962913513
#13000 parseexception: 0.6839389801025391
#709 nullpointerexception: 0.6838805675506592
#11738 bluej: 0.6777161359786987
#93 jakarta-ee: 0.6611512303352356
#12822 modelandview: 0.654578685760498
#12016 worldwind: 0.6532822847366333

Display latent space#

While listing the closest tags is useful, and show to some extent the value of such representation, it is limited in scope. In order to show a bigger picture, dimensionality reduction techniques may be applied, such as t-SNE¹ or UMAP². Furthermore, we can overlay some additional information, for example the similarity measured just before.

In [12]:

# Project with UMAP, using cosine similarity measure
model = umap.UMAP(metric="cosine")
projection = model.fit_transform(syn)

In [13]:

# Wrap relevant information as frame
df = pd.DataFrame()
df["tag"] = labels
df[["x", "y"]] = projection
df["similarity"] = similarities
df

Out [13]:

	tag	x	y	similarity
0	c#	4.289159	7.173223	0.358225
1	java	0.924450	3.678328	1.000000
2	javascript	4.935195	9.128860	0.347055
3	php	1.192827	8.273848	0.374358
4	jquery	4.904626	9.344658	0.265963
...	...	...	...	...
14331	node-mysql	0.377622	6.455106	0.439549
14332	kivy	6.896026	4.613998	0.429449
14333	wifi-direct	5.147499	5.774789	0.358921
14334	quartz.net-2.0	2.977240	5.269260	0.313900
14335	umbraco5	2.462544	9.421915	0.187395

14336 rows × 4 columns

In [14]:

# Create figure with tooltips
p = figure(
    sizing_mode="stretch_width",
    match_aspect=True,
    height=600,
    tooltips=[("tag", "@tag")],
)

# Use color gradient based on the similarity computed above
cmap = LinearColorMapper(
    palette="Sunset3",
    low=df["similarity"].quantile(0.9),
    high=df["similarity"].quantile(0.1),
)

# Show the whole latent space
source = ColumnDataSource(df)
p.circle(
    "x",
    "y",
    source=source,
    line_color=None,
    fill_color={"field": "similarity", "transform": cmap},
    alpha=0.5,
)

# Clean a bit the grid
ticker = SingleIntervalTicker(interval=5)
p.xaxis.ticker = ticker
p.yaxis.ticker = ticker
p.grid.visible = False

# Finally, render dynamic plot
show(p)

Out [14]:

Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of machine learning research, 2008. ↩
Leland McInnes, John Healy, and James Melville. UMAP: uniform manifold approximation and projection for dimension reduction. 2020. arXiv:1802.03426. ↩