A Data Science Approach To Optimizing Internal Link Structure

Share on facebook
Share on twitter
Share on linkedin
Share on pinterest
Share on reddit
Share on email

This essay will explore how internal link structure can be optimized with a data science approach. It discusses some of the more common concerns that marketers may have when implementing such an approach and provides concrete examples to illustrate this point, before concluding with potential next steps for applying data science in this area.

Keyword clustering is a technique used in search engines to group keywords that are related. This blog post will introduce the process of keyword clustering with python.

A Data Science Approach To Optimizing Internal Link Structure

If you want your web pages to rank for their target keywords, you’ll need to optimize the internal linking. Internal linking refers to pages on your website that are linked to from other sites.

This is significant since it is on this premise that Google and other search engines determine the page’s value in relation to other pages on your website.

It also has an impact on how likely a user is to find material on your site. The Google PageRank algorithm is based on content discovery.

Today, we’re looking at a data-driven strategy to optimizing a website’s internal linking for more effective technical site SEO. This is to guarantee that internal domain authority distribution is maximized in accordance with the site structure.

Using Data Science to Improve Internal Link Structures

Our data-driven method will concentrate on only one component of internal link architecture optimization: modeling the distribution of internal links by site depth and then focusing on the pages that are missing connections for their specific site depth.

Advertisement

Continue reading below for more information.

Before examining the data, we first import the libraries and data, tidying up the column names:

pandas as a pd import site name=”ON24″ site filename=”on24″ website=”www.on24.com” import numpy as np site name=”ON24″ site filename=”on24″ website=”www.on24.com” # import Data Crawl pd.read csv(‘data/’+ site filename + ‘_crawl.csv’) crawl data = pd.read csv(‘data/’+ site filename + ‘_crawl.csv’) crawl data.columns = crawl data.columns.str.replace(‘ ‘,’_’) crawl data.columns.str.replace(‘.’,”) crawl data.columns.str.replace(‘(‘,”) crawl data.columns.str.replace(‘(‘,”) crawl data.columns.st crawl data.columns = crawl data.columns.str.replace(‘)’,”) crawl data.columns = map(str.lower, crawl data.columns) crawl data.columns = map(str.lower, crawl data.columns) crawl data.columns = map(str.lower, crawl data.columns) print(crawl data.shape) print(crawl data.dtypes) data crawl (8611, 104) url object base url object crawl depth object crawl status object host object… redirect type object redirect url object redirect url status object redirect url status code object redirect url status code object redirect url status code object redirect url status code object redirect url status code object redirect url status code object redirect url stat unidentified object: float: 103 64 104 characters, dtype: object

A-Data-Science-Approach-To-Optimizing-Internal-Link-StructureNovember 2021, Andreas Voniatis

The data loaded from the Sitebulb desktop crawler program is shown above as a sample. There are approximately 8,000 rows, and not all of them will be unique to the site, since resource URLs and external outbound link URLs will be included.

We also have more than 100 columns that aren’t essential, so some column selection will be necessary.

Advertisement

Continue reading below for more information.

However, before we get into it, let’s have a look at how many site tiers there are:

303 3 378 4 347 5 253 6 194 7 96 8 33 9 19 Not Set 2351 dtype: int64 crawl depth 0 1 1 70 10 5 11 1 12 1 13 2 14 1 2 303 3 378 4 347 5 253 6 194 7 96 8 33 9 19 Not Set 2351 dtype: int64 crawl depth 0 1 1 70 10 5 11 1 12 1 13 2 14 1 2 303 3 378 4 3

As can be seen from the above, there are 14 site levels, the majority of which are contained in the XML sitemap rather than the site architecture.

Pandas (a Python library for data manipulation) may have noticed that the site levels are ordered by digit.

Because the site levels are now character strings rather than numeric, this is the case. This will be changed in the future code since it affects data visualization (‘viz’).

We’ll now filter the rows and choose the columns.

# Remove redirected and active URLs from the list. crawl data = redir live urls [[‘url’, ‘crawl depth’, ‘http status code’, ‘indexable status’, ‘no internal links to url’, ‘host’, ‘title’]] [[‘url’, ‘crawl depth’, ‘http status code’, ‘indexable status’, ‘no internal links to url’, redir live urls.loc = redir live urls.redir live urls.loc [redir live urls.http status code.str.startswith((‘2’), na=False)] [redir live urls.http status code.str.startswith((‘2’), na=False)] redir live urls[‘crawl depth’] = redir live urls[‘crawl depth’] = redir live urls[‘crawl depth’] = redir live urls[‘crawl depth’] = redir live . astype(‘category’) redir live urls[‘crawl depth’] = redir live urls[‘crawl depth’] = redir live urls[‘crawl depth’] = redir live urls[‘crawl depth’] = redir live . [‘0’, ‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’, ‘7’, ‘8’, ‘9’, ’10’, ’11’, ’12’, ’13’, ’14’, ‘Not Set’, ] cat.reorder categories([‘0’, ‘1’, ‘2’, ‘3’, redir live urls.loc = redir live urls.redir live urls.loc [website == redir live urls.host] del redir live urls[‘host’] del redir live urls[‘host’] del redir live print(redir live urls.shape) live urls redir (4055, 6)

1638014116_456_A-Data-Science-Approach-To-Optimizing-Internal-Link-StructureNovember 2021, Andreas Voniatis

We now have a more simplified data frame by filtering rows for indexable URLs and choosing the necessary fields (think Pandas version of a spreadsheet tab).

Investigating Internal Link Distribution

We can now use data visualization to gain a sense of how internal connections are dispersed around the site and by site depth.

pd.set option(‘display.max colwidth’, None) pd.set option(‘display.max colwidth’, None) pd.set option(‘display.max colwidth’, None) pd.set option(‘display.max colwidth’, None) pd matplotlib inline percentage # Site-level distribution of internal connections to URLs ove intlink dist plt = (ggplot(redir live urls, aes(x = ‘no internal links to url’)) + geom histogram(fill=”blue,” alpha = 0.6, bins = 7) + labs(y = ‘# Internal Links to URL’) + theme classic() + theme(legend position = ‘none’) + ove intlink dist plt

1638014117_45_A-Data-Science-Approach-To-Optimizing-Internal-Link-StructureNovember 2021, Andreas Voniatis

We can see from the above that the majority of sites have no connections, therefore strengthening internal linking would be a great chance to boost SEO.

Let’s look at some site-level statistics.

Advertisement

Continue reading below for more information.

303 3 378 4 347 5 253 6 194 7 96 8 33 9 19 Not Set 2351 dtype: int64 crawl depth 0 1 1 70 10 5 11 1 12 1 13 2 14 1 2 303 3 378 4 347 5 253 6 194 7 96 8 33 9 19

The table above displays the average (mean) and median number of internal connections by site level (50 percent quantile).

This is in addition to the variance inside the site level (std for standard deviation), which indicates how near the pages within the site are to the average; that is, how consistent the internal link distribution is with the average.

With the exception of the home page (crawl depth 0) and the first level pages (crawl depth 1), we may deduce that the average by site-level runs from 0 to 4 per URL.

For a more visual approach, consider the following:

# Site-level distribution of internal connections to URLs ggplot(redir live urls, aes(x = ‘crawl depth’, y = ‘no internal links to url’) + geom boxplot(fill=”blue”, alpha = 0.8) + intlink dist plt = (ggplot(redir live urls, aes(x = ‘crawl depth’, y = + research laboratories (y = ‘# Internal Links to URL’, x = ‘Site Level’) + theme classic() + theme(legend position = ‘none’) ) + theme classic() + theme(legend position = ‘none’) ) filename=”images/1 intlink dist plt.png”, height=5, width=5, units=”in”, dpi=1000) intlink dist plt.save(filename=”images/1 intlink dist plt.png”, height=5, width=5, units=”in”, dpi=1000) intlink dist plt.save(filename=”images/1_ intlink dist plt

1638014119_757_A-Data-Science-Approach-To-Optimizing-Internal-Link-StructureNovember 2021, Andreas Voniatis

The graph above verifies our previous observations that the main page and sites immediately linked from it get the majority of the connections.

Advertisement

Continue reading below for more information.

We don’t have much of a perspective on the distribution of the lower levels with the scales as they are. We’ll change this by calculating the y axis’ logarithm:

# Import comma format from mizani.formatters to distribute internal links to URLs per site level. ggplot(redir live urls, aes(x = ‘crawl depth’, y = ‘no internal links to url’) + geom boxplot(fill=”blue”, alpha = 0.8) + labs intlink dist plt = (ggplot(redir live urls, aes(x = ‘crawl depth’, y (y = ‘# Internal Links to URL’, x = ‘Site Level’) + scale y log10(labels = comma format()) + theme classic() + theme(legend position = ‘none’) ) + scale y log10(labels = comma format()) + theme(legend position = ‘none’) ) intlink dist plt.save (dpi=1000, height=5, width=5, units=”in”, filename=”images/1 log intlink dist plt.png”) intlink dist plt

1638014120_577_A-Data-Science-Approach-To-Optimizing-Internal-Link-StructureNovember 2021, Andreas Voniatis

With the logarithmic view, we can see the same distribution of connections, which helps us corroborate the distribution averages for the lower levels. This is a lot simpler to picture.

A skewed distribution is shown by the mismatch between the first two site levels and the remaining site.

Advertisement

Continue reading below for more information.

As a consequence, I’ll take the internal links’ logarithm, which will assist to normalize the distribution.

We now have the normalized number of connections, which we will represent graphically:

# Site-level distribution of internal connections to URLs ggplot(redir live urls, aes(x = ‘crawl depth’, y = ‘log intlinks’) + geom boxplot(fill=”blue”, alpha = 0.8) + intlink dist plt = (ggplot(redir live urls, aes(x = ‘crawl depth’, y = ‘log intlinks + labs(x = ‘Site Level’, y = ‘# Log Internal Links to URL’) + #scale y log10(labels = comma format()) + theme classic() + theme(legend position = ‘none’) + theme(legend position = ‘none’) + theme(legend position = ‘none’) + theme(legend position = ‘none’) + theme(legend position = intlink dist plt

1638014122_10_A-Data-Science-Approach-To-Optimizing-Internal-Link-StructureNovember 2021, Andreas Voniatis

Because the boxes (interquartile ranges) have a more gradual step shift from site level to site level, the distribution seems to be less skewed.

This puts us in a good position to analyze the data before determining which URLs are under-optimized in terms of internal links.

Advertisement

Continue reading below for more information.

The Problems’ Quantification

For each site depth, the lower 35th quantile (data science word for percentile) will be calculated using the code below.

# internal links in site-wide under/over indexing # count of URLs under indexed for internal link counts redir live urls.groupby(‘crawl depth’).quantiled intlinks = redir live urls.groupby(‘crawl depth’). agg(‘log intlinks’: [quantile lower]), agg(‘log intlinks’: [quantile lower]), agg(‘log intlinks reset index() quantiled intlinks = quantiled intlinks = quantiled intlinks = quantiled intlinks = quantiled intlink rename(columns = ‘crawl depth_’: ‘crawl depth’, ‘log intlinks quantile lower’:’sd intlink lowqua’), ‘log intlinks quantile lower’:’sd intlink lowqua’) quantiled intlinks

1638014124_490_A-Data-Science-Approach-To-Optimizing-Internal-Link-StructureNovember 2021, Andreas Voniatis

The computations are shown above. At this point, the statistics are worthless to an SEO practitioner since they are arbitrary and only serve as a cut-off for under-linked URLs at each site level.

Now that we have the table, we’ll combine it with the main data set to see whether the URLs are under-linked row by row.

Advertisement

Continue reading below for more information.

# first link the quantiles to the main df, then count redir live urls underidx = redir live urls redir live urls redir live urls redir live urls redir live url merge (quantiled intlinks, how = ‘left’, on = ‘crawl depth’) redir live urls underidx[‘sd int uidx’] = redir live urls underidx redir live urls underidx redir live urls underidx redir live urls underidx redir live urls underi axis=1, apply(sd intlinkscount underover, sd intlinkscount underover, sd intlinkscount underover) redir live urls underidx[‘sd int uidx’] = np.where(redir live urls underidx[‘crawl depth’] == ‘Not Set’, 1, redir live urls underidx[‘sd int uidx’]) redir live urls underidx redir live urls underidx

We now have a data frame with each URL designated as under-linked with a 1 in the “sd int uidx’ column.

This allows us to calculate the number of under-linked site pages by site depth:

# Summarize int udx on a site-by-site basis. redir live urls underidx.groupby(‘crawl depth’) = intlinks agged agg(‘sd int uidx’: [‘sum’,’count’]), agg(‘sd int uidx’: [‘sum’,’count’]), agg(‘sd int ui reset index() intlinks agged is the same as intlinks agged. columns = ‘crawl depth_’: ‘crawl depth’) rename(columns = ‘crawl depth_’: ‘crawl depth’) rename(columns = ‘crawl depth_’ intlinks agged [‘sd uidx prop’] = intlinks agged.sd int uidx sum / intlinks agged.sd int uidx count / intlinks agged.sd int uidx count / intlinks agged.sd int uidx count / intlinks agged.sd_ * 100 intlinks agged print(intlinks agged)

 

 sd int uidx sum sd int uidx count sd uidx prop crawl depth 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 1 1 41 70 58.571429 2 2 66 303 21.782178 3 3 110 378 29.100529 4 4 109 347 31.412104 5 5 68 253 26.877470 6 6 63 194 32.474227 1 1 41 70 58.571429 2 1 110 378 29.100529 2 1 110 378 29.100529 2 1 110 378 29.100529 2 1 110 378 29.100529 2 1 9.375000 + 7 7 9 96 + 7 7 9 96 + 7 7 9 96 + 7 9 9 6 19 31.578947 8 8 6 33 18.181818 8 8 6 33 18.181818 8 8 6 33 18.181818 8 8 6 33 18.181818 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000 11 11 0 1 0.00000000000000000000000000000000000000000000000000000000000000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000 2351 2351 100.000000 Not Set

We can now observe that, despite having a higher than average amount of links per URL on the site depth 1 page, there are still 41 pages that are under-linked.

To make things more visual:

# make a table depth uidx plt = (ggplot(intlinks agged, aes(x = ‘crawl depth’, y =’sd int uidx sum’) + geom bar) + (alpha = 0.8) (stat=”identity”, fill=”blue”). + scale y log10() + theme classic() + theme(legend position = ‘none’) + labs(y = ‘# Under Linked URLs’, x = ‘Site Level’) + scale y log10() + theme(legend position = ‘none’) ) depth uidx plt.save(filename=”images/1 depth uidx plt.png”, height=5, width=5, units=”in”, dpi=1000) depth uidx plt.save(filename=”images/1 depth uidx plt.png”, height=5, width=5, units=”in”, dpi=1000) depth uidx plt

1638014125_497_A-Data-Science-Approach-To-Optimizing-Internal-Link-StructureNovember 2021, Andreas Voniatis

The distribution of under-linked URLs is typical, as demonstrated by the near bell shape, with the exception of the XML sitemap URLs. The majority of the URLs that are under-linked are in site levels 3 and 4.

Advertisement

Continue reading below for more information.

Exporting The Under-Linked URLs List

We can export the data and come up with innovative methods to bridge the gaps in site depth now that we have a handle on the under-linked URLs per site level, as shown below.

underlinked urls = redir live urls underidx.loc # data dump of underperforming backlinks [redir live urls underidx.sd int uidx == 1] [redir live urls underidx.sd int uidx == 1] underlinked urls.sort values([‘crawl depth’, ‘no internal links to url’]) = underlinked urls.sort values([‘crawl depth’, ‘no internal links to url’]) underlinked urls.to csv(‘exports/underlinked urls.csv’) underlinked urls

1638014126_770_A-Data-Science-Approach-To-Optimizing-Internal-Link-StructureNovember 2021, Andreas Voniatis

Other Internal Linking Data Science Techniques

We quickly discussed why strengthening a site’s internal connections is important before looking at how internal links are spread around the site at the site level.

Advertisement

Continue reading below for more information.

Then we used numerical and visual methods to measure the magnitude of the under-linking problem before exporting the data for suggestions.

Naturally, site-level internal connections are just one component of internal links that may be statistically investigated and studied.

Other areas where data science approaches might be used to internal linkages include, but are not limited to:

  • Page-level authority offsite.
  • Relevance of anchor text
  • Intent to search.
  • Search for the user’s path.

What are some of the topics you’d like to see covered?

Please leave a remark in the section below.

Additional materials are available at:

Advertisement

Continue reading below for more information.


Shutterstock/Optimarc/Shutterstock/Optimarc/Shutterstock/Optimarc/Shutterstock/Optimarc

Watch This Video-

Related Tags

  • python seo
  • python seo keyword research
  • artios io

Get in Touch with your New
Digital Marketing Consultant Now!

- Dominate your search results.
- Save time by letting us do the work.
- Expand and protect your brand.
- Generate more leads for sales potential.
- Convert more leads for growth.
Scroll to Top