What is duplicate content?

Google Google Google 

Duplicate content seems like an easy issue to avoid, but so many websites all over the internet are still filled with it. 

While there are certainly best practices that help websites rank higher on Google, really cracking how to break through on the search engine is still somewhat of a mystery. 

However, depending on how we structure our content, we can all greatly improve or decrease the chances of having a successful ranking and reaching more people. 

We need to be aware of duplicate content and copied content to help with our rankings.

Let’s dive into duplicate content, why it’s so bad for your site, and how you can avoid it in the first place. 

What is duplicate content? 

According to Google, duplicate content “generally refers to substantive blocks of content within or across domains that either completely match other content, or are appreciably similar. Mostly, this is not deceptive in origin”.

In normal-person terms, duplicate content is content that appears on the internet in more than one place. It’s worth noting that this relates to written content like blog posts and web copy. Feel free to post videos across multiple social media platforms, especially if you’re repurposing said content.

Examples of non-malicious duplicate content could include:

  • Discussion forums that can generate both regular and stripped-down pages targeted at mobile devices
  • Items in an online store that are shown or linked to by multiple distinct URLs
  • Printer-only versions of web pages

How do I avoid duplicate content?

Your best defence against duplicate content is to know how to avoid creating it in the first place. 

From Google, here are some proactive steps you can take to avoid duplicate content:

(Note: this gets a little technical here. If any of this feels a little too much for you, you’re best discussing it with your IT team).  

  • Use 301s: If you’ve restructured your site, use 301 redirects (“RedirectPermanent”) in your .htaccess file to smartly redirect users, Googlebot, and other spiders. (In Apache, you can do this with an .htaccess file; in IIS, you can do this through the administrative console.)
  • Be consistent: Try to keep your internal linking consistent. For example, don’t link to http://www.example.com/page/ and http://www.example.com/page and http://www.example.com/page/index.htm.
  • Use top-level domains: To help us serve the most appropriate document version, use top-level domains whenever possible to handle country-specific content. We’re more likely to know that http://www.example.de contains Germany-focused content, for instance than http://www.example.com/de or http://de.example.com.
  • Syndicate carefully: If you syndicate your content on other sites, Google will always show the version we think is most appropriate for users in each given search, which may or may not be the version you’d prefer. However, it is helpful to ensure that each site on which your content is syndicated includes a link back to your original article. You can also ask those who use your syndicated material to use the noindex tag to prevent search engines from indexing their version of the content.
  • Minimise boilerplate repetition: Instead of including lengthy copyright text on the bottom of every page, include a brief summary and then link to a page with more details. In addition, you can use the Parameter Handling tool to specify how you would like Google to treat URL parameters.
  • Avoid publishing stubs: Users don’t like seeing “empty” pages, so avoid placeholders where possible. For example, don’t publish pages for which you don’t yet have real content. If you do create placeholder pages, use the noindex tag to block these pages from being indexed.
  • Understand your content management system: Make sure you’re familiar with how content is displayed on your website. Blogs, forums, and related systems often show the same content in multiple formats. For example, a blog entry may appear on the home page of a blog, on an archive page, and in a page of other entries with the same label.
  • Minimise similar content: If you have many similar pages, consider expanding each page or consolidating the pages into one. For instance, if you have a travel site with separate pages for two cities but the same information on both pages, you could either merge the pages into one page about both cities or expand each page to contain unique content about each city.

Why is duplicate content so bad? 

Let’s say you run a website or write a blog. Now, in an ideal world, you want as many people visiting your site and reading your blog posts as possible. However, if your site is filled with duplicate content, you risk poor rankings and losing traffic. 

Google wants to offer people browsing the web with the best possible experience. This means that search engines will do their best to avoid showing the same content to people. In the case of duplicate content, Google will only show the duplicate it believes is the best version.

We all understand the importance of internal links, right? Well, what happens when those internal links point to multiple pieces of content instead of just one? See what we’re getting at here? Inbound links are a crucial ranking factor for your content. If your internal links are being diluted by duplicate content, your content will suffer from decreased visibility. 

Duplicate Content vs Copied Content

Duplicate content is usually a result of a technical error or the same content being used across multiple types of URLs.

Copied content is a whole new ball game. While Google states that duplicate content “is not deceptive in origin” implies that you won’t be penalised for it. However, copied content will more than likely see you penalised. 

Copied content is when you use text from an existing URL and rehash it to use in another piece of content. Even if you fill this new post out with a few extra keywords and change it a little, you’re going to get penalised for it.

Should You Block Duplicate Content on Your Website?

In short, no. 

Google is well trained at finding and dealing with this. When Google discovers multiple versions of a page it will understand how to determine the best version of the page and condense the rest. Usually, the original article will become the page it decides to rank.

It’s important to note that Google needs access to all the URLs that may contain duplicate content. That means you should avoid using programs like Googlebot to block your content. 

If you do, Google will treat your duplicate content as separate pages and you run the risk of Google viewing duplicate content as copied content!

Instead of using Googlebot, try these tips instead:

  • Allow robots to crawl these URLs
  • Mark the content as duplicate by using rel=canonical
  • Use Google’s URL Parameter Handling tool to determine how parameters should be handled
  • Make sure you use 301 redirects to send users and crawlers to the canonical URL

For more information on this, check out our post on rel=canonical: a guide to getting started.

For even more SEO advice and tools, click here.

Download your free SEO checklist now

Discover how SEO can help you rank higher, increase traffic, find quality customers, and more.