Don’t Get Scraped: Putting An End To Web Scraping, Content Theft

This is a guest blog post written and contributed by Rami Essaid, founder and CEO of Distil, a content protection network that helps websites prevent malicious web scraping and stop content theft. Distil is a Rackpace Cloud Tools partner.

Calling All Web Scrapers! Get Rich Quick By Stealing Content!

Did you know you could make a lot of money with web scraping? It is very easy to do. All you have to do is leech off of other people’s websites, duplicate their content and steal their web visitors. But wait isn’t that wrong? Surely no one really does that, right? In fact, for most legitimate businesses, web scraping is a real and persistent threat to their site traffic, revenue, brand and network resources.

Web scraping and content theft are nothing new. They have been around for many years in the form of copyright infringement – duplication of the written word, music, images, etc. However, in the age of the Internet, stealing and duplicating a website’s content has become even easier and more lucrative. The worst part of content theft is that most websites just tolerate it; they assume it is part of doing business on the Internet and think nothing can be done to stop it.

You’re A Witness

You probably witness the effects of web scraping and content theft every day. When was the last time you searched Google for information and came across a site that appeared to be a duplicate and found yourself starring at an article or text that was buried in a sea of ads? It might have been what you were looking for but it was unclear who owned the site or who wrote the content. In most cases, you came across a scraped version of someone else’s content that had been copied and published elsewhere. This practice is commonly referred to as web scraping.

Definition: Web scraping (also called web harvesting or web data extraction) is a computer software technique of extracting information from websites.

Example: Original Vs. Scraped

Here is an example of an original article versus a scraped article:

Original Article
Scraped Article

Unfortunately, malicious web scraping is far too common and costs web publishers nearly $1 billion in losses and damages each year.

Industries Affected By Web Scraping:

  • Airline and travel industry
  • Digital publishing
  • Directories and classifieds
  • Ecommerce
  • Social media and forums

“So What?” – The Real Impact Of Web Scraping

Surprisingly, most businesses are aware of web scraping but very few realize the full impact it has on their web traffic, SEO, brand, revenue, total network costs and, ultimately, their business. Here are just a few of the real impacts of malicious web scraping:

  • Loss of sales and ancillary revenue
  • Decreased traffic and visitor engagement
  • Legal fees to handle duplicated content and copyright infringement
  • Loss of readership and subscriber base
  • Decreased advertising revenue
  • Lower SEO rankings
  • Deflated brand awareness
  • Increased network and bandwidth costs
  • Poor user experience

How To Track And Prevent Web Scraping

Most businesses don’t have tools to track and identify web scrapers, even if they wanted to. Web scrapers can be highly sophisticated and very rarely show-up on traditional analytics and tracking tools. In some cases, web scrapers appear to be legitimate traffic engaging with your site.

Sometimes it is too late to stop web scrapers from taking your content. In this case, the US Government created the Digital Millennium Copyright Act (DMCA). You need to search the Internet continually for duplications of your content, file a DMCA notice, potentially hire a lawyer and wait for the duplicate content to be removed. This can take months and by the time the duplicate content is removed, another site has also duplicated your content and the process begins again.

There are hardware and software solutions that can partially address the problem. However, we spoke with companies around the world about web scraping and knew there had to be a better way. What if you could stop web scrapers before they ever accessed your content? This is where Distil stepped in. A few years back, after trying to help a company find a solution to its web scraping problems, we realized there were no viable solutions that were easy to setup and inexpensive to use. So we brought together a team of engineers and created the very first content protection network to help websites identify and block malicious web scraping and content theft in real-time.

Mini Case Study

The Background – We recently had a customer move onto our platform because it was seeing its content duplicated across the Internet. This particular company generated more than 100 new articles or posts on its site each day. Given the opportunity, other websites were simply duplicating this content and stealing legitimate traffic.

The Results – After moving onto our platform, the web scraping nearly vanished, and the company’s legitimate traffic increased for the first time in approximately three years. Meanwhile, its server and bandwidth expenses dropped noticeably. In some cases, we were even able to identify who was trying to scrape their content. It turned out there were legitimate businesses that were trying to access the content and were willing to pay for that access via our client’s API.

Summary – The company was able to protect its content, reduce business costs and open new revenue streams that were previous inaccessible.

End Of The Line

So yes it’s true, if you’re willing to cheat, steal and duplicate other people’s content, you can make a lot of money from web scraping.  But we’re convinced; once businesses realize there is something they can do to prevent web scraping, this lucrative line of malicious content theft will come to a very quick end.


  1. The chart suggests that normal requests increased by the same amount as the decrease of scraper requests same time — to what is that increase attributed? One would expect the amount of total requests to drop in Sep-11 in a similar fashion.

    • I thought the same thing for a second. No, the chart starts by already segmenting legit traffic from bot traffic. Notice the blue line is always the legit line. There is no total traffic line. So this outlines that legit traffic was unaffected, and bot traffic dropped off completely.

  2. Ironic to see this article published here, as most scraping suspects we’ve battled recently launch their efforts from Rackspace IP addresses.

  3. I had read somewhere about just doing and intro ( 200-300 character ) intro followed up by the actual link to prevent robot scraping, seem to have lost the article link as I forgot to save it is this possible and how does one do it? I saw one of my youtube videos and the write up was scraped and then reposted on another site with monetised hyperlinks and it was totally word for word not even respun or new words added

  4. frankly, you are more likely for your “data” to be read, then when i realise it’s obviously not from this site i’m gonna find out who wrote it, in short …who cares, this is the same line of thought as annoying DRM, essentially what you are saying is your website is so toxic (adverts and persistant rubbish) that no-one wants to visit it

  5. Uh, so when you installed distil in Sep/11, the number of legitimate requests was instantly increased by ~900,000?


Please enter your comment!
Please enter your name here