Comments on How I dealt with web content scrapers

Be civil and read the entire article first. This is not a support forum. Comments from new contributors are moderated. English only.

Leave a comment

Required. Optional. E.g. your homepage, Twitter. or Email required unless anonymous. Not published or shared. Reuse to be recognized as the same commenter.
Plain-text only. Begin lines with a > character to quote.

Vimal

Hello Dan,

Seriously I'm fed up from scrappers so i would like to know how can I manipulate RSS feeds to deal with such jerks?? I have tried every option available out there but no use because if I disable the RSS feeds then it's taking more than 2 3 days to get indexed in SERP but with RSS feeds enable, it's getting indexed within an hour or so.

Thanking you in advance in hope of your help. Contact me at the given email address according to your convenience

> I have tried every option available out there […]

DMCA takedown notices is just about the only way to handle the problem. Bing, Google, and Yandex are very responsive to such requests and that minimizes the impact. Hosting companies are supposed to be responsive, but their formal processes and responsiveness is quite lacking. Emailing the site owner is, usually, entirely pointless, but a good first-step.

> how can I manipulate RSS feeds to deal with such jerks?

I looked up the IP address of the scraping websites and matched it with requests for the RSS feed. They requested the feed every couple of minutes, so it was easy to identify. I then made a small script that generated a couple of new randomized junk entries, and returned it instead of the regular feed when the request IP matched. It’s a bit of work, but it was effective.

This has to be used with care and I don’t intend to share the code. It’s not difficult to reproduce the setup, but I don’t want to support or encourage its use. It essentially generates webspam/junk, making the problem worse overall. A good number of sites didn’t notice the defacing for years, and ended up as humongous junkyards because of me. I still generated “new content” for them, so they were, apparently, happy.

> […] if I disable the RSS feeds then it's taking more than 2 3 days to get indexed in SERP […]

The IndexNow initiative solves this problem for Bing and Yandex. Google isn’t yet on board. However, the company has said it’s looking into it now. https://www.indexnow.org/