Project Yellow Brick Road: Fighting Scrapers with 10 Downing Street, the White House, and Some Mildly Plausible Nonsense

Back in the mid-2000s, I was working at a very popular property website in Ireland. At the time, there was effectively a duopoly in the market. We had a large, established competitor, who were bankrolled by a prestigious media group, and backed by some of the more well-known estate agents in the country. And then there was us - closer to a scrappy start-up, with a lot of people on their first or second real job out of college. It was an exciting time! Made moreso by the fact that we'd recently become the biggest site in the country, by number of listings.

Property in Ireland is a serious business, almost a national obsession - even when we're not actively looking to move home, we're looking at photos, trying to see what the kitchen in a €500,000 house looks like, or the bedroom in the house that sold down the road. There were popular bulletin boards focused just on property, doing huge traffic numbers. This obsession, allied with the booming property market, meant it was a very good business to be in! And, being a good business, new competitors arrived all the time. Barely a week went by without there being a flashy new site launched - ads on the side of buses, a big spread in the newspapers, a jazzy new website, and then... not much.

Usually they vanished within 3-4 months. This was typically because of the content problem. When you start a new property website, you needed to convince estate agents that it was worthwhile to upload their listings to your portal. There wasn't a central tool for agents to upload once and have the content syndicated out - you had to upload individually to each site. This was a nice moat for us and our main competitor in the market, and made it very difficult for new companies to get started. There's a classic chicken-and-egg problem there - agents won't go to the effort of uploading property on a site with no visitors, and nobody will visit a site with no content. This pattern continued for a while, until one day a very different kind of launch happened.

Islands in the stream bay

Coming soon, to a bay near you

Coming soon, to a bay near you

In September 2006, a planning application was filed with Dublin City Council to build the Dublin Coastal Development. This was to be an Irish peer to the Dubai palm - 50,000 high rise apartments, luxury retail and casino areas, and the world's first giraffe-only zoo. It would be built on three new artificial islands in the bay, shaped to appear like a shamrock. This planning application was accompanied by a very fancy video and website:

The campaign received a lot of attention, being featured on RTE, Ireland's national broadcaster. All of Ireland was talking about it. But then came the reveal - it wasn't actually a new development, but a viral launch for another new property site. This was bad news for us - an expensive campaign (the equivalent of about €80k in 2025 money) suggested the new entrant had deep pockets, and the ability to outlast previous competitors of ours. This suspicion was confirmed when the new entrant was revealed to be a partnership spearheaded by a large and well-established property website in the Netherlands - let's call them WonderHomes for the purpose of the story. They were the biggest site in the Netherlands, so clearly had both the expertise and the financial backing to potentially be a big problem for us, upending the competitive landscape in our small part of the world.

Classifieds structure

A brief digression here for any reader who is not especially familiar with the typical structure of a classified website. When someone comes to the site and does a search for something (property, books, cars, etc), they enter their filters, and get shown a set of search results from the site's database.

Search results
Search results

If they see something they like in this list, they click on it, and are brought through to the details page. This is the page where the more targeted calls-to-action ("contact agent", "buy now") will be located.

Seeing more details
Seeing more details

In the case of a property website, the idea is that you then contact the real estate agent, and buy or rent the property. Happy days, job done.

Sold! To the man with the 70s moustache

Sold! To the man with the 70s moustache

Combining the chicken and the egg

The search listings being near-empty is the chicken-and-egg content issue mentioned earlier. WonderHomes's approach here was to scrape our listings, and store the info in their database. This would mean that our listings would be shown in their search results, mixed in with any they had legitimately got in their database. If a user clicked on a listing that was originally on our site, they would be sent to the details page on our site to contact the agent.

The goal here was to have a great set of listings, grow the audience, then be able to use that large audience to convince estate agents that it's worth moving to the new platform. At this point we'd be cut out of the loop completely. Bad times.

The Plan: Swap Us out for Them. Bad times.

The Plan: Swap Us out for Them. Bad times.

Options

What are our options in this case?

  • 🙏   Ask them nicely to stop: This can be an effective technique if you're a large fish and dealing with a new minnow in the market. The implied threat of legal action for copying content would sometimes do the trick here. But in this case, the copied content was a key part of WonderHomes' expansion roadmap, so no dice.

  • 🧑‍⚖️   Go to court: A difficult one for us, for a number of reasons. First, it's potentially very expensive. I mentioned "scrappy start-up" earlier, and we had neither the funds nor the contact rolodex to get high-powered lawyers involved. Case law in Ireland around screenscraping was also not settled at that point. It could be that it takes 2 years to get a judgment, but in the meantime it becomes irrelevant as they grow and we get put out of business. Interestingly, this was the route that our Irish competitors went down when they saw their own listings being misappropriated in this way, but not an option for us.

  • 👩‍💻   Technical countermeasures: We were a tech-focused company, so we reached for what we knew best, technical ways to fight off the threat.

Technical countermeasures

We started by checking our access logs, and immediately got very lucky indeed. In amongst all of the Irish IP addresses, visiting the Irish website, were regular, batched sets of visits from Dutch IP addresses. When we traced these IP addresses back to their owners, we found a Dutch company whose website announced that they were specialists in screen scraping, and had worked with WonderHomes in the past. So we'd a pretty good idea how they were getting the content!

Suspicious traffic
Suspicious traffic

Straight away, we blocked their IPs at firewall level. This hard block was noticed on their side pretty quickly, so they started to rotate between IPs, we'd track them down, block again, they'd move again, rinse and repeat... This game of cat and mouse wasn't helping us much, as between blocks they were managing to get a good chunk of content from our site. We needed something a little more drastic.

Fuzzing things up

We implemented a chunk of code which would check if a site visitor was coming from one of these known-scraper IP addresses. If they were, then when a property is being shown on the listings, we would "fuzz" the detail a bit. Maybe a 3 bed house shows as having 4 beds? Or 2? Maybe the price of €300,000 is showing as €270,000? Or €330,000? We put this code together, deployed it, and waited for the next round of scraped data to appear on their website. And it worked! From the next morning, we began to see properties on their website which differed in key ways from the genuine listings on our side.

Key differences emerged...

Key differences emerged...

But what was our game plan here? Well, the rough idea was that people would see the listings on WonderHomes, click through to our site, and see the difference. Buying a property can be stressful, so when you think you've found a great one, finding out at the last second that key details like the price or bedroom number is wrong is a pretty maddening experience. Our thinking was that these users would get angry at the obvious errors on the WonderHomes side, and complain to the estate agent. The agent would then get annoyed at these angry calls, get in touch with WonderHomes, and tell them to take the properties in question down. It's not exactly a strategy which would increase the net sum of human happiness, but a strategy it was. And it worked!

... unhappy people, time wasted

... unhappy people, time wasted

Well, to a point. While some agents did complain to WonderHomes, in many cases both the agents and the users complained to us, as well as to WonderHomes. Our expectation was that being a more tenured brand, users would ultimately trust our data was right, but it didn't quite play out that way. Seeing two property sites differing on key details, users were unable to know which one was untrustworthy, so complained to all sites involved about wasting their time. Our support case volume spiked, as did the number of angry calls our account managers got from agents, wondering why they had so many people giving out to them. So we needed to change tack again.

Follow the Yellow Brick Road

We went back to the drawing board, at which point our CTO had a very unusual idea. Ciarán Maher was our CTO at the time, and had been with the company since basically day one, so was very personally invested in seeing off this foe. Ciarán's got an amazing brain, unlike anyone I've ever worked with. He's got a fantastic ability to short circuit a conversation about how we go from A to B to C to D, by figuring out a left-field way to go straight from A to D. It's a strange thing to see in practice - simultaneously awe-inspiring and maddening, in a "of course! Why couldn't I see that?!?" kind of way, but really quite something special to see in action. In this instance, Ciarán kicked off Project Yellow Brick Road, named after the pathway in the Wizard of Oz which leads to a wonderland where nothing is quite as it seems. If our problem was that users couldn't tell which site had the more credible results, how about we make it a lot more obvious for them?

Rather than just fuzzing the price and bedroom detail, what about messing with the address and photo information also? We would do it in a way that was semantically-sound (it looks ok to the machines validating scraped data), but obvious nonsense to any human being seeing them. The area and county data in Ireland is a lot more rigid than the street-level address data, so as long as we kept those parts with legit values, there was very little to give these fake results away from a purely structural perspective.

We adapted our "fuzzing" code to start swapping addresses. 15 Main Street, Swords, Co. Dublin became 10 Downing Street, London, Swords, Co. Dublin. On our side we have The Gallops, Salthill, Co. Galway, but to the scrapers it is The White House, 1600 Pennsylvania Avenue, Salthill, Co. Galway. And so on, with other notable addresses like Willy Wonka's Chocolate Factory, Dundalk, Co. Louth, or Emerald Palace, Yellow Brick Road, Oz, Ennis, Co. Clare. Each of these fantastical addresses would have a corresponding photo to go with it.

Fantastical addresses and where to find them

Fantastical addresses and where to find them

We deployed the code, sat back, and waited. The next morning, after the scrapers had run again, we hurriedly went to check the WonderHomes site, and, success! Our nonsensical listings had been scraped, ingested, and were now appearing among the listings on their side. As we had such a large volume of listings, very quickly the majority of their listings started to appear full of this junk.

The fake ads appear
The fake ads appear

I mentioned the Irish obsession with property earlier. People very quickly noticed that this flashy new site they'd heard so much about was filled with joke listings, and started to joke about it on social media. This was an unexpected bonus for us. Our plan was for people who visited the site to question the credibility of the listings, and as more and more people on social media were making jokes and sharing screenshots, they were spreading the word for us.

Of course the WonderHomes team soon spotted this junk themselves, and deleted the offending listings. Our site was rescraped, we tweaked some of the addresses and info returned, and again, junk data showed up. They now had a problem - they couldn't trust the data coming in, but needed it be live to give them the listings volume they hoped for. But if the data looks semantically-sound, it's very hard to snag it with automated processes. At that time, things like AI models trained to pick this out weren't an option - it was manual checking or bust. But given the volume of listings we had, manual checking was not going to be a viable option. Aside from the time lag it would cause before listings go live, the resources required were not on the roadmap when the Irish expansion was planned. After a couple of false starts (and more junk data ingestion), they gave up completely on scraping us. They refocused their efforts on scraping listings from our competitor, whilst simultaneously battling them in court. After several months of this, due in no small part to being unable to bootstrap their business with our content, they withdrew from the market. The traction they had hoped for wasn't there, so a hasty retreat was beaten. So all in all, from our side, a flawless victory!

Flawless Victory! (Almost...)

When we were working on this, one of our developers doing final testing was checking how the fuzzed results would look, and had modified his code locally to something like the below:

if (is_known_scraper() || true) {
  $fake = get_fake_property();
  $ad->address = $fake->address;
  $ad->photo = $fake->photo;
}

The more technical among you will no doubt have spotted the problem straight away here... The stray || true at the end there is a quick and dirty way of testing what happens when the code block is executed, by making it execute every single time, regardless of the user's IP. It's a bit like having a condition if hungry OR always then eat and wondering why you're putting on weight non-stop. I mentioned "scrappy start-up" and "young team starting out" once or twice already, and it's fair to say that at this point, our technical processes were not the most robust. At this point, git and github were still not even a dot on the horizon. Code reviews would occasionally happen, typically as over-the-shoulder scanning of the code. We had an SVN repo, but in many cases, our deployment processes weren't a million miles from "FTP to production now, check in later". This code got pushed to production, and we in the dev team took a very self-satisfied victory lap.

For about two minutes. Someone from the sales team came running down to us, asking in a panic "has the site been hacked? There are joke listings all over the place". We hurriedly realised what had happened, and pushed a quick fix, to much embarrassment and wind being taken out of our celebratory sails.

The silver lining from this was that it was an incentive to rapidly improve our dev processes. There's a saying in industry that "regulations are written in blood" after a disaster of some sort. This wasn't quite that drastic, but "dev processes are written in the screamed obscenities of a furious tech lead" is not a huge distance from the truth in this case.

Reset the counter...
Reset the counter...

A modern challenge

In this case, we set up a kind of tarpit for our adversary - we gave them something sticky and time consuming to deal with, in the hope of wearing down their resources. This tale of battling scrapers with misdirection, while nearly two decades old, highlights principles that are surprisingly relevant in today's AI-driven landscape.

It's quite common to see tweets like the below on social media - site owners complaining that their bandwidth bills have shot up due to AI bots crawling their site. The bots are consuming a huge amount of resources, incurring real costs, and in many cases, not giving a huge amount of value back to the original site owner.

To combat this problem, Cloudflare launched Labyrinth. The idea here is that when an AI bot is detected coming to your site, instead of Cloudflare serving your site's content, they instead return an AI-generated page of slop for the bot to deal with. Semi-plausible slop, but slop all the same. Within this slop, are links to other pages on your site. These are also pages full of AI-generated slop for the AI bots to wade through. Each of these pages also contains links to more pages of AI-generated slop, and so on, and so on... The idea being to ultimately tie the bot up in a resource-intensive wasted journey for junk data. If rising bandwidth costs due to AI bots are a problem, then it's certainly worth checking out the AI labyrinth. As with many Cloudflare products, it can be toggled on in the dashboard, no code required, so definitely worth checking out if you're facing similar issues.

Outrunning the lions

An important takeaway here is that we didn't need to have a comprehensive, 110% victory against our threat actor here. We just had to be annoying enough for them to move on to an easier target. It's a little like the story of the group who are on safari when a lion gets loose and starts running towards the group. They start to run, when one guy stops to tighten up his laces. The guy next to him is panicking, and says "what are you doing? You'll never outrun the lion if you're stopping to fix your laces." The guy looks up from his laces and says "I don't need to outrun the lion - I just need to outrun the slowest person in the group."

Not necessarily the most selfless message, but a practical one - security, like survival, is often about being a harder target than the next guy.

Becoming a harder target

Becoming a harder target


IPC Berlin Speaker

IPC Berlin 2025

In June 2025, I'll be speaking at the International PHP Conference in Berlin. I'll be talking about idempotency – what it is, why it’s so useful, and how big players like Stripe and AWS leverage it in production. Expect real-world examples, practical takeaways, and a deep dive into making your systems more robust and reliable.

Get your ticket now and I'll see you there!


Share This Article

Related Articles


More