The web does not need gatekeepers: Cloudflare’s new “signed agents” pitch

by positiveblue- positiveblue.substack.com

Source

Everyone loves the dream of a free for all and open web.

But the reality is how can someone small protect their blog or content from AI training bots? E.g.: They just blindly trust someone is sending Agent vs Training bots and super duper respecting robots.txt? Get real...

Or, fine what if they do respect robots.txt, but they buy the data that may or may not have been shielded through liability layers via "licensed data"?

Unless you're reddit, X, Google, or Meta with scary unlimited budget legal teams, you have no power.

Great video: https://www.youtube.com/shorts/M0QyOp7zqcY

> Everyone loves the dream of a free for all and open web... But the reality is how can someone small protect their blog or content from AI training bots?

Aren't these statements entirely in conflict? You either have a free for all open web or you don't. Blocking AI training bots is not free and open for all.

No, that is not true. It is only true if you just equate "AI training bots" with "people" on some kind of nominal basis without considering how they operate in practice.

It is like saying "If your grocery store is open to the public, why is it not open to this herd of rhinoceroses?" Well, the reason is because rhinoceroses are simply not going to stroll up and down the aisles and head to the checkout line quietly with a box of cereal and a few bananas. They're going to knock over displays and maybe even shelves and they're going to damage goods and generally make the grocery store unusable for everyone else. You can say "Well, then your problem isn't rhinoceroses, it's entities that damage the store and impede others from using it" and I will say "Yes, and rhinoceroses are in that group, so they are banned".

It's certainly possible to imagine a world where AI bots use websites in more acceptable ways --- in fact, it's more or less the world we had prior to about 2022, where scrapers did exist but were generally manageable with widely available techniques. But that isn't the world that we live in today. It's also certainly true that many humans are using websites in evil ways (notably including the humans who are controlling many of these bots), and it's also very true that those humans should be held accountable for their actions. But that doesn't mean that blocking bots makes the internet somehow unfree.

This type of thinking that freedom means no restrictions makes sense only in a sort of logical dreamworld disconnected from practical reality. It's similar to the idea that "freedom" in the socioeconomic sphere means the unrestricted right to do whatever you please with resources you control. Well, no, that is just your freedom. But freedom globally construed requires everyone to have autonomy and be able to do things, not just those people with lots of resources.

Loading

Loading

Loading

they're basically describing the tragedy of the commons, but if a handful of the people have bulldozers to rip up all the grass and trees.

We can't have nice things because the powerful cannot be held accountable. The powerful are powerful due to their legal teams and money, and power is the ability to carve exceptions to rules.

Loading

>You either have a free for all open web or you don't. Blocking AI training bots is not free and open for all.

It's perfectly legit to want to have a "free and open for all except big corporations and AI engines".

I think that was the point. Everyone loves the dream, but the reality is different.

Loading

Nothing is „free“. AI bots eat up my blog like crazy and I have to pay for its hosting.

Loading

And? Paying Cloudflare or someone else to block bad actors is required these days unless you have the scale and expertise to do it yourself.

Why is outsourcing this to Cloudflare bad and doing it yourself ok? Am I allowed to buy a license to a rate limiter or do I need to code my own? Am I allowed to use a firewall or is blocking people from probing my server not free enough?

Why are bots or any other user entitled to unlimited visits to my website? The entitlement is kind of unreal at this point

Loading

That's a very "BSD is freedom and GPL isn't" kind of philosophy.

Nothing is truly free unless you give equal respect to fellow hobbyists and megacorps using your labor for their profit.

Loading

[dead]

The dream is real, man. If you want open content on the Internet, it's never been a better time. My blog is open to all - machine or man. And it's hosted on my home server next to me. I don't see why anyone would bother trying to distinguish humans from AI. A human hitting your website too much is no different from an AI hitting your website too much.

I have a robots.txt that tries to help bots not get stuck in loops, but if they want to, they're welcome to. Let the web be open. Slurp up my stuff if you want to.

Amazonbot seems to love visiting my site, and it is always welcome.

> I don't see why anyone would bother trying to distinguish humans from AI.

Because a hundred thousand people reading a blog post is more beneficial to the world than an AI scraper bot fetching my (unchanged) blog post a hundred thousand times just in case it's changed in the last hour.

If AI bots were well-behaved, maintained a consistent user agent, used consistent IP subnets, and respected robots.txt, I wouldn't have a problem with them. You could manage your content filtering however you want (or not at all) and that would be that. Unfortunately at the moment, AI bots do everything they can to bypass any restrictions or blocks or rate limits you put on them; they behave as though they're completely entitled to overload your servers in their quest to train their AI bots so they can make billions of dollars on the new AI craze while giving nothing back to the people whose content they're misappropriating.

Loading

Loading

The only bot that bugs the crap out of me is Anthropic's one. They're the reason I set up a labyrinth using iocaine (https://iocaine.madhouse-project.org/). Their bot was absurdly aggressive, particularly with retries.

It's probably trivial in the whole scheme of things, but I love that anthropic spent months making about 10rps against my stupid blog, getting markov chain responses generated from the text of Moby Dick. (looks like they haven't crawled my site for about a fortnight now)

Loading

It's traditional to include a link when claiming to be invulnerable. :)

Loading

By developing Free Software combating these hostile softwares.

Corporations develop hostile AI agents,

Capable hackers develop anti-AI-agents.

This defeatist atittude "we have no power".

Yes, I obviously agree with you. My comment's point is missed a little I think by you. CF is making these tools and giving access to it to millions of people.

Loading

So basically cloudflare but self-hosted (with all the pain that comes from that)?

Loading

That's a mantra, not a solution.

Sometimes it's a hardware problem, not a software problem.

Loading

How does an agent help my website not get crushed by traffic load, and how is this proposal any different from the gatekeeping problem to the open web, except even less transparent and accountable because now access is gated by logic inside an impenetrable web of NN weights?

This seems like slogan-based planning with no actual thought put into it.

Loading

This is the attitude I like to see. As they say, actually I hate this because of past connotations but "freedom isn't free"

> But the reality is how can someone small protect their blog or content from AI training bots? E.g.: They just blindly trust someone is sending Agent vs Training bots and super duper respecting robots.txt? Get real...

baking in hashcash into http 1.0/1.1/1.2/2/3, smtp, imap, pop3, tls and ssh. then this will all to expensive for spammers and training bots. but IETF is infiltrated by government and corporate interests..

Spammers will buy ASICs and get a huge advantage over consumer CPUs

We have thousands of engineers of these companies right here on hackernews and they cry and scream about privacy and data governance on every topic but their own work. If you guys need a mirror to do some self reflection I am offering to buy.

In the recent days, the biggest delu-lulz was delivered by that guy who'd bravely decided to boycott Grok out of... environmental concerns, apparently. It's curious how everybody is so anxious these days, about AI among other things in our little corner of the web. I swear, every other day it's some new big fight against something... bad. Surely it couldn't ALL be attributed to policy in the US!

I'll contribute for the mirror. The hypocrisy is so loud, aliens in outer space can hear it (and sound doesn't even travel in vacuum).

“But the reality is how can someone small protect their blog or content from AI training bots?”

Why would you need to?

If your inability to assemble basic HTML forces you to adopt enormous, bloated frameworks that require two full cores of a cpu to render your post…

… or if you think your online missives are a step in the road to content creator riches …

… then I suppose I see the problem.

Otherwise there’s no problem.

It's not a question of languages or frameworks, but hardware. I cannot finance servers large enough to keep up with AI bots constantly scrapping my host, bypassing cache indications, or changing IP to avoid bans.

I have had to disable at least one service because AI bots kept hitting it and it started impacting other stuff I was running that I am more interested in. Part of it was the CPU load on the database rendering dozens of 404s per second (which still required a database call), part of it was that the thumbnail images were being queried over and over again with seemingly different parameters for no reason.

I'm sure there are AI bots that are good and respect the websites they operate on. Most of them don't seem to, and I don't care enough about the AI bubble to support them.

When AI companies stop people from using them as cheap scrapers, I'll rethink my position. So far, there's no way to distinguish any good AI bot from a bad one.

So by a free and open for all web you mean only for the tech priests competent enough to build the skills and maintain them in light of changes to the spec(hope these people didn’t run across xml/xslt dependent techniques building their site), or have a rich enough family that you can casually learn a skill while not worry about putting food on the table?

There’s going to be bad actors taking advantage of people who cannot fight back without regulations and gatekeepers, suggesting otherwise is about as reasonable as ancaps idea of government

What we need is some legal teeth behind robots.txt. It won't stop everyone, but Big Corp would be a tasty target for lawsuits.

I don't know about this. This means I'd get sued for using a feed reader on Codeberg[1], or for mirroring repositories from there (e.g. with Forgejo), since both are automated actions that are not caused directly by a user interaction (i.e. bots, rather than user agents).

[1]: https://codeberg.org/robots.txt#:~:text=Disallow:%20/.git/,....

Loading

Loading

Loading

It wouldn’t stop anyone. The bots you want to block already operate out of places where those laws wouldn’t be enforced.

Loading

What we need is stop fighting robots and start welcoming and helping them. I se zero reasons to oppose robots visiting any website I would build. The only purpose I ever tried disallowed robots for was preventing search engines from indexing incomplete versions or going the paths which really make no sense for them to go. Now I think we should write separate instructions for different kinds of robots: a search engine indexer shouldn't open pages which have serious side-effects (e.g. place an order) or display semi-realtime technical details but an LLM agent may be on a legitimate mission involving this.

Loading

Loading

The funny thing about the good old WWW is the first two W's stand for world-wide.

So

Which legal teeth?

It should have the same protections as an EULA, where the crawler is the end user, and crawlers should be required to read it and apply it.

Loading

I have the feeling that it's the small players that cause problems.

Dumb bots that don't respect robot.txt or nofollow are the ones trying all combinations of the filters available in your search options and requesting all pages for each such combination.

The number of search pages can easily be exponential in the number of filters you offer.

Bots walking around in these traps, do it because they are dumb. But even a small degenerate bot can send more requests than 1M MAUs.

At least that's my impression of the problem we're sometimes facing.

Signed agents seems like a horrific solution. And many serving the traffic is just better.

No we dont

- Moral rules are never really effective

- Legal threats are never really effective

Effective solutions are:

- Technical

- Monetary

I like the idea of web as a blockchain of content. If you want to pull some data, you have to pay for it with some kind of token. You either buy that token to consume information if you're of the leecher type, or get some by doing contributions that gain back tokens.

It's more or less the same concept as torrents back in the day.

This should be applied to emails too. The regular person send what, 20 emails per day max ? Say it costs $0.01 per mail, anyone could pay that. But if you want to spam 1,000,000 everyday that becomes prohibitive.

Loading

I recently found out my website has been blocked by AI agents, when I had never asked for it. It seems to be opt-out by default, but in an obscure way. Very frustrating. I think some of these companies (one in particular) are risking burning a lot of goodwill, although I think they have been on that path for a while now.

Are you talking about Cloudflare? The default seems indeed to be to block AI crawlers when you set up a new site with them.

You can lock it up with a user account and payment system. The fact the site is up on the internet doesn’t mean you can or cannot profit from it. It’s up to you. What I would like it’s a way to notify my isp and say, block this traffic to my site.

> What I would like it’s a way to notify my isp and say, block this traffic to my site.

I would love that, and make it automated.

A single message from your IP to your router: block this traffic. That router sends it upstream, and it also blocks it. Repeat ad nauseum until source changes ASN or (if the originator is on the same ASN) reaches the router from the originator, routing table space notwithstanding. Maybe it expires after some auto-expiry -- a day or month or however long your IP lease exists. Plus, of course, a way to query what blocks I've requested and a way to unblock.

Loading

Onion sites have bots and scrapers.

They don't use cloudlfare AFAIK.

They normally use a puzzle that the website generates, or the use a proof of work based capcha. I've found proof of work good enough out of these two, and it also means that the site owner can run it themselves instead of being reliant on cloudflare and third parties.

You can't trust everyone will be polite or follow "standards".

However, you can incentivize good behavior. Let's say there's a scraping agent, you could make a x402 compatible endpoint and offer them a discount or something.

Kinda like piracy; if you offer a good, simple, cheap service people will pay for it versus go through the hassle of pirating.

You might have this the wrong way around.

It's not the publishers who need to do the hard work, it's the multi-billion dollar investments into training these systems that need to do the hard work.

We are moving to a position whereby if you or I want to download something without compensating the publisher, that's jail time, but if it's Zuck, Bezos or Musk, they get a free pass.

That's the system that needs to change.

I should not have to defend my blog from these businesses. They should be figuring out how to pay me for the value my content adds to their business model. And if they don't want to do that, then they shouldn't get to operate that model, in the same way I don't get to build a whole set of technologies on papers published by Springer Nature without paying them.

This power imbalance is going to be temporary. These trillion-dollar market cap companies think if they just speed run it, they'll become too big, too essential, the law will bend to their fiefdom. But in the long term, it won't - history tells us that concentration of power into monarchies descends over time, and the results aren't pretty. I'm not sure I'll see the guillotine scaffolds going up in Silicon Valley or Seattle in my lifetime, but they'll go up one day unless these companies get a clue from history as to what they need to do.

> But the reality is how can someone small protect their blog or content from AI training bots?

A paywall.

In reality, what some want is to get all the benefits of having their content on the open internet while still controlling who gets to access it. That is the root cause here.

This. We need to get rid of the ad-supported free internet economy. If you want your content to be free, you release it and have no issues with AI. If you want to make money of your content, add a paywall.

We need micropayments going forward, Lightning (Bitcoin backend) could be the solution.

Loading

Which is really all that cloudflare is building here that people are mad about. It’s a way to give bots access to paywalled content.

Loading

> Everyone loves the dream of a free for all and open web.

> protect their blog or content from AI training bots

It strikes me that one needs to chose one of these as their visionary future.

Specifically: a free and open web is one where read access is unfettered to humans and AI training bots alike.

So much of the friction and malfunction of the web stems from efforts to exert control over the flow (and reuse) of information. But this is in conflict with the strengths of a free and open web, chief of which is the stone cold reality that bytes can trivially be copied and distributed permissionlessly for all time.

It's the new "ban cassette tapes to prevent people from listening to unauthorized music," but wrapped in an anti-corporate skin delivered by a massive, powerful corporation that could sell themselves to Microsoft tomorrow.

The AI crawlers are going to get smarter at crawling, and they'll have crawled and cached everything anyway; they'll just be reading your new stuff. They should literally just buy the Internet Archive jointly, and only read everything once a week or so. But people (to protect their precious ideas) will then just try to figure out how to block the IA.

One thing I wish people would stop doing is conflating their precious ideas and their bandwidth. The bandwidth is one very serious issue, because it's a denial of service attack. But it can be easily solved. Your precious ideas? Those have to be protected by a court. And I don't actually care iff the copyright violation can go both ways; wealthy people seem to be free to steal from the poor at will, even rewarded, "normal" (upper-middle class) people can't even afford to challenge obviously fraudulent copyright claims, and the penalties are comically absurd and the direct result of corruption.

Maybe having pay-to-play justice systems that punish the accused before conviction with no compensation was a bad idea? Even if it helped you to feel safe from black people? Maybe copyright is dumb now that there aren't any printers anymore, just rent-seekers hiding bitfields?

Maybe this is a naive question, but why not just cut an IP off temporarily if it sends too many requests or sends them too fast?

They use many IPs, often not identifiable as the same bot.

It is a service available to Cloudflare customers and is opt-in. I fail to see how they’re being gatekeepers when site owners have option not to use it.

Why do you have to protect it? Have you suffered any actual problem, or are you being overly paranoid? I think only a few people have actually received DDoS-level traffic, and the rest are being paranoid.

I care more about the dream of a wide open free web than a small time blogger’s fears of their content being trained on by an AI that might only ever emit text inspired by their content a handful of times in their life.

"I want an open web!"

"Okay, that means AI companies can train on your content."

"Well, actually, we need some protections..."

"So you want a closed web with access controls?"

"No no no, I support openness! Can't we just have, like, ethical openness? Where everyone respects boundaries but there's no enforcement mechanism? Why are you making this so black and white?"

> “When we started the “free speech movement,” we had a bold new vision. No longer would dissenters’ views be silenced. With the government out of the business of policing the content of speech, robust debate and the marketplace of ideas would lead us toward truth and enlightenment. But it turned out that freedom of the press meant freedom for those who owned one. The wealthy and powerful dominated the channels of speech. The privileged had a megaphone and used free speech protections to immunize their own complacent or even hateful speech. Clearly, the time has come to denounce the naïve idealism of the past and offer a new movement, Speech 2.0, which will pay more attention to the political economy of media and aim at “free-ish” speech — the good stuff without the bad.”

https://openfuture.eu/paradox-of-open-responses/misunderesti...

> Everyone loves the dream of a free for all and open web. But the reality is how can someone small protect their blog or content from AI training bots?

I'm old enough to remember when people asked the same questions of Hotbot, Lycos, Altavista, Ask Jeeves, and -- eventually -- Google.

Then, as now, it never felt like the right way to frame the question. If you want your content freely available, make it freely available... including to the bots. If you want your content restricted, make it restricted... including to the humans.

It's also not clear to me that AI materially changes the equation, since Google has for many years tried to cut out links to the small sites anyway in favor of instant answers.

(FWIW, the big companies typically do honor robots.txt. It's everyone else that does what they please.)

What if I want my content freely available to humans, and not to bots? Why is that such an insane, unworkable ask? All I want is a copyleft protection that specifically allows humans to access my work to their heart's content, but disallows AI use of it in any form. Is that truly so unreasonable?

Loading

Loading

Loading

Google (and the others) crawl from a published IP range, with "Google" in the user agent. They read robots.txt. They are very easy to block

The AI scum companies crawl from infected botnet IPs, with the user agent the same as the latest Chrome or Safari.

Loading

Loading

Don't publish things if you don't want them published.

Get real yourself.

nonsense.

I'm routinely denied access to websites now.

enable javascript and unblock cookies to continue

You could run https://zadzmo.org/code/nepenthes/ to punish the AI scrapers.

Everyone loves a free for all and open web because it works really well.

Basic tools like Anubis and fail2ban are very effective at keeping most of this evil at bay.

Nobody cares about robots.txt, nor should they.

If this is your primary argument against being scraped (viz that your robots.txt said not to) then you’re naive and you’re doing it wrong.

If the internet is open, then data on it is going to be scraped lol. You can’t have it both ways.

It seems the Open Internet is idealistic.

If others respected robots.txt, we would not need solutions like what Cloudflare is presenting here. Since abuse is rampant, people are looking for mitigations and this CF offering is an interesting one to consider.

how about we discuss and design and implement a system that charges them for their actions? we could put some dark patterns in our sites that specifically have this cost through some sort of problem solving thing in the site that harvests their energetic scraping/LLM tools into directing their energy onto causes that give us profit on our site, in exchange for revealing some content in return that achieves their mission of scraping too. Looks like these exist to degrees.

Why should your blog be protected? Information wants to be free.

It's amazing how this catchphrase has reversed meanings for some people. It was previously used against walled gardens and paywalls, but these corporate LLMs are the ultimate walled garden for information because in most cases you can't even find out who created the information in the first place.

"Information wants to be free! That's why I support hiding it behind a chatbot paywall that makes a few people billionaires"

I personally love the idea of a free and open internet and also have no issues with bots scraping or training off of my data.

I would much rather have it open for all, including companies, than the coming dystopian landscape of paywall gates. I don’t care about respecting robots.txt or any other types of rules. If it’s on the internet it’s for all to consume. The moment you start carving out certain parties is the moment it becomes a slippery slope.

For what it’s worth, I think CF will lose this battle and fundamentally feeding the bots will just become normal and wanted

> But the reality is how can someone small protect their blog or content from AI training bots?

First off, there's no harm from well-behaved bots. Badly behaved bots that cause problems for the server are easily detected (by the problems they cause), classified, and blocked or heavily throttled.

Of course, if you mean "protect" in the sense of "keep AI companies from getting a copy" (which you may have, given that you mentioned training) - you simply can't, unless you consider "don't put it on the web" a solution.

It's impossible to make something "public, but not like that". Either you publish or you don't.

If anything, it's a legal issue (copyright/fair use), not a technical one. Technical solutions won't work.

I'm not sure why people are so confused by this. The Mastodon/AP userbase put their public content on a publicly federated protocol then lost their shit and sent me death threats when I spidered and indexed it for network-wide search.

There are upsides and downsides to publishing things you create. One of the downsides is that it will be public and accessible to everyone.

I have zero issue with Ai Agents, if there's a real user behind there somewhere. I DO have a major issue with my sites being crawled extremely aggressively by offenders including Meta, Perplexity and OpenAI - it's really annoying realising that we're tying up several cpu cores on AI crawling. Less than on real users and google et al.

I've some personal apps online and I had to turn the cloudflare ai bot protection on because one of them got 1.6TB of data accessed by the bots in the last month, 1.3 million requests per day, just non stop hammering it with no limits.

So, under the free traffic tier of any decent provider.

Loading

They're getting to the point of 200-300RPS for some of my smaller marketing sites, hallucinating URLs like crazy. It's fucking insane.

You'd think they would have an interest in developing reasonable crawling infrastructure, like Google, Bing or Yandex. Instead they go all in on hosts with no metering. All of the search majors reduce their crawl rate as request times increase.

On one hand these companies announce themselves as sophisticated, futuristic and highly-valued, on the other hand we see rampant incompetence, to the point that webmasters everywhere are debating the best course of action.

Loading

Loading

I'm seeing around the same, as a fairly constant base load. Even more annoying when it's hitting auth middleware constantly, over and over again somehow expecting a different answer.

I wonder how many CPU cycles are spent because of AI companies scraping content. This factor isn't usually considered when estimating “environmental impact of AI.” What’s the overhead of this on top of inference and training?

To be fair, an accurate measurement would need to consider how many of those CPU cycles would be spent by the human user who is driving the bot. From that perspective, maybe the scrapers can “make up for it” by crawling efficiently, i.e. avoid loading tracker scripts, images, etc unless necessary to solve the query. This way they’ll still burn CPU cycles but at least it’ll be less cycles than a human user with a headful browser instance.

Same with me. If there is a real user behind the use of the AI agents and they do not make excessive accesses in order to do what they are trying to do, then I do not have a complaint (the use of AI agents is not something I intend, but that is up to whoever is using them and not up to me). I do not like the excessive crawling.

However, what is more important to me than AI agents, is that someone might want to download single files with curl, or use browsers such as Lynx, etc, and this should work.

Cloudflare is trying to gatekeep which user-initated agents are allowed to read website content, which is of course very different from scraping website for training data. Meta, Perplexity and OpenAI all have some kind of web-search functionality where they sent requests based on user prompts. These are not requests that get saved to train the next LLM. Cloudflare intentionally blurs the line between both types of bots, and in that sense it is a bait-and-switch where they claim to 'protect content creators' by being the man in the middle and collecting tolls from LLM providers to pay creators (and of course take a cut for themselves). Its not something they do because it would be fair, theres financial motivation.

> Cloudflare is trying to gatekeep which user-initated agents are allowed to read website content, which is of course very different from scraping website for training data.

That distinction requires you to take companies which benefit from amassing as much training data as possible at their word when they pinky swear that a particular request is totally not for training, promise.

Loading

> I DO have a major issue with my sites being crawled extremely aggressively by offenders including Meta, Perplexity and OpenAI

Gee, if only we had, like, one central archive of the internet. We could even call it the internet archive.

Then, all these AI companies could interface directly with that single entity on terms that are agreeable.

Internet Archive is missing enormous chunks of the internet though. And I don't mean weird parts of the internet, just regional stuff.

Not even news articles from top 10 news websites from my country are usually indexed there.

you think they care about that ? they’d still crawl like this just in case which is why they don’t rate limit atm

I use uncommon web browsers that don't leak a lot of information. To Cloudflare, I am indistingushable from a bot.

Privacy cannot exist in an environment where the host gets to decide who access the web page. I'm okay with rate limiting or otherwise blocking activity that creates too much of a load, but trying to prevent automated access is impossible withou preventing access from real people.

And god forbid you live in an authoritarian country and must use VPN to protect your freedom. Internet becomes captcha hell run by 2-3 companies.

I've had far fewer issues with my own bots that access cloudflare protected websites, than during my regular browsing with privacy respecting browsers and a VPN.

As a side note: I'm at least thankful Microsoft isn't behind web gatekeeping. Try and solve any microsoft captcha behind a VPN - its like writing a thesis, you gotta dedicate like 5 minutes, full attention.

The website owner has rights too. Are you arguing they cannot choose to implement such gatekeeping to keep their site operating in a financially viable manner?

The first article of our constitution says people shall be treated equally in equal situations. I presume that most countries have similar clauses but, beyond legalese, it's also simply in line with my ethics to treat everyone equally

There are people behind those connection requests. I don't try to guess on my server who is a bot and who is not; I'll make mistakes and probably bias against people who use uncommon setups (those needing accessibility aids or using e.g. experimental software that improves some aspect like privacy or functionality)

Sure, I have rights as a website owner. I can take the whole thing offline; I can block every 5th request; I can allow each /16 block to make 1000 requests per day; I can accept requests only from clients that have a Firefox user agent string. So long as it's equally applied to everyone and it's not based on a prohibited category such as gender or religious conviction, I am free to decide on such cuts and I'd encourage everyone to apply a policy that they believe is fair

Cloudflare and its competitors, as far as I can tell, block arbitrary subgroups of people based on secret criteria. It does not appear to be applied fairly, such as allowing everyone to make the same number of requests per unit time. I'm probably bothered even more because I happen to be among the blocked subgroup regularly (but far from all the time, just little enough to feel the pain)

Loading

I never said there was anything prohibiting them, just that they will be losing users. (Although, blocking some access can be illegal, for example when accessability tools are blocked.)

There's a whole spectrum of gatekeeping on communications with users, from static sites that broadcast their information to anyone, and stores that let you order without even making an account, to organizations that require you install local software to even access data and perform transactions. The latter means 90%+ of your users will hate you for it, and half will walk away, but it's still very common, collectively causing business that do so billions of dollers a year. (https://www.forbes.com/sites/johnkoetsier/2021/02/15/91-of-u... to-install-apps-to-do-business-costing-brands-billions/)

When companies get big enough to have entire departments devoted tasks, those departments will follow the fads that bring them the most prestige, at the cost of the rest of the company. Eventually the company will lose out to newer more efficient businesses that forgo fads in favor of serving customers, and the cycle continues.

I'm just point out how a new fad is hurting businesses, but by no means wish to limit their ability to do so. They just won't be getting my business, nor business from a quickly growing cohort that desires anonymitiy, or even requires it to get around growing local censorship.

If you put your information freely on the web, you should have minimal expectations on who uses it and how. If you want to make money from it, put up a paywall.

If you want the best of both worlds, i.e. just post freely but make money from ads, or inserting hidden pixels to update some profile about me, well good luck. I'll choose whether I want to look at ads, or load tracking pixels, and my answer is no.

Loading

Loading

[dead]

I also do the same and get caught up by bot blockers.

However, I do believe the host can do whatever they want with my request also.

This issue becomes more complex when you start talking about government sites, since ideally they have a much stronger mandate to serve everyone fairly.

I agree with you, but the website owners just don't seem to understand that they are making their small problem into a big problem for real people, some of which will drop off.

Do you currently get blocked a lot by Cloudflare/turnstile a lot then? Sorry, I think you implied that, just want to be clear.

Well, if you have a better way to solve this that’s open I’m all ears. But what Cloudflare is doing is solving the real problem of AI bots. We’ve tried to solve this problem with IP blocking and user agents, but they do not work. And this is actually how other similar problems have been solved. Certificate authorities aren’t open and yet they work just fine. Attestation providers are also not open and they work just fine.

> Well, if you have a better way to solve this that’s open I’m all ears.

Regulation.

Make it illegal to request the content of a webpage by crawler if a website operator doesn't explicitly allows it via robots.txt. Institute a government agency that is tasked with enforcement. If you as a website operator can show that traffic came from bots, you can open a complaint with the government agency and they take care of shaking painful fines out of the offending companies. Force cloud hosts to keep books on who was using what IP addresses. Will it be a 100% fix, no, will it have a massive chilling effect if done well, absolutely.

I'm not anti-government, but a technical solution that elliminates the the problem is infinitely better than regulating around it.

The internet is too big and distributed to regulate. Nobody will agree on what the rules should be, and certain groups or countries will disagree in any case and refuse to enforce them.

Existing regulation rarely works, and enforcement is half-assed, at best. Ransomware is regulated and illlegal, but we see articles about major companies infected all the time.

I don't think registering with Cloudflare is the answer, but regulation definitely isn't the answer.

The biggest issue right now seems to be people renting their residential IP addresses to scraper companies, who then distribute large scrapes across these mostly distinct IPs. These addresses are from all over the world, not just your own country, so we'll either need a World Government, or at least massive intergovernmental cooperation, for regulation to help.

Loading

this is hilarious

you are either from the EU or living a couple decades in the past

Agreed. It might not be THE BEST solution, but it is a solution that appears to work well.

Centralization bad yada yada. But if Cloudflare can get most major AI players to participate, then convince the major CDN's to also participate.... ipso facto columbo oreo....standard.

yep, that's why I am writing this now :)

You can see it in the web vs mobile apps.

Many people may not see a problem on wallet gardens but reality is that we have much less innovation in mobile than in web because anyone can spawn a web server vs publish an app in the App Store (apple)

I'm not sure if things are as fine as you say they are. Certificate authorities were practically unheard of outside of corporate websites (and even then mostly restricted to login pages) until Let's Encrypt normalized HTTPS. Without the openness of Let's Encrypt, we'd still be sharing our browser history and search queries with our ISPs for data mining. Attestation providers have so far refused to revoke attestation for known-vulnerable devices (because customers needing to replace thousands of devices would be an unacceptable business decision), making the entire market rather useless.

That said, what I am missing from these articles is an actual solution. Obviously we don't want Cloudflare from becoming an internet gatekeeper. It's a bad solution. But: it's a bad solution to an even worse problem.

Alternatives do exist, even decentralised ones, in the form of remote attestation ("can't access this website without secure boot and a TPM and a known-good operating system"), paying for every single visit or for subscriptions to every site you visit (which leads to centralisation because nobody wants a subscription to just your blog), or self-hosted firewalls like Anubis that mostly rely on AI abuse being the result of lazy or cheap parties.

People drinking the AI Kool-Aid will tell you to just ignore the problem, pay for the extra costs, and scale up your servers, because it's *the future*, but ignoring problems is exactly why Cloudflare still exists. If ISPs hadn't ignored spoofing, DDoS attacks, botnets within their network, """residential proxies""", and other such malicious acts, Cloudflare would've been an Akamai competitor rather than a middle man to most of the internet.

Certificate authorities don't block humans if they 'look' like a bot

AI poisoning is a better protection. Cloudflare is capable of serving stashes of bad data to AI bots as protective barrier to their clients.

AI poisoning is going to get a lot of people killed, be cause the AI won't stop being used.

Loading

Loading

Loading

Loading

You don't think that the AI companies will take efforts to detect and filter bad data for training? Do you suppose they are already doing this, knowing that data quality has an impact on model capabilities?

Loading

Loading

Are they? Until Let's Encrypt came along and democratise the CA scene, it was a hell hole. Web Security was depending on how deep your pockets are. One can argue that the same path is being laid in front us until a Let's Encrypt comes along and democratise it? And here as it's about attestation, how are we going to prevent gatekeeper's doing "selective attestations with arguable criteria"? How will we prevent political forces?

Sorry, the "web" isn't "open" and hasn't been for a while.

Most interaction, publication, and dissemination takes place behind authentication:

Most social media, newspapers, etc. throttle, block, or otherwise truncate non-authenticated clients.

Blogs are an extremely small tranche of information that the average netizen consumes.

>Sorry, the "web" isn't "open" and hasn't been for a while.

Doesn't matter, it still doesn't need gatekeepers, and if they're already a lot, it should reduce them, not increase them.

We have far too many gatekeepers as it is. Any attempt to add any more should be treated as an act of aggression.

Cloudflare seems very vocal about its desire to become yet another digital gatekeeper as of late, and so is Google. I want both reduced to rubble if they persist in it.

Several companies are looking to provide a solution for the AI bot problem. Cloudflare stands to make a lot of money if people pick their solution. But Cloudflare backing down won't make the problem go away, and someone else's bad solution will be chosen instead.

The gatekeeping described here is gatekeeping a website owner chooses. It's an alternative to pay walls, bespoke bot detection, or some kind of ID verification. Cloudflare already provides a service, but standardising the service will open up the market (at the cost of competitors adopting Cloudflare's standard).

The freedom of the open web also extends to the owners of the websites people visit.

What do you mean Google "desires" to become a gatekeeper? They have been a gatekeeper for years, since they control the browser everyone uses, and Firefox usage is now in the noise. Google just steers the www where they want it to go. Killing ublock, pushing .webp trash, etc.

> An allowlist run by ONE company?

An allowlist run by one company that site owners chose to engage with. But the irony of taking an ideological stance about fairness while using AI generated comics for blog posts…

Cloudflare is implementing the (still-emerging) Web Bot Auth standard. We're working on the same at Stytch for https://IsAgent.dev .

The discourse around this is a little wild and I'm glad you said this. The allowlist is a Cloudflare feature and their customers are free to use it. The core functionality involving HTTP Message Signatures is decentralized and open, so anyone can adopt it and benefit.

THANK YOU. The discourse on this is wild, people seem to be ranting agianst the Web Bot Auth standard without understanding what it is because of their (honestly quite legitimate) fears about Cloudflare's gatekeeping near-monopology.

If there's a way that the Web Both Auth standard might make their near-monopoly more harmful, we can talk about it, but let's focus on that -- the Web Both Auth standard itself is solving a problem that we in fact need solving, and seems to be designed properly for the use case. From my point of view as a site operator, it will actually help me allow in bot agents I want to allow in, that currently I'm being forced to block by trying to block all bot actors because of their expense to my site, without exception. I want to be able to make exceptions!

The giant wave of ridiculous distributed bot traffic of the past 1-2 years is very very real.

> An allowlist run by one company that site owners chose to engage with.

Exactly, no problem with that, just hinting that's not a protocol.

> But the irony of taking an ideological stance about fairness while using AI generated comics for blog posts

Wait, what?

> Wait, what?

I was referring to the following image:

https://substackcdn.com/image/fetch/$s_!zRK-!,w_1250,h_703,c...

Loading

It's a frying pan/fire choice that could create a de-facto standard we end up depending on, during a critical moment where the hot topic could have a protocol or standards based solution. Cloudflare is actively trying to make a blue ocean for themselves of a real issue affecting everyone.

>But the irony of taking an ideological stance about fairness while using AI generated comics for blog posts…

"But you participate in society!"

This is sort of like how email is based on Internet standards but a large percentage of email users use Gmail. The Internet standards Cloudflare is promoting are open, but Cloudflare has a lot of power due to having so many customers.

(What are some good alternatives to Cloudflare?)

Another way the situation is similar: email delivery is often unreliable and hard to implement due to spam filters. A similar thing seems to be happening to the web.

It is a big problem. There is no good alternative to Cloudflare as a free CDN. They put servers all over the world and they are giving them away for free. And making their money on premium serverless services.

Not to mention the big cloud providers are unhinged with their egress pricing.

> Not to mention the big cloud providers are unhinged with their egress pricing.

I always wonder why this status quo persisted even after Cloudflare. Their pricing is indeed so unhinged, that they're not even in consideration for me for things where egress is a variable.

Why is egress seemingly free for Cloudflare or Hetzner but feels like they launch spaceships at AWS and GCP every time you send a data packet to the outside world?

Loading

The web doesn't need attestation. It doesn't need signed agents. It doesn't need Cloudflare deciding who's a "real" user agent. It needs people to remember that "public" means PUBLIC and implement basic damn rate limiting if they can't handle the traffic.

The web doesn't need to know if you're a human, a bot, or a dog. It just needs to serve bytes to whoever asks, within reasonable resource constraints. That's it. That's the open web. You'll miss it when it's gone.

Basic damn rate limiting is pretty damn exploitable. Even ignoring botnets (which is impossible), usefully rate limiting IPv6 is anything but basic. If you just pick some prefix from /48 to /64 to key your rate limits on, you'll either be exploitable by IPs from providers that hand out /48s like candy or you'll bucket a ton of mobile users together for a single rate limit.

You make unauthenticated requests cheap enough that you don't care about volume. Reserve rate limiting for authenticated users where you have real identity. The open web survives by being genuinely free to serve, not by trying to guess who's "real."

A basic Varnish setup should get you most of the way there, no agent signing required!

Loading

Loading

Loading

What you're proposing is that a lot of small websites should simply shut down, in the name of the open internet. The goals seem self contradictory.

Modern AI crawlers are indistinguishable from malicious botnets. There's no longer any rate limiting strategy that's effective, that's entirely the point of what cloudflare is attempting to solve

"It needs people to remember that "public" means PUBLIC and implement basic damn rate limiting if they can't handle the traffic."

And publish the acceptable rate.

But anyone who has ever been blocked for sending a _single_ HTTP request with the "wrong" user-agent string knows that the issue website operators are worried about is not necessarily rate (behaviour). Website operators routinely believe there is no such thing as a well-behaved bot. Thus they disregard behaviour and only focus on identity. If their crude heuristics with high probability of false positives suggest "bot" as the identity then their decision is to block, irrespective of behaviour, and ignore any possibility the heuristics may have failed. Operators routinely make (incorrect) assumptions about intent based on identity not behaviour.

Yes, I think that you are right (although rate limiting can sometimes be difficult to work properly).

Delegation of authorization can be useful for things that require it (as in some of the examples given in the article), but public files should not require authorization nor authentication for accessing it. Even if delegation of authorization is helpful for some uses, Cloudflare (or anyone else, other than whoever is delegating the authorization) does not need to be involved in them.

> public files should not require authorization nor authentication for accessing it

Define "public files" in this case?

If I have a server with files, those are my private files. If I choose to make them accessible to the world then that's fine, but they're still private files and no one else has a right to access them except under the conditions that I set.

What Cloudflare is suggesting is that content owners (such as myself, HN, the New York Times, etc.) should be provided with the tools to restrict access to their content if unfettered access to all people is burdensome to them. For example, if AI scraper bots are running up your bandwidth bill or server load, shouldn't you be able to stop them? I would argue yes.

And yet you can't. These AI bots will ignore your robots.txt, they'll change user agents if you start to block their user agents, they'll use different IP subnets if you start to block IP subnets. They behave like extremely bad actors and ignore every single way you can tell them that they're not welcome. They take and take and provide nothing in return, and they'll do so until your website collapses under the weight and your readers or users leave to go somewhere else.

Loading

> within reasonable resource constraints

And let’s all hold hands and sing koombaya

[flagged]

Maybe the title means something more like "The web should not have gatekeepers (Cloudflare)". They do seem to say as much toward the end:

>We need protocols, not gatekeepers.

But until we have working protocols, many webmasters literally do need a gatekeeper if they want to realistically keep their site safe and online.

I wish this weren't the case, but I believe the "protocol" era of the web was basically ended when proprietary web 2.0 platforms emerged that explicitly locked users in with non-open protocols. Facebook doesn't want you to use Messenger in an open client next to AIM, MSN, and IRC. And the bad guys won.

But like I said, I hope I'm wrong.

>We need protocols, not gatekeepers

The funny thing is that this blog post is complaining about a proposed protocol from Cloudflare (one which will identify bots so that good bots can be permitted). The signup form is just a method to ask Cloudflare (or any other website owner/CDN) to be categorized as a good bot.

It's not a great protocol if you're in the business of scraping websites or selling people bots to access websites for them, but it's a great protocol for people who just want their website to work without being overwhelmed by the bad side of the internet.

The whitelist approach Cloudflare takes isn't good for the internet, but for website owners who are already behind Cloudflare, it's better than the alternative. Someone will need to come up with a better protocol that also serves the website owners' needs if they want Cloudflare to fail here. The AI industry simply doesn't want to cooperate, so their hand must be forced, and only companies like Cloudflare are powerful enough to accomplish that.

Conventional crawlers already have a way to identify themselves, via a json file containing a list of IP addresses. Cloudflare is fully aware of this defacto standard.

I agree with pretty much everything the author has said. I’ve been looking at the problem more on the enterprise side of things: how do you control what agents can and can’t do on a complex private network, let alone the internet.

I’ve actually just built an “identity token” using biscuit that you can delegate however you want after. So I can authenticate (to my service, but it could be federated or something just as well), get a token, then choose to create a delegated identity token from that for my agent. Then my agent could do the same for subagents.

In my system, you then have to exchange your identity token for an authorization token to do anything (single scope, single use).

For the internet, I’ve wondered about exchanging the identity token + a small payment (like a minuscule crypto amount) for an authorization token. Human users would barely spend anything. Bots crawling the web would spend a lot.

I think the reality is, we need identity on both the client and server sides.

At some point soon, if not now, assume everything is generated by AI unless proven otherwise using a decentralized ID.

Likewise, on the server side, assume it’s a bot unless proven otherwise using a decentralized ID.

We can still have anonymity using decentralized IDs. An identity can be an anonymous identity, it’s not all (verified by some central official party) or nothing.

It comes down to different levels of trust.

Decoupling identity and trust is the next step.

It's called an IP address. Since some ISPs don't assign a fixed IP to a subscriber, a timestamp is nowadays necessary. The combination is traceable to a subscriber who is responsible for the line, either to work with law enforcement if subpoenaed or to not send abusive traffic via the line themselves

Why law enforcement doesn't do their job, resulting in people not bothering to report things anymore, is imo the real issue here. Third party identification services to replace a failing government branch is pretty ugly as a workaround, but perhaps less ugly than the commercial gatekeepers popping up today

DID spec, also used in ATProto, is quite flexible. It would be nice to see it used in more places and processes

https://www.w3.org/TR/did-1.1/

I pretty much use Perplexity exclusively at this point, instead of Google. I'd rather just get my questions answered than navigate all of the ads and slowness that Google provides. I'm fine with paying a small monthly fee, but I don't want Cloudflare being the gatekeeper.

Perhaps a way to serve ads through the agents would be good enough. I'd prefer that to be some open protocol than controlled by a company.

This has been my experience more recently as well, I've finally migrated from google to Brave Search since google was just slow for me.

I also appreciate the AI search results a bit when im looking for something very specific (like what the yaml definition for a docker swarm deployment constraint looks like) because the AI just gives me the snippet while the search results are 300 medium blog posts about how to use docker and none of them explain the variables/what each does. Even the official docker documentation website is a mess to navigate and find anything relevant!

Not to mention how much worse it is on mobile. Every web site asks me to accept their cookies, close layers of ads with tiny buttons, and loads slowly with ads spread throughout the content. And that’s just to figure out if I’m even on the right page.

Perplexity has been one of the AI companies that created the problem that gave rise to this CF proposal. Why doesn't Perplexity invest more into being a responsible scraper?

https://blog.cloudflare.com/perplexity-is-using-stealth-unde...

Re-read what I wrote.

Perplexity is the problem Cloudflare and companies like it are trying to solve. The company refuses to take no for an answer and will mislead and fake their way through until they've crawled the content they wanted to crawl.

The problem isn't just that ads can't be served. It's that every technical measure to attempt to block their service produces new ways of misleading website owners and the services they use. Perplexity refuses any attempt at abuse detection and prevention from their servers.

None of this would've been necessary if companies like Perplexity would've just acted like a responsible web service and told their customers "sorry, this website doesn't allow Perplexity to act on your behalf".

The open protocol you want already exists: it's the user agent. A responsible bot will set the correct user agent, maybe follow the instructions in robots.txt, and leave it at that. Companies like Perplexity (and many (AI) scrapers) don't want to participate in such a protocol. They will seek out and abuse any loopholes in any well-intended protocol anyone can come up with.

I don't think anyone wants Cloudflare to have even more influence on the internet, but it's thanks to the growth of inconsiderate AI companies like Perplexity that these measure are necessary. The protocol Cloudflare proposes is open (it's just a signature), the problem people have with it is that they have to ask Cloudflare nicely to permit website owners to track and prevent abuse from bots. For any Azure-gated websites, your bot would need to ask permission there as well, as with Akamai-gated websites, and maybe even individual websites.

A new protocol is a technical solution. Technical solutions work for technical problems. The problem Cloudflare is trying to solve isn't a technical problem; it's a social problem.

You’re referencing an old and outdated technology that has no capability to handle things like revenue and attribution. New protocols will need to evolve to the current use. Owners want money, so make the protocol focused on that use case.

I’m not here to propose a solution. I’m here as an end-user saying I won’t go back to the old experience which is outdated and broken.

>but I don't want Cloudflare being the gatekeeper

Cloudflare is not the gatekeeper, it's the owner of the site that blocks Perplexity that's "gatekeeping" you. You're telling me that's not right?

Cloudflare is a gatekeeper because they’re trying to insert themselves between the owner and the end-user. Despite all the altruistic signaling, they really just want to capitalize on AI. And they’re happy to do that even if it results in a subpar experience for the end-user. They started this with a focus on news organizations, so I’m not particularly excited about trying to block AI access and lock down the web through one private company just so we can preserve 90s era clickbait businesses.

Loading

Cloudflare is being really annoying lately. It looks like they desesperately want to close the web to get their 30% fee on AI crawling fees.

We dont need gatekeepers. We do need to verify agents that act, in a reasonable way, on behalf of human vs an agent swarm/bot-mining operation (whether conducted by a large lab or a kid programming claude code to ddos his buddy's next.js deployment).

Cloudflare slows the whole damn websites down. It takes many seconds to deal with their trash. I hope they crash and burn. Let's get back to very low latency websites without the cloudflare garbage.

Cloudflare as a CDN greatly greatly speeds up the web.

All the custom code they write on top of that to transform HTML for you? Ehhhh... don't use those features. Most are easily reproducible on the backend.

So Cloudflare becomes the gatekeeper then?

I kind of want my site to be indexed with agents and used without any interference

By not using Cloudflare your website will be indexed by everyone. The gatekeeper aspect only applies if you use Cloudflare to distribute your website (and even then Cloudflare offers options to control this bot shield thing).

I want it to be indexed by everyone, thats the whole point.

So what then Cloudflare can use all these websites as leverage against Google, OpenAI and Microsoft? I kind of want my content to be indexed.

Loading

The private tracker community have long figured this out. Put content behind invite-only user registration, and treeban users if they ever break the rules.

This doesn't scale to the general web, does it? I think invite-only might work to build communities, but you end up in the situation we're in today where people are buying/selling invites, and that's with treebans in place.

I do fear the actions of the current bot landscape is going to lead to almost everything going behind auth walls though, and perhaps even paid auth walls.

I've been considering making this for the web. Why wouldn't it scale? Those selling invites would get banned soon enough if the people they distribute their invite to then send abusive traffic. Mystery shoppers can also make that a risky business if it's disallowed to sell invites (forcing them to be mostly free, such that the giver has nothing to gain from inviting someone who is willing to pay)

One of the practical problems I rather saw was bootstrapping: how to convince any website owner to use it, when very few people are on the system? Where should they find someone to get invites from?

As for tracking (auth walls), the website needs not know who you are. They just see random tokens with signatures and can verify the signature. If there's abuse, they send evidence to the tree system, where it could be handled similarly to HN: lots of flags from different systems will make an automated system kick in, but otherwise a person looks at the issue and decides whether to issue a warning or timeout. (Of course, the abuse reporting mechanism can also be abused so, again similar to HN, if you abuse the abuse mechanism then you don't count towards future reports.)

Ideally, we'd not need this and let real judges do the job of convicting people of abuse and computer fraud, but until such time, I'd rather use the internet anonymously with whatever setup I like than face blocks regularly while doing nothing wrong

Loading

as a Cloudflare customer, I am happy with their proposition. I personally do not want companies like Perplexity that fake their user-agent and ignore my robots.txt to trespass.

and isn't this why people sign up with Cloudflare in the first place? for bot protection? to me, this is just the same, but with agents.

i love the idea of an open internet, but this requires all party to be honest. a company like Perplexity that fakes their user-agent to get around blocks disrespects that idea.

my attitude towards agents is positive. if a user used an LLM to access my websites and web apps, i'm all for it. but the LLM providers must disclose who they are - that they are OpenAI, Google, Meta, or the snake oil company Perplexity

Your complaints about "faking their user-agent" reminds me of this 15-year-old but still-relevant, classic post about the history of the user-agent string:

https://webaim.org/blog/user-agent-string-history/

TLDR the UA string has always been "faked", even in the scenarios you might think are most legitimate.

The traditional UA fakery (adding Mozilla to the start and then just tacking on browser engine names) was the result of outdated websites breaking browsers.

The problematic fakery here is that bots are pretending to be people by emulating browsers to prevent rate limits and other technical controls.

That second category has also been with us since the dawn of the internet, but it has always been something worth complaining about. No trustworthy tool or service will pretend to be a real browser, at least not by default.

If AI agents just identified themselves as such, we wouldn't need elaborate schemes to block them when they need to be blocked.

The point is "should everyone just have an account with Cloudflare then"

With what they say about authorization, I think X.509 would help. (Although central certificate authorities are often used with X.509, it does not have to be that way; the service you are operating can issue the certificate to you instead, or they can accept a self-signed certificate which is associated with you the first time it is used to create an account on their service.)

You can use the admin certificate issued to you, to issue a certificate to the agent which will contain an extension limiting what it can be used for (and might also expire in a few hours, and also might be revoked later). This certificate can be used to issue an even more restricted certificate to sub-agents.

This is already possible (and would be better than the "fine-grained personal access tokens" that GitHub uses), but does not seem to be commonly implemented. It also improves security in other ways.

So, it can be done in such a way that Cloudflare does not need to issue authorization to you, or necessarily to be involved at all. Google does not need to be involved either.

However, that is only for things where would should normally require authorization to do anyways. Reading public data is not something that should requires authorization to do; the problem with this is excessive scraping (there seems to be too many LLM scraping and others which is too excessive) and excessive blocking (e.g. someone using a different web browser, or curl to download one file, or even someone using a common browser and configuration but something strange unexpected happens, etc); the above is something unrelated to that, so certificates and stuff like that does not help, because it solves a different problem.

What problem does this solve that a basic API key doesn't solve already? The issue with that approach is that you will require accounts/keys/certificates for all hosts you intend to visit, and malicious bots can create as many accounts as they need. You're just adding a registration step to the crawling process.

Your suggested approach works for websites that want to offer AI access as a service to their customers, but the problem Cloudflare is trying to solve is that most AI bots are doing things that website owners don't want them to do. The goal is to identify and block bad actors, not to make things easier for good actors.

Using mTLS/client certificates also exposes people (that don't use AI bots) to the awful UI that browsers have for this kind of authentication. We'll need to get that sorted before an X509-based solution makes any sense.

> What problem does this solve that a basic API key doesn't solve already?

Many things, including improved security, and the possibility of delegating authorization in the ways described in their article (if you do not restrict the certificate from issuing further certificates, and if you define an extension for use with your service to specify narrower authorization, and document this).

> The issue with that approach is that you will require accounts/keys/certificates for all hosts you intend to visit, and malicious bots can create as many accounts as they need. You're just adding a registration step to the crawling process.

Read the last paragraph of what I wrote, which explains why that issue does not apply. However, even if registration is required (which I say should not be required for most things anyways, especially read-only stuff), it does not necessarily have to be that fast or automatic.

> Your suggested approach works for websites that want to offer AI access as a service to their customers, but the problem Cloudflare is trying to solve is that most AI bots are doing things that website owners don't want them to do. The goal is to identify and block bad actors, not to make things easier for good actors.

The approach I describe would work for many things where authentication and authorization helps (most of which does not involve AI).

I do know that it does not solve the problem that Cloudflare is trying to solve, but it does what it says in the article about authorization, and in a secure way. And, it is open, interoperable, and standardized.

The problem that Cloudflare is trying to solve cannot be solved in this way, and the way Cloudflare tries to do it is not good either.

Things that AI bots are doing to other's sites includes such things as excessive scraping, rather than accessing private data (even if they might do that too, Cloudflare's solution won't help with that at all either). (There is also excessive blocking, but Cloudflare is a part of the problem, even if some of the things they do sometimes help.)

See comment 45068556. Not everything should require authentication or authorization. Also see many other comments, that also mention why it does not help.

> Using mTLS/client certificates also exposes people (that don't use AI bots) to the awful UI that browsers have for this kind of authentication. We'll need to get that sorted before an X509-based solution makes any sense.

OK, it is a valid point, but this could be improved, independently. (Before it is fixed (and even afterward if wanted), X.509 could be made as only one type of authentication; the service could allow using a username/password (and/or other things, such as TOTP) as well for people who do not want to use X.509.)

Also, AI bots are not the only kind of automated access (and is not one that I use personally, although other people might); you could also be using a API for other purposes, or you might be using a command-line program for manual access without the use of a web browser, etc.