Science gave us miracles

September 24, 2024

Science gave us miracles

Tony Montana Scarface GIF - Tony Montana Scarface Backbiters - Discover ...

...

Comments

AnonymousSeptember 24, 2024 at 6:02 PM
For three decades, a tiny text file has kept the internet from chaos. This text file has no particular legal or technical authority, and it’s not even particularly complicated. It represents a handshake deal between some of the earliest pioneers of the internet to respect each other’s wishes and build the internet in a way that benefitted everybody. It’s a mini constitution for the internet, written in code.

It’s called robots.txt and is usually located at yourwebsite.com/robots.txt. That file allows anyone who runs a website — big or small, cooking blog or multinational corporation — to tell the web who’s allowed in and who isn’t. Which search engines can index your site? What archival projects can grab a version of your page and save it? Can competitors keep tabs on your pages for their own files? You get to decide and declare that to the web.

It’s not a perfect system, but it works. Used to, anyway. For decades, the main focus of robots.txt was on search engines; you’d let them scrape your site and in exchange they’d promise to send people back to you. Now AI has changed the equation: companies around the web are using your site and its data to build massive sets of training data, in order to build models and products that may not acknowledge your existence at all.

The robots.txt file governs a give and take; AI feels to many like all take and no give. But there’s now so much money in AI, and the technological state of the art is changing so fast that many site owners can’t keep up. And the fundamental agreement behind robots.txt, and the web as a whole — which for so long amounted to “everybody just be cool” — may not be able to keep up either.
ReplyDelete
Replies
AnonymousSeptember 24, 2024 at 6:03 PM
In the early days of the internet, robots went by many names: spiders, crawlers, worms, WebAnts, web crawlers. Most of the time, they were built with good intentions. Usually it was a developer trying to build a directory of cool new websites, make sure their own site was working properly, or build a research database — this was 1993 or so, long before search engines were everywhere and in the days when you could fit most of the internet on your computer’s hard drive.
The only real problem then was the traffic: accessing the internet was slow and expensive both for the person seeing a website and the one hosting it. If you hosted your website on your computer, as many people did, or on hastily constructed server software run through your home internet connection, all it took was a few robots overzealously downloading your pages for things to break and the phone bill to spike.
Over the course of a few months in 1994, a software engineer and developer named Martijn Koster, along with a group of other web administrators and developers, came up with a solution they called the Robots Exclusion Protocol. The proposal was straightforward enough: it asked web developers to add a plain-text file to their domain specifying which robots were not allowed to scour their site, or listing pages that are off limits to all robots. (Again, this was a time when you could maintain a list of every single robot in existence — Koster and a few others helpfully did just that.) For robot makers, the deal was even simpler: respect the wishes of the text file.
From the beginning, Koster made clear that he didn’t hate robots, nor did he intend to get rid of them. “Robots are one of the few aspects of the web that cause operational problems and cause people grief,” he said in an initial email to a mailing list called WWW-Talk (which included early-internet pioneers like Tim Berners-Lee and Marc Andreessen) in early 1994. “At the same time they do provide useful services.” Koster cautioned against arguing about whether robots are good or bad — because it doesn’t matter, they’re here and not going away. He was simply trying to design a system that might “minimise the problems and may well maximize the benefits.”
“Robots are one of the few aspects of the web that cause operational problems and cause people grief. At the same time, they do provide useful services.”
By the summer of that year, his proposal had become a standard — not an official one, but more or less a universally accepted one. Koster pinged the WWW-Talk group again in June with an update. “In short it is a method of guiding robots away from certain areas in a Web server’s URL space, by providing a simple text file on the server,” he wrote. “This is especially handy if you have large archives, CGI scripts with massive URL subtrees, temporary information, or you simply don’t want to serve robots.” He’d set up a topic-specific mailing list, where its members had agreed on some basic syntax and structure for those text files, changed the file’s name from RobotsNotWanted.txt to a simple robots.txt, and pretty much all agreed to support it.
And for most of the next 30 years, that worked pretty well
ReplyDelete
Replies
AnonymousSeptember 24, 2024 at 6:04 PM
But the internet doesn’t fit on a hard drive anymore, and the robots are vastly more powerful. Google uses them to crawl and index the entire web for its search engine, which has become the interface to the web and brings the company billions of dollars a year. Bing’s crawlers do the same, and Microsoft licenses its database to other search engines and companies. The Internet Archive uses a crawler to store webpages for posterity. Amazon’s crawlers traipse the web looking for product information, and according to a recent antitrust suit, the company uses that information to punish sellers who offer better deals away from Amazon. AI companies like OpenAI are crawling the web in order to train large language models that could once again fundamentally change the way we access and share information.
The ability to download, store, organize, and query the modern internet gives any company or developer something like the world’s accumulated knowledge to work with. In the last year or so, the rise of AI products like ChatGPT, and the large language models underlying them, have made high-quality training data one of the internet’s most valuable commodities. That has caused internet providers of all sorts to reconsider the value of the data on their servers, and rethink who gets access to what. Being too permissive can bleed your website of all its value; being too restrictive can make you invisible. And you have to keep making that choice with new companies, new partners, and new stakes all the time.
ReplyDelete
Replies
AnonymousSeptember 24, 2024 at 6:08 PM
The biggest question most website owners historically had to answer was whether to allow Googlebot to crawl their site. The tradeoff is fairly straightforward: if Google can crawl your page, it can index it and show it in search results. Any page you want to be Googleable, Googlebot needs to see. (How and where Google actually displays that page in search results is of course a completely different story.) The question is whether you’re willing to let Google eat some of your bandwidth and download a copy of your site in exchange for the visibility that comes with search.
For most websites, this was an easy trade. “Google is our most important spider,” says Medium CEO Tony Stubblebine. Google gets to download all of Medium’s pages, “and in exchange we get a significant amount of traffic. It’s win-win. Everyone thinks that.” This is the bargain Google made with the internet as a whole, to funnel traffic to other websites while selling ads against the search results.
ReplyDelete
Replies

Post a Comment