• Ekybio@lemmy.world
    link
    fedilink
    English
    arrow-up
    12
    ·
    21 days ago

    Can someone with more knowledge shine a bit more light on this while situation? Im out of the loop on the technical details

    • snooggums@lemmy.world
      link
      fedilink
      English
      arrow-up
      30
      arrow-down
      1
      ·
      21 days ago

      AI crawlers tend to overwhelm websites by doing the least efficient scraping of data possible, basically DDOSing a huge portion of the internet. Perplexity already scraped the net for training data and is now hammering it inefficiently for searches.

      Cloudflare is just trying to keep the bots from overwhelming everything.

    • BetaDoggo_@lemmy.world
      link
      fedilink
      English
      arrow-up
      15
      arrow-down
      1
      ·
      edit-2
      21 days ago

      Perplexity (an “AI search engine” company with 500 million in funding) can’t bypass cloudflare’s anti-bot checks. For each search Perplexity scrapes the top results and summarizes them for the user. Cloudflare intentionally blocks perplexity’s scrapers because they ignore robots.txt and mimic real users to get around cloudflare’s blocking features. Perplexity argues that their scraping is acceptable because it’s user initiated.

      Personally I think cloudflare is in the right here. The scraped sites get 0 revenue from Perplexity searches (unless the user decides to go through the sources section and click the links) and Perplexity’s scraping is unnecessarily traffic intensive since they don’t cache the scraped data.

      • lividweasel@lemmy.world
        link
        fedilink
        English
        arrow-up
        5
        ·
        21 days ago

        …and Perplexity’s scraping is unnecessarily traffic intensive since they don’t cache the scraped data.

        That seems almost maliciously stupid. We need to train a new model. Hey, where’d the data go? Oh well, let’s just go scrape it all again. Wait, did we already scrape this site? No idea, let’s scrape it again just to be sure.

        • jballs@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          1
          ·
          21 days ago

          It’s worth giving the article a read. It seems that they’re not using the data for training, but for real-time results.

        • snooggums@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          1
          ·
          21 days ago

          They do it this way in case the data changed, similar to how a person would be viewing the current site. The training was for the basic understanding, the real time scraping is to account for changes.

          It is also horribly inefficient and works like a small scale DDOS attack.

        • rdri@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          2
          ·
          20 days ago

          First we complain that AI steals and trains on our data. Then we complain when it doesn’t train. Cool.