Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Loading common libraries from a CDN will no longer bring any shared cache benefits, at least in most major browsers. Here's Chrome's intent to ship: https://chromestatus.com/feature/5730772021411840 Safari already does this, and I think Firefox will, or is already, as well.


For info, this shipped in Chrome 86 just last week:

https://developers.google.com/web/updates/2020/10/http-cache...


I'm not sure I understand the threat here. Say I visit SiteA which references jQuery from CDNJS, then later visit SiteB which references exactly the same jQuery from CDNJS - what's the problem?


I'm guessing many websites are identifiable by which patterns of libs and specific versions they will force you to cache. One SiteA would then be able to tell that a user visited a SiteB (which, depending on the website, may or may not be problematic)


I'm sure some sites would be identifiable by their cached libs, but the cache is shared, so any overlapping dependencies would decrease the accuracy to unusable levels. The best you could do is know someone did not visit a site in the last ${cache_time}.

There are, of course, other vectors to consider, but I can't think of any that could be abused by third parties. If anything, isolating caches would make it easier for the CDN themselves to carry out the attack you mentioned, as they would be receiving all the requests in one batch.


What if my website tries to load a JS file that only foxnews.com loads (maybe with a less restrictive CORS config)?

I'd be able to tell if you visited Fox news recently, correct?


It would be extremely unlikely for only one site to use a specific file from a public CDN (like cdnjs). As for site-specific files like JS bundles and other static assets, those would be served on a "private" CDN, usually under the same domain (like cdn.foxnews.com) and with restrictive CORS settings for this very reason (and also to prevent bandwidth stealing).


A single file? Highly unlikely.

But three specific files can already be pretty unique. I chart.js with two specific plugins in my toy project, and I'm willing to bet that no one else on the world uses the exact same set and version configuration.


Exactly, but a third party can't see that set from the cache, they see the union of every website recently visited. They would see hundreds of files from many websites and if only one of those uses one of the three files yours does, it's impossible to tell for sure without a file that isn't used anywhere else on the Web. Your site uses A+B+C, site 2 uses A+D+E, site 3 uses B+F, site 4 uses C. The cache contents is A+B+C+D+E+F+... did the user visit your site? It's like trying to get individual pictures out of a single piece of film that was exposed multiple times - you can make some guesses and rule some possibilities out, but nothing other than that will be conclusive.


Like a discount Bloom Filter?


It would behave like one, yes.


You only need ~30 bits of information to uniquely fingerprint someone.


Could you explain this further?


There are about 8 billion people. 33 bits is enough to give each one a unique number. A whole bunch of them doesn't have access to the Internet, so fewer than 33 bits are enough to identify someone on the Internet.


It can be used to track users across domains to some degree.


Thanks, I understand the issue now - I haven't thought about CDNs from a privacy perspective before.

I suppose with HTTP2 some of the benefits of serving JS through CDNs are gone anyway, so I guess it's time to stop using them.


Not to be dense but wasn't that always the purpose of running a CDN service for common scripts and libraries?


The script wouldn't have to be from a CDN to track people using the browser cache. I could infer whether you've visited a site that doesn't use CDNs or trackers by asking you to load something from that site and inferring whether you have that resource cached by the time it took you to load it.


This is true, but if you're running a CDN you have access to cross-domain user information just based on the headers, no?


The CDN is not the place you have to worried about.

If Site A loads a specific JavaScript file for users with an administrator account, Site B can check to see if the JavaScript file is in your cache, and infer that you must have an administrator account if the file is there.

The attack can happen with different types of resources (such as images).


This I understand, the risk of third-parties monitoring. The attacks are pretty obvious. My confusion is over what the business model of a commercial CDN is if not to track users across multiple sites? How do they pay for bandwidth?


The problem is not the CDN (or arbitrary shared source domain) being able to track you but the sites which use the CDN.

Furthermore a CDN can't track you as simple as you might think, it often would require thinks which need explicitly opt-in agreements on a per website basis to be legal.

Furthermore due to technical limitations you can only get that permission from the user after the CDN was already used.

CDNs can still track aggregated information to some degree but they can't legally act like a tracker cookie.


How serious is this type of threat? Compared to all the info about us that is already shared by data brokers?


There are several data-broker-esque "services" that actually do this already with FB, Google, etc assets (favicon.ico and similar, loggedIn urls, ...) to check whether you have visited those pages, or whether you are logged in to those services by trying to request a URL that might return a large image if logged in, or fail rapidly if logged out. -- This has been a thing for a long time: https://news.ycombinator.com/item?id=15499917

If you don't use any of those sites, you're considered higher risk/fraudulent user/bot.

Here's an example of a very short and easy way to see if someone is probably gay: https://pastebin.com/raw/CFaTet0K

On chrome, I consistently get 1-5 back after it's been cached, and 100+ on a clean visit. On Firefox with resistFingerprinting, I get 0 always.


Thank you, that was insightful.

> Here's an example of a very short and easy way to see if someone is probably gay

Ok, but now the resource is in my cache, so from now on they will think I'm gay?


> Ok, but now the resource is in my cache, so from now on they will think I'm gay?

This resource is just generic, so probably not, but if you actually visited grindr's site without adblocking heavily, they load googletagmanager and a significant number of other tracking services, which will almost certainly associate your advertising profile and identifiers as 'gay'

I also can't believe they send/sell your information to 3 pages worth of third party monetization providers/adtech companies for something that is this critically sensitive.


You could have run this on a private window of the browser (and in that case, they would surely think you're a closeted gay).


Could this not be solved by by Grindr setting up CORS properly for that resource? It's unlikely anyone would ever open the script directly in their browser.


CORS wouldn't help here. CORS prevents you from reading the response or making a cross origin XHR requests, not loading an external resource from a different domain in a script or img tag.


Fun fact: browsers put scary warnings in their dev console (and some web sites log warnings or console) because some people love copy-pasting code they got from sketchy people trying to bypass all the browser security.


It's being actively used by the ad networks to do user fingerprinting instead of cookies, since the latter are more and more blocked.


I guess as serious as any other privacy threats but one that doesn't get enough attention in my opinion. CDNs and web fonts are definitely being used to track us and can bypass mitigations like private mode in your browser and ad/tracker blockers by tracking your IP address across sites.


Try and load assets from another domain and observe if it was probably cached or not, and you can know that they visited the site


I guess, but that disadvantage seems massively outweigh by the benefits. Can always use something like [1] to check if a client is active on interesting sites.

[1] - https://www.webdigi.co.uk/demos/how-to-detect-visitors-logge...


Are there actually any benefits though? I saw an article a few years back about how when loading jquery from Google's cdn there was about a 1% chance the user had it cached already. Since you have to have, the same library, the same cdn source and the same version of the library, it almost never is the case that the user has already grabbed this recently enough that it hasn't been kicked out.

Plus the trend now is to use webpack and have all of your deps bundled in and served from the same server.


Can’t always use that. It’s much less specific compared to the potential of cache, only works when websites provide that type of redirect, doesn’t work if you block third-party cookies (I think a form of that might already be the default in some browsers), etc.


maybe not with CDNJS, but perhaps you don't want every website to know you have AshleyMadison.com assets cached.


Can websites even tell what is cached and what’s pulled fresh?


Yes, using timing


Wouldn't the act of timing a download mean that I download and pollute my cache with new assets from the site trying find where else I've been? Does this only work for the first site that tries to fingerprint a browser in this way?


Is there a noCache option? Or can JS remove entries from the cache to reset it?

Someone below mentioned doing requests for a large image that requires authentication. Short response time means the user isn't logged in (they got a 403), long response time means they downloaded the image and are logged in.


Not if the javascript starts running only after all resources have loaded.


No there could still be timing attacks after. Just dynamically request a cross domain asset


Then those requests should not be cached?


  const start = window.performance.now();
  const t = await fetch("https://example.com/asset_that_may_be_cached.jpg");
  const end = window.performance.now();
  if (end - start < 10/*ms*/) {
    console.log("cached");
  } else {
    console.log("not cached");
  }


In that case, the browser would always load the asset (it is not cached). So the rule would be that only stuff that is directly in the <head> may be cached (or stuff that is on the same domain).


To be clear, the context of the thread is "why do we need to partition the HTTP cache per domain." My example code works under the (soon-to-be-false) assumption that the cache is NOT partitioned (i.e. there is a global HTTP cache).

> In that case, the browser would always load the asset (it is not cached).

Agreed, if the cache is partitioned per domain AND the current domain has not requested the resource on a prior load. If the cache is global, then the asset will be loaded from cache if it is present: https://developer.mozilla.org/en-US/docs/Web/API/Request/cac...

> So the rule would be that only stuff that is directly in the <head> may be cached (or stuff that is on the same domain).

You could be more precise here: with a domain-partitioned cache, all resources regardless of domain loaded by any previous request on the same domain could be cached. So if I load HN twice and HN uses https://example.com/image.jpg on both pages, then the second request will use the cached asset.


> To be clear, the context of the thread is "why do we need to partition the HTTP cache per domain."

Ah right, the thread is becoming long :)

> So if I load HN twice and HN uses https://example.com/image.jpg on both pages, then the second request will use the cached asset.

Good point!


On the other hand, the URL for a common library hosted on cdnjs (or one of the other big JavaScript CDNs) and included on many different websites is much more likely to already be cached on edge servers close to your users than if you host the file yourself.


The time to connect to the CDN hostname will negate any benefit, especially if push can be used.


You can mitigate this by getting your website itself on a CDN. If this is cached, then it's assets (incl javascript), would be too.

And by going that route you make sure that all pieces of your website have the same availability guarantees, the same performance profile, and the same security guarantees that the content was not manipulated by a 3rd party.


> And by going that route you make sure that all pieces of your website have the same availability guarantees, the same performance profile, and the same security guarantees that the content was not manipulated by a 3rd party.

You can already guarantee the security of the file by using the integrity attribute on the <script> tag. And the performance of your CDN is probably worse than the Google CDN (not to mention that you lose out on the shared cache).


I agree on the security side if you use that attribute. However:

> And the performance of your CDN is probably worse than the Google CDN

What means probably? Other CDNs (Akamai, CloudFront, Cloudflare, etc) are also fast.

And by pushing one piece of your website on a different CDN you force your users browser to create an additional HTTPS connection which takes additional round-trips, instead of being able to leverage one connection for all assets. This alone might as well outweigh the performance differences between CDNs.

Also the "shared cache" benefit might go away, if I read the other answers in this topic correctly.


Does someone know why they don't split the cookie storage equally by the top origin?

I mean, wouldn't that take care of a whole class of attack vectors and make cross-origin requests possible without having to worry about CSRF?


One of the problems is that it breaks use cases like logging into stackoverflow.com and then visiting serverfault.com, or (if you do it by top-level origin) even en.wikipedia.org and then visiting de.wikipedia.org. [1]

While privacy sensitive users may consider this a feature in case of e.g. google.com and youtube.com, the average user is more likely to consider it an annoyance, and worse, it is likely to break some obscure portal somewhere that is never going to be updated, so if one browser does it and another doesn't, the solution will be a hastily hacked note "this doesn't work in X, use Y instead" added to the portal. And no browser vendor wants to be X.

[1] The workaround of using the public suffix list for such purposes is being discouraged by the public suffix list maintainers themselves IIRC, so the "right" thing to do would be breaking Wikipedia.

Edit: If done naively on an origin basis right now, it would break the Internet. You couldn't use _any_ site/app that has login/account management on a separate host name. You couldn't log into your Google account with such a browser anymore (because accounts.google.com != mail.google.com). Countless web sites that require logins would fail, both company-internal portals and public sites.


It's possible to get around this with a redirect staple. E.g. if Google wants you to be logged in on youtube.com and google.com simultaneously:

1) User logs in at google.com/login and sets google.com cookies. 2) Server generates a nonce and redirects to youtube.com/login?auth=$NONCE 3) youtube.com checks the $NONCE and sets youtube.com cookies 4) youtube.com redirects back to google.com.

Firefox's container tabs can maintain isolation despite this since even this redirect will stay within a container. However there is a usability penalty since the user has to open links for sites in the right container (and automatically opening certain sites in certain containers will enable cross-container stapling again).


"if"?

webapps.stackexchange.com/questions/30254/why-does-gmail-login-go-through-youtube-com



Oh, that's interesting. I guess it makes sense from a security and privacy perspective.


Will this cause performance issues for sites that use static cookieless domains for js, images etc

Google themselves do this with gstatic.net and ytimg.com etc


> Will this cause performance issues for sites that use static cookieless domains for js, images etc

> Google themselves do this with gstatic.net and ytimg.com etc

Most probably not. The point of cookieless domains is that you can use a very simple web server to serve content (no need to handle user sessions, files are pre-compresses and cached, etc.) and it lowers incoming bandwidth a lot. If you have a lot of requests (images, css, js) the cookie information adds up quickly.

Opening video thumbnails from ytimg.com will still be cached for youtube.com as before. The only thing that will change is for embedded videos on 3rd party websites as those won't be able to use caches ytimg.com thumbails from elsewhere.


Couldn't the same thing be achieved by routing e.g. google.com/static/ to a separate simple webserver, instead of using another domain? Or use a subdomain, e.g. static.google.com.

The current way seems like needless DNS spam to me...


Even if Google used a separate highly optimised webserver for google.com/static/jquery.js, users who are logged in would be sending their auth cookies when requesting the library.

Given that generally people have slower upload than download, shaving off a few bytes from requests is worth it.

I also recall that browsers [used to (?)] limit concurrent requests per domain which this helps work around


Good! The whole idea for doing js on CDN is suppose to make it easier for entry level front dev to be able to start coding. I think that is great for a school exercise, should never be used in business or production sites.

And on a side note, very unhappy about how the entry to be a developer has lower significantly over the last 10 years or so.


Wouldn't a better, but partial, solution be for browsers to preload the top x common libraries? All other libraries would probably have to follow this new rule.


Isn’t this essentially what Decentraleyes does?

https://decentraleyes.org/


I've been using LocalCDN, it seems to be more acatively maintained and has a better selection of libraries.

https://www.localcdn.org/


What version(s) of those libraries? I mean, I don't deal with this anymore... but I've seen sites literally loading half a dozen different copies of jQuery in the past (still makes me cringe).


Please no, don't create such barriers.


Every other language runtime has a standard library, it's always been a shortcoming of the web IMO


At that point wouldn't it make more sense to just have the browsers include that functionality?


As soon as you add something to JS you have to support it forever. Its better to let sites pick what they need and scrap what they no longer need. JS has already added most of the useful stuff from jQuery. If browsers included a built in version of React it would pretty much lock in the design as it is now without room to remove and replace bad ideas.


I wonder if an nginx plugin could be made to auto-cache CDN javascript/css files and edit the HTML on the fly to serve them from locally.


You can set up a path that does a cached proxy to the CDN and just edit the HTML yourself. It's a bit annoying to get the cache settings to work properly, but editing the HTML is easy.


You’re probably thinking about a caching proxy like squid cache.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: