I like research projects on subjects that I feel have no hope. So here’s hoping for hope! This research is attempting to specifically defend onion services from being fingerprinted. The most common attack scenario is when an adversary is able to inspect the traffic between the tor client and the network and correlate the amount of traffic sent, to the size of known onion services. I believe this is a very accurate method of identifying the web pages an anonymous user visits.
The defense presented in this research is to make it more difficult to correlate the transfer size to the intended web page using various methods.
Right now there are PoC’s in the Github project below that have scripts to do a lot of this padding. It sounds like the Tor Project has accepted that padding is a necessary evil even with the cost to latency and bandwidth. In this case, implementing some of these methods will increase latency by 50% and decrease available bandwidth by 85%. But the results in the paper show that doing so reduces the accuracy of an attack from 69.6% to 10%.
You can also check out one of the co-author’s other papers called Bayes, not Naive: Security Bounds on Website Fingerprinting Defense which goes more into fingerprinting attack and defense metrics.
Just when you thought website fingerprint was bad, this research goes into fingerprinting specific keywords that you search for across Google, Bing, or DuckDuckGo. They call it Keyword Fingerprint (KF) and they seem to be applying some of the same machine learning methods used for website traffic correlation and use it specifically for search engines. And just like the other attack, this attack needs a set of keywords you’ve already learned against to try to fingerprint you. You can’t, however, identify some random search string, it has to be a value in the original data set.
Nick Matthewson noted that there were some new interesting features used as classifiers: cumulative sizes of TLS records, number of tor cells, and total packet stats.
Funny how one of the datasets of keywords used for their supervised learning system was the Google Blacklisted Keywords List from 2600.com
The bad of it: This is shockingly accurate in the right conditions. The paper outlines different learning methods and which ones are the most effective.
The good: Performing this analysis on real world traffic makes things less accurate. Also active defenses like padding were discussed in the paper and seem to offer a decent mitigation. Plus, like I mentioned, it has to be training on a specific site with specific words making this difficult to scale. Cool stuff none the less.
Now Se Eun Oh presents "Fingerprinting Past the Front Page: Identifying Keywords in Search Engine Queries over Tor" #pets17— Nick Mathewson (@nickm_tor) July 19, 2017