I was reading this ACLU blog post about how DreamHost was served with a warrant to hand over IP addresses of some 1.3 million visitors to a website they host, and it got me thinking: do websites really need to store IP addresses of their visitors?
There are a lot of VPN companies such as Private Internet Access that advertise far and wide that they explicitly chose not to keep any logs. The idea is that if the VPN provider is served with a warrant for user activity, they would have no data to hand over, because they never stored anything in the first place. Why don't websites do that?
Most websites log your IP address like crazy -- at the very least, your IP might show up in the server access logs as you navigate the site. Software like Apache httpd or nginx log IP's by default in their access logs. Furthermore, most web apps would explicitly store IP addresses in comments posted by visitors or in account logs. I'm guilty of this too: I log the IP address for each comment posted, but I realized now that I never actually do anything with that information, so I probably don't need to bother storing IP addresses at all.
At one point I thought storing IP addresses might be useful in case I had to ban a spammy user. But I never do that, because banning users by IP addresses is not effective: a lot of users have wildly dynamic IP's that change frequently, a whole other subset of users can change their own IP just by rebooting their cable modem, and if you keep a long-term ban of an IP ("you can never visit my site again!"), you'll just end up banning innocent users who have inherited the banned IP address.
Configure your web server software (Apache, nginx, etc.) to not store IPs in access logs. They do by default, so you'd have to explicitly configure them not to.
In your custom web app, think long and hard every time you consider putting an IP address into your database or application logs. The default answer should be "do NOT store them without a good reason."
If you have a good reason to store the IP address, still consider the length of time that you need to store it and how to make bulk record warrants impractical. Ideally, you should only store them in memory (not written to the filesystem, where they might be left in an old log file or recovered by file undeletion tools) and get rid of them as soon as you no longer need them.
If you need to temporarily store IP addresses to rate limit failed login attempts and protect account passwords from being brute forced "online," you could store this information temporarily in Redis. A typical rate limiting scheme might be to prevent more than 5 failed login attempts over the span of 30 minutes, so in this example you wouldn't need to cache an IP any longer than 30 minutes or so.
If you need to compare IP addresses (i.e. to tell whether the current request has an IP that matches a recently seen user), you might store hashes of the IP address rather than the IP address itself. This would serve the purpose of uniquely identifying users, but obfuscates their IP address. If you're asked to hand over records, you would hand over the list of hashes because you didn't store the original IP.
But hashes can be brute forced (basically by guessing every possible IP address, and checking if the hash matches), and IP addresses are particularly predictable, so you'd wanna use a slow hashing algorithm.
Use a purposefully slow hashing algorithm like bcrypt or Argon2. These hashing algorithms are designed for securely storing passwords by making it so it takes a "long time" to compute even a single hash.
So at best, the site host could hand over a list of bcrypt hashes for the most recent hour-or-so of user activity, and let them have fun brute forcing the IP addresses that correspond to those hashes. There are only 4 billion IPv4 addresses, so while it would be slow, they might eventually crack all of them. But for IPv6 addresses? They'd probably never brute force even a single bcrypt hash. That's akin to randomly guessing an AES 128-bit encryption key (something I've never heard of being done successfully), because IPv6 addresses are also 128-bit numbers! Have fun slowly passing those through a bcrypt algorithm one at a time. ;)
If your site is actively being hammered by a spammer, you would be able to determine their IP address at the time because you store it temporarily in Redis or somewhere. So when you're under active attack, you can figure out WTF is going on and blackhole the IP address or whatever.
But two weeks later? When that spammer has given up and gone home? Your server should retain no trace at all of that IP address anywhere. No log files, no databases, it should've been long since purged from Redis and nowhere to be seen in the system RAM.
I'll update my Rophako CMS to not explicitly store IP addresses in its database files, and configure nginx to not log IPs into the log files.
I literally never check IP addresses so I don't expect that this will disrupt anything. Your use case might be different. If you do need to store addresses temporarily, still do so carefully as outlined above.
I think "not storing IPs of website visitors" could become a selling point for privacy-oriented websites in the future, just like it is for VPN providers now. My blog is neither of those things, but my blog has always been where I get to experiment with neat things like this, so I'll be trying this anyway.
There are 4 comments on this page. Add yours.
another interesting thought, TLDR do websites need to store email addresses?
how about this instead of saving a users email address which makes a database something valuable to break into we save only a salted hash of the email?
someone breaks into your site and steals your entire database .. so what! all they have is a useless salted infohash of an email that means nothing!
user needs to request a password reset? they type in their email it goes through the same salt+hash process one time checks see if that hash exists and sends off to the email that was entered without storing it to reset the account password.
literally makes whoever is running the site incapable of sending spam to the end user also downside stops the site admin from being able to send any kind of notices.
This is exactly what i want to do, i want to store the hash of the $remote_addr (e.g. the client ip) in the outputed nginx access.log, but i have no idea how to do it.
I've been looking at Lua, and adding perl modules to Nginx, but i would ideally not want to create a new version of Nginx to do this, as I would have to recompile Nginx everytime a new version comes out.
You won't know how to do it in nginx would you?
@peter: not sure. What I'd do is not let nginx log the IP (set a custom log format that doesn't include $remote_addr). Depending on your app stack, you might handle your own logging in your application, where you can hash the IPs as normal.