Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How to handle user file uploads?
214 points by matusgallik008 15 days ago | hide | past | favorite | 186 comments
hey, i work as an SRE for a company where we allow users to upload media files (e.g. profile picture, attach docs or videos to tasks..the usual). We currently just take a S3 pre-signed URL and let the user upload stuff. Occasionally, limits are set on the <input/> element for file types.

I don't feel this is safe enough. I also feel we could do better by optimizing images on the BE, or creating thumbnails for videos. But then there is the question of cost on AWS side.

Anybody have experience with any of this? I imagine having a big team and dedicated services for media processing could work, but what about small teams?

All thoughts/discussions are welcome.




I would encourage not directly using the user-uploaded images. But uploading directly to S3 is probably fine. I just wouldn't use the raw file.

1. Re-encoding the image is a good idea to make it harder to distribute exploits. For example imaging the recent WebP vulnerability. A malicious user could upload a compromised image as their profile picture and pwn anyone who saw that image in the app. There is a chance that the image survives the re-encoding but it is much less likely and at the very least makes your app not the easiest channel to distribute it.

2. It gives a good place to strip metadata. For example you should almost certianlly be stripping geo location. But in general I would recommend stripping everything non-essential.

3. Generating different sizes as you mentioned can be useful.

4. Allows accepting a variety of formats without requiring consumers to support them all. As you just transcode in one place.

I don't know much about the cost on the AWS side, but it seems like you are always at some sort of risk given that if the user knows the bucket name they can create infinite billable requests. Can you create a size limit on the pre-signed URL? That would be a basic line of defence. But you probably also want to validate once the URL expires the data uploaded and decide if it conforms to your expectations (and delete it if you aren't interested in preserving the original data).


> For example you should almost certianlly be stripping geo location. But in general I would recommend stripping everything non-essential.

Including animation in most cases. Otherwise somebody can use a single frame with very long duration that will be reviewed by moderators, then follow that frame with a different frame containing objectionable material which will eventually be shown to people.


Wow, that's something I had never considered. I did run into a bug with someone uploading a gif to our servers and our resize script spitting out a file per resized gif frame (base_1.png, base_2.png, ...) that I had to fix but I'd never have thought of your example. We just took the first gif frame in our case, which would have been safe from this thankfully.


I often find you can upload animated GIFs to sites that don't allow them by just renaming them to PNG first.


> recent WebP vulnerability

For anyone wondering:

https://blog.cloudflare.com/uncovering-the-hidden-webp-vulne...

> The vulnerability allows an attacker to create a malformed WebP image file that makes libwebp write data beyond the buffer memory allocated to the image decoder. By writing past the legal bounds of the buffer, it is possible to modify sensitive data in memory, eventually leading to execution of the attacker's code.


I'm not the sharpest tool in the shed just learning python right now can the wizards explain how this would work? Isn't data stored in memory super random so the attacker would need to upload an image multiple times to get sensitive data? Or does the webp image itself have code that will be run and can retrieve the data?


Generally with a buffer overrun, you're trying to put executable code in, and get it to a spot where it'll be run, allowing you to do further stuff.

Here is a canonical old article on one type of buffer overflow:

http://www.phrack.org/issues/49/14.html#article

(The webp one, iirc, is not a stack buffer overflow, but rather a heap, but that'll give you a general sense of what's going on, even if that article is a bit ... dated)


>Re-encoding the image is a good idea to make it harder to distribute exploits. For example imaging the recent WebP vulnerability.

And now your server which is doing re-encoding is pwned. Gotta segment the server doing that somehow, or use pledge, capsicum, or seccomp I guess.


This is a pretty good use case for serverless endpoints, assuming the volume is pretty low.


Serverless + making sure the serverless function doesn't have any privileges. More often than not I see people use serverless in the name of security, and then give the function write access to prod resources.


Ideal use case for a cheeky OpenBSD machine...

Not that OpenBSD is actually unhackable or anything, but I doubt many attackers would guess you're running imagemagick on OpenBSD in your image pipeline.

I rather like it for such use cases; it has the added benefit that it never, ever seems to die. I found a 6.0 machine I setup doing some kind of risky Kafka processing that had an uptime of 6 years the other week (since migrated).


I had to reboot one of our servers last year, also just over 6 years. Reboot because physical move to another server hotel, not because it wasn't working :-)

It is running imagemagic to optimize images, create different resolutions and reencode them. It's only open for us to upload manually from our customers though, they can't upload themselves. Input anything, output jpg, very easy to use.


I've ran into a security issue where a serverless function had pretty large range of AWS access and a pentester was able to utilise that.


That's a bad use case for serverless :)


For video you might want to use the GPU, but for images this sounds like a good use case for Wasm



Draw image to canvas in the browser. Read image from canvas. Upload. If you are completely paranoid you could then upload raw pixel data only and construct whatever image format you wanted server side.


The user could push the file to the endpoint directly without using the client-side functionality.


I think the idea is that a browser needs to do this conversion to display images, so you'd be writing a http client that will run server-side and process each uploaded file as if it were a http response. It'd need to be transmissible via screenshot at this point, right?


By running a web browser on the server you are at square one again: running the encoder on the server, which is risky.


In this case isn't the browser decoding the image and drawing a bitmap? That sounds more like a critical vulnerability in e.g. Chrome.

In my mind the chain of events looks like: - Alice uploads an image file to Bob's file server - Charlie's image converter server invokes a local browser with the address of the file on Bob's server, resulting in a .bmp (or .png? a little out of my element here) saved to Charlie's server. - Doris can now pull images from Charlie's server knowing they've been vetted by a major browser.

Does that make sense?


Yes, which outsources the problem to Charlie. Charlie has to firewall and lockdown that browser as much as possible to reduce the risk of Alice pwning Charlie, else Charlie's server could (among other things) be made to taint all output files. Browsers are not magically immune, they also just call decoder libraries.

Did you forget to never trust the client? Please tell me you haven't built any products using this philosophy.


It's not a terrible idea... This doesn't "trust" the client, it just interprets the data that the client sent as an array of pixel values. In a memory safe language (e.g. JS, C#, Go, Rust, ...), that would make it basically impossible to pwn: the worst thing an attacker could do is upload an arbitrary image.


My sibling comments are the terrible sort of ‘instead of applying your own logic, blindly and sometimes incorrectly pattern match against a checklist of ‘best practices’’ internet lectures. Image processing exploits almost always exist at the decompression stage. Once you’re processing a bitmap of pixels, there is a whole lot less that can go wrong. Having decompression happen at the client has an obvious performance impact. Ignoring that though, given that it implicitly involves changing the server-side API, it’s not “trusting the client”. It very clearly offloads a bunch of risk. The underlying premise is that raw image data is harder to shove an exploit in. People are so quick to lecture others before thinking twice.


Thanks for one of the only rational replies to this thread. I often hold back from commenting at all on HN as the replies are so often full of low effort non-constructive criticisms.


Cool you have a bitmap. Now what? You're going to distribute all that child porn people enclosed in the bitmap (byte array) that doesn't render as a valid image?

It's just not secure - anything you do on the client can be trvially circumvented.


The imaging handling libraries I. Those langs are almost all written in C/C++. If you’re just wrapping ffmpeg or imagemagick or libpng you’re not really protected from much of anything.

False security, if anything.


The image is still being posted somewhere, right? What guarantee do you have that it was your wasm blob doing the post vs some j33t haxxors curl command from his kali vm?


So you're just making your own image format with no compression?

A 1080p image with 8 bits per channel would be 6 MB. Real mobile-friendly...


We run these services on isolated machines.


How isolated? If it's an image processing service then it can't live in a vacuum and needs to talk to something else at a certain point.


Only accept in-bound requests. Don't provided it with any special credentials.

Someone might still be able to get an RCE on them and burn a bit of money, but they certainly wouldn't be able to move laterally.


> 3. Generating different sizes as you mentioned can be useful.

For years now I use nginx's image filter [1] which handles file resizing quite nicely. Resized images are cached by nginx. For some usecases it works very vell and I no longer need to specify sizes beforehand, you just ask for the size by crafting your url properly.

[1] https://nginx.org/en/docs/http/ngx_http_image_filter_module....


How does that scale if? Say, image access is relatively random and your data set exceeds server ram by a couple orders of magnitude?


Cannot say, scaled images are saved and served from the filesystem, not kept in ram. One possible solution would be to use CDN.


Do also note that re-encoding can also be used as part of the exploit. E.g. Team Fortress 2 recently had one that exploited a similar system.


I don't think the exploit in that case was re-encoding. What happened is an image with very large dimensions was uploaded. When this was decoded into a raw pixel buffer on the client it used tons of memory. It was effectively a zip bomb attack.

In fact re-encoding probably would have solved this as the server could enforce the expected dimensions and rescale or reject the image.


There'a been explpots in image, video, and audio codecs... Which is why it's important to protect your users, but also your servers...

Best to sandbox/jail/etc as tightly as possible, and limit the codecs to only what you need. You can configure the ffmpeg builds pretty granularly... default will include too much.


I care more about protecting my servers than my users. If one user attacks another that's not really my problem, blame can easily be shifted to the browser. But if someone hacks my servers and leaks everyone's data, that's my problem.

So not encoding is probably the safer way to go for the business.


Yikes.

What if the other user getting attacked is you or another admin on your team?

Now the attacker has admin access and can compromise your servers and “leak everyone’s data” just fine.

I don’t think you’ve thought this through.


Trust me, your website will always get blamed regardless of its a user fault.

If websites get blamed when users reuse passwords, they are definitely getting blamed if you distribute malicious files.


You can think that but you should never say or write it. :p


It was just a joke, your honor.


I got the picture.


>Re-encoding the image is a good idea to make it harder to distribute exploits.

Famously, the Dangerous Kitten hacking toolset was distributed with the classic zip-in-a-jpeg technique, because imageboards used to not re-encode images.

https://web.archive.org/web/20120322025258/http://partyvan.i...


Didn’t some photo gallery service (Google Photos or Takeout?) make the news because they re-encoded users’ images but didn’t preserve the original files? People who relied on the service as a safe backup lost their original quality images. So in some cases, you may want to re-encode/optimize uploaded images for display but also archive the original files somewhere.


It obviously depends on the use case.


> Can you create a size limit on the pre-signed URL

Yes, pre-signed URL can have the `Content-Length` set and amazon S3 checks it. However, note that this is true for Amazon S3 but not for others like BackBlaze or R2. Last time I tried, BackBlaze didn't support it.


Only with createPresignedPost, not with getSignedUrl.


Presigned post lets you set content-length-range: https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-authen... You can specify content length on presigned PUTs, but it needs to be set as a header and added to the signed headers for SIGv4. It can't be set as a query param.


This is good advice. Just pick a resize dimension and try to resize everything that comes in. If it fails it's not an image.

You can hang an event off S3 and have a lambda that does the work / warns you of a bad upload.


Resizing also helps reduce costs. The latest phones can generate ridiculously large images with 200+ megapixels. You really don't want to dump that kind of behemoth in your S3 bucket and serve it as somebody's profile pic.

Videos will add even more to your AWS bill if you're not careful. Re-encode that 4K cat video as soon as it comes in, or wire up a CDN to do it for you.


> Can you create a size limit on the pre-signed URL?

Yes, if you use the POST method you can set the content-length-range property in your presigned URL form inputs to limit min and max bytes. https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-authen...


There's also an old bug where browsers re-interpreted image as HTML (even with correct MIME type set) and this allowed to host exploits on user-upload sites. Not sure if any modern browser still has this problem, but it used to be a concern. Recoding the image usually broke those exploits. Though it also could break some metadata - e.g. if you go from JPEG to PNG you could lose EXIF data.


I’m pretty sure, if memory serves me, some of the older versions of IE (Internet Explorer, I imagine some people here may be young enough to not be aware of that) used to let you pass any sort of file hosted over HTTP with arbitrary extension set (e.g. ‘.jpeg’) however it would actually do some MIME typing and execute the file based on the actual content, so you could literally set a .exe with a jpeg file extension at the time and IE would let you run arbitrary executables on Windows users systems at the time with full permission of the user running it (which was pretty much often admin). You could also do other file types but an executable was the most glaring example of how bad the issue was. Obviously you could embed this in HTML as well…


This still generally works with SVGs.


You can tell a lot from the exif metadata of images so that's one reason (user privacy) to always re-encode images.


Sounds complicated. Any recommendations for existing libraries I can use to handle this?



I know lemmy uses pict-rs https://git.asonix.dog/asonix/pict-rs/


> Re-encoding the image is a good idea to make it harder to distribute exploits

Which immediately makes you vulnerable to zip-bomb attacks.

If you want to be more safe, you first have to do all kinds of checks of the image headers.


#1 sounds interesting. Any examples of this being used?

Read through the comments and was surprised no one mentioned libvips - https://github.com/libvips/libvips. At my current small company we were trying to allow image uploads and started with imagemagick but certain images took too long to process and we were looking for faster alternatives. It's a great tool with minimum overhead. For video thumbnails, we use ffmpeg which is really heavy. We off-load video thumbnail generation to a queue. We've had great luck with these tools.


+1 to vips! It's amazingly fast and stable. I even wrote (some minimal) Swift bindings for it to be used with a Swift backend: https://github.com/gh123man/SwiftVips


Thanks for the suggestion, and Meson keeps popping up.

I vaguely remember flattening images in some manner so imagemagick wouldn't choke. Something about setting key tones? It's been a while.


Didn't know if you were asking but Meson is used to build libvips from source. It took a couple days for me to get up to speed on it but wasn't difficult to understand. https://mesonbuild.com/ https://www.libvips.org/install.html As for imagemagick, I just couldn't get the speed I wanted out of it. This is partly due to the fact that the system I was building allows many different image types. I want to say gifs were almost always an issue but I'm reaching here. Been a bit for me as well.


Yep, I keep wondering why I'm not just using make and a text editor on a lot of projects, but make feels old and creaky for some unexamined reason so I'm looking around.

One of the major lessons I pulled out of that time is that most library authors will happily fix bugs or add features if you give them money. You can also just ask them to sell you a different license if FOSS ruins your plan.


I don't understand the details of cloud stuff, and it took me a heck of a long time to google just what a "pre-signed URL" was. In case anybody else is in the same bucket (ahem):

Users can't upload to your S3 storage because they lack credentials. (It would be dangerous to make it public.) But you can give them access with a specially-generated URL (generated for each time they want to upload). So your server makes a special URL, "signed" with its own authorization. That lets them upload one file, to that specific URL.

(I dunno about anybody else, but I find working with AWS always involves cramming in a lot of security-related concepts that I had never had to think about before. It puts a lot of overhead on simple operations, but presumably is mandatory if you want an application accessible to the world.)


It's a cool feature. You don't need to proxy the upload via your web server. But instead it is directly handled by AWS.

It's nost likely more efficient, faster, and cheaper (no need to handle the traffic and hardware to proxy).


Calling it a "single use upload url" would probably be a lot clearer. Describe the purpose, not the mechanism...


Single use upload URL isn't the only use case for pre-signed URLs though. They are more commonly used for quick-expiring downloads for example.


It sure sounds like it. I'm not sure how much of AWS could be clarified by better names, but I suspect it's a lot.


pre-signed URLs also allow you to set conditions such as file size limit, type of file, file name etc to ensure malicious party isn't uploading a massive 1 TB file into your bucket which you serve as profile pics. However, while amazon S3 supports these "conditions", others like backblaze implementing S3 may or may not implement it. So beware.


Note that limiting content length only works with createPresignedPost, not with getSignedUrl.


If you were building a system with the same architecture, you’d have the same security considerations. Nobody is telling you to use pre-signed URLs, even if you’re using AWS. This is an apples and oranges comparison. A seasoned technologist such as yourself should be able to see past the word “cloud”, understand the architectural differences and the engineering trade offs being made. It’s evident from your comments that you seem to wear this ignorance as a point of pride. I’m no tech hipster and I can still see that this isn’t the good look that you think it is.


you can use an EBS volume attached to the pod, or shared EFS volume as well.

but for certain applications, s3 is simpler to use, especially when you need to scale


I have zero idea what an EBS, EFS, or pod is.

Not that I'm asking for an explanation. Just illustrating how much stuff there is to learn for the basic operation of "run this on your computer", even for an experienced developer. (I've been doing this for nearly 40 years.)


You should probably learn.


If I ever need it, I will. It's not useful in my area of work so I haven't bothered.

I am curious, so I took the opportunity to look up the strange sounding term "pre signed url". But that's just being a dilettante.


I took a class for AWS certification. Even though I never ended up taking the test, it has been extremely helpful throughout my career.


Both Azure and AWS have this feature and it’s very useful to avoid having to proxy large files through your own backend.


We allow uploads with pre-singed URLs at my work too. We added a few more constraints. First, we only allow clients to upload files with extensions that are reasonable for our apps. Files uploaded to S3 are quarantined with tags until we validate the binary contents appears to match the signature for a given extension/mime-type with a Lambda. Second, we use ClamAV to scan the uploaded files. Once these checks have completed, we generate a thumbnail with a Lambda and then make the file available to users.

I’m honestly surprise this isn’t a value-added capability offered by AWS S3 because it’s such a common need and undifferentiated work.


Virus scanning user uploaded files always seemed like a very sensible thing to me.

I've used ClamAV before in web apps, and it was fairly easy to get working - but it has a few big flaws: it's very slow, it uses a huge amount of memory, and detection rates are... honestly, pretty terrible.

I've used Verisys Antivirus API for 2 recently projects, and it works great, a huge improvement over self-hosted ClamAV. There are a couple of other virus scan APIs available, including Scanii[1] which I think has been mentioned on HN before, and I Trend Micro has something too - but their pricing is secret, so I have to assume it's expensive.

[0] https://www.ionxsolutions.com/products/verisys-virus-api [1] https://scanii.com


How are those better than ClamAV? I am not saying they aren't but you just said the detection rates of one is terrible compared to the rest.

Things got a lot better for ClamAV after the Cisco purchase, but detection rates remain attrocious. There are some commercial signatures available, which would certainly help, but it still doesn't come close to the detection rates of commercial scanning engines.

ClamAV needs lots of memory to work - we needed at least 8GB in one project I worked on (where we also used commercial signatures), 4GB was just about enough for another. The more signatures you have, the more memory it uses.

ClamAV chews through a lot of CPU, and it's so very slow. For example, scanning a 25MB MSI file from my test corpus takes 65 seconds - and that's on a fairly beefy machine!

I've used both of the services linked above, and both seem excellent - detection rates and performance are far superior to ClamAV, and they Just Work(TM).


Azure Storage seems to have an integrated AV option:

https://learn.microsoft.com/en-us/azure/defender-for-cloud/d...

Mind you, last time I designed something using this we just used ClamAV - which is pretty easy to develop against even if it is a slight pain to manage on an ongoing basis.


Is using ClamAv because of some compliance checklist that forces you to include a virus scanner in there? Otherwise I don’t see how adding a ton of additional complexity with a proven track record of CVEs would add any security at all.


No compliance reason other than adding it in couldn’t hurt.


How does

> a ton of additional complexity with a proven track record of CVEs

become:

> adding it in couldn’t hurt

?


By running it in an ephemeral container with no network access for each file you want to check?


Thanks for the insights! I had a look at ClamAV - never heard/worked with it. This was something I was looking for.

Do you also handle large files e.g. videos with lambda? Don't you struggle with Lambda limits then?


Great question. We use spot EC2 instances for ClamAV scans. For thumbnails, we still use Lambdas but use AWS Media Convert for Video files to select a frame for the thumbnail.


It's been a while but I would benchmark how ClamAV fares. At $(day job) we implemented scanning for file uploads and after our security team tested they noticed that it detected things fairly poorly (60% was on the high side) for very well known malicious content iirc.

We ultimately scrapped it, if anyone else has any better experience I'd love to hear it.


I am not a proper developer so I don't have the technical details on this, but another thing I have seen done is to buy access to an API on VirusTotal [1] and submit file hashes to be compared against their database of about 70 or so virus scanners. Not foolproof, as malware can be recompiled and hidden from VT until the actual file is submitted for sandbox analysis which you probably dont want to do with potentially sensitive data so it really only catches known-knowns but apparently some companies find it useful. I utilize it on a windows machine using ProcessExplorer that can automatically submit hashes but the API volume is limited for free / anonymous access.

[1] - https://docs.virustotal.com/reference/overview


CISA is offering a malware analysis service now, requires a Login.gov account to use and the user be a US citizen. Unsure how it compares to VirusTotal, it’s in our backlog to spike a poc in our IR workflow.

https://www.cisa.gov/resources-tools/services/malware-next-g...


I have to say this user agreement sounds like TikTok privacy policy, but my oh my! The government does its job. RESPECT.


If I was a malware distributor, I'd just throw a couple random bytes at the end each time I upload a malicious file, it'd still work and have a different checksum/hash.


Antivirus are really stupid tools, but not that stupid. I said that from a time where I had to work around tools flagged by antivirus. Among stupid things they do are flagging a part of an executable, some nsis plugin flagged the whole package as virus as soon as you included them. I think they probably hash files by chunks, if you have too many bad chunks then you are a virus. A few bytes at the end doesn't change that.


I'm not sure if it's still true, but it used to be that ~half of all antivirus would flag an executable if you compressed it with UPX.


You could sometimes bypass this by opening the file with a hexeditor and change a meaningless value. When UPX was popular there were also alternative file compressors that could also be used to sometimes bypass this issue.


If S3 did this, what would that look like?


I recently redesigned my stack of validating uploaded files and creating thumbnails from them. My approach is to have different binaries per file type (currently images JPEG/PNG, videos H264/265 and truetype fonts). Each of them is implemented as in a way that they receive the raw data stream via stdin and then either generate an error message or a raw planar RGBA data stream via stdout. The validation and thumbnail process is triggered after first locking in the process in a seccomp strict mode jail before touching any of the untrustworthy data. Seccomp prevents them from basically every syscall except read/write. Even if there would be an exploit in the format parser, it would very likely not get anywhere as there’s literally nothing it could do except write to stdout. Outside a strict time limit is enforced.

The raw RGBA output is then received and converted back into PNG or similar. It was a bit tricky to get everything working without additional allocation and using syscalls triggered by glibc somewhere, but works pretty well now and is fast enough for my use case (around 20ms/item).


Oh could you expand briefly on what the stack looks like to accomplish this? Or do you have a write up on a blog/site you could share?


I wrote a little bit more about this here: https://community.info-beamer.com/t/an-updated-approach-to-c...

I’m a huge fan of building minimal self-contained tools, so all of the C programs statically link in the required parser libs (libavcodec/wuffs/freetype) so the resulting binaries don’t require additional dependencies on the target machine. The python wrapping code is rather straightforward as well and is only like 300 lines of code.


Some application security thoughts for serving untrusted content. Not all are required but the main thing is that you don't want the user to be able to serve html or similar (pdf, SVG?) file formats that can use your origin and therefore gain access to anything your origin can do:

- serve on a different top level domain, ideally with random subdomains per uploaded file or user who provides the content. This is really most important for serving document types and probably not for images though SVG I think is the exception as it can have scripting and styling within when loaded outside of an IMG tag

- set "content-security-policy: sandbox" (don't allow scripts and definitely don't allow same origin)

- set "X-Content-Type-Options: no sniff" - disabling sniffing makes it a lot harder to submit an image that's actually interpreted as html or js later.

Transforming the uploaded file also would defeat most exploit paths that depend on sniffing the content type.


As a side issue here I can't believe sniffing is still the default.

From basically the day it existed it's been considered a best practise to send that header with every request, because there might be a vulnerability introduced otherwise . I've never heard of anyone relying on it. Why can't modern browsers flip the default on this internet Explorer era magic footgun?


> As a side issue here I can't believe sniffing is still the default

It isn't the default in modern browsers, or at least not in the sense you mean. The header still controls some things, but even without it, browsers will not sniff something served as content type image/png as html or anything like that. How modern browsers interpret that is different than IE.


I think it's still the default with text/plain. Haven't verified that though

Pdf isnt too bad, it has javascript but generally is sandboxed so you cant do anything bad with it.

SVG is basically the same as html, and you can do standard XSS stuff.

Hosting on separate domain (not just subdomain but actual separate domain) is a must if you allow formats like svg, and a good idea generally.


We use Cloudflare Images and Cloudflare Stream (Video) to process images and video that are uploaded to our site.

Both have worked well for us so far but I don't know about your scale and impact on pricing (we're small scale so far).

Cloudflare Images lets you auto resize images to generate thumbnails, etc. Same with video, where they will auto-encode the video based on who's watching it where. So for us it's just a matter of uploading it, getting an identifier, and storing that identifier.


You'll probably have no issue sending the image to you backend directly, doing whatever you want to it (compression, validation, etc), and then uploading it to S3 from there. It's not a lot of overhead (and I'd argue more testable and easy to run locally).

You can do the math on the ingress to your service (let's say it's EC2), and then the upload from EC2 to S3.

It appears that AWS doesn't charge for ingress [0]. "There is no charge for inbound data transfer across all services in all Regions".

S3 is half a cent per 1000 PUT requests [1], but you were already going to pay that anyway. Storage costs would also be paid anyway with a presigned URL.

You'll have more overhead on your server, but how often do people upload files? It'll depend on your application. Generally, I'd lean towards sending it to your backend until it becomes an issue, but I doubt it ever will. Having the ability to run all the post processing you want and validate it is a big win. It's also just so much easier to test when everything is right there in your application.

[0] https://aws.amazon.com/blogs/architecture/overview-of-data-t...

[1] https://aws.amazon.com/s3/pricing/


Aside from the extra load (and long requests, which is a bigger factor for non-async app servers), you need to take into account the physical location of your app servers. Depending on where the user is, there can be drastic differences in performance uploading to, say, us-east-1 vs ap-southeast-1. With S3, you can enable transfer acceleration so that the user uploads to whichever is their nearest region and then AWS handles it from there. This can give huge speedups.


Lots of other comments give good suggestions on how to handle uploading and processing, but nothing mentions serving resulting content, so let me chime in:

Do not serve content from S3 directly.

ISPs often deprioritize traffic from S3, so downloading assets can be very slow. I've seen kbytes/s on a connection that Speedtest.net says has a download speed of 850 mbit. Putting Cloudfront in front of S3 solves that.


> ISPs often deprioritize traffic from S3

I wonder why


You should never be serving content from S3 directly anyway. Always setup a cloudfront distribution for your buckets and allow origin access identity (OAI) access to the bucket.


Also a good idea to stream large files instead of trying to directly serve the file as-is. For video you essentially have to stream the file in order for it to load on Safari.


Cloudfront does not solve it. Comcast deprioritizes any content that comes from Amazon.


Not to mention using Cloudfront in front of S3 can lower cost, due to caching.


I remember one major web host in 2004 .. I noticed they weren't checking the extension of profile pic uploads, so I uploaded a .aspx file that I wrote a file tree explorer into.

From there I could browse through all of their customer's home directories; eventually I found the SQL database admin password, which turned out to be the same as their administrator password for the Windows server it was running on: "internet".

This was a big lesson for me in upload sanitizing.


'04 was good times for that indeed lol. Early enough that almost no regular developers knew anything about security. SQL injection vulnerabilities were the norm lol. It's really a shame that things turned nasty around then as well. Most people hacked stuff for lulz in those days, but now they do extreme damage.


Yeah, I should have added a caveat that I emailed the admins immediately after I found the administrator password and let them know what was going on. I did set my profile pic to goatse on the way out though, just to put a fire under them to fix it.

> Occasionally, limits are set on the <input/> element for file types.

Since this isn't enforced by the presigned PUT URL, you can't trust that the limits have been respected without inspecting whatever was uploaded to S3. You can get a lot more flexibility in what is allowed if you use an S3 presigned POST, which lets you set minimum and maximum allowed content lengths.

[0]: https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-HTTPPO...


Look at https://uppy.io/ open source and lot of integrations. You can keep moving to different levels of abstraction as required and see some good practices of how things are done.


What do you do with the uploaded images? You could be exposed to risks that may not be immediately obvious.

I have seen a team struggle for over a month to eliminate NSFW content - avatars - uploaded by a user that lead to their site being demonetised.


I had a site demonetized because about 1 in 10,000 pictures were of deformed penises, dead nazis, things like that.


I think these days you'd have to deploy an NSFW detector model. And you'd have to silently reject the upload* and limit profile picture change rate, else the attacker could use the endpoint as an oracle to build an adversarial model, which can eventually generate inputs that the detector won't flag.

*: act as if successful, including processing time, but throw away the upload


For images we simply use Cloudflare Images which takes care of everything.

Images are easy to display but for other media files you will probably need some streaming solution, a player, etc.

We're in the audio hosting and processing space. We still don't have an api though.

For video maybe look into Wistia, Cloudflare, BunnyCdn, etc.


2 buckets:

- upload bucket

- processed bucket

upload bucket has an event triggered on new file upload which triggers a lambda, the lambda will re-encode and do wtv you deem fit and upload to new bucket

your app will use the processed bucket


Following a similar stack, has anyone found success handling iPhone HDR and Live Photos? Both seem to give issues to standard HTML formats. I believe we’re using an AWS service to convert videos to various qualities (maybe Elastic Transcoder or Media Convert), and those iPhone video formats causes the service to error out.


> But then there is the question of cost on AWS side.

Use Cloudflare R2 for storage, public bucket for delivery, and a $10/month Hetzner server to run conversions. You don't have to take your users' money just to burn it on AWS...

If you are using AWS and are worried about cost, the first step is to not use AWS.


As someone outside of the AWS trap, I am amazed at how many services people are saying you must use to have file uploads. 2 storage buckets that are charged in 10 different ways, a CDN, a processing server, a "lambda", "media convert", virus scanning and a third party service for resizing.. Extremely unnecessarily over complicated and probably overly expensive.

Yeah, companies really seem to like pissing away their users money on AWS. And then people wonder why everything is an expensive subscription now and somehow Amazon keeps getting richer...

I can think of 100 circumstances off the top of my head where this “cost saving” wouldn’t be net worth it. It very well may be sometimes, too. But your comment is very absolute, and very obviously intentionally so, so you could use this lecturing tone.


We map the TUS[0] protocol to S3 multipart upload operations. This lets us obscure the S3 bucket from the client and perform authorize each interaction. The TUS operations are handled by a dedicated micro-service. It could be done in a Lambda or anything.

Once the upload completes we kick off a workflow to virus scan, unzip, decrypt, and process the file depending on what it is. We do some preliminary checks in the service looking at the file name, extension, magic bytes, that sort of stuff and reject anything that is obviously wrong.

For virus scanning, we started with ClamAV[1], but eventually bought a Trend Micro product[2] for reasons that may not apply to you. It is serverless based on SQS, Lambda, and SNS. Works fine.

Once scanned, we do a number of things. For images that you are going to serve back out, you for sure want to re-encode those and strip metadata. I haven't worked directly on that part in years, but my prototype used ImageMagick[3] to do this. I remember being annoyed with a Java binding for it.

[0] https://tus.io/ [1] https://www.clamav.net/ [2] https://cloudone.trendmicro.com/ [3] https://imagemagick.org/index.php


Kinda incredible how nearly every comment here mentions S3. Cloud storage is not the only backend in existence :)


This is your moment. Let's hear about alternatives.


That’s because the requester is already using AWS and S3 specifically.

One of the most important, yet oft ignored on HN, principles of architecting solutions is having an unbiased view of products and then choosing what will fit within an existing organisation’s architecture.

Given the specifications shared, utilising S3 further is the correct advice.


It gives you at least some basic security. Nothing can get executed on S3, there is no lateral move. If you are using your own servers for user uploads you have to do a damn good job of quarantining them.


A lot of people here are telling you to do stuff, but not really explaining "why".

A different suggestion would be to build a threat model.

Who are your uploaders? Who are your viewers? What are the threats?

Only when you've figured out what the threats are, can the solutions make sense.

Internal site? Internal users? Publicly available upload link? What media types? ....

Threat modelling is a thing that adds lots of value (imho) when thinking about adding new capabilities to a system.


Hi! I've been working on this exact problem myself recently and decided to build a product out of it. Take a look: www.bucketscan.com

The intention is to develop an API-driven approach that can be easily integrated into your own products as part of your file upload mechanism. We're really early stage, so we're looking for businesses and individuals to help us define what the product should look like. If you'd be up for sharing your thoughts, you can email us at info@bucketscan.com and/or complete this short product survey:

https://forms.gle/rywgnQ7zqsPuLdMd6


I have used https://github.com/cshum/imagor infront of S3 before and liked it, there is many (some commercial) offerings for this


Beyond taking advantage of validations that are enforced with IAM policies, you can also have a background job handle making thumbnails or whatever you want.

Also I don't think the Content-Type is actually verified by S3 so technically users can still upload malicious files such as an executable with a png extension.

On the bright side, S3 supports requesting a range of bytes. You can use that to perform validation server side afterwards to enforce it's really a png, jpg or whatever format you want. Here's examples in Python and Ruby to verify common image types by reading bytes: https://nickjanetakis.com/blog/validate-file-types-by-readin...


In my fediverse server project[1], I convert all user-uploaded images to high-quality webp and store them like that. I discard the original files after that's done. I use imgproxy[2] to further resize and convert them on the fly for actual display. In general, I try my best to treat the original user-uploaded files like they're radioactive, getting rid of them as soon as possible.

I don't do videos yet, but I'm kinda terrified of the idea of putting user-uploaded files through ffmpeg if/when I'll support them.

[1] https://github.com/grishka/Smithereen

[2] https://github.com/imgproxy/imgproxy


It’s amazing that this was an issue in 2004 and it’s still an issue today. I don’t have much to add aside from what was already said. There are services like uppy, transload it, etc that simplify this, but might be more expensive than S3+CF.


Assuming you serve that content out through a CDN a lot of optimization work will be handled there and customization should also be handled there. I’d be shocked if CDNs don’t allow you to do much/all of that out of the box.

Honestly though if this is an authenticated function and you have a small user base… who cares? Is there a reasonable chance at this disrupting any end user services? Maybe it’s not the best way to spend hundreds of hours and thousands of dollars.

Granted you’re an SRE so it’s your job to ideas this. I’d just push back on defaulting to dropping serious resources on a process that might be entirely superfluous for your use case.


If it's AWS there's basically nothing like this available as default. There are transcoding services and edge functions but you need to set them up yourself.


I tried to do the pre-signed URL thing but gave up quickly. I don't know how you'd do it properly. You're going to want a record of that in your database, right? So what, you have the client upload the image and then send a 2nd request to your server to tell you they uploaded it?

I ended up piping it through my server. I can limit file size, authenticate, insert it into my DB and do what not this way.


Usually it’s the opposite order. Client requests an upload, an entry is made in the database and a pre signed url is generated and returned to the client. The client uploads the image. Optionally, the client could then tell the server the upload was completed, although I’ve never done that final step.


Like you, we use pre-signed S3 upload urls. From there we use Transloadit [0] to crop and sanitize and convert and generate thumbnails. Transloadit is basically ImageMagick-as-a-Service. Running ImageMagick yourself on a huge variety of untrusted user input would be terrifying.

[0] https://transloadit.com/


This is how I would build this in AWS. Upload to S3 via pre-signed URLs. Create a notification on the bucket which publishes new objects to an SNS topic. Then create a lambda function with its own dedicated SQS queue subscribing to the SNS topic mentioned earlier. This setup would allow you to post-process new uploads without data loss (and at scale) especially if you use a DLQ. The lambda would drop all post-processed images into another bucket from which you can serve them safely via cloudfront distribution. Beware, if you have a high traffic site this will probably cost you an arm and a leg in S3 costs and Lambda executions. You could consider aging out items in the upload bucket to save some moneys. Similarly, you could use an auto-scaled ECS task instead of lambda, which would scale out (and in) based on the number of items in the queue. Not for the faint of heart, I know :))


You should quarantine them until you’ve analyzed them.

Like you stated, an async process using a function would suffice. Previously used ClamAV for this in a private cloud solution, I’ve also used the built in anti-virus support on Azure Blob Storage if you don’t mind multi-cloud, plus an Azure Function has the ability to support blob triggers, which is a nice feature.

The file types scan is relatively simple. You just need a list of known “magic string” header values to do a comparison, and for that you only need a max of 40 bytes of the beginning of the file to do the check (from memory). Depending on your stack, there are usually some libraries already available to perform the matching.

And it goes without saying, but never trust the client, and always generate your own filenames.

https://en.m.wikipedia.org/wiki/List_of_file_signatures


I'm pretty sure ClamAv would add more vulnerabilities to your stack than it might prevent from being exploited.


It’s an old stack. Can you suggest an alternative self hosted option?


I consider the approach of using virus scanning inherently flawed: They rely on heuristics and rules to essentially create a blacklist. And they add a ton of complexity (more code -> usually more bugs) which is not something you want if you work with untrustworthy data.

What you instead want is a whitelist: Only allow properly formatted images and videos and ruthlessly reject anything else. I wrote about how I implemented this for my service in another response.


If you are a big enough target, people will try to compromise your infrastructure or your users through these uploads.

Some problems you can run in to: Exploiting image or video processing tools with crafted inputs, leading to server compromise when encoding/resizing. Having you host illegal or objectionable material. Causing extreme resource consumption, especially around video processing and storage. Having you host material that in some way compromises your clients (exploits bugs in jpeg libraries, cross site scripting, etc.)

I can't really talk about what is done at the FAANG that I worked at on this stuff, but if you are a large enough target, this is a huge threat vector.


If you're looking for a good image optimization product, I've had excellent results from ImageOptim (I have no affiliation with them). They have a free Mac App, they have an API service, but also they kindly link to lots of other free similar products and services: https://imageoptim.com/versions.html

If you can spare the CPU cycles (depending on how you optimize, they can actually be expensive) and if your images will be downloaded frequently, your users will thank you.


Ensure you have adequate cleanup procedures too. I heard from ex-employees of a major car reselling platform that CSAM was distributed by creating a draft car ad, never publishing it and the CSAM was now hosted and accessible via direct image URLs. The records were orphaned, not sure how they got the tip-off.


as poijted out already magika is useful for good file type analysis. also, scan thoroughly for viruses, potebtially using multiple engines. now this can be tricky depending on the confidentiality of the uploaded files so take good care not to submit them to sandboxes, av platforms etc. if not allowed. i would really recommend it though.

if you want to get into the nitty grit of filetype abuse, to learn how to possibly detect that good. ange albertini's work on polymorphic filetypes and the ezine poc||gtfo as well as lots of articles by AV code devs are available. its really hard problem and also depends a lot on what program is interpreting the files submitted. if its some custom tool.there might even be unique vectors to take into account. fuzzing and penteting the upload form and any tools handing these files can shed light on those issues potentially.

(edit: fat fingers)


I use lambda for image uploads.

The function ensures images aren’t bigger than 10mb, the image is compressed to our sizing, and put into s3.


With Google Cloud Storage you can simply set a header with the max size allowed when signing the upload URL.


I believe AWS S3 allows setting the Content-Length header as well. So you might not need the lambda


Can you explain what you mean? I'm confused about the "might not need the lambda". Thanks!

In my case we allow up to 10mb file upload. For example just testing something right now from my iPhone I selected a 3.7mb image which was uploaded and resized to the "frame" it will go into in the UI and its now 288kb in S3. We have a few "frame" sizes which are consistent throughout our application. Now, my cloudfront is serving 288kb file instead of 3.7mb, which is good for me because I want to avoid the bandwidth costs and users honestly can't tell the difference, and then also we aren't a photo gallery app.

So in my case I need the lambda to resize the image to our "frame" sizes. I wonder if I didn't explain right. I'm ESL so just want to make sure since I'm confused about the your last sentence.


They mean you might not need the lambda if all you want to do is enforce a max size; if you want to do resizing/other processing then you’d still need the lambda


You're good mate. I meant exaclty what aprilnya said. I thought you are using the Lambda just to validate the 10mb limit - and just for that you don't need the lambda :) and could save yourself some cents & effort.

But if you do resizing as well, then ofc you need the Lambda (or some other compute engine).


For what its worth, processing the files is probably more risky for your internal infra than doing nothing. I've seen a RCE exploit from resizing profile images before.

On the other hand, not processing/scanning your uploads is probably more risky for your users/the rest of the internet.


My SaaS uses Cloudinary for uploading and storing images.

It’s not particularly cheap. But it is fast and flexible and safe.


My previous company used Cloudinary for the purpose and amazingly lot of time was saved.


Slight OT...

I created a program for profile pictures. It uses face recognition technology as to not deform faces when resizing photos. This may be useful to you.

https://github.com/jftuga/photo_id_resizer


I was part of designing a user file upload, it was a B2B product,limited number of users and in principle trusted users but similar to other comments we did something like:

- some file type and size checks in web app

- pre-signed URL

- upload bucket

- lambdas for processing and sanity checks

- processed bucket


It depends on how small. I’m working with really small (like less than 5mb per upload) and use a FastAPI endpoint on my API to receive the file and then have other Python code save it (or reject it)


We run an image scaler on aws lambda based on libvips. We cache the responses from it with cloudflare. We compared to letting cloudflare handle the scaling and the lambda was several times cheaper.


https://github.com/google/magika

    Magika is a novel AI powered file type detection tool that relies on the recent advance of deep learning to provide accurate detection. 
It's not a sliver bullet, but I use it recently for inspecting file type instead of magic file.

One advantage is that detecting composed files. Take pdf+exe file for example, the library will report something like 70% pdf and 30% pebin file.


I don't think this project was actually meant for production use, and especially not under decidedly hostile conditions.

I'd suggest existing tooling like `exiftool` to do mimetype detection and metadata stripping.


Have you considered a commercial solution?

https://developer.massive.io/js-uploader/

Just point it to your local files, as big as you want, and it does the rest. It handles the complexities of S3 juggling and browser constraints. Your users pay nothing to upload, you pay for egress.

Full disclosure: I'm MASV's developer advocate.


I'm not affiliated but a cloud file service from someone like backblaze may interest you


I don't feel this is safe enough. I also feel we could do better...

What is the business case for making the necessary changes?

Good luck.


Well, security for starts. Having user A upload a malicious_file.png and then user B download it would not be ideal. Secondly, UX - having thumbnails for files can increase the loading speed. Arguably not the most important UX feature, but we have the resources to spend on this, so why not.


> Having user A upload a malicious_file.png and then user B download it

You will usually want the profile image of user A to be shown to user B. Same for videos and other files - users likely upload them so that others can download them / consume them in some way, right?

I think where you need to start is doing some threat analysis, and proceed from there. Hosting user content can be built out from "very small" to "very big", depending on the particular threat scenario/use cases/your particular userbase. With the description that you are giving, I would say you are more at risk of building an overcomplicated ("oversecured") solution which might compromise UX for the sake of some protection that is not necessarily needed.

If you are a small team, likely you could use an image resizing / video thumbnailing proxy server such as https://www.imgix.com/ https://imgproxy.net/ etc. You generate a signed URL to it and then the service picks up the file from S3 and does $thing to it. https://www.thumbor.org/ is another such tool. There are quite a few.

Re. uploads and downloads - you have quite some options with S3. You can generate the presigned upload URL on the server (in fact: you should do just that), make it time limited and add the Content-Length of the upload to the signed headers - this way the server may restrict the size of the upload. Similarly, access to display the images can be done via a CDN or using low-TTL signed URLs... plenty of things to do.


> I don't feel this is safe enough. I also feel we could do better by optimizing images on the BE, or creating thumbnails for videos.

Yeah definitely. Even optimizing the vids. I just spend time writing scripts to convert, in parallel, a massive amount of JPG, PNG, PDFs, mp4 videos and even some HEIC files customers sent of their ID (identity card or passport, basically). I did shrink them all to reasonable size.

The issue is: if you let user do anything, you'll have that one user, once in a while, that shall send a 30 MB JPG of his ID. Recto. Than Verso.

Then the signed contracts: imagine a user printing a 15 pages contract, signing/paraphing every single page, then not scanning it but taking a 30 MB picture, with his phone, in diagonal, in perspective. And sending all the files individually.

After a decade, this represented a massive amount of data.

It was beautiful to crush that data to anywhere from 1/4th to 1/10th of its size and see all the cores working at full speed, compressing everything to reasonable sizes.

Many sites and 3rd party identity verification services (whatever these are called) do put limit on the allowed size per document, which already helps.

In my case I simply used ImageMagick (mogrify), ffmpeg (to convert to x265) and... GhostScript (good old gs command). PDFs didn't have to be searchable for text so there's that too (and often already weren't at least not easily, due to users taking pictures then creating a PDF out of the picture).

This was not in Amazon S3 but basically all in Google Workspace: it was for a SME to make everything leaner, snapper, quicker, smaller. Cheaper too (no need to buy additional storage).

Backups of all the originals, full size, files were of course made too but these shall probably never be needed.

In my case I downloaded everything. Both to create backups (offsite, offline) and to crush everything locally (simply on an AMD 7700X: powerful enough as long as you don't have months of videos to encode).

> Anybody have experience with any of this? I imagine having a big team and dedicated services for media processing could work, but what about small teams?

I did it as a one-person job. Putting limits in place or automatically resizing, right after upload, a 30 MB JPG file which you know if of an ID card to a 3 MB JPG file doesn't require a team.

Same for invoking the following to downsize vids:

    ffmpeg -i input.mp4 -vcodec libx265 -crf 28 output.mp4    (I think that's what I'm using)
My script's logic were quite simple: files above a certain size were candidates for downsizing then downsizing then if the output was successful and took less than a certain amount of time, use that, otherwise keep the original.

I didn't bother verifying that the files visually matched (once again: all the originals are available on offline backups in case something went south and some file is really badly needed) but I could have done that too. There was a blog post posted here a few years ago where a book author would visually compare thumbnails of different revisions of his book, to make sure that nothing changed too much between two minor revisions. I considered doing something similar but didn't bother.

Needless to say my client is very happy with the results and the savings.

YMMV but worked for me and worked for my client.


Why are you persistently storing copies of user passports? This is a data breach waiting to happen. Surely these should only be kept until whatever verification you wanted to do is complete, and then deleted forever?


Is HN turning into StackOverflow now?


This would probably be closed on SO as too broad.


Good. It doesn't work here either.


Is this really the state of the industry? Where an SRE is asking how to handle user media on the web?

I'm not diminishing asking the question in principle, I'm questioning the role and forum that the question is being asked on.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: