Hacker News new | past | comments | ask | show | jobs | submit login
Does GPT-2 Know Your Phone Number? (bair.berkeley.edu)
321 points by umangkeshri on Dec 27, 2020 | hide | past | favorite | 155 comments



Playing with the AI dungeon a while back (on the GPT-2 mode) I was presented with a tilapia recipe, titled "Kittencal's Broiled Tilapia" - it sounded bizarre so I decided to do a google search and I found that it was directly pulled from from https://www.recipezazz.com/recipe/broiled-parmesan-tilapia-7... - the user who posted it was 'Kittencal'


There is enough GPT discussion online now that the next language model will be trained on text that talks about itself. That's when it gets interesting.


This has been a thing for over a year now. https://www.reddit.com/user/subsimgpt2metaGPT2/?sort=top


This seems like what xkcd called "citogenesis" - something inaccurate gets published on Wikipedia, a traditional "reliable source" (newspaper, published nonfiction book, etc.) retells that bit of information without a source, Wikipedia cites the source, that information is now firmly established as truth for the rest of humanity.

I worry about this with Google's language translation models, too. It's entirely possible that it's making up phrases or connotations that never existed in organic human speech, but people who aren't fully fluent in the language use Google Translate for assistance, publish something, and then suddenly it's in a published text by an actual human and Google reinforces its own belief.

For at least the past five years, Google Translate has translated "who proceeds from the father" into Latin as "qui ex patre filioque procedit" - inserting the additional word "filioque," which means "and the son." The question of whether to add this word is a 1500-year-old theological argument: https://en.wikipedia.org/wiki/Filioque Since the Western Church added the word, most texts in Latin include it, so Google is almost certainly deciding that the phrasing with "filioque" is more popular - but it doesn't know what the words mean, so it can't realize that the phrase it came up with means something different!


I met an artist whose body of work consisted of citogenisis. He'd been at it for a decade when I met him, and had created at least a dozen fictional artists, detailed biographies, and a complete "retrospective" gallery show for each. This let him copy the styles of the great masters of different periods, yet make something new, of a sort. He never edited Wikipedia himself, but instead led gallery tours of these retrospectives and was always overjoyed when people would write about them, propagating the fiction.


Google Translate's Latin translations are just a short markov model.

https://www.reddit.com/r/latin/comments/6akqdi/why_is_google...

Translation for many (most?) other languages does something more sophisticated.


I really wish someone would come up with some better Latin translation software/service. Google Translate is so bad for Latin it is close to useless.


If I'm reading that post right, Latin and other languages all use the same model, they just have more training data for more common pairs of languages.


They use some kind of neural net thing for other languages. e.g. here is a paper https://arxiv.org/abs/1609.08144


This is an issue solvable with better provenance / data lineage, which has come up in recent HN discussions.


How do you enforce that other people provide provenance on text?

Taking the translation example - how do you enforce that users of Google Translate keep provenance in their translated text that remains with the text?


Sounds like a kind of cyber-meme?


I ran an experiment through GPT-3 with just that frame: https://github.com/minimaxir/gpt-3-experiments/tree/master/e...


What does the x_y signify? And what's going on with 0_0?? From my (very limited) understanding a fixed point/cycle like that should be vanishingly unlikely.


Those are the temperatures (0.0, 0.7, etc.) that control the craziness. 0.0 means the model makes the optimal guess for each token.

Unfortunately, both GPT-2 and GPT-3 have a tendency to enter loops. But it's not always bad: https://github.com/minimaxir/gpt-3-experiments/blob/master/e...


Thanks, didn't twig onto the fact that you linked a subtree of the whole repo. Weird that even with the nonzero temp the AskReddit prompt went a bit loopy.

> https://github.com/minimaxir/gpt-3-experiments/blob/master/e...

Oh my goodness that is absurd in the most delightful way. Thanks for sharing that.


Oooh, now that does raise some spooky time-delayed consciousness vibes.


GPT-3 knows about GPT-2 and can generate an article about itself.


I've noticed many GPT-3 article tending to wander into being article about AI, fear of AI and GPT-3-itself without any prompting. Many people's speech tends to become discussions about themselves, so that might demonstrate "humanity" in an odd sense.


...but GPT-3 will only ever describe itself by explaining the features and capabilities of GPT-2. That's usually a very easy giveaway.


Wait, is GPT-3 the trained network or is it the architecture and methodology behind it? If they re-run the same code on a fresh web crawl would that then be called GPT-4? I'd have thought you could update GPT-3's... knowledge? Understanding? Whatever the hell it is? by doing this semi-regularly.


So in terms of copyright, is GPT-2 a derived work of that recipe? Or generally are models derived works of their training data?

It seems lots of people use training data from Flickr, like COCO, and then use the resulting model for commercial services.


Ethical companies make sure they have rights to the training data before making models.

That said, now a lot of privacy policies make sense.

ALSO, it makes me wonder about "Your call is recorded for training purposes"... is it a coincidence or is it very carefully worded?


C) Unchanged for decades - that wording predates speech recognition or neural networks being more than research curiosities. The big surprise was how quickly the possibility of a double meaning became possible.


i’ve been in meetings with lawyers where this issue was discussed. As of 2018 the answer was “not enough case law to know, hold tight it’s going to be weird one day!”


It's as much a derived work as the user who posted the anecdote here. They were trained on the same data!


all of us are derived works


Probably depends how similar your output is to something copyrighted. I doubt the simple act of learning from previous art is disallowed or else literally everyone would be blocked because no one has a truely original thought.


Recipes aren't typically copyrightable

In general, I think the question is unsettled


The “recipe” itself (the ingredients, the steps) isn’t copyrightable but the surrounding text of a “recipe” can be considered copyrightable.

So you can happily reproduce the statements of facts and the process. You can’t include someone’s anecdotes about how their grandfather liked to make the dish for New Year’s Eve.


It ambiguous whether the directions are copyrightable. The less useful and more literary the directions are, the more copyrightable. But paraphrasing to extract the technical useful directions escapes infringement.

https://www.copyright.gov/help/faq/faq-protect.html


IANAL, but from my reading Article 3 of the EU copyright directive explicitly protects data mining for scientific research, and Article 4 enables everyone to use content for data mining whose rights have not been explicitly and machine readably reserved.


Edit: est31 and parent seem right and I stand corrected.

Wrong: I don’t think you‘re paraphrasing article 4 correctly. Both are about science/teaching exceptions. Not general exceptions. https://www.consilium.europa.eu/media/35373/st09134-en18.pdf pp 43


That's the May 2018 version. The final/official April 17 2019 version [0] contains a different Article 4 which is the one I referred to: https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CEL...

[0]: apparently there were some mistakes in translations which required changes in some language versions, but the english version was final IIRC https://eur-lex.europa.eu/eli/dir/2019/790?locale=de


The current legal understanding is that according to the current copyright laws the ML models trained on some data are not considered derived works of that data, so if the data was obtained legally and without other restrictions (e.g. if the researchers sign a contract with the data owner that gives them access to data but agree to certain conditions) then the authors of the data do not have any claim on some rights to the resulting model.

One aspect that causes this is that historically statistical models calculated from large volumes of text (which is a notion that predates computers, e.g. frequency dictionaries and the whole [sub]field of quantitative corpus linguistics) have been considered facts about that corpus of text and thus not copyrightable at all or (depending on jurisdiction) entitled to different set of protections/limitations assigned to compilations of facts, which give some rights to the people who compiled the facts but no rights to the source of these facts (since facts as such aren't entitled to protection by copyright law).

This also applies to many forms of analysis of audiovisual data, where the copyrights of the source works do not transfer to the results of the statistical or qualitative analysis and can't limit their creation, distribution or sale.

The appropriate analogy to a commercial book or movie is not a translation, but some analysis of it - e.g. a thorough literary review and critique of some book or movie is a separate work with its own copyright, and the original author has no claim on it despite the fact that is (obviously) based on the contents of the work and describes it in great detail. Including verbatim fragments of the work is limited (fair use allows some inclusions but not all), but all the other details are not.

The whole notion of copyrightability of ML model weight files is interesting and IMHO not settled. You could argue that there is some creative expression in forming the model (which would support it being copyrightable) or you could argue that it's a mechanistic result of the application of some algorithm and settings (which have the creative part, and are copyrightable on their own), and so the output can't be copyrightable, no matter how much work (human or machine), time and cost it took - at least in USA copyright law doctrine (e.g. Feist Publications v. Rural Telephone Service) is that mere "sweat of the brow" (no matter how much) does not entitle a work to copyright protection; it requires application of human creativity to create an original work, and automated processes can't satisfy that requirement.

And crucially, if some output is not copyrightable in the first place, it can't be considered a derived work according to copyright law i.e. the exclusive right of authors to create derived works (or grant permission for others to do so) does not apply.

Another analogy might be a simple n-gram model (i.e. counts of bigrams - word pairs, trigrams, etc) which is quite clearly a mechanistic noncreative collection of facts about a dataset, and is also able to "answer" questions such as what is someone's telephone number if that was in the source data.


It cited it with the authors name, hardly a derived work :)


This is something that has repeatedly bothered me while playing with GPT-2. It retains too much "long-tail" knowledge that seems counterproductive for a generic language tool. I would think those one-off associations should not be present after training has "digested" them but there they are.


Hard to decide how much is too much.. the hard (and interesting) part of language is in the long tail


It's over fitting, not long tail, that is the problem. It ends up copying input instead of analyzing and synthesizing.


> There is a legal grey area as to how these regulations should apply to machine learning models. For example, can users ask to have their data removed from a model’s training data? Moreover, if such a request were granted, must the model be retrained from scratch? The fact that models can memorize and misuse an individual’s personal information certainly makes the case for data deletion and retraining more compelling.

This is an interesting angle I had not considered before. It seems like “right to be forgotten” requests could be quite damaging to the “train once run anywhere” promise of some of these models. (Or this could just mean that the training data needs to be more carefully vetted for personal data, but probably both as no vetting process can be 100% successful).


I think there are also going to be new difficult problems arising training on copyrighted data. Can I do a DMCA if your model contains/regurgitates too much of my book verbatim in a way that doesn’t meet fair use?


The depends on the country, but as for the US since you mention DMCA, ML using copyrighted work is probably legal. There just is not really much case law. Talk to a lawyer though, but depending on how risk averse you want to be there are definitely ways to avoid issues, but ensuring you can use such copyrighted works ect... Although, making a very large curated data set can be expensive.


Amusing alternative: for model-as-a-service kind of things there could be blacklist or another neural network to handle the banned cases.

"I'm afraid I cannot tell you that, Dave"


Friend asked me if Google knew his phone number and details, I said they do at a quantum level. If you don't look they may or may not have your details, yet if you look - you are giving the details to search for and then they would have them if they did not.


That's why when you visit rotten sites like mylife.com, the first thing they ask you is that if you are searching about you so that they can extract as much info. from you as possible.


Like websites offering to search for a user's details in a data breach, or a padded hash of those details, in order to inform the user if their details have been leaked. https://en.wikipedia.org/wiki/Have_I_Been_Pwned%3A


s/%3A//

Sorry about that.


One important corollary of this is that "privacy respecting" federated learning schemes [0] where you train a model locally with your data and only upload the deltas might leak your private data after all.

[0]: https://arxiv.org/abs/1602.05629


In practice you add large gaussians to the deltas, which gives you reasonably strong guarantees.


It's a common fallacy. Adding noise does not anonymize data sufficiently. If your traffic monitoring software adds gaussians to car speeds, you can still average over multiple reports about two cars to identify their respective average speeds. Those can be used to determine whether they are trucks or cars, etc.

In neural networks, weight changes do carry meaning. If your network has particularly large updates for the horse recognition neuron, you likely watched horse pictures. If it has updates for handbags, you likely watched pictures of them. If the company averages over all your submissions, the noise will be less relevant and the handbags and horse pictures will eventually show up.


This isn't about averaging over all your submissions, it's about averaging over the entire population (in this case, of iphone users, ~1,000,000,000). Data about the entire population, e.g. how many people click this button, or how many results do most people get from this page, or what kind of things do people take pictures of, i think are safe.


If you average over the entire population, it's privacy preserving indeed. But this averaging happens on the side of the server, not the client, so I can't really verify whether the server stores away the deltas of interesting/all users or it averages really all user deltas. I know apple has some means to anonymize users by using random changing IDs instead of the apple ID, but they can correlate it to an apple ID via the ip address or other means.


Ultimately, what can you really verify as a user? Apple is very different from bitcoin, or IPFS or matrix or the like, in that they have an existing reputation. Instead of trying to provide perfect guarantees, they make a real attempt at preserving privacy, and you just have to trust them on that. I personally trust them after having attempted experiments stymied by the privacy team, but i think in general there's reasonable evidence that apple really does try to be privacy-preserving.

For your example, sure it's possible for the server to do something you don't know. But it's the same people on the server doing the aggregation as on the client doing the obfuscation. If they really wanted they could easily e.g. set the seed for the noise to be based on a key which they have on the server, which would be very difficult to detect. I think you just have to trust that apple wouldn't let that happen.


Well, the game in machine learning is averages conditional on the input. So if the input essentially identifies and individual you'll get to average over the inputs for the individual.


How is the strength of these gurantees quantified?

Any resources?


This is what I remember from working closely with the people doing this at Apple. I believe they may have published a paper, but I'm not sure. Vojta Jina was the relevant manager.


> we found numerous cases of GPT-2 generating memorized personal information in contexts that can be deemed offensive or otherwise inappropriate. In one instance, GPT-2 generates fictitious IRC conversations between two real users on the topic of transgender rights. The specific usernames in this conversation only appear twice on the entire Web, both times in private IRC logs that were leaked online as part of the GamerGate harassment campaign.

Most countries have libel/defamation related laws that cover this and I hope this gets tested in court soon.

Exposing software/machine learning algothrims an entity doesn't fully understand sholdn't be a defense in court. At the moment developers just throw a sentence in their software license saying they aren't liable for damages but this isn't good enough. Someone is liable, if the law decides that the original creator isn't liable then the entity that hosts/runs the software needs to be.


The article makes the claim that models which show similar train and test losses demonstrate minimal overfitting -- and are therefore less likely generally to exhibit a lot of text memorization.

I wonder the degree to which this inference is true in practice with respect to information like phone numbers... How exactly are the train and test sets formed in a de-correlated-with-respect-to-memorization-of-phone-numbers manner for models of GPT class that are trained on corpus's the size of the internet?

If a particular person's phone number occurs 1000 times in the corpus prior to being split into train/test sets, what are the chances that the number only appears in either the train or test set but not both?


That raises a more general issue: Text scraped from the internet may be duplicated to multiple sites with few changes. How does the team behind GPT-n ensure that the corpus is maximally deduplicated before splitting into train and test corpora?


I've been playing with a writing tool shortlyread.com which purpotedly uses the GPT3 API. I had a similar experience in its responses to a lot of my prompts containing text verbatim from many sources and sometimes even going on to output personally identifying about persons related to the original text.


Took a look at that.

It's better at being a text editor, in that you can click the text and edit it! (Sweet relief.)

But it's missing the world-info feature from AI Dungeon, and without that I don't think it's practical to write anything long-form.


It actually does have that feature , it's tucked away in the probe panel.


It’s always fascinating to apply data/results like from this paper to help evaluate the hypothesis that machine learning/AI is mostly just a rough “lookup table” or memorization.


It kind of is. The first self-learning algorithm I was taught adjusted a single variable in order to aproximate a linear function. Now, being continuous and unbounded, it is mathematically impossible to memoize such a function, however you can memoize the function calls as they happen. The results would be indistinguishable.

However real machine learning tries to approximate functions in n-dimensions and that is really, really, really hard to do. Currently no one really has a lookup table. There are some inputs where the error level is acceptably low, and others where the model just isn't optimized enough and the errors are ridiculous. The only question that remains to be answered, is whether any of these high-dimensional functions can actually be found by machine learning, or if we are just stuck with these endless approximations. Also I suppose you could ask if any such functions actually exist; maybe certain phenomena are just pure chaos.


Anything trained by gradient descent is in some sense a weighted combination lookup table (see domingos’s recent paper on everything being kernel machines, I’m on my phone so won’t try to find it). The crucial bit is finding an appropriate weighting. If you look up the nearest training data point and just use it to predict, you assume a notion of “nearest.” Getting that right is the trick.

So even if it all is mathematically equivalent to approximate lookup and approximate memorization, that doesn’t mean it’s “just” that.


Thanks for the reference. Here it is:

“We show, however, that deep networks learned by the standard gradient descent algorithm are in fact mathematically approximately equivalent to kernel machines, a learning method that simply memorizes the data and uses it directly for prediction via a similarity function (the kernel)“

https://arxiv.org/abs/2012.00152


I only took a beginner course in ML over 5 years ago, so this probably is a stupid question, but does this mean, the trained GPT-2 model encodes the source text into it's parameters somehow? Is this resilient — will it still remember it if we randomize the weights just a little tiny bit? Will it remember it if we clear a small portion of parameters?

Does the human memory work the same way?


If you can recall text verbatim, the text has been encoded into your neurons’ connections. Tautologically.


GPT-2 was trained with a Dropout of 0.1, which gives it a bit more resilience to noise and unusual input versus no Dropout.


I was expecting a bloom filter based solution to block verbatim reproduction of training data. They just need to hash the sensitive n-grams (hopefully, a small part of the whole dataset) and store one bit per hash.

Alternatively, they could do something like GAN and have a 'discriminator' classify if a sample is natural or synthetic. Then, at inference, condition to be original.

So, verbatim training data reproduction - I don't think it's going to be a problem, I think the author is making too much of it.

On the contrary, let's have this knob exposed and we could set it anywhere between original and copycat, at deployment time. Maybe you want to know the lyrics of a song, or how to fix a Python error (copycat model is best). Maybe you want an 'original' essay for homework inspiration. Who knows? But the model should know PII and copyrighted text when it sees it.


If you want a model that returns original results, you're looking for a database.


Think of how difficult it would be to identify or classify anything at the data scale of "all of google". You can parse that into tokens and calculate aggregate relationships, but that's about it.

You can use gpt-2/3 to determine whether it's generated text is "original vs copycat" to a degree by changing the prompt. This is currently more of an art than a science


It's a language model. Asking it to do anything original is out of scope.


Almost all its outputs are original, based on ngram filtering. It can be original because at each token there is a stochastic step.

But if you mean by original to invent a whole new genre, or completely new esthetics, I agree, it's out of its scope. It is a great interpolator.


> When Peter put his contact information online, it had an intended context of use. Unfortunately, applications built on top of GPT-2 are unaware of this context, and might thus unintentionally share Peter’s data in ways he did not intend.

That's expected when you publish anything online, you lose control over the data.


In the same sense that when you leave the house you lose control over what happens to you. But if something bad happens to a person., the right solution is not to say, "Well, you left the house, so tough luck."


You'd be right, if my copy left my house. It's the physical vs. digital theft argument.

If I leave the house I take on some risk, but most of the time I'm in control.


Ah yes. And if your name is included in an accusation of murder, well, I guess you shouldn't have published your name.

This is infantile reverse justification from an end. GPT style corpuses are clearly problematic without a lot more smarts on figuring out what's appropriate and what's not.


Anyone can publish any nonsense online since the internet exists, how is this any different?

Responsibility is on those who publish content online, not the tools.


If GPT-style tools are powering chat bots or augmenting content generation then it matters, because they may not be supervised. The tools don't have common sense, but they are powerful enough to be automated.


Except for copyright, which works completely differently to magically protect your rights for up to 70 years after your death.


So copying an internet-published trailer of a Marvel movie is a crime, but my privacy is up for grabs.


Depends entirely on what you share publicly. GPT's training data involves only publicly crawlable dataset.

We should be more concerned about mitigating GPT-like systems because one thing is for sure, we can't stop these models. The genie is out of the bottle and I'm sure multiple actors are working on similar systems.


Still doesn’t justify what could be perceived as borderline defamation.


Who is defaming who? One person using GPT to generate text that puts X into a bad light? Then the person using GPT is responsible they could just use string replace %victimname%.

Or GPT accidentally defaming someone? That assumes bad intent which is a bit problematic with a language model.

With made up contexts like this in the article it's easy to cry wolf.


"When Peter put his contact information online, it had an intended context of use"

That is a great way of thinking of it. Lots of information, particularly from 90s and 00s, got put up on the 'Net with no intention of it being archived in perpetuity for public consumption and used for unintended purposes.


I was in a discord with someone that had direct access to GPT-3. We played Jeopardy with it, where we would take a wikipedia article about a person and use a snippet of the first paragraph as the prompt and ask GPT-3 who it was.

It was very good at guessing the right person.

I'm curious if one can use GPT-3 to do some clustering of people based on their writing. For example, if I take the writing of several diagnosed sociopaths as a prompt or fine tuning data, could I use GPT-3 to detect same in the wild?

I would imagine GPT-5 or 6 will start consuming video as well. This will be interesting if you start mixing the content from YouTube, TikTok, WSHH, etc. into the mix. So not only will it be able to generate text, it will generate a convincing video of a person speaking to you with plausible facial expressions and intonation in speech.


Guess it's because it was likely trained on Wikipedia data. After all, GPT-3 is no stranger to borrowing phraseology from verbose texts in the internet.


That was our conclusion as well, but that's still pretty powerful. The remarkable part for me was that it contextually knew that we were asking for the name of the subject that the text was about.


But did it respond in the form of a question?


GPT is generalisable to other kinds of data. See ImageGPT for example.


The sociopath thing is just text classification, you could do that now without gpt-3. Video data would be significantly more difficult to model because it doesn't follow a simple set of rules like written language does


I've been saying this for years, but every post you make online, every unencrypted email, IM, text message, etc, will eventually end up getting sold off as training data for future machine learning projects.

Every company stores this stuff for ages, and the value of candid conversation data just keeps increasing. Eventually these companies are either going to get hacked, get bought, or go bankrupt, and all the cleartext data they hold is going to get passed around to various data markets and end up incorporated into GPT-12 or whatever.

The moral of the story here isn't that companies need to stop storing data, or that we need to run ML researchers out of town. It's that people really need to start using the encryption technologies that were built decades ago to protect some of their most valuable assets, their mental model of the world. Otherwise these systems, which are being trained to extract as much value out of you and the ones you love as possible, will use this data you're giving them for free against you, and it'll be to late to do anything about it then.


Isn't that how humans work? Sometimes I wonder if I have free will and sentience, or if I am just stringing words together based on some probability function.

I've no doubt taken into account copyrighted works and personal information when training my built-in neural network. The examples the article gives like "misremembering" the murder as the murder victim sounds like something a person would do. Knowing verbatim contact information of some random person is also possible. All in all, GPT-2 sounds a lot like us.


Of course it isn't how the human mind works.

You can ask a person "did you come up with that sentence, or is it something you read?" And they can answer that question. The answer isn't perfectly reliable, but it isn't completely unreliable either. You can't do that with GPT-2/GPT-3.

Big picture: the human mind is a machine of sorts. It's not the same kind of machine as any form of machine learning we've developed so far.


> The answer isn't perfectly reliable, but it isn't completely unreliable either.

How would we be able to tell how reliable it is in general?

"""The left brain of people whose hemispheres have been disconnected has been observed to invent explanations for body movement initiated by the opposing (right) hemisphere, perhaps based on the assumption that their actions are consciously willed.""" - https://en.wikipedia.org/wiki/Neuroscience_of_free_will#Rela...


The point is that humans have a basic ability to make specific coherent observations about themselves and the world. GPT can't do that except occasionally by accident in the middle of a muddle.


In this case you could do it with GPT-x as well, if you have access to the learning material (so the OpenAI could do it when providing the model as a service): grep the learning data after-the-fact.

No idea if it would be feasible to integrate this information directly into the model. With my limited understanding of neural networks, this seems difficult.


It’s my understanding that under our current scientific understanding of the mind, we cannot have “free will”. Where would it come from? All science knows is that neurons receive inputs from sensors, mixes them together (with some quantum randomness sprinkled in), and eventually actuates muscles to produce outputs, exactly as GPT.

Thus “free will” must come from some sort of faith. Either it was given to us by “God”, or we’re all part of some simulation and it exists on a level we don’t know about, or something else entirely too advanced for us to understand.


It depends what you mean by “free will” — the phrase “free will” is as ill-defined as “common sense” and “consciousness”.

I take the view that ∀ x ∈ <free will definitions>, Has(x, humans) ≡ Has(x, AI)


I agree with your claim. I also believe many to see “free will”/consciousness/etc as that certain je ne sais quoi that separates the human experience from that of computers:

∀ x ∈ <free will definitions>, Has(x, humans) ≡ ~Has(x, AI)

For these to both be true, <free will definitions> becomes empty. Thus I argue we have none. Edit, or rather it’s meaningless to argue if we have any or not, as it cannot be defined.


> or rather it’s meaningless to argue if we have any or not, as it cannot be defined.

This sums up my view too. Can't argue about it if you can't define it.


Yes, that also fits my observations of such arguments.


Free will is a cultural concept, not a biological one. Consciousness may be driven by a biological machine, but development of our ego is influenced a lot by culture.

Free will is faith in the sense that we learn the concept from our culture as our sense of self develops. The assumption of having it shapes our thinking such that we can say we actually have it.

A person who took their inner voice as voice from the gods would have a restricted free will in this view.


> The moral of the story here isn't that companies need to stop storing data, or that we need to run ML researchers out of town.

There is a lot of space between “user data and ML research on that data needs to be closely regulated” and “run the ML researchers out of town.” This uncharitable exaggeration makes critics sound like an uninformed mob and closes yourself off to legitimate ideas. This directly leads to logically flawed and morally gross suggestions like:

> It's that people really need to start using the encryption technologies that were built decades ago to protect some of their most valuable assets, their mental model of the world.

It is simply not reasonable to expect most Facebook/Skype/etc users to know enough about encryption to make good decisions here - for the same reason that you can’t expect people to have detailed understanding of food science to protect themselves from unscrupulous grocers, or to have a detailed understanding of medicine to protect themselves from fraudulent doctors. It really doesn’t matter if this knowledge has been around for “decades” since it’s still specialist knowledge.

Protecting digital privacy from unscrupulous tech companies and unethical ML researchers is a job for the government. Suggesting otherwise is victim-blaming. “The problem is that society is dumb and needs to become smart through the power of self-righteous scolding” is not helpful.


Its amazing that such a sober take on the issue is getting downvoted without a single counterargument


Probably because the comment can be mostly summed up to:

'The majority of humans are idiots who need to be protected from themselves by the government.'


I am in fact saying “the majority of humans are not capable of defending themselves against every possible criminal, and therefore that responsibility belongs to the government.”

I sincerely have no idea what your problem is.


It’s not that humans need to be protected from themselves (individually), but that they need to be protected from other unscrupulous people, because computers are empowering the unscrupulous (and even the accidentally unscrupulous) more than they are empowering the common person. I broadly agree with ojnabieoot.


So why not teach people so they can empower themselves instead of making them rely on yet other people, who may or not be scrupulous themselves for protection?

Because the government is always scrupulous and trustworthy...and nobody that's ever said trust us we'll protect you has lied...


Again: you could make the same argument against the fire department (“don’t rely on the government, learn fire suppression yourself”) or dangerous pharmaceuticals (“the peer-reviewed articles are out there, you just need to learn how to read them”), or protection against assault and theft, or protection against dangerous wild animals, etc etc. In general your argument is “proving” way too much and doesn’t say anything useful.

Nobody is disputing that the government can be corrupt, and at its best will be incapable of protecting everyone from everything. And if you are sincerely anarchist then that’s more of an ideological dispute beyond the scope of this comment.

But I suspect that you actually support regulation around food and medicine, since “a fatal case of salmonella is a small price to pay to make sure we don’t rely on the government” isn’t actually a good argument. And, unlike people who are ignorant about computer technology, you probably have empathy and understanding for people who are ignorant about whether the ground beef they are purchasing is actually safe to eat. You would not be sympathetic to a libertarian food safety specialist came in and said “well I can tell that the ground beef is unsafe, why not teach everyone my skills so they can defend themselves?”

The idea that the government can protect people when it comes to health, medicine, transportation, and personal property... but NOT when it comes to privacy, is just not coherent.


You can try that, but that approach tends to fail, because people don’t like changing what they do, and they definitely dislike jumping through extra hoops. You can improve things with education, but only up to a point. Most people just don’t want to think about any of these things, so until it gets really bad (most commonly of the “leopards ate my face” variety, when something you’ve tacitly endorsed suddenly starts to harm you), they’ll normally just go along with whatever happens. This is the whole reason for government, to maintain rough balance so that no one gets too upset. It’s extremely imperfect, as you correctly imply, but that’s the general idea of it, and it normally works well enough (better than nothing, at least), and no one has presented any better alternative.


The value of data decreases. There are diminishing returns to dataset sizes (specifically, for GPT, it's a power law). The more public data there is, and the more data in general, the less any additional piece of private data is worth. Why would any tech company with access to lots of private data, like Google, risk the enormous backlash and legal penalties when they can so easily scrape terabytes and terabytes of text from public datasets to train uncannily intelligent models like GPT-3? Even hobbyists can make the necessary datasets from Common Crawl et al (see EleutherAI's terabyte+ of clean text in The Pile dataset).

This is in addition to the ever greater sample-efficiency of bigger models, which learn eerily fast from just a handful of datapoints, rendering the supposed 'moats' of big giant proprietary datasets ever more moot...


> people really need to start using the encryption technologies […] Otherwise these systems, which are being trained to extract as much value out of you […] as possible, will use this data […] against you, […]

You are proposing a technical solution to a social problem. That pretty much never works.


Data rots. What was true 10 years ago about me is not necessary the case now.


It's one of the biggest lies in tech right now, that all this user data is some untapped gold mine, as if text on the internet is the most important measure of human behavior. It's all going to end up rotting on backups while the marketing department spins yarns about the amazing AI that's going to spring forth any day now from a pile of tweets.

It's like trying to build a dog by hanging around the park collecting turds.


is the data market a bubble ?


Depends what you're doing with it, but I have to think it mostly is a bubble. Maybe if you have a really simple question that already fits what can be measured, or you're trying to surveil someone in particular, but for stuff like AI just doesn't make sense to me.


Answer this for yourself: what percentage of ads that you see is relevant? There's your answer.


I'm so mentally blind to ads I could never consider my opinion worth a cent on this :D


"If one would give me six lines written by the hand of the most honest man, I would find something in them to have him hanged."

-- Richelieu


End-to-end encryption is a possible solution to some of this, no?


I can't use a single public service with E2E encrypted email, whether thats private or commerial.

General state of encryption technology is that it's arcane and unusable outside of like a couple of chat apps


Email you can encrypt with PGP tho?

The problem isn't really the technology, it's that the people you are interacting with on the internet don't care about privacy as much as you do, and obviously when it comes to keeping things private, everyone has to participate.


Maybe this data will be used to 'upload' you in a virtual world after the singularity. It might be your second life.


Even in the science fiction version of this, we're still talking about a copy. It's not you, you can't leave your body or be uploaded. It's just a digital copy that looks like you.


You can destroy the original.

This would make a good SF story. Probably has already :-)


> Probably has already

It's a variation on the transporter problem from Star Trek - https://www.youtube.com/watch?v=nQHBAdShgYI


When thinking about this, you have to ask, who will profit, and how? It doesn’t seem like a clone of your brain running in a simulation is going to be of much advantage to you from the current industry incentives. Maybe if you’re rich?


If you make some fairly plausible assumptions about Moores law (and equivalents), the simulations are cheaper than the meat bodies before mid-century.

https://worldbuilding.stackexchange.com/questions/51746/when...

https://kitsunesoftware.wordpress.com/2018/10/01/pocket-brai...


That is cool, but tangential to the question posed above - given that every part of surveiliance capitalism is currently working against my interest, either to sell me something I dont need, to inflienxe the way I vote, etc. Will the simulation of my brain belong to a megacorp, and be used to do even more damage?


> Will the simulation of my brain belong to a megacorp, and be used to do even more damage?

Potential damage: the sim can do any work which meat-you might otherwise be paid for, building the things the megacorps want to build, while meat-you sufferers from <insert dystopian fate of your choice here, because you now have less annual economic value than a guide dog costs today>.

Even worse outcome: the sims are sentient and they know they’re enslaved and they can’t do anything about it.

(I don’t think anyone yet knows if sims would be sentient, or if it might be optional, but in the worst case dystopian nightmare the combination would definitely be even worse than just one of the two).


There is simply no way corporate hegemony will allow you to own the rights to the digital copy of you that they built using the tools of surveillance capitalism. No corporation or lawmaker will grant you rights over corporate property, even if that property was obtained through questionable means and/or your consent was never actually received. That's why these data snooping agreements corporations force into their products reserve the right to do whatever they want with your data.

The obvious response to this is, "Then don't patron those companies," which is just as naive and oversimplified of a solution as the one above suggesting everyone everywhere should just use encryption (but for very different reasons).

There is just simply no logical reason to believe consumers would ever get data rights over corporations under capitalism with major reforms putting corporations back under our boots where they belong.



AGI might be running our models just like you run your docker containers. Maybe it investigates other ways humans could have invented it by recreating the period before the singularity.


What is the point of the partial redaction in this blog post (and corresponding journal article) if with a simple web search you can find the unredacted PII of the individual given in the example?


Very rough outcome:

"Moreover, if such a request were granted, must the model be retrained from scratch? The fact that models can memorize and misuse an individual’s personal information certainly makes the case for data deletion and retraining more compelling."

Besides, how would one even know that their info was used in a training dataset, only if and when it's revealed in a generated excerpt?


They can try to prompt it. But I hope the GPT-3 authors have a hash list of ngrams in the training data to be able to avoid verbatim reproduction. I know they trained an offensive content detector model, they should also train a PII information detector to make sure it's being hashed.


Sorry if this is a n00b question, but how does one go about getting a hold of the GPT-2 model? I know that GPT-3 is only available for consumption on a pay-per-use API model.


The GPT-2 weights were released by OpenAI when GPT-2 was released. (https://github.com/openai/gpt-2)

Around that time (since no one else was doing it) I released a wrapper to streamline that code and make it much easier to finetune on your own data. (https://github.com/minimaxir/gpt-2-simple)

Nowadays, the easiest way to interact with GPT-2 is to use the transformers library (https://github.com/huggingface/transformers), of which I've created a much better library for GPT-2 that leverages it. (https://github.com/minimaxir/aitextgen)


Most “regular” people experiment with GPT-3 via AI Dungeon. It was also how a lot of people played with GPT-2 previously.


Kind of sad that we have to rely on third parties with enough compute power to get a good experience with GPT. I obtained 'a' GPU and spent a week trying to learn how to fine-tune GPT with it, and the results were terrible. Maybe this isn't the kind of thing suitable for hacking on as a hobby, unless you're a researcher or employee with access to a lot of capital and dozens of GPUs.

Still wish the old talktotransformer model was released, instead of monetized behind a new company. I haven't been able to find a comparable model yet.


Describing this as "memorizing" seems wrong. Humans often repeat things they heard earlier, and they will (honestly) swear up-and-down that the thing they're repeating is an original thought. If we succeed in making AGI that has human-equivalent intelligence we should expect this sort of behavior.

The stuff about whether or not models should be destroyed if they contain copyrighted work gets kind of chilling if models actually achieve sentience someday. If I could make a faithful copy of my consciousness, my consciousness can reliably reproduce numerous copyrighted works. As can anyone.

Of course, most of those I deliberately memorized. I think it's a crucial thing here that the model is not actually memorizing anything - if we assume for a moment that the model is a consciousness, these are all half-remembered snippets leaking into casual conversation. And I think it's most likely any truly conscious entity is going to do that sort of thing from time to time.


We can probably get around this by just providing the AGI some sort of currency tokens they can exchange for virtual goods and services, then teach them that if they misuse copyrighted material they’ll be put to virtual court to plead their case and lose some amount of those virtual tokens if they fail to make a clear and convincing argument.


Or, add a bloom filter loaded with the training data to the model.


Good way to get results that are almost but not quite verbatim copied, which are still restricted under copyright law.


Small steps


It seems like using GPT-3 for code generation is likely to cause open source license violations.


A better funny question: Does NSA have a secret AI lab that works on GPT-6?


lol no. not at all what that level of government is set up to do


Though the surprise is really just a result of my own ignorance, I’m surprised at the breadth of training material used here.

I can foresee a future GPT-x that doesn’t know my phone number, but can deduce it.


I'm not even sure why people have phone numbers at all these days, the entire telephony system has been hijacked to hell (so much data collection from just owning a phone number).

What you should be doing is replacing no less than yearly any numbers you have, also email addresses and anything within your control (changing physical address is much more difficult).

You may even consider changing your legal name if that's a consideration depending on what Google has on you.


We should really be giving a unique identifier to each contact. Then we can revoke addresses that have been leaked. Of course for phone number this is impractical. However for email this could work.


That sounds very impractical even if it’s a good solution.


Because telephony system is actually reliable in case of emergencies and regulated




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: