The power of inference: biometric psychography and large language models

More expressive bits + better algorithms = more knowledge. And who benefits?

Aug 04, 2022

Knowledge is almost never raw data. Knowledge is usually gained through reasoning about raw data to arrive at some new fact or conclusion. This reasoning process, a.k.a. inference, is vital to almost every meaningful piece of knowledge that can be gained. Knowing someone's browsing history, at face value, is not why it's interesting to gather that data. It's what you can infer about their personality from their browsing history that is interesting.

Inference can be surprisingly powerful. Knowing an American's gender, zip code and birth date is enough, in 84-97% of cases, to infer their identity. Birth date, of course, narrows down the population quite a bit, but the inferential effectiveness of combining this with gender and zip code seems like a large and fairly unintuitive leap from data to knowledge. Maybe you'll think twice before giving your gender and birth date to random websites now (especially as they can likely get your zip code from your browser).

The knowledge that can be inferred increases when either 1) your raw data gets more expressive, or 2) your reasoning methods get more powerful.

As a society, we're getting better at collecting more expressive data. For example, the new data generated by immersive technologies like AR and VR represents a huge leap forward in biometric information gathering.

VR/AR generate incredible amounts of fine-grained biometric information. Necessarily so: creating a realistic immersive world for a user entails things like eye tracking (to reduce simulator sickness), and movement tracking with things like optical sensors and gyroscopes (to understand what people are trying to do in the simulated world). One can use this information to directly improve the experience, of course, but one can also store this information to later feed it through statistical models.

Scholar Brittan Heller points out in her paper “Watching Androids Dream of Electric Sheep” that you can use eye tracking to infer what interests a person, what their gaze lingers on; pupil dilation can tell you who they're sexually attracted to, and even predict autism and dementia.

This gives rise to biometric data that can help you infer someone's personality and interests on the physical and biological level. (This gets even more powerful if you're combining it with first-person video recordings of their house through the front-facing camera!)

Heller calls this kind of information biometric psychography:

Biometric psychography is a new concept for a novel type of bodily-centered information that can reveal intimate details about users’ likes, dislikes, preferences, and interests. Immersive technology must capture this data to function, meaning that while biometric psychography may be relevant beyond immersive tech, it will become increasingly inescapable as immersive tech spreads... current thinking around biometrics is focused primarily on identity, but biometric psychography... (can) identify a person’s interests.

It's not just VR and AR technologies that give rise to biometric psychography; it's facial scans, EEGs, EMGs, ECGs, galvanic skin responses, and the like. I can see how useful this could be in the medical context, and how it could be equally sinister in the consumer context.

What’s powerful and alarming about the inferential power of biometric psychography is that your biological responses are almost entirely involuntary, and it's extraordinarily difficult to understand how it might be used. I have no idea which signals facial tracking algorithms pick up from my face to infer my feelings, let alone how they're processed. Especially if ML is involved, I'm not sure any engineer creating the system could really tell me, either.

[Commercial] VR systems typically track body movements 90 times per second to display the scene appropriately, and high-end systems record 18 types of movements across the head and hands. Consequently, spending 20 minutes in a VR simulation leaves just under 2 million unique recordings of body language. — Bailenson (2018)

No wonder Facebook is interested in building the Metaverse, given its data-hungry, advertising-first business model. The petabytes of information that will inevitably pour in from VR is a super rich source of data.

They could combine biometric data with other kinds of data on you, which makes the magnifying glass of inference even more powerful. Birth date + gender + zip code is a powerful combination; your music listening history + your movie watching history is more valuable information than the sum of their parts; access to mountains of your Instagram + Facebook + WhatsApp + VR activity is, of course, far better.

Many ML researchers and data scientists will be excited to train algorithms on or otherwise use this information to serve ads or build AGI or do psychology research. I mean, of course—this is powerful stuff, and I know firsthand how amazing capabilities can be created from this information. But this demographic of people, the ones who take care of the inference side, need to think about whether, how and why to use this kind of data.

Not only can you get more knowledge by running existing inference methods on novel data, novel forms of data allow and call for novel forms of inference, too. We figured out how to vectorise text into word embeddings; that’s a representation of text that can capture semantic meaning and allows for new inference methods that rely on such information. The raw data gets more expressive, our reasoning methods get newer and better, rinse and repeat.

As newer models spit out better predictions from less data coming from more unintuitive sources, we have to think about what can be inferred.

What happens when you combine novel bits gathered from different corners of your life; what happens when you run your ever-improving reasoning processes through that ever-growing pile of bits? There are debates being hashed out over whether these predictive processes are good or scientific. Regardless, we know they’re powerful — so one question is, who has a say or stake in that power?

Are our social and economic structures set up to handle this world trajectory? When large neural networks can hoover up terabytes of text or biometric measurements generated by everyday people doing their everyday thing to create incredibly powerful prediction machines, surely there are some rights and rewards that everyday people can expect from contributing their information. Although many digital services are financially free, since people subsidise usage by paying in data, I wonder if this remains a sustainable model for the future.

Much of the internet is a collectively created commons, with massive inferential leverage. So far, large platforms have mostly harnessed data by their own users to make services better. This will get more powerful as things like biometric psychography become possible.

But we’re also good at, and getting better at, collecting and using data we didn’t help to create. The growing data market involves data brokers selling information collected in some contexts to parties for use in other contexts. Information across the internet is being harnessed by (mostly private) entities that make e.g. powerful and lucrative language models dependent on common data, but they aren’t set up to pay the gains back. The data wasn’t created contractually, and not for this context. And in a way, this privatises the commons.

How can we improve the situation? People should be able to share in whatever (economic and epistemic) upside is created, and have a say in it, rather than have it leveraged on the behalf of other interests, and/or directly against theirs. And we need to draw better lines around knowledge-gathering, lines that delineate agency and control, lines that will safeguard identities and spaces, lines that people can trust in.

I’ve been thinking about this stuff both unofficially and Officially. Officially, I’m part of the pilot cohort of the Interact Summer Residency program in Brooklyn, NY this summer. We’re hosting a Symposium across multiple spaces in Vinegar Hill to present what we’re working on; this is happening Aug 10-13th. If you want to hear me talk about this stuff on August 13th, details/RSVP are here. And reach out if you’d like to come to our Symposium more generally and I’ll send you the invite — there are lots of awesome people working on awesome things you should see!

Under the Rose

Discussion about this post