top of page

MBTI Prediction Based on Twitter Content

Using machine learning, we explored the relationship between a user's Twitter (now X) content and their MBTI classification. 

Link to View Project

Project Overview

In this project, we explored the relationship between a user’s Twitter content and their MBTI classification. We used Twitter and MBTI information from a dataset that conatains 8,328 users and analyzed 5 tweets per user using sentiment analysis and frequency distribution plots. We then used SVM to train a model that predicts a user's MBTI type based on their Twitter content. Our results indicate that the relationship between the variables analyzed and a user’s MBTI type is inconclusive.

Team Contributions

  • My contributions: Background and Prior Work, Dataset Info, Frequency Distribution using EDA (Exploratory Data Analysis)

  • Ashley Ho: Data Cleaning, Data Analysis and Results

  • Ariann Manlangit: Background Info, Research Question, Script, Slides

  • Akhila Nivarthi: Ethics and Privacy, Conclusion & Discussion, Script

  • Audrey Chung: Found Data, Ethics and Privacy, Conclusion & Discussion, Data Analysis

​

All team members were present at meetings and thoroughly communicated with one another.

Research Question

Can we predict how an individual's MBTI is classified based on the content they share on Twitter, specifically the text sentiment and word frequency of their posts, as well as average user tweet statistics (average tweet length, average mentions count, average media count, and average retweet count)?

Hypothesis

We hypothesize that there is an underlying relationship between the classification of an individual's MBTI and the content of the tweets they post. We believe that textual components such as word choice, capitalization, punctuation usage, and emoji usage, as well as the quantitative measures such as tweet length and tweet frequency, are indicative of an individual’s personality traits. Our background research has indicated that individuals are likely to express their true personas online and that often times how we identify in real life can be portrayed through our online presence.

Dataset

 

This dataset contains information sourced from Twitter API about 8,328 Twitter users that have self-reported their MBTI types in their profile descriptions. The dataset is comprised of three csv files. The first file stores users' MBTI classifications. The second file includes publicly-availiable data about their account such as their username, follower counts, location, and verification status. The final file contains users' 200 most recent tweets posted on or before March 31, 2020.

Background & Setup

MBTI stands for Myers-Briggs Type Indicator, and it identifies people's personalities through 4 different categories:

  • ​[I/E] introversion/extroversion

  • [S/N] sensing/intuition

  • [T/F] thinking/feeling

  • [J/P] judging/perceiving

​

We chose one dataset for our project called Twitter MBTI Personality Types with a total of 8,328 observations. 

Data Cleaning

We took 4 steps when cleaning the data.

​

  • Step 1: Merged MBTI classifications, profile information, and tweets into the variable df (data frame).

  • Step 2: Checked for any missing values and dropped any rows/columns with missing data.

  • Step 3: Wrote a function using detect to identify and filter out tweets that were not in English.

  • Step 4: Applied word_tokenize from nltk (natural language toolkit)  to all tweets in order to prep for EDA (Exploratory Data Analysis).

Analysis

The plots for the mean mention count, mean tweet length, and mean media count for each individual MBTI type were not extremely skewed. The only plot that had noticeable differences is for the mean retweet count for each individual MBTI type.

​

After exploring the data, we created a model that takes an individual's tweet and predict their MBTI using a linear Support Vector Machine (SVM) in our model to train and predict our data. We used SVM to perform sentiment analysis on the text (tweet content) and then we predicted the label or group (MBTI classification).

Results & Conclusion

Analyzing the results of our model, we were unable to accurately prove our hypothesis that MBTI can be identified through Twitter content.

​

There were also various limitations in our procedure such as the sample size of our observations as well as the extensive set of MBTI categories.

​

While we were unable to find results using these methods, when we used Vader, a sentiment analysis tool, we were able to find some correlation MBTI and text sentiment. 

Reflections

Working on my first machine learning project was a transformative experience that expanded my understanding of data science and its applications. Digging into sentiment analysis was particularly enlightening, as it demonstrated the power of machine learning in interpreting human emotions from textual data. One of the key lessons I learned was the critical importance of data cleaning; I discovered that high-quality, well-prepared data is essential for building accurate and reliable models. This project not only equipped me with technical skills but also emphasized the importance of a meticulous approach to data preprocessing, ultimately shaping my ability to handle complex datasets and derive meaningful insights.

Get in Touch

777 24th st.
San Diego, CA 92154

619-870-4721

  • LinkedIn

Thanks for submitting!

bottom of page