Reddit Success Predictor project preview

Reddit Success Predictor

Analyzed 2,100+ Reddit posts using NLP to uncover what makes content go viral · NLP · Data Science

📌 TL;DR

I scraped 2,100 top posts from 7 subreddits (AskReddit, relationships, science, LifeProTips, personalfinance, confession, offmychest) and used NLP techniques – sentiment analysis, topic modeling, LDA, random forest, and SHAP – to identify the psychological drivers of viral content. Key findings: questions get 1.8× more engagement, 40–60% subjectivity (balanced personal + factual tone) is the sweet spot, and negative posts get more comments but fewer upvotes. I distilled the insights into the REDDIT framework – a practical guide for anyone creating content.

Problem and Motivation

Every day, millions of posts compete for attention on Reddit. Some go viral; most don't. As a data scientist and aspiring product manager, I wanted to move beyond guesswork and ask: What psychological patterns make content spread?

This project uses natural language processing and machine learning to decode the DNA of viral Reddit posts, providing actionable insights for marketers, community managers, and creators.

Data Collection

I used the Reddit API (PRAW) to collect 300 top posts from the past year across 7 psychologically diverse subreddits:

  • Emotional communities: r/relationships, r/confession, r/offmychest
  • Factual communities: r/science, r/personalfinance, r/LifeProTips
  • Mixed / general: r/AskReddit

Engagement metric: 0.7 × upvotes + 0.3 × comments – validated by Reddit's algorithm research.

Total posts: 2,100.

Feature Engineering

I engineered features to capture psychological dimensions:

Feature Description
Sentiment polarity TextBlob polarity (-1 to 1)
Subjectivity TextBlob subjectivity (0 to 1)
Is question Whether title contains ?
Title length Word count
Self-references Count of I, me, my
Emotional words Count of words like feel, happy, sad, angry
Post hour and weekday From UTC timestamp (converted to US/Eastern)

Key Findings

1. Questions Get 1.8× More Engagement

Posts with a question in the title have significantly higher median engagement than statements (p < 0.05). Why? Questions trigger social response mechanisms – they invite participation and feel less like broadcasting.

Questions vs statements median engagement comparison

Questions vs statements – median engagement comparison. Click to enlarge.

2. The Subjectivity Sweet Spot (40–60%)

Content that is too objective (dry facts) or too subjective (pure emotion) underperforms. The highest-engagement posts blend personal experience with factual evidence – a balance that creates cognitive ease and credibility.

Subjectivity sweet spot engagement chart

Balanced subjectivity (0.3–0.6) generates highest engagement. Click to enlarge.

3. The Sentiment Tradeoff

There's a fascinating paradox: negative posts get 45% more comments but 30% fewer upvotes. Controversy drives discussion, but positive posts earn silent approval. Choose your goal – engagement type matters.

Sentiment correlation heatmap

Correlation heatmap – sentiment vs upvotes vs comments. Click to enlarge.

4. Community-Specific Language

Different communities have different “language identities”:

  • r/relationships thrives on emotional authenticity – “I feel…”
  • r/science values cognitive, factual language – “Research shows…”
  • r/AskReddit loves open-ended questions – “What's your experience…”
Emotional language distribution across communities

Emotional language distribution across communities. Click to enlarge.

Predictive Model

I built a Random Forest Regressor (150 trees, max depth 7) to predict engagement using psychological features. Then I used SHAP (SHapley Additive exPlanations) to interpret the model.

Top drivers of engagement (from SHAP):

  • Is question (most important)
  • Subjectivity (balanced tone wins)
  • Post hour (timing matters)
  • Title length (shorter is better for some subreddits)
SHAP feature importance chart

SHAP feature importance – what really drives engagement. Click to enlarge.

The REDDIT Framework

I distilled the findings into a memorable acronym – REDDIT – a practical guide for creating viral content:

Step Description Example
Reveal Share personal experiences (I-statements) “I struggled with this for years…”
Emotion Include emotional language (feel/felt) “I felt overwhelmed when…”
Direct Clear question in title (What/How/Why) “What techniques helped you?”
Dialogue Encourage conversation in comments “What are your experiences?”
Insight Provide novel information/value “Research shows this method works 73% better”
Timing Post at community peak hours r/relationships: 9 PM, r/science: 2 PM

What I Learned

  • NLP can uncover actionable psychological patterns, but community context matters as much as the algorithm. A post that works in r/relationships will flop in r/science.
  • Simple models with good features beat complex black boxes. Random Forest + basic text features gave me clear, interpretable results. SHAP made it explainable.
  • The REDDIT framework is my main takeaway – it's a tool anyone can use to improve their content, whether for social media, marketing, or product communication.

References and Tools

  • Reddit API (PRAW) – data collection
  • TextBlob – sentiment and subjectivity
  • scikit-learn – Random Forest, preprocessing
  • SHAP – model interpretability
  • LDA (gensim) – topic modeling
  • Python, pandas, matplotlib, seaborn

📂 Full code and analysis available on GitHub.