Reddit Success Predictor
Analyzed 2,100+ Reddit posts using NLP to uncover what makes content go viral · NLP · Data Science
📌 TL;DR
I scraped 2,100 top posts from 7 subreddits (AskReddit, relationships, science, LifeProTips, personalfinance, confession, offmychest) and used NLP techniques – sentiment analysis, topic modeling, LDA, random forest, and SHAP – to identify the psychological drivers of viral content. Key findings: questions get 1.8× more engagement, 40–60% subjectivity (balanced personal + factual tone) is the sweet spot, and negative posts get more comments but fewer upvotes. I distilled the insights into the REDDIT framework – a practical guide for anyone creating content.
Problem and Motivation
Every day, millions of posts compete for attention on Reddit. Some go viral; most don't. As a data scientist and aspiring product manager, I wanted to move beyond guesswork and ask: What psychological patterns make content spread?
This project uses natural language processing and machine learning to decode the DNA of viral Reddit posts, providing actionable insights for marketers, community managers, and creators.
Data Collection
I used the Reddit API (PRAW) to collect 300 top posts from the past year across 7 psychologically diverse subreddits:
- Emotional communities:
r/relationships,r/confession,r/offmychest - Factual communities:
r/science,r/personalfinance,r/LifeProTips - Mixed / general:
r/AskReddit
Engagement metric: 0.7 × upvotes + 0.3 × comments – validated by Reddit's algorithm research.
Total posts: 2,100.
Feature Engineering
I engineered features to capture psychological dimensions:
| Feature | Description |
|---|---|
| Sentiment polarity | TextBlob polarity (-1 to 1) |
| Subjectivity | TextBlob subjectivity (0 to 1) |
| Is question | Whether title contains ? |
| Title length | Word count |
| Self-references | Count of I, me, my |
| Emotional words | Count of words like feel, happy, sad, angry |
| Post hour and weekday | From UTC timestamp (converted to US/Eastern) |
Key Findings
1. Questions Get 1.8× More Engagement
Posts with a question in the title have significantly higher median engagement than statements (p < 0.05). Why? Questions trigger social response mechanisms – they invite participation and feel less like broadcasting.
2. The Subjectivity Sweet Spot (40–60%)
Content that is too objective (dry facts) or too subjective (pure emotion) underperforms. The highest-engagement posts blend personal experience with factual evidence – a balance that creates cognitive ease and credibility.
3. The Sentiment Tradeoff
There's a fascinating paradox: negative posts get 45% more comments but 30% fewer upvotes. Controversy drives discussion, but positive posts earn silent approval. Choose your goal – engagement type matters.
4. Community-Specific Language
Different communities have different “language identities”:
r/relationshipsthrives on emotional authenticity – “I feel…”r/sciencevalues cognitive, factual language – “Research shows…”r/AskRedditloves open-ended questions – “What's your experience…”
Predictive Model
I built a Random Forest Regressor (150 trees, max depth 7) to predict engagement using psychological features. Then I used SHAP (SHapley Additive exPlanations) to interpret the model.
Top drivers of engagement (from SHAP):
- Is question (most important)
- Subjectivity (balanced tone wins)
- Post hour (timing matters)
- Title length (shorter is better for some subreddits)
The REDDIT Framework
I distilled the findings into a memorable acronym – REDDIT – a practical guide for creating viral content:
| Step | Description | Example |
|---|---|---|
| Reveal | Share personal experiences (I-statements) | “I struggled with this for years…” |
| Emotion | Include emotional language (feel/felt) | “I felt overwhelmed when…” |
| Direct | Clear question in title (What/How/Why) | “What techniques helped you?” |
| Dialogue | Encourage conversation in comments | “What are your experiences?” |
| Insight | Provide novel information/value | “Research shows this method works 73% better” |
| Timing | Post at community peak hours | r/relationships: 9 PM, r/science: 2 PM |
What I Learned
- NLP can uncover actionable psychological patterns, but community context matters as much as the algorithm. A post that works in
r/relationshipswill flop inr/science. - Simple models with good features beat complex black boxes. Random Forest + basic text features gave me clear, interpretable results. SHAP made it explainable.
- The REDDIT framework is my main takeaway – it's a tool anyone can use to improve their content, whether for social media, marketing, or product communication.
References and Tools
📂 Full code and analysis available on GitHub.