This classifier was trained on a large sample of users from the r/politicalcompassmemes subreddit. Each user's flair defines their political leaning and is used as target data. The features used as an input to the model for a given user are the number of comments they've made in every subreddit. Since users have never commented in the vast majority of subreddits, a sparse array is used to encode each instance of a user.
The model used for classification is a logistic regression model. It is necessary to extract two axes of the political compass (horizontal & vertical) from each user's flair. For example: 'authleft' -> ('L', 'A'), 'lib' -> (None, 'L')
Currently, the model achieves a precision of 0.8 and recall of 0.8 on a test set taken again from unseen users of r/politicalcompassmemes. While these metrics seem quite good, it must be considered that since the model is only trained and tested on users from this small community the predictions made may not generalise well across all of Reddit's userbase. This is especially an issue considering that intuition suggests users of r/politicalcompassmemes are likely to comment in more political subreddits than the average reddit user. This will mean users in the test set will have a greater amount of relevant data allowing better predictions to be made.
Due to the significant class imbalance present in the training data (the number of users that lean 'lib' on the v axis is far greater than those who lean 'auth'). It may be useful to consider alternative metrics such as bACC or PPCR.
I hope to implement the use of more features. The use of NLP could be promising to enhance predictions in cases where users have only few comments available.