As a researcher and university lecturer, I regularly supervise and evaluate bachelor’s and master’s theses. Over the years, I have developed a structured approach to thesis evaluation that aims to be transparent for students while also offering a systematic framework that colleagues may adapt to their own supervision or grading practices. Since questions about how theses are graded often arise, I would like to outline my approach here, focusing especially on empirical projects in psychology or linguistics.
In the German academic system, grades range from 1.0 (excellent) to 5.0 (the lowest possible grade), with steps of 0.3 in between. Any grade below 4.0 counts as a fail. These intermediate steps allow for more nuanced differentiation and often reflect quality differences more precisely than full grades alone.
One thing I always try to do is communicate criteria openly from the very beginning of supervision. This isn’t just about fairness – it’s also practical. When students know what matters, they can better judge their own progress and focus on the things that will genuinely improve their thesis. I’ve seen many students become more confident when expectations are no longer a mystery.
The criteria I use are designed primarily for empirical master’s theses in psychology or linguistics. Such theses typically involve planning and conducting an empirical study or conducting secondary data analyses, including data acquisition or access, statistical evaluation, and interpretation. Other disciplines or programs may follow different requirements – for example, theoretical theses in the humanities or more design-oriented work in applied fields.
For bachelor’s theses, I apply the same general criteria but with lower expectations regarding scope and depth. The research question is usually narrower, the literature review more focused, and the methodological approach less complex than in a master’s project. The scientific standards remain, but the level of sophistication differs.
My evaluation framework comprises 27 parameters divided into eight sections:
General aspects of the thesis
Literature and theoretical background
Specification and justification of the research question
Data collection and methodology
Statistical analysis
Presentation of results
Discussion and interpretation
General evaluation criteria
Those 27 parameters might sound like a lot, but they help me stay consistent – no gut feelings, no “this just seems like a 2.0”.
For each parameter, I start with a 1.0 in mind and only lower the grade if I can clearly justify why. I write these justifications in the column "Comments" as I go – partly for memory, partly because they turn out to be extremely helpful when I later write the official evaluation report.
After evaluating all parameters one by one, I calculate an average for each section (i.e., 8 intermediate grades) and then average those section results to obtain the overall grade. This corresponds to an equal weighting of all sections, although you can apply a different weighting scheme if you'd like to – for example, giving more importance to methodology in empirical work.
The last step is the formal written review. Because the evaluation is detailed and documented, writing the report is straightforward. It summarizes strengths and weaknesses, explains the grade clearly, and provides constructive feedback.
An anonymized example of such a review follows below.
Review of the master’s thesis “Semantic Priming and Conceptual Framing in Multilingual Environments: Effects on Cognitive Load and Lexical Retrieval” by Lena Hinterberger
The thesis presents a study evaluating a browser-based platform @CogLex developed for examining semantic priming effects and reducing lexical retrieval delays among multilingual participants in Germany. In a randomized controlled trial, participants completed psycholinguistic assessments before and after using the platform for two weeks. Cognitive load and lexical retrieval accuracy were evaluated and compared across the two measurement points. No significant differences were found between the two measurement points in the treatment or control group, although no decisive conclusions can be drawn due to a small sample size. Additionally, user satisfaction and its impact on the potential effect of the platform use was estimated. The topic of the thesis is highly relevant, and the thesis has a clear applied focus.
The work is well-structured, and the theoretical parts (introduction and discussion) are clear and comprehensive. Considered literature is relevant and presented at the appropriate level of precision and detail. The only missing part in the introduction is a subsection on user satisfaction with the platform (one of the critical variables in the study). As a result of this omission, Hypothesis 3 (“User’s satisfaction with the browser-based platform is negatively associated with cognitive load at T2 in the intervention group”) seems to have little support in the reviewed literature. The other theoretical questions and research hypotheses are clearly justified.
Citations are correctly formatted, and the reference list includes around 50% of sources published in the last five years, which underlines the high relevance of the topic. Most visuals and tables are used appropriately and correctly, although I found the fonts in the figures too small and hard to read. Some distribution plots in Appendix D could be completely omitted, or narrower bins should have been used to make those plots more informative. The list of abbreviations does not include several critical abbreviations, such as RT, CLQ, or USQ.
The study is conducted at a high methodological level. Sample size calculation and sample description are adequate. The data collection process is carefully documented, and its description allows the reproduction of the study. The thesis includes four appendices that supplement the main text with helpful information about the data collection process and statistical analyses. However, the bilingual versions of the CLQ questionnaire differ across different mentions (cf. pp. 57 and 63), which must be an unfortunate mistake.
Most statistical procedures are chosen correctly and applied appropriately. The two exceptions are the linearity test with just seven observations (see Figure 9), which I wouldn’t recommend to conduct, and the examination of Hypothesis 3, in which user satisfaction was simply correlated with cognitive load at the post-test measurement (T2). I do not see the theoretical rationale behind this analysis, and I would suggest correlating the user satisfaction score with the difference score (T2 minus T1) or conducting a moderation analysis instead. Additionally, it would be helpful to readers if exact directions of statistical differences were summarized in the main text, i.e., not merely “A statistically significant effect of X was found” but rather “group A had significantly higher X than group B”.
The main results are briefly summarized at the beginning of the discussion section. The study is discussed in the context of relevant literature. Study limitations are critically reflected, particularly the study’s small sample size, which was not possible to increase for organizational reasons independent from Ms. Hinterberger. Study implications are discussed in a focused and constructive manner. The discussion on broader context effects (the implementation of language policy changes in multilingual education, which coincided with data collection in the study, see pp. 31–32) is profound and outstanding.
While working on the thesis, Ms. Hinterberger demonstrated a high degree of independence and personal initiative.
Based on the abovementioned criteria, the thesis deserves a grade of 1.7.