FilterHN

Ask HN: Better approach for plagiarism detection in self-hosted LMS?

1 points

3 hours ago

| 0 comments

I'm building an open-source LMS and added plagiarism detection using OpenSearch's more_like_this query plus character n-grams for similarity scoring.

Basically when a student submits an answer, I search for similar answers from other students on the same question. Works decently but feels a bit hacky - just reusing the search engine I already had.

Current setup:

  search = cls.search().filter(
      "nested", path="answers", 
      query={"term": {"answers.question_id": str(question_id)}}
  )
  search = search.query(
      "nested",
      path="answers",
      query={
          "more_like_this": {
              "fields": ["answers.answer"],
              "like": text,
              "min_term_freq": 1,
              "minimum_should_match": "1%",
          }
      },
  )
  
  # get top 10, then re-rank in Python
  def normalize(t):
      return re.sub(r"\s+", "", t.strip())
  
  def char_ngrams(t, n=3):
      return set(t[i:i+n] for i in range(len(t)-n+1))
  
  norm_text = normalize(text)
  text_ngrams = char_ngrams(norm_text)
  
  for hit in response.hits:
      norm_answer = normalize(hit.answer)
      answer_ngrams = char_ngrams(norm_answer)
      
      intersection = len(text_ngrams & answer_ngrams)
      union = len(text_ngrams | answer_ngrams)
      ratio = int((intersection / union) * 100)
      
      if ratio >= 60:
          # flag as similar

Constraints: - Self-hosted only, no external APIs - Few thousand students - Want simple operations, already running OpenSearch anyway

Questions: - Is this approach reasonable or am I missing something obvious? - What do other self-hosted systems use? Checked Moodle docs but their plagiarism plugins mostly call external services - Anyone tried lightweight ML models for this that don't need GPU?

The search engine approach works but curious if there's a better way that fits our constraints.

No one has commented on this post.