Class: FactDb::Validation::AliasFilter

Inherits:
Object
  • Object
show all
Defined in:
lib/fact_db/validation/alias_filter.rb

Overview

Filters out invalid aliases such as pronouns, common terms, and generic references. Used by extractors, services, and models to ensure alias quality.

Constant Summary collapse

PRONOUNS =

English pronouns (subject, object, possessive, reflexive)

%w[
  i me my mine myself
  you your yours yourself yourselves
  he him his himself
  she her hers herself
  it its itself
  we us our ours ourselves
  they them their theirs themselves
  who whom whose
  this that these those
  what which
  one ones
  all any both each either neither none some
  another other others
].freeze
GENERIC_TERMS =

Common generic terms that shouldn’t be aliases

%w[
  a an the
  man woman person people men women
  boy girl child children
  husband wife brother sister father mother son daughter
  king queen prince princess lord lady
  sir madam mr mrs ms miss dr
  someone something somewhere anyone anything anywhere
  everyone everything everywhere nobody nothing nowhere
  here there
  today yesterday tomorrow
  now then
].freeze
GENERIC_ROLES =

Common role/title references that are too generic

%w[
  the\ man the\ woman the\ person the\ people
  a\ man a\ woman a\ person
  this\ man this\ woman this\ person
  that\ man that\ woman that\ person
  the\ king the\ queen the\ lord the\ lady
  the\ brother the\ sister the\ father the\ mother
  the\ husband the\ wife
  the\ boy the\ girl the\ child
  believers disciples apostles
  men greek\ men
].freeze
AMBIGUOUS_FIRST_NAMES =

Common first names that are too ambiguous to use as standalone aliases These should only be valid when part of a fuller name

%w[
  simon peter john james paul mark matthew luke andrew philip
  thomas james joseph mary martha elizabeth sarah anna david
  michael robert william richard henry george charles edward
  mary ann jane elizabeth margaret catherine alice
].freeze

Class Method Summary collapse

Class Method Details

.filter(aliases, name: nil) ⇒ Array<String>

Filter an array of aliases, returning only valid ones

Parameters:

  • aliases (Array<String>)

    Array of potential aliases

  • name (String, nil) (defaults to: nil)

    The entity’s name

Returns:

  • (Array<String>)

    Array of valid aliases



89
90
91
92
93
94
95
96
97
# File 'lib/fact_db/validation/alias_filter.rb', line 89

def filter(aliases, name: nil)
  return [] unless aliases.is_a?(Array)

  aliases
    .map { |a| a.to_s.strip }
    .reject { |a| a.empty? }
    .select { |a| valid?(a, name: name) }
    .uniq { |a| a.downcase }
end

.rejection_reason(text, name: nil) ⇒ String?

Get a human-readable reason why an alias was rejected

Parameters:

  • text (String)

    The alias text

  • name (String, nil) (defaults to: nil)

    The entity’s name

Returns:

  • (String, nil)

    Rejection reason or nil if valid



103
104
105
106
107
108
109
110
111
112
113
114
115
116
# File 'lib/fact_db/validation/alias_filter.rb', line 103

def rejection_reason(text, name: nil)
  return "empty or nil" if text.nil? || text.to_s.strip.empty?

  normalized = text.to_s.strip.downcase

  return "too short (less than 2 characters)" if too_short?(normalized)
  return "is a pronoun" if pronoun?(normalized)
  return "is a generic term" if generic_term?(normalized)
  return "is a generic role reference" if generic_role?(normalized)
  return "contains only articles and generic words" if only_articles_and_generic?(normalized)
  return "is an ambiguous standalone first name" if ambiguous_standalone_name?(normalized, name)

  nil
end

.valid?(text, name: nil) ⇒ Boolean

Check if a potential alias is valid

Parameters:

  • text (String)

    The alias text to validate

  • name (String, nil) (defaults to: nil)

    The entity’s name (for comparison)

Returns:

  • (Boolean)

    true if the alias is valid



68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# File 'lib/fact_db/validation/alias_filter.rb', line 68

def valid?(text, name: nil)
  return false if text.nil?

  normalized = text.to_s.strip.downcase

  return false if normalized.empty?
  return false if too_short?(normalized)
  return false if pronoun?(normalized)
  return false if generic_term?(normalized)
  return false if generic_role?(normalized)
  return false if matches_canonical?(normalized, name)
  return false if only_articles_and_generic?(normalized)
  return false if ambiguous_standalone_name?(normalized, name)

  true
end