Sumários

There was no class

22 outubro 2018, 15:30 Helena Galhardas

The instructor went to a conference abroad.

The planned schedule takes this situation into account.


Duplicate detection and elimination with PDI

22 outubro 2018, 15:30 Diogo Ribeiro Ferreira

Pairwise comparison by join. Pairwise similarity and thresholds. Calculating similarity/distance measures in PDI. Weighted similarity on multiple attributes. Reducing the number of comparisons. Clustering and transitive closure. Merging of duplicates.


Duplicate detection and elimination with PDI and Lab Guide 5

22 outubro 2018, 09:30 João Pedro Lebre Magalhães Pereira

  • How to specify a duplicate detection and elimination process with PDI transformations
  • Resolution of Lab Guide 5


String Matching

18 outubro 2018, 15:30 Diogo Ribeiro Ferreira

The Damerau-Levenshtein distance. Converting a distance measure into a similarity measure. The Needleman-Wunsch measure. The Jaro and Jaro-Winkler measures. The Jaccard measure. Phonetic measures: Soundex and Refined Soundex.


Data matching and fusion

18 outubro 2018, 14:30 Helena Galhardas

Data matching (detection of approximate duplicates):

  • Two challenges: accuracy and efficiency
  • Record-oriented matching techniques: rule-based matching 
  • Scaling-up record-set oriented matching: sorted neighbourhood method
  • Measures and data sets
Data Fusion (elimination of approximate duplicates):
  • Types of data conflicts
  • Data conflict resolution strategies and functions
  • Relational operators and extensions