Web Crawling

14 novembro 2019, 15:30 Bruno Emanuel Da Graça Martins

  • Motivation and taxonomy of crawlers
  • Basic crawlers and implementation issues
    • BFS vs DFS Traversal
    • Fetching the contents
    • Parsing HTML and other formats
    • Relative vs. absolute URLs
    • URL canonicalization
    • Avoiding spider traps
    • The page repository
    • Concurrent crawlers
  • Universal crawlers
    • Performance and scalability
    • Crawling policies