Web Crawling
14 novembro 2019, 15:30 • Bruno Emanuel Da Graça Martins
- Motivation and taxonomy of crawlers
- Basic crawlers and implementation issues
- BFS vs DFS Traversal
- Fetching the contents
- Parsing HTML and other formats
- Relative vs. absolute URLs
- URL canonicalization
- Avoiding spider traps
- The page repository
- Concurrent crawlers
- Universal crawlers
- Performance and scalability
- Crawling policies