Project and essay

  • Project and essay statement v0
    • first delivery deadline: October 30th
    • [new] second delivery deadline: November 30th 
    • [new] essay delivery deadline: December 6th
    • corrections and clarifications will be posted here
  • Project material (credentials given after signing the form)
  • Report templates: project (word, latex) and essay (word, latex)
  • Functional demonstration Jupyter notebook template (Project 1): notebook
  • Project demonstrations
    Thursday Dec 10th               Friday Dec 11th


Please always consult the FAQ before posting questions to the faculty hosts.

Notes on the essay:

  • On the essay challenges:
    • (a) ii: 'filtering' is the task of retrieving documents from a collection using a document of interest as the query
    • (b) ii: the goal here is to discuss how can classifiers leverage on the document structure (zones)
    • (c) i: 'original IR system' can be either seen as a classic IR (similarly to proposed in the first project delivery) or IR using a link-based ranking
    • (c) iii: if your earlier project delivery already considers a topic-dependent graph, you can use that strategy as a starting point
  • If some of the proposed principles have cons, do not forget to identify them along with the pros.

Questions related with second delivery:

  1. Can you clarify the programming and reporting requirements for the second delivery?
    • program: functionalities 3.1 to 3.3 and a jupyter notebook displaying their invocation and outputs
    • report: compact direct answers to the questions posed on section 3.1-3.3 and the supporting charts
  2. Functionalities on the Dtrain and Dtest take too much time
    • Similarly to first delivery, you can reduce the collections to consider only documents with judgments for a given subset of topics.
    • [new] Note that the use of pairwise similarities in some clustering, classification and graph algorithms can become computationally expensive for a collection with >1k documents. In this context, we suggest you to use collection samples with an approximate order of 1000 documents.
  3. Can we use the scikit-learn TFIDF vectorizer instead of the traditional inverted index?
    • Yes, in fact, we recommend the use of the TFIDF vectorizer instead for all the three quests (3.1-3.3) of this second delivery. Still, let us keep in mind that real-world search engines use the inverted index to support these tasks.

Questions related with first delivery:

  1. Can you clarify the programming and reporting requirements for the first delivery?
    • IR system program: implement functionalities (a)-(e) and the reciprocal rank function
    • report: compact direct answers to the questions and the display of supporting charts
  2. Do we need to program the indexing function from scratch?
    • No. You can use facilities from packages such as whoosh (check Lab 3).
  3. Do we need to be able to differentiate the fields (elements) of each new document?
    • No. Yet, if you are curious to see how fields affect IR, you can for instance increase the weight of occurrences in the headline.
  4. Indexing and searching Dtest documents from RCV1 collection takes too much time.
    • Please consider the following principles:
      • select a small sample of documents from Dtest for development purposes and only use the whole Dtest to gather the final results
      • focus on 3 to 5 of the provided topics to minimize the number of queries
  5. There are some additional folders (dtds, codes). Do we need to use them? No.
  6. What does "the Boolean querying should tolerate up to round(0.2 × k) term mismatches" mean?
    • 20% of the terms in a Boolean query do not necessarily need to be present in the document for a match
    • a different stance is to consider word/phrase similarity (terms in document should be at least 80% similar). You can opt to use this stance, yet state upfront this decision in your report
  7. The high number of documents in the collection with available judgments prevents a robust evaluation of some queries due to a scarcity of judged documents.
    • To minimize this problem: when subsampling the Reuters collection you can guarantee that only judged documents for a given set of queries are selected.