Just a picture
  This class will be offered in the Fall 2008 semester.
 
 


Tentative Syllabus

  1. We will cover the following subjects:
    • Review of basic internet technologies: HTML, PHP, Java HttpURLConnection
    • Introduction to Information Retrieval (text).
    • Inverted indices and boolean queries.
    • Query optimization.
    • Unstructured vs semi-structured text.
    • Text encoding: tokenization, stemming, lemmatization, stop words, phrases.
    • Proximity and phrase queries. Positional indices.
    • The vector space retrieval model.
    • tf.idf weighting. Scoring documents. The cosine measure.
    • Introduction to data clustering.
    • Partitioning methods: k-means clustering| Hierarchical clustering
    • Introduction to text classification. Naive Bayes models. Email-Spam filtering.
    • K Nearest Neighbors, Decision boundaries, Vector space classification using centroids, Decision Trees
    • The structure of the Web graph.
    • Zipf's and Pareto's Laws.
    • Web search overview, web structure, the user, paid placement, search engine optimization/spam
    • Web characteristics: Web size measurement, Near-duplicate detection
    • Web Crawling and web indexes
    • Link analysis
    • PageRank and HITS ranking methods
    • Recognizing web spam with statistical and graph-theoretic methods
    • Web Communities discovery