|
 |
Tentative Syllabus |
- We will cover the following subjects:
• Review of basic internet technologies: HTML, PHP, Java HttpURLConnection
• Introduction to Information Retrieval (text).
• Inverted indices and boolean queries.
• Query optimization.
• Unstructured vs semi-structured text.
• Text encoding: tokenization, stemming, lemmatization, stop words, phrases.
• Proximity and phrase queries. Positional indices.
• The vector space retrieval model.
• tf.idf weighting. Scoring documents. The cosine measure.
• Introduction to data clustering.
• Partitioning methods: k-means clustering| Hierarchical clustering
• Introduction to text classification. Naive Bayes models. Email-Spam filtering.
• K Nearest Neighbors, Decision boundaries, Vector space classification using centroids, Decision Trees
• The structure of the Web graph.
• Zipf's and Pareto's Laws.
• Web search overview, web structure, the user, paid placement, search engine optimization/spam
• Web characteristics: Web size measurement, Near-duplicate detection
• Web Crawling and web indexes
• Link analysis
• PageRank and HITS ranking methods
• Recognizing web spam with statistical and graph-theoretic methods
• Web Communities discovery
|
|
|
|