SEARCH
You are in browse mode. You must login to use MEMORY

   Log in to start

Intro to AI 2


🇬🇧
In English
Created:


Public
Created by:
Christian N


0 / 5  (0 ratings)



» To start learning, click login

1 / 25

[Front]


What are the two components of Automated Reasoning?
[Back]


Knowledge Base (KB): what we know about the world (a domain of interest, e.g. number theory, medicine) Inference Engine: how we think; used to answer queries and derive implicit knowledge about the world (modeled domain)

Practice Known Questions

Stay up to date with your due questions

Complete 5 questions to enable practice

Exams

Exam: Test your skills

Test your skills in exam mode

Learn New Questions

Dynamic Modes

SmartIntelligent mix of all modes
CustomUse settings to weight dynamic modes

Manual Mode [BETA]

Select your own question and answer types
Specific modes

Learn with flashcards
Complete the sentence
Listening & SpellingSpelling: Type what you hear
multiple choiceMultiple choice mode
SpeakingAnswer with voice
Speaking & ListeningPractice pronunciation
TypingTyping only mode

Intro to AI 2 - Leaderboard

0 users have completed this course. Be the first!

No users have played this course yet, be the first


Intro to AI 2 - Details

Levels:

Questions:

64 questions
🇬🇧🇬🇧
What are the two components of Automated Reasoning?
Knowledge Base (KB): what we know about the world (a domain of interest, e.g. number theory, medicine) Inference Engine: how we think; used to answer queries and derive implicit knowledge about the world (modeled domain)
How do you resolve the conflict for Non-Monotonic Logics?
Belief Revision in non-monotonic logics Use the notion of a Degree of Belief and probabilistic reasoning
What was the Initial proposal in early days of AI?
Deductive Logic Top-down approach to logical thinking where conclusions are derived from general principles. All dogs have ears; golden retrievers are dogs, therefore they have ears.
What is the weakness of Deductive Logic?
Deductive Logic fails in a domain like medical diagnosis (and others such as: law, business, automobile repair, gardening, dating) due to: Laziness Theoretical ignorance Practical ignorance
What are the Limits of Deduction Logic?
Deductive Logic is monotonic Once we deduce something from a KB, we can never invalidate the deduction by acquiring more knowledge
What is the Qualification Problem?
To deduce a conclusion without relying on assumptions by meeting all necessary preconditions. e.g. must have two wings, must not be afraid of flying, must have already learned how to fly, etc.
What are some solutions to monotonic logics?
Solution 1: Non-Monotonic Logics Solution 2: Degree of Belief
What is Non-Monotonic Logics?
Equipping logic with the ability to jump into certain conclusions Requires: − Mechanisms for managing assumptions − Criteria for deciding on which assumptions to assert and retract, and when Consistency-based approach: • Assert as many assumptions as possible, as long as they do not lead to a logical inconsistency
What is the problem with Non-Monotonic Logics?
They have contradicting extensions/conclusions!
How do you resolve the conflict for Non-Monotonic Logics?
Belief Revision in non-monotonic logics Use the notion of a Degree of Belief and probabilistic reasoning
What is a Degree of Belief?
Instead of declaring facts (Deductive Logic) or assumptions (Non-monotonic Logics), assign a degree of belief to propositions
How do you resolve the conflict for Non-Monotonic Logics?
Belief Revision in non-monotonic logics Use the notion of a Degree of Belief and probabilistic reasoning
Degrees of belief assigned are..
Interpreted as probabilites Manipulated by the laws of probability
De Finetti‘s Theorem (1931):
Impossible to act rationally under uncertainty using degrees of belief that violate axioms of probability.
What does Probability mean? What are the two approaches?
Objectivist (frequentist) approach Subjectivist (Bayesian) approach
What is a Objectivitist (frequentist) approach?
Probabilities as − inherent objective properties of objects − real aspects of the universe − view of probability based in the Law of Large Numbers (prob. of event corresponds to relative frequency of over infinite number of trials) • Source of probability numbers: (only) experiments
What is an example of a frequentist approach?
Frequentist probability „a patient has cavity“ • Interpretation: Among 100 examined patients, we should detect approximately 30 times a cavity
What is a Subjectivist (Bayesian) approach:
Probability as − quantifies subjective belief in the occurrence of an event − reflects state of knowledge of an individual (person, agent) • Allows any self-consistent ascription of prior probabilities to propositions • Insists on proper Bayesian updating as new evidence arrives
What is an example of a Subjectivist (Bayesian) approach?
Bayesian probability „a patient has cavity“ • Interpretation: With confidence 30% I believe that a particular patient belongs to those people who have a cavity
Example of Degree of Belief
We see the bird on the right (E0) • „this bird is normal“ since no evidence that • something is wrong with this bird → derive „this bird can fly“ Then we see this bird from a new viewing angle (E1) • Update based on E1 • Update belief of „this bird is normal“ → Update belief of „this bird fly“
Degree of Belief vs. Assumptions
Assumptions - Either true or false Decisions tend to follow naturally from the assumptions made Pro: less information is needed Con: Derived facts might turn out to be false later Degree of Belief: Continuous in interval , gives more information than assumptions Decision-making: Decision Theory = Prob. Theory + Utility Theory Con: more information is needed Pro: degrees of belief do not imply any particular truth of the underlying propositions, pitfalls can be avoided
Why should a machine learn?
1. We cannot anticipate all possible future situations 2. Sometimes it is not clear how to program a solution
Supervised Learning: Features and labels are variables that can be of two types:
Numerical (Continous, discrete) Categorical (Nominal, ordinal)
Definition
When the label is numerical, the learning problem is called regression When the label is categorical, the learning problem is called classification
There are various techniques to evaluate a model against the test set
Train/test split (e.g. 80%/20%) K-fold cross validation Leave-one-out
What are the typical performance indicators?
Mean squared error for regression Accuracy for classification (correctly classified/total samples).
K-fold Cross validation
- K determines the number of partitions (and iterations) - Complete evaluation because uses all data available to evaluate the model. - More costly than train/split because K models needs to be trained - Leave-one-out is an extreme case of cross validation where K equals to the size of all data
Bias vs Variance
If a model performs poorly on the trained data, we say that the model underfits the data. If a model performs well on the training data, but poorly on the test data, the model overfits the data Simple models (e.g. Linear) tend to have high bias - This leads to underfitting (poor performance on training data). Too complex (or overtrained) models tend to have low bias but high variance - This leads to overfitting (poor performance on test data).
What is a decision tree?
A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.
Decision trees – when to stop?
In the example we stopped because all the leaves were pure, but in general other criteria can be used: • Max depth • Min samples in split • Min samples in leaf • Min impurity decrease
Advantages vs Disadvantages of Decision trees?
Advantages: • White-box method, totally transparent • Small data preparation (can deal with categorical and numerical data, no scaling needed..) • Robust against missing data • Non-linearity Disadvantages: • High variance algorithm. Small changes in input data can lead to very different trees. Prone to overfit • Can have problems with unbalanced data • Poorly suited for regression • Greedy algorithm
Random forest?
It uses bagging (bootstrap aggregating) to create different trees from the same dataset, and then aggregate the results. They combine the simplicity of DTs with flexibility, resulting in vast improvement in accuracy.
Random forest - Training
The first step is creating a bootstrapped dataset from the original data. A bootstrap dataset is a new dataset with the same size of the original, where the samples are randomly selected from the original dataset. IMPORTANT: Repetition in the random selection are allowed.
Random forest - Training
We built a tree 1) Using a bootstrapped dataset 2) Only considering a random a subset of variables at each step
Random forest - Evaluation
Repeat the process until the desired amount of trees is reached. Bootstrap and random feature selection ensure variation in the created trees. To evaluate the forest, we run each sample on each generated tree and we look at the results. The sample is classified based on the class selected by the majority of the trees.
Advantages vs Disadvantages of Random forests
Advantages: • They are generally more accurate than single trees • The process can be easily parallelized • They also provide additional information, like ranking of the mos influential features (useful in unsupervised learning) • No need to data preprocessing (like trees) Disadvantages: • They sacrifice readability • Can be computationally intensive
K-Nearest Neighbors
The K-Nearest Neighbor (KNN) is another nonparametric method. It relies on the idea that similar data points tend to have similar labels or values. During the training phase, the KNN algorithm stores the entire training dataset as a reference. When making predictions, it calculates the distance between the input data point and all the training examples, using a chosen distance metric such as Euclidean distance. The algorithm assigns the most common class label among the K neighbors as the predicted label for the input data point. (weighted average for regression)
Advantages vs Disadvantages of K-Nearest Neighbors
Advantages: • White box model • No training needed (only parameter tuning) • Non linearity • Only one parameter to tune (K) Disadvantages: • Can be slow with large datasets • Scaling is necessary for the distance measure to work properly • Sensible to outliers • Curse of dimensionality
Types of Interaction input-output
Linear methods: every input contributes separately to the output Decision tree: there is some interaction between input vars, but limited Deep NN: Long computational paths, lots of interactions between input vars
Brains
10^11 neurons of > 20 types, 10^14 synapses, 1ms–10ms cycle time Signals are noisy “spike trains” of electrical potential
McCulloch–Pitts “unit”
Ai ← g(ini) = g(ΣjWj,iaj)
Activation functions
(a) is a step function or threshold function (b) is a rectified linear function ReLU(x): max(0,x) The smooth version (everywhere-differentiable) of ReLU is called soft plus softPlus(x) : log(1 + eX) Changing the bias weight W0,i moves the threshold location
Expressiveness of perceptrons
Can represent AND, OR, NOT, majority, etc., but not XOR Represents a linear separator in input space
Multi-layer perceptrons
1) A single perceptron cannot represent complex non linear relations 2) Neural networks are also known as multi-layer perceptrons. 3) When every node of a layer is connected to all nodes of the next, this is called a fully connected neural network 4) When a network has connections only in one direction from input to output, is called a Feed-forward neural network.
Summary
Most brains have lots of neurons; each neuron ≈ linear–threshold unit (?) Perceptrons (one-layer networks) insufficiently expressive Multi-layer networks are sufficiently expressive; can be trained by gradient descent, i.e., error back-propagation Many applications: speech, driving, handwriting, fraud detection, etc. Engineering, cognitive modelling, and neural system modelling subfields have largely diverged
How does convolution work?
• Repeated application of a filter (Kernel) on a sliding window • The Stride indicates how much the kernel shifts • In this example the stride is 1 • Kernels can be of various shapes, in the example the kernel is 3x3x1. • The same kernel is applied to every image • The values inside each kernel are learned with backprop By computing the Dot product between the input and the filter, we can say that the filter is Convolved with the input.
Activation function and Pooling
An activation function (ReLU) is applied to the feature map • A (max) pooling layer groups together a selection of pixels and selects the max • This identifies the area of image where the filter found the best match • Making the network robust against small shifts in the image • And reducing the size of the input while keeping the important information
Generative Adversarial Networks (GAN) - Introduction
CNN are well suited for image classification • But is it possible to exploit them to create an image generator? • GANs are implemented as a min-max two-player game between two systems: a Generator G and a Discriminator D • In adversarial training, the goal of the generator is to create images are seem as real as possible. The goal of the discriminator it to detect which images are real and which are generated • They are trained in cycles. At the end, the discriminator can be thrown away and we have a generator that can create seemingly real images from random noise.
GAN – Why it worked?
Exploits the loss function of the discriminator to train the generator • The generator never interacts directly with the dataset, only through the discriminator. This minimizes cases where the generator “cheats” by simply copying images from the dataset. • This technique can be used with any machine learning algorithm, and not only for image generation (cheap chatbots)
Transformer - Architecture
Transformers are the basic architecture used in NLP (chatbots, translators) • Contrary to LSTM, they do not work sequencially -> high parallelization • They need positional encoding to specify the position of each word • They use multiple attention layers to keep track of important information across sentences • The output sentence is produced word by word • The output of the network is a probability distribution across all word in the dictionary to predict the next word in the sentence • The process stops only when <EOS> is predicted
Transformer - Vector embeddings
Key idea: Similar words should have similar representation vectors • Mapping of words (or images) into a multi-dimensional space called Embedding space, or Latent space • In this space, similar concepts are close to each other In the example, the words “cat” and “dog” are often use in similar context, therefore they are close in the latent space. The word “car”, while very similar to the word “cat”, is used in different contexts, and therefore is far away in the latent space.
Transformer - Positional encoding
Key idea: add a new vector containing positional information to current vectors Checklist: 1) Unique encoding for each time-step 2) Consistent distance between any two time-steps 3) Should generalize to longer sentences 4) Deterministic
Transformer - Single-Head attention
Key idea: Calculate attention using different copies of the same input.
Query Key Value - Intuition
QUERIES, KEYS and VALUES • The queries indicate what are you interested in (e.g. Name of person) • Keys are pointers to different concepts (name, height, age) • Value is the actual concept (e.g. name) • Based on the query, the key selects the values that are most related with the query
Transformer - Multi-Head Attention
1) Concentenate all the attention heads 2) Multiply with a weight matrix W^o that was trained jointly with the model 3) The result should be the Z matrix that captures information from all the attention heads. We can send this forward to the FFNN.
Transformer - Add & Norm
Layer Normalization = Output of the previous layer (from attention block) + Input Embedding (From the first step) Benefits - Faster training, Reduce Bias, Prevent weight explosion Types of Normalization - Batch & Layer normalization *Layer normalization is preferable for transformers, especially for Natural language processing tasks
Transformer - Feed forward
A standard feed forward fully connected NN with two layers. These helps to: • Format the input for the decoder layer • The non linear layers can capture more abstract concepts from the input data and pass it forward
Conclusion
• CNN revolutionized the image classification field thanks to smart architecture choices that allowed them to exploit the spacial relations of images • The success of GANs as image generators builds upon the success of CNNs. This architecture allowed to train a generator exploiting the loss function of the (CNN) discriminator • Transformer used complex architecture tricks in order to encode the sentence structure while using a non sequential algorithm (at least for the training)
What is clustering?
A smart way of grouping items/examples/instances – Items are described by numerical features – Feature extraction and scaling needed before clustering Unsupervised learning – No labeled data like in classification • Application areas: – Image processing – Data mining / big data – Information retrieval • Documents/texts
Types of clustering
•Hierarchical methods • Partitioning methods • Hard clustering – Each item belongs to exactly one cluster • Soft clustering – Each item belongs to some/all clusters with a certain probability
Hierarchical Clustering
• Produces a set of nested clusters organized as a hierarchical tree • Can be visualized as a dendrogram – A tree-like diagram that records the sequences of merges or splits
Strengths of Hierarchical Clustering
• No assumptions on the number of clusters – Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level
Types of Hierarchical Clustering
– Agglomerative: • Start with the points as individual clusters • At each step, merge the closest pair of clusters until only one cluster (or k clusters) left – Divisive: • Start with one, all-inclusive cluster • At each step, split a cluster until each cluster contains a point (or there are k clusters)
Complexity of hierarchical clustering
• Distance matrix is used for deciding which clusters to merge/split • At least quadratic in the number of data points
Agglomerative clustering algorithm
•Most popular hierarchical clustering technique • Basic algorithm 1. Compute the distance matrix between the input data points 2. Let each data point be a cluster 3. Repeat 4. Merge the two closest clusters 5. Update the distance matrix 6. Until only a single cluster remains • Key operation is the computation of the distance between two clusters – Different definitions of the distance between clusters lead to different algorithms