Data Team
The total time taken is recorded by the system.
Please enter below | |
---|---|
Name: | |
Email: |
Question 2a - Automatic Matching (~20min)
In order to build a machine that is automatically able to decide which results should be 1s (matched) and 0s (non-matched), we defined specific features to compare the data.
One of the main features is responsible for comparing the subject name (e.g Edward Lampert) and the name that's located in the result.(‘resultName’)
In the following excel sheet (file here: nameComparingFeature ) you can see the name comparison and its associated similarity score for different results.
(For every result you can also find the manual analyst feedback one the result, but remember that Name-comparison is only one of the factors an analyst will take into consideration when labeling a result.)
2a.
What is the method (the function) which generates the similarity score in the excel sheet? ('Feature Score') You can describe the calculation in your own words, or write a formula.
A good comparison method will mimic the work methodology of an Analyst in the best manner.
(Think how your brain works when looking at the comparison of personal names and try to translate it into a fun
- 2b.
What are the disadvantages in the last method, as calculated in the excel sheet?
In the next slide you can re-upload the excel file with your method suggestion score(s), but it's not a must. You can describe here your method and its fun
You can re-upload the excel file with your method suggestion score(s), but it's not a must.
Question 3 – Understanding which information is obtainable through each source (~20min -> 10min per scenario)
Background
Assume you want to collect information on an individual using the automated machine.
Assume that the machine is able to keep only the following details about a person:
- Full name
- Date of birth
- Address
- Employment
- SSN
The machine can use those details for two purposes:
- Search the sources and collect results.
- Decide whether a result is 1 or 0.
The machine works in the following manner ('run process'):
- It receives an input from the user and saves it.
- It searches a source using what it knows so far about the person (if it doesn't have the minimum input for the source, no results will be received).
- It decides if the results are 1 or 0 based on what the machine knows so far about the person.
- If results that received 1 contain unknown details about the person, the machine keeps the new details and use them for future searches and scoring.
- The machine repeats stages 2-4 until it finishes going over all of the sources. Each source can be searched only once!
Each source has a minimum input which has to be entered in order to get results:
Without this input, the data source cannot be used.
- Twitter - full name
- Person - SSN OR (full name + employment)
- News - full name
- Officer - employment + (full name OR address)
- Contributions - full name
- Legal - full name + address
You are the designer of the machine, you can decide in what order it will search the sources.
e.g : 1. Twitter , 2. Person, 3. Legal, etc
Your objective is to get as many correct results as possible in each run (through stages 1-5 'run process') of the machine. The order of the sources should be depends on the user input, which might be different in each run.
The sources you have at your disposal are the same as in the attached Edward Lampert excel document (Person, News, Officer, Twitter, Contributions and Legal).
Your Task
In each of the following 3 cases, your machine will receive a different input.
For each of the following inputs (i.e. The only information you received initially about the target), explain in what order you would run the sources.
Question 3 – Understanding which information is obtainable through each source (~20min -> 10min per scenario)
Background
Assume you want to collect information on an individual using the automated machine.
Assume that the machine is able to keep only the following details about a person:
- Full name
- Date of birth
- Address
- Employment
- SSN
The machine can use those details for two purposes:
- Search the sources and collect results.
- Decide whether a result is 1 or 0.
The machine works in the following manner ('run process'):
- It receives an input from the user and saves it.
- It searches a source using what it knows so far about the person (if it doesn't have the minimum input for the source, no results will be received).
- It decides if the results are 1 or 0 based on what the machine knows so far about the person.
- If results that received 1 contain unknown details about the person, the machine keeps the new details and use them for future searches and scoring.
- The machine repeats stages 2-4 until it finishes going over all of the sources. Each source can be searched only once!
Each source has a minimum input which has to be entered in order to get results:
Without this input, the data source cannot be used.
- Twitter - full name
- Person - SSN OR (full name + employment)
- News - full name
- Officer - employment + (full name OR address)
- Contributions - full name
- Legal - full name + address
You are the designer of the machine, you can decide in what order it will search the sources.
e.g : 1. Twitter , 2. Person, 3. Legal, etc
Your objective is to get as many correct results as possible in each run (through stages 1-5 'run process') of the machine. The order of the sources should be depends on the user input, which might be different in each run.
The sources you have at your disposal are the same as in the attached Edward Lampert excel document (Person, News, Officer, Twitter, Contributions and Legal).
Your Task
In each of the following 3 cases, your machine will receive a different input.
For each of the following inputs (i.e. The only information you received initially about the target), explain in what order you would run the sources.
Order | |
---|---|
Twitter (min input: full name) | |
Person - (min input: SSN OR (full name + employment)) | |
News - (min input: full name) | |
Officer - (min input: employment + (full name OR address)) | |
Contributions - (min input: full name) | |
Legal - (min input: full name + address) |
Order | |
---|---|
Twitter (min input: full name) | |
Person - (min input: SSN OR (full name + employment)) | |
News - (min input: full name) | |
Officer - (min input: employment + (full name OR address)) | |
Contributions - (min input: full name) | |
Legal - (min input: full name + address) |
Each question has only one correct answer.
Bill traveled for 3 hours.
The next slide will start a 90min data analysis task. You can use your favourite tool in order to answer it (Excel, Python/R/Sas etc)
Move next when you are ready.
In the following zip, there are 3 csv files. Each file represents a different model (A,B,C) that was running in a different time.
The data collected for each one is:
- DataSourceId - The data source which the system pulled results from
- subjectId - The subject Identifier which the system was looking for
- score - the system automatic match decision. (1 is for records that were matched, 0 otherwise)
- label - an analyst manual label score. The analyst checked each record and marked 1 if the record mentioned the subject, and 0 otherwise.
You goal is to compare the models performance in order to decide which of the model performed better.
Download link: compare_models_file
1. Describe in short the steps you made to measure and compare the models performance ?
2. Which model did you find as the best one? why?
In the next slide you can upload an Excel or R/Python script to show your process and/or figures. You can also upload a zip file if you want to share multiple files.
Please move to the next slide only when you've finished the task.
You can upload an Excel or a zip file with R/Python script to show your process and/or figures. If you want to share multile files, please add all into a single zip file.
Please do not upload a script file directely (.py/.R/etc).
Script files can be uploaded only as part of a zip/rar files.
Please move to the next slide only when you've finished the task.
***Once you press Finish below your test will be submitted. Good luck!