Day 12

Today

For next time

Text Analysis vs Text Analytics

Distinguishing between text analysis and text analytics will be helpful. Here is a fun scenario.

In MP3, it’s possible to use text analysis to present qualitative stories of data. One can also use text analytics to present quantitative information about data. Or do both. You can use Python packages to have your scripts analyze text in many ways, such as flagging keywords within a body of text that the program classifies as “aggressive” or some other term you specify. An analytics approach might present a graph to show how many students in a given class demonstrated patterns of aggression in public social media posts they made.

What if…? Imagine a scenario where we, as professors, mined popular social media sites to collect a corpus of each students’ public posts. Then, we examined the number of exclamation points used per 100 characters written. We then set a threshold (e.g., >12) to classify students with a higher number of exclamation points as “lacking professionalism, and reduced those students’ “professionalism” scores for this course.

Are there ethical issues with us making use of this public data in this way?

Is there ambiguity in how exclamation points are used (in English-based social media posts)? Can we write Python programs that can do a decent enough job telling the difference (if your grade depended on it)? What else would we need to consider?

Ethics in text mining and analysis

Some of the key points from RJ11 are summarized here:

Personally Identifiable Information protection principles

Principle PII protection rationale Incompatibility with “Big Data” analysis
Collection Limitation There should be limits to the collection of PII, it should be obtained by lawfully and fairly and, ideally, with the knowledge/consent of the data subject The larger the data collection, the better the potential for identifying interesting correlations
Data Quality PII should be relevant to the purposes for which it is to be used, and should be accurate, complete and up-to-date enough for those purposes “Messy data” is fine, it’s not clear what is relevant until its analysed, and even inaccurate or incomplete data can be useful
Purpose Specification purposes for which PII are collected should be specified at the time of data collection. Subsequent use should be limited to those purposes or such others compatible with those purposes and specified on each change of purpose. Data may have been collected for a particular purpose, but analysis may indicate further unrelated and previously unknown, but valuable, purposes. Data as collected may not be obviously PII, but analysis of it may identify individuals
Use Limitation PII should not be disclosed, made available or otherwise used for unspecified purposes except with data subject consent or by authority of law. There may be value in sharing and aggregating data that may not be apparent at the time of collection
Security Safeguards PII should be protected by reasonable security safeguards against such risks as loss or unauthorised access, destruction, use, modification or disclosure of data. It may be unclear what security issues, if any, arise from a particular collection of data or its analysis
Openness data subjects should be able to establish the existence and nature of PII, and the main purposes of its use, and the identity and location of the data controller. Where data is collected and analysed, it may not be obvious that it is PII, and even in circumstances where it is, the researcher may have no way of informing the data subject of its use
Individual Participation An individual should be able to be informed by a data controller whether it holds PII relating to him or her; to have the PII communicated to him or her in meaningful form and reasonable time and at reasonable cost; to be informed if the PII will not be communicated and to be able to challenge that denial, where the PII is not lawfully held to have it erased, rectified, completed or amended. Data that is anonymous may still be utilised in ways that can cause risk/harm to an individual
Accountability A data controller should be accountable for complying with measures that give effect to the other principles How and when might a researcher to be held accountable and for what?

Failure modes of data analysis

Alignment and bias from SoftDes past

Let’s examine a reflection write-up from a previous semester MP3 that has been adapted to serve as a prompt that allows us to build up numerous “ask-analyze-assess” alignment possibilities. We will do so today with an eye toward biases that could have played a role in data and algorithms used (or the people that generated them) by the systems that this project incorporated.

Work with one or two people near you as you read the reflection below and do exercises 1 through 5 related to alignment. Examining an already-completed MP3 can give you practice considering limitations and biases in complex systems before you reflect upon your own assignment. The exercise also enables instructors to introduce some considerations that might be less obvious.


Text caching class example

This is an activity that a student can do if they were unable to make it to class for the in-class design challenge below.

We’ve created an example program to demonstrate a) caching text data as local files, and b) the utility of custom classes.

You are free to build on this program in your MP3, but as with all code you didn’t write you must make sure you understand how it works (and ask questions if you don’t). Since this code was provided by course staff, you don’t need to cite its source.

Exercise: Try adding a lines method to the class that returns a list of all the individual lines in the text file, so that you can use it to write code like:

example = Text(my_url)
for line in example.lines():
    do_something(line)

Continuing our in-studio challenge pilot

Today, we continue piloting a new-for-Spring-2020 interaction style for class. We noted our reasons for trying a new approach to instruction in select studio sessions in Day 11’s notes.

A Context for today’s challenge

The find command in Linux is a program written in C that walks through files starting at a certain folder and prints all files that match a certain pattern. For example, you can find all files that end in .txt. For this exercise, you will implement a simpler version of this command in Python.

Setting up working groups

Today’s tasks

Get the starter files from this link.

For everyone, your job is to implement the class FileFinder in file_finder.py by filling in the bodies of the specified functions. Do not edit the other file. You can run the file with python3 find_file.py <base_dir> <file_name> where <base_dir> is the name of the directory to start searching in and <file_name> is the name of the file to find. You can add --num-trials 10 to repeat the process 10 times.

Then, in your groups, do the following:

Working through tasks for 30 minutes

Quick reflection

The students give us feedback verbally and using the Tell Us Tuesdays forms that Course Asssistants made available.