Day 12

Today

Text analysis, text analytics, and ethics
Alignment and bias example text from a former SoftDes student
Exercises for exploring classes and files
Additional in-studio Design Challenges

For next time

Please fill out the MP3 mid-point survey so we can collect general information about the data sets and tools people are using, which will let us provide better support.
Final Reminder: Please complete Reflection and Teaming Survey 2 if you have not already.

Text Analysis vs Text Analytics

Distinguishing between text analysis and text analytics will be helpful. Here is a fun scenario.

In MP3, it’s possible to use text analysis to present qualitative stories of data. One can also use text analytics to present quantitative information about data. Or do both. You can use Python packages to have your scripts analyze text in many ways, such as flagging keywords within a body of text that the program classifies as “aggressive” or some other term you specify. An analytics approach might present a graph to show how many students in a given class demonstrated patterns of aggression in public social media posts they made.

What if…? Imagine a scenario where we, as professors, mined popular social media sites to collect a corpus of each students’ public posts. Then, we examined the number of exclamation points used per 100 characters written. We then set a threshold (e.g., >12) to classify students with a higher number of exclamation points as “lacking professionalism, and reduced those students’ “professionalism” scores for this course.

Are there ethical issues with us making use of this public data in this way?

Is there ambiguity in how exclamation points are used (in English-based social media posts)? Can we write Python programs that can do a decent enough job telling the difference (if your grade depended on it)? What else would we need to consider?

Ethics in text mining and analysis

Some of the key points from RJ11 are summarized here:

Personally Identifiable Information protection principles

Principle	PII protection rationale	Incompatibility with “Big Data” analysis
Collection Limitation	There should be limits to the collection of PII, it should be obtained by lawfully and fairly and, ideally, with the knowledge/consent of the data subject	The larger the data collection, the better the potential for identifying interesting correlations
Data Quality	PII should be relevant to the purposes for which it is to be used, and should be accurate, complete and up-to-date enough for those purposes	“Messy data” is fine, it’s not clear what is relevant until its analysed, and even inaccurate or incomplete data can be useful
Purpose Specification	purposes for which PII are collected should be specified at the time of data collection. Subsequent use should be limited to those purposes or such others compatible with those purposes and specified on each change of purpose.	Data may have been collected for a particular purpose, but analysis may indicate further unrelated and previously unknown, but valuable, purposes. Data as collected may not be obviously PII, but analysis of it may identify individuals
Use Limitation	PII should not be disclosed, made available or otherwise used for unspecified purposes except with data subject consent or by authority of law.	There may be value in sharing and aggregating data that may not be apparent at the time of collection
Security Safeguards	PII should be protected by reasonable security safeguards against such risks as loss or unauthorised access, destruction, use, modification or disclosure of data.	It may be unclear what security issues, if any, arise from a particular collection of data or its analysis
Openness	data subjects should be able to establish the existence and nature of PII, and the main purposes of its use, and the identity and location of the data controller.	Where data is collected and analysed, it may not be obvious that it is PII, and even in circumstances where it is, the researcher may have no way of informing the data subject of its use
Individual Participation	An individual should be able to be informed by a data controller whether it holds PII relating to him or her; to have the PII communicated to him or her in meaningful form and reasonable time and at reasonable cost; to be informed if the PII will not be communicated and to be able to challenge that denial, where the PII is not lawfully held to have it erased, rectified, completed or amended.	Data that is anonymous may still be utilised in ways that can cause risk/harm to an individual
Accountability	A data controller should be accountable for complying with measures that give effect to the other principles	How and when might a researcher to be held accountable and for what?

Failure modes of data analysis

overfitting
selection bias
cherry-picking
confirmation bias

Alignment and bias from SoftDes past

Let’s examine a reflection write-up from a previous semester MP3 that has been adapted to serve as a prompt that allows us to build up numerous “ask-analyze-assess” alignment possibilities. We will do so today with an eye toward biases that could have played a role in data and algorithms used (or the people that generated them) by the systems that this project incorporated.

Work with one or two people near you as you read the reflection below and do exercises 1 through 5 related to alignment. Examining an already-completed MP3 can give you practice considering limitations and biases in complex systems before you reflect upon your own assignment. The exercise also enables instructors to introduce some considerations that might be less obvious.

What the creator was asking:
- For this project, I created a program that asks the user for a music artist, scrapes the lyrics of their songs, and performs a sentiment analysis on them.
Which types of data were analyzed:
- I used google and lyricsfreak.com as data sources. Once I parsed through the HTML and gathered the lyrics of each song, I used a sentiment analyzer to gauge the overall mood of their songs. I hoped to get more comfortable using python to interact with information on the web and use an appropriate computational analysis on it. Additionally, I wanted to create a visual representation of artists’ lyrics to compare and contrast in hopes to find trends. Once the user enters an artist’s name, my program performs a google search with those terms and finds the artist’s song list on the LyricsFreak website. I looked at different lyric websites but I chose this one because the body of the lyrics was easy to identify and easily accessible using Beautiful Soup. From the artist’s page, I extracted the URLs of each song and went to each individual page to scrape the lyrics. Keeping the lyrics from each songs separate I used the NLTK sentiment intensity analyzer to find the polarity scores of each song. Finally, using matplotlib, I plotted each song by the selected artist on a scatter plot using the negative and positive scores from the analysis.
How the tools selected helped answer the question:
- Graphs for artists with a lot of songs were difficult to read. I tried for a long time to figure out how to get the labels to appear when they were hovered over, but I was unsuccessful and could not find a way to interact with the figures I generated. Instead, I randomly labeled a few of the songs so that the viewer could identify specific songs to make sense of the plot but still be able to observe any trends. The final representation of my information is a scatter plot with every song’s positive and negative sentiment values of a particular artist. One thing that surprised me was that the songs turned out to be a lot more neutral than I had expected. I tend to think songs written by an artist have the similar tones but judging by the analysis I performed, I don’t think I can fully support that hypothesis. However, there were slight trends, for example Adele’s songs were more negative than positive for the most part but generally the points were pretty scattered. Also, I thought it was interesting to look at the songs that were very close to either axis. This suggests that they were almost entirely negative or positive and more skewed to one side or the other.
- How the student felt that the questions they asked aligned with the data sources they chose and the tools they used to analyze them and present their work:
  - [Do the exercises below before viewing the author’s remarks in this section.]
- How the creator felt after completing Mini-Project 3:
  - After doing the project, the I am very pleased with the skill set I learned from this project. This was one of the first projects that allowed for some freedom in what topics and concepts we wanted to dive deeper into. I was definitely challenged by this project. One of the biggest changes I can pinpoint was visualizing the flow of the program and going from what was in my head to lines of code. I had to make decisions about which functions would handle individual song lyrics and which ones would be performing a function on a list of scrapped information. It helped me to write out the different functions I thought I needed in plain english along with the parameters and return objects and to create a flow diagram. I felt as though this was well scoped and that I gained a significant amount of independence during this project. Although I might not use the specific results from this project, I now have a handful of new tools that I look forward to incorporating in my future projects.
- Exercises: put yourselves in this students’ shoes and consider the tools mentioned above or any others that you can imagine helping to achieve the stated goal.
1. Take the potential source of lyrics, lyricsfreak.com and Google search results, and propose at least one way that each source may be limited.
  - 1a. A potential lyricsfreak.com limitation…
  - 1b. A potential google search limitation…
2. Take the tool to analyze the language of lyrics, NLTK, and imagine ways that the tool might have been trained to classify a set of words as positive or negative. Think of times when such a toolkit may struggle to classify a particular set of lyrics based on how you think it was trained.
  - 2a. NLTK might produce satisfactory results given lyrical data that has the following characteristics, based on how we think it was trained…
  - 2b. NLTK might struggle to produce satisfactory results given lyrical data that has the following characteristics, based on how we think it was trained…
3. Consider the tool for visualizing the results of the analyses, matplotlib, or something similar (e.g., printing text to the terminal). What are conditions (e.g., with a lot of lyrical data available for an artist or only a little) that might might make it difficult for a user to confirm or challenge assumptions they made about the musician they searched for?
4. Once you have responses to the three prompts above, read the alignment excerpt written by the original project creator.
5. Contemplate whether you went further in your consideration of potential bias than the project’s creator. [Optional] see texts such as Algorithms of Oppression that introduce ways in which Google searches and the algorithms that produce the results can be problematic for many users.

Text caching class example

This is an activity that a student can do if they were unable to make it to class for the in-class design challenge below.

We’ve created an example program to demonstrate a) caching text data as local files, and b) the utility of custom classes.

You are free to build on this program in your MP3, but as with all code you didn’t write you must make sure you understand how it works (and ask questions if you don’t). Since this code was provided by course staff, you don’t need to cite its source.

Exercise: Try adding a lines method to the class that returns a list of all the individual lines in the text file, so that you can use it to write code like:

example = Text(my_url)
for line in example.lines():
    do_something(line)

Continuing our in-studio challenge pilot

Today, we continue piloting a new-for-Spring-2020 interaction style for class. We noted our reasons for trying a new approach to instruction in select studio sessions in Day 11’s notes.

A Context for today’s challenge

The find command in Linux is a program written in C that walks through files starting at a certain folder and prints all files that match a certain pattern. For example, you can find all files that end in .txt. For this exercise, you will implement a simpler version of this command in Python.

Setting up working groups

The whole class counts off 1 to 4.
Half of the class (1s and 3s) moves to a set of chairs close to the west whiteboard.
The other half of the class (2s and 4s) moves to chairs near the east whiteboard.
An instructor sitting with the east group will explain Task E, then introduce two different ways of carrying out the task in Python.
A different instructor sitting with the west group will explain Task W, then introduce two different ways of carrying out the task in Python.

Today’s tasks

Get the starter files from this link.

For everyone, your job is to implement the class FileFinder in file_finder.py by filling in the bodies of the specified functions. Do not edit the other file. You can run the file with python3 find_file.py <base_dir> <file_name> where <base_dir> is the name of the directory to start searching in and <file_name> is the name of the file to find. You can add --num-trials 10 to repeat the process 10 times.

Then, in your groups, do the following:

East 1’s will time how long it takes to run the process 10-1000 times, whatever gives a running time of 2-10 seconds.
East 3’s will time how long it takes to run the process the same number of times as East 1, but using memoization. You will need to edit your implementation to do this.
West 2’s will write additional code to measure how many files are looked through while searching, and try to correlate the running time with how many files are searched through.
West 4’s will write additional code to measure how many matches are found, and try to correlate the running time with how many matches there were.

Working through tasks for 30 minutes

Students in each group will work with their own assigned number students to spend 10 minutes implementing the approach they’ve been given.
spend about 4 minutes testing with different values.
Each numbered group will split into two subgroups, such as 1a’s and 1b’s. 1a’s will move to sit with 3a’s, who were working using a different approach. 3b’s will join 1b’s for a share and discussion session. 2a’s will move to sit with 4a’s. 4b’s will join 2b’s for a share and discussion session.
try each others test cases and discuss.

Quick reflection

The students give us feedback verbally and using the Tell Us Tuesdays forms that Course Asssistants made available.