Python Scraping, PDF2Text Conversion – first steps

At the beginning of this semester, I joined Manisha Goel, one of Pomona’s economics professors, to help with the technical side of her research. The project aims to analyze the effects of government actions on businesses and business management. To do this we needed to analyze the tone and diction of tens of thousands of listed businesses to search for hints of doubt or uncertainty within their business reports. In order to analyze the text, we first needed to gather the text and turn it into something that can be used later down the line of analysis. The business reports start in PDF format, but plain text was needed in order to process the language used.

There are many ways to transfer text on a pdf to plain text, but as I’ve found, some work better than others. Initially, my team was throwing around ideas of using a tool developed within ITS, an optical character reader tool (or OCR tool), but we eventually decided to just solve the problem through Python. I used the PyPDF2 library, while a fellow researcher used pdftotext. Both libraries have the same purpose, but the pdftotext implementation has had higher rates of accuracy compared to my implementation. This difference in accuracy could be explained by pdftotext being a stronger tool, but I think the real difference was in the experience of my colleague, Maxwell Rose. Regardless, I learned how to access directories and convert and create files in Python, useful tools for later research.

The next steps for this project are the actual analysis of the text files produced in this stage of the process, which will hopefully lend the result and insights we are looking for within the corpus we’ve collected. For me, I hope to revisit pdf conversion with a different package, pdfminer

By Sam Millette

 

Project: JupyterHub (Part 2, Edit 1)

Lessons Learned:

The virtual host was originally Ubuntu 16. To upgrade to v18 in place: [should we add some words on what it means to upgrade from Ubuntu 16 to 18?]

$ apt update

$ apt upgrade

$ apt dist-upgrade

$ apt install update-manager core

$ do-release-upgrade

This was per: per https://www.zdnet.com/article/how-to-upgrade-from-ubuntu-linux-16-04-to-18-04/

I did those steps in a tmux session [so as not to lose the output when disconnected from the remote host] due to the length of time required to perform those steps.

The main configuration file of TLJH is a yaml file [text file with a specific syntx]. TLJH comes with a utility command, “tljh-config”, to set its values.

Example usage:

$ tljh-config set foo bar

That was cool. However it didn’t seem to handle array values [what happens when TLJH can’t handle array values?] very well, so I edited the yaml file by hand for an array value. I was unsure at first how to do arrays in yaml. They look like this:

Array-name:

– Array entry 1

– Array entry 2

The spacing is important – two spaces per indent. [Underlined for emphasis]

Also, and this was probably the most embarrassing mistake of all, I confused the documentation of one version of JupyterHub as the documentation for the different version [should we or identify this version? “the TLJH version” or something]  I was actually working with. They looked identical, but they were not. I spent hours configuring LDAP authentication according to its docs, only to learn at the end that when I tried to enable it for use, that it’s not an option in the version I was installing. However, as a bit of a consolation prize, it supported authenticating against GitHub credentials, so I set that up instead.

Also, after successfully installing JupyterHub, I was baffled by the method in installing additional conda packages into the environment. I could not find the installed python venvs [venvs?]  on the command line. In the JupyterHub web interface, under the same “New” menu where you launch a new notebook, you can also (or at least I could as an admin) launch a new “Terminal” session. This gave me an interface into the underlying venv environment, and I could use pip and conda to my hearts content. One of these days, I’ll figure out where those venvs live in the actual server. So that’s how I installed the ldap-authenticator plugin. ($conda install -c conda-forge jupyterhub-ldapauthenticator) The “-c” in that command tells conda to use the conda-forge channel, which I believe are the kind folks offering the ldap-authenticator plugin.

Regarding the OS. Originally I tried installing JupyterHub on one of the CentOS7 servers we had been using for a professor’s economics research. It refused to install, requiring an ubuntu 18 server. So I tried spinning up an ubuntu container on it using docker (“docker pull ubuntu:latest” to pull down an official ubuntu image from dockerhub, then a “docker container run -it ubuntu:latest /bin/bash” to start and attach a shell). However, that turned out to be a very minimal ubuntu image – it didn’t have “vi”, or “which”, or “man”, or “ping.” It was difficult to perform the simplest tasks, it seemingly catered more towards a microservices crowd [what’s the difference between a microservice crowd and the crowd for the project?]. After poking around to find a fuller ubuntu image on DockerHub (and faiIing … or was it that DockerHub failed me? [lol]) I punted and went with creating a full-blown ubuntu vm using an ubuntu template in vCenter. [This allowed me to use the tasks from a proper ubuntu image] After that it was pretty smooth sailing.

By Andrew Crawford

Project: JupyterHub (Part 1, Edit 1)

Background: JupyterHub is an easy-to-use, browser-based interface to the Spark + Scala + Python environment we’ve been experimenting with over the past few months. JupyterHub is an always-on Jupiter notebook environment that, unlike Jupiter notebooks, does not require a user to configure it on their local laptop and allows to run long jobs. Think of it as what GitHub does for git, or what DockerHub does for Docker. JupyterHub does it for Jupiter notebooks. It is multi-user, which lets multiple researchers share the environment.

In practice, when a researcher is ready to start coding their project in Python or Scala and that code’s execution needs to be striped  [“striped” as in “stars in stripes” ]  across multiple high-computing nodes, the researcher can simply point their browser at the JupyterHub URL, log in, and they will be presented with a fairly respectable Integrated Development Environment (IDE) that will execute – line by line, with reviewable output – any code they write. And that code is executed against the multi-node HPC environment. It is very powerful, and very cool.

To appreciate what JupyterHub does, it would probably help to understand what a researcher would have to do without JupyterHub. The answer: a lot. There would have to be deep Unix shell experience, and proficiency with a Java build tool called Maven. The researcher would have to understand building .jars [Packaged Java Code], and the Java build environment, and the Spark submit environment [I added italics and bolding because I think we are using parallel “ands” for emphasis]. The researcher would have to know an editor such as vim; Python and Python virtual environments; a Python virtual environment tool called conda, a python library management tool called pip. By having JupytherHub in place, most if not all of the above will be handled by an administrator instead, and the researcher can focus on writing and executing code against the environment. However, if the researcher does have those skills, JupyterHub gives them the option of launching a web-based terminal session into the underlying environment. Which is also really cool.

Project Details

 Servers: pom-itb-jhubdev.campus.pomona.edu

An Ubuntu 18 box built for the project. The version of JupyterHub used, “The Littlest Jupyter Hub,” requires Ubuntu 18. Configured to serve JupyterHub via https.

Packages installed:

Python3, git, curl (apt-get)

JupytherHub:

curl https://raw.githubusercontent.com/jupyterhub/the-littlest-jupyterhub/master/bootstrap/bootstrap.py | sudo -E python3 – –admin <admin-user-name>

JupyterHub plugin for LDAP authentication:

https://github.com/jupyterhub/ldapauthenticator

Configurations:

This is the configuration of JupytherHub on the server. It uses the handy “tljh-config show” command to dump it: (“tljh” stands for The Littlest Jupyter Hub, which is the name of the version we installed).

root@pom-itb-jhubdev:~# tljh-config show

users:

admin:

– andrew

auth:

type: oauthenticator.github.GitHubOAuthenticator

GitHubOAuthenticator:

client_id: <secret from github>

client_secret: <secret from github>

oauth_callback_url: https://pom-itb-jhubdev.campus.pomona.edu/hub/oauth_callback

ssl_key: /root/campus.key

ssl_cert: /root/campus.pem

https:

enabled: true

tls:

key: /root/campus.key

cert: /root/campus.crt

People To Thank:

Jonathan Lanyon – Assistance with configuring AD authentication. Ultimately it was learned that this tljh version doesn’t appear to support AD authentication, but the work performed is needed for when we put in a version that does it.

Pat Flannery – guidance with virtual hosts and building and configuring the Ubuntu VM.

Michael Ramsey – his assistance with diagnosing Active Directory configuration issues, comparing them with environments known to work.

Asya Shklyar – The omniscient leader of all things HPC, without whose vision there would be nothing. She confirmed JupyterHub as the way to go, and offered Binder as the next step evolution. She also set the challenge of AD authentication, which has not been met … yet. (foreshadowing)

More Reading – Additional Docs:

The main JupyterHub docs: https://jupyterhub.readthedocs.io/en/stable/

TLJH docs: the-littlest-jupyterhub.readthedocs.io/en/latest

Ldapauthenticator docs: https://github.com/jupyterhub/ldapauthenticator

Configuring authenticators for tljh: http://tljh.jupyter.org/en/latest/topic/authenticator-configuration.html

Binder blog: https://blog.jupyter.org/mybinder-org-serves-two-million-launches-7543ae498a2a

Binder docs: https://binderhub.readthedocs.io/en/latest/

By Andrew Crawford

Rendering scenes with pbrt3

At the end of Computer Graphics class, we had an assignment to design our own ray tracing renderer. It was an extremely difficult task, and our group failed in the end despite our efforts. As so, I would like to see how a working ray tracing engine performs in rendering a scene.

Here I would like to share the procedure I used to run this ray-tracing code, as well as the huge excitement I felt at the time. If you are not familiar with build and make, then you can also get a grasp of that in this blog.

So, what is ray tracing? Imagine your eyes are a camera. When you look at the screen, you are looking at about 2560 x 1600 pixels (could vary depending on your computer) altogether. Therefore, for each pixel, the computer needs to put a color on there. So how does it determine that? In other words, how does your computer project a 3D scene to a 2D platform? In short, the “camera” shoots out a ray towards each pixel. As each ray hits an object, it reads the color value of this hitting point on the object. This color value is determined by the material, reflection algorithm, and light sources. Eventually, the ray gives the RGB value to that pixel it passed on the screen.

In computer graphics, the best book for rendering is Physically Based Rendering. Up to this point, it has a third version already, and the fourth one is coming next year. It seems that CG researchers have been discovering better and faster ways to render.

The beautiful bathroom picture below is my goal today – actually, it was the goal 24 days ago, and you will see what I mean by that later. Yes, it is a completely modeled scene. It is not a picture taken by a camera.

Salle_de_bains, created by nacimus, downloaded from Blend Swap

The first step is getting the code from this book’s official GitHub. https://github.com/mmp/pbrt-v3

All the code in this book is published there. The README file is also detailed enough to follow, but I’d still like to offer here a concise procedure with screenshots. I ran this on a MacBook, but keep in mind that you can also do it on Windows command prompt in the same way with a slightly different syntax. After all, there is only one cmake in the world.

Firstly, I created an empty folder called “build” in the folder I just cloned. Then, I cd into that build folder, and called “$ cmake ..” in terminal. This is how cmake works – you call “$ cmake [the folder with CMakeLists.txt]”, and it will give you a Makefile in the current folder.

 

 

 

After it finished building, there will be a Makefile in the build folder. I cd into the build folder and called “ $ make”. There you will see some beautiful purple and green messages. This one could take a while.

 

 

 

 

Cool. Now my build folder looks like this:

There is an exec file called “pbrt”, and that is exactly what we want after all these messes! Think it as a software you normally use, except that you don’t double click it – you use it by calling it in command prompt.

Now that we have a stove, we still need some raw food to cook. In this case, the latter is a 3D scene. If you go to https://pbrt.org/scenes-v3.html, you will see a lot of good-looking scenes created by artists in the public. On the top of the page there’s a GitHub source to clone from. Here, we use “Salle de Bain” created by nacimus for instance.

Finally, we get to render it! What you should do in command line is pointing out that I want to use the pbrt engine to render this bathroom scene. Still in the build directory, enter a line like this:

$  ./pbrt  [the .pbrt file]

And then it starts running! Note that it shows how many cores it detects. We will use that later. Then, look at the progress bar. It is still empty. The number on the left in the parenthesis is the time already taken, and the number on the right (still a ‘?’ at this point) is the estimated remaining time. In the beginning, the computer cannot tell how long it will take yet. But feel free to take a guess now 🙂

So, what is your guess? Is it 6 hours or even longer than a day? The answer is 23 days on my MacBook Air with 4 cores!

~ 2 million seconds ≈ 551 h ≈ 23 days

That’s why the smartest CG researchers in the world are still studying ray tracing after all these years. The resulting pictures certainly look nice – but it is too slow.

However, it does get better if you have stronger hardware. This is the result I got running on my msi gaming laptop with 8 cores:

~ 1 million seconds ≈ 278 h ≈ 11.6 days

Asya said that the speed is mostly related to the number of cores. That makes sense here – the laptop with 8 cores has twice the speed than the one with 4 cores. They also have different GPU and RAM, but I guess they are not that important. My opinion is that the program ran in multiple threads, each requiring a core; however, each thread is not that much of a work, so the other factors did not influence the speed that much.

However, as I ran this on Mudd’s research computer, I was amazed to find that though the system only detected 4 cores, it finished the work in a mere 3 days! The secret of computing has still a lot to explore.

By Jack Chen

Memory in HPC

Chris Nardi from HPC Support discusses the challenges of high-output memory in an HPC environment.


         Everything tends to be more difficult when running applications in a High-Performance Computing environment. In order to require the use of an HPC environment, the task at hand must require an enormous amount of computational resources that simply aren’t available at a personal laptop or workstation. Though there are countless other challenges that HPC applications face, this post focuses on memory usage, as memory has been a roadblock for two particular projects that we’ve worked on in the past year. We tend to classify applications as limited (or “bound”) by the three main components of a computer.


An application can be CPU-bound if the application would be sped up by adding additional cores to the server, memory-bound if the processing speed is limited by the amount of data that can be stored in
RAM and swap, or I/O-bound if the major hurdle is reading and writing data either from persistent storage or RAM. An application can be a mix of these types, as a process that requires reading a lot of data and making use of all of it at the same time could be both I/O-bound and memory-bound. Broadly
classifying applications in this way can help to identify which components of the system (or even the code) need to be improved so that performance is optimized.


While an application that is CPU-bound or I/O-bound would typically run on less powerful machines (though possibly slowly), a memory-bound application will require a certain amount of memory to be
available so that it can even run. This is further complicated by the price of RAM sticks, as they can be one of the most expensive components of a computer or server. Applications that need to run on an HPC environment oftentimes are memory-bound, as they typically are analyzing or creating a large amount of data that cannot (easily) be processed independently. While many applications are also CPU-bound, long-
running computations are acceptable to researchers. What’s not acceptable is a computation that cannot even be finished due to a lack of memory.


We’ve encountered memory-bound applications in two projects that I’ve worked on in the past year. The first one involved a several terabyte dataset that we were attempting to run a Latent Dirichlet
Allocation (LDA) model on in order for topics to be discovered without manual tagging from hundreds of thousands of text posts. The most common implementation of an LDA model requires all of the data to
be loaded into memory at the same time–something that wasn’t feasible locally as the most RAM on an individual server we have currently is 768 GB.


         We looked into using AWS to find a cloud node that would have over 1 TB of RAM, but costs can quickly add up as x1e.16xlarge or x1e.32xlarge instances with 1-3 TB of RAM are $14-$25 an hour. We found some implementations of LDA models that incorporated batching, meaning the algorithm was run on subsets of the data and the intermediate data was combined to generate a final output, but this would require additional coding and implementation work.


         Though the second application was performing calculations only on a moderately sized dataset in R, it required an exponential expansion of probabilities to reach a final output. As a result, it too would
require more than 768 GB of RAM in order to complete. This application proved trickier to find a solution, as it would be hard to batch out the work of the associated algorithm in the same way that could be done with LDA.


         So, what can be done when you find an application that is limited by memory? Generally speaking, there are a few possible solutions. As discussed, a simple one is going to the cloud, though this can be expensive. For academic use, XSEDE (a partnership of colleges and universities that allows sharing of computing resources) can be a great resource, but submitting a job on it requires that the data and code be shareable with no restrictions. More complicated solutions likely require a rework of the application itself.


         In some instances, data can be compressed without losing any (or much) information, meaning that a similar output could be generated without needing as much memory. If the data can be processed independently, tools like Apache Spark can also help by dispatching tasks to worker nodes and bringing the data together at the end to create a final output.


         Of course, buying more memory is also a possibility. However, at $1000+ per 128 GB stick, money can limit the feasibility of this option. As with our two examples, it’s usually possible to find ways to rework the application if you search hard enough (and have people with the expertise to implement the required changes).


         Memory usage is a pitfall in HPC, but it is surmountable with the appropriate tools.


By Chris Nardi

DIY Easy Smart Mirror

Ever Wanted to Make a Do It Yourself Smart Mirror? Lindsey Tam Tells You How, Step by Step.

Making a smart mirror is a fun DIY project that anyone can do. A smart mirror is a device where a mirror is placed behind a screen that displays information such as the time, the weather, or the date. More advanced smart mirrors also support touch screen capabilities. Personally, I do not want a touch screen mirror (who wants to deal with constant smudges and fingerprints?) so I opted for a simpler display.

I chose to use a tablet as the technology behind the class. You could also use a monitor and a Raspberry Pi behind the glass, but this would require programming. There is a lot of documentation online so it would be straightforward to download code online. Both of these methods would require the mirror to be charged 24/7.

My Materials:

  1. A shadow box frame (37’ x 17’)
  2. A 7’ Amazon fire tablet + Charger
  3. White wood spray paint
  4. Two-way Acrylic Mirror
  5. Black masking tape + Black construction paper
  6. D nails

Steps:

  1. Start by disassembling a shadowbox frame.

 

Take the backing off of the frame and cover it with black paper.

 

 

It is important to cover with black paper because the darker it is on the inside, the better and clearer the mirror will be.

2. Download a smart mirror app on your tablet.

 

 

 

 

The app I am using is called Digital Wall Clock.

If there is no smart mirror app, you can look for apps such as “clock display” or “always on”. I feel that Android tablets support better smart mirror apps, but I chose an Amazon tablet because I happened to have it lying around. What I like about my Amazon tablet is that it supports Alexa. So my mirror is not just a display, but is also capable of playing music and videos through voice commands.

3. Drill holes in the back of the board to allow sound to pass through. I added an additional hole at the bottom for the wiring.

 

4. Use masking tape to securely fasten the device onto the board. I also taped down the wiring onto the board.

 

 

5. Place the mirror on the inside of the frame and seal the back using D hooks. I manually screwed these by hand. For extra security, I also added a layer of black tape.

 

 

Note: I got my mirror online from https://www.tapplastics.com. The quality is very poor, but their rates were the most affordable.

 

And you are done! The end product is a unique and fun spin to a traditional mirror. The materials are easily accessible and assembly is fast. Now, you are not only able to see your reflection, but can also monitor time and listen to music. This project makes a great gift and is an excellent accessory to have in the home. You can modify your mirror to your liking and the possibilities are endless!

By Lindsey Tam

The Case Against Google

How Destructive Corporate Practices Necessitate Groups like HPC Support

There are many changes that a reader of this post could make that I would make me count writing this piece a successful use of time. Switching to Duck Duck Go, migrating away from Gmail, writing to a congressperson about privacy concerns, spreading the facts, are among a few of the outcomes I’d like.

    It is quite simply a fact that Google has accumulated too much power without a policy in place to regulate it. The technological revolution that started in the early ’80s and was catalyzed by the Internet has happened much too quickly for Congress to keep up. Do we need to amend the constitution to account for changes in the fundamental way we live our lives? Is Google (and Facebook) tricking the public into thinking that privacy is nothing to worry about? Do items like fake news and content published on social media sites really fall under the jurisdiction of those same sites, and so the responsibility falls to Facebook and Google to clean up the messes they are making by allowing conspiracy theory to proliferate?

I contend the answer to all three of these questions is yes. Yes, we need a constitutional revolution to account for modern day life. Yes, the public has no idea what they are in for if they do not fight for the fundamental right of privacy. And yes, huge tech corporations are responsible for some amount of the malicious behavior that has made its way from their platforms through politics and into daily life.

Let me present the reader with a couple of facts that I hope will heighten worry and promote action. Google runs on about 75% of the websites most trafficked on the internet. Most of what they are doing is running what is called Google Analytica. This program serves as a tool for data collection. It helps Google (and Facebook has a similar program) to build an in-depth profile of each individual that can be sold to advertising agencies. But, of course, the power is not just in the individual but in the collective. It is not individual data that Google cares about but the aggregation of millions of data points. When they can see trends that gather across whole communities they can profit off of the information.

Sure, it is true that some of these ways are helpful. Whenever we go onto new sites that require a login to perform helpful activities, Google and Facebook both appear as easy ways to do a one-click sign in, instead of spending the 10 minutes to enter all of our information. But it is hard for most of us to see that this helpful activity is another way for Google and Facebook to track you. Because we use this sign-in tool, and sometimes it is the only option, both companies can now track what we do even off of their sites. Due to the fact that this is not a source of anxiety, a red herring that a totalitarian state may be coming, the public seems woefully blind to its own potential downfall.

Just imagine a case of profitable behavior that Google is incentivized to perform selling your information without your consent in a way that could cost you your life. Google is able to track your mouse as it passes over their site. In recording that data, they could aggregate it and list it against all known users who have Parkinson’s disease. This sickness would show itself online by slower and shakier than usual mouse control. With vast amounts of data at their disposal, Google could average extremely high predictive values as to if you have Parkinson’s. Here comes the part that does violence to commonly held societal goals.

Google is incentivized to sell this information to your insurance company that could result in a spike in insurance costs, or the insurance company dropping you as a client. There are currently no laws in place that unilaterally prohibit this type of behavior. No laws exist that require Google to compensate you for the information or to even tell you that you likely have Parkinson’s disease. Do not be fooled by the admittedly fantastic productivity boost these companies offer; you are the product, not the consumer.

I am no Luddite advocating a return to the often overly romanticized times of hunter-gatherers or cabins in the woods. I am a staunch supporter of coding, data analysis, and the integration of life with technology. We need supportive online communities, not corrosive ones. We need a tool for the people, not a tool used on the people. We need our lives to be protected by the institutions that have a duty to serve us.

All of this is to say that information is the skeleton key in the war of ideas. It unlocks avenue and ways of thinking that were hereunto previously locked away. One way to inform ourselves is to learn about the various methods these companies use to perform analysis on such large data. Highly parallel computing is one of these tools. This group, under the enlightening leadership of Asya Shklyar, teaches the fundamental knowledge of the computing world. From the terminology to various hands-on experiments my time at the HPC group, though currently, brief, has helped me create more informed opinions on a variety of topics.

I do not wish to say that in the HPC group we focus on Google or Facebook or any other companies policies explicitly. Rather, you get a taste of what is going on in a world that is crucially important to be an informed citizen in this day and age. Facts such as the critical path are of paramount importance in determining the runtime of a program, or that Moore’s Law predicts that the number of cores per processor will double every 18 – 24 months give the most utility to programmers, data scientists, and technicians. But they offer the layperson an avenue to more accurate predictions and knowledge of the capabilities she has. Knowledge is power is how the old saying goes. At the HPC support group, Asya and our team empower a generation of students from academic disciplines across the college.

Further Resources:

Articles:

Are You Ready? Here Is All the Data Google Has on You by Dylan Curran

Google Just Got Some Record-Breaking Bad News by Maya Kosoff

Podcasts:

Waking Up with Sam Harris #152 – The Trouble with Facebook

The Knowledge Project with Shane Parish – Popping the Filter Bubble

 

By Malcolm Yeary

Editors Comment: Thank you, Malcolm, for such an inspiring article. More students in non-technical fields need to know how technology works, to make informed decisions and participate in policy-making. We have our work cut out for us!

Ekeka Abazie Learns How To Create ML Workflows in GCP

Learning about GCP (Google Cloud Platform)

On May 29, I attended the Machine Learning workshop at Google’s office in Venice Beach. I was super excited to be able to visit one of the leaders of the AI industry as well as a major search engine; in fact, since coming to California, this has been a big goal of mine as a CS aspirant.

I attended the workshop with my HPC supervisor Asya Shklyar who also generously provided me with transportation to the event. The workshop was aimed at marketing Google’s AI software such as BigQuery, hosted JupyterLab, and Dataflow to businesses with a consideration of ethics and applications to machine learning at the end. The ethics and applications discussion at the end had representatives from USC Keck School of Medicine, Pluto (which is now in partnership with Google), and a financial consulting business.

Soaking in the tech atmosphere

BigQuery is a software application that Google uses to process business requests to survey large amounts of information (terabytes of data). By using this technology, a business is able to use different machine learning models to, for example, predict whether or not a customer that visits their website will end up buying at the end of their visit. The first practical lab portion of the workshop gave me a chance to use a binary system of “will not buy vs will buy” to pose a query to the software to yield predictions based on trends of input data that it had been fed. The query also listed out in a very user-friendly format that even a layperson would understand better specifics of the data that it had been fed.


Putting it all together

I found Dataflow to be a very interesting development because it was based on real-time data as opposed to the previous model Dataproc that was solely based on fed data. Dataflow allowed for the data to interact with the machine learning model as opposed to the machine learning model interacting with it. It also has much more practical applications, because I could envision more customers being available data sets for the integration of the machine learning model which could increase the accuracy of its predictions and yield better results for the business.

At the end of the workshop, we heard from representatives of different sectors namely health (Keck School of Medicine) and finance. I found it amazing that the USC School of Medicine, one of the premier research-based health institutions in the US, used machine learning models from Google to match patients with appropriate clinical trials. It’s also important to note that this workshop showed me that Google is an AI business and that that is how they make their money which I had always been curious about. There was also a fireside chat, but Senior Shkylar arranged for me to have a tour of the building which I am very grateful for. Also, the food was awesome.

By Ekeka Abazie

Structure in HPC Support

HPC Support Student Malcolm Yeary critically analyzes the positives and negatives of HPC structures, offering solutions and optimism for the future.

It has come to the attention of the HPC group, through the prompting of Asya, to find both problems and potential solutions. HPC support has become an agglomeration of highly talented and curious students under the guidance of a patient and knowledgeable leader. Now, it may be fair to ask, what’s the problem? Groups without structure can sacrifice some productivity and efficiency. I remain open to the possibility that a trade-off is made in our group where some portion of productivity is lost for greater creativity, which would probably benefit the group in the long term. So, this post is dedicated to uncovering reasons for creating more structure in the HPC Support community. Just how much would benefit our group remains an ongoing discussion, and I encourage anyone reading this to reply below to help continue the conversation. Here, in this entry, I hope to set the groundwork of the discussion.

I will begin by recapitulating the problems as Asya and others briefly discussed in our meeting on May 16th, 2019. I will present three problems: the continuous stream of new members to HPC, student burn-out, and HPC’s current anonymity.

In the first case, due to HPC support’s effort to be open and inclusive, it used a rolling application that allows students to come sit-in on meetings, talk with Asya, and get involved if approved for a project. This policy has prevented some of the senior members of the group from diving deeper into the topics they enjoy and has curtailed the learning of more advanced skills. By having to explain the basics over and over again, the group loses its ability to progress because it entails leaving the newest members behind.

Of course, whatever structure HPC and Asya decide to implement, this welcoming spirit is something everyone wants to keep going. Recently, I felt the tremendous power of this positive attitude when I joined. Asya welcomed me into the meeting and dedicated time to answering my questions while managing to help other people with questions about their projects too. I recognize that some people at the meeting were listening to the information for the 5th or 6th time. As good as it is for cementing the basics, I can imagine some desire further exploration on HPC related themes.

The second reoccurring problem is student burn-out. Pomona students, and 5C students, opt into a system that takes their time in exchange for skills and knowledge. It takes tremendous effort and concentration to become a practicing biologist, programmer, writer, or artist. So, professors assign an amount of work that they feel apt to training us in the ways they feel are needed. This fact of life at college affects HPC and all other campus jobs and groups. I find it a privilege to have such specialized attention and so much responsibility coming from my course load, and I sometimes find my time strained to accommodate more work. It seems other members of HPC have felt the same. Commitment and interest run high at the beginning of the semesters when course work is light and intellectual curiosity has not felt the punch of imminent deadlines. People take on projects but soon lose the ability to both keep up with classes and participate fully in HPC work.

The third and final problem to highlight is the lack of awareness that the Pomona community has about our group. One goal in creating this group was to spread the knowledge that high performance computing can help many different fields of study. This potential to advance a research project or create new ways of thinking is not limited to the STEM disciplines. Professors and the student body at large has not had a sufficient introduction to the HPC community. This lack of awareness could also contribute to the problem of lacking a stable funding source from Pomona administration.

So, for these three problems, I propose a three part solution. HPC may thrive from delineating three groups of people: (1) a group of teachers and mentors, (2) a group committed to HPC support during the semester, and (3) HPC advocates. I will briefly explicate what the duties and benefits of each group may be.

The first group would mitigate the constant reiteration of basic topics to the entire group. A select portion of the HPC members would volunteer to become mentors that meet with new students that wish to join (and who were previously vetted by Asya). They would explain the basics of what HPC is and why we exist. If this structure goes into place, then the mentors could explain how the group works and get the new person to start thinking about where she may like to go.

This mentor meeting could either be at one of the weekly meetings, in which case the 2 weekly meeting should be split by experience. Thursday would be a beginner meeting where everyone is welcome, and new people are encouraged to ask Asya and the team about topics and interests. And Friday would be the advanced meeting where again everyone is welcome, and it is geared toward discussing the more detailed questions of people currently working directly with Asya or with a faculty adviser on an assigned project. Splitting the meeting like this would offer the benefit of both easing new people into the fold and getting them acquainted with terms as well as allowing more experienced members to dive deeper.

The second group, the helpers of HPC support, would examine their upcoming semester and commit to some number of hours of working directly on HPC projects. With the increase in amount and kind of HPC tasks, it may be pertinent to hire a “student staff” working directly under Asya. This would ease the work load for Asya and give curious students a direct experience working with an ITS team. Moreover, this would stop the second problem, burn-out, from presenting the group with an issue. Students who know that they will not have enough time in the semester can sit in on the meeting and follow the messages on Slack. They will be able to signal this desire in advance so that Asya can plan in advance how to divide the projects. This dedicated group could help achieve more efficiency within the group and allow more people to express their desired number of hours committed to HPC.

I foresee a potential objection to the line of reasoning use to promote the creation of the second group. An objector may claim that this will in fact limit productivity because people will be unsure as to their exact course load. Due to this fear and uncertainty, fewer people will sign up for the committed support position. If we keep the group the way it is with an open structure of doing projects when you can, more people may actually work on them throughout the semester. I am open to this criticism, but I have an intuition that it is not the case for HPC. I would love to hear more in person if anyone has this view (or respond below!).

The third group, the HPC advocates, are tasked with the duty of spreading the word. They would meet regularly outside of the HPC meetings and create plans to get HPC information out into the open. This group’s job is the dissemination of knowledge. A couple ideas that jump to mind right away are info-posters, meetings with administration to work on getting a budget, attending the trustee event (this time we can do it!), hosting workshops to teach basic command line coding (with snacks for incentives). Of course, these are first thoughts and if put into place the team will spend some more time focusing on effective strategies to teach about this new world of knowledge.

The last thing to mention is that Asya also brought up the possibility of distinguishing between groups on a temporal basis. In other words, there would be a fall semester group, a spring semester group and a summer group. This would help with organization. It would allow the students to check-in with themselves. They could reflect if they wish to continue at the level they are going, up their commitment and responsibility, or lessen their involvement due to an upcoming hard semester.

This post is supposed to be the beginning of a longer conversation about what to do. If anyone disagrees with the problems as conceptualized in the beginning or the solutions proposed, please write a response! I hope this helps get the ball rolling.

By Malcolm Yeary

How I Created a Virtual Reality Environment and (almost) Presented it at a Conference

Sabina Ku and Ino Tsichrintzi

I always found Virtual Reality fascinating but never expected to actually not only be able to have access to it, but to also be able to create a VR environment AND have an opportunity to present it at a conference!

By joining the HPC Support group at Pomona College, I got to experiment with VR and use it to create an environment that enabled us to share some of the things we do at the In The Know Lab. I have to admit that at first, I was really excited and that despite my initial fear, I was able to feel confident in making this application, and make it good. The process, however, was a bit harder than what one would think.

The main problem I had to deal with was including 360 videos of students in the environment. Those videos showed my fellow students in the lab and talking about the different tools that we have and how they use them in the context of our educational purposes. Though the videos looked fine on the computer, when integrated in the VR environment, they became distorted. I tried to approach the problem in many different ways after doing some research online and according to the advice of others who had encountered similar problems. Many of the solutions did not work, making me feel desperate, but in the end, I found a way to project the videos in the surroundings of the application’s user without distorting them. I later realized that the reason behind this issue was that the 360 videos were not in the appropriate format for use in a VR platform.

After completing the application, I felt really proud of myself and very excited to have something completed in order to present at a conference called ELI. The conference can be best explained by their own definition: “The EDUCAUSE Learning Initiative (ELI) is a collaborative community committed to advancing learning through IT innovation.” ELI is associated with EDUCAUSE which states on their web site: “We are a nonprofit association and the largest community of technology, academic,
industry, and campus leaders advancing higher education through the use of IT.”

That was the first conference I attended as a presenter and I was very excited about it. When we got there, however, things did not run as smoothly as I would have liked. As we were setting up our booth, I had to set up the VR headset, but the one I was using back at the In the Know Lab was Oculus Rift, and the one at the conference was different (Samsung Odyssey/Microsoft), making it harder to use. I had never tested with a different headset and thus setting it up while people were coming to our booth was not ideal. After a while, though, I managed to set it up and was ready to run my program. When running the program, however, I realized that the headset I developed with was using a different setup in Unity than the one we were using at the conference. That was when I started panicking. I tried to find a way to make it work but, in the end, it was too late and I gave up.

I did not get to present what I did, but I still had the opportunity to share my excitement about the new technology with the people who came to our booth.

Ino is pensive, starting to realize things are not going great
Ino is ready to give up
Ino (left) in a Hololens AR headset experiencing an Anatomy class developed at Case Western

Even though that was not my greatest achievement, it was still a very valuable experience. At the time I was very disappointed since I felt like the hard work that I had done was all for nothing, but I later realized that I learned so much throughout this journey. For me, it highlighted how much more important the work I had done was, and not its recognition.

By Ino Tsichrintzi