## Monthly Archives: May 2017

## The system of software that connects Internet devices

Glow-in-the-dark objects seem magical when you’re a kid — they can brighten up a dark room without the need for electricity, batteries or a light bulb. Then at some point you learn the science behind this phenomenon. Chemical compounds called chromophores become energized, or excited, when they absorb visible light. As they return to their normal state, the stored energy is released as light, which we perceive as a glow. In materials science, researchers rely on a similar phenomenon to study the structures of materials that will eventually be used in chemical catalysis, batteries, solar applications and more.

When a molecule absorbs a photon — the fundamental particle of light — electrons in the molecular system are promoted from a low-energy (ground) state to a higher-energy (excited) state. These responses resonate at specific light frequencies, leaving “spectral fingerprints” that illuminate the atomic and electronic structures of the system being studied.

In experiments, the “spectral fingerprints” or absorption spectrum, are measured with state-of-the-art facilities like the Advanced Light Source (ALS) at the U.S. Department of Energy’s Lawrence Berkeley National Laboratory (Berkeley Lab). In computer simulations, these measurements are typically captured with a quantum mechanical method called Time Dependent Density Functional Theory (TDDFT). The computational models are critical in helping researchers make the most of their experiments by predicting and validating results.

Yet despite its usefulness, there are times when TDDFT cannot not be used to calculate the absorption spectrum of a system because it would require too much time and computer resources. This is where a new mathematical “shortcut” developed by researchers in Berkeley Lab’s Computational Research Division (CRD) comes in handy. Their algorithm speeds up absorption calculations by a factor of five, so simulations that used to take 10 to 15 hours to compute can now be done in approximately 2.5 hours.

A paper describing this method was published in the Journal of Chemical Theory and Computation (JCTC). And the new approach for computing the absorption spectrum will be incorporated in an upcoming release of the widely used NWChem computational chemistry software suite later this year.

**New Algorithms Lead to Computational Savings**

To study the chemical structure of new molecules and materials, scientists typically probe the system with an external stimulus — typically a laser — then look for small electronic changes. Mathematically, this electronic change can be expressed as an eigenvalue problem. By solving this eigenvalue problem, researchers can get a good approximation of the absorption spectrum, which in turn reveals the resonant frequencies of the system being studied. Meanwhile, the corresponding eigenvector is used to calculate how intensely the system responded to the stimulus. This is essentially the principle behind the TDDFT approach, which has been implemented in several quantum chemistry software packages, including the open-source NWChem software suite.

While this approach has proven to be successful, it does have limitations for large systems. The wider the energy range of electronic responses a researcher tries to capture in a system, the more eigenvalues and eigenvectors need to be computed, which also means more computing resources are necessary. Ultimately, the absorption spectrum of a molecular system with more than 100 atoms becomes prohibitively expensive to compute with this method.

To overcome these limitations, mathematicians in CRD developed a technique to compute the absorption spectrum directly without explicitly computing the eigenvalues of the matrix.

“Traditionally, researchers have had to compute the eigenvalues and eigenvectors of very large matrices in order to generate the absorption spectrum, but we realized that you don’t have to compute every single eigenvalue to get an accurate view of the absorption spectrum,” says Chao Yang, a CRD mathematician who led the development of the new approach.

By reformulating the problem as a matrix function approximation, making use of a special transformation and taking advantage of the underlying symmetry with respect to a non-Euclidean metric, Yang and his colleagues were able to apply the Lanczos algorithm and a Kernal Polynomial Method (KPM) to approximate the absorption spectrum of several molecules. Both of these algorithms require relatively low-memory compared to non-symmetrical alternatives, which is the key to the computational savings.

Because this method requires less computing power to achieve a result, researchers can also easily calculate the absorption spectrum for molecular systems with several hundreds of atoms.

“This method is a significant step forward because it allows us to model the absorption spectrum of molecular systems of hundreds of atoms at lower computational cost.” says Niranjan Govind, a computational chemist at the Pacific Northwest National Laboratory who collaborated with the Berkeley Lab team on the development of the method in the NWChem computational chemistry program.

Recently Berkeley Lab scientists used this method to calculate the absorption spectrum and confirm what several experimental results have been hinting — that the element berkelium breaks form with its heavy element peers by taking on an extra positive charge when bound to a synthetic organic molecule. This property could help scientists develop better methods for handling and purifying nuclear materials. A paper highlighting this result appeared April 10 in the journal *Nature Chemistry*.

“The experimental results were hinting at this unusual behavior in berkelium, but there wasn’t enough experimental evidence to say yes, 100 percent, this is what we’re seeing,” says study co-author Wibe Albert de Jong, a CRD scientist. “To be 100 percent sure, we did large computational simulations and compared them to the experimental data and determined that they were, indeed, seeing berkelium in an unusual oxidation state.”

This new algorithm was developed through a DOE Office of Science-supported Scientific Discovery through Advanced Computing (SciDAC) project focused on advancing software and algorithms for photochemical reactions. SciDAC projects typically bring together an interdisciplinary team of researchers to develop new and novel computational methods for tackling some of the most challenging scientific problems.

“The interdisciplinary nature of SciDAC is a very effective way to facilitate breakthrough science, as each team member brings a different perspective to problem solving,” says Yang. “In this dynamical environment, mathematicians, like me, team up with domain scientists to identify computational bottlenecks, then we use cutting-edge mathematical techniques to address and overcome those challenges.”

## Security statistics in the analysis of existing data in the visualization software

Modern data visualization software makes it easy for users to explore large datasets in search of interesting correlations and new discoveries. But that ease of use — the ability to ask question after question of a dataset with just a few mouse clicks — comes with a serious pitfall: it increases the likelihood of making false discoveries.

At issue is what statisticians refer to as “multiple hypothesis error.” The problem is essentially this: the more questions someone asks of a dataset, they more likely one is to stumble upon something that looks like a real discovery but is actually just a random fluctuation in the dataset.

A team of researchers from Brown University is working on software to help combat that problem. This week at the SIGMOD2017 conference in Chicago, they presented a new system called QUDE, which adds real-time statistical safeguards to interactive data exploration systems to help reduce false discoveries.

“More and more people are using data exploration software like Tableau and Spark, but most of those users aren’t experts in statistics or machine learning,” said Tim Kraska, an assistant professor of **computer** science at Brown and a co-author of the research. “There are a lot of statistical mistakes you can make, so we’re developing techniques that help people avoid them.”

Multiple hypothesis testing error is a well-known issue in statistics. In the era of big data and interactive data exploration, the issue has come to a renewed prominence Kraska says.

“These tools make it so easy to query data,” he said. “You can easily test 100 hypotheses in an hour using these visualization tools. Without correcting for multiple hypothesis error, the chances are very good that you’re going to come across a correlation that’s completely bogus.”

There are well-known statistical techniques for dealing with the problem. Most of those techniques involve adjusting the level of statistical significance required to validate a particular hypothesis based on how many hypotheses have been tested in total. As the number of hypothesis tests increases, the significance level needed to judge a finding as valid increases as well.

But these correction techniques are nearly all after-the-fact adjustments. They’re tools that are used at the end of a research project after all the hypothesis testing is complete, which is not ideal for real-time, interactive data exploration.

“We don’t want to wait until the end of a session to tell people if their results are valid,” said Eli Upfal, a computer science professor at Brown and research co-author. “We also don’t want to have the system reverse itself by telling you at one point in a session that something is significant only to tell you later — after you’ve tested more hypotheses — that your early result isn’t significant anymore.”

Both of those scenarios are possible using the most common multiple hypothesis correction methods. So the researchers developed a different method for this project that enables them to monitor the risk of false discovery as hypothesis tests are ongoing.

“The idea is that you have a budget of how much false discovery risk you can take, and we update that budget in real time as a user interacts with the data,” Upfal said. “We also take into account the ways in which user might explore the data. By understanding the sequence of their questions, we can adapt our algorithm and change the way we allocate the budget.”

For users, the experience is similar to using any data visualization software, only with color-coded feedback that gives information about statistical significance.

“Green means that a visualization represents a finding that’s significant,” Kraska said. “If it’s red, that means to be careful; this is on shaky statistical ground.”

The system can’t guarantee absolute accuracy, the researchers say. No system can. But in a series of user tests using synthetic data for which the real and bogus correlations had been ground-truthed, the researchers showed that the system did indeed reduce the number of false discoveries users made.

The researchers consider this work a step toward a data exploration and visualization system that fully integrates a suite of statistical safeguards.

“Our goal is to make data science more accessible to a broader range of users,” Kraska said. “Tackling the multiple hypothesis problem is going to be important, but it’s also very difficult to do. We see this paper as a good first step.”

## Grouping of domain-aware mashup services based on LDA models and topics from multiple data sources

Mashup is emerging as a promising software development method for allowing software developers to compose existing Web APIs to create new or value-added composite Web services. However, the rapid growth in the number of available Mashup services makes it difficult for software developers to select a suitable Mashup service to satisfy their requirements. Even though clustering based Mashup discovery technique shows a promise of improving the quality of Mashup service discovery, Mashup service clustering with high accuracy and good efficiency is still a challenge problem.

This paper proposes a novel domain-aware Mashup service clustering method with high accuracy and good efficiency by exploiting LDA topic model built from multiple data sources, to improve the quality of Mashup service discovery.

The proposed method firstly designs a domain-aware Mashup service feature selection and reduction process by refining characterization of their domains to consolidate domain relevance. Then, it presents an extended LDA topic model built from multiple data sources (include Mashup description text, Web APIs and tags) to infer topic probability distribution of Mashup services, which serves as a basis of Mashup service similarity computation. Finally, K-means and Agnes algorithm are used to perform Mashup service clustering in terms of their similarities.

Compared with other existing Mashup service clustering methods, experimental results show that the proposed method achieves a significant improvement in terms of precision, recall, F-measure, purity and entropy.

The results of the proposed method help software developers to improve the quality of Mashup service discovery and Mashup-based software development. In the future, there will be a need to extend the method by considering heterogeneous network information among Mashup, Web APIs, tags, users, and applying it to Mashup discovery for software developers.

## The relationship between technical and improvements in PHP web applications

### Context

Technical Debt Management (TDM) refers to activities that are performed to prevent the accumulation of Technical Debt (TD) in software. The state-of-research on TDM lacks empirical evidence on the relationship between the amount of TD in a software module and the interest that it accumulates. Considering the fact that in the last years, a large portion of software applications are deployed in the web, we focus this study on PHP applications.

### Objective

Although the relation between debt amount and interest is well-defined in traditional economics (i.e., interest is proportional to the amount of debt), this relation has not yet been explored in the context of TD. To this end, the aim of this study is to investigate the relation between the amount of TD and the interest that has to be paid during corrective maintenance.

### Method

To explore this relation, we performed a case study on 10 open source PHP projects. The obtained data have been analyzed to assess the relation between the amount of TD and two aspects of interest: (a) corrective maintenance (i.e., bug fixing) frequency, which translates to *interest probability* and (b) corrective maintenance effort which is related to*interest amount*.

### Results

Both interest probability and interest amount are positively related with the amount of TD accumulated in a specific module. Moreover, the amount of TD is able to discriminate modules that are in need of heavy corrective maintenance.

### Conclusions

The results of the study confirm the cornerstone of TD research, which suggests that modules with a higher level of incurred TD, are costlier in maintenance activities. In particular, such modules prove to be more defect-prone and consequently require more (corrective) maintenance effort.