10th International Conference on Evaluation and Assessment in Software Engineering
10-11 April 2006
|Time||Monday 10th April 2006|
Chair: Barbara Kitchenham
Chair: Teresa Baldassarre
Chair: Emilia Mendes
Chair: June Verner
|18:30||Reception and Conference Dinner|
|Time||Tuesday 11th April 2006|
Chair: Magne Jørgensen
Chair: David Budgen
|12:20-13:00||A preliminary empirical investigation of the use of evidence based software engineering|
Chair: Matthias Müller
|16:00||Coffee and Program Committee meeting|
The following papers will be presented as informal posters during the conference.
Shantha Jayalal, Pearl Brereton, and Chris Hawksley
|An empirical study of approaches to determining the Semantic Relatedness of web pages|
Phil Woodall and Pearl Brereton
Conducting a Systematic Review from the Perspective of a Ph.D. Researcher
|Marc Bartsch and Rachel Harrison||A Coupling Framework for AspectJ|
Simula Approach to Experimentation in Software Engineering
The ultimate goal of software engineering research is to support the private and public software industry in developing higher quality systems with improved timeliness in a more cost-effective and predictable way. One contribution of the empirical software engineering community to this overall goal is the conducting of experiments to evaluate and compare technologies (processes, methods, techniques, languages and tools) for planning, building and maintaining software. However, the applicability of the experimental results to industrial practice is, in most cases, hampered by the experiments’ lack of realism and scale regarding subjects, tasks, systems and environments. In this talk, I will discuss Simula Research Laboratory’s strategy for addressing this challenge: (1) About 25% of our budget is used for hiring software consultants as experimental subjects , mainly at the expense of employing a larger number of researchers. In the last five years, about 800 professionals from 60 companies in several countries have participated in 25 experiments (some of them very large, in order to identify the variances between sub-populations) in which the professionals worked under various controlled circumstances, such as the complexity of tasks and systems, the tools used, whether they worked in pairs, and so on. (2) A large investment in infrastructures and apparatus has been made to support the logistics of running large experiments and surveys, and to collect and organise data with minimal overhead. (3) A senior project manager has been employed to organise the experiments and the resulting data. To increase flexibility and save administrative overhead, Simula hires people on a short-term basis for assistance with, for example, particularly large or complex experiments. They could be students for clerical work or consultants who are particularly qualified for certain tasks, for example, a statistician. (4) Active collaboration with industry (in addition to hiring consultants), such as taking part in industry-managed research projects on software process improvement, and giving seminars and courses, has been considered important. The focus on publicising our research in the media and disseminating it through teaching has also resulted in Simula becoming well known in the Norwegian software industry. (5) Software engineering is typically performed by humans in organisations. Hence, we have established research collaborations with other disciplines, such as psychology, sociology and management.
Experiences Using Systematic Review Guidelines
Mark Staples and Mahmood Niazi
A systematic review is a defined and methodical way to identify, assess and
analyse published primary studies in order to investigate a specific research
question. Kitchenham has recently published
guidelines for software engineering researchers performing systematic reviews.
The objective of our
paper is to critique Kitchenham’s guidelines and to comment on systematic
review generally with respect to our experiences conducting our first systematic
review. Our perspective as neophytes may
be particularly illuminating for other software engineering researchers who
are also considering
conducting their first systematic review. Overall we can recommend Kitchenham’s
guidelines to other
researchers considering systematic reviews. We caution researchers to clearly
and narrowly define the
research questions they will investigate by systematic review, to reduce the
overall effort and to
improve the quality of the selection of papers and extraction of data. In particular
defining complementary research questions that are not within the scope of
the systematic review in
order to clarify the boundaries of the specific research question of interest.
An instance of this
recommendation is that researchers should clearly define the unit of study
for the systematic review.
Prediction of Overoptimistic Predictions
Magne Jørgensen and Bjørn Faugli
Overoptimistic predictions are common in software engineering projects, e.g., the average software project cost overrun is about 30%. This paper examines the use of two popular general tests of optimism (the ASQ and the LOT-R test) to select software engineers that are less likely to provide overoptimistic predictions. A necessary, but not sufficient, condition for this use is that there is a strong relationship between optimism score, as measured by the ASQ and LOT-R tests, and predictions. We report from two experiments on this topic. The experiments suggest that the relation between optimism score as measured by ASQ or LOT-R and predictions is too weak to enable a use of these optimism measurement instruments to select more realistic estimators in software organizations. Our results also suggest that a person’s general level of optimism and over-optimistic predictions of performance are, to a large extent, unrelated.
Simulation of Experiments for Data Collection – a
Per Runeson and Mattias Wiberg
Simulations can be used as a means for extension of data collection from empirical studies. A simulation model is developed, based on the data from experiments, and new data is generated from the simulation model. This paper replicates an initial investigation by Münch and Armbrust, with the purpose of evaluating the generality of their approach. We replicate their study using data from two inspection experiments. We conclude that the replicated study corroborates the original one. The deviation between the detection rate of the underlying experiment and the simulation models was 2% for the original study and is 4% in the replicated study. Both figures are acceptable for using the approach further. Still the model is based on some adjustment variables that are not directly possible to interpret in terms of the original experiment, and hence the model is subject to improvement.
Assessing the value of Architectural Information Extracted from Patterns for
Muhammad Ali Babar, Barbara Kitchenham, and Piyush Maheshwari
Background: We have developed an approach to identifying and capturing architecturally
significant information from patterns (ASIP), which can be used to improve
architecture design and evaluation.
Goal: Our goal was to evaluate whether the use of the ASIP provides more effective support in understanding or designing software architecture composed of the software design patterns which are the source of the ASIP compared with the original design pattern documentation.
Experimental design: Our subjects were 20 experienced software engineers who had returned to University for a post graduate course. All participants were taking a course in software architecture. The participants were randomly assigned to two groups of equal size. Both groups performed two tasks: understanding the use of J2EE design pattern in a given architecture based on the quality requirements the architecture was supported to satisfy, and designing software architecture to satisfy a given set of quality requirements using J2EE design patterns. For the first task, one group (treatment group) was given ASIP information the other (control group) was given the standard J2EE pattern documentation. For the second task, treatment group became the control group and vice versa and the type of support information was kept constant. The outcome variables were the number of correctly identified design patterns. The participants also completed a post-experiment questionnaire.
Result: The average score for the first task for the treatment group was 23.90 and for the control group was 13.80. The difference between the groups was significant using Mann-Whiney test (p=0.0375). The average score for the second task for the treatment group was 26.85 and for the control group was 19.60. Mann-Whitney test revealed that the difference between the groups was again significant at (p=0.035). Post-study questionnaire revealed that 18 of the 20 participants believed that the ASIP was more helpful than pattern documentation for understanding and designing architectures.
Conclusion: Our results support the hypotheses that ASIP information is more helpful in understanding or designing software architectures using software design patterns than pattern documentation itself.
Reusability Ranking of Software components by Coupling Measure
Gui Gui and Paul D.Scott
This paper provides an account of new measures of coupling developed to assess the reusability of Java components retrieved from the internet by a search engine. These measures differ from the majority of established metrics in two respects: they reflect the degree to which entities are coupled or resemble each other, and they take account of indirect couplings or similarities. An empirical comparison of the new measures with eight established metrics is described. The new measure is shown to be consistently superior at ranking components according to their reusability.
A Systematic Review of Cross- vs. Within-Company Cost Estimation Studies
Barbara Kitchenham, Emilia Mendes and Guilherme H. Travassos
OBJECTIVE – The objective of this paper is to determine under what circumstances
individual organisations would be able to rely on cross-company based estimation
METHOD – We performed a systematic review of studies that compared predictions from cross-company models with predictions from within-company models based on analysis of project data.
RESULTS – Ten papers compared cross-company and within-company estimation models, however, only seven of the papers presented independent results. Of those seven, three found that cross-company models were as good as within-company models, four found cross-company models were significantly worse than within-company models. Experimental procedures used by the studies differed making it impossible to undertake formal meta-analysis of the results. The main trend distinguishing study results was that studies with small single company data sets (i.e. <20 projects) that used leave-one-out cross-validation all found that the within-company model was significantly more accurate than the cross-company model.
CONCLUSIONS – The results of this review are inconclusive. It is clear that some organisations would be ill-served by cross-company models whereas others would benefit. Further studies are needed, but they must be independent (i.e. based on different data bases or at least different single company data sets). In addition, experimenters need to standardise their experimental procedures to enable formal meta-analysis.
Assessing the Quality and Cleaning of a Software Project Data Set:
An Experience Report
Gernot Liebchen, Martin Shepperd, Bheki Twala, and Michelle Cartwright
OBJECTIVE - The aim is to report upon an assessment of the impact noise has
on the predictive
accuracy by comparing noise handling techniques.
METHOD - We describe the process of cleaning a large software management dataset
comprising initially of more than 10,000 projects. The data quality is mainly assessed through
feedback from the data provider and manual inspection of the data. Three methods of noise
correction (polishing, noise elimination and robust algorithms) are compared with each other
assessing their accuracy. The noise detection was undertaken by using a regression tree model.
RESULTS - Three noise correction methods are compared and different results in their accuracy
CONCLUSIONS - The results demonstrated that polishing improves classification accuracy
compared to noise elimination and robust algorithms approaches.
Experiences of Performance Tuning Software Product Family Architectures Using
a Scenario-Driven Approach
Christian Del Rosso
Performance is an important non functional quality attribute of a software system.The ability to deliver the expected performance objectives comes from a careful design and attention to detail. Unfortunately, performance is not always considered at the beginning. However, once built, software performance can still be improved by evaluating and tuning the software architecture.When analyzing the performance of a software product family, an understanding of its architectural properties is needed. A software product family architecture’s strength is based on common assets, platforms and source code shared by its family members. Software product family design allows improved time-to-market, software quality and software reuse. At the same time, variability is the factor to instantiate different products and the handling of the variation points must be carefully managed. In this paper I present a scenario-driven approach for analyzing the performance of software product family architectures. The process of performance tuning has been applied to a Nokia software product family architecture and two case studies are presented. The evaluation process and the tradeoffs of evaluating software product family architectures are discussed.
Do Programmer Pairs make different Mistakes than Solo Programmers?
Matthias M. Muller
Objective: Comparison of program defects caused by programmer pairs and solo
Design: Analysis of programs developed during two counter balanced experiments.
Setting: Programming lab at University.
Experimental Units: 42 programs developed by computer science students participating in an extreme programming lab course.
Main Outcome Measures: Programmer pairs make as many algorithmic mistakes but fewer expression mistakes than solo programmers.
Results: The second result is significant on the 5 percent level.
Conclusions: For simple problems, pair programming seems to lead to fewer mistakes than solo programming.
Software Modernization and Replacement Decision Making in Industry: A Qualitative
Miia-Maarit Saarelainen, Jarmo J. Ahonen, Heikki Lintinen, Jussi Koskinen, Irja Kankaanpää, Henna Sivula, Päivi Juutilainen, and Tero Tilus
Software modernization and replacement decisions are crucial to many organizations. They affect greatly to the success and well being of the organizations and their people. The decisions like that are usually presumed to be rational and based on facts. These decisions and how they are made tell much about the decision makers and the decision making tools available to them. Interviews of 29 software modernization decision makers or senior experts were analyzed in order to find out how the decisions were made and what models and tools were used. It turned out that decisions are not as rational as supposed. Intuition is the dominant factor in decision making. Formal software engineering oriented decision support methods are not used. Most decision makers did not see intuition as a preferable way to make decisions. This might be because the preferred values are rationality and formality. Since the use of intuition is not particularly valued it is not necessarily admitted or documented either. However, truthful description and justification of decisions is important both from the practical and ethical point of views.
A preliminary empirical investigation of the use of evidence based software
Austen Rainer, Tracy Hall and Nathan Baddoo
Recently, Dybå, Jørgensen and Kitchenham have proposed a methodology, Evidence-Based Software Engineering (EBSE), that is intended to help researchers and practitioners evaluate software technologies in practice. We report the conduct of a preliminary empirical investigation of the reported use of Evidence-Based Software Engineering by 15 final-year under-graduate students. The investigation produced inconsistent results: our quantitative data suggests that students are making good use of some of the EBSE guidelines, but our qualitative evidence suggests that the students are not using the guidelines properly. Possible explanations for these results include: the superficial use of EBSE by the students; the limited ‘power’ of our research instruments to assess the use of EBSE guidelines; the difficulties of combining a required coursework with a piece of research; and the different expectations of what makes good evaluations between professional researchers and novice software practitioners.
Trust in Software Outsourcing Relationships:An
Analysis of Vietnamese Practitioners’ View
Phong Thanh Nguyen, Muhammad Ali Babar, June M. Verner
Trust is considered one of the most important factors for successfully managing software outsourcing relationships. However, there is lack of research into factors that are considered important in establishing and maintaining trust between clients and vendors. The goal of this research is to gain an understanding of vendors’ perceptions of the importance of factors that are critical to the establishment and maintenance of trust in software outsourcing projects in Vietnam. We used a multiple case study methodology to guide our research and in-depth interviews to collect qualitative data. The participants of study were 12 Vietnamese software development practitioners drawn from 8 companies that have been developing software for off shore clients. Vendor companies identified that cultural understanding, creditability, capabilities, and personal visits are important factors in gaining the initial trust of a client, while cultural understanding, communication strategies, contract conformance, and timely delivery are vital factors in maintaining that trust. We also identify similarities and differences between Vietnamese and Indian practitioners’ views on factors affecting trust relationships.
Software Product Lines in Value Based Software Engineering
Maria Teresa Baldassarre, Danilo Caivano, and Giuseppe Visaggio
Objective : Evaluate the value of a product line in terms of maintainability,
extensibility and configurability with refer to the interested stakeholders:
customers, maintainers, producers.
Rationale : There are values that customers constantly require in a modern software application. Some of these values are supported by product lines. Nevertheless, in the industrial and scientific communities the conjecture that customer values clash with those of producers/maintainers is diffused.
Design of Study : we have designed and carried out a case study in an industrial context on an ongoing project to verify the validity of a product line in creating value for stakeholders. So data was collected as the project was being executed along a nine month period. Then, descriptive statistics and hypothesis testing were carried out.
Results : experience acquired during the execution of an industrial project has allowed the authors to point out the differences between program families and software product lines. Also, the case study has shown how product lines contribute to stakeholder value proposition elicitation and reconciliation.
Conclusions : This study has represented a first step towards analyzing the value that product lines represent for various stakeholders.
"The Impossible will take a little while": pragmatic approaches to evidence
The evidence based practice paradigm has been in existence for just under fifteen years. Over this time period it has come a long way, very quickly. It has permeated from medicine, through health, to other areas of public practice such as education, social care and management. It has also engaged the attention of more technical disciplines such as information systems, library and information practice and, of course, software engineering. Each domain has "tinkered and tailored" the techniques of systematic review and critical appraisal but the basic model for evidence based practice remains essentially unchallenged. Unfortunately this model presupposes a near infinite stream of reviewers on a near infinite supply of keyboards providing rigorous yet uncertain answers, using inadequate primary data and generating endless recommendations for further research. The speaker, who has been involved in evidence based practice since its early development in the UK and who remains a staunch advocate and champion, will draw upon his experience of evidence based healthcare and evidence based library and information practice to suggest pragmatic ways for advancing the paradigm. He will argue that evidence based practice is currently at a crossroads and decisions made now on how to channel our efforts and enthusiasms effectively and appropriately will have longstanding implications for the development of the evidence base for our individual disciplines.