DIPARTIMENTO DI INFORMATICA Università di Torino

The Artificial Characters Learning Problem

This database has been artificially generated by using a first order theory which describes the structure of ten capitol letters of the English alphabet and a random choice theorem prover which accounts for etherogeneity in the instances. The capitol letters represented are the following: A, C, D, E, F, G, H, L, P, R.
Each instance is structured and is described by a set of segments (lines) which resemble the way an automatic program would segment an image.
The following picture shows some of the instances.

In turn, each segment in an instance is described by seven attributes, four of which are the really necessary, one is superflous, and the other two can be computed from the necessary ones, but are present for efficiency reasons.
Instances are stored in tabular form: rows describe the segments, and columns corresponds to attribute values, according to the following format:

#id of the instance
#class is an integer number indicating the class as described below
#objnum is an integer identifier of a segment (starting from 0) in the instance
type of segment: specifies the type of a segment and is always set to the string "line". Its C language type is char.
xx1,yy1,xx2,yy2: these attributes contain the initial and final coordinates of a segment in a cartesian plane. Their C language type is int.
size: this is the length of a segment computed by using the geometric distance between two points A(xx1,yy1) and B(xx2,yy2). Its C language type is float.
diag: this is the length of the diagonal of the smallest rectangle which includes the picture of the character. The value of this attribute is the same in each object. Its C language type is float.

There are no missing attribute values.
Up to now, there are seven data files available: six files containing 1000 instances each (100 per class), and an independent test set containing 10000 instances (1000 per class). To download these datasets just click on the following items:

Note: in each file instances are numbered from 1 to 1000, so you may need to renumber them when putting together two or more folds.
The class value (CLASS) can take ten different values. Each letter belongs to only one of the classes. Instances are evenly distributed over the classes, as shown in the following table:


        CLASS	NAME   FOLD 0  FOLD 1 ... FOLD 6  TESTSET  TOTAL     

          1	  A 	100	100	   100     1000    1600
          2	  C	100	100	   100     1000    1600
          3	  D	100	100	   100     1000    1600
          4	  E	100	100	   100     1000    1600
          5	  F	100	100	   100     1000    1600
          6	  G	100	100	   100     1000    1600
          7	  H	100	100	   100     1000    1600
          8	  L	100	100	   100     1000    1600
          9	  P	100	100	   100     1000    1600
          10	  R	100	100	   100     1000    1600
          -----------------------------------------------------------
          TOTAL        1000    1000       1000    10000   16000

We performed many experiments on these datasets by using Smart+ and more recently by combining Smart+ and two numerical knowledge refinement systems we develop. In the experiments performed by Smart+ and reported in the paper by M. Botta, and A. Giordana: "SMART+: A MultiStrategy Learning Tool", Proc. of the IJCAI-93 (Chambery, France, 1993), pp. 937-943, we used the first fold as learning set and the remaining five folds as test set (at that time the independent test set were not available). The results obtained on this data sets are the following:

	Type of Optimization  Recognition Rate  Error Rate  Ambiguity Rate
	  
	  No OPT		  41.48%	   3.82%	 54.70%
	  Local OPT		  98.68%	   0.12%	  1.20%
	  Local+GA OPT		  99.70%	   0.0%		  0.30%

In a recent paper, M. Botta, A. Giordana, and R. Piola: "An integrated framework for learning numerical terms in FOL", Proc. of the ECAI-98, (Brighton, UK, 1998), pp. 415-419, we report preliminary results on using Smart+ to acquire an initial knowledge base that is then refined by FONN and NTR, two refinement systems based on the error gradient descent technique.

Further Information: botta@di.unito.it

Last update: Sep 24, 1998

[Information] [People] [Research] [Ph.D.] [Education] [Library] [Search]
[Bandi/Careers] [HelpDesk] [Administration] [Services] [Hostings] [News and events]


	Administrator: wwwadm[at]di.unito.it	Last update: Mar 08, 1999