CroALa DokuWiki

z:philologic4

Notes on trying to install Philologic 4

Source and instructions: http://github.com/ARTFL-Project/PhiloLogic4

Objective: install Philologic 4, process XML files and serve database locally (on localhost).

Update: installing PhiloLogic 4.1 beta3

Sep 19 2015

It was best to make a clean install from the Aug 18 release: https://github.com/ARTFL-Project/PhiloLogic4/releases.

Then it was simply the matter of building PhiloLogic again, following the instructions: https://github.com/ARTFL-Project/PhiloLogic4/blob/master/docs/installation.md.

Afterwards, I copied the www directory to a test location (cp -r www ~/philo4test) and built the test database issuing python load_script.py test *.xml.

Everything worked beautifully.

Local setup

apache2 apache2.conf

In /etc/apache2/apache2.conf added this:

<Directory /home/neven/public_html/>
        Options Indexes FollowSymLinks
        Options +ExecCGI
        AllowOverride None
        Require all granted
</Directory>

apache2 philologic.conf

<VirtualHost *:80>
	ServerAdmin xxx@xxx.hr
	DocumentRoot /home/neven/public_html
<Directory /home/neven/public_html/>
                Options Indexes FollowSymLinks MultiViews 
		Options +ExecCGI
                AllowOverride All
                Order allow,deny
                allow from all
		AddHandler cgi-script .py
                # test python scripts:
		SetHandler mod_python
		PythonHandler mod_python.publisher
</Directory>
 
	# Possible values include: debug, info, notice, warn, error, crit,
	# alert, emerg.
	LogLevel warn
 
	CustomLog ${APACHE_LOG_DIR}/access.log combined
</VirtualHost>

Test Python script

The script test.py runs in browser when called from the address http://localhost/test.py:

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
 
# enable debugging
import cgitb
cgitb.enable()
 
print "Content-Type: text/plain;charset=utf-8"
print 
print "Hello World!"

The same script also works when placed in the /home/neven/public_html/philologic directory and called from the address http://localhost/philologic/test.py.

Philologic load_script.py

What I changed in the script supplied on Github (https://github.com/ARTFL-Project/PhiloLogic4/blob/master/extras/load_script.py):

# Set the filesytem path to the root web directory for your PhiloLogic install.
database_root = "/home/neven/public_html/philologic/"
 
# Set the URL path to the same root directory for your philologic install.
url_root = "http://localhost/philologic/"
 
template_dir = "/home/neven/philo4test/"
# The load process will fail if you haven't set up the template_dir at the correct location.

The last line had to be modified from the following, which never finished:

template_dir = database_root + "_system_dir/_install_dir/"
# The load process will fail if you haven't set up the template_dir at the correct location.

I don't understand what the following line does; should this be changed if the XML documents contain only div elements? Must the source files contain div1, div2 etc?

# Define navigable objects
navigable_objects = ('doc', 'div1', 'div2', 'div3')

These XPaths are appropriate for our XML files:

xpaths =  [("doc","."),("div",".//text/body/div"),("div",".//text/body/div/div"),("div",".//text/body/div/div/div"),("para",".//l"),("page",".//pb")]         
 
metadata_xpaths = [ # metadata per type.  '.' is in this case the base element for the type, as specified in XPaths above.
    # MUST MUST MUST BE SPECIFIED IN OUTER TO INNER ORDER--DOC FIRST, WORD LAST
    ("doc","./teiHeader//titleStmt/title","title"),
    ("doc","./teiHeader//titleStmt/author","author"),
    ("doc", "./teiHeader//profileDesc[1]/creation/date/@period", "date"),
    ("div","./head","head"),
    ("div",".@n","n"),
    ("div",".@id","id"),
    ("para", ".@n", "n"),
    ("page",".@n","n"),
    ("page",".@fac","img")
]

Running load_script.py

Bash output. Parsing seems to have been successful, but no words, objects or pages have been found?

neven@figulus:~/philo4test$ python load_script.py proba ~/philo4test/*.xml
chmod: cannot access '/home/neven/public_html/philologic/proba/app/assets/css': No such file or directory
chmod: cannot access '/home/neven/public_html/philologic/proba/app/assets/js': No such file or directory
 
### Parsing files ###
Sat May 23 10:53:30 2015: parsing 3 files.
Sat May 23 10:53:30 2015: parsing 1 : ian-pan-diomed.xml
Sat May 23 10:53:30 2015: parsing 2 : milasin-f-viator.xml
Sat May 23 10:53:31 2015: parsing 3 : niger-t-divin.xml
Sat May 23 10:53:32 2015: done parsing
 
### Merge parser output ###
Sat May 23 10:53:32 2015: sorting words
Sat May 23 10:53:32 2015: word sort returned 0
Sat May 23 10:53:32 2015: sorting objects
Sat May 23 10:53:32 2015: object sort returned 0
Sat May 23 10:53:32 2015: joining pages
Sat May 23 10:53:32 2015: word join returned 0
 
### Create inverted index ###
[3, 1, 4, 1, 222, 3, 10, 36035, 1]
[2, 1, 3, 1, 8, 2, 4, 16, 1]
38 bits wide.
freq1: 10; 4 bits
freq2: 94; 7 bits
offst: 69632; 17 bits
Sat May 23 10:53:32 2015: analysis done
reading dbspecs in from /home/neven/public_html/philologic/proba/data/WORK/dbspecs4.h...
done. 4763 hits packed in 2930 entries.
Sat May 23 10:53:33 2015: all indices built. moving into place.
 
### Post-processing filters ###
Loading the pages SQLite table... done.
Loading the toms SQLite table... done.
Generating word frequencies... done.
Generating normalized word frequencies... done.
Generating metadata frequencies...
Generated metadata frequencies for title
Generated metadata frequencies for author
Generated metadata frequencies for filename
Generated metadata frequencies for head
The following fields were not found in the input corpus date, n, id
done.
Generating normalized metadata frequencies... done.
 
### Finishing up ###
wrote database info to /home/neven/public_html/philologic/proba/data/db.locals.py.
wrote Web application info to /home/neven/public_html/philologic/proba/data/web_config.cfg.

Inspecting the web_config.cfg, I find traces of processing my files:

# search_examples = {"author": "Jean-Jacques Rousseau", "title": "Du contrat social"}
search_examples = {'head': u'Diomedis et Glauci Congressus. Locus ex Homeri Iliade Z. a versu 119. sequentes Latino carmine redditus.', 'author': u'Homer', 'filename': u'ian-pan-diomed.xml', 'title': u'Diomedis et Glauci congressus, versio electronica'}

Also, the frequencies directory shows word frequencies etc:

et	94
in	55
nec	37
sed	27
qui	27
ut	24
ad	21
hic	20
cum	20
nam	19
vix	18
quem	18
haec	16
ab	15
ergo	14
tibi	13
hoc	13

I cannot access it through browser, however.

Testing browser access

Navigating to the following address: http://localhost/philologic/proba/ results in 500 internal server error (The server encountered an internal error or misconfiguration and was unable to complete your request.).

The log reads:

[Sat May 23 11:41:17.750998 2015] [core:alert] [pid 7565] [client ::1:42328] 
/home/neven/public_html/philologic/proba/.htaccess: Invalid command 'RewriteEngine', 
perhaps misspelled or defined by a module not included in the server configuration

After enabling mod_rewrite (sudo a2enmod rewrite / sudo apache2 service restart) the error remains the same. The 'error.log' does not mention 'RewriteEngine' any more, however.

And here is the access.log:

::1 - - [23/May/2015:11:41:17 +0200] "GET /philologic/proba/ HTTP/1.1" 500 801 "-" 
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) 
Chrome/42.0.2311.152 Safari/537.36"

dispatcher.py problems

After changing the configuration of apache2 philologic.conf, here is the relevant part oferror.log. It seems that dispatcher.py is run by Apache, but something is not OK:

[Sat May 23 19:34:32.514301 2015] [cgi:error] [pid 15092] [client ::1:48088] AH01215: Traceback 
(most recent call last):
[Sat May 23 19:34:32.514467 2015] [cgi:error] [pid 15092] [client ::1:48088] AH01215:   
File "/home/neven/public_html/philologic/proba/dispatcher.py", line 4, in <module>
[Sat May 23 19:34:32.514549 2015] [cgi:error] [pid 15092] [client ::1:48088] AH01215:     
import reports
[Sat May 23 19:34:32.514592 2015] [cgi:error] [pid 15092] [client ::1:48088] AH01215: 
ImportError: No module named reports
[Sat May 23 19:34:32.517085 2015] [cgi:error] [pid 15092] [client ::1:48088] End of 
script output before headers: dispatcher.py

Permissions in philologic/proba directory

All necessary directories seem to be there in db directory (philologic/proba), but they did not have necessary permissions. Output of ls -l:

neven@figulus:~/public_html/philologic/proba$ ls -l
total 112
drwx------ 5 neven www-data  4096 May 20 16:09 app
drwxr-sr-x 7 neven www-data  4096 May 23 10:53 data
-rwxr-xr-x 1 neven www-data  1508 May 19 22:06 dispatcher.py
-rwxr-xr-x 1 neven www-data  1150 May 19 22:06 favicon.ico
drwx------ 2 neven www-data  4096 May 20 16:09 functions
-rw-r--r-- 1 neven www-data 14946 May 23 10:53 ian-pan-diomed.xml
-rwxr-xr-x 1 neven www-data   110 May 23 11:02 __init__.py
-rw-r--r-- 1 neven www-data  4981 May 23 10:53 load_script.py
-rw-r--r-- 1 neven www-data 17693 May 23 10:53 milasin-f-viator.xml
-rw-r--r-- 1 neven www-data 36133 May 23 10:53 niger-t-divin.xml
drwx------ 2 neven www-data  4096 May 20 16:09 reports
drwx------ 2 neven www-data  4096 May 20 16:09 scripts

Following a recipe on Stack Overflow (http://stackoverflow.com/questions/3740152/how-to-set-chmod-for-a-folder-and-all-of-its-subfolders-and-files-in-linux-ubunt), the directory permissions were changed:

neven@figulus:~/public_html/philologic/proba$
 find /home/neven/public_html/philologic/proba -type d -exec chmod 755 {} \;
neven@figulus:~/public_html/philologic/proba$ ls -l
total 112
drwxr-xr-x 5 neven www-data  4096 May 20 16:09 app
drwxr-sr-x 7 neven www-data  4096 May 23 10:53 data
-rwxr-xr-x 1 neven www-data  1508 May 19 22:06 dispatcher.py
-rwxr-xr-x 1 neven www-data  1150 May 19 22:06 favicon.ico
drwxr-xr-x 2 neven www-data  4096 May 20 16:09 functions
-rw-r--r-- 1 neven www-data 14946 May 23 10:53 ian-pan-diomed.xml
-rwxr-xr-x 1 neven www-data   110 May 23 11:02 __init__.py
-rw-r--r-- 1 neven www-data  4981 May 23 10:53 load_script.py
-rw-r--r-- 1 neven www-data 17693 May 23 10:53 milasin-f-viator.xml
-rw-r--r-- 1 neven www-data 36133 May 23 10:53 niger-t-divin.xml
drwxr-xr-x 2 neven www-data  4096 May 20 16:09 reports
drwxr-xr-x 2 neven www-data  4096 May 20 16:09 scripts

There was some progress. Navigating to http://localhost/philologic/proba now results in the following message:

A server error occurred.  Please contact the administrator.

The Apache error.log says (among other things):

[Sun May 24 19:01:56.792370 2015] [cgi:error] [pid 2684] [client ::1:46878] AH01215: 
IOError: [Errno 13] Permission denied: 
'/home/neven/public_html/philologic/proba/app/assets/css/philoLogic.css'
[Sun May 24 19:02:07.475799 2015] [autoindex:error] [pid 2686] [client ::1:46882] 
AH01276: Cannot serve directory /home/neven/public_html/philologic/: No matching 
DirectoryIndex (index.html,index.cgi,index.pl,index.php,index.xhtml,index.htm) found, 
and server-generated directory index forbidden by Options directive

The scripts seem to be referring to a philoLogic.css which I cannot find anywhere (also on Github).

Perhaps it should have been built during loading the database?

But when I try to rebuild the database (or, more precisely, to build a new one), I get the following error from load_script.py:

chmod: cannot access '/home/neven/public_html/philologic/proba2/app/assets/css': 
No such file or directory
chmod: cannot access '/home/neven/public_html/philologic/proba2/app/assets/js': 
No such file or directory

Update: start all over again

This time, clone Philologic4 Github repository:

neven@figulus:~$ git clone https://github.com/ARTFL-Project/PhiloLogic4.git

Check permissions on www:

neven@figulus:~/PhiloLogic4$ ls -l www
total 28
drwxr-xr-x 5 neven neven 4096 May 26 18:51 app
-rwxr-xr-x 1 neven neven 1508 May 26 18:51 dispatcher.py
-rwxr-xr-x 1 neven neven 1150 May 26 18:51 favicon.ico
drwxr-xr-x 2 neven neven 4096 May 26 18:51 functions
-rwxr-xr-x 1 neven neven   87 May 26 18:51 __init__.py
drwxr-xr-x 2 neven neven 4096 May 26 18:51 reports
drwxr-xr-x 2 neven neven 4096 May 26 18:51 scripts

Copy www to ~/phtest; check permissions.

neven@figulus:~/PhiloLogic4$ ls -l ~/phtest/www
total 28
drwxr-xr-x 5 neven neven 4096 May 26 18:53 app
-rwxr-xr-x 1 neven neven 1508 May 26 18:53 dispatcher.py
-rwxr-xr-x 1 neven neven 1150 May 26 18:53 favicon.ico
drwxr-xr-x 2 neven neven 4096 May 26 18:53 functions
-rwxr-xr-x 1 neven neven   87 May 26 18:53 __init__.py
drwxr-xr-x 2 neven neven 4096 May 26 18:53 reports
drwxr-xr-x 2 neven neven 4096 May 26 18:53 scripts

Change load_script.py, especially to tell the template_dir where I copied the files from the Github repo.

template_dir = "/home/neven/phtest/www/"
# The load process will fail if you haven't set up the template_dir at the correct location.

Now the output of python load_script.py test *.xml looks much more promising:

neven@figulus:~/ph4test$ python load_script.py test *.xml
 
### Parsing files ###
Tue May 26 19:12:09 2015: parsing 3 files.
Tue May 26 19:12:09 2015: parsing 1 : ian-pan-diomed.xml
Tue May 26 19:12:09 2015: parsing 2 : milasin-f-viator.xml
Tue May 26 19:12:09 2015: parsing 3 : niger-t-divin.xml
Tue May 26 19:12:11 2015: done parsing
 
### Merge parser output ###
Tue May 26 19:12:11 2015: sorting words
Tue May 26 19:12:11 2015: word sort returned 0
Tue May 26 19:12:11 2015: sorting objects
Tue May 26 19:12:11 2015: object sort returned 0
Tue May 26 19:12:11 2015: joining pages
Tue May 26 19:12:11 2015: word join returned 0
 
### Create inverted index ###
[3, 1, 4, 1, 222, 3, 10, 36035, 1]
[2, 1, 3, 1, 8, 2, 4, 16, 1]
38 bits wide.
freq1: 10; 4 bits
freq2: 94; 7 bits
offst: 69632; 17 bits
Tue May 26 19:12:11 2015: analysis done
reading dbspecs in from /home/neven/public_html/philologic/test/data/WORK/dbspecs4.h...
done. 4763 hits packed in 2930 entries.
Tue May 26 19:12:11 2015: all indices built. moving into place.
 
### Post-processing filters ###
Loading the pages SQLite table... done.
Loading the toms SQLite table... done.
Generating word frequencies... done.
Generating normalized word frequencies... done.
Generating metadata frequencies...
Generated metadata frequencies for title
Generated metadata frequencies for author
Generated metadata frequencies for filename
Generated metadata frequencies for head
The following fields were not found in the input corpus date, n, id
done.
Generating normalized metadata frequencies... done.
 
### Finishing up ###
wrote database info to /home/neven/public_html/philologic/test/data/db.locals.py.
wrote Web application info to /home/neven/public_html/philologic/test/data/web_config.cfg.

Success!

z/philologic4.txt · Last modified: 2015/09/19 11:50 by njovanovic