I have this log tool that log downloads from a web page.
But so many downloads are made by spiders and/or crawlers… I’d like to mark these as “crawlers”.
To do this, I look for the date / time when the download is made.
This is a typical download. As you can tell, these are made with only a few seconds apart.
Obviously, that’s not human made.
My thought is, if I can count all posts that are made the same minute, then that post would be made from a human!
“SELECT COUNT(*) FROM [myLogTable] WHERE dteCreated = …”
Can’t you log the user-agents? Usually you can detect spiders and crawlers that way
Yes! User-agents are logged and that’s neat, but there are new crawlers all the time and I thought this might be a neat way to automate the process of finding crawlers…
Of-course, it’s possible to use a list of crawlers to pick from. Sure!
Dates are this difficult!? Hmm… I like the challenge…!!
You could group on everything but the seconds:
select substr(dteCreated, 1, length(dteCreated) -3) as dteCreated, count(*)
group by substr(dteCreated, 1, length(dteCreated) -3)
having count(*) > 1
Is it an idea to store known crawlers User-Agent data in a database table? And when you apply your automated process detects a possible crawler, it adds a row. You can add a Column to the User Agents table where you can set a confirm flag.
As many email spam detectors work with some score, you can kind of do the same. Use the seconds you scanned for in some algorithm to do so. Also, scanning for text in User Agents might help. Google and other some search engines are easy to recognize.
That will be part two!
There are plenty of crawlers… from all over the place.
I know, the idea to add crawlers to a list, but that list would be automated.
I shall try some more at the office!
SELECT COUNT(*) FROM [filesLog] WHERE dteCreated < DATEADD(second, 45, 2015-08-06 15:18:59)
What’s the missing operator in the query expression!?
I’ve tried s and ss for seconds.
I think the date in your SQL statement must be quoted:
SELECT COUNT(*) FROM [filesLog] WHERE dteCreated < DATEADD(second, 45, '2015-08-06 15:18:59')
Yes! Of-course!! Silly me!! My brain is little slow…
[code]SELECT COUNT(*) FROM [filesLog] WHERE dteCreated < DATEADD(s, 45, ‘2015-08-06 15:18:59’)
No value given for one or more required parameters. [/code]
Does you hawk-eyes see the error!?
Ah, I see that the “s” needs to be quoted too
SELECT COUNT(*) FROM [filesLog] WHERE dteCreated > DATEADD('s', -90, '2016-05-03 20:52:47')
AND dteCreated < DATEADD('s', 90, '2016-05-03 20:52:47')
Now I can easy find traces of crawlers in a large amount of data!
I love it!!